Home » Uncategorized

DSC Weekly 6 June 2023 – The Missing Part in LLMs and GPT-like Systems


Announcements

  • Attackers have many opportunities to strike on-site and cloud-based enterprise applications from early in the development process. But many solutions and tools — such as the emerging DevSecOps framework – are available to better secure applications and ensure security is prioritized within DevOps and application security testing tools. Tune into Effective Application Security summit to hear leading experts discuss how to secure applications in your enterprise infrastructure with strategies like DevSecOps along with the right combination of tools and testing.
  • Financial services firms are undergoing complete makeovers as they rework their vision, resources and leadership strategy to stay competitive in the digital world. Join the Digital Transformation for Financial Services summit to learn to transform strategically, creating a digital-first strategy that combines emerging technologies such as AI, analytics, blockchain and more with the right talent and ways of working to optimize efficiencies, bolster resilience and drive long-term success.

High Confidence Level Rating Score Speedometer 3d Illustration

The Missing Part in LLMs and GPT-like Systems

These days, all the AI talk is about GPT (Generative Pre-Trained Transformer), LLMs (Large Language Models), generative AI, prompt engineering, and related technologies. You must live alone on a small island if you have never heard these words.

LLM originated from NLP (natural language processing) which gave rise to NLG (natural language generation) before becoming what it is today. Deep neural networks such as GAN (generative adversarial network) are one of the components. Another one is collecting vast amounts of unstructured text data and categorizing it. This is achieved by crawling websites such as Wikipedia, ArXiv (preprints and scientific research), Stack Exchange forum communities, GitHub, LinkedIn content, online news, other large repositories, and even Facebook conversations or Google search result pages. Starting with 1,000 seed keywords, looking at what Google returns, and recursively crawling all the links found, will in a couple of months create a database with billions of webpages, covering 95% of the Internet traffic.

There are techniques to categorize this unstructured data: I have developed my own and implemented such smart crawlers, some discussed in my books. In the end, you can easily create a search engine better than the most popular on the market. Because of their monopoly, their incentive to innovate is small, and they are manipulated by spammers and other actors that find ways to get their content at the top.

The next step was to develop a more friendly interface. Instead of returning links with a small summary, it composes complete answers to your questions. This is what tools like ChatGPT do. It will also be manipulated the same way Google is, in the end.

There is one technique that could make these systems a lot better: scoring the input sources. Whether a publisher, a specific channel, a website, a Facebook user, a journalist, or an author. The score attached to a source – more specifically a set of scores each one measuring a specific attribute – tells you how trustworthy a piece of information is. A brand-new LinkedIn account with few connections, with a picture showing an attractive, lightly dressed young woman that only has old wealthy males as connections, is likely to trigger a low score, compared to someone who consistently receives good feedback and reviews (unless the good feedback is created by a ring of fake accounts, which is easy to detect).

It is not just classifying info as trustworthy or not. It can assign labels such as “exaggerated”, “politically biased” (conservative, liberal, and so on), “unverified”, and the list goes on. Each source could be assigned to multiple labels, each with a probability determined by the scoring algorithm. For instance, a source could be classified as both exaggerated and real. And these scores would be updated daily.

One benefit, besides warning the user, is to avoid incoherence when GPT answers a question. If an answer is based on a mix of sources – some liberal, some conservative – it may say one thing in one paragraph, and the opposite in the next paragraph. Using the scores, the answer could include the two contradictory arguments, and easily explain why it is so. Users could also choose to receive the answer that they want to hear (regardless of veracity), by choosing parameters associated with the scoring engine. The scoring system could be quite sophisticated, and not automatically categorizing statements as “misinformation” just because the average Joe and even reputable scientists say so, but as “controversial” instead. Sometimes, it is because the information in question has not yet passed the test of time.

Finally, you can also score the output (the answers), not just the input sources. This is an area where I am currently actively involved, with patents already granted and technology that I started to develop years ago.

Vincent Granville, Contributor

Contact The DSC Team if you are interested in contributing.


DSC Featured Articles


Picture of the Week

DSC Weekly 6 June 2023 – The Missing Part in LLMs and GPT-like Systems