Home » AI

New Trends in LLM Architecture

Since OpenAI/GPT launched in November 2022, many things have happened. Competitors and new applications are born every month, some raising considerable funding. Search is becoming hot again, this time powered by RAG and LLMs rather than PageRank. It remains to be seen who will achieve profitability on a large scale. Costs are dramatically decreasing, and protagonists are fighting to deliver better quality with faster training speed and easier fine-tuning. While small or specialized LLMs start to emerge, the general trend is towards more GPU, more weights, more tokens. Sometimes based on questionable input, such as Reddit, in an attempt to gather more rather than better sources. But these days, not all LLMs use transformers, and energy-efficient implementations are gaining popularity, with an attempt to lower GPU usage, and thus costs. Yet all but one still rely on Blackbox neural networks.

Great evaluation metrics remain elusive and will remain so probably forever: in the end, LLMs, just like clustering, are part of unsupervised learning. Two users looking at a non-trivial dataset will never agree on what the “true” underlying cluster structure is. Because “true” is meaningless in this context. The same applies to LLMs. With some exceptions: when used for predictive analytics, that is, supervised learning, it is possible to tell which LLM is best in absolute terms (to some extent; it also depends on the dataset).

From big to simple LLMs, back to big ones

The first LLMs were very big, monolithic systems. Now you see many simple LLMs to deal with specialized content or applications, such as corporate corpus. The benefit is faster training, easier fine-tuning, and reduced risk of hallucinations. But the trend could change, moving back to big LLMs. For instance, my xLLM architecture consists of small, specialized sub-LLMs, each one focusing on a top category. If you bundle 2000 of them together, you cover the entire human knowledge. The whole system, sometimes called mixture of experts, is managed with an LLM router.

LLM routers

The word multi-agent system is sometimes used instead, although not with the exact same meaning. An LLM router is a top layer above the sub-LLMs, that guides the user to the correct sub-LLMs relevant to his prompt. It can be explicit to the user (asking him which sub-LLM to choose), or transparent (automatically performed), or semi-transparent. For instance, a user looking for “gradient descent” using the statistical science sub-LLM, may find very little: the relevant information is in the calculus sub-LLM. The LLM router should take care of this problem.

Evaluation, faster fine-tuning and self-tuning

Fine-tuning an LLM on part of the system, rather than the whole, can speed up the process tremendously. With xLLM, you can fine-tune hyperparameters locally on a sub-LLM (fast), or across all sub-LLMs at once (slow). Hyperparameters can be local or global. In the case of xLLM, they are intuitive, as the system is based on explainable AI. In standard LLMs, LoRA, an abbreviation for Low-Rank Adaptation, achieves a similar goal.

Self-tuning works as follows: collect the favorite hyperparameters chosen by the users and build a default hyperparameter set based on these choices. It also allows the user to work with customized hyperparameters, with two users getting different answers to the same prompt. Make this process even easier by returning a relevancy score to each item listed in the answer (URLs, related concepts, definitions, references, examples, and so on).

Regarding evaluation, I proceed as follows. Reconstruct the taxonomy attached to the corpus: for each web page, assign a category, and compare it to the real category embedded in the corpus. I worked with Wolfram, Wikipedia, and corporate corpus: all have a very similar structure with taxonomy and related items; this structure can be retrieved while crawling.

Finally, whenever possible, use the evaluation metric as your loss function in the underlying gradient descent algorithm — typically a deep neural network. Loss functions currently in use are poor proxies to model quality, so why not use the evaluation metric instead? This is hard to do because you need a loss function that can be updated with atomic changes such as weight update or neuron activation, billions of times during training. My workaround is to start with a rough approximation of the evaluation metric and refine it over time until it converges to the desired metric. The result is an adaptive loss function. It also prevents you from getting stuck in a local minimum.

Search, clustering and predictions

At the beginning, LLM for search was looked down. Now that this is what most corporate clients are looking for, and since it can do a far better job than Google search or all search boxes found on company websites, it starts to get a lot of attention. Great search on your website leads to more sales. Besides search, there are plenty of other applications: code generation, clustering, and predictive analytics based on text only.

Knowledge graphs and other improvements

There is a lot of talk about long-range context and knowledge graphs, built as a top layer to add more context to LLMs. In my xLLM, the knowledge graph is actually the bottom layer and retrieved from the corpus while browsing. If none is found or if quality is poor, I import one from an external source, calling it augmented knowledge graph. I also built some from scratch using synonyms, indexes, glossaries, and books. It may consist of a taxonomy and related concepts. In any case, it brings the long-range context missing in the first LLM implementations.

I also introduced longer tokens consisting of multiple tokens, such as “data~science”. I call them multi-tokens. Meta also uses them. Finally, I use contextual tokens, denoted as (say) “data^science”. It means that the two words “data” and “science” are found in a same paragraph, but not adjacent to each other. Special care is needed to avoid an explosion in the number of tokens. In addition to the corpus itself, I leverage user prompts as augmented data to enrich the input data. The most frequent embeddings are stored in a cache for faster retrieval in the backend tables. Then, variable-length embeddings further increase the speed. While vector and graph databases are popular to store embeddings, in my case I use nested hashes, that is, an hash (or key-value database) where the value is an hash itself. It is very efficient to handle sparsity.

Cosine distance and dot product, to compare embeddings, is receiving increased criticism. There are alternative metrics, such as pointwise mutual information (PMI).

Local, secure, enterprise versions

There is more and more interest in local, secure implementations to serve corporate clients. Afterall, that’s where the money is. For these clients, hallucinations are a liability. Low latency, easy fine-tuning, and explainable parameters are other important criteria for them. Thus, their interest in my open source xLLM that solves all these problems.


I illustrate all the concepts discussed here in my new book “State of the Art in GenAI & LLMs — Creative Projects, with Solutions”, available here. For a high-level presentation, see my PowerPoint presentation here on Google drive (easy to view), or on GitHub, here. Both the book and the presentation focus on xLLM.

New Trends in LLM Architecture


Towards Better GenAI: 5 Major Issues, and How to Fix Them

Vincent Granville is a pioneering GenAI scientist and machine learning expert, co-founder of Data Science Central (acquired by a publicly traded company in 2020), Chief AI Scientist at MLTechniques.com and GenAItechLab.com, former VC-funded executive, author (Elsevier) and patent owner — one related to LLM. Vincent’s past corporate experience includes Visa, Wells Fargo, eBay, NBC, Microsoft, and CNET. Follow Vincent on LinkedIn.