by Samuel Fletcher with additional input from Nick Solly and Chris Downton. Follow QR_'s LLM experimentation by signing up for early access insights here.
As we try to productionise Artificial intelligence (AI) and make tools genuinely useful to industry (rather than just cool product teasers and 30-second demos), we are finding that Large Language Model (LLM) token limits are at odds with the resource-hungry or otherwise relatively unconstrained coding methods we’ve become used to.
“Back in the day” coding had boundaries: bandwidth restrictions, limited processing power, and storage constraints forced developers to always be efficient or mindful of length and complexity. Those constraints have been removed (or lessened) over the last fifty years, and we now have huge databases without the need to worry about the fact that we stored “Production Intent” for every part rather than “P” because for a BoM that’s only an extra few kB to the file size, while being much more human-friendly to read. But we’re currently in a world where finite token limits for these models again require maximum efficiency.
A short introduction to some fundamentals
Most LLMs use a specific neural network architecture called a transformer. These transformers are particularly suited to language processing as they can read vast amounts of text, spot patterns in how words and phrases relate to each other, and then make predictions about what words should come next.
Text is “tokenised” – chunked into meaningful units of text that the models process and generate. A token can represent individual characters, words, sub-words, or even larger linguistic units, depending on the method used – each model does this differently – and each token is assigned a unique identifier or index. LLMs are trained by mapping the relationships between these numerical IDs, which encode the semantic and contextual information between text, and when given a prompt, suggest which tokens should come next.
The architecture of each LLM will determine a token limit: the maximum number of tokens that the model can process at once, the maximum length of the prompt and the output of the model.
This becomes important when applying LLMs to a specific domain and industrialising them beyond simple product demos. Because these models are trained on very general domain corpora, they are less effective for domain-specific tasks if used “straight out of the box”. It also requires any data or context asked of the model to be under the token limit, which if you’re talking about a Bill of Materials might only equate to giving it a handful of lines of raw data – not particularly helpful!
Historical approaches to resource-constrained development
But although LLMs are new and we’re only beginning to apply these in industry settings, there are echoes of the same challenges faces in the earlier days of computing from which we can learn.
If we rewind fifty years of computing and look at the constraints that have been present over that time, we could loosely group them into:
- Storage limits imposed by the media we had to transport software
- Processing and memory challenges as we put the software to use.
Our storage options have evolved, taking us from distributing software on floppy discs in the 70s where you had to fit your program under 80 KB, through CDs and DVDs, to the current state where we don’t really distribute using hard media and the main constraint is the network bandwidth limit while transferring it. Similarly, our processing and memory evolution has been just as changed, where comparisons of the Apollo Moon Lander computers to smartphones are well known.
Picture yourself trying to fit a 120 KB program onto a 80 KB floppy disk – how would you have done it? The answers to those early constraints typically fell into three groups:
- Optimisation of the code: finding efficiencies in delivering the features – every character counts! (An example of this being abstracting out into functions wherever possible for maximum reuse.)
- Approximation: you can sacrifice functionality for fuzziness in situations that you know is acceptable. The video game OpenArena, for example, used an approximation to calculate the reciprocal of the square root of a 32-bit floating-point number in its lighting and reflection calculations – it was no more than 0.175% away from the true values, but then functioned at the speed required of the game.
- You’d have ruthlessly cut features that you didn’t need, after every last character had been squeezed out, and knowingly sacrificed functionality.
What are the token limit equivalents of our answers to historic constraints?
- There are different methods for the optimisation of tokens for LLMs, on a spectrum from the crude to the clever. On one extreme you could truncate or otherwise omit text, with no selectivity over what gets kept beyond arbitrary / random decisions. Or you could use contextual compression, where supporting information for a query to the LLM gets compressed them using the context of the given query, so that only the most relevant information is passed to the LLM (there are built in LangChain methods that can do this, for example). Building “One Shot Prompts” can avoid you needing to go through several rounds of questions, with its history being saved each time as context for the next question – with the aim of getting the perfect response first time.
- For approximation techniques, we could ask the LLM itself to summarise the current state and context required, and start a new set of queries – accepting that a certain amount of loss may occur. Any text could be approximated by using compression techniques like Abbreviated Semantic Encoding to further reduce the token count in the queries
- Sacrificing functionality is the last thing we generally want to do, but we can be more selective and choose what to prioritise when we’re providing context to LLMs with techniques such as Retrieval Augmented Generation (RAG). Rather than throwing every piece of data into the model as context, RAG is a method to retrieve the most relevant retrieved data and provide only that as context to the model. There are also smarter ways of building chains of these where the model can refine the RAG criteria itself over a number of iterations, and so build up to the minimum context it needs itself to answer an original query.
Moore's law and tokens?
Is this a persistent constraint, or is this only a current challenge that we will overcome with the benefit of time? If we look at the LLM input token limits over the last five years for LLMs released and their token limits, we can see a small but gradual increase: from the first tentative steps of GPT1 (512 tokens), to 2023/2023 where the majority of models have limits of 4096 tokens, and we’re now beginning to see the first signs of surpassing that with the likes of Claude and GPT4 8K. But these currently are the anomalies, compared to the general population, and as such we should be prepared to work within their constraints for a while yet.
We should note here – the relief of the query size constraint will bring its own challenges and implications that we should not neglect! Not least from an environmental POV, where even with the current query size the energy required to compute it is estimated to be 4–5x that of a traditional search engine query (from Martin Bouchard at QScale) – and that is likely to scale faster as the potential query size increases. Bigger also doesn’t always mean better – we’re only just beginning to see comparisons of query structures and performance: a paper that’s been doing the rounds recently is a Stanford one , where LLMs performed better when the relevant information was at the beginning or end of the input context.
What’s the conclusion of all this? LLMs strength as generalist tools are proving harder to replicate in highly specific industrial contexts, in part due to token limits. However, these kinds of constraints are nothing new and looking at how similar challenges were previously addressed can provide useful starting points. While it can be tempting to await the availability of higher token limits, on the balance of probability is making more efficient use of resources available today will provide both quicker and better LLM solutions for very context-specific use cases. The constraints just mean we need to be more thoughtful about how we get there.
Follow QR_'s LLM experimentation by signing up for early access insights here.