Microsoft Scales LLMs to a Mind-Boggling 1B (!) Token Context 🤯

The paper “LongNet: Scaling Transformers to 1,000,000,000 tokens” presents a machine learning breakthrough, particularly in handling and analyzing large amounts of text data. Simply put, this paper is about a new model called LongNet that can understand and analyze really long strings of text – up to 1 billion words or phrases, called “tokens,” at a time.

Let’s put this into perspective:

Books: The average novel might contain around 50,000 to 100,000 words. So, 1 billion words would equal 10,000 to 20,000 novels. That’s like reading all the works of Shakespeare about 250 times!
Wikipedia: The English Wikipedia contains over 6 million articles and an estimated total of 4.3 billion words. So, 1 billion words would represent roughly a quarter of the entire English Wikipedia. (source)
Web Pages: An average web page might contain around 1,000 to 2,000 words. So, 1 billion words would be equivalent to about 500,000 to 1 million average-sized web pages.
Social Media: As of 2021, Twitter allowed up to 280 characters per tweet. Assuming an average of 5 characters per word, that’s about 56 words per tweet. So, 1 billion words would be about the same as 18 million tweets!

These examples should give you a sense of just how large a text dataset of 1 billion words really is. It’s a colossal amount of information, well beyond what any human could read and comprehend in a lifetime.

Until now, it’s been a challenge to process such long sequences of text in an efficient way that maintains the model’s ability to understand and generate meaningful output.

💡 Remember: We’re not talking about the size of the pre-trained AI model (=brain size 🧠) that will often scale to billions of parameters even before this paper has been released. We’re talking about the size of the context of a single prompt (=question size ❓). Imagine you ask your friend a question and give them all books in your local library as part of the question.

The existing methods, while good, do have their limits. They either struggle with the sheer amount of calculations needed to process such long sequences of text or with the model’s ability to understand and generate meaningful output when dealing with such long sequences.

Enter LongNet. The LongNet model introduces a new technique called “dilated attention.”

In simple terms, this means that as the distance grows between the different parts of the text sequence, the attention the model pays to these parts expands exponentially. This allows LongNet to process extremely long sequences of text without sacrificing its performance on shorter sequences.

This is beyond exponential growth of the context length over time☝️

Some of the main advantages of LongNet are:

Advantage 1: It has a linear computation complexity, which means that as the amount of text data increases, the calculations needed to process this data increase at a steady rate, not an exponential one. This makes LongNet much more efficient than previous methods.

Here’s a comparison of runtime in practice for various sequence lengths:

Advantage 2: It can work as a distributed trainer for extremely long sequences. This means that the task of analyzing the long sequence of text can be divided among different parts of a network or system, making the process faster and more efficient.

Advantage 3: Dilated attention can be easily integrated with existing methods of optimizing Transformer-based models, making it a versatile tool for machine learning experts.

The Microsoft Research team tested LongNet and found that it performs well on both tasks involving long sequences of text and general language tasks.

This opens up a world of possibilities for processing and understanding extremely long sequences of text, like an entire book collection, millions of articles, or even a quarter of Wikipedia.

In a single prompt!

The future is bright! 😎 Intelligence is exploding. Check out the GitHub here and the paper here.

If you want to try an LLM with a massive context window, definitely check out this article: