AI Scaling Laws - A Short Primer - Be on the Right Side of Change

The AI scaling laws could be the biggest finding in computer science since Moore’s Law was introduced. 📈 In my opinion, these laws haven’t gotten the attention they deserve (yet), even though they could show a clear way to make considerable improvements in artificial intelligence. This could change every industry in the world, and it’s a big deal.

ChatGPT Is Only The Beginning

In recent years, AI research has focused on increasing compute power, which has led to impressive improvements in model performance. In 2020, OpenAI demonstrated that bigger models with more parameters could yield better returns than simply adding more data with their paper on Scaling Laws for Neural Language Models.

This research paper explores how the performance of language models changes as we increase the model’s size, the amount of data used to train it, and the computing power used in training.

The authors found that the performance of these models, measured by their ability to predict the next word in a sentence, improves in a predictable way as we increase these factors, with some trends continuing over a wide range of values.

🧑‍💻 For example, a model that’s 10 times larger or trained on 10 times more data will perform better, but the exact improvement can be predicted by a simple formula.

Interestingly, other factors like how many layers the model has or how wide each layer is don’t have a big impact within a certain range. The paper also provides guidelines for training these models efficiently.

For instance, it’s often better to train a very large model on a moderate amount of data and stop before it fully adapts to the data, rather than using a smaller model or more data.

In fact, I’d argue that transformers, the technology behind large language models are the real deal as they just don’t converge:

This development sparked a race among companies to create models with more and more parameters, such as GPT-3 with its astonishing 175 billion parameters. Microsoft even released DeepSpeed, a tool designed to handle (in theory) trillions of parameters!

🧑‍💻 Recommended: Transformer vs LSTM: A Helpful Illustrated Guide

Model Size! (… and Training Data)

However, findings from DeepMind’s 2022 paper Training Compute – Optimal Large Language Models indicate that it’s not just about model size – the number of training tokens (data) also plays a crucial role. Until recently, many large models were trained using about 300 billion tokens, mainly because that’s what GPT-3 used.

DeepMind decided to experiment with a more balanced approach and created Chinchilla, a Large Language Model (LLM) with fewer parameters—only 70 billion—but a much larger dataset of 1.4 trillion training tokens. Surprisingly, Chinchilla outperformed other models trained on only 300 billion tokens, regardless of their parameter count (whether 300 billion, 500 billion, or 1 trillion).

What Does This Mean for You?

First, it means that AI models are likely to significantly improve as we throw more data and more compute on them. We are nowhere near the upper ceiling of AI performance by simply scaling up the training process without needing to invent anything new.

This is a simple and straightforward exercise and it will happen quickly and help scale these models to incredible performance levels.

Soon we’ll see significant improvements of the already impressive AI models.

How the AI Scaling Laws May Be as Important as Moore’s Law

Accelerating Technological Advancements: Just as Moore’s Law predicted a rapid increase in the power and efficiency of computer chips, the scaling laws in AI could lead to a similar acceleration in the development of AI technologies. As AI models become larger and more powerful, they could enable breakthroughs in fields such as natural language processing, computer vision, and robotics. This could lead to the creation of more advanced and capable AI systems, which could in turn drive further technological advancements.

Economic Growth and Disruption: Moore’s Law has been a key driver of economic growth and innovation in the tech industry. Similarly, the scaling laws in AI could lead to significant economic growth and disruption across various industries. As AI technologies become more powerful and efficient, they could be used to automate tasks, optimize processes, and create new business models. This could lead to increased productivity, reduced costs, and the creation of new markets and industries.

Societal Impact: Moore’s Law has had a profound impact on society, enabling the development of technologies such as smartphones, the internet, and social media. The scaling laws in AI could have a similar societal impact, as AI technologies become more integrated into our daily lives. AI systems could be used to improve healthcare, education, transportation, and other areas of society. This could lead to improved quality of life, increased access to resources, and new opportunities for individuals and communities.

Frequently Asked Questions

How can neural language models benefit from scaling laws?

Scaling laws can help predict the performance of neural language models based on their size, training data, and computational resources. By understanding these relationships, you can optimize model training and improve overall efficiency.

What’s the connection between DeepMind’s work and scaling laws?

DeepMind has conducted extensive research on scaling laws, particularly in the context of artificial intelligence and deep learning. Their findings have contributed to a better understanding of how model performance scales with various factors, such as size and computational resources. OpenAI has then pushed the boundary and scaled aggressively to reach significant performance improvements with GPT-3.5 and GPT-4.

How do autoregressive generative models follow scaling laws?

Autoregressive generative models, like other neural networks, can exhibit scaling laws in their performance. For example, as these models grow in size or are trained on more data, their ability to generate high-quality output may improve in a predictable way based on scaling laws.

Can you explain the mathematical representation of scaling laws in deep learning?

A scaling law in deep learning typically takes the form of a power-law relationship, where one variable (e.g., model performance) is proportional to another variable (e.g., model size) raised to a certain power. This can be represented as: Y = K * X^a, where Y is the dependent variable, K is a constant, X is the independent variable, and a is the scaling exponent.

Which publication first discussed neural scaling laws in detail?

The concept of neural scaling laws was first introduced and explored in depth by researchers at OpenAI in a paper titled “Language Models are Few-Shot Learners”. This publication has been instrumental in guiding further research on scaling laws in AI.

Here’s a short excerpt from the paper:

🧑‍💻 OpenAI Paper:

“Here we show that scaling up language models greatly improves task-agnostic, few-shot performance, sometimes even reaching competitiveness with prior state-of-the-art fine-tuning approaches.

Specifically, we train GPT-3, an autoregressive language model with 175 billion parameters, 10x more than any previous non-sparse language model, and test its performance in the few-shot setting.

[…]

GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic.”

Is there an example of a neural scaling law that doesn’t hold true?

While scaling laws can often provide valuable insights into AI model performance, they are not always universally applicable. For instance, if a model’s architecture or training methodology differs substantially from others in its class, the scaling relationship may break down, and predictions based on scaling laws might not hold true.

💡 Recommended: 6 New AI Projects Based on LLMs and OpenAI