Top 10 LLM Training Datasets – It’s Money Laundering for Copyrighted Data!

5/5 - (3 votes)

I’ve read the expression of large language models (LLMs) being “Money Laundering for Copyrighted Data” on Simon Willison’s blog. In today’s article, I’ll show you which exact training data sets open-source LLMs use, so we can gain some more insights into this new alien technology and, hopefully, get smarter and more effective prompters. Let’s get started! πŸ‘‡

There’s a tectonic shift happening in software development. AI developers working for Tesla, OpenAI, and Google more and more focus on … data curation rather than explicitly writing intelligent algorithms.

In fact, Andrew Karpathy, Tesla’s former AI director, coined the phrase Software 2.0, i.e., software that is written implicitly by data and AI training rather than explicitly by coders. “Mechanistic Interpretability” describes analyzing and understanding how neural nets have self-learned and encoded algorithms in their weights.

Image Credits: Amalie Mayer Altimira

One of the critical aspects of large language model training is the availability of diverse and high-quality training datasets. These datasets play a vital role in shaping the LLM’s understanding of text structure, context, and general semantics. Various datasets have been employed for training LLMs, depending on factors such as specialization of the model, size, and performance goals.

But where does the training data of LLMs actually come from? Let’s find out! πŸ§‘β€πŸ’»

Overview of Training Datasets

One of the most comprehensive open-source datasets available is The Pile (paper, online), which consists of a diverse range of text sources. The Pile aims to provide a solid foundation for training LLMs, incorporating a wide variety of subjects, writing styles, and domains. It includes data from scientific articles, books, web pages, and other text sources to ensure a comprehensive and well-rounded training base.

Here’s an overview of the training data used:

As you can see, many of the data sets used are not copyright-free at all. They are actually copyrighted content. For example, the Books3 dataset consists of “mostly pirated ebooks”:

However, these copyrighted contents are only used to train LLMs, For example, if you read 2000 pirated books, you’ll still become more intelligent and educated. But your “output” wouldn’t necessarily contain copyrighted content. Reading pirated books may not be very ethical, but it sure is effective in learning abstract and specific knowledge, and it’s not necessarily illegal.

Another essential resource in LLM training is the C4 dataset, which is short for Colossal Clean Crawled Corpus. C4 is derived from the Common Crawl dataset, a massive web-crawled resource containing billions of web pages. The C4 dataset is preprocessed and filtered, making it a cleaner and more useful resource for training LLMs.

RefinedWeb is another valuable dataset specifically designed for training LLMs on HTML understanding. It focuses on understanding the structure and content of web pages, which is crucial for LLMs to generate contextually accurate and meaningful results.

Wikipedia forms an essential part of various training datasets as it offers a vast source of structured, human-curated information covering an extensive range of topics. Many LLMs rely on Wikipedia in their training process to ensure a general knowledge base and improve their ability to generate relevant and coherent outputs across different domains.

Huggingface has a collection of tens of thousands of training datasets.

Meta’s Llama research group published the data sources in their Llama v1 paper confirming some of our findings above:

Especially Books and CommonCrawl are not copyright-free datasets to the best of my knowledge.

Many other dataset aggregation resources have emerged such as this GitHub repository and this Reddit thread. These data sources are very unstructured and they also contain input/output pairs of other LLM models such as ChatGPT which would likely yield biased models or even violate the terms of services of existing LLMs such as OpenAI’s GPT model series or Meta’s Llama models.

Domain-Specific Large Language Models

Domain-specific large language models (LLMs) incorporate industry-specific knowledge and formulations. These models are trained on extensive datasets within specialized fields, enabling them to generate accurate and context-aware results.

In the healthcare sector, LLMs are transforming medical practices by leveraging vast repositories of clinical literature and medical records. Large language models in medicine are instrumental in improving diagnostic predictions, enhancing drug discovery, and refining patient care. The use of domain-specific text during the training of these models results in higher utility and performance, addressing complex medical queries with higher precision.

For instance, check out Google Research on leveraging proprietary medical data sets to improve the LLM performance:

πŸ§‘β€πŸ’» Recommended: Med-PaLM 2: Will This Google Research Help You Increase Your Healthspan?

The finance industry also benefits from domain-specific LLMs tailored to handle financial data and industry-specific tasks. The Bloomberggpt, a large language model for finance, is designed to support a diverse array of tasks within the financial sector. By focusing on domain-specific content, this model can effectively comprehend and generate finance-related insights, such as market analysis, trend predictions, and risk assessment.

Many other proprietary data sources are often used for training (but not for providing exact content to avoid copyright issues), e.g., StackOverflow and GitHub, Quora and Twitter, or YouTube and Instagram.

Domain-specific LLMs have the potential to revolutionize various industries by combining the power of large-scale machine learning with the expertise and context of domain-specific data. By focusing on specialized knowledge and information, these models excel in generating accurate insights, improving decision-making, and transforming industry practices across healthcare, finance, and legal sectors.

Image Credits: Amalie Mayer Altimira

Check out how to make your own LLM with proprietary data using GPT-3.5: πŸ‘‡

πŸ§‘β€πŸ’» Recommended: Fine-Tuning GPT-3.5 Turbo – How to Craft Your Own Proprietary LLM

Frequently Asked Questions

What are the primary datasets used to train LLMs?

Large language models (LLMs) are usually trained on a diverse range of text data, which can include books, articles, and web pages. Some popular datasets used for training LLMs include the Common Crawl dataset, which contains petabytes of web crawl data, and the BookCorpus dataset, which comprises millions of books. Other examples of primary datasets include Wikipedia, news articles, and scientific papers.

How is data collected for training large language models?

Data is collected for training LLMs through web scraping, dataset aggregation, and collaborative efforts. Web scraping involves extracting text from web pages, while aggregation consolidates existing databases and datasets. Collaborative efforts often involve partnerships with organizations that possess large volumes of data, such as research institutions and universities. Preprocessing is an essential step to ensure quality, as it includes tasks such as tokenization, normalization, and filtering out irrelevant content.

What are the open-source resources to find training datasets for LLMs?

There are various open-source resources to find training datasets for LLMs, such as the Hugging Face Datasets library, which provides easy access to numerous datasets for machine learning and natural language processing. Other resources include the United Nations Parallel Corpus, Gutenberg Project, and ArXiv, which offer extensive collections of text data.

Are there any limitations or biases in current LLM training datasets?

Yes, current LLM training datasets can exhibit limitations and biases. These can result from factors such as biased data sources, imbalanced data, and overrepresentation of certain domains or demographics. This may lead LLMs to inherit and even amplify these biases, which can affect the fairness, reliability, and overall quality of the models. Public attention is growing around the need to address these issues in the development of LLMs.

How do different LLMs compare in terms of dataset size and diversity?

Different LLMs may vary in terms of dataset size and diversity. Generally, state-of-the-art LLMs tend to have larger and more diverse training datasets to achieve better performance. However, the specific features of different LLMs can contribute to the variations in the datasets used. For instance, some LLMs may prioritize specific domains or languages, while others may focus on capturing broader content from various sources.

πŸ§‘β€πŸ’» Recommended: Llama 2: How Meta’s Free Open-Source LLM Beats GPT-4!