Open-source research on large language models (LLMs) is crucial for democratizing this powerful technology.
Although open-source LLMs are now widely used and studied, they faced initial challenges and criticism. Early attempts at creating open-source LLMs like OPT and BLOOM had poor performance compared to closed-source models.
This led researchers to realize the need for higher-quality base models pre-trained on larger datasets with trillions (!) of tokens!
- OPT: 180 billion tokens
- BLOOM: 341 billion tokens
- LLaMa: 1.4 trillion tokens
- MPT: 1 trillion tokens
- Falcon: 1.5 trillion tokens
- LLaMA 2: 2 trillion tokens
However, pre-training these models is expensive and requires organizations with sufficient funding to make them freely available to the community.
This article focuses on high-performing open-source base models significantly improving the field. A great graphic of the historic context of open-source LLMs is presented on the Langchain page:
How can we determine the best of those? Easy, with Chatbot leaderboards like this on Hugginface:
By the way, feel free to check out my article on Claude-2 proven to be one of the most powerful free but closed-source LLMs:
The introduction of LLaMA 1 and 2 was a significant step in improving the quality of open-source LLMs. LLaMA is a suite of different LLMs with sizes ranging from 7 billion to 65 billion parameters. These models strike a balance between performance and inference efficiency.
LLaMA models are pre-trained on a corpus containing over 1.4 trillion tokens of text, making it one of the largest open-source datasets available. The release of LLaMA models sparked an explosion of open-source research and development in the LLM community.
LLaMA-2, the latest release, sets a new state-of-the-art among open-source LLMs. These models are pre-trained on 2 trillion tokens of publicly available data and utilize a novel approach called Grouped Query Attention (GQA) to improve inference efficiency.
MPT, another commercially-usable open-source LLM suite, was released by MosaicML. MPT-7B and MPT-30B models gained popularity due to their performance and ability to be used in commercial applications. While these models perform slightly worse than proprietary models like GPT-based variants, they outperform other open-source models.
Falcon, an open-source alternative to proprietary models, was the first to match the quality of closed-source LLMs. Falcon-7B and Falcon-40B models are commercially licensed and perform exceptionally well. They are pre-trained on a custom-curated corpus called RefinedWeb, which contains over 5 trillion tokens of text.
You can currently try the Falcon-180B Demo here.
📈 TLDR: Open-source LLMs include OPT, BLOOM, LLaMa, MPT, and Falcon, each pre-trained on extensive tokens. LLaMa-2 and Falcon stand out for their innovative approaches and extensive training data.
👉 For the best open-source LLM, consider using Vicuna-33B for its superior performance among non-commercial options.
Also, make sure to check out my other article on the Finxter blog: 👇
🔗 Recommended: Six Best Private & Secure LLMs in 2023
While working as a researcher in distributed systems, Dr. Christian Mayer found his love for teaching computer science students.
To help students reach higher levels of Python success, he founded the programming education website Finxter.com that has taught exponential skills to millions of coders worldwide. He’s the author of the best-selling programming books Python One-Liners (NoStarch 2020), The Art of Clean Code (NoStarch 2022), and The Book of Dash (NoStarch 2022). Chris also coauthored the Coffee Break Python series of self-published books. He’s a computer science enthusiast, freelancer, and owner of one of the top 10 largest Python blogs worldwide.
His passions are writing, reading, and coding. But his greatest passion is to serve aspiring coders through Finxter and help them to boost their skills. You can join his free email academy here.