Claude 2 LLM Reads Ten Papers in One Prompt with Massive 200k Token Context

5/5 - (8 votes)

The new Claude 2 model from AI research company Anthropic has proved insane new capabilities. In this quick article, I’ll give you a short and concise overview of what you need to know.

Claude 2 Overview

Anthropic’s latest prodigy, Claude 2, is making waves. This AI language model, part of the Claude series, is a master of conversation, writing, editing, and more. It’s like having a personal assistant who can also code and provide advice on a myriad of subjects. Claude 2 is well-suited for creative and literary use cases writing in a particular tone, voice, or personality.

In particular, it can do the following tasks:

  • Search
  • Writing
  • Editing
  • Outlining
  • Summarizing
  • Coding
  • Advising
  • Educating

It doesn’t yet search the web, but you can share large documents (e.g., PDFs) with it and interact with the docs like asking it specific questions or finding content in a document.

The quality of Claude 2 is quite good. It can pass many standardized tests such as grade school math problem solving, Q&A on very long stories, answering science questions, or reading comprehension better than humans:

Claude 2 is in the >90th percentile of verbal reasoning and analytical writing when compared to human students:

Claude 2 can also pass the Multistate Bar Examination (MBE) and the US Medical Examination (USMLE) with a passing score (~more than 60% correct answers):

But here’s the most insane benefit: πŸ‘‡πŸ€―πŸš€

Claude 2’s Long Context Data with up to 200k Tokens

Claude 2 has been trained to have an expanded context window of 200k tokens — and performance keeps improving with larger context sizes! 200k token context data is equivalent to roughly 150k words. So you can query Claude 2 with a small book PDF as context data! 🀯

Here are a few examples to help illustrate what 150k words might look like in real life:

  1. Books: An average novel is around 80,000 to 100,000 words. So, 150,000 words would be equivalent to a long novel or perhaps a trilogy of shorter novels. For example, “Harry Potter and the Order of the Phoenix” by J.K. Rowling is over 257,000 words. So, 150,000 words would be a bit more than half of that book.
  2. Theses and Dissertations: A typical doctoral dissertation might be around 80,000 to 100,000 words. So, 150,000 words would be a particularly long and detailed dissertation or thesis. My own PhD Thesis on distributed graph processing was roughly 57k words long, so Claude 2 could process four years of work in one context window!
  3. Speeches: The average person speaks at around 125-150 words per minute. So, a speech of 150,000 words would last around 16 to 20 hours if delivered without breaks.
  4. Web Content: The average web page has around 500-1000 words. So, 150,000 words would be equivalent to the content of about 150-300 average web pages.
  5. Newspaper: The average newspaper article is around 500-800 words. So, 150,000 words would be equivalent to around 187-300 newspaper articles.

Claude 2 will support 100k tokens context windows at launch with the goal of increasing it later. Again: large context windows like this are a true game changer. Neural networks start to become mega brains that can process and “load” huge amounts of information into their brains at once.

Not only do those AI models already have huge amounts of base knowledge encoded into their brains (0-shot prompting) but you can now load bigger and bigger amounts of application-specific information (200k contexts) to generate high-quality output.

Here are some examples of how you can combine a mega-brain (LLM) with app-specific data (200k context query):

  • Legal Document Analysis: A mega brain AI with a large context window could be used to analyze lengthy legal documents, such as contracts or court transcripts. It could identify critical points, summarize content, and even provide insights on legal implications. This could be particularly useful for law firms and legal departments in corporations.
  • Medical Research: In the field of medicine, there are often extensive research papers and clinical trial reports that need to be reviewed. An AI with a significant context window could read and summarize these documents. A medical AI researcher can use it to create new research by combining various papers in unique ways.
  • Book Summarization and Analysis: An AI could read and summarize entire books for publishers or avid readers. It could provide plot summaries, character analyses, and themes. This could be useful for creating study guides or for readers trying to decide if they want to read a particular book.
  • Historical Research: Historians often have to sift through extensive primary source documents. An AI with a large context window could help by reading through these documents and identifying key events, figures, and themes, saving researchers significant time.

Helpful, Honest, Harmless (HHH) Evaluation Framework

Anthropic’s evaluation framework for their AI models, including Claude 2, is comprehensive and rigorous. It includes pre-deployment testing that assesses the model’s capabilities, safety, and alignment with ethical expectations.

Capabilities evaluations measure the model’s skills across various tasks, while safety and alignment evaluations assess potential risks and ethical conformity.

Red teaming is also employed, where independent teams attempt to exploit system vulnerabilities. The results are integrated into safety mitigations.

Anthropic collaborates with the Alignment Research Center (ARC) for safety audits and with external red teamers for Trust and Safety tests.

Human feedback is a crucial part of the evaluation process. Human preference data is used to calculate per-task Elo scores, a comparative performance metric that indicates how often a human evaluator prefers one model’s outputs over another.

Here’s such an evaluation from the paper (higher is better):

Here’s another alignment evaluation (lower is better):

All in all an extremely impressive performance and value proposition. The large context windows, high prompting quality, and low bias of the Anthropic LLM Claude 2 provides one additional milestone on our insane journey that we’re currently on.

Compare this to where we were only a year ago in AI research. The future is bright! 🌞

Join us and stay on the right side of change with our daily email updates and the latest research and cheat sheets (>150k coders):