Python LangChain Course 🐍🦜🔗 - Chatting with Large Documents (2/6)

Python LangChain Course 🐍🦜🔗

✅ Part 0/6: Overview
✅ Part 1/6: Summarizing Long Texts Using LangChain
👉✅ Part 2/6: Chatting with Large Documents
✅ Part 3/6: Agents and Tools
✅ Part 4/6: Custom Tools
✅ Part 5/6: Understanding Agents and Building Your Own
✅ Part 6/6: RCI and LangChain Expression Language

Welcome back to part two, where we’re going to ‘chat’ with an entire book! We’ll be able to ask a textual question and we’ll receive a textual response based on the information inside the book. Besides being really cool, what are the practical real-world use cases for this?

Imagine you’re a writer, and you are on book number 7 of a series. You want to make sure you don’t contradict yourself or create some kind of continuity problem with the contents of the last 6 books, so you just ‘chat’ with your previous books to quickly find out. Just ask your books and get a natural-language answer back!

Say you have a large amount of documentation on your product or software suite explaining exactly how to do this and that. You can store the entire documentation and instead of keyword searches, people will be able to semantically search for how to do something and find exactly whatever part of your documentation explains how to do that exact thing!

This is extremely helpful and I personally find this extremely cool! The last-named use case is already being used in professional production environments. The point is that people can ask a natural language question and get a natural language response back in return. It’s just like chatting with ChatGPT except it has access to the entire book or even multiple books or an entire documentation base and will answer your questions based on that knowledge.

So how can we do this? We will use embeddings! To avoid boring people who already took my ‘Function calls and embeddings’ course with repetition, I won’t go too deep into what embeddings are exactly. Basically an ’embedding’ is a way for the computer to represent a word or sentence as a vector, which is just a long list of numbers. These numbers represent the meaning of the text, but not the exact words used! The embedding for the string ‘happy’ and ‘feel-good’ would be very similar for example, even though the characters used in both strings are completely different, the meaning is in a similar direction. This is the whole idea behind embeddings.

This allows computers to compare vectors for different pieces of text and see how similar they are in meaning. This is how we can compare a user’s question to the contents of a book and find the most similar parts of the book to the user’s question. We can then return the most similar parts of the book as the answer to the user’s question. Again, if you want more information on embeddings and how they work, please refer back to my ‘Function calls and embeddings’ course.

OpenAI has not just the ChatGPT API, but also another API endpoint that allows us to send a piece of text and get one of these embeddings in return. We’ll be using this OpenAI embeddings API for our embeddings.

The process

Our basic steps to achieve a full book or documentation chat are as follows:

Load our data and split it up into small parts
For each small part, send it to the OpenAI embeddings API and get the embedding back
Store the embedding in a vector database, specifically designed to store and compare/search embeddings
When a user asks a question, convert the question to an embedding and search the vector database for the closest embeddings, indicating that their meaning is related to the user’s question.
Take over the world for total domination! (optional)

You can do this with a single huge document, like a book, or with a large collection of smaller documents, like the documentation for a software library. It doesn’t matter and the underlying process is basically the same. We will create a database of vectors and compare the user query converted to a vector to the vectors in the database. The closest vectors will be related to the user query and be returned as the answer to the user.

In the previous tutorial series we used a .csv file to store our embeddings. While this is fine for smaller projects, if you’re going to be building a large production project you will need a vector database. That’s why for this tutorial we’ll be looking at PineCone for storing our embeddings. PineCone is a cloud-based vector database that is extremely easy to use and has a free tier that is more than enough for most projects.

That’s enough introduction, let’s get started on world domination… I mean, let’s get started on building our book chat!

💡 Note: You can watch the full course video right here on the blog — I’ll embedd the video below each of the other parts as well. If you want the step-by-step course with code and downloadable PDF course certificate to show your employer or freelancing clients. follow this link to learn more.

Installing dependencies and getting a book

Before we get started, let’s install some dependencies and get a book we want to chat with. First, run the following command in a terminal window:

pip install pypdf pinecone-client

We’ll use the pypdf library to read our book (which is in pdf format) and the pinecone-client library to connect to our PineCone vector database (we’ll create a free account later).

Next is our book. I will be using “How to succeed or, Stepping-stones to fame and fortune” by Orison Swett Marden. It’s a book from 1896 and is in the public domain, so we can use it for free. You can download the book from the following link:

🔗 https://manybooks.net/titles/mardenor2051320513-8.html#google_vignette

Go ahead and create a new folder in your root project folder named ‘2_Chat_with_large_documents’ and put the book inside in a data folder, I will be renaming it to ‘How-to-succeed.pdf’. Your folder structure should look like this:

📁Finx_LangChain
    📁1_Summarizing_long_texts
    📁2_Chat_with_large_documents
        📁data
            📄How-to-succeed.pdf
    📄.env

Preparing our data / creating a vector database

The code for this tutorial part will be relatively short, but it is important that we know and keep track of what is happening and what LangChain is abstracting away for us, so that this doesn’t turn into some kind of magical code soup that we don’t understand. The less code there is, the harder it can often be to understand what is going on, so we’ll take a look under the hood as we go.

In your '2_Chat_with_large_documents' folder, create a file called '1_book_chat_setup.py'

📁Finx_LangChain
    📁1_Summarizing_long_texts
    📁2_Chat_with_large_documents
        📁data
            📄How-to-succeed.pdf
        📄1_book_chat_setup.py
    📄.env

Inside this file we’ll get started with our imports:

import re
import pinecone
from decouple import config
from langchain.document_loaders import PyPDFLoader
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.schema import Document
from langchain.vectorstores import Pinecone

The re library is the built-in regular expression library, we’ll use it to clean up our texts. Pinecone is for the database we’ll be using, decouple should be familiar by now for loading our API key. PyPDFLoader allows us to load up PDFs in Langchain, and OpenAIEmbeddings simply makes a call to the OpenAI embeddings API, just like we used ‘ChatOpenAI’ in the previous tutorial series. You’re well familiar with the Document data structure by now and Pinecone from langchain.vectorstores is another LangChain helper that will make it easier to interact with our pinecone vector database in the cloud.

Speaking of Pinecone, let’s create a free account. Go to pinecone.io and press the ‘Sign Up Free’ blue button on the top right of the webpage. You can just use your Google account to sign up for a free plan. Now you’ll need two things. First, you’ll need your API key. You can find this by clicking on ‘API Keys’. Copy the key and paste it into your .env file below your OpenAI key as follows:

OPENAI_API_KEY=your_openai_api_key_here
PINECONE_API_KEY=your_pinecone_api_key_here

Again, make sure not to use any spaces anywhere in the .env file.

Second, you will need to create an index to store the embeddings (vectors) in. Back to the Pinecone website, in the left-hand menu above ‘API Keys’ should be a menu item named indexes. Click this and then click the ‘Create Index’ button. For the name use "langchain-vector-store". Set the number of dimensions to 1536 as this is what OpenAI embeddings use (it refers to the number of dimensions in the vector that represents our text, which basically means each vector we’ll be storing is a list of 1536 floating point numbers).

For the metric, use cosine, as this is the standard mathematical way to compare vectors. Finally set the Pod Type to starter, as this is free to use in our free account. Then click the ‘Create Index’ button to create your pinecone database. You’ll see something like this, letting you know the vector database is up and running:

langchain-vector-store                                              Free Tier
METRIC     DIMENSIONS     POD TYPE      HOST
cosine     1536           starter       https://.......pinecone.io

PROVIDER   ENVIRONMENT                                              MONTHLY COST
                                                                    $0

Loading our book into the vector database

Back to our '1_book_chat_setup.py' file, let’s do some setup:

embeddings_api = OpenAIEmbeddings(openai_api_key=config("OPENAI_API_KEY"))

pinecone.init(api_key=config("PINECONE_API_KEY"), environment="gcp-starter")
pinecone_index = "langchain-vector-store"

First, we prepare the embeddings API by initializing a new OpenAIEmbeddings and passing in our openai_api_key from the .env file. The embeddings endpoint uses the same API key as the ChatGPT endpoint. Next, we initialize pinecone by calling pinecone.init and loading in our pinecone_api_key.

The environment for the second argument can be found by clicking the ‘Indexes’ menu item on the left-hand side of the Pinecone website. This will show your langchain-vector-store and among the information will be ‘Environment’. In my case, this is ‘gcp-starter‘, and it will probably be so for you as well.

Finally, we initialize a simple string variable and set its value to the name of our pinecone index, which is “langchain-vector-store“.

Time to load our book!

loader = PyPDFLoader("data/How-to-succeed.pdf")
data: list[Document] = loader.load_and_split()

The first line just creates a loader object linked to our file path but does not load it yet. We then call the .load_and_split method on the loader object which will return a list of Document objects with each Document being one page.

Remember that Document objects have a “page_content” and a “metadata” property. We won’t be using the metadata property in this one. So let’s simplify our data to a simple list of strings each containing the text of one page:

page_texts: list[str] = [page.page_content for page in data]
print(page_texts[6][:1000])

We use a list comprehension and inside, for every page in the data variable, we get only the page_content property. Then we print the first 1000 characters of the 7th page just to have a look at what we have. Go ahead and run this Python file so far. Make sure you cd into your …/Finx_LangChain/2_Chat_with_large_documents folder in your terminal before running the file. It will take a minute and then you should see something like this:

One     great   need    of      the     world   to-day  is      for     men     and     women   who     are     good
animals.        To      endure  the     strain  of      our     concentrated    civilization,   the
.....

Let’s get rid of all these unnecessary tabs that appeared in the loading process. Remove the previous print statement and then add the following below the page_texts variable:

page_texts_fixed: list[str] = [re.sub(r"\t|\n", " ", page) for page in page_texts]
print(page_texts_fixed[6][:1000])

For each page in the page_texts, we will run a regex substitution, replacing any ("\t" tab) (| or) ("\n" newline) characters with a simple " " space character for this page. The list comprehension will run this on every page and return a new list. Let’s print a small snippet again and see if it’s fixed:

One great need of the world to-day is for men and women who are good animals. To endure the strain of our concentrated civilization, the coming man and woman must have....

That’s a lot better. Go ahead and get rid of the print statement. Now all we need to do is send each page to the OpenAI embeddings API and store the resulting embedding in our pinecone vector database. LangChain makes this very easy for us. (For more info on what these API calls do and what vectors are, refer to my ‘Function calls and embeddings’ course available here at the Finxter Academy). Add the following code to finish up our setup file:

vector_database = Pinecone.from_texts(
    page_texts_fixed, embeddings_api, index_name=pinecone_index
)

We call the .from_texts method on the Pinecone helper class that comes with LangChain. We pass in our list of strings containing the page texts, the embeddings_api we want to use to get the embeddings, and the name of our pinecone index to store the embeddings in. LangChain will do the rest for us behind the scenes. It will send each page to the OpenAI embeddings API and store the resulting embedding in our pinecone vector database.

Go ahead and run your '1_book_chat_setup.py' file and give it a couple of minutes to run. (Make sure you are in your …/Finx_LangChain/2_Chat_with_large_documents folder in your terminal before running the file). When it’s done running, go back to the Pinecone website and click indexes in the left-hand menu. Now click on your “langchain-vector-store“.

It may take a minute to appear but the browser tab at the bottom should start to display data. We can see each entry has an ID which is used internally by Pinecone, VALUES which holds the vectors themselves and METADATA, which has the text corresponding to the vector. We now have a fully functional vector database we can search that has embeddings for every page in our book!

Getting vectors from the database

Let’s make a very simple query to our database first, to see how we can get matching embeddings back from the database, after that we’ll build it out further. Create a new file in your '2_Chat_with_large_documents' folder called '2_book_simple_query.py':

📁Finx_LangChain
    📁1_Summarizing_long_texts
    📁2_Chat_with_large_documents
        📁data
            📄How-to-succeed.pdf
        📄1_book_chat_setup.py
        📄2_book_simple_query.py
    📄.env

Inside this file let’s start with the basic imports:

import pinecone
from decouple import config
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import Pinecone

You have seen all of these before. Let’s do our setup:

embeddings_api = OpenAIEmbeddings(openai_api_key=config("OPENAI_API_KEY"))

pinecone.init(api_key=config("PINECONE_API_KEY"), environment="gcp-starter")
pinecone_index = pinecone.Index("langchain-vector-store")
vectorstore = Pinecone(pinecone_index, embeddings_api, "text")

First we set up our OpenAIEmbeddings API in LangChain, as the question we ask must be converted to an embedding as well before we can compare it to the embeddings already present in the pinecone database and ask for similar results. Now we connect to pinecone using the .init method and passing our API key and environment again. Just to avoid confusion, the lowercase pinecone is the official pinecone library, and the uppercase Pinecone refers to the LangChain helper class.

We catch our pinecone index in a variable by calling pinecone.Index and passing in the name of our index. Now we use LangChain’s Pinecone helper, passing in the pinecone index, embeddings API, and finally the string "text" to indicate we will be comparing embeddings for texts. This returns an object we can use to interact with our vector database so we’ll name it vectorstore.

Now let’s set a query and run a quick test:

query = "What is the fastest way to get rich?"
print(vectorstore.similarity_search(query, k=5))

As you can see we’ll get back a list with 5 Document objects containing the top 5 pages that are most similar in meaning to our query because we set the k argument to the number 5.

That’s nice and all, but 5 pages is a lot to read through. I really would like a more concise and summarized natural language answer from ChatGPT for the end user, combining the essence of the material in these pages to precisely answer the user’s question. Let’s do that next!

Getting a natural language answer from our book chat

The '2_book_simple_query.py' file was just to give you more of an idea of what is actually going on before we abstract too much away. It’s time to make a better version of our book chat now! Let’s create a new file in our '2_Chat_with_large_documents' folder called '3_book_chat_final.py':

📁Finx_LangChain
    📁1_Summarizing_long_texts
    📁2_Chat_with_large_documents
        📁data
            📄How-to-succeed.pdf
        📄1_book_chat_setup.py
        📄2_book_simple_query.py
        📄3_book_chat_final.py
    📄.env

As always, we start with our imports:

import pinecone
from decouple import config
from langchain.chains.question_answering import load_qa_chain
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.llms import OpenAI
from langchain.vectorstores import Pinecone

All the imports are familiar except for load_qa_chain, which lets us load a question-and-answer chain as you will see later. Also, notice we changed the ChatOpenAI import to OpenAI for this one. This is because we’ll be using the Davinci model instead of the GPT 3.5-turbo model for this one.

The Davinci model is a language model from the completions endpoint that focuses purely on text completion, but not necessarily on being chatty, whereas the GPT 3.5-turbo model is a chatbot model from the chat_completions endpoint. We’ll use the Davinci model because it will be as concise as possible and only answer the question, whereas the chatty GPT 3.5-turbo model will start with disclaimers like “the text does not explicitly state the answer to your question” because the wording is slightly different. We could also play around with the prompts to try and achieve this, but I want you to be aware that there are more options than just ChatGPT.

Now for our setup. We’ll need both our language model API and the embeddings API.

openai_key = config("OPENAI_API_KEY")
davinci_api = OpenAI(temperature=0, openai_api_key=openai_key, model_name="text-davinci-003")
embeddings_api = OpenAIEmbeddings(openai_api_key=openai_key)

We declare our openai_key in a separate variable as we’ll need it twice. First, we instantiate our davinci_api by calling LangChain’s OpenAI helper, passing in the temperature, our openai key, and the model name we’ll use. Temperature is 0 here as we just want it to use the material provided to answer, additional creative variety is not desired. We then instantiate our embeddings_api as we did before.

Now set up Pinecone and connect to our vector database. (You can copy these three lines straight from the previous '2_book_simple_query' file):

pinecone.init(api_key=config("PINECONE_API_KEY"), environment="gcp-starter")
pinecone_index = pinecone.Index("langchain-vector-store")
vector_store = Pinecone(pinecone_index, embeddings_api, "text")

Now we will use the load_qa_chain method to quickly create a chain:

qa_chain = load_qa_chain(davinci_api, chain_type="stuff")

We load a question and answer chain, passing in our Davinci API and the chain_type of “stuff”. You will remember “stuff” from the previous tutorial part. We’re using it here because we need to stuff the multiple matches we get back from our vector database into the chain.

You might have noticed we skipped a step, we never set up our prompt template like we did in the previous tutorial part. This is because the load_qa_chain method already does this for us. The prompt template is internally stored in chain.llm_chain.prompt.messages and reads as follows:

## Do not put this in your code ##
template="Use the following pieces of context to answer the users question. \nIf you don't know the answer, just say that you don't know, don't try to make up an answer.\n----------------\n{context}"

So we can see that this qa_chain will use the above template and stuff multiple documents into the {context} placeholder for us, all straight out of the box without us having to do anything. We just need to get similar matches from our vector database and feed them to the chain. Let’s make a function to do this:

def ask_question_to_book(question: str, verbose=False) -> str:
    matching_pages = vector_store.similarity_search(question, k=5)
    if verbose:
        print(f"Matching Documents:\n{matching_pages}\n")
    result = qa_chain.run(input_documents=matching_pages, question=question)
    print(result, "\n")
    return result

We declare a function that takes a question and a verbose boolean which defaults to False as arguments and returns a string. We call the similarity_search method on our vector_store and pass in the question and the number 5 for k, but you can experiment with different amounts of matches if you like. This will return a list of Document objects that are the top matches for our question, the most similar in meaning amongst the entire book’s contents.

We then check if verbose is True and if so, we print the matching pages to give more console feedback. We then call the .run method on our qa_chain and pass in the matching_pages as input_documents and the original question. This will return a string that is the answer to the question. We print this string and return it.

Note that we never converted the user’s question to an embedding. LangChain’s Pinecone helper class, of which we have an instance in the vector_store variable, does this for us automatically.

Testing our book chat

I’m going to ask three questions to our book chat to test it out:

ask_question_to_book("What is the fastest way to get rich?", verbose=True)
ask_question_to_book("What is the problem with most people?")
ask_question_to_book("What is the best way to peel bananas?")

The first one uses the verbose flag set to True to test it out. The second one will only give us the answer without displaying all the context pages passed to Davinci. The third one is a test to make sure Davinci will not make up an answer if the answer is not contained in the book.

Let’s go ahead and run out book chat and see what we get:

*A list with 5 Document objects, cut out for brevity*

The fastest way to get rich is to find a great want of humanity and improve any methods which men use to supply that want. It is also important to choose an occupation that is helpful to the largest number of people and to observe and take pains to find opportunities.

The problem with most people is that they are superficial and unprepared for their work. They lack the mental and moral breadth to be successful in their chosen profession and are often out of place in society. They lack the will-power and application to develop their own individual strengths and instead try to imitate others. They often fret and complain about things they cannot help and waste time on the road to success.

I don't know.

Even though this book is over 120 years old, it seems some things never change! Those are excellent answers straight from the book’s philosophy, and the third trick question was answered “I don’t know”, just like we wanted. You are now chatting with a book!

Now you can make this into a utility that takes input, or implement it into a website or app and allow your users to talk to this knowledge base! Of course, make sure you have the rights to use the book or documentation you are using, but this is a very powerful tool that can be used in many different ways.

That’s it for part 2, I hope to see you soon in the next part where we’ll be looking at tools and agents!

This tutorial is part of our original course on Python LangChain. You can find the course URL here: 👇

🧑‍💻 Original Course Link: Becoming a Langchain Prompt Engineer with Python – and Build Cool Stuff 🦜🔗

Original article on the Finxter Academy