What is project setup and dependencies in Python?

We need six packages. I have listed each one with its role so you know why every dependency is here — no mystery installs. | Package | Role | |---|---| | langchain | Core framework — chains, prompts, base classes | | langchain-openai | OpenAI chat model and embedding wrappers | |...

Build a Chat-with-Documents App: LangChain + Streamlit End-to-End

Intermediate150 min3 exercises50 XP

Prerequisites

LangChain Document Loaders LangChain Text Splitters Build a Chatbot with Memory

0/3 exercises

You have a 200-page technical PDF and a question buried somewhere on page 147. You could skim the whole thing, or you could build an app that reads it for you. Upload the PDF, ask your question in plain English, and get a cited answer with the exact page and passage.

That is exactly what we are building: a full ChatPDF-style application using LangChain, ChromaDB, and Streamlit.

Architecture — How Chat-with-Documents Works

Before we write a single line of code, I want you to understand the full pipeline. Every "chat with your documents" app — ChatPDF, Notion AI, custom enterprise tools — follows the same five-step pattern. Once you see it, you will recognize it everywhere.

The pipeline has two phases. The ingestion phase happens once per document: load the PDF, split it into chunks, generate embeddings, and store them in a vector database.

The query phase runs every time the user asks a question. Embed the question, search the vector store for the most relevant chunks, and pass those chunks plus the question to an LLM to generate an answer.

The five-step RAG pipeline (pseudocode)

Loading editor...

The key insight here is that this architecture separates finding the right information (retrieval) from generating the answer (generation). The vector store handles retrieval, the LLM handles generation, and LangChain wires them together.

Project Setup and Dependencies

We need six packages. I have listed each one with its role so you know why every dependency is here — no mystery installs.

Install all dependencies

Loading editor...

Package	Role
`langchain`	Core framework — chains, prompts, base classes
`langchain-openai`	OpenAI chat model and embedding wrappers
`langchain-community`	Community integrations including ChromaDB vector store
`chromadb`	Lightweight vector database (runs locally, no server needed)
`pypdf`	PDF text extraction
`streamlit`	Web UI with zero frontend code

Create a project folder and an app.py file. You will also need an OpenAI API key — set it as an environment variable or pass it directly. I prefer the environment variable approach because it keeps secrets out of your codebase.

Project structure

Loading editor...

Loading and Splitting Documents

If you worked through the Document Loaders and Text Splitters tutorials, this will feel familiar — but now we are putting both pieces together in a real pipeline. The goal: take a PDF file path, extract all its text, and split it into chunks that are small enough for embedding but large enough to carry meaning.

ingest.py — Load and split a PDF

Loading editor...

Two decisions matter here. Chunk size controls how much text each vector represents. Too small (100 characters) and you lose context — the chunk might be a sentence fragment that means nothing on its own. Too large (5,000 characters) and retrieval becomes imprecise — the relevant sentence gets buried in surrounding text. I have found 800-1,200 characters to be the sweet spot for most document QA tasks.

Chunk overlap ensures that sentences near a split boundary appear in both adjacent chunks. Without overlap, a sentence like "The deadline was moved to March 15th" could get cut between "moved to" and "March 15th" — and neither chunk would contain the complete fact. An overlap of 150-200 characters handles most of these edge cases.

Inspect a chunk to verify metadata

Loading editor...

Embedding and Storing in ChromaDB

Chunks are just text right now. To search them by meaning rather than keyword matches, we convert each chunk into a vector — a list of numbers that captures semantic meaning. I think of it as translating English into "math space" where similar sentences land close together, even if they use different words.

ingest.py — Create embeddings and store in ChromaDB

Loading editor...

Chroma.from_documents() does three things at once: sends each chunk's text to OpenAI's embedding API, receives a 1536-dimensional vector for each chunk, and stores the vector alongside the original text and metadata. That is the entire ingestion pipeline in two functions.

The persist_directory argument saves the database to disk. This means you embed each document once, and subsequent queries load the pre-built index instantly — no re-embedding needed.

Load an existing vector store from disk

Loading editor...

Building the Retrieval Chain

This is the core of the application — the chain that takes a user question, finds relevant document chunks, and generates a grounded answer. I spent a lot of time getting this right in my own projects, and the pattern I landed on uses LangChain's LCEL (LangChain Expression Language) to pipe components together cleanly.

chain.py — The RAG chain with citations

Loading editor...

Let me walk through that LCEL chain piece by piece, because the pipe syntax can look dense if you have not used it before.

The chain input is a string — the user's question. The first step creates a dictionary with two keys. context pipes the question through the retriever, which returns matching Document objects, and then through format_docs to convert them to readable text with page numbers. question uses RunnablePassthrough() to forward the original question string unchanged.

That dictionary feeds into the prompt template, filling {context} and {question}. The completed prompt goes to the LLM, and StrOutputParser() extracts the text. The entire pipeline runs with one call: chain.invoke("What is the deadline?").

Testing the Chain Before Building the UI

Before touching Streamlit, I always test the RAG chain in a plain Python script. This isolates problems: if the answers are wrong, the bug is in the chain or the data — not the UI.

test_chain.py — Verify the pipeline end-to-end

Loading editor...

Run this with python test_chain.py. Check two things. First, are the retrieved chunks actually relevant to the question? If not, adjust chunk_size or k. Second, does the answer correctly use information from those chunks? If it hallucinates facts not in the chunks, strengthen the grounding instruction.

Weak grounding prompt

prompt = "Answer the question using the context.\n{context}\n{question}"

Strong grounding prompt

prompt = """Answer ONLY from the context below.
If the context lacks the answer, say so.
Cite sources using [Source N, Page M].

{context}

{question}"""

Exercise 1: Format Retrieved Documents for the Prompt

Write Code

Write a function format_sources(docs) that takes a list of document dictionaries (each with "text" and "page" keys) and returns a single formatted string. Each document should appear as [Source N, Page P] followed by the text on the next line, separated by blank lines.

For example, given [{"text": "Revenue grew 20%.", "page": 3}, {"text": "Costs fell 5%.", "page": 7}], the output should be:

[Source 1, Page 3]
Revenue grew 20%.

[Source 2, Page 7]
Costs fell 5%.

Loading editor...

Conversational Memory for Follow-Up Questions

The chain we built answers isolated questions, but real users ask follow-ups: "What were the Q3 results?" then "How did that compare to Q2?" The second question makes no sense without the context of the first. We need conversation memory.

If you worked through the LangChain memory tutorial, this approach will feel familiar. We store the conversation history and, before each retrieval, use the LLM to rephrase follow-ups into standalone questions. "How did that compare to Q2?" becomes "How did the Q3 results compare to Q2?" — now the retriever finds the right chunks.

chain.py — Add conversation-aware retrieval

Loading editor...

The condense step is the trick that makes follow-up questions work. Without it, searching for "How did that compare?" would return random chunks because the retriever has no idea what "that" refers to. With the condense step, the retriever searches for the right topic every time.

Building the Streamlit Interface

Streamlit is the fastest way to build a chat interface for a Python application. Here is the mental model if you have not used it: you write a normal Python script, and Streamlit re-runs it top to bottom on every user interaction. State that needs to survive across reruns lives in st.session_state.

app.py — Complete Streamlit application

Loading editor...

The sidebar handles file upload and ingestion. When the user uploads a PDF, we save it to a temporary file because PyPDFLoader expects a file path, not a file object. Then we run the ingestion pipeline and store the vector store in session state. The if "vector_store" not in st.session_state guard prevents re-ingesting on every rerun.

app.py — Chat interface with message history

Loading editor...

Streamlit's st.chat_message and st.chat_input give you a ChatGPT-style layout with zero CSS. Messages persist in st.session_state["messages"]. The source documents appear in a collapsible expander below each answer, so the user can verify the citations.

Exercise 2: Simulate Text Chunking with Overlap

Write Code

Write a function chunk_text(text, chunk_size, overlap) that splits a string into overlapping chunks. Each chunk should be exactly chunk_size characters long (except possibly the last one), and consecutive chunks should overlap by overlap characters.

For example, chunk_text("abcdefghij", 5, 2) should return ["abcde", "defgh", "ghij"].

The step between chunks is chunk_size - overlap. Return an empty list if text is empty.

Loading editor...

Improving Retrieval Quality

The basic pipeline works, but retrieval quality makes or breaks the user experience. If the retriever returns irrelevant chunks, the LLM either hallucinates or says "I don't know" when the answer is actually in the document. Here are three techniques I reach for when the initial results are not good enough.

Technique 1: MMR — Maximum Marginal Relevance

Standard similarity search can return four chunks that all say roughly the same thing — four paragraphs about revenue when the answer also needs cost data from a different section. MMR (Maximum Marginal Relevance) balances relevance with diversity. It picks the first chunk by relevance, then penalizes subsequent chunks that are too similar to ones already selected.

Switch from similarity to MMR search

Loading editor...

The lambda_mult parameter controls the trade-off. At 1.0 it behaves like standard similarity search. At 0.0 it maximizes diversity (even at the cost of relevance). I typically start at 0.7 and adjust based on the types of questions users ask.

Technique 2: Metadata Filtering

If the user uploads multiple documents, you do not want a question about "Document A" to return chunks from "Document B". ChromaDB supports metadata filters that constrain the search to specific documents, page ranges, or any metadata you attached during ingestion.

Filter retrieval by source document

Loading editor...

Technique 3: Adjusting Chunk Size

There is no universally optimal chunk size. Dense technical documents — legal contracts, academic papers — benefit from smaller chunks (500-800 characters) that contain a single precise fact. Narrative documents like reports work better with larger chunks (1000-1500 characters) that preserve paragraph-level context.

Experiment with different chunk sizes

Loading editor...

Handling Multiple Documents

A real-world app needs to handle more than one PDF. The change is simple: ingest each document into the same ChromaDB collection, and the retriever searches across all of them. The source metadata tracks which chunk came from which file.

Support multiple file uploads in app.py

Loading editor...

The only structural change is the loop over uploaded_files and collecting all chunks into one list before creating the vector store. I have used this exact pattern in a project with 50+ internal policy documents. ChromaDB handles the rest — the source metadata in each returned chunk tells you which PDF it came from.

Common Mistakes and How to Fix Them

I have built several document QA systems, and these are the issues that trip people up most often. Knowing them upfront saves you hours of debugging.

Mistake 1: Chunks too large — the LLM ignores relevant details.

If your chunks are 3,000+ characters, the LLM receives a wall of text and might miss the specific sentence that answers the question. The fix: reduce chunk_size to 800-1,000 characters and test again.

Mistake 2: No overlap — facts split across chunk boundaries.

A chunk boundary might fall right in the middle of "the deadline is March 15, 2025." Without overlap, neither chunk contains the complete fact. Set chunk_overlap to at least 15-20% of chunk_size.

Mistake 3: Embedding model mismatch between ingestion and query.

If you embed documents with text-embedding-3-small but load the vector store with text-embedding-ada-002, the vectors live in different spaces. Similarity search returns garbage. Always use the same model for both.

Bug: mismatched embedding models

# Ingestion
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
Chroma.from_documents(docs, embeddings, persist_directory="./db")

# Query — WRONG model
embeddings = OpenAIEmbeddings(model="text-embedding-ada-002")
Chroma(persist_directory="./db", embedding_function=embeddings)

Fix: same model everywhere

# Use a constant
EMBEDDING_MODEL = "text-embedding-3-small"

# Ingestion
embeddings = OpenAIEmbeddings(model=EMBEDDING_MODEL)
Chroma.from_documents(docs, embeddings, persist_directory="./db")

# Query — same model
embeddings = OpenAIEmbeddings(model=EMBEDDING_MODEL)
Chroma(persist_directory="./db", embedding_function=embeddings)

Mistake 4: Session state not checked — re-ingesting on every Streamlit rerun.

Streamlit reruns your entire script on every interaction. Without an if "vector_store" not in st.session_state guard, you re-embed the document on every message. That is both slow and expensive.

Mistake 5: Not testing retrieval separately from generation.

When answers are wrong, you need to know whether retrieval failed (wrong chunks) or generation failed (right chunks, bad answer). Always inspect retriever.invoke(question) before blaming the LLM.

Exercise 3: Compute Cosine Similarity Between Vectors

Write Code

Vector search works by computing the similarity between the question vector and each document vector. Write a function cosine_similarity(vec_a, vec_b) that computes the cosine similarity between two lists of numbers.

Cosine similarity = (A . B) / (|A| * |B|)

where A . B is the dot product and |A| is the magnitude (square root of sum of squares).

Return the result rounded to 4 decimal places. If either vector has zero magnitude, return 0.0.

Loading editor...

What Would Be Different in Production

The app we built is fully functional for personal use or small teams. Production deployment introduces a different set of concerns. Here is a brief map of what changes and why — each of these could be its own tutorial.

Vector store. ChromaDB is excellent for local development and prototyping. For production with millions of documents, consider Pinecone (managed, scales automatically), Weaviate (open-source, hybrid search), or Qdrant (Rust-based, fast). The LangChain code changes by two lines — swap the import and constructor.

Embedding costs. OpenAI's embedding API charges per token. For high-volume apps, consider open-source models like sentence-transformers/all-MiniLM-L6-v2 running locally. They are free, fast, and produce competitive results for most document QA tasks.

Authentication and file storage. In production, users upload through an authenticated interface. Files live in cloud storage (S3, GCS) rather than temp directories. The ingestion pipeline stays the same — only the file loading adapter changes.

Streaming responses. Stream the LLM's response word by word instead of waiting for the complete answer. LangChain supports this with .stream() instead of .invoke(), and Streamlit has st.write_stream() for displaying streamed text.

Streaming the response in Streamlit

Loading editor...

Complete Code

Here is the entire application as three files you can copy into your project folder and run immediately.

ingest.py — Complete document ingestion module

Loading editor...

chain.py — Complete RAG chain module

Loading editor...

app.py — Complete Streamlit application

Loading editor...

Frequently Asked Questions

Can I use this with non-PDF documents like Word, HTML, or Markdown?

Yes. LangChain provides loaders for dozens of formats: Docx2txtLoader for Word, UnstructuredHTMLLoader for HTML, UnstructuredMarkdownLoader for Markdown. Swap the loader in load_and_split() — the rest of the pipeline stays identical.

Swap loader for different file types

Loading editor...

How do I handle scanned PDFs (images, no selectable text)?

PyPDFLoader only extracts selectable text. For scanned documents, you need OCR. Use UnstructuredPDFLoader with mode="elements" and strategy="ocr_only" — it runs Tesseract OCR under the hood. Install with pip install unstructured[pdf] tesseract.

How many documents can ChromaDB handle?

ChromaDB stores vectors locally and handles hundreds of thousands of chunks on a single machine. For millions of vectors or multi-user production, migrate to Pinecone, Weaviate, or Qdrant. The LangChain code changes minimally — swap the vector store class and pass connection credentials.

Can I use a free/open-source LLM instead of OpenAI?

Absolutely. Replace ChatOpenAI with ChatOllama for local models or ChatAnthropic for Claude. You will also need to swap the embedding model. HuggingFaceEmbeddings with sentence-transformers/all-MiniLM-L6-v2 is a popular free alternative.

Use Ollama for a fully local setup

Loading editor...

What if the LLM still hallucinates despite the grounding prompt?

Three things help. First, set temperature=0 for deterministic outputs. Second, strengthen the prompt — explicitly tell it to say "I don't know" rather than guess. Third, add a verification step that checks if the answer contains text from the retrieved chunks. If it does not, flag it as potentially hallucinated.

References

LangChain Documentation — Retrieval-Augmented Generation. Link

ChromaDB Documentation — Getting Started. Link

OpenAI Embeddings Guide. Link

Lewis, P. et al. (2020). "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks." NeurIPS 2020. arXiv:2005.11401

Streamlit Documentation — Chat Elements. Link

LangChain Documentation — Text Splitters. Link

Carbonell, J. & Goldstein, J. (1998). "The Use of MMR, Diversity-Based Reranking for Reordering Documents and Producing Summaries." SIGIR 1998.

Architecture — How Chat-with-Documents Works

Project Setup and Dependencies

Loading and Splitting Documents

Embedding and Storing in ChromaDB

Building the Retrieval Chain

Testing the Chain Before Building the UI

Conversational Memory for Follow-Up Questions

Building the Streamlit Interface

Improving Retrieval Quality

Technique 1: MMR — Maximum Marginal Relevance

Technique 2: Metadata Filtering

Technique 3: Adjusting Chunk Size

Handling Multiple Documents

Common Mistakes and How to Fix Them

What Would Be Different in Production

Complete Code

Frequently Asked Questions

References

Related Tutorials