LangChain Document Loaders: Ingest PDFs, Web Pages, CSV, Notion, and GitHub
You have your LangChain chain working with hardcoded strings. Now your boss drops a folder on your desk: 47 PDFs, a Notion workspace, three CSV exports, and a GitHub repo. "Make the AI answer questions about all of this." The gap between a working prototype and a production-ready ingestion pipeline is document loaders -- and getting them right determines whether your RAG system returns brilliant answers or hallucinated garbage.
What Are Document Loaders and Why Do They Matter?
A document loader is a LangChain class that reads data from a source — a file, a URL, an API — and converts it into a standardised Document object. Every loader, regardless of format, produces the same output: a list of Document objects, each containing page_content (the text) and metadata (source information like file name, page number, or URL).
I think of loaders as the plumbing of any LLM application. Nobody brags about plumbing, but if it leaks, everything downstream fails. Bad ingestion means your text splitter gets garbled input, your embeddings encode noise, and your retriever returns irrelevant chunks. I have debugged more RAG quality issues traced back to a loader problem than to any model or prompt issue.
Running this on a simple text file produces something like:
That metadata dictionary is quietly important. When your RAG system retrieves a chunk later, the metadata tells the user where the answer came from -- which PDF, which page, which URL. Without metadata, your system gives answers but cannot cite its sources.
Before we dive into specific loaders, here is what you need installed. LangChain splits loaders across packages so you only install what you actually use:
Loading PDFs — The Most Common Starting Point
PDFs are the format I encounter most in real projects. Contracts, research papers, financial reports, internal documentation — they all live in PDF. LangChain gives you two primary loaders, and choosing the right one depends on the PDF's complexity.
PyPDFLoader — Fast and Simple
PyPDFLoader uses the pypdf library under the hood. It's fast, reliable, and creates one Document per page — which means the page number lands in your metadata automatically. For standard text-heavy PDFs (contracts, reports, articles), this is the loader I reach for first.
Notice the page key in metadata starts at 0. When you display citations to users later, you will want to add 1: f"Source: page {doc.metadata['page'] + 1}".
UnstructuredPDFLoader — For Complex Layouts
When PDFs have tables, multi-column layouts, headers/footers, or embedded images with captions, PyPDFLoader often mashes everything together. UnstructuredPDFLoader uses the unstructured library which does layout analysis — it identifies headings, tables, and list items as separate elements.
Loading a Directory of PDFs
Real projects rarely involve a single file. DirectoryLoader walks a directory and applies a loader to every matching file. The glob parameter controls which files it picks up.
Loading Web Pages — Turning URLs into Documents
Sometimes the knowledge your LLM needs lives on a website, not in a file. LangChain's web loaders fetch a URL, strip the HTML boilerplate, and give you clean text. The difference between a good and bad web loader is how much navigation, ads, and cookie banners end up in your document.
WebBaseLoader — Quick and Lightweight
WebBaseLoader uses requests and BeautifulSoup to fetch and parse HTML. You can pass it a single URL or a list. For blog posts and documentation pages with clean HTML, it works well.
Without filtering, WebBaseLoader grabs everything — navigation menus, footers, sidebars. The bs_kwargs parameter lets you pass a SoupStrainer from BeautifulSoup, which restricts parsing to specific HTML elements. This dramatically cleans up the output.
RecursiveUrlLoader — Crawling Multiple Pages
When you need an entire documentation site or a multi-page knowledge base, RecursiveUrlLoader follows links from a starting URL up to a specified depth. I use this when a client says "ingest our entire docs site" — it handles the crawling so I don't have to write a custom scraper.
Loading CSVs and Structured Data
CSVs are the second most common format I ingest after PDFs -- customer records, FAQ databases, product catalogs. CSVLoader reads each row as a separate Document, converting column names and values into a readable key-value string. You pass the file path and optional csv_args (delimiter, quote character) to handle non-standard formats.
Each CSV row becomes a separate Document. The page_content is a key-value text representation of the row — not the raw comma-separated line. This is exactly what you want for RAG: when the retriever finds this document, it returns a self-contained description of one record.
You can control which columns become the document content and which become metadata. This is critical for filtering later — if plan is in metadata, you can retrieve only enterprise customers without relying on semantic search.
Third-Party Loaders — Notion and GitHub
LangChain's real power shows when you move beyond files on disk. The langchain_community package includes loaders for dozens of third-party services. I will focus on two that come up constantly in enterprise projects: Notion (for internal knowledge bases) and GitHub (for codebases and documentation).
NotionDirectoryLoader — Internal Knowledge Bases
Notion is where many teams keep their documentation, meeting notes, and product specs. To load Notion content, export your workspace as Markdown files (Notion Settings > Export > Markdown & CSV), then point the loader at the exported directory. Each Markdown file becomes a Document with the file path as metadata.
The metadata includes the file path, which preserves the Notion page hierarchy. This is valuable context — when a user asks about deployment, you can cite the exact Notion page.
GitHubLoader — Codebases and Repo Documentation
GitHubIssuesLoader pulls issues and pull requests from a GitHub repository via the API. Each issue becomes a Document with the body text as page_content and metadata fields for title, labels, state, and author. You need a GitHub personal access token.
For loading actual source code files from a repo, you can clone the repository locally and use DirectoryLoader with a TextLoader. This approach gives you more control over which files to include and handles binary files gracefully.
Building a Unified Ingestion Function
In production, you rarely use just one loader. A typical RAG pipeline ingests PDFs, web pages, and CSV files from the same source. Writing a function that detects the file type and picks the right loader eliminates repetitive code.
The function below uses three strategies based on the input. First, it checks if the source is a URL and uses WebBaseLoader. Second, if the source is a directory, it iterates over a LOADER_MAP dictionary (mapping file extensions to loader classes) and uses DirectoryLoader for each extension. Third, for a single file, it looks up the extension in LOADER_MAP and instantiates the correct loader.
The caller passes a path string or URL. The function handles the rest:
LangChain's Document objects always have page_content and metadata. Write a function summarize_documents(docs) that takes a list of dictionaries (each with keys page_content and metadata) and returns a dictionary with three keys:
"total_docs": the number of documents"total_chars": the sum of all page_content lengths"sources": a sorted list of unique source values from metadata["source"]Print the result.
Metadata Enrichment — Adding Context Loaders Miss
Raw loader metadata is minimal -- usually just source and maybe page or row. For a production RAG system, you want richer metadata: document category, department, access level. This metadata powers filtered retrieval later: "Find answers only from Q3 reports" or "Search only engineering docs."
The enrich_metadata function below takes a batch of loaded documents and stamps each one with a category, department, loaded_at timestamp, and a char_count field. It mutates the documents in place and returns them so you can chain calls. The char_count is useful later for filtering out tiny extraction artifacts before you embed them.
That char_count field is more useful than it looks. When you split documents into chunks later, tiny fragments (under 50 characters) are usually headers, page numbers, or extraction artifacts. Filtering them out before embedding saves cost and improves retrieval quality.
Error Handling and Edge Cases
Loaders fail. Files are corrupted, URLs return 404, CSVs have encoding issues. In a batch ingestion pipeline, you do not want one bad file to crash the entire run.
The safe_load wrapper below wraps load_source in a try/except that catches FileNotFoundError, UnicodeDecodeError, and a generic fallback. Instead of raising, it returns an empty list so batch processing continues. It also filters out documents shorter than 20 characters -- those are usually headers, page numbers, or extraction artifacts that add noise to your vector store.
The key design choice: return an empty list instead of raising so batch processing continues. The 20-character threshold catches headers, page numbers, and extraction artifacts while keeping real content.
Write a function get_loader_name(file_path) that takes a file path string and returns the name of the appropriate loader as a string. The mapping is:
.pdf -> "PyPDFLoader".csv -> "CSVLoader".txt -> "TextLoader".md -> "TextLoader".html -> "BSHTMLLoader""Unsupported"The function should be case-insensitive for extensions (.PDF and .pdf both work). Print the result for each test case.
Real-World Example: Building a Multi-Source Knowledge Base
Here is a pattern I use for production RAG pipelines: a YAML-driven config that maps source types to paths and metadata. Each source entry specifies the type (pdf, csv, web), the file path or URL list, and any custom metadata tags. The ingest_knowledge_base function iterates through configs, calls safe_load for each source, enriches metadata, and tracks success/failure stats.
The implementation uses a SourceConfig dataclass to hold source type, path, URL list, and metadata tags. The function loops through each config, resolves the source (path or URLs), calls safe_load, then enrich_metadata to stamp custom fields. A stats dictionary tracks how many documents loaded and how many sources failed -- essential visibility for debugging production pipelines.
This pattern separates configuration from code. Adding a new source means adding a SourceConfig entry -- no loader code changes. In production, I load these configs from a YAML file so non-engineers can add new data sources without touching Python.
Common Mistakes and How to Fix Them
After building document ingestion across many projects, these are the mistakes I see repeatedly.
# Loads every file including images, binaries, .git
loader = DirectoryLoader("repo/", glob="**/*")
docs = loader.load() # Crashes or loads garbage# Only load what your pipeline can actually process
loader = DirectoryLoader(
"repo/",
glob="**/*.{py,md,txt}",
loader_cls=TextLoader,
silent_errors=True,
)docs = loader.load()
# Sends all docs to embeddings — including empty ones
vector_store.add_documents(docs)docs = loader.load()
# Remove empty and near-empty documents
docs = [d for d in docs if len(d.page_content.strip()) > 50]
vector_store.add_documents(docs)# Loader returns docs but you never check metadata
docs = loader.load()
# Later: "The AI said X but WHERE did it come from?"docs = loader.load()
for doc in docs:
doc.metadata["ingested_at"] = datetime.now().isoformat()
assert "source" in doc.metadata, "Missing source!"Write a function deduplicate_docs(docs) that takes a list of document dictionaries (each with page_content and metadata keys) and removes duplicates. Two documents are considered duplicates if they have the same metadata["source"] AND the same page_content. The function should return a new list with duplicates removed, preserving the order of first occurrence. Print the count of documents before and after deduplication.
Loader Quick-Reference Table
LangChain has over 160 loaders. Here are the ones I reach for most, grouped by source type:
| Source Type | Loader | Package | Key Feature |
|---|---|---|---|
| Plain text | TextLoader | langchain_community | Simplest loader, one doc per file |
PyPDFLoader | langchain_community | One doc per page, fast | |
| PDF (complex) | UnstructuredPDFLoader | langchain_community | Layout analysis, table extraction |
| CSV | CSVLoader | langchain_community | One doc per row, column control |
| Web page | WebBaseLoader | langchain_community | BeautifulSoup extraction |
| Web (multi-page) | RecursiveUrlLoader | langchain_community | Crawls links to max_depth |
| Notion export | NotionDirectoryLoader | langchain_community | Reads Markdown/CSV exports |
| GitHub issues | GitHubIssuesLoader | langchain_community | Loads issues with labels/metadata |
| JSON | JSONLoader | langchain_community | jq-like path expressions |
| Word (.docx) | Docx2txtLoader | langchain_community | Microsoft Word documents |
| HTML | BSHTMLLoader | langchain_community | Local HTML files |
| Directory | DirectoryLoader | langchain_community | Batch loading with any sub-loader |
Performance and Best Practices
Loading documents is I/O-bound -- your bottleneck is disk reads, network requests, and PDF parsing, not CPU. These four practices have saved me the most time in production.
The pattern is simple: replace loader.load() with a for loop over loader.lazy_load(). Each iteration yields a single Document, which you enrich and push to the vector store immediately.
The load_with_cache function below avoids re-parsing files that have not changed. It hashes the source path with hashlib.md5 to create a deterministic cache filename, then checks a .doc_cache/ directory for a matching JSON file. On a cache hit, it deserializes Document objects from JSON. On a cache miss, it calls load_source, serializes each document's page_content and metadata to JSON, and writes the cache file for next time.
LangChain also supports async loading via aload() and alazy_load() on every loader. If your ingestion pipeline is already async (common in FastAPI backends), these methods let you load documents without blocking the event loop. The API is identical -- just await the call.
Frequently Asked Questions
How do I load a JSON file with nested structure?
JSONLoader uses jq-style path expressions to navigate nested JSON. The jq_schema parameter tells the loader which field should become page_content. Set text_content=False if the extracted value is a JSON object (not a plain string) so the loader serializes it properly.
Can I load Google Docs or Google Drive files directly?
Yes. Use GoogleDriveLoader from langchain_community. It requires OAuth credentials and supports Google Docs, Sheets, and Slides. The setup takes about 10 minutes — follow the LangChain Google Drive guide.
What is the difference between load() and load_and_split()?
load() returns whole documents. load_and_split() loads and then runs a text splitter in one call. I prefer calling load() and splitting separately because it gives you a chance to inspect and enrich documents between loading and splitting.
How do I handle password-protected PDFs?
PyPDFLoader accepts a password parameter. Pass the PDF password as a string and it handles decryption before extraction.
Do I need LangChain just for loading documents?
No. If document loading is all you need, you can use pypdf, beautifulsoup4, or csv directly. LangChain loaders shine when you are building a full pipeline -- the standardized Document format means loaders, splitters, embeddings, and retrievers all fit together without format conversion.
What's Next — From Raw Documents to a Working RAG Pipeline
Document loading is the first stage of a RAG pipeline. You now have raw Document objects -- but they are too long to embed directly. The next step is text splitting, which breaks documents into chunks sized for your embedding model. From there, you embed the chunks and store them in a vector database.
For the full end-to-end flow -- loading, splitting, embedding, storing, and querying -- see our RAG with LangChain tutorial. If you are debugging chain behavior or tracking token costs across loader calls, LangSmith gives you observability into every step.
Complete Code
The full ingestion pipeline combining everything from this tutorial: