LangChain Document Loaders: Ingest PDFs, Web Pages, CSV, Notion, and GitHub
You've got your LangChain chain working with hardcoded strings. Now your boss drops a folder on your desk: 47 PDFs, a Notion workspace, three CSV exports, and a GitHub repo. "Make the AI answer questions about all of this." The gap between a working prototype and a production-ready ingestion pipeline is document loaders — and getting them right determines whether your RAG system returns brilliant answers or hallucinated garbage.
What Are Document Loaders and Why Do They Matter?
A document loader is a LangChain class that reads data from a source — a file, a URL, an API — and converts it into a standardised Document object. Every loader, regardless of format, produces the same output: a list of Document objects, each containing page_content (the text) and metadata (source information like file name, page number, or URL).
I think of loaders as the plumbing of any LLM application. Nobody brags about plumbing, but if it leaks, everything downstream fails. Bad ingestion means your text splitter gets garbled input, your embeddings encode noise, and your retriever returns irrelevant chunks. I've debugged more RAG quality issues that traced back to a loader problem than to any model or prompt issue.
Running this on a simple text file produces something like:
That metadata dictionary is quietly important. When your RAG system retrieves a chunk later, the metadata tells the user where the answer came from — which PDF, which page, which URL. Without metadata, your system gives answers but can't cite its sources.
Before we dive into specific loaders, here is what you need installed. LangChain splits loaders across packages so you only install what you actually use:
Loading PDFs — The Most Common Starting Point
PDFs are the format I encounter most in real projects. Contracts, research papers, financial reports, internal documentation — they all live in PDF. LangChain gives you two primary loaders, and choosing the right one depends on the PDF's complexity.
PyPDFLoader — Fast and Simple
PyPDFLoader uses the pypdf library under the hood. It's fast, reliable, and creates one Document per page — which means the page number lands in your metadata automatically. For standard text-heavy PDFs (contracts, reports, articles), this is the loader I reach for first.
Notice the page key in metadata starts at 0. When you display citations to users later, you will want to add 1: f"Source: page {doc.metadata['page'] + 1}".
UnstructuredPDFLoader — For Complex Layouts
When PDFs have tables, multi-column layouts, headers/footers, or embedded images with captions, PyPDFLoader often mashes everything together. UnstructuredPDFLoader uses the unstructured library which does layout analysis — it identifies headings, tables, and list items as separate elements.
Loading a Directory of PDFs
Real projects rarely involve a single file. DirectoryLoader walks a directory and applies a loader to every matching file. The glob parameter controls which files it picks up.
Loading Web Pages — Turning URLs into Documents
Sometimes the knowledge your LLM needs lives on a website, not in a file. LangChain's web loaders fetch a URL, strip the HTML boilerplate, and give you clean text. The difference between a good and bad web loader is how much navigation, ads, and cookie banners end up in your document.
WebBaseLoader — Quick and Lightweight
WebBaseLoader uses requests and BeautifulSoup to fetch and parse HTML. You can pass it a single URL or a list. For blog posts and documentation pages with clean HTML, it works well.
Without filtering, WebBaseLoader grabs everything — navigation menus, footers, sidebars. The bs_kwargs parameter lets you pass a SoupStrainer from BeautifulSoup, which restricts parsing to specific HTML elements. This dramatically cleans up the output.
RecursiveUrlLoader — Crawling Multiple Pages
When you need an entire documentation site or a multi-page knowledge base, RecursiveUrlLoader follows links from a starting URL up to a specified depth. I use this when a client says "ingest our entire docs site" — it handles the crawling so I don't have to write a custom scraper.
Loading CSVs and Structured Data
Each CSV row becomes a separate Document. The page_content is a key-value text representation of the row — not the raw comma-separated line. This is exactly what you want for RAG: when the retriever finds this document, it returns a self-contained description of one record.
You can control which columns become the document content and which become metadata. This is critical for filtering later — if plan is in metadata, you can retrieve only enterprise customers without relying on semantic search.
Third-Party Loaders — Notion and GitHub
LangChain's real power shows when you move beyond files on disk. The langchain_community package includes loaders for dozens of third-party services. I'll focus on two that come up constantly in enterprise projects: Notion (for internal knowledge bases) and GitHub (for codebases and documentation).
NotionDirectoryLoader — Internal Knowledge Bases
Notion is where half the startups I work with keep their documentation, meeting notes, and product specs. To load Notion content, you first export your workspace as Markdown files (Notion Settings > Export > Markdown & CSV), then point the loader at the exported directory.
The metadata includes the file path, which preserves the Notion page hierarchy. This is valuable context — when a user asks about deployment, you can cite the exact Notion page.
GitHubLoader — Codebases and Repo Documentation
Loading a GitHub repo lets your LLM answer questions about a codebase — architecture, function signatures, dependencies, README content. This is the foundation of "chat with your code" tools.
For loading actual source code files from a repo, you can clone the repository locally and use DirectoryLoader with a TextLoader. This approach gives you more control over which files to include and handles binary files gracefully.
Building a Unified Ingestion Function
In production, you rarely use just one loader. A typical RAG pipeline ingests PDFs, web pages, and CSV files from the same source. Writing a function that detects the file type and picks the right loader eliminates a lot of repetitive code. This is the pattern I use in every project.
Using it is straightforward:
LangChain's Document objects always have page_content and metadata. Write a function summarize_documents(docs) that takes a list of dictionaries (each with keys page_content and metadata) and returns a dictionary with three keys:
"total_docs": the number of documents"total_chars": the sum of all page_content lengths"sources": a sorted list of unique source values from metadata["source"]Print the result.
Metadata Enrichment — Adding Context Loaders Miss
Raw loader metadata is minimal — usually just source and maybe page or row. For a production RAG system, you want richer metadata: document category, creation date, department, access level. This metadata powers filtered retrieval later: "Find answers only from Q3 reports" or "Search only engineering docs."
That char_count field is more useful than it looks. When you split documents into chunks later, tiny fragments (under 50 characters) are usually headers, page numbers, or extraction artifacts. Filtering them out before embedding saves cost and improves retrieval quality.
Error Handling and Edge Cases
Loaders fail. Files are corrupted, URLs return 404, CSVs have encoding issues. In a batch ingestion pipeline, you don't want one bad file to crash the entire run. Here is the defensive pattern I use.
The key decisions here: return an empty list instead of raising (so batch processing continues), log the error for debugging, and filter out documents that are too short to be useful. That 20-character threshold catches headers, page numbers, and extraction artifacts while keeping real content.
Write a function get_loader_name(file_path) that takes a file path string and returns the name of the appropriate loader as a string. The mapping is:
.pdf -> "PyPDFLoader".csv -> "CSVLoader".txt -> "TextLoader".md -> "TextLoader".html -> "BSHTMLLoader""Unsupported"The function should be case-insensitive for extensions (.PDF and .pdf both work). Print the result for each test case.
Real-World Example: Building a Multi-Source Knowledge Base
Let me walk through a pattern I've used in three different client projects: ingesting documents from multiple sources into a single knowledge base that feeds a RAG pipeline. The ingestion config is a YAML file that maps source types to paths and metadata.
This pattern separates configuration from code. Adding a new source means adding a SourceConfig entry — no loader code changes. In production, I load these configs from a YAML file so non-engineers can add new data sources without touching Python.
Common Mistakes and How to Fix Them
After building document ingestion for a dozen projects, these are the mistakes I see repeatedly.
# Loads every file including images, binaries, .git
loader = DirectoryLoader("repo/", glob="**/*")
docs = loader.load() # Crashes or loads garbage# Only load what your pipeline can actually process
loader = DirectoryLoader(
"repo/",
glob="**/*.{py,md,txt}",
loader_cls=TextLoader,
silent_errors=True,
)docs = loader.load()
# Sends all docs to embeddings — including empty ones
vector_store.add_documents(docs)docs = loader.load()
# Remove empty and near-empty documents
docs = [d for d in docs if len(d.page_content.strip()) > 50]
vector_store.add_documents(docs)# Loader returns docs but you never check metadata
docs = loader.load()
# Later: "The AI said X but WHERE did it come from?"docs = loader.load()
for doc in docs:
doc.metadata["ingested_at"] = datetime.now().isoformat()
assert "source" in doc.metadata, "Missing source!"Write a function deduplicate_docs(docs) that takes a list of document dictionaries (each with page_content and metadata keys) and removes duplicates. Two documents are considered duplicates if they have the same metadata["source"] AND the same page_content. The function should return a new list with duplicates removed, preserving the order of first occurrence. Print the count of documents before and after deduplication.
Loader Quick-Reference Table
LangChain has 160+ loaders. Here are the ones I reach for most, grouped by source type:
| Source Type | Loader | Package | Key Feature |
|---|---|---|---|
| Plain text | TextLoader | langchain_community | Simplest loader, one doc per file |
PyPDFLoader | langchain_community | One doc per page, fast | |
| PDF (complex) | UnstructuredPDFLoader | langchain_community | Layout analysis, table extraction |
| CSV | CSVLoader | langchain_community | One doc per row, column control |
| Web page | WebBaseLoader | langchain_community | BeautifulSoup extraction |
| Web (multi-page) | RecursiveUrlLoader | langchain_community | Crawls links to max_depth |
| Notion export | NotionDirectoryLoader | langchain_community | Reads Markdown/CSV exports |
| GitHub issues | GitHubIssuesLoader | langchain_community | Loads issues with labels/metadata |
| JSON | JSONLoader | langchain_community | jq-like path expressions |
| Word (.docx) | Docx2txtLoader | langchain_community | Microsoft Word documents |
| HTML | BSHTMLLoader | langchain_community | Local HTML files |
| Directory | DirectoryLoader | langchain_community | Batch loading with any sub-loader |
Performance and Best Practices
Loading documents is I/O-bound — your bottleneck is disk reads, network requests, and PDF parsing, not CPU. These practices come from ingesting thousands of documents in production systems.
1. Use `lazy_load()` for large collections. Every loader has a .lazy_load() method that returns a generator instead of a list. This means documents are processed one at a time instead of all loaded into memory at once. For 10,000+ documents, this is the difference between a working pipeline and an OutOfMemoryError.
2. Enable multithreading for `DirectoryLoader`. Setting use_multithreading=True loads files in parallel. On a directory with 500 PDFs, I measured a 3-4x speedup on an 8-core machine.
3. Cache loaded documents. If your source data does not change frequently, serialize the loaded documents to disk and skip re-loading on subsequent runs. A simple pickle or JSON cache saves minutes on large datasets.
4. Track ingestion lineage. Log which files were loaded, when, how many documents each produced, and any errors. When your RAG system returns a bad answer six months from now, this log tells you whether the source document was ingested correctly.
Frequently Asked Questions
How do I load a JSON file with nested structure?
Use JSONLoader with a jq_schema to extract the specific field you want as page_content. The jq_schema uses jq syntax to navigate the JSON tree.
Can I load Google Docs or Google Drive files directly?
Yes. Use GoogleDriveLoader from langchain_community. It requires OAuth credentials and supports Google Docs, Sheets, and Slides. The setup takes about 10 minutes — follow the LangChain Google Drive guide.
What is the difference between `load()` and `load_and_split()`?
load() returns whole documents. load_and_split() loads the documents and then runs a text splitter on them in one call. I prefer calling load() and splitting separately because it gives you a chance to inspect and enrich the documents between loading and splitting.
How do I handle password-protected PDFs?
PyPDFLoader accepts a password parameter. Pass the PDF password as a string and it handles decryption before extraction.
Do I need LangChain just for loading documents?
No. If document loading is all you need, you can use pypdf, beautifulsoup4, or csv directly. LangChain loaders shine when you are building a pipeline — the standardised Document format means loaders, splitters, embeddings, and retrievers all fit together without format conversion.
Complete Code
The full ingestion pipeline combining everything from this tutorial: