LangChain Document Loaders: Ingest PDFs, Web Pages, CSV, Notion, and GitHub

Intermediate90 min3 exercises45 XP

Prerequisites

0/3 exercises

You've got your LangChain chain working with hardcoded strings. Now your boss drops a folder on your desk: 47 PDFs, a Notion workspace, three CSV exports, and a GitHub repo. "Make the AI answer questions about all of this." The gap between a working prototype and a production-ready ingestion pipeline is document loaders — and getting them right determines whether your RAG system returns brilliant answers or hallucinated garbage.

What Are Document Loaders and Why Do They Matter?

A document loader is a LangChain class that reads data from a source — a file, a URL, an API — and converts it into a standardised Document object. Every loader, regardless of format, produces the same output: a list of Document objects, each containing page_content (the text) and metadata (source information like file name, page number, or URL).

I think of loaders as the plumbing of any LLM application. Nobody brags about plumbing, but if it leaks, everything downstream fails. Bad ingestion means your text splitter gets garbled input, your embeddings encode noise, and your retriever returns irrelevant chunks. I've debugged more RAG quality issues that traced back to a loader problem than to any model or prompt issue.

The universal loader pattern

Loading editor...

Running this on a simple text file produces something like:

Python

Loading editor...

That metadata dictionary is quietly important. When your RAG system retrieves a chunk later, the metadata tells the user where the answer came from — which PDF, which page, which URL. Without metadata, your system gives answers but can't cite its sources.

Before we dive into specific loaders, here is what you need installed. LangChain splits loaders across packages so you only install what you actually use:

Installation

Loading editor...

Loading PDFs — The Most Common Starting Point

PDFs are the format I encounter most in real projects. Contracts, research papers, financial reports, internal documentation — they all live in PDF. LangChain gives you two primary loaders, and choosing the right one depends on the PDF's complexity.

PyPDFLoader — Fast and Simple

PyPDFLoader uses the pypdf library under the hood. It's fast, reliable, and creates one Document per page — which means the page number lands in your metadata automatically. For standard text-heavy PDFs (contracts, reports, articles), this is the loader I reach for first.

Loading a PDF with PyPDFLoader

Loading editor...

Python

Loading editor...

Notice the page key in metadata starts at 0. When you display citations to users later, you will want to add 1: f"Source: page {doc.metadata['page'] + 1}".

UnstructuredPDFLoader — For Complex Layouts

When PDFs have tables, multi-column layouts, headers/footers, or embedded images with captions, PyPDFLoader often mashes everything together. UnstructuredPDFLoader uses the unstructured library which does layout analysis — it identifies headings, tables, and list items as separate elements.

Extracting structured elements from a PDF

Loading editor...

Python

Loading editor...

Loading a Directory of PDFs

Real projects rarely involve a single file. DirectoryLoader walks a directory and applies a loader to every matching file. The glob parameter controls which files it picks up.

Batch-loading an entire directory

Loading editor...

Loading Web Pages — Turning URLs into Documents

Sometimes the knowledge your LLM needs lives on a website, not in a file. LangChain's web loaders fetch a URL, strip the HTML boilerplate, and give you clean text. The difference between a good and bad web loader is how much navigation, ads, and cookie banners end up in your document.

WebBaseLoader — Quick and Lightweight

WebBaseLoader uses requests and BeautifulSoup to fetch and parse HTML. You can pass it a single URL or a list. For blog posts and documentation pages with clean HTML, it works well.

Loading a web page

Loading editor...

Without filtering, WebBaseLoader grabs everything — navigation menus, footers, sidebars. The bs_kwargs parameter lets you pass a SoupStrainer from BeautifulSoup, which restricts parsing to specific HTML elements. This dramatically cleans up the output.

Filtering HTML with SoupStrainer

Loading editor...

RecursiveUrlLoader — Crawling Multiple Pages

When you need an entire documentation site or a multi-page knowledge base, RecursiveUrlLoader follows links from a starting URL up to a specified depth. I use this when a client says "ingest our entire docs site" — it handles the crawling so I don't have to write a custom scraper.

Crawling a documentation site

Loading editor...

Loading CSVs and Structured Data

Loading a CSV file

Loading editor...

Each CSV row becomes a separate Document. The page_content is a key-value text representation of the row — not the raw comma-separated line. This is exactly what you want for RAG: when the retriever finds this document, it returns a self-contained description of one record.

Python

Loading editor...

You can control which columns become the document content and which become metadata. This is critical for filtering later — if plan is in metadata, you can retrieve only enterprise customers without relying on semantic search.

Controlling content vs metadata columns

Loading editor...

Python

Loading editor...

Third-Party Loaders — Notion and GitHub

LangChain's real power shows when you move beyond files on disk. The langchain_community package includes loaders for dozens of third-party services. I'll focus on two that come up constantly in enterprise projects: Notion (for internal knowledge bases) and GitHub (for codebases and documentation).

NotionDirectoryLoader — Internal Knowledge Bases

Notion is where half the startups I work with keep their documentation, meeting notes, and product specs. To load Notion content, you first export your workspace as Markdown files (Notion Settings > Export > Markdown & CSV), then point the loader at the exported directory.

Loading a Notion export

Loading editor...

Python

Loading editor...

The metadata includes the file path, which preserves the Notion page hierarchy. This is valuable context — when a user asks about deployment, you can cite the exact Notion page.

GitHubLoader — Codebases and Repo Documentation

Loading a GitHub repo lets your LLM answer questions about a codebase — architecture, function signatures, dependencies, README content. This is the foundation of "chat with your code" tools.

Loading GitHub issues

Loading editor...

For loading actual source code files from a repo, you can clone the repository locally and use DirectoryLoader with a TextLoader. This approach gives you more control over which files to include and handles binary files gracefully.

Loading source code from a cloned repository

Loading editor...

Building a Unified Ingestion Function

In production, you rarely use just one loader. A typical RAG pipeline ingests PDFs, web pages, and CSV files from the same source. Writing a function that detects the file type and picks the right loader eliminates a lot of repetitive code. This is the pattern I use in every project.

A unified load_source function

Loading editor...

Using it is straightforward:

Python

Loading editor...

Exercise 1: Build a Document Metadata Parser

Write Code

LangChain's Document objects always have page_content and metadata. Write a function summarize_documents(docs) that takes a list of dictionaries (each with keys page_content and metadata) and returns a dictionary with three keys:

"total_docs": the number of documents

"total_chars": the sum of all page_content lengths

"sources": a sorted list of unique source values from metadata["source"]

Print the result.

Loading editor...

Metadata Enrichment — Adding Context Loaders Miss

Raw loader metadata is minimal — usually just source and maybe page or row. For a production RAG system, you want richer metadata: document category, creation date, department, access level. This metadata powers filtered retrieval later: "Find answers only from Q3 reports" or "Search only engineering docs."

Enriching metadata after loading

Loading editor...

Python

Loading editor...

That char_count field is more useful than it looks. When you split documents into chunks later, tiny fragments (under 50 characters) are usually headers, page numbers, or extraction artifacts. Filtering them out before embedding saves cost and improves retrieval quality.

Error Handling and Edge Cases

Loaders fail. Files are corrupted, URLs return 404, CSVs have encoding issues. In a batch ingestion pipeline, you don't want one bad file to crash the entire run. Here is the defensive pattern I use.

Defensive loading with error recovery

Loading editor...

The key decisions here: return an empty list instead of raising (so batch processing continues), log the error for debugging, and filter out documents that are too short to be useful. That 20-character threshold catches headers, page numbers, and extraction artifacts while keeping real content.

Batch ingestion with graceful failure

Loading editor...

Exercise 2: Build a File Type Dispatcher

Write Code

Write a function get_loader_name(file_path) that takes a file path string and returns the name of the appropriate loader as a string. The mapping is:

.pdf -> "PyPDFLoader"

.csv -> "CSVLoader"

.txt -> "TextLoader"

.md -> "TextLoader"

.html -> "BSHTMLLoader"

Any other extension -> "Unsupported"

The function should be case-insensitive for extensions (.PDF and .pdf both work). Print the result for each test case.

Loading editor...

Real-World Example: Building a Multi-Source Knowledge Base

Let me walk through a pattern I've used in three different client projects: ingesting documents from multiple sources into a single knowledge base that feeds a RAG pipeline. The ingestion config is a YAML file that maps source types to paths and metadata.

Knowledge base configuration

Loading editor...

Multi-source ingestion pipeline

Loading editor...

Python

Loading editor...

This pattern separates configuration from code. Adding a new source means adding a SourceConfig entry — no loader code changes. In production, I load these configs from a YAML file so non-engineers can add new data sources without touching Python.

Common Mistakes and How to Fix Them

After building document ingestion for a dozen projects, these are the mistakes I see repeatedly.

Loading everything, filtering nothing

# Loads every file including images, binaries, .git
loader = DirectoryLoader("repo/", glob="**/*")
docs = loader.load()  # Crashes or loads garbage

Target specific file types

# Only load what your pipeline can actually process
loader = DirectoryLoader(
    "repo/",
    glob="**/*.{py,md,txt}",
    loader_cls=TextLoader,
    silent_errors=True,
)

Ignoring empty documents

docs = loader.load()
# Sends all docs to embeddings — including empty ones
vector_store.add_documents(docs)

Filter before embedding

docs = loader.load()
# Remove empty and near-empty documents
docs = [d for d in docs if len(d.page_content.strip()) > 50]
vector_store.add_documents(docs)

No metadata — impossible to cite sources

# Loader returns docs but you never check metadata
docs = loader.load()
# Later: "The AI said X but WHERE did it come from?"

Enrich and verify metadata early

docs = loader.load()
for doc in docs:
    doc.metadata["ingested_at"] = datetime.now().isoformat()
    assert "source" in doc.metadata, "Missing source!"

Exercise 3: Deduplicate Documents by Source

Write Code

Write a function deduplicate_docs(docs) that takes a list of document dictionaries (each with page_content and metadata keys) and removes duplicates. Two documents are considered duplicates if they have the same metadata["source"] AND the same page_content. The function should return a new list with duplicates removed, preserving the order of first occurrence. Print the count of documents before and after deduplication.

Loading editor...

Loader Quick-Reference Table

LangChain has 160+ loaders. Here are the ones I reach for most, grouped by source type:

Source Type	Loader	Package	Key Feature
Plain text	`TextLoader`	`langchain_community`	Simplest loader, one doc per file
PDF	`PyPDFLoader`	`langchain_community`	One doc per page, fast
PDF (complex)	`UnstructuredPDFLoader`	`langchain_community`	Layout analysis, table extraction
CSV	`CSVLoader`	`langchain_community`	One doc per row, column control
Web page	`WebBaseLoader`	`langchain_community`	BeautifulSoup extraction
Web (multi-page)	`RecursiveUrlLoader`	`langchain_community`	Crawls links to max_depth
Notion export	`NotionDirectoryLoader`	`langchain_community`	Reads Markdown/CSV exports
GitHub issues	`GitHubIssuesLoader`	`langchain_community`	Loads issues with labels/metadata
JSON	`JSONLoader`	`langchain_community`	jq-like path expressions
Word (.docx)	`Docx2txtLoader`	`langchain_community`	Microsoft Word documents
HTML	`BSHTMLLoader`	`langchain_community`	Local HTML files
Directory	`DirectoryLoader`	`langchain_community`	Batch loading with any sub-loader

Performance and Best Practices

Loading documents is I/O-bound — your bottleneck is disk reads, network requests, and PDF parsing, not CPU. These practices come from ingesting thousands of documents in production systems.

1. Use `lazy_load()` for large collections. Every loader has a .lazy_load() method that returns a generator instead of a list. This means documents are processed one at a time instead of all loaded into memory at once. For 10,000+ documents, this is the difference between a working pipeline and an OutOfMemoryError.

Using lazy_load for memory efficiency

Loading editor...

2. Enable multithreading for `DirectoryLoader`. Setting use_multithreading=True loads files in parallel. On a directory with 500 PDFs, I measured a 3-4x speedup on an 8-core machine.

3. Cache loaded documents. If your source data does not change frequently, serialize the loaded documents to disk and skip re-loading on subsequent runs. A simple pickle or JSON cache saves minutes on large datasets.

File-based document caching

Loading editor...

4. Track ingestion lineage. Log which files were loaded, when, how many documents each produced, and any errors. When your RAG system returns a bad answer six months from now, this log tells you whether the source document was ingested correctly.

Frequently Asked Questions

How do I load a JSON file with nested structure?

Use JSONLoader with a jq_schema to extract the specific field you want as page_content. The jq_schema uses jq syntax to navigate the JSON tree.

Python

Loading editor...

Can I load Google Docs or Google Drive files directly?

Yes. Use GoogleDriveLoader from langchain_community. It requires OAuth credentials and supports Google Docs, Sheets, and Slides. The setup takes about 10 minutes — follow the LangChain Google Drive guide.

What is the difference between `load()` and `load_and_split()`?

load() returns whole documents. load_and_split() loads the documents and then runs a text splitter on them in one call. I prefer calling load() and splitting separately because it gives you a chance to inspect and enrich the documents between loading and splitting.

load() vs load_and_split()

Loading editor...

How do I handle password-protected PDFs?

PyPDFLoader accepts a password parameter. Pass the PDF password as a string and it handles decryption before extraction.

Python

Loading editor...

Do I need LangChain just for loading documents?

No. If document loading is all you need, you can use pypdf, beautifulsoup4, or csv directly. LangChain loaders shine when you are building a pipeline — the standardised Document format means loaders, splitters, embeddings, and retrievers all fit together without format conversion.

Complete Code

The full ingestion pipeline combining everything from this tutorial:

Complete ingestion pipeline

Loading editor...

References

LangChain Documentation — Document Loaders. Link

LangChain Community — Document Loader Integrations. Link

pypdf Documentation — PDF Parsing. Link

Unstructured Library — Document Processing. Link

BeautifulSoup Documentation — HTML Parsing. Link

LangChain API Reference — Document Class. Link

LangChain Documentation — Text Splitters. Link

What Are Document Loaders and Why Do They Matter?

Loading PDFs — The Most Common Starting Point

PyPDFLoader — Fast and Simple

UnstructuredPDFLoader — For Complex Layouts

Loading a Directory of PDFs

Loading Web Pages — Turning URLs into Documents

WebBaseLoader — Quick and Lightweight

RecursiveUrlLoader — Crawling Multiple Pages

Loading CSVs and Structured Data

Third-Party Loaders — Notion and GitHub

NotionDirectoryLoader — Internal Knowledge Bases

GitHubLoader — Codebases and Repo Documentation

Building a Unified Ingestion Function

Metadata Enrichment — Adding Context Loaders Miss

Error Handling and Edge Cases

Real-World Example: Building a Multi-Source Knowledge Base

Common Mistakes and How to Fix Them

Loader Quick-Reference Table

Performance and Best Practices

Frequently Asked Questions

Complete Code

References

Related Tutorials