Skip to main content

LangChain Document Loaders: Ingest PDFs, Web Pages, CSV, Notion, and GitHub

Intermediate90 min3 exercises45 XP
0/3 exercises

You've got your LangChain chain working with hardcoded strings. Now your boss drops a folder on your desk: 47 PDFs, a Notion workspace, three CSV exports, and a GitHub repo. "Make the AI answer questions about all of this." The gap between a working prototype and a production-ready ingestion pipeline is document loaders — and getting them right determines whether your RAG system returns brilliant answers or hallucinated garbage.

What Are Document Loaders and Why Do They Matter?

A document loader is a LangChain class that reads data from a source — a file, a URL, an API — and converts it into a standardised Document object. Every loader, regardless of format, produces the same output: a list of Document objects, each containing page_content (the text) and metadata (source information like file name, page number, or URL).

I think of loaders as the plumbing of any LLM application. Nobody brags about plumbing, but if it leaks, everything downstream fails. Bad ingestion means your text splitter gets garbled input, your embeddings encode noise, and your retriever returns irrelevant chunks. I've debugged more RAG quality issues that traced back to a loader problem than to any model or prompt issue.

The universal loader pattern
Loading editor...

Running this on a simple text file produces something like:

Python
Loading editor...

That metadata dictionary is quietly important. When your RAG system retrieves a chunk later, the metadata tells the user where the answer came from — which PDF, which page, which URL. Without metadata, your system gives answers but can't cite its sources.

Before we dive into specific loaders, here is what you need installed. LangChain splits loaders across packages so you only install what you actually use:

Installation
Loading editor...

Loading PDFs — The Most Common Starting Point

PDFs are the format I encounter most in real projects. Contracts, research papers, financial reports, internal documentation — they all live in PDF. LangChain gives you two primary loaders, and choosing the right one depends on the PDF's complexity.

PyPDFLoader — Fast and Simple

PyPDFLoader uses the pypdf library under the hood. It's fast, reliable, and creates one Document per page — which means the page number lands in your metadata automatically. For standard text-heavy PDFs (contracts, reports, articles), this is the loader I reach for first.

Loading a PDF with PyPDFLoader
Loading editor...
Python
Loading editor...

Notice the page key in metadata starts at 0. When you display citations to users later, you will want to add 1: f"Source: page {doc.metadata['page'] + 1}".

UnstructuredPDFLoader — For Complex Layouts

When PDFs have tables, multi-column layouts, headers/footers, or embedded images with captions, PyPDFLoader often mashes everything together. UnstructuredPDFLoader uses the unstructured library which does layout analysis — it identifies headings, tables, and list items as separate elements.

Extracting structured elements from a PDF
Loading editor...
Python
Loading editor...

Loading a Directory of PDFs

Real projects rarely involve a single file. DirectoryLoader walks a directory and applies a loader to every matching file. The glob parameter controls which files it picks up.

Batch-loading an entire directory
Loading editor...

Loading Web Pages — Turning URLs into Documents

Sometimes the knowledge your LLM needs lives on a website, not in a file. LangChain's web loaders fetch a URL, strip the HTML boilerplate, and give you clean text. The difference between a good and bad web loader is how much navigation, ads, and cookie banners end up in your document.

WebBaseLoader — Quick and Lightweight

WebBaseLoader uses requests and BeautifulSoup to fetch and parse HTML. You can pass it a single URL or a list. For blog posts and documentation pages with clean HTML, it works well.

Loading a web page
Loading editor...

Without filtering, WebBaseLoader grabs everything — navigation menus, footers, sidebars. The bs_kwargs parameter lets you pass a SoupStrainer from BeautifulSoup, which restricts parsing to specific HTML elements. This dramatically cleans up the output.

Filtering HTML with SoupStrainer
Loading editor...

RecursiveUrlLoader — Crawling Multiple Pages

When you need an entire documentation site or a multi-page knowledge base, RecursiveUrlLoader follows links from a starting URL up to a specified depth. I use this when a client says "ingest our entire docs site" — it handles the crawling so I don't have to write a custom scraper.

Crawling a documentation site
Loading editor...

Loading CSVs and Structured Data

Loading a CSV file
Loading editor...

Each CSV row becomes a separate Document. The page_content is a key-value text representation of the row — not the raw comma-separated line. This is exactly what you want for RAG: when the retriever finds this document, it returns a self-contained description of one record.

Python
Loading editor...

You can control which columns become the document content and which become metadata. This is critical for filtering later — if plan is in metadata, you can retrieve only enterprise customers without relying on semantic search.

Controlling content vs metadata columns
Loading editor...
Python
Loading editor...

Third-Party Loaders — Notion and GitHub

LangChain's real power shows when you move beyond files on disk. The langchain_community package includes loaders for dozens of third-party services. I'll focus on two that come up constantly in enterprise projects: Notion (for internal knowledge bases) and GitHub (for codebases and documentation).

NotionDirectoryLoader — Internal Knowledge Bases

Notion is where half the startups I work with keep their documentation, meeting notes, and product specs. To load Notion content, you first export your workspace as Markdown files (Notion Settings > Export > Markdown & CSV), then point the loader at the exported directory.

Loading a Notion export
Loading editor...
Python
Loading editor...

The metadata includes the file path, which preserves the Notion page hierarchy. This is valuable context — when a user asks about deployment, you can cite the exact Notion page.

GitHubLoader — Codebases and Repo Documentation

Loading a GitHub repo lets your LLM answer questions about a codebase — architecture, function signatures, dependencies, README content. This is the foundation of "chat with your code" tools.

Loading GitHub issues
Loading editor...

For loading actual source code files from a repo, you can clone the repository locally and use DirectoryLoader with a TextLoader. This approach gives you more control over which files to include and handles binary files gracefully.

Loading source code from a cloned repository
Loading editor...

Building a Unified Ingestion Function

In production, you rarely use just one loader. A typical RAG pipeline ingests PDFs, web pages, and CSV files from the same source. Writing a function that detects the file type and picks the right loader eliminates a lot of repetitive code. This is the pattern I use in every project.

A unified load_source function
Loading editor...

Using it is straightforward:

Python
Loading editor...
Exercise 1: Build a Document Metadata Parser
Write Code

LangChain's Document objects always have page_content and metadata. Write a function summarize_documents(docs) that takes a list of dictionaries (each with keys page_content and metadata) and returns a dictionary with three keys:

  • "total_docs": the number of documents
  • "total_chars": the sum of all page_content lengths
  • "sources": a sorted list of unique source values from metadata["source"]
  • Print the result.

    Loading editor...

    Metadata Enrichment — Adding Context Loaders Miss

    Raw loader metadata is minimal — usually just source and maybe page or row. For a production RAG system, you want richer metadata: document category, creation date, department, access level. This metadata powers filtered retrieval later: "Find answers only from Q3 reports" or "Search only engineering docs."

    Enriching metadata after loading
    Loading editor...
    Python
    Loading editor...

    That char_count field is more useful than it looks. When you split documents into chunks later, tiny fragments (under 50 characters) are usually headers, page numbers, or extraction artifacts. Filtering them out before embedding saves cost and improves retrieval quality.

    Error Handling and Edge Cases

    Loaders fail. Files are corrupted, URLs return 404, CSVs have encoding issues. In a batch ingestion pipeline, you don't want one bad file to crash the entire run. Here is the defensive pattern I use.

    Defensive loading with error recovery
    Loading editor...

    The key decisions here: return an empty list instead of raising (so batch processing continues), log the error for debugging, and filter out documents that are too short to be useful. That 20-character threshold catches headers, page numbers, and extraction artifacts while keeping real content.

    Batch ingestion with graceful failure
    Loading editor...
    Exercise 2: Build a File Type Dispatcher
    Write Code

    Write a function get_loader_name(file_path) that takes a file path string and returns the name of the appropriate loader as a string. The mapping is:

  • .pdf -> "PyPDFLoader"
  • .csv -> "CSVLoader"
  • .txt -> "TextLoader"
  • .md -> "TextLoader"
  • .html -> "BSHTMLLoader"
  • Any other extension -> "Unsupported"
  • The function should be case-insensitive for extensions (.PDF and .pdf both work). Print the result for each test case.

    Loading editor...

    Real-World Example: Building a Multi-Source Knowledge Base

    Let me walk through a pattern I've used in three different client projects: ingesting documents from multiple sources into a single knowledge base that feeds a RAG pipeline. The ingestion config is a YAML file that maps source types to paths and metadata.

    Knowledge base configuration
    Loading editor...
    Multi-source ingestion pipeline
    Loading editor...
    Python
    Loading editor...

    This pattern separates configuration from code. Adding a new source means adding a SourceConfig entry — no loader code changes. In production, I load these configs from a YAML file so non-engineers can add new data sources without touching Python.

    Common Mistakes and How to Fix Them

    After building document ingestion for a dozen projects, these are the mistakes I see repeatedly.

    Loading everything, filtering nothing
    # Loads every file including images, binaries, .git
    loader = DirectoryLoader("repo/", glob="**/*")
    docs = loader.load()  # Crashes or loads garbage
    Target specific file types
    # Only load what your pipeline can actually process
    loader = DirectoryLoader(
        "repo/",
        glob="**/*.{py,md,txt}",
        loader_cls=TextLoader,
        silent_errors=True,
    )
    Ignoring empty documents
    docs = loader.load()
    # Sends all docs to embeddings — including empty ones
    vector_store.add_documents(docs)
    Filter before embedding
    docs = loader.load()
    # Remove empty and near-empty documents
    docs = [d for d in docs if len(d.page_content.strip()) > 50]
    vector_store.add_documents(docs)
    No metadata — impossible to cite sources
    # Loader returns docs but you never check metadata
    docs = loader.load()
    # Later: "The AI said X but WHERE did it come from?"
    Enrich and verify metadata early
    docs = loader.load()
    for doc in docs:
        doc.metadata["ingested_at"] = datetime.now().isoformat()
        assert "source" in doc.metadata, "Missing source!"
    Exercise 3: Deduplicate Documents by Source
    Write Code

    Write a function deduplicate_docs(docs) that takes a list of document dictionaries (each with page_content and metadata keys) and removes duplicates. Two documents are considered duplicates if they have the same metadata["source"] AND the same page_content. The function should return a new list with duplicates removed, preserving the order of first occurrence. Print the count of documents before and after deduplication.

    Loading editor...

    Loader Quick-Reference Table

    LangChain has 160+ loaders. Here are the ones I reach for most, grouped by source type:

    Source TypeLoaderPackageKey Feature
    Plain textTextLoaderlangchain_communitySimplest loader, one doc per file
    PDFPyPDFLoaderlangchain_communityOne doc per page, fast
    PDF (complex)UnstructuredPDFLoaderlangchain_communityLayout analysis, table extraction
    CSVCSVLoaderlangchain_communityOne doc per row, column control
    Web pageWebBaseLoaderlangchain_communityBeautifulSoup extraction
    Web (multi-page)RecursiveUrlLoaderlangchain_communityCrawls links to max_depth
    Notion exportNotionDirectoryLoaderlangchain_communityReads Markdown/CSV exports
    GitHub issuesGitHubIssuesLoaderlangchain_communityLoads issues with labels/metadata
    JSONJSONLoaderlangchain_communityjq-like path expressions
    Word (.docx)Docx2txtLoaderlangchain_communityMicrosoft Word documents
    HTMLBSHTMLLoaderlangchain_communityLocal HTML files
    DirectoryDirectoryLoaderlangchain_communityBatch loading with any sub-loader

    Performance and Best Practices

    Loading documents is I/O-bound — your bottleneck is disk reads, network requests, and PDF parsing, not CPU. These practices come from ingesting thousands of documents in production systems.

    1. Use `lazy_load()` for large collections. Every loader has a .lazy_load() method that returns a generator instead of a list. This means documents are processed one at a time instead of all loaded into memory at once. For 10,000+ documents, this is the difference between a working pipeline and an OutOfMemoryError.

    Using lazy_load for memory efficiency
    Loading editor...

    2. Enable multithreading for `DirectoryLoader`. Setting use_multithreading=True loads files in parallel. On a directory with 500 PDFs, I measured a 3-4x speedup on an 8-core machine.

    3. Cache loaded documents. If your source data does not change frequently, serialize the loaded documents to disk and skip re-loading on subsequent runs. A simple pickle or JSON cache saves minutes on large datasets.

    File-based document caching
    Loading editor...

    4. Track ingestion lineage. Log which files were loaded, when, how many documents each produced, and any errors. When your RAG system returns a bad answer six months from now, this log tells you whether the source document was ingested correctly.

    Frequently Asked Questions

    How do I load a JSON file with nested structure?

    Use JSONLoader with a jq_schema to extract the specific field you want as page_content. The jq_schema uses jq syntax to navigate the JSON tree.

    Python
    Loading editor...

    Can I load Google Docs or Google Drive files directly?

    Yes. Use GoogleDriveLoader from langchain_community. It requires OAuth credentials and supports Google Docs, Sheets, and Slides. The setup takes about 10 minutes — follow the LangChain Google Drive guide.

    What is the difference between `load()` and `load_and_split()`?

    load() returns whole documents. load_and_split() loads the documents and then runs a text splitter on them in one call. I prefer calling load() and splitting separately because it gives you a chance to inspect and enrich the documents between loading and splitting.

    load() vs load_and_split()
    Loading editor...

    How do I handle password-protected PDFs?

    PyPDFLoader accepts a password parameter. Pass the PDF password as a string and it handles decryption before extraction.

    Python
    Loading editor...

    Do I need LangChain just for loading documents?

    No. If document loading is all you need, you can use pypdf, beautifulsoup4, or csv directly. LangChain loaders shine when you are building a pipeline — the standardised Document format means loaders, splitters, embeddings, and retrievers all fit together without format conversion.

    Complete Code

    The full ingestion pipeline combining everything from this tutorial:

    Complete ingestion pipeline
    Loading editor...

    References

  • LangChain Documentation — Document Loaders. Link
  • LangChain Community — Document Loader Integrations. Link
  • pypdf Documentation — PDF Parsing. Link
  • Unstructured Library — Document Processing. Link
  • BeautifulSoup Documentation — HTML Parsing. Link
  • LangChain API Reference — Document Class. Link
  • LangChain Documentation — Text Splitters. Link
  • Related Tutorials