Skip to main content

LangChain Document Loaders: Ingest PDFs, Web Pages, CSV, Notion, and GitHub

Intermediate90 min3 exercises45 XP
0/3 exercises

You have your LangChain chain working with hardcoded strings. Now your boss drops a folder on your desk: 47 PDFs, a Notion workspace, three CSV exports, and a GitHub repo. "Make the AI answer questions about all of this." The gap between a working prototype and a production-ready ingestion pipeline is document loaders -- and getting them right determines whether your RAG system returns brilliant answers or hallucinated garbage.

What Are Document Loaders and Why Do They Matter?

A document loader is a LangChain class that reads data from a source — a file, a URL, an API — and converts it into a standardised Document object. Every loader, regardless of format, produces the same output: a list of Document objects, each containing page_content (the text) and metadata (source information like file name, page number, or URL).

I think of loaders as the plumbing of any LLM application. Nobody brags about plumbing, but if it leaks, everything downstream fails. Bad ingestion means your text splitter gets garbled input, your embeddings encode noise, and your retriever returns irrelevant chunks. I have debugged more RAG quality issues traced back to a loader problem than to any model or prompt issue.

The universal loader pattern
Loading editor...

Running this on a simple text file produces something like:

Python
Loading editor...

That metadata dictionary is quietly important. When your RAG system retrieves a chunk later, the metadata tells the user where the answer came from -- which PDF, which page, which URL. Without metadata, your system gives answers but cannot cite its sources.

Before we dive into specific loaders, here is what you need installed. LangChain splits loaders across packages so you only install what you actually use:

Installation
Loading editor...

Loading PDFs — The Most Common Starting Point

PDFs are the format I encounter most in real projects. Contracts, research papers, financial reports, internal documentation — they all live in PDF. LangChain gives you two primary loaders, and choosing the right one depends on the PDF's complexity.

PyPDFLoader — Fast and Simple

PyPDFLoader uses the pypdf library under the hood. It's fast, reliable, and creates one Document per page — which means the page number lands in your metadata automatically. For standard text-heavy PDFs (contracts, reports, articles), this is the loader I reach for first.

Loading a PDF with PyPDFLoader
Loading editor...
Python
Loading editor...

Notice the page key in metadata starts at 0. When you display citations to users later, you will want to add 1: f"Source: page {doc.metadata['page'] + 1}".

UnstructuredPDFLoader — For Complex Layouts

When PDFs have tables, multi-column layouts, headers/footers, or embedded images with captions, PyPDFLoader often mashes everything together. UnstructuredPDFLoader uses the unstructured library which does layout analysis — it identifies headings, tables, and list items as separate elements.

Extracting structured elements from a PDF
Loading editor...
Python
Loading editor...

Loading a Directory of PDFs

Real projects rarely involve a single file. DirectoryLoader walks a directory and applies a loader to every matching file. The glob parameter controls which files it picks up.

Batch-loading an entire directory
Loading editor...

Loading Web Pages — Turning URLs into Documents

Sometimes the knowledge your LLM needs lives on a website, not in a file. LangChain's web loaders fetch a URL, strip the HTML boilerplate, and give you clean text. The difference between a good and bad web loader is how much navigation, ads, and cookie banners end up in your document.

WebBaseLoader — Quick and Lightweight

WebBaseLoader uses requests and BeautifulSoup to fetch and parse HTML. You can pass it a single URL or a list. For blog posts and documentation pages with clean HTML, it works well.

Loading a web page
Loading editor...

Without filtering, WebBaseLoader grabs everything — navigation menus, footers, sidebars. The bs_kwargs parameter lets you pass a SoupStrainer from BeautifulSoup, which restricts parsing to specific HTML elements. This dramatically cleans up the output.

Filtering HTML with SoupStrainer
Loading editor...

RecursiveUrlLoader — Crawling Multiple Pages

When you need an entire documentation site or a multi-page knowledge base, RecursiveUrlLoader follows links from a starting URL up to a specified depth. I use this when a client says "ingest our entire docs site" — it handles the crawling so I don't have to write a custom scraper.

Crawling a documentation site
Loading editor...

Loading CSVs and Structured Data

CSVs are the second most common format I ingest after PDFs -- customer records, FAQ databases, product catalogs. CSVLoader reads each row as a separate Document, converting column names and values into a readable key-value string. You pass the file path and optional csv_args (delimiter, quote character) to handle non-standard formats.

Loading a CSV file
Loading editor...

Each CSV row becomes a separate Document. The page_content is a key-value text representation of the row — not the raw comma-separated line. This is exactly what you want for RAG: when the retriever finds this document, it returns a self-contained description of one record.

Python
Loading editor...

You can control which columns become the document content and which become metadata. This is critical for filtering later — if plan is in metadata, you can retrieve only enterprise customers without relying on semantic search.

Controlling content vs metadata columns
Loading editor...
Python
Loading editor...

Third-Party Loaders — Notion and GitHub

LangChain's real power shows when you move beyond files on disk. The langchain_community package includes loaders for dozens of third-party services. I will focus on two that come up constantly in enterprise projects: Notion (for internal knowledge bases) and GitHub (for codebases and documentation).

NotionDirectoryLoader — Internal Knowledge Bases

Notion is where many teams keep their documentation, meeting notes, and product specs. To load Notion content, export your workspace as Markdown files (Notion Settings > Export > Markdown & CSV), then point the loader at the exported directory. Each Markdown file becomes a Document with the file path as metadata.

Loading a Notion export
Loading editor...
Python
Loading editor...

The metadata includes the file path, which preserves the Notion page hierarchy. This is valuable context — when a user asks about deployment, you can cite the exact Notion page.

GitHubLoader — Codebases and Repo Documentation

GitHubIssuesLoader pulls issues and pull requests from a GitHub repository via the API. Each issue becomes a Document with the body text as page_content and metadata fields for title, labels, state, and author. You need a GitHub personal access token.

Loading GitHub issues
Loading editor...

For loading actual source code files from a repo, you can clone the repository locally and use DirectoryLoader with a TextLoader. This approach gives you more control over which files to include and handles binary files gracefully.

Loading source code from a cloned repository
Loading editor...

Building a Unified Ingestion Function

In production, you rarely use just one loader. A typical RAG pipeline ingests PDFs, web pages, and CSV files from the same source. Writing a function that detects the file type and picks the right loader eliminates repetitive code.

The function below uses three strategies based on the input. First, it checks if the source is a URL and uses WebBaseLoader. Second, if the source is a directory, it iterates over a LOADER_MAP dictionary (mapping file extensions to loader classes) and uses DirectoryLoader for each extension. Third, for a single file, it looks up the extension in LOADER_MAP and instantiates the correct loader.

A unified load_source function
Loading editor...

The caller passes a path string or URL. The function handles the rest:

Python
Loading editor...
Exercise 1: Build a Document Metadata Parser
Write Code

LangChain's Document objects always have page_content and metadata. Write a function summarize_documents(docs) that takes a list of dictionaries (each with keys page_content and metadata) and returns a dictionary with three keys:

  • "total_docs": the number of documents
  • "total_chars": the sum of all page_content lengths
  • "sources": a sorted list of unique source values from metadata["source"]
  • Print the result.

    Loading editor...

    Metadata Enrichment — Adding Context Loaders Miss

    Raw loader metadata is minimal -- usually just source and maybe page or row. For a production RAG system, you want richer metadata: document category, department, access level. This metadata powers filtered retrieval later: "Find answers only from Q3 reports" or "Search only engineering docs."

    The enrich_metadata function below takes a batch of loaded documents and stamps each one with a category, department, loaded_at timestamp, and a char_count field. It mutates the documents in place and returns them so you can chain calls. The char_count is useful later for filtering out tiny extraction artifacts before you embed them.

    Enriching metadata after loading
    Loading editor...
    Python
    Loading editor...

    That char_count field is more useful than it looks. When you split documents into chunks later, tiny fragments (under 50 characters) are usually headers, page numbers, or extraction artifacts. Filtering them out before embedding saves cost and improves retrieval quality.

    Error Handling and Edge Cases

    Loaders fail. Files are corrupted, URLs return 404, CSVs have encoding issues. In a batch ingestion pipeline, you do not want one bad file to crash the entire run.

    The safe_load wrapper below wraps load_source in a try/except that catches FileNotFoundError, UnicodeDecodeError, and a generic fallback. Instead of raising, it returns an empty list so batch processing continues. It also filters out documents shorter than 20 characters -- those are usually headers, page numbers, or extraction artifacts that add noise to your vector store.

    Defensive loading with error recovery
    Loading editor...

    The key design choice: return an empty list instead of raising so batch processing continues. The 20-character threshold catches headers, page numbers, and extraction artifacts while keeping real content.

    Batch ingestion with graceful failure
    Loading editor...
    Exercise 2: Build a File Type Dispatcher
    Write Code

    Write a function get_loader_name(file_path) that takes a file path string and returns the name of the appropriate loader as a string. The mapping is:

  • .pdf -> "PyPDFLoader"
  • .csv -> "CSVLoader"
  • .txt -> "TextLoader"
  • .md -> "TextLoader"
  • .html -> "BSHTMLLoader"
  • Any other extension -> "Unsupported"
  • The function should be case-insensitive for extensions (.PDF and .pdf both work). Print the result for each test case.

    Loading editor...

    Real-World Example: Building a Multi-Source Knowledge Base

    Here is a pattern I use for production RAG pipelines: a YAML-driven config that maps source types to paths and metadata. Each source entry specifies the type (pdf, csv, web), the file path or URL list, and any custom metadata tags. The ingest_knowledge_base function iterates through configs, calls safe_load for each source, enriches metadata, and tracks success/failure stats.

    Knowledge base configuration
    Loading editor...

    The implementation uses a SourceConfig dataclass to hold source type, path, URL list, and metadata tags. The function loops through each config, resolves the source (path or URLs), calls safe_load, then enrich_metadata to stamp custom fields. A stats dictionary tracks how many documents loaded and how many sources failed -- essential visibility for debugging production pipelines.

    Multi-source ingestion pipeline
    Loading editor...
    Python
    Loading editor...

    This pattern separates configuration from code. Adding a new source means adding a SourceConfig entry -- no loader code changes. In production, I load these configs from a YAML file so non-engineers can add new data sources without touching Python.

    Common Mistakes and How to Fix Them

    After building document ingestion across many projects, these are the mistakes I see repeatedly.

    Loading everything, filtering nothing
    # Loads every file including images, binaries, .git
    loader = DirectoryLoader("repo/", glob="**/*")
    docs = loader.load()  # Crashes or loads garbage
    Target specific file types
    # Only load what your pipeline can actually process
    loader = DirectoryLoader(
        "repo/",
        glob="**/*.{py,md,txt}",
        loader_cls=TextLoader,
        silent_errors=True,
    )
    Ignoring empty documents
    docs = loader.load()
    # Sends all docs to embeddings — including empty ones
    vector_store.add_documents(docs)
    Filter before embedding
    docs = loader.load()
    # Remove empty and near-empty documents
    docs = [d for d in docs if len(d.page_content.strip()) > 50]
    vector_store.add_documents(docs)
    No metadata — impossible to cite sources
    # Loader returns docs but you never check metadata
    docs = loader.load()
    # Later: "The AI said X but WHERE did it come from?"
    Enrich and verify metadata early
    docs = loader.load()
    for doc in docs:
        doc.metadata["ingested_at"] = datetime.now().isoformat()
        assert "source" in doc.metadata, "Missing source!"
    Exercise 3: Deduplicate Documents by Source
    Write Code

    Write a function deduplicate_docs(docs) that takes a list of document dictionaries (each with page_content and metadata keys) and removes duplicates. Two documents are considered duplicates if they have the same metadata["source"] AND the same page_content. The function should return a new list with duplicates removed, preserving the order of first occurrence. Print the count of documents before and after deduplication.

    Loading editor...

    Loader Quick-Reference Table

    LangChain has over 160 loaders. Here are the ones I reach for most, grouped by source type:

    Source TypeLoaderPackageKey Feature
    Plain textTextLoaderlangchain_communitySimplest loader, one doc per file
    PDFPyPDFLoaderlangchain_communityOne doc per page, fast
    PDF (complex)UnstructuredPDFLoaderlangchain_communityLayout analysis, table extraction
    CSVCSVLoaderlangchain_communityOne doc per row, column control
    Web pageWebBaseLoaderlangchain_communityBeautifulSoup extraction
    Web (multi-page)RecursiveUrlLoaderlangchain_communityCrawls links to max_depth
    Notion exportNotionDirectoryLoaderlangchain_communityReads Markdown/CSV exports
    GitHub issuesGitHubIssuesLoaderlangchain_communityLoads issues with labels/metadata
    JSONJSONLoaderlangchain_communityjq-like path expressions
    Word (.docx)Docx2txtLoaderlangchain_communityMicrosoft Word documents
    HTMLBSHTMLLoaderlangchain_communityLocal HTML files
    DirectoryDirectoryLoaderlangchain_communityBatch loading with any sub-loader

    Performance and Best Practices

    Loading documents is I/O-bound -- your bottleneck is disk reads, network requests, and PDF parsing, not CPU. These four practices have saved me the most time in production.

    The pattern is simple: replace loader.load() with a for loop over loader.lazy_load(). Each iteration yields a single Document, which you enrich and push to the vector store immediately.

    Using lazy_load for memory efficiency
    Loading editor...

    The load_with_cache function below avoids re-parsing files that have not changed. It hashes the source path with hashlib.md5 to create a deterministic cache filename, then checks a .doc_cache/ directory for a matching JSON file. On a cache hit, it deserializes Document objects from JSON. On a cache miss, it calls load_source, serializes each document's page_content and metadata to JSON, and writes the cache file for next time.

    3. File-based document caching
    Loading editor...

    LangChain also supports async loading via aload() and alazy_load() on every loader. If your ingestion pipeline is already async (common in FastAPI backends), these methods let you load documents without blocking the event loop. The API is identical -- just await the call.

    Async loading with aload() and alazy_load()
    Loading editor...

    Frequently Asked Questions

    How do I load a JSON file with nested structure?

    JSONLoader uses jq-style path expressions to navigate nested JSON. The jq_schema parameter tells the loader which field should become page_content. Set text_content=False if the extracted value is a JSON object (not a plain string) so the loader serializes it properly.

    JSONLoader with jq_schema and metadata extraction
    Loading editor...

    Can I load Google Docs or Google Drive files directly?

    Yes. Use GoogleDriveLoader from langchain_community. It requires OAuth credentials and supports Google Docs, Sheets, and Slides. The setup takes about 10 minutes — follow the LangChain Google Drive guide.

    What is the difference between load() and load_and_split()?

    load() returns whole documents. load_and_split() loads and then runs a text splitter in one call. I prefer calling load() and splitting separately because it gives you a chance to inspect and enrich documents between loading and splitting.

    load() vs load_and_split()
    Loading editor...

    How do I handle password-protected PDFs?

    PyPDFLoader accepts a password parameter. Pass the PDF password as a string and it handles decryption before extraction.

    Python
    Loading editor...

    Do I need LangChain just for loading documents?

    No. If document loading is all you need, you can use pypdf, beautifulsoup4, or csv directly. LangChain loaders shine when you are building a full pipeline -- the standardized Document format means loaders, splitters, embeddings, and retrievers all fit together without format conversion.

    What's Next — From Raw Documents to a Working RAG Pipeline

    Document loading is the first stage of a RAG pipeline. You now have raw Document objects -- but they are too long to embed directly. The next step is text splitting, which breaks documents into chunks sized for your embedding model. From there, you embed the chunks and store them in a vector database.

    For the full end-to-end flow -- loading, splitting, embedding, storing, and querying -- see our RAG with LangChain tutorial. If you are debugging chain behavior or tracking token costs across loader calls, LangSmith gives you observability into every step.


    Complete Code

    The full ingestion pipeline combining everything from this tutorial:

    Complete ingestion pipeline
    Loading editor...

    References

  • LangChain Documentation — Document Loaders. Link
  • LangChain Community — Document Loader Integrations. Link
  • pypdf Documentation — PDF Parsing. Link
  • Unstructured Library — Document Processing. Link
  • BeautifulSoup Documentation — HTML Parsing. Link
  • LangChain API Reference — Document Class. Link
  • LangChain Documentation — Text Splitters. Link
  • Related Tutorials

    Save your progress across devices

    Never lose your code, challenges, or XP. Sign up free — no password needed.

    Already have an account?