LangChain Callbacks: Custom Logging, Cost Tracking, and Streaming

Intermediate60 min3 exercises45 XP

Prerequisites

0/3 exercises

You ship an LLM app, and it works. But then your boss asks: "How much did that cost last month?" You don't know. A user reports a weird answer. You can't reproduce it because you never logged the prompt. The model starts returning empty strings at 2 AM, and nobody notices until morning.

LangChain callbacks solve all three problems. They let you hook into every step of an LLM call — before the prompt is sent, after the response arrives, when tokens stream in, when something fails — and run your own code at each point. If you've built anything with LangChain, adding callbacks is the single most impactful improvement you can make to observability.

What Are Callbacks and Why Do They Matter?

A callback is a function that LangChain calls automatically at a specific moment during execution. You don't call it yourself — you register it, and LangChain fires it at the right time.

To create a callback handler, you subclass BaseCallbackHandler and override the event methods you care about. The two most common are on_llm_start (fires before the API request) and on_llm_end (fires after the response arrives). LangChain provides no-op defaults for every method, so you only implement what you need.

A minimal callback handler

Loading editor...

The first two lines come from our callback. The third is the model's response (yours will differ):

LLM call starting with 1 prompt(s)
LLM call finished. Got 1 generation(s)
Python is a high-level, interpreted programming language...

The Callback Lifecycle — Every Hook You Can Use

LangChain doesn't give you just two hooks. It gives you over a dozen, covering every phase of execution across LLMs, chains, tools, and retrievers. I find that most tutorials only show on_llm_start and on_llm_end. Knowing the full set opens up much better logging and debugging.

Here are the callback methods grouped by the component they monitor. You override only the ones you care about — BaseCallbackHandler provides no-op defaults for everything else:

LLM Events:

on_llm_start(serialized, prompts, **kwargs) — Before the LLM API call

on_llm_new_token(token, **kwargs) — Each streamed token arrives

on_llm_end(response, **kwargs) — After the complete response

on_llm_error(error, **kwargs) — When the LLM call fails

Chain Events:

on_chain_start(serialized, inputs, **kwargs) — Before a chain runs

on_chain_end(outputs, **kwargs) — After a chain completes

on_chain_error(error, **kwargs) — When a chain fails

Tool Events:

on_tool_start(serialized, input_str, **kwargs) — Before a tool executes

on_tool_end(output, **kwargs) — After a tool returns

on_tool_error(error, **kwargs) — When a tool fails

Retriever Events:

on_retriever_start(serialized, query, **kwargs) — Before a retriever searches

on_retriever_end(documents, **kwargs) — After documents are returned

on_retriever_error(error, **kwargs) — When retrieval fails

The lifecycle of a single LLM call flows like this: on_llm_start → (optionally) on_llm_new_token repeated for each token → on_llm_end. If something goes wrong at any point, on_llm_error fires instead of on_llm_end. For chains, the pattern nests: on_chain_start → on_llm_start → ... → on_llm_end → on_chain_end.

Build a Logging Handler

The first callback I build for any project is a logger. Not print() statements scattered through the code — a proper structured logger that captures the prompt, model name, latency, and token counts for every single LLM call. When something goes wrong in production, this log is what saves you.

This handler uses Python's logging module and the time library to measure latency. The on_llm_start method records the timestamp and model name; on_llm_end computes the elapsed time and extracts token counts from the response metadata. The on_llm_error method captures failures with their duration:

Structured logging callback

Loading editor...

To attach it, pass an instance in the config dict when you call .invoke(). Every method in the handler fires automatically at the right moment:

Python

Loading editor...

Running this produces structured log lines with timestamps, latency, and token counts:

2026-03-06 14:22:01 | LLM START | model=gpt-4o-mini | prompt_chars=52
2026-03-06 14:22:02 | LLM END | duration=1.34s | prompt_tokens=18 | completion_tokens=45

Build a Cost Tracker

Knowing what your LLM calls cost is critical — especially when you're comparing models. I prefer tracking costs inside my app rather than relying solely on the provider dashboard, because it lets me attribute costs to specific features or users. If you want a broader comparison of API costs across providers, check out our dedicated guide.

This CostTracker handler maintains a pricing dictionary mapping model names to per-million-token rates. Each time on_llm_end fires, it extracts the prompt and completion token counts from the response metadata, multiplies by the model-specific rate, and accumulates totals. The report() method returns a human-readable summary at any point:

Token cost tracking callback

Loading editor...

Reuse one CostTracker instance across all your calls so the counters accumulate. Here we run three questions through the same tracker and print a summary:

Python

Loading editor...

With gpt-4o-mini, three short questions typically cost well under a tenth of a cent. Swap the model to gpt-4o and the same calls cost roughly 17x more. The tracker makes that difference visible before your invoice does.

Exercise 1: Build a Per-Model Cost Summary

Write Code

Complete the per_model_report() method that returns a dictionary mapping each model name to its total cost. The callback's on_llm_end method already tracks costs. You need to also store costs per model in self.model_costs (a defaultdict(float)) and implement per_model_report() to return that dictionary.

The model name is available from response.llm_output.get("model_name", "unknown").

Loading editor...

Streaming Callbacks — Token-by-Token Output

When ChatGPT shows text appearing word by word, that's streaming. With LangChain callbacks, you can intercept every single token as it arrives and do whatever you want with it — print it, send it over a WebSocket, feed it to a progress bar, or buffer it for post-processing.

The key method is on_llm_new_token. It fires once per token, receiving just the new text fragment. This handler prints each token using sys.stdout.write() (instead of print(), which would add a newline after each partial word) and keeps a running count:

Streaming token handler

Loading editor...

To enable streaming, set streaming=True on the LLM. Without this flag, on_llm_new_token never fires — LangChain waits for the full response and skips straight to on_llm_end:

Python

Loading editor...

The output appears progressively in the terminal, token by token, ending with a count like --- 38 tokens streamed ---. The result variable still contains the full response after streaming completes.

Combining Multiple Callbacks

You're not limited to one handler. I typically run three simultaneously: a logger, a cost tracker, and an error alerter. Each has a single responsibility, and they all fire on the same LLM call. LangChain calls every handler in the list, in order, for each event.

Stacking multiple handlers

Loading editor...

Constructor-Level vs Invocation-Level Callbacks

There are two places to attach callbacks, and the choice matters. Constructor-level callbacks attach to the LLM instance and fire on every call. Invocation-level callbacks attach to a single .invoke() call. This distinction applies to LCEL chains the same way it applies to individual LLM calls.

Constructor-level (always active)

# Attached at creation — fires on EVERY call
llm = ChatOpenAI(
    model="gpt-4o-mini",
    callbacks=[LoggingHandler()]
)

# Both of these calls trigger the handler
llm.invoke("Question 1")
llm.invoke("Question 2")

Invocation-level (per-call)

llm = ChatOpenAI(model="gpt-4o-mini")

# Only this call triggers the handler
llm.invoke("Question 1",
    config={"callbacks": [LoggingHandler()]})

# This call has no callbacks
llm.invoke("Question 2")

My preference: constructor-level for production monitoring (logging, cost tracking) that should always be on. Invocation-level for temporary debugging or per-request features like streaming to a specific user's WebSocket.

Error Handling with Callbacks

API calls fail. Rate limits hit, networks time out, models return unexpected responses. Without callbacks, you're wrapping every .invoke() call in try/except blocks scattered across your codebase. An error-handling callback centralizes all failure logic in one place.

This ErrorAlertHandler stores every failure as a structured record with a timestamp, error type, full message, and traceback. It hooks into both on_llm_error and on_chain_error to catch failures at every level. The get_error_summary method gives you a quick status check without digging through logs:

Error alerting callback

Loading editor...

In production, you'd replace the print() call with a real alerting mechanism — a Slack webhook, PagerDuty integration, or a metrics counter that triggers an alarm when the error rate spikes.

Async Callbacks for High-Throughput Apps

If your app handles multiple LLM calls concurrently — a FastAPI server, a streaming backend, or a batch pipeline — synchronous callbacks can become a bottleneck. LangChain supports async callback handlers through AsyncCallbackHandler.

The interface is identical to BaseCallbackHandler, but methods are async def and can await I/O operations without blocking the event loop. This AsyncFileLogger writes each LLM event to a log file using aiofiles for non-blocking I/O:

Async callback handler

Loading editor...

Use it with LangChain's async methods — ainvoke instead of invoke:

Python

Loading editor...

Built-in Callback Handlers

Before building everything from scratch, check whether LangChain's built-in handlers do what you need. These ship with langchain and langchain-core — no extra installation required.

Handler	What It Does	Import Path
`StdOutCallbackHandler`	Prints all events to stdout	`langchain_core.callbacks`
`StreamingStdOutCallbackHandler`	Prints streamed tokens to stdout	`langchain_core.callbacks`
`FileCallbackHandler`	Writes all events to a file	`langchain_community.callbacks`
`ConsoleCallbackHandler`	Rich console output with colors	`langchain_core.tracers`

Using built-in handlers

Loading editor...

I reach for StdOutCallbackHandler when I just want to see what's happening during development. For production, I switch to a custom handler with structured logging. The built-in handlers are quick starting points — but custom handlers give you full control over format, filtering, and where the data goes.

Real-World Example — A Production Monitoring Stack

This handler combines logging, cost tracking, latency monitoring, and error counting into a single class with a dashboard-ready summary method. It's the kind of handler you'd wire up once and leave running in production.

The key design choice is using run_id (a unique ID LangChain assigns to each invocation) to track concurrent calls. The earlier LoggingHandler used a single self.start_time attribute, which breaks if two calls overlap. Here, a dictionary maps each run_id to its start timestamp, so latencies stay accurate even under load:

Production monitoring callback

Loading editor...

The dashboard() method aggregates the raw data into metrics you'd expose as a health-check API endpoint. It computes average latency, 95th-percentile latency (your worst-case user experience), and the error rate as a percentage:

Python

Loading editor...

After a batch of calls, dashboard() returns a dictionary like this:

{
    "total_calls": 150,
    "total_tokens": 45200,
    "total_cost_usd": 0.028,
    "avg_latency_s": 1.2,
    "p95_latency_s": 2.8,
    "error_rate_pct": 1.33,
    "errors_by_type": {"RateLimitError": 2}
}

Exercise 2: Compute Error Rate and Average Latency

Write Code

Complete the compute_stats() function. It receives a list of latencies (floats in seconds) and two integers: call_count and error_count. It should return a dictionary with three keys:

"avg_latency": the mean of the latencies list, rounded to 3 decimal places (0 if the list is empty)

"p95_latency": the value at the 95th percentile index (int(len * 0.95)), rounded to 3 decimal places (0 if the list is empty)

"error_rate": error_count / call_count * 100, rounded to 2 decimal places (0 if call_count is 0)

Loading editor...

Common Mistakes and How to Fix Them

Callback bugs are subtle because callbacks fail silently — your app keeps working, but your monitoring is broken. These are the four mistakes I see most often.

Mistake 1: Sharing State Across Concurrent Calls

Wrong — single start_time attribute

class BadHandler(BaseCallbackHandler):
    def __init__(self):
        self.start_time = None  # Overwritten by concurrent calls!

    def on_llm_start(self, serialized, prompts, **kwargs):
        self.start_time = time.time()

    def on_llm_end(self, response, **kwargs):
        elapsed = time.time() - self.start_time  # Wrong if another call started

Correct — use run_id to track each call

class GoodHandler(BaseCallbackHandler):
    def __init__(self):
        self._start_times = {}

    def on_llm_start(self, serialized, prompts, *, run_id, **kwargs):
        self._start_times[run_id] = time.time()

    def on_llm_end(self, response, *, run_id, **kwargs):
        elapsed = time.time() - self._start_times.pop(run_id, time.time())

If two LLM calls overlap, the first call's start_time gets overwritten by the second. The run_id pattern keeps them isolated.

Mistake 2: Raising Exceptions in Callbacks

An unhandled exception in a callback can crash your entire LLM call, even if the API response was fine. Always wrap callback logic in try/except to isolate monitoring failures from application logic:

Python

Loading editor...

Mistake 3: Forgetting streaming=True for Token Callbacks

This one bites everyone at least once. You write a beautiful on_llm_new_token handler, wire it up, and nothing happens. The fix is a single parameter:

Python

Loading editor...

Mistake 4: Creating a New Handler Instance per Call

If your callback tracks cumulative metrics (total cost, call count), creating a fresh instance for each .invoke() call resets the counters. Reuse the same instance:

Python

Loading editor...

Exercise 3: Write a Safe Callback Wrapper

Write Code

Complete the safe_callback function. It takes a callable callback_fn and returns a new function that:

1. Calls callback_fn with the given arguments

2. If callback_fn raises any exception, catches it and returns the string "callback_error: " followed by the exception message

3. If callback_fn succeeds, returns its return value

This pattern prevents a broken callback from crashing the main application.

Loading editor...

Callbacks vs LangSmith — When to Use Which

A question I see constantly: should you use custom callbacks or LangSmith for observability? The short answer is they're complementary, not alternatives. LangSmith itself is built on the callback system — it uses a LangChainTracer callback handler under the hood.

Aspect	Custom Callbacks	LangSmith
Setup	Write your own handlers	Set env vars, auto-traces
Data location	Wherever you send it (your DB, files, etc.)	LangChain's cloud servers
Customization	Full control over format and logic	Dashboard UI, limited custom logic
Cost	Free (your infrastructure)	Free tier + paid plans
Best for	Custom metrics, on-premises, alerting	Quick debugging, trace visualization

I prefer custom callbacks when I need specific metrics (cost attribution per user, custom alerting) or when data cannot leave my infrastructure. LangSmith is faster to set up when you just want trace visualization during development. Many teams run both: LangSmith for development visibility and custom callbacks for production monitoring.

Frequently Asked Questions

Can I use callbacks with LCEL chains?

Yes. LCEL chains accept callbacks through the same config parameter. The chain-level events (on_chain_start, on_chain_end) fire for the overall chain, while LLM events fire for each model call within it:

Python

Loading editor...

How do I access the original prompt inside on_llm_end?

The on_llm_end method doesn't receive the prompt directly. Store it in on_llm_start and retrieve it in on_llm_end using run_id as the correlation key:

Python

Loading editor...

Do callbacks work with tools and agents?

Yes. If you're using LangChain tools, the on_tool_start, on_tool_end, and on_tool_error events fire for each tool execution. For agents, the chain-level events capture the agent's reasoning loop, while LLM and tool events capture individual steps within it.

Can I dispatch custom events from my own code?

Since LangChain 0.3, you can use dispatch_custom_event(name, data) inside a Runnable to fire arbitrary events that your callback handler can listen for via on_custom_event. This is useful when you need to track application-specific milestones (like "retrieval complete" or "guardrail check passed") that don't map to the standard LLM/chain/tool events.

What to Learn Next

You now have a complete callback toolkit: logging, cost tracking, streaming, error handling, async handlers, and production monitoring. Here are the natural next steps in your LangChain journey:

LangSmith — LangChain's managed tracing platform, built on callbacks. See your callback data in a visual dashboard.

LLM Streaming — Deep dive into streaming patterns beyond basic token callbacks.

LangChain Chains — Build multi-step pipelines where chain-level callbacks give you end-to-end visibility.

LangChain Tools — Add tool-use callbacks to monitor function calling in agents.

Streaming AI API Backend — Wire streaming callbacks into a FastAPI server.

Complete Code

Here is every handler from this tutorial combined into a single copy-paste script. It includes the logging handler, cost tracker, streaming handler, error alerter, and a demo that runs them together:

Full script — copy-paste and run

Loading editor...

References

LangChain documentation — Callbacks. Link

LangChain documentation — Custom Callback Handlers. Link

LangChain API Reference — BaseCallbackHandler. Link

LangChain documentation — Async Callbacks. Link

OpenAI API documentation — Token Usage. Link

OpenAI Pricing. Link

Python logging module documentation. Link

LangSmith documentation — Tracing Overview. Link

LangChain Callbacks: Custom Logging, Cost Tracking, and Streaming

What Are Callbacks and Why Do They Matter?

The Callback Lifecycle — Every Hook You Can Use

Build a Logging Handler

Build a Cost Tracker

Streaming Callbacks — Token-by-Token Output

Combining Multiple Callbacks

Constructor-Level vs Invocation-Level Callbacks

Error Handling with Callbacks

Async Callbacks for High-Throughput Apps

Built-in Callback Handlers

Real-World Example — A Production Monitoring Stack

Common Mistakes and How to Fix Them

Mistake 1: Sharing State Across Concurrent Calls

Mistake 2: Raising Exceptions in Callbacks

Mistake 3: Forgetting streaming=True for Token Callbacks

Mistake 4: Creating a New Handler Instance per Call

Callbacks vs LangSmith — When to Use Which

Frequently Asked Questions

Can I use callbacks with LCEL chains?

How do I access the original prompt inside on_llm_end?

Do callbacks work with tools and agents?

Can I dispatch custom events from my own code?

What to Learn Next

Complete Code

References

Related Tutorials

Save your progress across devices