LangChain Callbacks: Custom Logging, Cost Tracking, and Streaming
You ship an LLM app, and it works. But then your boss asks: "How much did that cost last month?" You don't know. A user reports a weird answer. You can't reproduce it because you never logged the prompt. The model starts returning empty strings at 2 AM, and nobody notices until morning.
LangChain callbacks solve all three problems. They let you hook into every step of an LLM call — before the prompt is sent, after the response arrives, when tokens stream in, when something fails — and run your own code at each point. If you've built anything with LangChain, adding callbacks is the single most impactful improvement you can make to observability.
What Are Callbacks and Why Do They Matter?
A callback is a function that LangChain calls automatically at a specific moment during execution. You don't call it yourself — you register it, and LangChain fires it at the right time.
To create a callback handler, you subclass BaseCallbackHandler and override the event methods you care about. The two most common are on_llm_start (fires before the API request) and on_llm_end (fires after the response arrives). LangChain provides no-op defaults for every method, so you only implement what you need.
The first two lines come from our callback. The third is the model's response (yours will differ):
LLM call starting with 1 prompt(s)
LLM call finished. Got 1 generation(s)
Python is a high-level, interpreted programming language...The Callback Lifecycle — Every Hook You Can Use
LangChain doesn't give you just two hooks. It gives you over a dozen, covering every phase of execution across LLMs, chains, tools, and retrievers. I find that most tutorials only show on_llm_start and on_llm_end. Knowing the full set opens up much better logging and debugging.
Here are the callback methods grouped by the component they monitor. You override only the ones you care about — BaseCallbackHandler provides no-op defaults for everything else:
LLM Events:
on_llm_start(serialized, prompts, **kwargs) — Before the LLM API callon_llm_new_token(token, **kwargs) — Each streamed token arriveson_llm_end(response, **kwargs) — After the complete responseon_llm_error(error, **kwargs) — When the LLM call failsChain Events:
on_chain_start(serialized, inputs, **kwargs) — Before a chain runson_chain_end(outputs, **kwargs) — After a chain completeson_chain_error(error, **kwargs) — When a chain failsTool Events:
on_tool_start(serialized, input_str, **kwargs) — Before a tool executeson_tool_end(output, **kwargs) — After a tool returnson_tool_error(error, **kwargs) — When a tool failsRetriever Events:
on_retriever_start(serialized, query, **kwargs) — Before a retriever searcheson_retriever_end(documents, **kwargs) — After documents are returnedon_retriever_error(error, **kwargs) — When retrieval failsThe lifecycle of a single LLM call flows like this: on_llm_start → (optionally) on_llm_new_token repeated for each token → on_llm_end. If something goes wrong at any point, on_llm_error fires instead of on_llm_end. For chains, the pattern nests: on_chain_start → on_llm_start → ... → on_llm_end → on_chain_end.
Build a Logging Handler
The first callback I build for any project is a logger. Not print() statements scattered through the code — a proper structured logger that captures the prompt, model name, latency, and token counts for every single LLM call. When something goes wrong in production, this log is what saves you.
This handler uses Python's logging module and the time library to measure latency. The on_llm_start method records the timestamp and model name; on_llm_end computes the elapsed time and extracts token counts from the response metadata. The on_llm_error method captures failures with their duration:
To attach it, pass an instance in the config dict when you call .invoke(). Every method in the handler fires automatically at the right moment:
Running this produces structured log lines with timestamps, latency, and token counts:
2026-03-06 14:22:01 | LLM START | model=gpt-4o-mini | prompt_chars=52
2026-03-06 14:22:02 | LLM END | duration=1.34s | prompt_tokens=18 | completion_tokens=45Build a Cost Tracker
Knowing what your LLM calls cost is critical — especially when you're comparing models. I prefer tracking costs inside my app rather than relying solely on the provider dashboard, because it lets me attribute costs to specific features or users. If you want a broader comparison of API costs across providers, check out our dedicated guide.
This CostTracker handler maintains a pricing dictionary mapping model names to per-million-token rates. Each time on_llm_end fires, it extracts the prompt and completion token counts from the response metadata, multiplies by the model-specific rate, and accumulates totals. The report() method returns a human-readable summary at any point:
Reuse one CostTracker instance across all your calls so the counters accumulate. Here we run three questions through the same tracker and print a summary:
With gpt-4o-mini, three short questions typically cost well under a tenth of a cent. Swap the model to gpt-4o and the same calls cost roughly 17x more. The tracker makes that difference visible before your invoice does.
Complete the per_model_report() method that returns a dictionary mapping each model name to its total cost. The callback's on_llm_end method already tracks costs. You need to also store costs per model in self.model_costs (a defaultdict(float)) and implement per_model_report() to return that dictionary.
The model name is available from response.llm_output.get("model_name", "unknown").
Streaming Callbacks — Token-by-Token Output
When ChatGPT shows text appearing word by word, that's streaming. With LangChain callbacks, you can intercept every single token as it arrives and do whatever you want with it — print it, send it over a WebSocket, feed it to a progress bar, or buffer it for post-processing.
The key method is on_llm_new_token. It fires once per token, receiving just the new text fragment. This handler prints each token using sys.stdout.write() (instead of print(), which would add a newline after each partial word) and keeps a running count:
To enable streaming, set streaming=True on the LLM. Without this flag, on_llm_new_token never fires — LangChain waits for the full response and skips straight to on_llm_end:
The output appears progressively in the terminal, token by token, ending with a count like --- 38 tokens streamed ---. The result variable still contains the full response after streaming completes.
Combining Multiple Callbacks
You're not limited to one handler. I typically run three simultaneously: a logger, a cost tracker, and an error alerter. Each has a single responsibility, and they all fire on the same LLM call. LangChain calls every handler in the list, in order, for each event.
Constructor-Level vs Invocation-Level Callbacks
There are two places to attach callbacks, and the choice matters. Constructor-level callbacks attach to the LLM instance and fire on every call. Invocation-level callbacks attach to a single .invoke() call. This distinction applies to LCEL chains the same way it applies to individual LLM calls.
# Attached at creation — fires on EVERY call
llm = ChatOpenAI(
model="gpt-4o-mini",
callbacks=[LoggingHandler()]
)
# Both of these calls trigger the handler
llm.invoke("Question 1")
llm.invoke("Question 2")llm = ChatOpenAI(model="gpt-4o-mini")
# Only this call triggers the handler
llm.invoke("Question 1",
config={"callbacks": [LoggingHandler()]})
# This call has no callbacks
llm.invoke("Question 2")My preference: constructor-level for production monitoring (logging, cost tracking) that should always be on. Invocation-level for temporary debugging or per-request features like streaming to a specific user's WebSocket.
Error Handling with Callbacks
API calls fail. Rate limits hit, networks time out, models return unexpected responses. Without callbacks, you're wrapping every .invoke() call in try/except blocks scattered across your codebase. An error-handling callback centralizes all failure logic in one place.
This ErrorAlertHandler stores every failure as a structured record with a timestamp, error type, full message, and traceback. It hooks into both on_llm_error and on_chain_error to catch failures at every level. The get_error_summary method gives you a quick status check without digging through logs:
In production, you'd replace the print() call with a real alerting mechanism — a Slack webhook, PagerDuty integration, or a metrics counter that triggers an alarm when the error rate spikes.
Async Callbacks for High-Throughput Apps
If your app handles multiple LLM calls concurrently — a FastAPI server, a streaming backend, or a batch pipeline — synchronous callbacks can become a bottleneck. LangChain supports async callback handlers through AsyncCallbackHandler.
The interface is identical to BaseCallbackHandler, but methods are async def and can await I/O operations without blocking the event loop. This AsyncFileLogger writes each LLM event to a log file using aiofiles for non-blocking I/O:
Use it with LangChain's async methods — ainvoke instead of invoke:
Built-in Callback Handlers
Before building everything from scratch, check whether LangChain's built-in handlers do what you need. These ship with langchain and langchain-core — no extra installation required.
| Handler | What It Does | Import Path |
|---|---|---|
StdOutCallbackHandler | Prints all events to stdout | langchain_core.callbacks |
StreamingStdOutCallbackHandler | Prints streamed tokens to stdout | langchain_core.callbacks |
FileCallbackHandler | Writes all events to a file | langchain_community.callbacks |
ConsoleCallbackHandler | Rich console output with colors | langchain_core.tracers |
I reach for StdOutCallbackHandler when I just want to see what's happening during development. For production, I switch to a custom handler with structured logging. The built-in handlers are quick starting points — but custom handlers give you full control over format, filtering, and where the data goes.
Real-World Example — A Production Monitoring Stack
This handler combines logging, cost tracking, latency monitoring, and error counting into a single class with a dashboard-ready summary method. It's the kind of handler you'd wire up once and leave running in production.
The key design choice is using run_id (a unique ID LangChain assigns to each invocation) to track concurrent calls. The earlier LoggingHandler used a single self.start_time attribute, which breaks if two calls overlap. Here, a dictionary maps each run_id to its start timestamp, so latencies stay accurate even under load:
The dashboard() method aggregates the raw data into metrics you'd expose as a health-check API endpoint. It computes average latency, 95th-percentile latency (your worst-case user experience), and the error rate as a percentage:
After a batch of calls, dashboard() returns a dictionary like this:
{
"total_calls": 150,
"total_tokens": 45200,
"total_cost_usd": 0.028,
"avg_latency_s": 1.2,
"p95_latency_s": 2.8,
"error_rate_pct": 1.33,
"errors_by_type": {"RateLimitError": 2}
}Complete the compute_stats() function. It receives a list of latencies (floats in seconds) and two integers: call_count and error_count. It should return a dictionary with three keys:
"avg_latency": the mean of the latencies list, rounded to 3 decimal places (0 if the list is empty)"p95_latency": the value at the 95th percentile index (int(len * 0.95)), rounded to 3 decimal places (0 if the list is empty)"error_rate": error_count / call_count * 100, rounded to 2 decimal places (0 if call_count is 0)Common Mistakes and How to Fix Them
Callback bugs are subtle because callbacks fail silently — your app keeps working, but your monitoring is broken. These are the four mistakes I see most often.
Mistake 1: Sharing State Across Concurrent Calls
class BadHandler(BaseCallbackHandler):
def __init__(self):
self.start_time = None # Overwritten by concurrent calls!
def on_llm_start(self, serialized, prompts, **kwargs):
self.start_time = time.time()
def on_llm_end(self, response, **kwargs):
elapsed = time.time() - self.start_time # Wrong if another call startedclass GoodHandler(BaseCallbackHandler):
def __init__(self):
self._start_times = {}
def on_llm_start(self, serialized, prompts, *, run_id, **kwargs):
self._start_times[run_id] = time.time()
def on_llm_end(self, response, *, run_id, **kwargs):
elapsed = time.time() - self._start_times.pop(run_id, time.time())If two LLM calls overlap, the first call's start_time gets overwritten by the second. The run_id pattern keeps them isolated.
Mistake 2: Raising Exceptions in Callbacks
An unhandled exception in a callback can crash your entire LLM call, even if the API response was fine. Always wrap callback logic in try/except to isolate monitoring failures from application logic:
Mistake 3: Forgetting streaming=True for Token Callbacks
This one bites everyone at least once. You write a beautiful on_llm_new_token handler, wire it up, and nothing happens. The fix is a single parameter:
Mistake 4: Creating a New Handler Instance per Call
If your callback tracks cumulative metrics (total cost, call count), creating a fresh instance for each .invoke() call resets the counters. Reuse the same instance:
Complete the safe_callback function. It takes a callable callback_fn and returns a new function that:
1. Calls callback_fn with the given arguments
2. If callback_fn raises any exception, catches it and returns the string "callback_error: " followed by the exception message
3. If callback_fn succeeds, returns its return value
This pattern prevents a broken callback from crashing the main application.
Callbacks vs LangSmith — When to Use Which
A question I see constantly: should you use custom callbacks or LangSmith for observability? The short answer is they're complementary, not alternatives. LangSmith itself is built on the callback system — it uses a LangChainTracer callback handler under the hood.
| Aspect | Custom Callbacks | LangSmith |
|---|---|---|
| Setup | Write your own handlers | Set env vars, auto-traces |
| Data location | Wherever you send it (your DB, files, etc.) | LangChain's cloud servers |
| Customization | Full control over format and logic | Dashboard UI, limited custom logic |
| Cost | Free (your infrastructure) | Free tier + paid plans |
| Best for | Custom metrics, on-premises, alerting | Quick debugging, trace visualization |
I prefer custom callbacks when I need specific metrics (cost attribution per user, custom alerting) or when data cannot leave my infrastructure. LangSmith is faster to set up when you just want trace visualization during development. Many teams run both: LangSmith for development visibility and custom callbacks for production monitoring.
Frequently Asked Questions
Can I use callbacks with LCEL chains?
Yes. LCEL chains accept callbacks through the same config parameter. The chain-level events (on_chain_start, on_chain_end) fire for the overall chain, while LLM events fire for each model call within it:
How do I access the original prompt inside on_llm_end?
The on_llm_end method doesn't receive the prompt directly. Store it in on_llm_start and retrieve it in on_llm_end using run_id as the correlation key:
Do callbacks work with tools and agents?
Yes. If you're using LangChain tools, the on_tool_start, on_tool_end, and on_tool_error events fire for each tool execution. For agents, the chain-level events capture the agent's reasoning loop, while LLM and tool events capture individual steps within it.
Can I dispatch custom events from my own code?
Since LangChain 0.3, you can use dispatch_custom_event(name, data) inside a Runnable to fire arbitrary events that your callback handler can listen for via on_custom_event. This is useful when you need to track application-specific milestones (like "retrieval complete" or "guardrail check passed") that don't map to the standard LLM/chain/tool events.
What to Learn Next
You now have a complete callback toolkit: logging, cost tracking, streaming, error handling, async handlers, and production monitoring. Here are the natural next steps in your LangChain journey:
Complete Code
Here is every handler from this tutorial combined into a single copy-paste script. It includes the logging handler, cost tracker, streaming handler, error alerter, and a demo that runs them together:
References
logging module documentation. Link