Observability & Tracing
Logging, debugging, and tracing complex LLM requests and agent workflows.
The Black Box Problem
LLM Observability Stack
When an agent fails, it can be nearly impossible to debug by looking at a traditional server log. Did the agent pick the wrong tool? Did the LLM parse the JSON incorrectly? Did the RAG system retrieve the wrong document?
Observability (LLMOps) provides visual tracing of every step in an AI workflow.
Top Platforms
- LangSmith: Native integration with LangChain. Incredible for viewing entire agent trajectories and tool calls.
- Helicone: Focuses heavily on capturing prompts, responses, costs, and proxying API calls.
- Langfuse: Open-source alternative for detailed step-by-step traces.
What to Trace
- Input/Output: The exact prompt sent and the completion returned.
- Latency: Time to First Token (TTFT) and Generation time.
- Cost: Total tokens used (prompt and completion).
- Metadata: User ID, session IDs, and custom tags for analytics.
Use Cases
Debugging infinite loops in LangGraph
Calculating cost-per-user by tagging API calls with custom user IDs
Creating datasets for fine-tuning by exporting highly-rated traces
Common Mistakes
Logging PII (Personally Identifiable Information) in plain text to third-party dashboards
Not tracking costs, leading to unexpected billing surprises
Failing to trace the steps *between* LLM calls (e.g. database latency vs. LLM latency)
Interview Insight
Relevance
Medium - Crucial for senior engineers scaling systems.