Agentic Memory Architecture
Short-term buffer memory vs Long-term graph memory (Mem0, Zep).
Chat History is Not "Memory"
Agentic Memory Architecture
Appending {"role": "user", "content": "..."} to an array until you hit the token limit is not an architecture — it's a ticking time bomb. True Agentic Memory is split into Short-Term (operational) and Long-Term (persistent) subsystems.
Short-Term Memory (Working Context)
This is standard sliding-window context. But instead of blindly appending, Senior engineers use Token-Aware Summary Buffers. When the conversation history exceeds a threshold (e.g., 4000 tokens), an asynchronous background LLM distills the oldest 3000 tokens into a dense 200-token summary ("User and AI discussed Python deployment strategies..."). The new working context becomes: [Summary] + [Last 5 Messages].
Long-Term Memory (Persistent Storage)
What if a user mentions "I am allergic to peanuts" on day 1, and on day 400 asks for a recipe? Short-term memory has long forgotten this. Long-term memory continuously extracts semantic "Entities" and "Facts" from the user's conversation stream and writes them to a VectorDB or GraphDB.
Frameworks like Mem0 or Zep automatically monitor the stream, extracting facts (e.g. User_Allergy = Peanuts). When the user asks for a recipe later, the system queries the Long-Term memory vector store for relevant user facts and injects them into the system prompt dynamically.
Code Example
Using structured outputs (JSON Schema) to reliably extract persistent user facts from conversational streams in the background to build Long-Term agentic memory.
1from pydantic import BaseModel, Field
2import openai
3import json
4
5client = openai.OpenAI()
6
7class MemoryExtraction(BaseModel):
8 user_facts: list[str] = Field(description="Explicit physical, dietary, or personal facts about the user.")
9 core_preferences: list[str] = Field(description="Strong preferences or constraints the user has expressed.")
10
11def extract_long_term_memory(chat_message: str) -> dict:
12 """Run asynchronously to extract persistent facts from the user stream."""
13 response = client.chat.completions.create(
14 model="gpt-4o-mini", # Use cheap fast models for memory extraction
15 response_format={"type": "json_schema", "json_schema": {"name": "mem", "schema": MemoryExtraction.model_json_schema(), "strict": True}},
16 messages=[
17 {"role": "system", "content": "Extract hard facts about the user. Ignore conversational filler. If no facts, return empty lists."},
18 {"role": "user", "content": chat_message}
19 ]
20 )
21 return json.loads(response.choices[0].message.content)
22
23# User says: "Can you help me plan a trip to Japan? I'm vegan and hate flying."
24# Background extraction task runs:
25facts = extract_long_term_memory("Can you help me plan a trip to Japan? I'm vegan and hate flying.")
26
27# Extracted facts:
28# {'user_facts': ['Dietary restriction: Vegan'], 'core_preferences': ['Dislikes flying']}
29
30# Write these vectors into Pinecone/Mem0.
31# Next time the user asks "Plan a vacation", retrieve these facts and inject them into System Prompt!Use Cases
Common Mistakes
Interview Insight
Relevance
High - System design for autonomous agents maintaining state.