Agentic Memory Architecture

Chat History is Not "Memory"

Appending {"role": "user", "content": "..."} to an array until you hit the token limit is not an architecture — it's a ticking time bomb. True Agentic Memory is split into Short-Term (operational) and Long-Term (persistent) subsystems.

Short-Term Memory (Working Context)

This is standard sliding-window context. But instead of blindly appending, Senior engineers use Token-Aware Summary Buffers. When the conversation history exceeds a threshold (e.g., 4000 tokens), an asynchronous background LLM distills the oldest 3000 tokens into a dense 200-token summary ("User and AI discussed Python deployment strategies..."). The new working context becomes: [Summary] + [Last 5 Messages].

Long-Term Memory (Persistent Storage)

What if a user mentions "I am allergic to peanuts" on day 1, and on day 400 asks for a recipe? Short-term memory has long forgotten this. Long-term memory continuously extracts semantic "Entities" and "Facts" from the user's conversation stream and writes them to a VectorDB or GraphDB.

Frameworks like Mem0 or Zep automatically monitor the stream, extracting facts (e.g. User_Allergy = Peanuts). When the user asks for a recipe later, the system queries the Long-Term memory vector store for relevant user facts and injects them into the system prompt dynamically.

Code Example

Using structured outputs (JSON Schema) to reliably extract persistent user facts from conversational streams in the background to build Long-Term agentic memory.

python

1from pydantic import BaseModel, Field
2import openai
3import json
4
5client = openai.OpenAI()
6
7class MemoryExtraction(BaseModel):
8    user_facts: list[str] = Field(description="Explicit physical, dietary, or personal facts about the user.")
9    core_preferences: list[str] = Field(description="Strong preferences or constraints the user has expressed.")
10
11def extract_long_term_memory(chat_message: str) -> dict:
12    """Run asynchronously to extract persistent facts from the user stream."""
13    response = client.chat.completions.create(
14        model="gpt-4o-mini", # Use cheap fast models for memory extraction
15        response_format={"type": "json_schema", "json_schema": {"name": "mem", "schema": MemoryExtraction.model_json_schema(), "strict": True}},
16        messages=[
17            {"role": "system", "content": "Extract hard facts about the user. Ignore conversational filler. If no facts, return empty lists."},
18            {"role": "user", "content": chat_message}
19        ]
20    )
21    return json.loads(response.choices[0].message.content)
22
23# User says: "Can you help me plan a trip to Japan? I'm vegan and hate flying."
24# Background extraction task runs:
25facts = extract_long_term_memory("Can you help me plan a trip to Japan? I'm vegan and hate flying.")
26
27# Extracted facts: 
28# {'user_facts': ['Dietary restriction: Vegan'], 'core_preferences': ['Dislikes flying']}
29
30# Write these vectors into Pinecone/Mem0. 
31# Next time the user asks "Plan a vacation", retrieve these facts and inject them into System Prompt!

Use Cases

Personalized AI Companions or Tutors that remember context over years

SaaS copilots that remember user deployment stacks across sessions

Autonomous agents running for weeks, preserving learned constraints

Common Mistakes

Passing the entire 2-year old chat transcript into the model for every request

Relying purely on Vector databases for chat history retrieval without chronological awareness

Extracting subjective temporary states ("User is angry today") as permanent long-term memory facts

Interview Insight

Relevance

High - System design for autonomous agents maintaining state.

LLM Foundations

Advanced Prompt Engineering

RAG & Vector Databases

Building AI Agents

AI Engineering Stack

Advanced RAG Engineering

LLM Inference Engineering

Fine-Tuning & Model Alignment

Context & Memory Management

Agentic Memory Architecture

Chat History is Not "Memory"