GraphRAG & Knowledge Graphs

Microsoft GraphRAG, entity extraction, and global summarization queries.

When Vector Search Fundamentally Fails

GraphRAG vs Naive RAG

Naive RAG (Vector Only) chunk 1 chunk 2 chunk 3 No relationships. Misses multi-hop queries. GraphRAG (Knowledge Graph) CEO Acme Product leads makes Traverses relationships. Handles multi-hop ✓

Standard RAG is retrieval over isolated chunks. The question "What are the major themes across all of our quarterly earnings reports?" cannot be answered by Top-K retrieval. No single chunk contains a global answer. This is the problem Microsoft's GraphRAG was designed to solve.

Building a Knowledge Graph from Text

GraphRAG uses an LLM to parse documents and extract entities (Companies, People, Events) and their relationships (Acquired, Reported, Partnered) to construct a directed property Knowledge Graph. Nodes are entities. Edges are typed relationships with attributes (date, sentiment, etc.).

For global queries, GraphRAG doesn't retrieve chunks — it traverses graph communities (clusters of tightly connected nodes), generates summaries for each community, and synthesizes a final answer across summaries. This enables cross-document reasoning that defeats standard vector search.

Local vs. Global Query Routing

Smart GraphRAG systems route queries by type. Local queries ("What did Tim Cook say about AI in Q4 2024?") use standard vector retrieval since they target specific facts. Global queries ("What risks are mentioned across all board meetings?") use graph traversal and community summarization. Misrouting global queries to vector search is a common failure mode.

Neo4j as the Graph Backend

Production knowledge graphs live in dedicated graph databases like Neo4j. The Cypher query language allows direct relationship traversal that SQL and vector databases cannot express. LangChain has a native GraphCypherQAChain that converts natural language queries into Cypher queries via an LLM, executes them, and returns structured results.

Code Example

Full GraphRAG pipeline: LLM extracts structured graph data from text, ingests into Neo4j, then uses natural language → Cypher to query it. This enables cross-document reasoning impossible with vector search.

python
1from langchain_community.graphs import Neo4jGraph
2from langchain.chains import GraphCypherQAChain
3from langchain_openai import ChatOpenAI
4from anthropic import Anthropic
5
6client = Anthropic()
7
8# Step 1: Extract entities and relationships from text using Claude
9def extract_knowledge_graph(text: str) -> dict:
10    """Use LLM to extract a structured knowledge graph from raw text."""
11    response = client.messages.create(
12        model="claude-3-5-sonnet-20241022",
13        max_tokens=1000,
14        messages=[{
15            "role": "user",
16            "content": f"""Extract all entities and relationships from this text.
17Return ONLY valid JSON in this format:
18{{
19  "entities": [{{"id": "E1", "name": "Apple Inc", "type": "Company"}}],
20  "relationships": [{{"source": "E1", "target": "E2", "type": "ACQUIRED", "year": 2023}}]
21}}
22
23TEXT: {text}"""
24        }]
25    )
26    import json
27    return json.loads(response.content[0].text)
28
29# Step 2: Ingest into Neo4j
30def ingest_to_neo4j(graph_data: dict, neo4j_graph: Neo4jGraph):
31    for entity in graph_data["entities"]:
32        neo4j_graph.query(
33            "MERGE (e:Entity {id: $id}) SET e.name = $name, e.type = $type",
34            params=entity
35        )
36    for rel in graph_data["relationships"]:
37        neo4j_graph.query(
38            f"""MATCH (a:Entity {{id: $source}}), (b:Entity {{id: $target}})
39            MERGE (a)-[r:{rel['type']}]->(b) SET r.year = $year""",
40            params=rel
41        )
42
43# Step 3: Natural Language → Cypher → Answer
44def query_knowledge_graph(question: str, neo4j_graph: Neo4jGraph):
45    llm = ChatOpenAI(model="gpt-4o", temperature=0)
46    chain = GraphCypherQAChain.from_llm(
47        llm=llm,
48        graph=neo4j_graph,
49        verbose=True  # See the generated Cypher queries for debugging
50    )
51    return chain.invoke({"query": question})
52
53# Example: "What companies did Apple acquire after 2020?"
54# LLM generates: MATCH (a:Entity {name: "Apple Inc"})-[r:ACQUIRED]->(b)
55#                WHERE r.year > 2020 RETURN b.name

Use Cases

Legal firms querying relationships between contracts, parties, and clauses
Financial analysis requiring cross-document entity relationship queries
Drug discovery knowledge bases linking proteins, compounds, and clinical trials

Common Mistakes

Using GraphRAG for simple factual lookups — the overhead is massive and overkill for local queries
Not defining a strict entity extraction schema, causing inconsistent node types in the graph
Forgetting to deduplicate entities (Apple vs Apple Inc vs AAPL) before ingestion

Interview Insight

Relevance

High - Cutting edge architecture for complex enterprise knowledge systems.

AI Tutor

Ask about the topic

Sign in Required

Please sign in to use the AI tutor

Sign In