Claude API with LlamaIndex: RAG, Agents, and Query Engines

Use the Claude API with LlamaIndex in Python 2026. Build RAG pipelines, query engines, and LlamaIndex agents backed by Claude Sonnet. Working code examples.

💥 50p impulse-buy: Power Prompts PDF (first 10 buyers) 30 battle-tested Claude Code prompts · 8-page PDF · paste into CLAUDE.md and never re-type a prompt again · 50p impulse-buy, no commitment

LlamaIndex is a popular framework for building RAG (Retrieval-Augmented Generation) applications. This guide shows how to wire Claude as the LLM backend for LlamaIndex query engines, agents, and pipelines.

Installation

pip install llama-index-core llama-index-llms-anthropic llama-index-embeddings-huggingface

Basic Claude LLM setup

import os
from llama_index.llms.anthropic import Anthropic

llm = Anthropic(
    model="claude-sonnet-4-6",
    api_key=os.environ["ANTHROPIC_API_KEY"],
    max_tokens=1024,
)

# Direct completion
response = llm.complete("What is retrieval-augmented generation?")
print(response.text)

RAG pipeline: document Q&A with Claude

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings
from llama_index.llms.anthropic import Anthropic
from llama_index.embeddings.huggingface import HuggingFaceEmbedding

# Use Claude as LLM, local HuggingFace model for embeddings (no OpenAI key needed)
Settings.llm = Anthropic(model="claude-sonnet-4-6")
Settings.embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")

# Load documents from a directory (PDF, TXT, MD supported)
documents = SimpleDirectoryReader("./docs").load_data()

# Build vector index and query engine
index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine()

response = query_engine.query("What are the key findings in the Q4 report?")
print(response.response)

# Show source nodes (what chunks were retrieved)
for node in response.source_nodes:
    print(f"  Score: {node.score:.3f} | {node.text[:120]}...")

Chat engine (multi-turn conversation over documents)

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings
from llama_index.llms.anthropic import Anthropic

Settings.llm = Anthropic(model="claude-sonnet-4-6")
# ... (set embed_model as above) ...

documents = SimpleDirectoryReader("./docs").load_data()
index = VectorStoreIndex.from_documents(documents)

# CondenseQuestionChatEngine rephrases follow-up questions using history
chat_engine = index.as_chat_engine(chat_mode="condense_question", verbose=True)

response1 = chat_engine.chat("What is the main topic?")
print(response1.response)

response2 = chat_engine.chat("Can you give me more detail on that?")
print(response2.response)  # Uses conversation history to resolve "that"

ReAct agent with Claude

from llama_index.core.agent import ReActAgent
from llama_index.core.tools import FunctionTool
from llama_index.llms.anthropic import Anthropic

def multiply(a: float, b: float) -> float:
    """Multiply two numbers and return the result."""
    return a * b

def add(a: float, b: float) -> float:
    """Add two numbers and return the result."""
    return a + b

tools = [
    FunctionTool.from_defaults(fn=multiply),
    FunctionTool.from_defaults(fn=add),
]

llm = Anthropic(model="claude-sonnet-4-6")
agent = ReActAgent.from_tools(tools, llm=llm, verbose=True)

response = agent.chat("What is (23.5 + 14.7) * 3.2?")
print(response.response)

Streaming responses

from llama_index.llms.anthropic import Anthropic

llm = Anthropic(model="claude-sonnet-4-6")

# Stream a completion
streaming_response = llm.stream_complete("Explain vector databases in detail.")
for chunk in streaming_response:
    print(chunk.delta, end="", flush=True)
print()  # newline after stream

LlamaIndex vs LangChain for Claude RAG

AspectLlamaIndexLangChain
Primary use caseDocument indexing, RAG, structured retrievalGeneral LLM orchestration, chains, agents
Document loaders100+ built-in (PDF, Word, Notion, S3, web)150+ loaders (broader, but less RAG-focused)
RAG ergonomicsOne-liner: index.as_query_engine()Requires explicit chain composition
Agent frameworkReActAgent, OpenAI-style function callingMore mature multi-agent support
Claude integrationllama-index-llms-anthropic (official)langchain-anthropic (official)
Best forDocument Q&A, knowledge bases, RAG evalComplex multi-step workflows, custom chains

Prompt caching with Claude in LlamaIndex

from llama_index.llms.anthropic import Anthropic
from anthropic import Anthropic as AnthropicClient

# For advanced caching, use the Anthropic SDK directly inside a custom LLM wrapper
# LlamaIndex's Anthropic integration passes through extra_headers
llm = Anthropic(
    model="claude-sonnet-4-6",
    additional_kwargs={
        "extra_headers": {"anthropic-beta": "prompt-caching-2024-07-31"}
    }
)
# System prompt caching saves ~90% on repeated RAG queries with a large system prompt

To track Claude API costs across your LlamaIndex RAG pipeline, use the Claude API Cost Calculator. For the equivalent LangChain integration, see the Claude API LangChain guide. For production monitoring of token usage, see the Claude API monitoring guide.

Frequently asked questions

How do I use Claude with LlamaIndex in Python?
Install llama-index-llms-anthropic and llama-index-core. Create an Anthropic LLM object: from llama_index.llms.anthropic import Anthropic; llm = Anthropic(model='claude-sonnet-4-6'). Pass it as llm= to any LlamaIndex index or query engine.
What is the difference between LlamaIndex and LangChain with Claude?
Both are orchestration frameworks for building LLM apps. LlamaIndex is optimized for retrieval-augmented generation (RAG) — indexing documents, chunking, embedding, and querying. LangChain is more general-purpose (agents, chains, tools). For document Q&A and RAG over large corpora, LlamaIndex is more ergonomic. Both support Claude as a drop-in LLM.
How do I build a RAG pipeline with Claude and LlamaIndex?
1) Load documents with SimpleDirectoryReader. 2) Build a VectorStoreIndex with your documents and a local embedding model or OpenAI embeddings. 3) Create a query engine with index.as_query_engine(llm=Anthropic(model='claude-sonnet-4-6')). 4) Call query_engine.query('your question'). LlamaIndex handles chunking, embedding, retrieval, and synthesis automatically.
Can I use LlamaIndex agents with Claude?
Yes. Use llama_index.core.agent.ReActAgent.from_tools(tools, llm=Anthropic(model='claude-sonnet-4-6')). Claude's strong instruction-following makes it an excellent ReAct agent backbone. You can mix LlamaIndex built-in tools (file read, code interpreter) with custom FunctionTool wrappers.
Does LlamaIndex support streaming with Claude?
Yes. Call query_engine.query() and use the streaming_response=True option, or use llm.stream_complete() directly. The Anthropic LlamaIndex integration wraps Claude's native streaming API.

Free tools

Cost Calculator → API Cookbook → Diff Summarizer → Skills Browser →

More examples

Claude API Python QuickstartClaude API Node.js / TypeScript QuickstartClaude API Streaming in PythonClaude API Streaming in Node.js / TypeScriptClaude API Tool Use in PythonClaude API Tool Use in Node.js / TypeScript