Use the Claude API with LlamaIndex in Python 2026. Build RAG pipelines, query engines, and LlamaIndex agents backed by Claude Sonnet. Working code examples.
LlamaIndex is a popular framework for building RAG (Retrieval-Augmented Generation) applications. This guide shows how to wire Claude as the LLM backend for LlamaIndex query engines, agents, and pipelines.
pip install llama-index-core llama-index-llms-anthropic llama-index-embeddings-huggingface
import os
from llama_index.llms.anthropic import Anthropic
llm = Anthropic(
model="claude-sonnet-4-6",
api_key=os.environ["ANTHROPIC_API_KEY"],
max_tokens=1024,
)
# Direct completion
response = llm.complete("What is retrieval-augmented generation?")
print(response.text)
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings
from llama_index.llms.anthropic import Anthropic
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
# Use Claude as LLM, local HuggingFace model for embeddings (no OpenAI key needed)
Settings.llm = Anthropic(model="claude-sonnet-4-6")
Settings.embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")
# Load documents from a directory (PDF, TXT, MD supported)
documents = SimpleDirectoryReader("./docs").load_data()
# Build vector index and query engine
index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine()
response = query_engine.query("What are the key findings in the Q4 report?")
print(response.response)
# Show source nodes (what chunks were retrieved)
for node in response.source_nodes:
print(f" Score: {node.score:.3f} | {node.text[:120]}...")
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings
from llama_index.llms.anthropic import Anthropic
Settings.llm = Anthropic(model="claude-sonnet-4-6")
# ... (set embed_model as above) ...
documents = SimpleDirectoryReader("./docs").load_data()
index = VectorStoreIndex.from_documents(documents)
# CondenseQuestionChatEngine rephrases follow-up questions using history
chat_engine = index.as_chat_engine(chat_mode="condense_question", verbose=True)
response1 = chat_engine.chat("What is the main topic?")
print(response1.response)
response2 = chat_engine.chat("Can you give me more detail on that?")
print(response2.response) # Uses conversation history to resolve "that"
from llama_index.core.agent import ReActAgent
from llama_index.core.tools import FunctionTool
from llama_index.llms.anthropic import Anthropic
def multiply(a: float, b: float) -> float:
"""Multiply two numbers and return the result."""
return a * b
def add(a: float, b: float) -> float:
"""Add two numbers and return the result."""
return a + b
tools = [
FunctionTool.from_defaults(fn=multiply),
FunctionTool.from_defaults(fn=add),
]
llm = Anthropic(model="claude-sonnet-4-6")
agent = ReActAgent.from_tools(tools, llm=llm, verbose=True)
response = agent.chat("What is (23.5 + 14.7) * 3.2?")
print(response.response)
from llama_index.llms.anthropic import Anthropic
llm = Anthropic(model="claude-sonnet-4-6")
# Stream a completion
streaming_response = llm.stream_complete("Explain vector databases in detail.")
for chunk in streaming_response:
print(chunk.delta, end="", flush=True)
print() # newline after stream
| Aspect | LlamaIndex | LangChain |
|---|---|---|
| Primary use case | Document indexing, RAG, structured retrieval | General LLM orchestration, chains, agents |
| Document loaders | 100+ built-in (PDF, Word, Notion, S3, web) | 150+ loaders (broader, but less RAG-focused) |
| RAG ergonomics | One-liner: index.as_query_engine() | Requires explicit chain composition |
| Agent framework | ReActAgent, OpenAI-style function calling | More mature multi-agent support |
| Claude integration | llama-index-llms-anthropic (official) | langchain-anthropic (official) |
| Best for | Document Q&A, knowledge bases, RAG eval | Complex multi-step workflows, custom chains |
from llama_index.llms.anthropic import Anthropic
from anthropic import Anthropic as AnthropicClient
# For advanced caching, use the Anthropic SDK directly inside a custom LLM wrapper
# LlamaIndex's Anthropic integration passes through extra_headers
llm = Anthropic(
model="claude-sonnet-4-6",
additional_kwargs={
"extra_headers": {"anthropic-beta": "prompt-caching-2024-07-31"}
}
)
# System prompt caching saves ~90% on repeated RAG queries with a large system prompt
To track Claude API costs across your LlamaIndex RAG pipeline, use the Claude API Cost Calculator. For the equivalent LangChain integration, see the Claude API LangChain guide. For production monitoring of token usage, see the Claude API monitoring guide.