Advisor Tool
Pair a faster executor model with a higher-intelligence advisor model that provides strategic guidance mid-generation.
The advisor tool lets a fast, lower-cost executor model (Sonnet or Haiku) consult a high-intelligence advisor model (Opus 4.6) mid-generation. The advisor reads the full conversation and produces a plan or course correction — typically 400–700 text tokens — and the executor continues with the task.
This pattern is well-suited for long-horizon agentic workloads (coding agents, computer use, multi-step research) where most turns are mechanical but having an excellent plan is crucial. You get close to advisor-solo quality while the bulk of token generation happens at executor-model rates.
The advisor tool is in beta. Include anthropic-beta: advisor-tool-2026-03-01 in your requests — LiteLLM adds this automatically when it detects the advisor tool in your tools array.
Supported Providers
| Provider | Chat Completions API | Messages API |
|---|---|---|
| Anthropic API | ✅ | ✅ |
| Azure Anthropic | ❌ (coming soon) | ❌ (coming soon) |
| Google Cloud Vertex AI | ❌ (coming soon) | ❌ (coming soon) |
| Amazon Bedrock | ❌ (coming soon) | ❌ (coming soon) |
Model Compatibility
The executor and advisor models must form a valid pair. Currently the only supported advisor model is claude-opus-4-6.
| Executor | Advisor |
|---|---|
claude-haiku-4-5-20251001 | claude-opus-4-6 |
claude-sonnet-4-6 | claude-opus-4-6 |
claude-opus-4-6 | claude-opus-4-6 |
Chat Completions API
SDK Usage
Basic Example
import litellm
response = litellm.completion(
model="anthropic/claude-sonnet-4-6",
messages=[
{"role": "user", "content": "Build a concurrent worker pool in Go with graceful shutdown."}
],
tools=[
{
"type": "advisor_20260301",
"name": "advisor",
"model": "claude-opus-4-6",
}
],
max_tokens=4096,
)
print(response.choices[0].message.content)
With Optional Parameters
import litellm
response = litellm.completion(
model="anthropic/claude-sonnet-4-6",
messages=[
{"role": "user", "content": "Build a REST API with authentication in Python."}
],
tools=[
{
"type": "advisor_20260301",
"name": "advisor",
"model": "claude-opus-4-6",
"max_uses": 3, # cap advisor calls per request
"caching": {"type": "ephemeral", "ttl": "5m"}, # enable for 3+ calls per conversation
}
],
max_tokens=4096,
)
Streaming
import litellm
response = litellm.completion(
model="anthropic/claude-sonnet-4-6",
messages=[
{"role": "user", "content": "Implement a distributed rate limiter."}
],
tools=[
{
"type": "advisor_20260301",
"name": "advisor",
"model": "claude-opus-4-6",
}
],
max_tokens=4096,
stream=True,
)
for chunk in response:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="")
The advisor sub-inference does not stream. The executor's stream pauses while the advisor runs, then the full advisor result arrives in a single event. Executor output resumes streaming afterward.
Multi-Turn Conversation
import litellm
tools = [
{
"type": "advisor_20260301",
"name": "advisor",
"model": "claude-opus-4-6",
}
]
messages = [
{"role": "user", "content": "Build a concurrent worker pool in Go with graceful shutdown."}
]
response = litellm.completion(
model="anthropic/claude-sonnet-4-6",
messages=messages,
tools=tools,
max_tokens=4096,
)
# Append the full response (includes server_tool_use + advisor_tool_result blocks)
messages.append({"role": "assistant", "content": response.choices[0].message.content})
# Continue the conversation — keep the same tools array
messages.append({"role": "user", "content": "Now add a max-in-flight limit of 10."})
response2 = litellm.completion(
model="anthropic/claude-sonnet-4-6",
messages=messages,
tools=tools,
max_tokens=4096,
)
LiteLLM automatically strips advisor_tool_result blocks from message history when the advisor tool is not present in the current request. This prevents the Anthropic 400 error that would otherwise occur.
AI Gateway Usage
Proxy Configuration
model_list:
- model_name: claude-sonnet
litellm_params:
model: anthropic/claude-sonnet-4-6
api_key: os.environ/ANTHROPIC_API_KEY
Client Request via Proxy
from openai import OpenAI
client = OpenAI(
api_key="your-litellm-proxy-key",
base_url="http://0.0.0.0:4000/v1"
)
response = client.chat.completions.create(
model="claude-sonnet",
messages=[
{"role": "user", "content": "Implement a distributed rate limiter in Python."}
],
tools=[
{
"type": "advisor_20260301",
"name": "advisor",
"model": "claude-opus-4-6",
}
],
max_tokens=4096,
)
Messages API
SDK Usage
Basic Example
import asyncio
import litellm
async def main():
response = await litellm.anthropic.messages.acreate(
model="anthropic/claude-sonnet-4-6",
messages=[
{"role": "user", "content": "Build a concurrent worker pool in Go with graceful shutdown."}
],
tools=[
{
"type": "advisor_20260301",
"name": "advisor",
"model": "claude-opus-4-6",
}
],
max_tokens=4096,
)
print(response)
asyncio.run(main())
Streaming
import asyncio
import json
import litellm
async def main():
response = await litellm.anthropic.messages.acreate(
model="anthropic/claude-sonnet-4-6",
messages=[
{"role": "user", "content": "Implement a distributed rate limiter."}
],
tools=[
{
"type": "advisor_20260301",
"name": "advisor",
"model": "claude-opus-4-6",
}
],
max_tokens=4096,
stream=True,
)
async for chunk in response:
if isinstance(chunk, bytes):
for line in chunk.decode("utf-8").split("\n"):
if line.startswith("data: "):
try:
print(json.loads(line[6:]))
except json.JSONDecodeError:
pass
asyncio.run(main())
AI Gateway Usage
Proxy Configuration
model_list:
- model_name: claude-sonnet
litellm_params:
model: anthropic/claude-sonnet-4-6
api_key: os.environ/ANTHROPIC_API_KEY
Client Request via Proxy (Anthropic SDK)
import anthropic
client = anthropic.Anthropic(
api_key="your-litellm-proxy-key",
base_url="http://0.0.0.0:4000"
)
response = client.beta.messages.create(
model="claude-sonnet",
max_tokens=4096,
betas=["advisor-tool-2026-03-01"],
messages=[
{"role": "user", "content": "Build a concurrent worker pool in Go with graceful shutdown."}
],
tools=[
{
"type": "advisor_20260301",
"name": "advisor",
"model": "claude-opus-4-6",
}
],
)
print(response)
Response Structure
A successful advisor call returns server_tool_use and advisor_tool_result blocks in the assistant content:
{
"role": "assistant",
"content": [
{
"type": "text",
"text": "Let me consult the advisor on this."
},
{
"type": "server_tool_use",
"id": "srvtoolu_abc123",
"name": "advisor",
"input": {}
},
{
"type": "advisor_tool_result",
"tool_use_id": "srvtoolu_abc123",
"content": {
"type": "advisor_result",
"text": "Use a channel-based coordination pattern. The tricky part is draining in-flight work during shutdown: close the input channel first, then wait on a WaitGroup..."
}
},
{
"type": "text",
"text": "Here's the implementation using a channel-based coordination pattern..."
}
]
}
Pass the full assistant content, including advisor blocks, back on subsequent turns. LiteLLM handles this automatically through provider_specific_fields.
Cost Control
Advisor calls run as a separate sub-inference billed at the advisor model's rates. Usage is reported in usage.iterations[]:
{
"usage": {
"input_tokens": 412,
"output_tokens": 531,
"iterations": [
{
"type": "message",
"input_tokens": 412,
"output_tokens": 89
},
{
"type": "advisor_message",
"model": "claude-opus-4-6",
"input_tokens": 823,
"output_tokens": 1612
},
{
"type": "message",
"input_tokens": 1348,
"output_tokens": 442
}
]
}
}
Top-level usage reflects executor tokens only. Advisor tokens appear in iterations entries with type: "advisor_message" and are billed at Opus rates.
Tips:
- Enable
cachingon the tool definition only when you expect 3+ advisor calls per conversation; it costs more than it saves below that threshold. - Use
max_usesto cap advisor calls per request. Once reached, the executor continues without further advice. - For conversation-level caps, count advisor calls client-side. When you reach your limit, remove the advisor tool from
tools.
Recommended System Prompt
For coding and agent tasks, Anthropic recommends prepending these blocks to your system prompt for consistent advisor timing and optimal cost/quality:
You have access to an `advisor` tool backed by a stronger reviewer model. It takes NO parameters — when you call advisor(), your entire conversation history is automatically forwarded. They see the task, every tool call you've made, every result you've seen.
Call advisor BEFORE substantive work — before writing, before committing to an interpretation, before building on an assumption. If the task requires orientation first (finding files, fetching a source, seeing what's there), do that, then call advisor. Orientation is not substantive work. Writing, editing, and declaring an answer are.
Also call advisor:
- When you believe the task is complete. BEFORE this call, make your deliverable durable: write the file, save the result, commit the change.
- When stuck — errors recurring, approach not converging, results that don't fit.
- When considering a change of approach.
On tasks longer than a few steps, call advisor at least once before committing to an approach and once before declaring done. On short reactive tasks where the next action is dictated by tool output you just read, you don't need to keep calling.
Give the advice serious weight. If you follow a step and it fails empirically, or you have primary-source evidence that contradicts a specific claim, adapt. A passing self-test is not evidence the advice is wrong.
If you've already retrieved data pointing one way and the advisor points another: don't silently switch. Surface the conflict in one more advisor call — "I found X, you suggest Y, which constraint breaks the tie?"
To reduce advisor output length by 35–45% without losing quality, add:
The advisor should respond in under 100 words and use enumerated steps, not explanations.