Website audit: when you need it, what it costs, what it contains in 2026

What Claude Agent SDK actually is

Claude Agent SDK is Anthropic’s official framework for AI agents, available in Python and TypeScript. Unlike the plain Claude API which handles single completion calls, the SDK is opinionated: it assumes you are building an agent with a long-running session, persistent state, tool use, and subagents. The framework exposes ready-made abstractions for things every team would otherwise write themselves: a prompt caching layer, a message queue, a tool dispatcher, streaming support, and JSON schema validation.

Under the hood the SDK talks to the same HTTP endpoints as the Claude API. The difference is in the application layer. Picture two scenarios: a chatbot that answers a question once and forgets (Claude API in completion mode), versus an operational assistant that runs a 30-minute conversation with a client, reads their calendar, updates the CRM, and drafts an email for approval (Claude Agent SDK). The first takes 20 lines of code with the anthropic library; the second requires state management, retry policy, tools with different costs and side effects. The SDK handles the second case without manual glue between components.

The SDK is open source under MIT, distributed via pip and npm. Updates ship alongside new Claude models (Sonnet 4, Opus 4.7, Haiku 4.5), which means new API features (extended thinking, computer use, batch processing) land in the SDK on launch day. That is a meaningful difference from LangChain, where support for new models often arrives weeks later and tends to be incomplete.

Claude Agent SDK versus vanilla Claude API

The choice between SDK and plain API is not religious: it depends on workload profile and the features you actually want. For a single call without context (classifying an email, extracting fields from a single invoice, generating a short product description) the plain API is lighter. You do not need state, you do not need a tool dispatcher, you do not need caching. You import anthropic, you call messages.create, you are done.

The SDK earns its keep in four situations. First: multi-turn conversation where you preserve history between exchanges. Second: regular tool calls (Gmail, Calendar, CRM, custom APIs) where the SDK handles dispatch, validation, and error handling. Third: aggressive prompt caching, where 90 percent of input cost disappears after the first exchange. Fourth: subagent orchestration, where the SDK uses the built-in Task tool to delegate work to parallel Claude instances.

Practical threshold: if your workflow has more than 3 tool calls per session or more than 5 conversation turns, the SDK pays off from day one. The plain API would become glue between pieces of logic you would have to maintain yourself. An engineer needs 2-3 days to write what the SDK provides out of the box, and then an hour each month to patch regressions whenever Claude updates.

There is a third option: use the SDK through Claude Code (CLI/IDE) rather than building your own agent. Claude Code uses Agent SDK internally and exposes ready-made tools for files, bash, and webfetch. For internal automation (deploy scripts, code refactors, content generation) Claude Code is often the faster path than a custom agent. A custom SDK build makes sense when the agent will be exposed to a client (Telegram bot, dashboard, embedded chat) or when you need your own runtime (Lambda, Cloud Run, on-prem).

Four patterns for SMBs where the SDK earns its keep

First pattern: a Telegram assistant per client with a persona. The agent receives messages from the business owner, has access to Google Calendar, Gmail, and a simple CRM (Notion, ClickUp, Airtable). It answers questions like “when did we last speak with Jan Kowalski”, “send a draft offer to Anna from lead 0042”, “summarize the last 5 supplier emails”. The persona in the system prompt reflects the client’s style: greeting, signature, banned-words list. We cover this pattern in more depth in the article Telegram AI bot for business.

Second pattern: a workflow orchestrator with many tools. Here the agent receives a high-level instruction (“prepare the October sales report”) and decides itself which sequence of tools to call: SQL query against the database, aggregation in Python, formatting to PDF, sending by email. The SDK manages it as a session: if a tool returns an error, the agent can try a different path without crashing the whole workflow. This is where traditional n8n or Make have limits (rigid graphs), while SDK plus Claude picks the order at runtime.

Third pattern: a document processing pipeline. The agent receives a file (invoice, contract, report), uses an OCR tool (Mistral OCR, Google Document AI) to extract raw text, classifies the document type, extracts specific fields by schema, verifies consistency (sums match line items, dates are realistic), and writes the result to a database. The pipeline can process 10-30 documents a day with a full audit log of every decision. We cover this pattern more broadly in AI implementation in business in the accounting automation section.

Fourth pattern: a customer support agent with escalation. The agent reads incoming support emails, classifies them (invoice, complaint, product question, quote request), finds related information in a knowledge base, generates a draft reply, checks whether the answer requires human review (amount above threshold, strategic client, negative sentiment), and either sends automatically or moves to the operator-review queue. This pattern is the core of B2B customer service automation.

Prompt caching, why it matters

Prompt caching is a mechanism in the Claude API where part of the input (system prompt, tool definitions, earlier turns) is cached on Anthropic’s server for 5 minutes. Every subsequent call within that window pays only 10 percent of the standard rate for the cached portion. For agents with a large system prompt (5-15 thousand tokens: role, brand voice, tool list, escalation policy, few-shot examples) this is the difference between business viability and its absence.

A real example from our portfolio: an internal multi-bot uses a system prompt of roughly 8 thousand tokens per bot. Without caching: each exchange costs 0.024 dollars of input (8000 by 3 dollars per million tokens). With caching after the first exchange: 0.0024 dollars. Monthly scale at 200 exchanges a day, 30 days, 5 bots: 720 dollars without caching, 72 dollars with caching. A tenfold difference decides whether an 800 PLN/mo retainer for a client is profitable.

The SDK handles caching declaratively. You define cache control breakpoints in the system prompt and tools array. The SDK keeps the breakpoints consistent between exchanges (one wrong character in the cached prefix invalidates the cache). Without the SDK you would be debugging why cache hit rate dropped from 95 percent to zero after you added one tool, manually watching token order and gaps.

Trap: prompt caching has a 5-minute window (or 1 hour in the extended variant). If the agent is used rarely (say once every 30 minutes), cache miss will be the rule and caching does not help. The pattern works for agents with continuous traffic: support chatbots, internal company assistants, pipelines processing a stream of documents. For agents invoked occasionally (a monthly report) caching is irrelevant, but the costs are low anyway.

Second trap: the “cache write” fee on the first exchange. Cache write costs 1.25 times more than a normal input token (3.75 dollars per million versus 3 dollars). For rarely used agents that is an overpay. Rule: count break-even at 3-4 exchanges within the 5-minute window; below that, caching does not return. For our AI Assistant pipeline we count break-even at 8-10 exchanges per day per client: most paying-interested clients sit in the 30-150 range, so caching pays off from day one. Monitoring cache hit rate is part of the retainer (an alarm fires when it drops below 70 percent, usually signaling a system-prompt regression).

Tool use in the SDK, implementation pattern

Tool use in the SDK relies on defining tools as JSON Schema with a name, description, input parameters, and (on the code side) the function that executes them. The agent decides on its own when and with what parameters to call a given tool. The SDK serializes the call, waits for the function’s result, injects the result back into the conversation, and the agent keeps reasoning. This is a loop that runs until the agent decides it has an answer for the user.

Key distinction: sync versus async tools. Sync are fast operations where the agent can wait synchronously (SQL query, calculation, file fetch). Async are long-running operations (image generation, sending an email with delivery confirmation, page scraping). The SDK supports both styles, but for async the preferred pattern is exposing a “submit_job” tool and a second “check_job_status” tool, where the agent orchestrates polling itself. Anthropic documents this in detail in the Claude Agent SDK documentation.

Second issue: idempotency for tools with side effects. Every tool that sends an email, creates an invoice, or adds a CRM record must be idempotent. Reason: the agent may retry after a timeout. Without idempotency the client receives 3 identical offers. Pattern: each call gets a unique idempotency key (a UUID generated by the agent), the tool checks whether an operation with that key was already executed, and if so returns the previous result. Implementation: a SQLite table or Redis with a 24-hour TTL.

Third issue: per-tool error handling. Bad patterns include throwing exceptions that kill the session, returning an empty string (the agent cannot tell whether it is no-data or an error), or returning a bare HTTP 500 without context. A good pattern: each tool returns a structured response object with a success flag, the data payload, an error message, and a retry-after hint. The agent sees the error, can retry once, can try a different path, can escalate to an operator. The SDK does not enforce this structurally, but good per-team conventions are critical.

Fourth issue: token budget per tool. Some tools return large blocks of text (scraping a full page, listing 100 emails, dumping 5000 records). Without a budget the agent can fill its context after 3 calls. Pattern: each tool declares a maximum number of tokens it may return, and if the result exceeds it, the tool returns a summary plus pagination offsets. The agent can call “fetch pages 2-5 of the result if needed”. This discipline matters especially for tools aggregating data (reports, listings, search).

Subagents (Task tool), when they are worth it

The Task tool in the SDK lets the main agent delegate a sub-task to a new, isolated Claude instance. The subagent gets its own system prompt, its own tools (usually a narrower set), and its own context window. When it finishes, it returns a summary to the parent agent. The parent sees only the result, not the full subagent transcript. This is the foundation of scalable agent architectures.

First use case: research delegation. The parent receives a question that requires analyzing 20 web pages. Rather than reading every page itself (it would fill its context window in 5 minutes), the parent dispatches 4 subagents at 5 pages each; each returns a 200-word summary; the parent aggregates. Second use case: parallelism. The parent must analyze 10 documents, each independently. Dispatch 10 subagents in parallel, each with its own context, aggregate results. Third use case: specialization. The subagent has its own system prompt with an expert role (for example, “you are a lawyer analyzing SaaS contracts”), different constraints, different tools.

Key benefit: context isolation. The parent has a clean 100 thousand tokens for its own reasoning, subagents work in their own windows, nothing leaks. Without it the parent’s context window would clog with details of subagent work and reasoning quality would drop. With this pattern the agent can operate for hours without exhausting context, only summaries flow up the hierarchy.

When not to use subagents: for simple linear workflows (one input, one output, 1-2 tools) the overhead of delegation outweighs the gain. Each subagent is an extra model call, so for sequential tasks with strong inter-step dependency the parent will execute them faster and cheaper. Rule of thumb: if tasks can be parallelized or require different roles, dispatch subagents; if they are sequential and homogeneous, let the parent run them itself in a tool-use loop.

AI Assistant V0.1 as a productized service

None of the above is abstract: in June 2026 we launch AI Assistant V0.1 as a productized service. It is a Telegram bot per client built on Claude Agent SDK plus an n8n layer for integrations with a low rate of change. Setup: 3000 PLN one-time (build, persona, integrations with Gmail, Calendar, and a simple CRM), retainer 800 PLN/mo (hosting on Mac Mini or VPS, monitoring, monthly improvements, 2 hours of ad-hoc tweaks per month). First paying-interested clients are the owner of a DACH e-commerce business (an operational assistant handling 50 orders a day) and a hospitality client (a booking assistant for accommodation).

Stack behind the facade: Claude Agent SDK (Python) as orchestrator, the python-telegram-bot library for the Telegram Bot API, n8n self-hosted for fast-changing integrations, MCP servers for operations requiring isolation (Gmail OAuth flow, custom CRM API). State in SQLite per client. Voice extraction from 30-50 client emails available as an add-on at plus 1500 PLN. The architecture is covered further in the MCP article and in our n8n versus Make versus Zapier comparison.

If you are wondering whether an AI assistant makes sense in your firm, start with an AI audit (1500 PLN, 4-8 hours of discovery plus a roadmap PDF plus a 1-hour call). The audit maps your processes, identifies 3-5 automation candidates, and produces a concrete prioritized plan. If you decide to implement, the audit cost is deducted from the implementation package. Book 30 minutes through the form on the contact page.

FAQ

Claude Agent SDK versus LangChain versus LlamaIndex, which to pick?

The SDK is Anthropic-native, with the best performance and feature parity for Claude models (cache, tool use, extended thinking). LangChain is multi-LLM, useful when you want to swap easily between OpenAI, Claude, and Mistral. LlamaIndex specializes in RAG (Retrieval-Augmented Generation), where you index a large knowledge base. Our default is Claude Agent SDK, since Claude is our production model and prompt caching requires the native SDK.

What are the real costs of Claude Agent SDK?

The SDK is free (MIT). You pay only for Claude API tokens: Sonnet 4 is 3 dollars per million input tokens and 15 dollars per million output. With prompt caching effective costs drop 5-10 times. Plus hosting (VPS 12-30 dollars per month, or a Mac Mini at 800 dollars one-time plus electricity). For an AI assistant with 200 exchanges a day: 50-150 dollars per month plus hosting.

Does the SDK have acceptable latency for real-time?

Yes, the SDK supports streaming response (token by token), the first response appears 200-500 milliseconds after the call. A full 500-1000 token reply takes 3-8 seconds. For a Telegram bot that is acceptable (the user sees the “typing…” indicator). For applications requiring sub-second responses (a voice telephony agent) prefer Haiku 4.5 over Sonnet; latency drops to 100-300 milliseconds.

Self-host or cloud, where to host the agent?

The SDK is a Python or TypeScript library, you host it anywhere you have a runtime: VPS (Hetzner, DigitalOcean), serverless (AWS Lambda, Vercel Functions, Cloud Run), or bare metal (Mac Mini in the office). For SMBs we recommend a Mac Mini M4 (one-time cost, no monthly fees, physical control over data) or a Hetzner CPX21 VPS (12 euros a month). Cloud serverless makes sense for spiky traffic, but for an agent with a persistent WebSocket connection a long-lived host is better.