On May 20, 2026 I passed the Claude Certified Architect – Foundations exam. The certificate landed in my inbox shortly after, with a verification URL on Skilljar and a six-month validity window. The credential itself is just a PNG. The journey behind it is what I want to talk about, because the exam is not what most people expect when they hear "AI certification." It is not a quiz about prompt templates and parameter names. It is a brutally practical test of whether you can architect systems with Claude — and most of the questions are designed in a way that will not reward you for memorizing documentation.

I have done this dance before. TOGAF, AWS Solutions Architect Professional, AWS Machine Learning Specialty, and a handful of others. Every certification has its own personality. The Anthropic exam felt different. It was the first certification where I genuinely changed how I built things before the exam, and the changes stuck. That alone makes it worth writing about.
This post is the field guide I wish I had at the start. It covers the structure of the exam, the five domains and what they actually test, the hardest parts for me personally, the study path that worked, and the hands-on exercises that mattered more than any reading. If you are sitting for this exam in the next six months, I hope this saves you a few weekends.
TL;DR (For When Your Standup Starts in 3 Minutes)
- It is a scenario exam, not a trivia exam. 4 of 6 production scenarios are drawn at random — customer support agents, Claude Code workflows, multi-agent research, developer productivity, CI/CD integration, structured extraction. Questions ask you to choose the most effective next step, often with four plausible options.
- Scoring: scaled 100–1,000, passing at 720. Multiple choice, no penalty for guessing, unanswered counts as wrong.
- Weighting matters: Agentic Architecture & Orchestration (27%), Prompt Engineering & Structured Output (20%), Claude Code (20%), Tool Design & MCP (18%), Context Management & Reliability (15%). Invest your time proportionally.
- The trap: several distractors are correct in a vacuum but wrong because they over-engineer, rely on probabilistic compliance where deterministic compliance is required, or solve a different problem than the one asked. You must read scenarios twice.
- What actually moves the needle: building things. I spent more time in
~/.claude/editing skills, rules, and MCP configs than reading docs. Five real hands-on exercises (below) are worth ten passes through the exam guide. - Hardest domain for me: Tool Design & MCP Integration. Easiest: Claude Code Configuration. Surprise difficulty: Context Management & Reliability — the wording of questions in this domain trips up experienced architects who default to "just use a bigger context window."
What the Certification Actually Validates
The exam guide is honest about its target candidate: a solution architect with at least six months of hands-on experience building production systems with Claude. Not a hobbyist. Not someone who has watched a few videos. Someone who has shipped agents into a customer-facing environment, fought with tool selection ambiguity at 2 a.m., and lost a weekend to a context window that quietly degraded across a long session.
The four technologies under test:
- Claude Agent SDK — the Python and TypeScript SDK for building agentic systems. Agentic loops, hooks, subagent spawning, session management.
- Claude Code — the CLI and IDE integration for engineering work.
CLAUDE.md,.claude/skills/,.claude/rules/, plan mode, slash commands, the-pheadless mode, MCP integration. - Model Context Protocol (MCP) — the open standard for connecting Claude to external systems. Server scoping, tool descriptions, structured errors, resources for content catalogs.
- Claude API — the raw HTTP/JSON layer.
tool_usewith JSON schemas,tool_choice,stop_reason, the Message Batches API.
Three things explicitly are not tested, and I want to flag them because they trip up people coming from other certifications: fine-tuning and training custom models, deployment infrastructure for MCP servers (networking, containers, cloud configuration), and cloud-vendor-specific patterns (AWS Bedrock, Vertex AI, Azure). If you came from AWS world, you have to actively unlearn the reflex to reach for an architecture-diagram answer.
Why This Certification Is Genuinely Hard
I have sat for exams where "hard" means "memorize a thousand acronyms." This is not that. The difficulty here is architectural taste. Three reasons it earns its difficulty:
Reason 1: Every Question Has Four Plausible Answers
The exam guide says it plainly: distractors are options that "a candidate with incomplete knowledge or experience might choose." In practice this means the wrong answers are not absurd. They are things you would actually do if you skipped one paragraph of documentation. Consider this real-style question from the practice set:
Question: Production data shows that in 12% of cases, your agent skips
get_customerentirely and callslookup_orderusing only the customer's stated name, occasionally leading to misidentified accounts and incorrect refunds. What change would most effectively address this reliability issue?A) Add a programmatic prerequisite that blocks
lookup_orderandprocess_refundcalls untilget_customerhas returned a verified customer ID.
B) Enhance the system prompt to state that customer verification viaget_customeris mandatory before any order operations.
C) Add few-shot examples showing the agent always callingget_customerfirst, even when customers volunteer order details.
D) Implement a routing classifier that analyzes each request and enables only the subset of tools appropriate for that request type.
The correct answer is A. But B is what most people instinctively try first. C is what the documentation suggests for ambiguous tool selection. D is what an enterprise architect would design after a Friday afternoon whiteboard session. A wins because financial operations require deterministic compliance, and B/C only provide probabilistic compliance — the LLM might follow the instruction 99% of the time, which is not good enough when 1% means an incorrect refund.
This is the entire personality of the exam in one question. The right answer is the one that matches the structural property of the problem, not the one that looks elegant in a slide deck.
Reason 2: It Tests Anti-Patterns Specifically
Domain 1.1 — the agentic loop task statement — explicitly calls out things you must avoid: parsing natural language signals to determine loop termination, setting arbitrary iteration caps as the primary stopping mechanism, checking assistant text content as a completion indicator. The exam will offer you a wrong answer that looks like a sensible engineering decision in isolation. If you have ever built an agent and used a "did the response contain DONE?" check to exit the loop, the exam will punish you for it. The right answer is always: inspect stop_reason. Continue when it is "tool_use", terminate when it is "end_turn".
I had to actively rewrite some of my own agent code based on this. Reading the exam guide was, no exaggeration, a free code review of my last six months of agent work.
Reason 3: Context Management Questions Punish Architects
The Domain 5 questions were where I lost the most points on the practice exam. Architects coming from systems backgrounds have a reflex: if the system runs out of memory, give it more memory. Translated to LLM thinking: if the context window fills up, use a model with a bigger window.
The exam rejects this reflex repeatedly. The correct answers in Domain 5 nearly always involve structural changes: splitting tasks across subagents with isolated context, trimming verbose tool outputs before they accumulate, extracting structured "case facts" outside summarized history, placing key findings at the start of inputs to mitigate the lost-in-the-middle effect. Bigger windows do not solve attention dilution, and the exam knows this.
Domain by Domain: What to Actually Study
The official content outline lists task statements with "Knowledge of" and "Skills in" sub-points. I will not regurgitate it — Anthropic publishes the exam guide and you should read it cover to cover at least twice. What follows is what I found actually mattered per domain, ranked by how much it showed up in the practice questions relative to its weighting.
Domain 1: Agentic Architecture & Orchestration (27%)
The largest domain, and the one that rewards real building experience the most. Master these concretely:
stop_reasonhandling."tool_use"means continue,"end_turn"means stop. Anything else (parsing assistant text, counting iterations) is an anti-pattern.- Coordinator–subagent hub-and-spoke patterns. Subagents do not inherit parent context. You must pass findings explicitly in the subagent's prompt. The
Tasktool is the spawn mechanism, andallowedToolsmust include"Task"for a coordinator to invoke subagents. - Parallel subagent execution. Emit multiple
Tasktool calls in a single coordinator response. Sequential calls across turns are the slow, wrong answer. - Hooks for deterministic compliance.
PostToolUsefor normalizing heterogeneous data formats. Tool-call interception hooks for blocking policy-violating actions (refunds above a threshold). Hooks beat prompts for any rule that must hold 100% of the time. - Task decomposition: prompt chaining vs adaptive decomposition. Chaining for predictable multi-aspect reviews (analyze each file, then a cross-file integration pass). Adaptive for open-ended investigations where each step's findings drive the next.
- Session management.
--resume <session-name>to continue a named session.fork_sessionfor divergent branches from a shared baseline. Start a fresh session with an injected summary when prior tool results are stale.
If you can build a five-agent research system from scratch — coordinator, web search, document analysis, synthesis, fact verification — with parallel spawning, structured error propagation, and provenance tracking, you have over-prepared for this domain. Which is exactly the right amount of preparation.
Domain 2: Tool Design & MCP Integration (18%)
The hardest domain for me, because the wrong answers are so close to correct. Key concepts:
- Tool descriptions are the primary mechanism LLMs use for tool selection. When you see "the agent sometimes calls the wrong tool" in a question, the answer is almost always: improve descriptions to differentiate similar tools, include input formats, edge cases, and boundary explanations. Few-shot examples are a band-aid; better descriptions are the cure.
- Splitting overlapping tools. If you have
analyze_contentandanalyze_documentwith similar descriptions, rename one (e.g.,extract_web_results) and split the other into purpose-specific tools (extract_data_points,summarize_content,verify_claim_against_source). - Structured error responses. Every MCP tool must return
errorCategory(transient/validation/business/permission),isRetryable, and a human-readable description. Generic "operation failed" is the wrong answer everywhere. - Distinction between access failures and valid empty results. A successful query that returned no rows is not an error. A timeout is. The coordinator needs to be able to tell these apart to make recovery decisions.
- Tool distribution discipline. Agents with 18 tools select worse than agents with 4–5. Restrict each subagent's tool set to its specialization. Give scoped cross-role tools (a
verify_factfor synthesis) for high-frequency needs. - MCP server scoping. Project-level
.mcp.json(shared via version control, with environment variable expansion for credentials) vs user-level~/.claude.json(personal, not shared). - MCP resources for content catalogs. Issue summaries, documentation hierarchies, database schemas — exposed as resources rather than requiring exploratory tool calls.
tool_choiceoptions."auto"(may return text),"any"(must call a tool, model picks), forced ({"type": "tool", "name": "..."}).
Domain 3: Claude Code Configuration & Workflows (20%)
The easiest domain if you actually use Claude Code daily. The trap is that the documentation has evolved rapidly, and old habits die hard.
CLAUDE.mdhierarchy. User-level~/.claude/CLAUDE.md(personal), project-level.claude/CLAUDE.mdor rootCLAUDE.md(shared via VCS), directory-level for subdirectory-specific conventions..claude/rules/with YAML frontmatter glob patterns. For conventions that span multiple directories (e.g., all test files via**/*.test.tsx), this beats directory-levelCLAUDE.mdfiles because rules load conditionally based on what is being edited.- Project-scoped slash commands in
.claude/commands/. Shared via version control, available to every developer who clones the repo.~/.claude/commands/is for personal commands. - Skills with
SKILL.mdfrontmatter.context: forkisolates the skill in a sub-agent context — critical for skills that produce verbose output.allowed-toolsrestricts tool access during skill execution.argument-hintprompts the developer when they invoke without arguments. - Plan mode vs direct execution. Plan mode for architectural changes, multi-file migrations, anything with multiple valid approaches. Direct execution for well-scoped changes. The exam will trick you with questions where the requirement explicitly states complexity — choose plan mode.
- CI integration.
claude -p "..."for non-interactive mode.--output-format jsonwith--json-schemafor machine-parseable findings posted as PR comments. - Session context isolation. The same Claude session that generated code is less effective at reviewing it. Use an independent review instance. This generalizes to multi-pass review architectures: split large reviews into focused per-file passes plus a cross-file integration pass.
Domain 4: Prompt Engineering & Structured Output (20%)
The domain that surprised me. I had assumed years of prompt engineering would make this trivial. It did not, because the exam tests a very specific philosophy of prompts — explicit categorical criteria over vague instructions, few-shot examples for ambiguous cases, schemas over prose.
- Explicit criteria over vague instructions. "Flag comments only when claimed behavior contradicts actual code behavior" beats "check that comments are accurate." "Be conservative" and "only report high-confidence findings" are wrong answers everywhere.
- Few-shot examples as the highest-leverage technique. 2–4 targeted examples for ambiguous scenarios, demonstrating reasoning for why one action was chosen. Showing format consistency. Distinguishing acceptable patterns from genuine issues.
tool_usewith JSON schemas for guaranteed structured output. Eliminates JSON syntax errors. Does not eliminate semantic errors (values in wrong fields, line items that don't sum). Schema design matters: required vs optional, nullable fields, enum + "other" + detail string patterns for extensible categories.- Validation-retry loops. Append specific validation errors to the prompt on retry. Retries help with format/structural errors, do not help when the information is absent from the source.
- Message Batches API. 50% cost savings, up to 24-hour window, no SLA, no multi-turn tool calls within a single batch request. Use for non-blocking, latency-tolerant workloads. Do not use for blocking pre-merge checks.
- Multi-instance and multi-pass review. Independent review instances catch issues self-review misses. Per-file local passes plus a cross-file integration pass beat single-pass reviews of large changesets.
Domain 5: Context Management & Reliability (15%)
The smallest domain by weight, but the one where I had to most actively unlearn habits. The mental model: context is a budget, not a resource you can throw money at.
- Progressive summarization is dangerous when it condenses numerical values, percentages, dates, and customer-stated expectations into vague summaries. Extract these as "case facts" into a persistent block included in each prompt, outside the summarized history.
- The lost-in-the-middle effect. Place key findings at the beginning of aggregated inputs. Organize detailed results with explicit section headers.
- Trim verbose tool outputs before they accumulate. If an order lookup returns 40 fields and only 5 are relevant, normalize to those 5 with a hook.
- Subagent delegation for context isolation. Verbose exploration in a subagent that returns a structured summary, not raw output, to the coordinator.
- Scratchpad files for persisting key findings across context boundaries. Each agent exports state to a known location; the coordinator loads a manifest on resume.
- Escalation calibration. Honor explicit customer requests for human agents immediately. Escalate when policy is ambiguous or silent on the customer's specific request. Sentiment-based escalation and LLM self-reported confidence are unreliable proxies for actual case complexity.
- Provenance and uncertainty. Require subagents to output structured claim-source mappings. Annotate conflicts with source attribution rather than arbitrarily selecting one value. Require publication/collection dates to prevent temporal differences from being misinterpreted as contradictions.
My Preparation Path: What Actually Worked
I prepared over roughly six weeks, alongside a full-time role running AI at ScienceSoft. Here is the breakdown of what worked, in order of impact-per-hour.
1. Build the Four Exercises in the Exam Guide (60% of value)
The exam guide ships with four preparation exercises. They are not optional. They are the exam, in slow motion. I will not reproduce them in full — read the guide — but the headlines:
- Build a multi-tool agent with escalation logic. 3–4 MCP tools with detailed differentiated descriptions. Agentic loop driven by
stop_reason. Structured error responses with category and retryable flags. APostToolUsehook that enforces a business rule. Test with multi-concern messages. - Configure Claude Code for a team development workflow. Project-level
CLAUDE.mdwith universal standards..claude/rules/with glob-scoped conventions for different code areas. A project-scoped skill withcontext: forkandallowed-toolsrestrictions. An MCP server in.mcp.jsonwith environment variable expansion. Test plan mode vs direct execution on tasks of varying complexity. - Build a structured data extraction pipeline. JSON schema with required, optional, nullable, and enum+other patterns.
tool_usefor guaranteed schema compliance. Validation-retry loops with error feedback. Few-shot examples for varied document formats. A Message Batches API submission of 100 documents withcustom_idcorrelation. Field-level confidence scores routed to human review. - Design and debug a multi-agent research pipeline. Coordinator with at least two subagents, explicit context passing (no relying on inheritance), parallel
Taskspawning, structured output with provenance, simulated subagent timeout with error propagation, conflicting source data with annotated synthesis output.
I built all four. Two of them — the multi-tool agent and the multi-agent research pipeline — became actual internal tools I still use. The exam questions in those domains were noticeably easier afterward, because I had already hit the exact failure modes the questions describe.
2. Read the Exam Guide Three Times (15% of value)
Three reads, for three different purposes. First read: cover to cover, no notes, just to understand the shape. Second read: with a notebook, copying every "anti-pattern" callout and every "skill in" bullet into flashcards. Third read: as a checklist against my own code — for every "skill in" item, did my code do this? If not, why not?
The third pass is the highest leverage. It turns the exam guide into a personal audit.
3. The Sample Questions and Practice Exam (15% of value)
The exam guide contains 12 sample questions with full explanations. I worked through them in two passes. First pass: answer cold, write down my reasoning, then read the explanation. Second pass: a week later, redo them, see if I had internalized the reasoning or just memorized the answer.
Of the 12, I got 8 right on first attempt. The four I missed were all in Domain 5 — context management — and all involved me defaulting to "use a bigger window" or "use a smarter model" instead of restructuring the context flow.
The practice exam (linked separately from the guide) is closer to real exam difficulty. Do it at least once under timed conditions. If you score below 75% on the practice, you are not ready — go back to the exercises.
4. Documentation Deep Dives (10% of value)
Specifically, the Claude Agent SDK reference for AgentDefinition, Task, hooks, and session management. The MCP specification for tool descriptions, errors, and resources. The Claude Code documentation pages for skills, rules, slash commands, and plan mode. The API reference for tool_use and tool_choice.
I deliberately did not spend much time on tutorials and blog posts. The exam tests current behavior, and content older than a few months is often subtly wrong because the surface evolves rapidly. Stick to the official references.
Exam Day: What to Expect
The exam is online-proctored. You will be asked to show a clean desk, identify yourself, and not look away from the screen for the duration. Plan accordingly: clear the area, queue up a glass of water, use the bathroom beforehand.
You get four scenarios drawn at random from the six. Each scenario has multiple questions. Take the full time — there is no bonus for finishing early, and the questions reward rereading.
My approach during the exam:
- Read the scenario twice before reading any questions. The scenario constrains what "correct" means. A question that asks for "the most effective first step" in a customer-support scenario where 80% first-contact resolution is the target is different from the same question in a research scenario.
- For every question, identify the trap before identifying the answer. Which option looks correct but isn't? That gives you the structural property the question is testing. Then find the answer that matches the property.
- Flag, don't skip. Mark anything you are less than 80% confident on. Come back to it with fresh eyes after finishing the rest.
- Trust your first instinct on the easy ones. Second-guessing on a Domain 3 Claude Code question because you started imagining edge cases that aren't in the scenario is how you lose points you should not lose.
The Hardest Questions for Me
I cannot share actual exam content. But the style of question I found hardest fell into three families:
Family 1: "Improve tool selection" questions. The instinct is to add few-shot examples. The correct answer is usually to improve tool descriptions first, because few-shot examples paper over an underlying ambiguity that descriptions can resolve at the root. I got this wrong twice on practice.
Family 2: "Why is the multi-agent system producing incomplete results" questions. The instinct is to blame the subagents — they must be filtering, or returning wrong format, or hitting context limits. The correct answer is often that the coordinator's task decomposition is too narrow, and the subagents are correctly executing on too-small assignments. The logs reveal it; you have to read them carefully.
Family 3: "Should we switch to the batch API" questions. The exam loves to give you a workload with two parts — one blocking, one overnight — and ask whether to switch both, neither, or one. The correct answer is almost always: only the non-blocking part. The trap is the 50% cost savings making "switch both" feel like the responsible choice.
What Changed in My Work After the Exam
The most genuinely useful side effect of preparing for this exam: my agents got better. Specifically:
- I deleted the iteration caps from three internal agents and replaced them with proper
stop_reasonhandling. They are now more reliable on long tasks. - I refactored two MCP tool descriptions to remove overlap, and the tool-misrouting bugs we had been seeing in production dropped to near zero.
- I moved three business rules from system prompts into
PostToolUsehooks. The "agent occasionally violates this rule under load" bug class disappeared. - I introduced
.claude/rules/with glob-scoped conventions for our test files. New team members now get our test conventions applied automatically, without needing to read a separate doc. - I added structured "case facts" extraction to a customer-support pilot. Long sessions stopped losing the customer's stated expectations.
These were all things I "knew" before, in the sense that I had read about them. Preparing for the exam forced me to do them. That is the value of the certification — not the badge, but the audit it forces you to perform on your own work.
A Note for Employers and Hiring Managers
If you are hiring an architect for Claude-based systems, this certification is a useful (though not sufficient) filter. A candidate who has earned it can be expected to know the difference between probabilistic and deterministic compliance, to reach for hooks before prompts when reliability matters, to design tools with structured errors, to set up Claude Code with project-scoped configuration that benefits the whole team, and to architect multi-agent systems with explicit context passing rather than relying on inheritance.
What it does not validate: model selection economics, infrastructure cost optimization, training and fine-tuning, evaluation methodology beyond field-level confidence scoring, production observability beyond what the SDK provides. For those, you still need separate signals.
Closing
The Claude Certified Architect – Foundations exam is six months old at the time of this writing. It is a young credential, and Anthropic will iterate on it. I expect the next version to add depth on evaluation methodology, prompt caching internals, and cross-provider patterns. None of that diminishes the current version.
If you are an architect building production systems with Claude, this is worth doing. Not because the certificate impresses anyone — most hiring managers will not have heard of it yet — but because the preparation itself will make your systems better. The exam is a forcing function for the kind of architectural taste that does not come from reading documentation, only from building things, breaking them, and reading the postmortems.
I am now studying for the next tier. Anthropic has not announced when "Professional" or "Specialty" versions will land, but the Foundations exam is explicitly labeled "Foundations" for a reason. Whatever comes next, the agents I build between now and then will be the real preparation.
If you sit for the exam, drop me a note on LinkedIn or Telegram — I would like to hear how the questions evolve. And if you found this guide useful, the highest compliment is to write your own once you pass. The community on this is still small. Let's grow it.
Credential verification: verify.skilljar.com/c/v9axz9t5re5n · Certificate v9axz9t5re5n · Valid May 20, 2026 – Nov. 20, 2026.