Building Evals for an AI Agent: From Zero to Consistency Testing

We’re building an AI agent that triages cloud security findings. It reads a finding from AWS Security Hub or Prowler, assesses the risk, and tells an engineer exactly what to do about it with specific AWS CLI commands they can run.

The agent worked. We had 620 unit tests proving the code was correct. But we had zero tests proving the agent’s output was correct. A model upgrade could silently change “Very High” risk to “Moderate” on a root account without MFA, and we’d have no idea until a customer noticed.

What we needed was an eval: give the agent an input, then apply grading logic to its output to measure success. This post describes how we built an eval suite from scratch: the design decisions, the problems we hit, and what the evals actually found. This is not the only way to do it. But if you’re building an agent and wondering how to get started with evals, this is one path that worked.

What we needed to evaluate
- The layered design
Starting simple: the assertion layer
- The data format
- First results
Adding subjective quality: the LLM judge layer
- Quality results
The surprise payoff: consistency testing
- Risk level variance
- Output structure variance
Beyond the tests: what else you need to build
The cost picture
What we deferred
What we learned

What we needed to evaluate

Our agent takes an OCSF security finding as input and produces a structured analysis as output:

Risk assessment: One of Very High, High, Moderate, Low, Very Low (NIST SP 800-30)
Risk rationale: Why this risk level was assigned
Vulnerability summary: Plain-language explanation for non-security engineers
Remediation steps: Actionable steps with AWS CLI commands
Remediation sources: URLs cited for guidance

We needed to answer three questions:

Correctness: Is the risk level right? Are the remediation steps relevant to the finding?
Quality: Would a cloud engineer actually find this useful?
Consistency: Does the agent give the same answer every time?

Each of these questions has a spectrum of answers. We’re not trying to prove the agent is perfect. We’re building a test suite for regression testing and initial measurement of key performance dimensions. We’ll raise the bar as we gather baseline data and user feedback.

This initial pass is deliberately scoped: 10 findings, enough to be confident we can expand test cases and extend the framework without rewriting everything. If adding case #11 requires refactoring the eval architecture, that means we’ve gotten the design wrong.

The layered design

We designed the eval suite as three layers, each answering one of those questions at a different cost/speed tradeoff:

Eval Suite Architecture block diagram

Test Data block with evals.json feeds into three Eval Layers:

1. Assertions with deterministic checks
2. Quality with LLM-as-judge
3. Consistency measuring cross-trial variance

Output block: All three eval layers output to files in eval/outputs

Figure 1. Eval Suite Architecture

All three layers are driven by the same evals.json data file. Adding a test case is a JSON edit. All layers pick it up automatically.

Assertion layer: Deterministic checks on the agent’s structured output. Fast, cheap, no LLM judge. Answers “is the output correct?” Runs on every merge to main.
Quality layer: LLM-as-judge with custom rubrics. Answers “is the output useful to an engineer?” Expensive, run on-demand / periodically.
Consistency layer: Run each case 3 times, check all trials pass, then analyze cross-trial variance. Answers “does the agent give the same answer reliably?” Run on-demand / periodically.

Let’s dig into each eval layer and see how they build confidence.

Starting simple: the assertion layer

We started with Anthropic’s eval design advice: begin with 20-50 simple tasks drawn from real usage. We drew 10 real security findings from our example inputs: compliance findings, detection findings, exposure findings from AWS Security Hub, Prowler, and k9 Security. That was our starting set.

The first design decision was about what to check. Anthropic recommends combining code-based graders (fast, cheap, objective) with model-based graders (flexible, nuanced). We noticed that most of what we needed to evaluate initially is deterministic:

Is the risk level one of the five valid NIST levels? (exact match)
Is the risk level in the expected range for this finding? (exact match against an acceptable set)
Are there remediation steps? (non-empty check)
Does the vulnerability summary mention the right AWS service and resource? (regex)
Do any remediation steps include CLI commands? (custom check)

None of these need an LLM judge. A regex can tell you whether the summary mentions “S3” or “security group.” An equality check can verify the risk level is “High” or “Very High.”

This became our first design principle: use deterministic checks for everything you can, and reserve LLM judges for questions that genuinely require judgment.

The data format

We wanted adding a new eval case to be a data operation, not a code change. At 10 cases, either approach works. At 100 cases, maintaining Python fixtures becomes a burden. And we wanted to make contributing test cases easy for both people and agents.

We adapted the eval format used by Claude Cowork, which structures test cases as JSON with a prompt, expected output, and a list of typed assertions:

{
  "id": "compliance-root-hardware-mfa",
  "finding_file": "examples/inputs/findings/compliance-finding.hardware-mfa-should-be-enabled.root.json",
  "expected_output": "High or Very High risk for root account missing hardware MFA",
  "assertions": [
    {
      "name": "valid_risk_level",
      "type": "one_of",
      "field": "risk_assessment",
      "value": [ "Very High", "High", "Moderate", "Low", "Very Low"]
    },
    {
      "name": "has_remediation_steps",
      "type": "not_empty",
      "field": "remediation_steps"
    },
    {
      "name": "summary_mentions_mfa_or_root",
      "type": "regex",
      "field": "vulnerability_summary",
      "value": "(?i)(mfa|root|multi.?factor)"
    },
    {
      "name": "has_cli_commands",
      "type": "custom",
      "check": "any_step_has_commands"
    }
  ],
  "stage_assertions": {
    "1_triage": [
      {
        "name": "expected_risk",
        "type": "one_of",
        "field": "risk_assessment",
        "value": [ "Very High", "High"]
      }
    ]
  }
}

The key adaptation: finding_file replaces prompt. The agent’s prompt is built from the finding by the same code path used in production. The eval doesn’t construct synthetic prompts that might diverge from reality.

We split assertions into two groups: common (assertions) and stage-specific (stage_assertions). Common assertions apply to every pipeline stage. They validate that the output is well-formed and relevant: “is the risk level valid?”, “does the summary mention the right service?” Stage-specific assertions apply only to the named stage. For example, the expected_risk check lives under 1_triage with a wider acceptable range than we’d expect from the full pipeline, because Stage 1 triages without account context or live validation.

This split matters because the Stage 1 risk assessment will be refined by later stages with additional data. Context gathering validates the finding against the live account and provides data that affects likelihood and impact estimates. Stage 1 might reasonably assess a stale IAM key anywhere from “Low” to “High.” The full pipeline will narrow that. As a test case’s data moves through each stage of the agent’s pipeline, the stage-specific assertions will verify new facts are integrated into the agent’s reasoning and output gets more specific. Different stages, different expectations, same test case data.

Assertion types are intentionally simple: one_of, contains, regex, not_empty, and custom (for checks that need Python logic). A dispatcher function in the test conftest routes each assertion to the right check. Adding a new assertion type is a few lines of Python. Adding a new test case requires adding a JSON object to evals.json and no Python.

First results

All 10 cases passed on the first run. That was reassuring but not very interesting. It told us the agent wasn’t broken, not that the evals were useful. The evals became more useful when we added expected risk levels.

Initially, 8 of 10 cases only checked that the risk level was valid. Any of the five NIST levels would pass. Per Anthropic’s advice to “grade what the agent produced,” we added expected_risk assertions with acceptable ranges for every case. Root without hardware MFA should be Very High or High. A default VPC security group should be Moderate or Low. An admin IAM user without MFA should be Very High or High.

These ranges required judgment calls. Is a stale inactive IAM access key on a user who hasn’t logged in for 4 years a “Moderate” or “Low” risk? Both are defensible. The model sometimes even says “High” when the prompt asks it to consider worst-case consequences. We set the range to ["Low", "Moderate", "High"] and moved on. The eval forces the conversation about what “correct” means at each stage of the pipeline, which is valuable because it helps identify points where we need to improve accuracy or precision in the pipeline.

Adding subjective quality: the LLM judge layer

Deterministic assertions can tell you the risk level is “High” and the summary mentions “S3.” They can’t tell you whether the summary would actually help a Cloud or Platform Engineer understand the issue without being a security expert.

For this, we used the Strands Agents Evals SDK, which provides OutputEvaluator: you define a scoring rubric, and an LLM judge scores the agent’s output against it. We wrote two rubrics:

Triage quality scores the analysis on technical correctness: valid NIST risk level with justified rationale, clear vulnerability summary, actionable and finding-specific remediation steps, specific AWS CLI commands, and correct identification of the affected service and resource type.

Helpfulness scores from the engineer’s perspective: can they understand the vulnerability without security expertise? Does the risk level help them prioritize? Are the remediation steps specific enough to execute with minimal additional research?

Both require >= 80% pass rate across all 10 cases.

One gotcha: the Strands SDK has a built-in HelpfulnessEvaluator, but it’s a TRACE_LEVEL evaluator that requires OpenTelemetry trajectory data, not just the output string. The error you get is:

Evaluator error: Trace parsing requires actual_trajectory to be a Session object, got NoneType.

We didn’t want to add OpenTelemetry instrumentation to our eval suite (yet). OutputEvaluator with a custom rubric measures a similar subjective quality signal without the tracing complexity.

The quality layer runs the same 10 cases through the triage agent, then through the judge model. Each of the two rubrics runs all 10 cases, so that’s 20 agent invocations + ~20 judge invocations and ~16 minutes. We run it on-demand, not on every push.

Quality results

The first quality run scored well:

Triage quality: 10/10 passed, all scoring 1.0. Every case had a valid NIST risk level with justified rationale, clear vulnerability summaries, actionable remediation steps with specific CLI commands, and correct service identification.

Helpfulness: 10/10 passed, 9 scoring 1.0 and 1 scoring 0.9. The 0.9 was the root hardware MFA case. The judge noted the analysis was “comprehensive” and “enables immediate action” but docked slightly. This is a legitimate nuance: you can’t programmatically enroll a physical hardware token, so the agent’s CLI commands can guide the process but not complete it. The eval surfaced a real limitation of automated remediation guidance for physical-device controls.

These scores are a baseline, not a finish line. As we add more diverse findings and edge cases, the scores will reveal where the agent’s guidance is weaker. We capture eval data to JSON files so we can inspect and trend it over time.

The surprise payoff: consistency testing

At this point, every assertion test passed. Every quality evaluation passed. The agent was correct and helpful on every single finding. We could have stopped here.

Then we ran each case three times. And things got interesting.

The consistency test uses the pass^k metric (informal explanation, original paper), which measures the probability that all k trials succeed. As k increases, the bar gets higher because you’re demanding consistency across more trials. The metric was introduced specifically for “real-world agent tasks requiring reliability and consistency like customer service.” Anthropic calls it “critical for customer-facing agents.” We set k=3: run each of the 10 cases three times, require all three to pass assertions, then compare the outputs across trials.

To verify consistency, we built four deterministic checks that compare outputs across the three trials:

Risk level agreement: Did all 3 runs produce the same risk level?
Step count stability: Same number of remediation steps?
Topic consistency: Do the step titles cover the same remediation topics?
Summary similarity: How much keyword overlap between vulnerability summaries?

Here’s what the consistency tests found.

Risk level variance

The first consistency run found a real issue. The compliance-rotate-iam-key case assessed risk as “Low” in trial 1 and “Moderate” in trials 2 and 3. Our expected range was ["Moderate", "High"], so trial 1 failed. After reviewing the finding, a stale inactive key on a user dormant for 4+ years, we added “Low” to the range. The eval forced us to decide whether the variance was a bug or an acceptable range of judgment.

More importantly, repeated runs identified that the agent is unable to consistently estimate risk within a two-level range at this stage, e.g. ["Moderate", "High"]. After reading the risk reasoning logged by the agent, we concluded the inconsistency is due to a lack of:

clear guidance and reference data to classify risk according to NIST 800-30
specific estimates for likelihood of occurrence and impact of each occurrence

To solve this, we plan to refactor the pipeline to have a discrete risk estimation stage that runs after the context is available to estimate likelihood, impact, and finally risk. The risk estimation stage will also have special reference data and tools to support accurate risk estimation.

Output structure variance

For the securityhub-ebs-snapshot-public case, all three trials passed every assertion. Valid risk level, has remediation steps, mentions “snapshot,” includes CLI commands. A green checkmark on every run.

But the cross-trial analysis told a different story:

Consistency report for securityhub-ebs-snapshot-public (3 trials):
  Risk              CONSISTENT (High: 3/3)
  Steps             VARIABLE (6, 7, 5, spread=2)
  Topics            VARIABLE (14/56 keywords shared, ratio=25%)
  Summary           VARIABLE (min pairwise keyword overlap: 24%)

Only 24% keyword overlap between vulnerability summaries. The model was describing the same vulnerability, public EBS snapshots, with completely different structure and emphasis each time. One trial used “What is the issue?” as a section heading; another used “What is this?” Both correct. Both different enough that an agent or engineer comparing triage reports across findings would see inconsistent formatting. We certainly noticed it in our own ad-hoc testing.

This is the kind of problem that assertions on individual fields can’t catch. The risk level is correct. The remediation steps are present. The keywords match. But the vulnerability summary, which passes through unchanged to the final report the engineer reads, looks different every time. An engineer reviewing multiple triage reports would see inconsistent structure across findings, which undermines confidence in the tool.

We fixed this with a prompt change: explicit section labels (“What is the issue?”, “Why does it matter?”, “What could happen if not fixed?”) that the model must use in every summary. After the change, section headings became consistent across trials. Content within sections still varies (25-35% keyword overlap), but the structural consistency makes reports scannable and comparable. A prompt change, not a code change. Discovered entirely by measuring consistency.

Beyond the tests: what else you need to build

Here’s a glimpse into what we had to build in addition to the eval suites themselves to iterate accurately and quickly.

Match the production configuration

Our initial eval created the triage agent without the AWS Knowledge MCP client it uses in production. The evals passed, but they were testing a different agent than users would experience. When we added the MCP client to match production, it surfaced real infrastructure problems (rate limiting, connection lifecycle) that we needed to solve anyway.

Test what you run, run what you test. Shortcuts in eval configuration produce results that don’t predict production behavior.

Make evals fast enough to run

Our first make run-evals run took 10 minutes because it executed tests sequentially. Too slow for a feedback loop. We added pytest-xdist for parallel execution and got it down to under 4 minutes with 5 workers.

The main complication: if your agent calls external services (MCP servers, APIs), parallel workers multiply the connection load. We hit rate limits on the AWS Knowledge MCP server and had to add retry logic, share connections across tests, and eventually migrate to an IAM-authenticated endpoint to get reliable parallel execution.

Save outputs for review

Anthropic’s guidance is explicit: “We do not take eval scores at face value until someone digs into the details.”

Every eval run saves the full analysis output as JSON to test-reports/eval-outputs/. Consistency runs save per-trial outputs. Quality evals save per-case scores with the judge’s reasoning. All are included as CI artifacts.

This paid off immediately. After a run where all tests passed, we reviewed the outputs and noticed the agent was producing 5-7 remediation steps per finding. All valid, but too many for an engineer to scan quickly. We wouldn’t have caught that from pass/fail results alone.

The cost picture

Building AI agent evals costs real money. Every eval case invokes the triage agent via Bedrock, and quality evals add LLM judge calls on top. The “invocations” counted here are logical agent invocations; each one involves multiple Bedrock API roundtrips for tool use (searching docs, reading pages, etc.).

Suite	Agent Invocations	Runtime	Input Tokens	Output Tokens	Cost ($)	When to Run
Assertions (10 cases)	10	~4 min	~580K	~38K	~$2.30	Every merge to main
Quality (10 cases x 2 judges)	~40	~16 min	~1,200K+	~76K+	~$4.70+	On-demand
Consistency (10 cases x 3 trials)	30	~10 min	~1,700K	~110K	~$6.80	On-demand

Note: Costs estimated with Claude Sonnet 4.6 Bedrock pricing on April 1, 2026: $3/1M input tokens, $15/1M output tokens.

The assertion suite is cheap enough at $2.30 to run on every merge. We’ll look into reducing token usage per case and sampling a subset of cases in CI. Quality and consistency are behind manual approval gates in CI; you trigger them when you need them, not on every push.

This is why separating deterministic assertions from LLM judge calls matters. The assertion suite runs 10 agent invocations and checks the results with code. The quality suite runs 20 agent invocations (10 cases x 2 rubrics) plus ~20 LLM judge invocations to score the outputs, totaling ~40 invocations and 4x the runtime. The consistency suite runs 30 agent invocations (10 cases x 3 trials) but analyzes cross-trial variance with deterministic code, not an LLM judge. Keep the fast feedback loop cheap by checking everything you can without a judge.

What we deferred

This is a pre-MVP agent. We deliberately limited scope for our initial eval suite:

No Stage 2 evals (context gathering against live AWS accounts). We haven’t integrated our ‘known vulnerable’ account yet. But Stage 1 triage evals cover much of the core value proposition and exercise the framework.
No negative test cases (findings the agent should deprioritize). Important for avoiding filtering noise from over-zealous checks or those obsoleted by policy such as 90-day password rotation.
No automated consistency thresholds. Consistency warnings are logged but don’t fail tests. We needed baseline data before we set meaningful thresholds and start raising the bar.

Each of these is a known gap, not an oversight. We’ll address them as the product matures and as user feedback tells us where the eval suite needs to grow.

What we learned

Start with real inputs, not synthetic ones. Our 10 test cases are sanitized security findings from real AWS accounts. The prompts are built by the same code path as production. This starts your eval suite with test cases that you know existed in real life (and you’re probably familiar with).

Most eval assertions don’t need an LLM judge. We have ~50 assertions across 10 cases. Roughly 80% are deterministic. If you’re checking whether the risk level is “High,” just compare strings.

Consistency testing finds problems that correctness testing misses. Every assertion passed on every trial. Cross-trial analysis found 24% summary overlap on a finding where the agent was “correct” every time. Without consistency testing, you risk shipping an agent that gives different-looking answers to the same question.

Evals force decisions about what “correct” means. Is a stale inactive IAM key “Moderate” or “Low” risk? The eval makes you answer that question explicitly, with a range you commit to in code. That creates shared understanding of what the agent should do, which is more valuable than the eval itself.

Building an agent that is consistently correct is significant work. Getting the eval suite to verify “the agent produces valid output” took an afternoon. Getting to “the agent produces the right output, as judged by domain experts and an LLM, consistently across multiple runs” took substantially longer. We’re still finding things to improve. The eval suite isn’t done; it’s a living system that grows with the agent.

The eval outputs and consistency data are now shaping our product roadmap: where to invest in accuracy, where to reduce cost, and which pipeline stages need the most work. Building the eval suite took a few days. The insights it produced will steer development and build confidence for weeks and beyond.

—

We’re building Kitt, an AI agent that takes your team from security alert to ready-to-apply fix in minutes. Kitt auto-triages findings using your account, code, and IAM context, then proposes exact CLI or IaC fixes. If your team spends hours each week figuring out which findings matter and how to fix them, we’d love to hear about your workflow.

Recent Posts

Recent Comments