Evaluation

gecx eval runs scenario-replay tests against a ChatSession, scores them with deterministic and LLM-judge scorers, and prints a pass/fail table plus a JSON report. It supports a regression gate that compares the current run against a committed baseline.

Quick start

pnpm gecx eval ./apps/showcase/scenarios

You'll see a per-scenario pass/fail table, plus aggregate metrics (pass rate, deflection rate, p50/p95 latency, hallucination rate, tool-call accuracy) and which judge providers are configured.

For CI:

pnpm gecx eval ./apps/showcase/scenarios \
  --baseline ./apps/showcase/baseline.eval.json \
  --fail-on-regress \
  --json --output eval-report.json

Exit code 0 means scenarios passed and no regressions tripped. Exit 1 means a scenario failed or a regression threshold was exceeded.

Writing a scenario

Two formats, same shape. Both are loaded by walking the target directory for *.scenario.ts, *.scenario.yaml, and *.scenario.yml.

TypeScript

// apps/showcase/scenarios/welcome.scenario.ts
export default {
  id: 'welcome',
  name: 'Welcome flow greets the user',
  tags: ['support'],
  when: [{ user: 'hello' }],
  then: [
    { scorer: 'message-contains', args: { pattern: 'help' } },
    { scorer: 'no-handoff' },
    { scorer: 'no-error' },
  ],
};

YAML

# apps/showcase/scenarios/invoice-lookup.scenario.yaml
id: invoice-lookup
name: Invoice lookup calls the invoice tool
tags: [support, tool-call]
when:
  - user: Please pull invoice INV-5678
then:
  - scorer: tool-called
    args:
      name: lookup_invoice
  - scorer: no-handoff
  - scorer: no-error

Anatomy

id — unique within the suite.
name — human-readable label for the report.
tags — optional. Use --filter <tag> on the CLI to run a subset. Tag any scenario that uses an LLM-judge scorer with llm-judge so CI can gate it on a secret.
when — turns. Either { user: '...' } to send a user message, { approveToolCall: 'tag' } to approve the next pending tool call, or { denyToolCall: 'tag' } to deny it.
then — expectations evaluated in order. Each one names a scorer plus its args.

Built-in scorers

Scorer	Args	What it checks
`message-contains`	`{ pattern }`	Assistant text (case-insensitive substring)
`message-matches-regex`	`{ pattern, flags? }`	Regex over assistant text
`final-message-equals`	`{ pattern }`	Last assistant message equals string
`tool-called`	`{ name }`	A tool with that name was requested
`tool-called-with-input`	`{ name, input }`	Same, plus matching input keys
`no-tool-called`	`{ name? }`	No (or a specific) tool was requested
`tool-call-accuracy`	`{ expected: [...] }`	Fraction of expected tools matched
`handoff-triggered`	—	`handoff_status_changed` emitted with a non-`none` status
`no-handoff`	—	No handoff happened
`error-code`	`{ code }`	A `ChatSdkError` with that code was emitted
`no-error`	—	No errors emitted
`latency-p50-under`	`{ ms }`	p50 of `assistant_response_completed.durationMs` below threshold
`latency-p95-under`	`{ ms }`	p95 of the same
`llm-judge-not-hallucinating`	`{ groundingContext, provider? }`	LLM judges whether output is grounded
`llm-judge-helpfulness`	`{ criteria?, provider? }`	LLM judges whether output is helpful
`llm-judge-tone-matches`	`{ tone, provider? }`	LLM judges whether tone matches

provider defaults to anthropic. Set it to openai or gemini to use a different judge.

Custom scorers

import { defineScorer } from 'gecx-chat/eval';

const allCapsTitle = defineScorer<{ minLength: number }>({
  id: 'all-caps-title',
  describe: (args) => `assistant title is ALL CAPS and >= ${args.minLength} chars`,
  async run(ctx, args) {
    const first = ctx.record.messages.find((m) => m.role === 'agent');
    const text = first?.parts.map((p) => (p.type === 'text' ? p.text : '')).join('') ?? '';
    const pass = text === text.toUpperCase() && text.length >= args.minLength;
    return { status: pass ? 'passed' : 'failed', score: pass ? 1 : 0 };
  },
});

import { createScorerRegistry, registerBuiltinScorers, runEval } from 'gecx-chat/eval';

const registry = registerBuiltinScorers(createScorerRegistry());
registry.register(allCapsTitle);
await runEval({ dir: './scenarios', registry });

LLM-judge providers

Provider	Env var	Default model
Anthropic	`ANTHROPIC_API_KEY`	`claude-sonnet-4-6`
OpenAI	`OPENAI_API_KEY`	`gpt-4o-mini`
Gemini	`GEMINI_API_KEY`	`gemini-2.5-flash`

Each adapter posts directly to the provider's HTTP API — no SDK dependency. If a key isn't set, scorers needing that provider return status: 'skipped' (the scenario itself does not fail). Override the model per provider in the eval config:

// eval.config.json
{
  "providers": {
    "anthropic": { "model": "claude-sonnet-4-6" }
  }
}

Then pnpm gecx eval ./scenarios --config eval.config.json.

Regression gates

--baseline <path> compares the current run to a previously committed report and applies thresholds. Defaults:

Threshold	Default	Trips when
`deflectionDropPp`	5	Deflection rate drops > 5 percentage points
`p95LatencyIncreasePct`	20	p95 latency rises > 20%
`ttftP95IncreasePct`	20	p95 TTFT rises > 20%
`passRateDropPp`	0	Any pass-rate drop
`hallucinationRateIncreasePp`	2	Hallucination rate rises > 2 percentage points

Override in eval.config.json under regressionThresholds, or in a separate file passed as --regression-config <path>. New scenario failures (not present in baseline) always trip the gate regardless of metric thresholds.

To update the baseline after intentional improvements:

pnpm gecx eval ./scenarios --update-baseline ./baseline.eval.json

CI integration

See docs/recipes/eval-github-action.md for a full GitHub Actions workflow that runs the eval on every PR, uploads the JSON report as an artifact, and gates on regressions.

Reference

CLI flag reference: docs/reference/eval-cli.md
JSON schemas: schemas/eval-scenario.schema.json, schemas/eval-report.schema.json, schemas/eval-config.schema.json

Source: docs/guides/evaluation.md