Evaluation
gecx eval runs scenario-replay tests against a ChatSession, scores them with deterministic and LLM-judge scorers, and prints a pass/fail table plus a JSON report. It supports a regression gate that compares the current run against a committed baseline.
Quick start
pnpm gecx eval ./apps/showcase/scenarios
You'll see a per-scenario pass/fail table, plus aggregate metrics (pass rate, deflection rate, p50/p95 latency, hallucination rate, tool-call accuracy) and which judge providers are configured.
For CI:
pnpm gecx eval ./apps/showcase/scenarios \
--baseline ./apps/showcase/baseline.eval.json \
--fail-on-regress \
--json --output eval-report.json
Exit code 0 means scenarios passed and no regressions tripped. Exit 1 means a scenario failed or a regression threshold was exceeded.
Writing a scenario
Two formats, same shape. Both are loaded by walking the target directory for *.scenario.ts, *.scenario.yaml, and *.scenario.yml.
TypeScript
// apps/showcase/scenarios/welcome.scenario.ts
export default {
id: 'welcome',
name: 'Welcome flow greets the user',
tags: ['support'],
when: [{ user: 'hello' }],
then: [
{ scorer: 'message-contains', args: { pattern: 'help' } },
{ scorer: 'no-handoff' },
{ scorer: 'no-error' },
],
};
YAML
# apps/showcase/scenarios/invoice-lookup.scenario.yaml
id: invoice-lookup
name: Invoice lookup calls the invoice tool
tags: [support, tool-call]
when:
- user: Please pull invoice INV-5678
then:
- scorer: tool-called
args:
name: lookup_invoice
- scorer: no-handoff
- scorer: no-error
Anatomy
id— unique within the suite.name— human-readable label for the report.tags— optional. Use--filter <tag>on the CLI to run a subset. Tag any scenario that uses an LLM-judge scorer withllm-judgeso CI can gate it on a secret.when— turns. Either{ user: '...' }to send a user message,{ approveToolCall: 'tag' }to approve the next pending tool call, or{ denyToolCall: 'tag' }to deny it.then— expectations evaluated in order. Each one names a scorer plus itsargs.
Built-in scorers
| Scorer | Args | What it checks |
|---|---|---|
message-contains | { pattern } | Assistant text (case-insensitive substring) |
message-matches-regex | { pattern, flags? } | Regex over assistant text |
final-message-equals | { pattern } | Last assistant message equals string |
tool-called | { name } | A tool with that name was requested |
tool-called-with-input | { name, input } | Same, plus matching input keys |
no-tool-called | { name? } | No (or a specific) tool was requested |
tool-call-accuracy | { expected: [...] } | Fraction of expected tools matched |
handoff-triggered | — | handoff_status_changed emitted with a non-none status |
no-handoff | — | No handoff happened |
error-code | { code } | A ChatSdkError with that code was emitted |
no-error | — | No errors emitted |
latency-p50-under | { ms } | p50 of assistant_response_completed.durationMs below threshold |
latency-p95-under | { ms } | p95 of the same |
llm-judge-not-hallucinating | { groundingContext, provider? } | LLM judges whether output is grounded |
llm-judge-helpfulness | { criteria?, provider? } | LLM judges whether output is helpful |
llm-judge-tone-matches | { tone, provider? } | LLM judges whether tone matches |
provider defaults to anthropic. Set it to openai or gemini to use a different judge.
Custom scorers
import { defineScorer } from 'gecx-chat/eval';
const allCapsTitle = defineScorer<{ minLength: number }>({
id: 'all-caps-title',
describe: (args) => `assistant title is ALL CAPS and >= ${args.minLength} chars`,
async run(ctx, args) {
const first = ctx.record.messages.find((m) => m.role === 'agent');
const text = first?.parts.map((p) => (p.type === 'text' ? p.text : '')).join('') ?? '';
const pass = text === text.toUpperCase() && text.length >= args.minLength;
return { status: pass ? 'passed' : 'failed', score: pass ? 1 : 0 };
},
});
Register it into the runner by passing your own ScorerRegistry:
import { createScorerRegistry, registerBuiltinScorers, runEval } from 'gecx-chat/eval';
const registry = registerBuiltinScorers(createScorerRegistry());
registry.register(allCapsTitle);
await runEval({ dir: './scenarios', registry });
LLM-judge providers
| Provider | Env var | Default model |
|---|---|---|
| Anthropic | ANTHROPIC_API_KEY | claude-sonnet-4-6 |
| OpenAI | OPENAI_API_KEY | gpt-4o-mini |
| Gemini | GEMINI_API_KEY | gemini-2.5-flash |
Each adapter posts directly to the provider's HTTP API — no SDK dependency. If a key isn't set, scorers needing that provider return status: 'skipped' (the scenario itself does not fail). Override the model per provider in the eval config:
// eval.config.json
{
"providers": {
"anthropic": { "model": "claude-sonnet-4-6" }
}
}
Then pnpm gecx eval ./scenarios --config eval.config.json.
Regression gates
--baseline <path> compares the current run to a previously committed report and applies thresholds. Defaults:
| Threshold | Default | Trips when |
|---|---|---|
deflectionDropPp | 5 | Deflection rate drops > 5 percentage points |
p95LatencyIncreasePct | 20 | p95 latency rises > 20% |
ttftP95IncreasePct | 20 | p95 TTFT rises > 20% |
passRateDropPp | 0 | Any pass-rate drop |
hallucinationRateIncreasePp | 2 | Hallucination rate rises > 2 percentage points |
Override in eval.config.json under regressionThresholds, or in a separate file passed as --regression-config <path>. New scenario failures (not present in baseline) always trip the gate regardless of metric thresholds.
To update the baseline after intentional improvements:
pnpm gecx eval ./scenarios --update-baseline ./baseline.eval.json
CI integration
See docs/recipes/eval-github-action.md for a full GitHub Actions workflow that runs the eval on every PR, uploads the JSON report as an artifact, and gates on regressions.
Reference
- CLI flag reference: docs/reference/eval-cli.md
- JSON schemas:
schemas/eval-scenario.schema.json,schemas/eval-report.schema.json,schemas/eval-config.schema.json
docs/guides/evaluation.md