GitHub Actions: gate PRs on eval regressions

Run gecx eval on every PR, upload the JSON report as an artifact, and block merges when a scenario fails or a regression threshold trips.

Workflow

Copy this into .github/workflows/evaluation.yml:

name: evaluation
on:
  pull_request:
    branches: [main]
  workflow_dispatch:

jobs:
  evals:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: pnpm/action-setup@v4
      - uses: actions/setup-node@v4
        with:
          node-version: 22
          cache: pnpm
      - run: pnpm install --frozen-lockfile
      - run: pnpm build:packages

      - name: Run scenarios
        run: |
          pnpm gecx eval ./apps/showcase/scenarios \
            --baseline ./apps/showcase/baseline.eval.json \
            --fail-on-regress \
            --json --output eval-report.json
        env:
          # Judge keys are optional. Scorers needing a missing provider are skipped.
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
          GEMINI_API_KEY: ${{ secrets.GEMINI_API_KEY }}

      - uses: actions/upload-artifact@v4
        if: always()
        with:
          name: eval-report
          path: eval-report.json

Gating LLM-judge scorers

LLM-judge scorers cost real API calls. Tag scenarios that use them with llm-judge and gate the workflow on a repo var:

- name: Run scenarios (LLM judges only when enabled)
  if: ${{ vars.EVAL_LLM_JUDGE_ENABLED == 'true' || github.event_name == 'workflow_dispatch' }}
  run: pnpm gecx eval ./apps/showcase/scenarios --baseline ./apps/showcase/baseline.eval.json --fail-on-regress

- name: Run scenarios (deterministic only)
  if: ${{ vars.EVAL_LLM_JUDGE_ENABLED != 'true' && github.event_name == 'pull_request' }}
  run: pnpm gecx eval ./apps/showcase/scenarios --baseline ./apps/showcase/baseline.eval.json --fail-on-regress

(Deterministic scorers always run; the LLM-judge ones simply return status: 'skipped' when no key is set.)

Posting a PR comment summary

Append after the upload step:

- name: Comment summary on PR
  if: github.event_name == 'pull_request'
  uses: actions/github-script@v7
  with:
    script: |
      const fs = require('fs');
      const report = JSON.parse(fs.readFileSync('eval-report.json', 'utf8'));
      const m = report.metrics;
      const body = [
        '### GECX eval results',
        '',
        `- Pass rate: ${(m.passRate * 100).toFixed(1)}% (${m.passedScenarios}/${m.totalScenarios})`,
        `- Deflection: ${(m.deflectionRate * 100).toFixed(1)}%`,
        `- Latency p95: ${Math.round(m.latencyP95Ms)}ms`,
        `- TTFT p95: ${Math.round(m.ttftP95Ms)}ms`,
        `- Hallucination rate: ${(m.hallucinationRate * 100).toFixed(1)}%`,
        `- Tool-call accuracy: ${(m.toolCallAccuracy * 100).toFixed(1)}%`,
      ].join('\n');
      await github.rest.issues.createComment({
        owner: context.repo.owner,
        repo: context.repo.repo,
        issue_number: context.issue.number,
        body,
      });

Updating the baseline

When intentional improvements ship and you want to lock them in:

pnpm gecx eval ./apps/showcase/scenarios --update-baseline ./apps/showcase/baseline.eval.json
git add apps/showcase/baseline.eval.json
git commit -m "chore(eval): refresh baseline"

The committed baseline becomes the new floor for future PRs.

Source: docs/recipes/eval-github-action.md