We Built an AI Agent That Writes Our Tests. Here's What Actually Happened by Dalek | Empowering Creators with Crypto-Driven Content Monetization

We knew testing mattered. We just never had time for it. So we built a CI agent that watches every push to staging, generates Vitest unit tests for changed files using GPT, runs them, auto-fixes failures, and opens a merge request — without anyone on the team having to think about it.
We always struggled with testing. It wasn't something we tracked in sprints or brought up in formal retrospectives — we're a small team, two or three people, and most of our honest conversations happen during code reviews or while debugging something that broke. And the same thing kept coming up: "we should really have tests for this." Everyone agreed. Nobody had time to go back and write them.
Our backend is a TypeScript API — Fastify serving both REST endpoints and GraphQL, with three separate GraphQL schemas: one public-facing for passengers, one private for internal admin, and one for carriers. Prisma on top of PostgreSQL, Redis for queuing, about 700 helper files across the codebase. The kind of project where one untested change can quietly break something three layers deep. We knew the risk. We just kept shipping anyway.
The Cursor Chapter Was ShortWhen Cursor's chat view launched we gave it a shot. You'd select a file, ask it to write tests, and it would. But the workflow was fully manual — you had to remember to do it, pick the file, wait, copy the output, create a test file, run it, then sit with the failures. For functions with external dependencies it would often mock things wrong or miss imports entirely, and you'd spend more time fixing the generated test than you would have spent writing it. The bigger issue was that nothing happened automatically. If you forgot to ask, no tests got written. And we forgot constantly, because we were busy building features.
We started thinking about whether we could remove ourselves from the loop entirely. What if tests just appeared whenever new code landed on staging?
The Core IdeaThe solution is a single TypeScript script running in GitLab CI. Every time someone pushes to staging, it:
Diffs the commit to find changed source files
Filters out files that don't need tests (generated code, schema declarations, boilerplate)
Sends each file to an AI API with a prompt that knows our codebase conventions
Writes the generated test to disk and runs it
If it fails, sends the error back to the AI to fix — up to a configurable number of attempts
Pushes everything to a new branch and opens a GitLab MR
﻿
No framework, no orchestration platform. One script, about 700 lines, runs on every push.
Setting Up the CI JobStart here, because this is what triggers everything. The job needs two things: shallow git history so it can diff against the previous commit, and a git remote configured with a token that can push branches and open MRs.
generate_tests:
  stage: test
  image: node:22-alpine
  rules:
    - if: '$CI_COMMIT_BRANCH == "staging" && $CI_COMMIT_TITLE !~ /\[skip ci\]/'
      when: on_success
    - when: never
  variables:
    GIT_DEPTH: 2
    GIT_STRATEGY: clone
    AI_PROVIDER: "openai"
  before_script:
    - git remote set-url origin "https://gitlab-ci-token:${GITLAB_TOKEN}@${CI_SERVER_HOST}/${CI_PROJECT_PATH}.git"
    - git fetch --depth=2 origin staging
    - git checkout staging
    - yarn install --frozen-lockfile --prefer-offline
  script:
    - npx tsx agents/tests/generate-tests.ts
  allow_failure: true
﻿
allow_failure: true is important. Test generation should never block a deploy. If the AI API is down or something fails, the pipeline keeps moving and the team gets notified separately.
Diffing and FilteringThe script starts by finding what changed:
const DIFF_BASE = process.env.DIFF_BASE ?? "HEAD~1";
﻿
function getChangedSourceFiles(): string[] {
  const raw = run(`git diff ${DIFF_BASE} --name-only --diff-filter=AM`, true);
  return raw
    .split("\n")
    .map((f) => f.trim())
    .filter(
      (f) =>
        f.startsWith("src/") &&
        f.endsWith(".ts") &&
        !f.includes("__tests__") &&
        !f.endsWith(".test.ts") &&
        !f.endsWith(".spec.ts") &&
        !f.endsWith(".d.ts"),
    );
}
﻿
DIFF_BASE defaults to HEAD~1 but you can override it — useful when you want to run the agent against a larger range, like after a batch of commits: DIFF_BASE=HEAD~5.
Before anything hits the AI, a .notestignore file filters out paths that have no testable logic:
src/index.ts
src/server.ts
src/workers/
src/types/
src/constants/
src/graphql/generated/
src/graphql/schemas/
src/graphql/models/
src/prisma/generated/
*.inputs.ts
*.types.ts
*.queries.ts
*.mutations.ts
﻿
Patterns with a / are treated as directory prefixes. Patterns starting with * are matched as file suffixes. This eliminated roughly half our API calls on the first real run.
The Generate → Refine → Fix LoopEach file goes through three phases:
for (const filePath of filesToProcess) {
  const sourceCode = fs.readFileSync(filePath, "utf-8");
﻿
  // Phase 1: generate
  let testContent = await callAI(buildPrompt(filePath, sourceCode));
﻿
  // Phase 2: refine — ask AI to review its own output before running
  for (let i = 2; i <= GENERATION_ITERATIONS; i++) {
    const refined = await callAI(
      buildRefinePrompt(filePath, sourceCode, testContent, i)
    );
    if (refined.trim()) testContent = refined;
  }
﻿
  const testPath = testPathFor(filePath);
  fs.writeFileSync(testPath, testContent);
﻿
  // Phase 3: run → fix loop
  for (let attempt = 0; attempt <= MAX_FIX_ATTEMPTS; attempt++) {
    const { passed, output } = runTest(testPath);
    if (passed) break;
﻿
    const fixed = await callAI(
      buildFixPrompt(filePath, sourceCode, testPath, testContent, output)
    );
    testContent = fixed;
    fs.writeFileSync(testPath, testContent);
  }
}
﻿
GENERATION_ITERATIONS controls how many times the AI reviews and refines its own output before we ever run the tests. We settled on 2 — one generation pass, one self-review. Setting it higher improves quality slightly but costs more API calls per file.
MAX_FIX_ATTEMPTS controls the run → fix loop. When a test fails, the full error output from Vitest goes back to the AI along with the source file and the broken test. It tries to correct it and we run again. We use 2 attempts: enough to catch most compilation errors and wrong mock shapes, but not so many that a fundamentally flawed test wastes a dozen API calls. Files that still fail after all attempts land in the MR marked for manual review — the agent doesn't discard them, because even a failing test with correct structure is useful to edit.
The Prompt Is Everything (not really)The generated tests are only as good as the context you give the AI. Generic prompts produce generic tests that don't compile in your project. Our prompt embeds our specific conventions:
function buildPrompt(filePath: string, code: string): string {
  return `You are writing Vitest unit tests for a TypeScript
Fastify/GraphQL backend.
﻿
## Project conventions you MUST follow
- Always use .js extension in imports, even for .ts files
- Path aliases available: $helpers/*, $services/*, $graphql/*, etc.
- Error handling: errors are thrown as CustomError from $helpers/CustomError.js
- Always mock external I/O: Prisma ($services/prisma.js),
  Redis ($services/redis.js), SendGrid, etc.
- Prefer describe/it blocks over flat test() calls
﻿
${TESTING_CONVENTIONS}
﻿
File path: ${filePath}
﻿
\`\`\`typescript
${code}
\`\`\``;
}
﻿
That TESTING_CONVENTIONS block is pulled straight from our CLAUDE.md at startup — the same onboarding doc our developers read. This means the AI prompt stays in sync with our actual conventions automatically, without a separate config file to maintain.
The same conventions are included in the fix prompt and the refinement prompt, so the AI can't "forget" them between passes.
AI ProviderWe abstract the AI call so we can switch between OpenAI and Anthropic with an environment variable:
async function callAI(prompt: string): Promise<string> {
  if (AI_PROVIDER === "openai") {
    const client = new OpenAI({ apiKey: OPENAI_API_KEY });
    const response = await client.chat.completions.create({
      model: OPENAI_MODEL,   
      messages: [{ role: "user", content: prompt }],
      temperature: 0.2,   
    });
    return response.choices[0]?.message?.content ?? "";
  }
﻿
  const client = new Anthropic({ apiKey: ANTHROPIC_API_KEY });
  const response = await client.messages.create({
    model: ANTHROPIC_MODEL,
    messages: [{ role: "user", content: prompt }],
  });
  return response.content[0].type === "text" ? response.content[0].text : "";
}
﻿
We run on gpt-5.2. Switching provider is a single CI variable change — no code touched.
What Does It Cost?We spent time on testing before — just never consistently, and never on the files that needed it most. The agent doesn't replace that time; it shifts it. Instead of writing tests from scratch, you review and merge the MR.
In terms of API costs per file processed, it roughly breaks down like this. A simple helper function — pure logic, few dependencies — runs about $0.02–0.03 per file including the refinement pass and a fix attempt if needed. Complex business logic with multiple mocked services and a longer source file lands closer to $0.50. Most of our codebase sits in the simpler bucket, so a typical staging push with 5–10 changed files costs well under a dollar. So far in practice it costs us on average $0.25 per day
What's NextA few things we're planning to add:
MR comment → fix loop. Right now, if a reviewer leaves a comment like "this mock is wrong — the function returns an array, not an object", nothing happens automatically. We want the agent to watch for comments on auto-generated test MRs, pick up the feedback, and push a corrected version. Close the loop between human review and the agent.
Dependency context. The prompt currently only sees the single source file. For resolvers that call several helpers, the AI has to guess at the signatures of things it's mocking. Passing in the immediate imports alongside the main file should make mock shapes much more accurate.
Update existing tests. Right now the agent only creates tests for files that don't have them yet. When a source file changes significantly, the existing test can go stale. We want a mode that diffs the source against the last time a test was generated and proposes updates. A related improvement: when an existing test file is present, pass it into the prompt alongside the source file so the AI knows what's already covered and can avoid duplicating tests or contradicting assertions that are intentionally there.
Inline opt-out. The .notestignorefile works well for whole directories and file patterns, but sometimes you just want to exclude one specific file that doesn't fit any pattern. We want to support a // @notest comment at the top of a source file as a file-level opt-out — the agent would check for it before making any API call and skip the file silently. No config to update, no path to remember — the signal lives right next to the code.
Integration tests. Unit tests for helpers are the easy win. Integration tests for the full GraphQL resolvers — with a real database connection or a proper test fixture — are a much harder problem. We haven't figured out the right approach yet.
The agent isn't a silver bullet, and the tests it writes aren't production-grade on first pass. But consistently having something — a test file that's mostly right, covers the happy path and at least one error case, and runs without crashing — turns out to be worth a lot more than occasionally having nothing. We push code faster now knowing the helper layer has coverage, even imperfect coverage. That felt surprising when we first noticed it.
If you're on GitLab with Vitest and TypeScript, most of this is adaptable to your project in a day.