AI-Assisted Workflow
Explain how Blackbox gives coding agents an external behavioral checker without making the docs depend on AI.
Abstract
Blackbox gives AI-assisted development a stable verification loop: implement, exercise the system, observe runtime effects, compare them to the catalog, and hand evidence back to the developer for review.
The same loop also helps ordinary scripts, CI jobs, and human developers. Keep this page AI-aware, but not AI-dependent.
Audience
Teams using coding agents or automation while keeping a deterministic behavioral gate outside the implementation loop.
The Agent Problem
Agents need stop conditions. If the only stop condition is “tests are green” or “the task list is checked,” the agent can miss behavior the tests did not assert, drift from the spec, or declare completion without runtime proof.
Blackbox does not make the agent trustworthy by itself. It gives the workflow an external checker that is tied to observed behavior.
Recommended Loop
- The agent reads the task, existing tests, catalog, and reports.
- The agent makes one reviewable implementation change.
- The system test or E2E suite runs against the real boundary.
- Blackbox collects spans and produces effect coverage artifacts.
- The agent reads the report and fixes missing or forbidden effects.
- The agent hands the diff, report, and remaining uncertainty to a human reviewer.
Agent Command Loop
Agents should prefer read-only checks before state-changing commands:
pnpm exec blackbox features check --features ./features --tests ./tests --jsonpnpm test:systempnpm exec blackbox coverage replay --coverage-dir .blackbox-coverage --no-htmlWhen an agent needs to reshape tests, it should run the dry-run path first:
pnpm exec blackbox features reshape ./tests/my.spec.ts --dry-runState-changing commands such as features emit, features reshape --write, features experimental wrap --write, and features experimental scaffold --write should produce a reviewable diff. They should not be treated as automatic truth.
Agentic Output Verification
Agentic output verification means the agent’s work is checked against evidence outside the agent. In Blackbox, that evidence comes from runtime effects, catalogs, reports, and exit codes.
The workflow should not ask the same agent to be the sole author and final judge. It should give the agent a behavioral report it can react to, then leave merge decisions to a human or CI policy.
Maker And Checker Split
For higher-risk work, separate the implementation role from the verification role:
- The maker changes code.
- The checker runs Blackbox and reads artifacts from disk.
- The checker reports pass, fail, or uncertain against explicit criteria.
- The human decides whether the behavioral evidence is enough to merge.
This matches the direction shown by loop-engineering discussions in the Spec Kit ecosystem, but the page should not depend on any specific agent, model, IDE, or prompt system.
What Blackbox Provides
- Stable artifacts an agent can parse: reports, catalogs, generated feature files, and exit codes.
- A behavioral signal outside the agent’s prose reasoning.
- A way to detect missing required effects and forbidden effects.
- A review artifact that humans can inspect without trusting the agent’s summary.
Useful Structured Outputs
features check --jsonreports feature syntax and source/feature drift.features lint --jsonreports structural findings.features compare-observations --jsonreports experimental observation drift.omcdc-propagation.json,coverage.json, andshape-coverage.jsonreport runtime coverage evidence.
What This Loop Does Not Prove
- It does not replace a human owner for product intent.
- It does not prove every requirement in a written spec.
- It does not remove the need for instrumentation, scenario design, or system tests.
- It should not be documented as a direct Spec Kit command until that integration exists.
Command Safety
Agents should treat exit code 1 from drift, check, lint, or compare commands as a useful finding, not a crash. Exit code 2 usually means the command could not produce a trustworthy result because inputs, directories, or transforms failed.