Feature Files, BDD, and Staleness
Relate BDD, Gherkin, Cucumber, stale feature files, and the Blackbox gates that keep behavior artifacts synchronized.
BDD and Gherkin had the right ambition: make behavior visible, reviewable, and connected to executable checks. The failure mode was maintenance.
When .feature files become another hand-written artifact, they can drift from product intent, test source, and system behavior. A sentence can still read well while the service no longer calls the same dependency, emits the same event, writes the same record, or blocks the same forbidden action.
Blackbox does not ask every team to return to classic Cucumber-style BDD. It keeps the useful shape, then adds gates around it.
Why Feature Files Go Stale
Feature files go stale when they are treated as the source of truth but are not continuously checked against what developers actually run.
Classic BDD often created three things that had to agree:
- The
.featurefile. - The step definitions.
- The implementation and test runner behavior.
That can work when the team has discipline and stable behavior. It breaks down when a system changes quickly, when tests are mostly written by engineers rather than product/QA, or when AI-assisted changes increase the amount of behavior reviewers must verify.
Two Kinds Of Drift
Blackbox separates two drift problems that often get mixed together:
| Drift type | Question | Blackbox gate |
|---|---|---|
| Feature-file drift | Does the .feature file still match the test source? | blackbox features check or blackbox features drift |
| Behavioral drift | Did the running system behavior change in a meaningful way? | Effect catalog, effect coverage, and optional observation comparison |
Feature-file drift is about synchronization between source tests and readable Gherkin. Behavioral drift is about runtime effects, assertions, outcomes, and system boundaries.
Do not use bare “drift” when writing docs or error messages. Say which kind.
The Blackbox Direction
Blackbox starts from tests and evidence:
- A system test exercises a behavior.
- The feature analyzer derives an AAA/Given-When-Then trace from the test source.
- The feature emitter writes a Gherkin
.featurefile from that trace. - The syntax and drift gate checks the
.featurefile against the source. - Runtime evidence and effect catalogs prove what the system did at its boundaries.
The feature file becomes a review surface over executable behavior, not a promise that must be trusted by itself.
Relationship To BDD And Gherkin
Blackbox does not need to argue that BDD was wrong. BDD correctly focused teams on behavior. The problem was the direction and cost of synchronization.
Traditional flow:
- Human writes feature file.
- Automation binds steps to code.
- People must maintain the feature file as the system evolves.
Blackbox flow:
- Test source describes or exercises the behavior.
- Blackbox analyzes the behavior grammar.
- Blackbox emits or checks the feature file.
- Runtime gates prove the behavior through effects and observations.
- People review the artifact changes instead of manually maintaining every sentence.
Gherkin remains the human-readable format. Cucumber-compatible parsing is used for syntax validation. Cucumber-style step definitions are not required.
The Behavior Grammar
The Blackbox feature pipeline is built around AAA:
Given* When+ Then+Given describes setup and preconditions. When describes the action. Then is the concluder: the scenario must end in observable assertions. A feature file without a meaningful Then is documentation, not verification.
The linter enforces this shape and catches common failure modes:
Thenbefore anyWhen.Givenafter the action.Whenafter assertions.- Missing action.
- Missing assertion.
- Opaque generated steps.
- Repeated setup that should become a
Background.
The Cycle Of Gates
Blackbox connects feature files, system tests, and E2E suites in a cycle of gates:
features lintchecks the behavior grammar.features emitgenerates Gherkin from test source.features checkvalidates Gherkin syntax and catches feature-file drift.- The project system-test command runs the behavior.
- Effect catalogs and effect coverage catch boundary behavior changes.
- Observation comparison can catch meaningful runtime differences during migration or refactor.
This is the connection to verification gates: feature files are not just docs, and runtime evidence is not just debugging data. Each artifact has a gate that can run locally, in CI, or inside an SDD/agent workflow.
How The REPL Should Teach This
The ideal REPL is not a marketing animation. It should load @suites/blackbox-features, accept a small test, and show the transformation in real time:
- Source test.
- Detected adapter: Playwright or BDD DSL.
- Analyzed AAA/Given-When-Then trace.
- Generated
.featurefile. - Lint findings, syntax result, and feature-file drift result.
- Optional observation comparison when baseline and candidate runs exist.
That REPL would make the product promise concrete: you can see exactly what Blackbox inferred, where it was confident, where it fell back to generic text, and which gate will protect the artifact.