Feature Files, BDD, and Staleness

Relate BDD, Gherkin, Cucumber, stale feature files, and the Blackbox gates that keep behavior artifacts synchronized.

BDD and Gherkin had the right ambition: make behavior visible, reviewable, and connected to executable checks. The failure mode was maintenance.

When .feature files become another hand-written artifact, they can drift from product intent, test source, and system behavior. A sentence can still read well while the service no longer calls the same dependency, emits the same event, writes the same record, or blocks the same forbidden action.

Blackbox does not ask every team to return to classic Cucumber-style BDD. It keeps the useful shape, then adds gates around it.

Why Feature Files Go Stale

Feature files go stale when they are treated as the source of truth but are not continuously checked against what developers actually run.

Classic BDD often created three things that had to agree:

The .feature file.
The step definitions.
The implementation and test runner behavior.

That can work when the team has discipline and stable behavior. It breaks down when a system changes quickly, when tests are mostly written by engineers rather than product/QA, or when AI-assisted changes increase the amount of behavior reviewers must verify.

Two Kinds Of Drift

Blackbox separates two drift problems that often get mixed together:

Drift type	Question	Blackbox gate
Feature-file drift	Does the `.feature` file still match the test source?	`blackbox features check` or `blackbox features drift`
Behavioral drift	Did the running system behavior change in a meaningful way?	Effect catalog, effect coverage, and optional observation comparison

Feature-file drift is about synchronization between source tests and readable Gherkin. Behavioral drift is about runtime effects, assertions, outcomes, and system boundaries.

Do not use bare “drift” when writing docs or error messages. Say which kind.

The Blackbox Direction

Blackbox starts from tests and evidence:

A system test exercises a behavior.
The feature analyzer derives an AAA/Given-When-Then trace from the test source.
The feature emitter writes a Gherkin .feature file from that trace.
The syntax and drift gate checks the .feature file against the source.
Runtime evidence and effect catalogs prove what the system did at its boundaries.

The feature file becomes a review surface over executable behavior, not a promise that must be trusted by itself.

Relationship To BDD And Gherkin

Blackbox does not need to argue that BDD was wrong. BDD correctly focused teams on behavior. The problem was the direction and cost of synchronization.

Traditional flow:

Human writes feature file.
Automation binds steps to code.
People must maintain the feature file as the system evolves.

Blackbox flow:

Test source describes or exercises the behavior.
Blackbox analyzes the behavior grammar.
Blackbox emits or checks the feature file.
Runtime gates prove the behavior through effects and observations.
People review the artifact changes instead of manually maintaining every sentence.

Gherkin remains the human-readable format. Cucumber-compatible parsing is used for syntax validation. Cucumber-style step definitions are not required.

The Behavior Grammar

The Blackbox feature pipeline is built around AAA:

Given*  When+  Then+

Given describes setup and preconditions. When describes the action. Then is the concluder: the scenario must end in observable assertions. A feature file without a meaningful Then is documentation, not verification.

The linter enforces this shape and catches common failure modes:

Then before any When.
Given after the action.
When after assertions.
Missing action.
Missing assertion.
Opaque generated steps.
Repeated setup that should become a Background.

The Cycle Of Gates

Blackbox connects feature files, system tests, and E2E suites in a cycle of gates:

features lint checks the behavior grammar.
features emit generates Gherkin from test source.
features check validates Gherkin syntax and catches feature-file drift.
The project system-test command runs the behavior.
Effect catalogs and effect coverage catch boundary behavior changes.
Observation comparison can catch meaningful runtime differences during migration or refactor.

This is the connection to verification gates: feature files are not just docs, and runtime evidence is not just debugging data. Each artifact has a gate that can run locally, in CI, or inside an SDD/agent workflow.

How The REPL Should Teach This

The ideal REPL is not a marketing animation. It should load @suites/blackbox-features, accept a small test, and show the transformation in real time:

Source test.
Detected adapter: Playwright or BDD DSL.
Analyzed AAA/Given-When-Then trace.
Generated .feature file.
Lint findings, syntax result, and feature-file drift result.
Optional observation comparison when baseline and candidate runs exist.

That REPL would make the product promise concrete: you can see exactly what Blackbox inferred, where it was confident, where it fell back to generic text, and which gate will protect the artifact.