Tests, Effects, and Feature Files

Edit on GitHub

Understand the combined Blackbox loop: tests execute behavior, effects prove runtime boundaries, feature files make behavior readable, and gates keep them synchronized.

Blackbox is strongest when three artifacts stay connected:

A system or E2E test that executes the behavior.
Runtime effects that prove what the system actually did.
A readable feature file when the team wants behavior expressed in Gherkin.

You can use these layers separately. Effects are the core Blackbox layer. Feature files are optional. The combination matters because it turns a green test into a reviewable behavioral proof trail.

The combined workflow is stronger than any artifact alone: tests keep behavior executable, effects make runtime proof reviewable, and feature files give humans and agents a readable behavior surface.

The Three Artifacts

Each artifact answers a different question.

Artifact	What it gives you	Question it answers
System or E2E test	Executable behavior	Can we run the workflow again?
Runtime effects	Evidence from the running system	What did the system actually do and avoid?
Feature file	Human-readable behavior	Can reviewers, product people, and agents understand the intended behavior?

The mistake is treating one of them as a replacement for the others.

A test without effects can pass while missing a required queue message, audit write, cache update, or downstream call. A feature file without a test can become stale documentation. Runtime evidence without a readable behavior surface can be hard to review outside the engineering team.

Blackbox lets the layers reinforce each other.

Why The Combination Changes The Process

Without this loop, many reviews stop at “the test is green.” That is often too thin for a refactor, migration, incident regression, or AI-generated change.

With the loop, a behavior change has more surfaces:

The test shows the workflow still executes.
The effect catalog shows the required and forbidden runtime behavior.
Effect coverage shows which effects were observed, missing, failed, or uncovered.
The feature file shows the behavior in a readable scenario format.
Drift checks show whether the readable behavior is still aligned with the test source.

This changes the review from “does the assertion pass?” to “does the system still prove the behavior we care about?”

That is the core Blackbox category: runtime-backed behavioral verification for system and E2E tests.

The Verification Gates

The gates are deliberately layered. A team can start with only the runtime effect layer, then add feature-file gates when readable behavior specs become valuable.

Gate	Layer	What it protects
Project system test	Test	The workflow still runs and assertions still pass
Effect catalog matchers	Runtime effects	Required effects happened and forbidden effects stayed absent
Effect coverage report	Runtime effects	The cataloged effects were actually covered by the run
OMC/DC report	Runtime effects	Decision-sensitive behavior is distinguishable in observed effects
`features lint`	Feature files	The scenario has a useful Given/When/Then shape
`features check` / `features drift`	Feature files	The `.feature` file did not drift from the test source
`features compare-observations`	Optional semantic comparison	Runtime observations did not change meaningfully during a reshape

These gates do not all need to block on day one. The important choice is to make the gate match the risk.

For a first adoption, effect catalog matchers and effect coverage usually matter most. For a BDD-heavy or spec-driven team, feature-file drift becomes a useful second gate. For a large refactor, migration, or AI-assisted change, observation comparison can add another review surface.

You Can Adopt The Layers Separately

Blackbox does not force a single methodology.

Adoption path	What you use	Best when
Effects only	Runtime evidence, catalog, matchers, coverage	You already have system tests and want stronger behavioral proof
Feature files only	`features emit`, lint, check, drift	You want readable scenarios generated from tests, even before effect gates
Combined loop	Tests, effects, coverage, feature files, drift gates	You want executable behavior, runtime proof, and readable specs to move together

Effects are the must-have layer because they capture the missing middle of many E2E and system tests: what happened between input and output.

Feature files add a different kind of value. They make behavior easier to review, discuss, and compare over time. They are especially useful when the team already uses BDD language, wants Gherkin artifacts, or is moving toward spec-driven development.

Why This Increases Confidence

Confidence improves because the same behavior is checked from multiple angles.

A system test proves the workflow can execute. The effect catalog proves the runtime boundaries that matter. The coverage report shows whether those effects were exercised. The feature file gives the behavior a readable form. The gates keep the artifacts synchronized.

This is not formal proof. It is practical engineering evidence from a real run.

That evidence is useful before a refactor because it records what the current system does. It is useful during a refactor because it shows exactly which behavior changed. It is useful after a refactor because it leaves behind artifacts future reviewers can inspect.

Why This Matters For AI-Assisted Development

AI coding agents make implementation changes quickly. They can also update tests, prose, and feature files quickly. That speed makes external verification more important, not less.

Blackbox gives the agentic workflow stop conditions outside the generated code:

If the source test changed, feature-file drift exposes it.
If the implementation changed behavior, runtime effects expose it.
If required effects disappeared, coverage exposes it.
If the readable scenario changed, the feature diff exposes it.
If a refactor changes the runtime shape, observation comparison can expose it.

This gives reviewers a better question to ask an agent: not “did you say the task is done?” but “which tests, effects, feature files, and gates prove the behavior?”

Where Gherkin And BDD Fit

Blackbox uses Gherkin as a readable behavior format. It does not require Cucumber as the test runner, and it does not require teams to adopt classic BDD ceremony before getting value.

The feature-file track is about keeping readable behavior close to executable behavior:

Analyze existing Playwright or Blackbox scenario tests.
Emit .feature files from the test source.
Lint the Given/When/Then shape.
Check whether the feature file drifted from the test.
Optionally compare runtime observations across baseline and candidate runs.

The result is not a promise that feature files will never go stale. It is a way to make staleness detectable.

Test to Feature REPL

Future package-backed REPL flow: load a test, analyze the AAA/Given-When-Then shape, emit Gherkin, then run the gates.

Source Test

The input can be a plain Playwright-style system test or a Blackbox BDD-DSL test. Plain tests are decompiled best-effort; DSL-authored tests preserve more intent.

test.system('subscribe-flow', 'alice subscribes to the pro tier', () => {
  test('alice is an existing user with no active subscription', async ({ request, system }) => {
    const response = await request.post(`${system.bff.hostBaseUrl}/subscriptions`, {
      data: { userId: 'alice', paymentMethodId: 'pm_card_visa' },
    });

    expect(response.status()).toBe(201);
  });
});

Analyzed Behavior Trace

The analyzer turns test structure into a behavior trace. The linter checks that the trace has a valid AAA shape: `Given*`, `When+`, `Then+`.

adapter: playwright
feature: subscribing to the pro tier
flow: subscribe-flow

scenario: alice is an existing user with no active subscription
  when:
    alice POSTs /subscriptions with a valid card
  then:
    response status is 201

grammar:
  aaa-shape: pass
  missing-then: pass
  opaque-step: none

Generated Gherkin

The feature file is a readable projection from the test source. It is useful for review, but it is still checked against the source instead of trusted as disconnected prose.

@flow:subscribe-flow
Feature: subscribing to the pro tier

  Scenario: alice is an existing user with no active subscription
    When alice POSTs /subscriptions with a valid card
    Then the response status is 201

Verification Gate

The gate is two-part: Cucumber-compatible Gherkin syntax validation, plus feature-file drift detection against the test source. Runtime effects and observation comparison can add stronger gates later.

$ pnpm exec blackbox features check --features ./features --tests ./e2e/tests
syntax: 1/1 .feature files parsed cleanly.
drift:  no drift detected.

$ pnpm exec blackbox features lint ./e2e/tests --fail-on error
no lint findings