title: "Extract — "What do we NOT see?"" source: "tasks/TFW-41__execution_quality_gates/research/extract.md"
Extract — "What do we NOT see?"¶
Parent: HL-TFW-41 Goal: Structural gates at every TFW handoff point — prevent failures, not just detect them.
Findings¶
E1: Evidence Chain — TS Content Type → Execution Quality (H1)¶
Controlled comparison (same project, same coordinator, same executor model):
| Case | TS type | TS size | AC pass rate | Phase D needed? | Evidence |
|---|---|---|---|---|---|
| HD-16/C | Code-in-TS (§4 Detailed Steps) | 50KB / 1280 LOC | 23/24 (96%) | YES — palette, 89 colors | per_page=100 vs 500 |
| HD-18/C | Requirements-first (§4 Acceptance Criteria) | 11KB / 206 LOC | 100% | NO | All grep gates passed |
Confounding variables controlled: - Same TFW version (conventions, templates) - Same project codebase (Helpdesk) - Executor model class: AI agent in both cases - Coordinator: AI + PO involvement in both
Causal mechanism: HD-16/C executor had complete code to copy → copied without thinking → didn't verify per_page against backend cap. HD-18/C executor had requirements + grep gates → had to engineer solution → ran gates that would have caught the discrepancy.
H1 verdict: SUPPORTED. The evidence chain is strong — requirements-first TS at minimum eliminates the "copy without verifying" failure mode. The 79% size reduction is a secondary benefit (fewer tokens = more attention budget for actual engineering).
Critical PO evidence: process.md F24 — "TS = требования + gates, НЕ код. TS с кодом → исполнитель копирует без мысли → форма есть, содержания нет." Source: PO feedback, 2026-04-19. The coordinator was explicitly taught this principle by the PO. Without F24, the coordinator would have continued writing code-in-TS.
Implication for TFW-41: Template enforcement is necessary. The coordinator learned F24 only through PO feedback after failure. Without structural enforcement in the TS template, future coordinators (new sessions, new projects) won't have F24 in their knowledge base.
E2: Pre-TS Gate Analysis (H2)¶
HD-18 coordinator self-assessment — 3 errors that reached ONB:
| # | Error | Root cause | Would reading RF N-1 have prevented it? |
|---|---|---|---|
| 1 | Wrong function signature (copied from other workers) | Read own TS plan, not actual code | ✅ YES — RF Phase B documents actual signatures |
| 2 | Wrong worker count (didn't count create_task calls) | Assumed count from HL plan | ✅ YES — RF Phase B §2 Key Decisions documents exact count |
| 3 | Referenced already-updated test | Read own TS Phase B, not RF Phase B | ✅ YES — RF Phase B lists which files were modified |
H2 verdict: STRONGLY SUPPORTED. All 3 errors have the same root cause: coordinator read own plan (TS N-1) instead of actual output (RF N-1). A Pre-TS Gate requiring Read RF Phase {N-1} before writing TS Phase {N} would have prevented all 3.
Structural analysis: The current plan.md workflow has no step that says "read RF of previous phase." It says "read HL" and "write TS" — the coordinator naturally reads their own plan because it's the most recent artifact they wrote.
E3: Execution Loops Stress Test (H3)¶
Would Execution Loops have caught HD-16/C per_page=100?
Scenario reconstruction:
1. TS specified: per_page: 500 in MonitorPage code
2. Executor copied code with per_page: 500 literal
3. Backend max_per_page=100 clamped the value
4. No gate checked actual runtime behavior
If Execution Loops had been active (Gate per Requirement):
Requirement would have been: "R3: per_page=500 — Monitor fetches all daily tickets" Gate would have been: "API returns 500 items when 500 exist" or "No truncation warning"
But: The gate as written in the TS was a CODE specification (per_page: 500), not a BEHAVIORAL requirement. Even with Execution Loops, the executor would have verified "does my code say 500?" → yes → pass. The real gate should have been: "Does the API return the expected count?"
H3 verdict: CONDITIONALLY SUPPORTED. Execution Loops catch issues only when gates test behavior, not code text. The loop pattern requires Requirements-first TS (H1) to be effective. Without H1, loops verify copy accuracy, not functional correctness.
Counter-evidence: HD-18/C succeeded without explicit Execution Loops. The requirements-first pattern with grep gates was sufficient. This suggests Execution Loops add value primarily when: - >3 requirements (more chances for drift between R_n and R_{n+1}) - Requirements involve runtime behavior, not just code structure - Requirements have dependencies (R3 depends on R1's API change)
E4: Zwicky Morphological Box — Meta-Test (H4)¶
H4-A: Diagnosing HD-19 Zwicky Box Failure¶
HD-19 extract.md Section C — what went wrong:
| Symptom | Root cause |
|---|---|
| All 8 dimensions selected Alt 1 | No Cross-Consistency Assessment (CCA) performed |
| "Рекомендуется" annotation on first option | Pre-judged before systematic evaluation |
| No elimination of inconsistent combinations | CCA step skipped entirely |
| No new discovery | No parameter interaction analysis |
The Zwicky Box was decorative. The researcher already had recommendations from sections A and B, then built the box to confirm them. This is the OPPOSITE of GMA methodology — the box should GENERATE options, CCA should ELIMINATE options, and the result should INFORM (not confirm) the recommendation.
Root cause: The research workflow says nothing about HOW to do Extract. The template says {content — patterns, comparisons, internal file analysis} — completely open-ended. The researcher optimized for completion, not discovery.
H4-B: Applying Zwicky Box to TFW-41's Solution Space (Live Test)¶
Step 1: Define parameters (independent dimensions of the solution)
| # | Parameter | Description |
|---|---|---|
| P1 | TS content structure | How the TS template organizes what-to-do |
| P2 | Gate verification method | How gates are checked |
| P3 | Coordinator input verification | What the coordinator reads before writing TS |
| P4 | Executor self-check pattern | When/how executor verifies own work |
| P5 | Principle enforcement mechanism | How HL §7 principles survive into implementation |
Step 2: Enumerate values per parameter
| P | V1 | V2 | V3 | V4 |
|---|---|---|---|---|
| P1 | §4 = Detailed Steps (current) | §4 = Requirements + §5 = Guidance | §4 = BDD/Gherkin scenarios | §4 = AC checklist only (no steps, no guidance) |
| P2 | Manual reviewer check | Grep/CLI commands in TS | Automated test assertion | Screenshot/visual evidence |
| P3 | Read HL only (current) | Read RF N-1 + HL | Read RF N-1 + HL + diff against TS N-1 | Read RF N-1 only (no HL re-read) |
| P4 | Linear execution (current) | Gate per Requirement (Execution Loop) | Per-file verification | Post-execution self-review |
| P5 | Text in HL §7 only (current) | HL §7 → TS AC mapping table | HL §7 → reviewer checklist item | HL §7 → automated gate |
Step 3: Cross-Consistency Assessment (CCA)
Pairwise elimination of inconsistent combinations:
| Pair | Inconsistency | Eliminated |
|---|---|---|
| P1-V1 × P4-V2 | Detailed Steps + Execution Loops: loops verify code text, not behavior | ⚠️ weak — loops possible but low value |
| P1-V3 × P2-V2 | BDD scenarios + grep commands: different verification paradigms | ❌ inconsistent |
| P1-V4 × P2-V4 | AC-only + screenshots: no guidance → executor may screenshot wrong thing | ⚠️ weak |
| P2-V4 × P4-V2 | Screenshots + Execution Loops: screenshot per requirement = impractical overhead | ❌ for >5 requirements |
| P3-V4 × P5-V2 | RF-only (no HL) + HL§7 mapping: can't map principles if HL not read | ❌ inconsistent |
| P1-V1 × P5-V2 | Detailed Steps + principle mapping: mapping exists but executor copies code, ignoring mapping | ⚠️ weak — mapping wasted |
Surviving viable configurations after CCA:
| Config | P1 | P2 | P3 | P4 | P5 | Notes |
|---|---|---|---|---|---|---|
| C1 | V2 (Req+Guidance) | V2 (grep/CLI) | V2 (RF+HL) | V2 (Loops) | V2 (mapping table) | Full enforcement — all new mechanisms |
| C2 | V2 (Req+Guidance) | V2 (grep/CLI) | V2 (RF+HL) | V1 (linear) | V2 (mapping table) | Moderate — no loops, requirements-first does heavy lifting |
| C3 | V2 (Req+Guidance) | V3 (test assertion) | V2 (RF+HL) | V2 (Loops) | V3 (reviewer check) | Test-heavy — requires test infra, higher overhead |
| C4 | V4 (AC only) | V2 (grep/CLI) | V2 (RF+HL) | V2 (Loops) | V2 (mapping) | Minimal TS — no guidance at all, executor fully autonomous |
CCA eliminated: 6 pairwise inconsistencies → reduced from 1024 (4^5) to 4 viable configurations.
Observation: C1 and C2 differ only in P4 (loops vs linear). H3 analysis shows loops are conditionally valuable. C3 requires test automation infrastructure. C4 gives zero guidance — risky for complex tasks.
Recommendation: C1 for phases with >3 requirements. C2 for simple phases (≤3 requirements). Both use Requirements+Guidance (P1-V2) and Pre-TS Gate (P3-V2).
H4-C: Diagnosing WHY the Box Was Decorative in HD-19¶
The researcher's error was NOT in the box itself but in the process:
-
No genuine CCA. The researcher annotated "рекомендуется" on values BEFORE doing pairwise elimination. The recommendation should come AFTER CCA, not before.
-
Parameters weren't independent. D2 (permission invalidation) depends on D3 (frontend store) — SSE requires Zustand, not Context. These should have been combined or the dependency made explicit.
-
No new discovery. A valid Zwicky Box should produce at least one viable configuration the researcher didn't initially consider. HD-19 produced zero — the "recommended" was already known from sections A and B.
Fix for TFW research Extract workflow:
The Zwicky Box step needs algorithmic enforcement:
- MUST enumerate ≥3 values per parameter (prevent binary "yes/no" thinking)
- MUST perform CCA pairwise (table of eliminated pairs with stated inconsistency)
- MUST NOT annotate "recommended" before CCA (prevent confirmation bias)
- MUST produce ≥2 surviving configurations (prevent single-option validation)
- MUST identify ≥1 non-obvious configuration that wasn't in the initial hypothesis
E5: Token Budget Analysis¶
Current workflow word counts vs proposed additions:
| Workflow | Current (words) | Proposed addition | New total | Budget (1200) |
|---|---|---|---|---|
| handoff.md | ~680 | Pre-RF Gate (~40w) + Execution Loops (~80w) + ONB answer protocol (~30w) | ~830 | ✅ Under |
| plan.md | ~795 | Pre-TS Gate (~50w) | ~845 | ✅ Under |
| review.md | ~600 | HL §7 principles check (~30w) | ~630 | ✅ Under |
| research/base.md | ~500 | Zwicky Box mandate in Extract (~60w) | ~560 | ✅ Under |
All additions fit within the 1200-word design rule. No workflow exceeds budget.
E6: TFW Project Self-Evidence¶
Does TFW itself exhibit the same problems?
Checked TFW task artifacts. The TS template (§4 Detailed Steps) is the same template used in Helpdesk. TFW tasks (code-focused: TFW-26, TFW-27) also used code-in-TS patterns. The difference: TFW is a methodology project — its "code" is markdown templates and workflows, which are INHERENTLY requirements (they describe what the system should do). So TFW is partially immune to the code-in-TS problem because its code IS requirements.
But: TFW research iterations (TFW-22, TFW-24, TFW-32) show the same "first viable option" pattern in Extract stages — no morphological analysis, no systematic parameter enumeration. H4's problem exists in TFW's own research history.
Checkpoint¶
| Found | Remaining |
|---|---|
| H1 SUPPORTED: Requirements-first TS → better execution quality. Evidence: HD-16/C vs HD-18/C + PO's F24 | — |
| H2 STRONGLY SUPPORTED: Pre-TS Gate prevents plan≠fact drift. All 3 HD-18 errors preventable | — |
| H3 CONDITIONALLY SUPPORTED: Loops valuable for >3 requirements with behavioral gates | Need Challenge to stress-test the >3 threshold |
| H4: Zwicky Box viable BUT needs algorithmic enforcement (5 rules). HD-19 = decorative box. TFW-41 live test produced 4 configs from 1024 | Need Challenge to test if enforcement rules are sufficient |
| Token budgets: all additions fit within 1200-word limit | — |
| PO explicitly taught F24 — coordinator doesn't self-correct | — |
Sufficiency: - [x] External source used? (Zwicky GMA methodology applied with proper CCA) - [x] Briefing gap closed? (All 4 hypotheses analyzed with evidence)
Deep mode criteria: - [x] Hypothesis tested? All 4 hypotheses tested with evidence and verdict - [x] Counter-evidence sought? HD-18/C success without loops (→ conditional H3). HD-19 Zwicky failure (→ enforcement rules for H4).
Metacognitive check: The Zwicky Box meta-test produced genuine discovery — configuration C4 (AC-only, no guidance) is viable but risky, which I hadn't considered. The CCA also revealed that P1-V1 (current Detailed Steps) + P4-V2 (Execution Loops) is a WEAK combination — loops on code-in-TS verify copy accuracy, not functional correctness. This is a new insight.
Stage complete: YES → Recommend: proceed to Challenge. Key questions: (1) Is the >3 requirement threshold for Execution Loops correct? (2) Are the 5 Zwicky enforcement rules sufficient or over-constraining? (3) What's the failure mode of Requirements-first TS — when does it NOT work?