Skip to content

RES — TFW-38: Quality Enforcement

Date: 2026-04-14 Author: Researcher Status: 🔬 RES — Complete Parent HL: HL-TFW-38 Mode: Pipeline


Research Context

Empirical audit of TFW agent behavior across 4 active projects (helpdesk, auto-schedule, atamat, steps-framework) to validate that executors skip RF §6-8, reviewers rubber-stamp, researchers omit Findings Maps, and diagrams are abandoned. Goal: collect evidence to support or refute HL hypotheses before writing TS.

Briefing

Pipeline research for TFW-38. Focus areas: (1) empirical skip rates for RF §6-8 across real projects, (2) actual reviewer audit behavior from REVIEW files, (3) RES Findings Map presence rates, (4) external research on LLM instruction-following patterns. Mode: focused (1 OODA loop per stage).

Decisions

# Decision Rationale
D1 Explicit §6-8 enumeration in handoff.md is the correct fix 96-100% skip rate empirically validated across 80+ RF files. Template presence is necessary but not sufficient. Only positive examples (HD PhaseG/H) correlate with user attention, not workflow instructions. Apple RLCF research confirms checklist decomposition beats implicit compliance.
D2 Reviewer audit step is needed but should include a triage gate Reviewers currently trust RF claims. Only TFW-19 has an explicit "Independent Verification" section. TFW-36 reviewer self-assessed: "I trusted executor's self-reported fix instead of verifying independently." But analytical/docs tasks would make file-level audit wasteful. Triage: code task → spot-check; docs task → verify existence.
D3 Diagram collection should INDEX, not COPY Zero diagrams collected in any project currently. Copying creates stale repositories. Better: add references to KNOWLEDGE.md §2 ("Architecture diagram → RF Phase G §8"). Diagrams stay in RF trace with full context.
D4 9 mandatory items in handoff Phase 3 is within the empirically validated constraint budget Apple RLCF research shows agents struggle with >10 constraints. Going from 6 to 9 stays under threshold. Each new item has "If empty, write 'No X'" fallback — zero content burden.
D5 Review workflow may benefit from research-like staged structure (Read → Verify → Challenge → Synthesize) User observation: the research flow's power comes from forced cognitive mode transitions. Current review is single-pass. Structured stages would make audit behavior reliable, not personality-dependent. Beyond TFW-38 Phase A scope — recommended as iteration 2 thread.

Open Questions

# Question Status Answer
Q1 Which projects have the most RF/RES/REVIEW files? ✅ Answered Helpdesk (newest), atamat (most, older TFW), auto-schedule (few recent), steps-framework (most with current TFW)
Q2 Are there positive examples of §6-8 being filled? ✅ Answered Only HD PhaseG/H (2026-04-14) — correlates with user quality discussion, not workflow. TFW-36 PhaseA also has all sections.
Q3 Has diagram collection been attempted before? ✅ Answered No — never attempted in any form.

Hypotheses (from HL §10)

# Hypothesis HL Status RES Status Evidence
H1 Adding explicit §6-8 enumeration to handoff.md Phase 3 will stop executors from skipping them open 🟢 supported 96-100% skip rate with current workflow. Only filled when user explicitly focused on quality. Template alone insufficient. Apple RLCF confirms checklist decomposition works. 9 items < 10 constraint threshold.
H2 An audit verification step in review.md will change reviewer behavior from "trust" to "verify" open 🟢 supported TFW-36 reviewer admitted trust-chain failure. Only TFW-19 has independent verification. 2026 industry consensus: multi-gate review with deterministic verification for objective checks. Triage gate handles analytical task overhead. User proposes deeper restructure (staged review like research flow) — iteration 2 thread.

HL Update Recommendations

# What to update Source
1 Phase B D6: change "storage location strategy" to "indexing strategy" — diagrams should be indexed in KNOWLEDGE.md, not copied to docs/diagrams/ D3, Challenge C5
2 Add H3: "Structured review stages (Read → Verify → Challenge → Synthesize) will produce more reliable reviews than a single-pass audit step" D5, user observation
3 Consider adding Phase C: "Review Workflow Restructure" for the staged review concept, if iteration 2 confirms value D5
4 Risk R2 mitigation confirmed valid — triage gate for trivial/analytical tasks is the right approach D2, Challenge C2

Fact Candidates

Cognitive mode: Pure reporting — record factual observations without interpretation or synthesis.

Before writing: review the conversation history. The human's messages are the primary source.

# Category Candidate Source Confidence
F1 process RF §6-8 skip rate is 96-100% across 80+ RF files in 4 active projects. The template contains the sections with detailed instructions, but agents don't fill them because the handoff workflow doesn't enumerate them. Template ≠ enforcement. Gather G1, cross-project grep High
F2 process The only RF files with §6-8 filled (HD PhaseG, PhaseH, TFW-36 PhaseA) were produced in sessions where the user was actively discussing quality enforcement. The correlation is user attention, not workflow instruction. Gather G1, positive example analysis High
F3 philosophy User: review should be structured like the research flow — with explicit stages, cognitive mode transitions, and traces. The research flow forces the agent through gather → extract → challenge modes. Review is currently single-pass read-and-write. Structured stages would make audit behavior reliable, not personality-dependent. User, 2026-04-14 High
F4 convention TFW-19 REVIEW is the only REVIEW in the steps-framework project with an explicit "§3. Independent Verification" section. TFW-31 (scheduler) REVIEW has line-by-line DoD verification. Both emerged organically, proving the capability exists but is not reliably triggered. Extract E4, Challenge C6 High
F5 process RES Findings Map presence is 80% in helpdesk (newer project, post-TFW-32 template) vs 15% in steps-framework (older tasks predate the section). The presence correlates with template version at task creation time, not with workflow enforcement. Gather G2 Medium

Strategic Insights (Research)

# Category Insight Source Confidence
SS1 philosophy The user observes that review should follow the same structural pattern as research: forced cognitive mode transitions via explicit stages. This is a deeper insight than "add an audit step" — it suggests that ANY TFW workflow that requires quality judgment (research, review) should use the stage pattern, while workflows that are procedural (handoff, docs) can remain step-based. The implication: there are two classes of TFW workflows — investigative (staged) and procedural (linear). User, 2026-04-14 ★★★

Findings Map

PROBLEM CHAIN (confirmed empirically)
══════════════════════════════════════

handoff.md Phase 3          review.md Step 1           research/base.md Step 6
lists 6 items               says "read RF"             lists 4 synthesis items
  ↓                           ↓                           ↓
§6-8 NOT listed             no audit instruction       Findings Map NOT listed
  ↓                           ↓                           ↓
96-100% skip rate            trust-chain failures       ~50% skip (newer projects)
across 80+ RFs              (TFW-36 self-assessment)    0% in older projects

        ↓                            ↓                          ↓
    ┌───┴───┐                   ┌────┴────┐                 ┌───┴───┐
    │  H1   │                   │   H2    │                 │  H1   │
    │CONFIRM│                   │ CONFIRM │                 │CONFIRM│
    └───────┘                   └─────────┘                 └───────┘

ROOT CAUSE (unified)
════════════════════
Template has section + instructions
        BUT
Workflow doesn't enumerate it
        =
Agent skips it (workflow wins over template)

EXTERNAL VALIDATION
═══════════════════
Apple RLCF: checklist decomposition > implicit compliance
2026 consensus: multi-gate review with deterministic verification
Constraint budget: <10 items per checklist (we're at 9)

NEW DIRECTION (user, iteration 2 candidate)
═══════════════════════════════════════════
Review → staged flow (Read → Verify → Challenge → Synthesize)
  ↓
Parallels research flow (Briefing → Gather → Extract → Challenge → RES)
  ↓
Two classes of TFW workflows: investigative (staged) vs procedural (linear)

Iteration Status

  • Iteration: 1 of 2 (min) / 3 (max)
  • Hypotheses tested: H1 (🟢 supported), H2 (🟢 supported)
  • Hypotheses deferred: None
  • Gaps discovered: User proposed staged review structure (D5, SS1) — not in original HL
  • Superseded decisions: None

Open Threads (for next iteration)

# Thread Why it matters Suggested focus
1 Staged review workflow (Read → Verify → Challenge → Synthesize) Could replace the "add Step 1.5 audit" approach with a structurally deeper solution. Parallels research flow design. User-initiated direction. Analyze: does review benefit from cognitive mode transitions the same way research does? What would review stage files look like? Would this exceed the 1200-word budget?
2 Two classes of TFW workflows: investigative vs procedural If confirmed, this is a framework-level design principle that affects future workflow design. Map all 10 TFW workflows into investigative vs procedural. Do the investigative ones all need staged structure?
3 Diagram indexing vs copying in docs.md D3 recommends indexing. Need to design the actual KNOWLEDGE.md §2 format for diagram references. Concrete format proposal for KNOWLEDGE.md diagram index entries.

Recommendation

  • [ ] SUFFICIENT — proceed to /tfw-plan to update HL and write TS
  • [x] MORE NEEDED — Iteration 2 should explore the staged review concept (Thread #1). This is a user-directed structural change that could modify Phase A scope from "add audit step" to "restructure review as staged flow." The mechanical fixes (§6-8 enumeration, Findings Map mandate) are ready for TS. The review restructure needs one more iteration.

⚠️ Coordinator decides whether to continue or proceed. Researcher recommends but does NOT decide.

Conclusion

Research iteration 1 validated both HL hypotheses with strong empirical evidence: 80+ RF files show 96-100% §6-8 skip rates, proving template-alone enforcement fails. External research (Apple RLCF, 2026 multi-gate review patterns) confirms that explicit checklist enumeration and deterministic verification gates are industry best practice. The iteration's most valuable discovery is the user's observation (D5/SS1) that review should be structured like the research flow — with staged cognitive modes rather than a single-pass audit step. This opens a potentially deeper redesign that warrants exploration in iteration 2 before writing the TS for Phase A.


RES — TFW-38: Quality Enforcement | 2026-04-14