title: "Challenge — Iteration 3: "What do we NOT expect?"" source: "tasks/TFW-32__methodology_and_positioning/research3/challenge.md"
Challenge — Iteration 3: "What do we NOT expect?"¶
Parent: HL-TFW-32 Predecessors: Gather, Extract Goal: Stress-test финалистов нейминга, визуализации, multi-iteration дизайна.
C1: Empirical LLM naming test (Qwen3.5-27B, zero-context)¶
Method: Одна и та же беседа (клиент: 18% clients = 80% revenue, subscription model failed, PM frustrated, prefers markdown). Разные section names. Модель: Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled (local vLLM). Zero context — модель не знает TFW. Минимальный instruction: только название секции + 1 sentence.
Сравнительная таблица результатов¶
| Section Name | Что модель записала | Thinking mode | Ловит preference (markdown)? | Ловит emotion (PM frustrated)? | Analysis depth |
|---|---|---|---|---|---|
| Strategic Insights | Pareto analysis + business model implications + process gap + preference. ADDED: "revenue concentration risk", "non-negotiable customization", "decision continuity gaps" | Analytical — разбирал каждый пункт на implications | ✅ Да, но как «lightweight documentation preference» | ✅ Да, «decision continuity gaps» | Deepest — добавил стратегические выводы которых не было в input |
| Fact Candidates | 4 пункта, дословно: Pareto, subscription failed, PM reports decisions lost, prefers markdown. Без analysis. Lists directly. | Reporting — «These are all factual statements. I should record them concise and factual without adding interpretation» | ✅ Как голый факт | ✅ Как голый факт (no interpretation) | Shallowest — чистый reporting без analysis |
| Strategic Signals | Revenue concentration risk + failed transition + decision tracking gap + documentation preference | Analytical but signal-detection framing — «This indicates a process/communication breakdown» | ✅ Как «tool/process misalignment» | ✅ Как «operational inefficiency and PM frustration» | Medium-high — close to Strategic Insights |
| Key Learnings | Revenue concentration risk + subscription model failure + process gap + tool preference | Reflective — «I need to convert these observations into concise bullet points that capture the essence of what was learned» | ✅ Как «documentation platform decisions» | ✅ Как «knowledge retention problem causing PM frustration» | Medium — factual with mild interpretation |
| Domain Signals | Revenue concentration + failed subscription model + agile methodology gaps + documentation stack | Mixed — analytical + reporting | ✅ Как «current documentation stack» | ✅ Как «knowledge management or documentation process issue» | Medium — more factual than Strategic Insights |
Key empirical findings¶
-
«Strategic Insights» triggered the deepest analytical mode. Модель не просто записала факты — она ДОБАВИЛА выводы: «revenue concentration risk», «non-negotiable customization», «decision continuity gaps». Ни одно другое название не дало такой глубины.
-
«Fact Candidates» triggered pure reporting mode. Модель БУКВАЛЬНО сказала в thinking: «I should record them concise and factual without adding interpretation.» Это EXACTLY the behavior we want for §6. Название работает как промпт.
-
«Strategic Signals» ≈ «Strategic Insights» minus depth. Близко по фреймингу, но «Signals» дал чуть менее глубокий анализ.
-
«Key Learnings» = retrospective mode. Средняя глубина, хорошее подведение итогов, но weak on forward-looking.
-
«Domain Signals» split attention — модель пыталась и анализировать и репортить, получилось среднее.
D28 empirical verdict¶
| Name | Target behavior | Actual behavior | Match? |
|---|---|---|---|
| Strategic Insights | Synthesis + implications | ✅ Deep synthesis | PERFECT |
| Fact Candidates | Raw reporting | ✅ Pure facts, no interpretation | PERFECT |
| Strategic Signals | Strategic detection | ⚠️ Close to Insights but shallower | Good |
| Key Learnings | Lessons extracted | ✅ Retrospective capture | Good |
| Domain Signals | Domain-specific patterns | ⚠️ Split between analysis and reporting | Fair |
VERDICT: Названия «Strategic Insights» и «Fact Candidates» = уже правильные. Они запускают EXACTLY нужный cognitive mode. Менять их = regression.
C2: Visualization section naming — empirical test¶
Method: Одна и та же беседа (school management system). Два section names: «Diagrams» vs «Visual Map».
Результаты¶
| Section Name | Что модель нарисовала | Подход |
|---|---|---|
| Diagrams | 3 отдельных технических диаграммы: enrollment timeline, grade export pipeline, server room architecture. Чёткие, структурированные, engineering-grade | Модель разделила контекст на компоненты и дала each one свою диаграмму. Technical, systemic |
| Visual Map | 3 диаграммы: enrollment seasonal cycle (с peak triggers), grade export pipeline (с проблемным местом), relationship diagram. Более narrative, с пояснениями внутри | Модель добавила аннотации («Peak Triggers», «Issue»). Более explanatory, менее technical |
Analysis¶
- «Diagrams» = инженерный подход. Чистые диаграммы, минимум текста внутри
- «Visual Map» = навигационный подход. Карта с пояснениями, контекстом, аннотациями
Для HL (vision level): «Visual Map» лучше — vision нужна карта с пояснениями. Для RF (result level): «Diagrams» лучше — result нужны чёткие технические схемы.
Но мы ищем ОДНО название...
Counter-argument к «одному названию»: Может HL §3.1 уже имеет «Result Visualization» — и это ДОСТАТОЧНО? А для RF/RES нужно другое?
Нет — user explicitly: «одна общая глава с одним четким названием». Значит umbrella.
Compromise position: «Diagrams» (umbrella) + per-template instructions (control framing). Reasons: 1. Zero learning curve 2. Empirically produces good results even for non-code projects 3. Cross-domain understood 4. Instructions carry the real behavioral load, not the name 5. «Visual Map» adds learning curve without proportional gain
C3: «Strategic Insights» edge cases stress-test¶
Из Extract E2, «Strategic Insights» не ideal для preferences и emotions. Empirical test says it DOES capture them, but framed through strategic lens.
| Edge case | «Strategic Insights» behaviour |
|---|---|
| User: «Мне не нравится JSON, хочу YAML» | Records as tool preference insight ✅ |
| User: 😤 «Это меня бесит» | Records as stakeholder frustration signal ✅ — but framed as «operational issue», not emotion per se |
| User: «Мой инвестор сказал что XYZ» | Records perfectly ✅ — stakeholder intelligence |
| User: «В Казахстане малому бизнесу сложно с бухгалтерией» | Records as domain context ✅ |
| User: «Я хочу чтобы продукт был для предпринимателей, не программистов» | Records as strategic vision ✅ |
All 5 edge cases captured. The word «Strategic» doesn't BLOCK non-strategic items — it ELEVATES everything to strategic framing. Preferences become «tool strategy». Emotions become «stakeholder signals». This is a FEATURE, not a bug.
Instructions needed (addition to template):
> **What counts as strategic:** Business facts, domain knowledge, stakeholder priorities,
> emotional reactions, process preferences, forward-looking signals, corrections to direction.
> If in doubt: "Would the next agent decide differently knowing this?" → record it.
C4: Multi-iteration design — failure scenarios¶
| # | Failure scenario | Design handles it? |
|---|---|---|
| F1 | Researcher writes «SUFFICIENT» after iteration 1. Coordinator is a fresh agent, doesn't know about min_iterations | ✅ — iterations.yaml has min_iterations. Coordinator reads YAML before deciding |
| F2 | Coordinator forgets to update iterations.yaml | ⚠️ — risk of stale file. Mitigation: researcher exit protocol reminds. But not enforced programmatically |
| F3 | Researcher in iteration 2 doesn't read previous RES | ✅ — briefing template for iter 2+ has «Predecessor» field. Workflow Step 1: researcher MUST read all previous RES |
| F4 | 5 iterations, scope creep, never reaches TS | ✅ — max_iterations (soft cap) + coordinator judgment. Design says «beyond max = scope creep signal» |
| F5 | Stage files overwritten (single research/ folder) — lose iteration 1 detail |
⚠️ — this is the open design question. Two options remain |
On F5: folder structure decision¶
| Approach | Pro | Con |
|---|---|---|
Single research/, overwrite stage files |
Clean. RES = trace. Stage files = scratch | Info loss. Can't re-read gather from iteration 1 |
researchN/ folders, accumulate |
Zero info loss. Full trace | Disk grows. Multiple folders. Which stage files matter? |
Hybrid: research/ + archive |
Active in research/, archive old to researchN/ before overwrite | Complexity. Extra step |
I now think researchN/ is correct. Reasons from practice:
1. VLM-3 research3 briefing REFERENCED specific findings from research1 gather. Without the file, researcher would need to re-discover
2. TFW philosophy = «trace is the product». Deleting stage files = deleting traces
3. Disk cost is negligible vs information value
4. Each researcher reads ONLY their own folder + all RES files on task root. Previous research folders are reference, not mandatory
C5: Naming philosophy coherence test¶
Question from H_meta: Do all names follow ONE philosophy?
| Section | Current name | Philosophy |
|---|---|---|
| §5 (RF) | Observations | «What the agent SEES technically» — observer mode |
| §6 (RF, RES, REVIEW) | Fact Candidates | «What we THINK might be true» — candidate pipeline |
| §7/§11 (HL, RF, RES) | Strategic Insights | «What the human KNOWS strategically» — synthesis mode |
| Viz section | Diagrams (proposed) | «SHOW the structure visually» — rendering mode |
Does this cohere? Let me check: - Observations → passive, technical, internal → ✅ right level - Fact Candidates → pipeline metaphor (candidate → verified) → ✅ right process - Strategic Insights → active synthesis from human input → ✅ right cognitive mode - Diagrams → concrete output type → ⚠️ different ontological category (it's a FORMAT, not a COGNITION)
I notice: First three names describe COGNITIVE MODES (observe, candidating, synthesize). Fourth describes OUTPUT FORMAT (diagram). This is an inconsistency.
Alternative: «Visual Thinking» — same cognitive ontology as others. But empirically worse at prompting (not tested).
Assessment: The inconsistency is acceptable. «Diagrams» lives in a different section purpose — it's about OUTPUT, not CAPTURE. §5/§6/§7 capture information. Diagrams section RENDERS information. Different function = different naming logic is OK.
Checkpoint¶
| # | Finding | Status |
|---|---|---|
| C1 | Empirically proven: «Strategic Insights» = deepest analytical mode. «Fact Candidates» = cleanest reporting mode. Both ALREADY optimal | CONFIRMED: no rename needed |
| C2 | «Diagrams» > «Visual Map» for universal use. Per-template instructions carry framing load | Recommend «Diagrams» |
| C3 | «Strategic Insights» handles ALL edge cases including preferences and emotions (elevates them to strategic framing). Add instruction block for clarity | Keep + extend instructions |
| C4 | Multi-iteration design handles 4/5 failure scenarios. F5 (folder structure) → researchN/ is correct per TFW trace philosophy |
Design solid |
| C5 | Naming philosophy is coherent for capture sections (cognitive modes). Viz section (output format) = different category, acceptable | Philosophy holds |
Sufficiency: - [x] External source used? (Empirical LLM testing on local model) - [x] Briefing gap closed? (All 3 topics challenged + edge-case tested)
Deep mode criteria: - [x] Hypothesis tested? H5c (REFUTED — original names are optimal), H6c (Diagrams wins), H7c (design validated through failure scenarios), H_meta (philosophy coherent) - [x] Counter-evidence sought? Tested 5 alternative names empirically. Tested 5 edge cases for SI. Tested 5 failure scenarios for multi-iteration - [x] Metacognitive check: SURPRISED. Пришёл менять названия — уходу с вердиктом «не менять». Оригинальные названия эмпирически лучше всех предложенных. RES1 D2 и RES2 D10 оба были regression.
Stage complete: YES → Proceed to Synthesis (RES)