Skip to content

title: "Extract — "What do we NOT see?"" source: "tasks/TFW-26__documentation_site/research/extract.md"


Extract — "What do we NOT see?"

Parent: HL-TFW-26 Goal: Compile TFW project artifacts into a publishable documentation site — deterministic, scriptable, no agent involvement.

Findings

E1: Compilable Contract — Artifact Structure Audit

Audited all 10 source artifacts. Each analyzed for: structure consistency, heading format, cross-references, frontmatter needs, transformation complexity.

Source Heading format Cross-refs Frontmatter needed? Transformation
README.md # Title + ## Sections Relative links to .tfw/, tasks/ Yes: title, nav_order Moderate — landing page extraction (drop Task Board for docs?)
KNOWLEDGE.md ## 0-4 numbered sections Links to tasks/ RF files Yes: title, nav_order Split — §0-3 → separate pages. §4 → index only
knowledge/*.md # Knowledge: {Category} + table Links to RF/REVIEW sources Yes: title, category Minimal — add frontmatter, copy
TECH_DEBT.md # Tech Debt Registry + table Source column references tasks Yes: title Minimal — add frontmatter, copy
.tfw/README.md # TFW + long-form narrative Internal refs to conventions, glossary Yes: title, nav_order Moderate — extract sections to concept pages
.tfw/conventions.md ## 1-15 numbered sections Refs to glossary, templates, workflows Yes: title Minimal — add frontmatter, copy (or split large sections)
.tfw/glossary.md # TFW Glossary + ### Terms Refs to conventions Yes: title Minimal — add frontmatter, copy
.tfw/CHANGELOG.md Keep a Changelog format Refs to tasks Yes: title Minimal — add frontmatter, copy
.tfw/init.md # Getting Started pointer Refs to workflow Yes: title Minimal — getting-started page
RELEASE.md # Release Strategy Refs to CHANGELOG Yes: title Minimal — add frontmatter, copy

Key insight: Most artifacts need only frontmatter injection + copy. Only 3 need real transformation: 1. KNOWLEDGE.md — split by section (§0-§3 are different page types) 2. .tfw/README.md — long (353 lines) — could be split into concepts pages OR served as single page 3. README.md — landing page extraction (Task Board is not public docs content)

E2: Compilable Contract — Formal Specification

The compilable contract is the strict structure each artifact type MUST have for deterministic compilation:

# Compilable Contract v1 — what Layer 2 (utility) expects

artifacts:
  knowledge_topic:
    location: knowledge/{category}.md
    must_have:
      - "# Knowledge: {Category}" as H1
      - blockquote with "Topic file for" and "See KNOWLEDGE.md"
      - table with columns: #, Fact, Verified, Source(s), Added
    frontmatter_inject:
      title: "Knowledge: {Category}"
      section: knowledge

  knowledge_index:
    location: KNOWLEDGE.md
    must_have:
      - "# KNOWLEDGE.md" as H1
      - "## 0. Philosophy" section
      - "## 1. Architecture Map" section
      - "## 3. Legacy" section
      - "## 4. Project Facts" section (index table)
    compiler_action: split by ## sections → separate pages

  tech_debt:
    location: TECH_DEBT.md
    must_have:
      - "# Tech Debt Registry" as H1
      - table with columns: #, Source, Severity, File(s), Description, Status
    frontmatter_inject:
      title: "Tech Debt"
      section: reference

  changelog:
    location: .tfw/CHANGELOG.md
    must_have:
      - "# TFW Changelog" as H1
      - Keep a Changelog format ([x.y.z] — date)
    frontmatter_inject:
      title: "Changelog"
      section: reference

  conventions:
    location: .tfw/conventions.md
    must_have:
      - "# TFW Conventions" as H1
      - "## N)" numbered sections
    frontmatter_inject:
      title: "Conventions"
      section: reference

  glossary:
    location: .tfw/glossary.md
    must_have:
      - "# TFW Glossary" as H1
      - "### Term" format for entries
    frontmatter_inject:
      title: "Glossary"
      section: reference

  philosophy:
    location: .tfw/README.md
    must_have:
      - "# Trace-First Workflow (TFW)" as H1
      - "## Values and Principles" section
      - "## How It Works" section
    compiler_action: serve as single page OR split by ##
    frontmatter_inject:
      title: "Philosophy & Design"
      section: concepts

  landing:
    location: README.md
    must_have:
      - "# " as H1 (project title)
      - "## Quick Start" section
      - "## What's Inside" section
    compiler_action: extract non-Task-Board content for index page
    frontmatter_inject:
      title: "Home"

E3: Source→Output Mapping — Revised from HL

Based on Gather and Extract findings, the HL §3.1 mapping needs revision:

┌─────────────────────────────────┐     ┌──────────────────────────────┐
│  TFW PROJECT (source)           │     │  MkDocs site (output)        │
│                                 │     │  built by mkdocs-gen-files   │
│                                 │     │                              │
│  README.md ─────────────────────┼────→│  index.md                    │
│  .tfw/README.md ────────────────┼────→│  concepts/philosophy.md      │
│  .tfw/init.md ──────────────────┼────→│  getting-started.md          │
│  KNOWLEDGE.md §0 ───────────────┼────→│  architecture/principles.md  │
│  KNOWLEDGE.md §1 ───────────────┼────→│  architecture/decisions.md   │
│  KNOWLEDGE.md §3 ───────────────┼────→│  architecture/legacy.md      │
│  knowledge/*.md ────────────────┼────→│  knowledge/*.md  (copy+fm)   │
│  TECH_DEBT.md ──────────────────┼────→│  reference/tech-debt.md      │
│  .tfw/conventions.md ───────────┼────→│  reference/conventions.md    │
│  .tfw/glossary.md ──────────────┼────→│  reference/glossary.md       │
│  .tfw/CHANGELOG.md ─────────────┼────→│  reference/changelog.md      │
│                                 │     │                              │
│  mkdocs.yml ← ← ← ← ← ← ← ← │     │  (MkDocs config — NEW file)  │
│  scripts/gen_docs.py ← ← ← ← ←│     │  (gen-files script — NEW)    │
└─────────────────────────────────┘     └──────────────────────────────┘

Changes from HL: - No docs/ folder committed to git - mkdocs.yml + scripts/gen_docs.py added at project root - Output is site/ (MkDocs default), deployed via CI/CD - concepts/adapters.md dropped (too specific for generic docs)

E4: mkdocs-gen-files Script Complexity Estimate

The gen-files script needs to: 1. Read ~10 source files from various locations 2. For 7 of them: add YAML frontmatter + copy content 3. For 2 of them: split by ## headings into separate pages 4. For 1 (README.md): extract specific sections (remove Task Board) 5. Generate nav structure (or use mkdocs-literate-nav)

Estimated LOC: ~80-120 lines Python.

This is close to H3's "<100 LOC" hypothesis. The script is simple because: - No content transformation (just split + frontmatter inject) - No link rewriting (relative links still work within MkDocs) - No parsing beyond ## heading detection

BUT: link rewriting IS needed for cross-references between artifacts. Example: knowledge/convention.md references [RF TFW-18](../../TFW-18__knowledge_consolidation/RF__PhaseB__knowledge_quality.md) §6 — this link points to tasks/ which is NOT in the docs output. These become dead links. Two options: 1. Accept dead links in docs (they work in the repo, not in the site) 2. Strip task references from docs output (lossy but clean) 3. Include task artifacts in docs (scope explosion)

Recommendation: Option 1 — accept dead links. The docs site is a compiled view. Source links to tasks/ work in the repo context (GitHub), not in the docs context. This is a known, documented limitation.

E5: New Files Required in TFW Project

For a TFW project to get docs, it needs:

File Purpose Size
mkdocs.yml MkDocs configuration — theme, plugins, nav ~30 lines
scripts/gen_docs.py mkdocs-gen-files script — reads artifacts, writes docs ~80-120 lines
.github/workflows/docs.yml GitHub Actions deploy workflow ~20 lines
.gitlab-ci.yml section GitLab Pages job (if needed) ~10 lines
requirements-docs.txt Python deps for docs build ~5 lines

Total new files: 3-4 (not all projects need both GH+GL CI).

E6: MkDocs Configuration Sketch

site_name: "Project Docs"  # From PROJECT_CONFIG project.name
site_url: "https://tfw.saubakirov.kz"
theme:
  name: material
  features:
    - navigation.sections
    - navigation.expand
    - search.suggest
    - content.code.copy
plugins:
  - search
  - gen-files:
      scripts:
        - scripts/gen_docs.py
  - literate-nav:
      nav_file: SUMMARY.md
nav:
  - Home: index.md
  - Getting Started: getting-started.md
  - Concepts:
    - Philosophy & Design: concepts/philosophy.md
  - Architecture:
    - Principles: architecture/principles.md
    - Decisions: architecture/decisions.md
    - Legacy: architecture/legacy.md
  - Knowledge:
    - knowledge/*.md  # Auto-expanded
  - Reference:
    - Conventions: reference/conventions.md
    - Glossary: reference/glossary.md
    - Tech Debt: reference/tech-debt.md
    - Changelog: reference/changelog.md

Checkpoint

Found Remaining
Compilable contract defined — 8 artifact types with strict structure requirements Need to validate: does the current artifact structure actually match the contract? (audited: YES, it does)
Most artifacts need minimal transformation (frontmatter + copy) Only KNOWLEDGE.md and README.md need splitting/extraction
gen-files script ~80-120 LOC — close to H3 hypothesis Need to challenge: is link rewriting really not needed?
New files in project: 3-4 (mkdocs.yml, gen_docs.py, CI workflow, requirements) Acceptable scope — not invasive
Dead links from task references are a known limitation Decision needed: accept vs strip vs include tasks

Sufficiency: - [x] External source used? (No additional external needed — this stage was internal artifact audit) - [x] Briefing gap closed? (Compilable contract defined, source mapping revised)

Deep mode: - [x] Hypothesis tested? (H1: partially confirmed — copy + frontmatter for 7/10 artifacts. H3: ~100 LOC plausible. H6: current structure IS sufficient) - [x] Counter-evidence sought? (Link rewriting identified as gap. Dead links = known limitation)

Metacognitive check: I confirmed H6 (structure is already sufficient) through direct audit. This is confirmation, not new discovery. But the dead links issue IS new — it's a real gap that the HL §6 DoF doesn't mention. The contract specification in E2 is the most valuable output — it makes the implicit structure explicit.

Stage complete: YES → User decision: ___