title: "Gather — "What do we NOT know?"" source: "tasks/TFW-26__documentation_site/research/gather.md"
Gather — "What do we NOT know?"¶
Parent: HL-TFW-26 Goal: Compile TFW project artifacts into a publishable documentation site — deterministic, scriptable, no agent involvement.
Findings¶
G1: SSG Landscape — Tool Capabilities Matrix¶
| Tool | Language | Multi-source dirs | Frontmatter inject | Custom build scripts | GitHub Pages | GitLab Pages | Community |
|---|---|---|---|---|---|---|---|
| MkDocs + Material | Python | ❌ Single docs_dir only |
✅ Native YAML frontmatter | ✅ mkdocs-gen-files plugin (Python scripts at build time) |
✅ mkdocs gh-deploy |
✅ Standard CI/CD | Massive, industry standard |
| Jekyll + Just-the-Docs | Ruby | ⚠️ Collections within build dir | ✅ Native YAML frontmatter | ⚠️ Limited (plugins restricted on GH Pages) | ✅ Native (GH default) | ✅ Standard CI/CD | Large, native GH integration |
| Hugo | Go | ⚠️ Single content/ dir |
✅ Native YAML/TOML frontmatter | ⚠️ No build-time script plugin | ✅ Via Actions | ✅ Standard CI/CD | Large, fastest builds |
| Docusaurus | React/Node | ⚠️ Plugin-based multi-instance | ✅ Native frontmatter | ✅ Plugin system | ✅ Via Actions | ✅ Standard CI/CD | Meta-backed, heavy |
| Astro Starlight | Node | ✅ Multiple content dirs via config | ✅ MDX/Markdown frontmatter | ✅ Full Astro plugin system | ✅ Via Actions | ✅ Standard CI/CD | Fast-growing, modern |
| Zensical | Rust+Python | ✅ Reads mkdocs.yml natively |
✅ Same as MkDocs | ✅ Same ecosystem | ✅ Via Actions | ✅ Standard CI/CD | New (2025), MkDocs successor |
Key finding: ALL major SSGs require files to be within their designated source directory. None can natively read from scattered locations (root /, .tfw/, knowledge/). This means TFW always needs an aggregation step — either symlinks, copy script, or mkdocs-gen-files.
G2: MkDocs + mkdocs-gen-files — Deep Dive¶
The mkdocs-gen-files plugin is the strongest candidate for TFW's use case:
- Runs Python scripts during mkdocs build
- Scripts can read ANY file from the project, process it, and write to the virtual docs directory
- Files are generated in-memory — not checked into the repo
- Compatible with Material for MkDocs theme
- Compatible with Zensical (successor)
How it solves TFW's problem:
# scripts/gen_docs.py (runs at build time)
import mkdocs_gen_files
# Read KNOWLEDGE.md from project root, write to docs/architecture/decisions.md
with open("KNOWLEDGE.md") as f:
content = f.read()
# Extract §1 Architecture Decisions, add frontmatter
with mkdocs_gen_files.open("architecture/decisions.md", "w") as out:
out.write("---\ntitle: Architecture Decisions\n---\n")
out.write(extracted_section)
This is exactly the "compilation utility" the HL envisions — but it's a MkDocs plugin script, not a standalone utility. The script IS the compiler.
G3: ADR Tooling — Limited Relevance¶
| Tool | What it does | Fits TFW? |
|---|---|---|
adr-tools |
CLI for creating ADR markdown files | ❌ No compilation, no site generation |
log4brains |
ADR creation + static site (built-in) | ⚠️ ADR-only, can't handle arbitrary artifact types |
| Backstage TechDocs | MkDocs-based docs-as-code for microservices | ⚠️ Opinionated Backstage platform integration |
Conclusion: ADR tools solve a narrower problem. TFW's artifacts are broader (decisions + facts + conventions + glossary + changelog). No existing ADR tool fits. MkDocs with custom gen-files scripts is the right abstraction level.
G4: GitHub Pages Deployment Options¶
Current state: tfw.saubakirov.kz serves from repo root, main branch (CNAME file confirms tfw.saubakirov.kz).
Three deployment models:
| Model | How it works | Impact on repo |
|---|---|---|
A: /docs folder |
GH Pages Settings → "Deploy from branch" → main, /docs |
Compiled output checked into git. Pollutes history. |
B: gh-pages branch |
CI/CD builds → pushes to gh-pages branch |
Clean main branch, but legacy pattern. |
| C: GitHub Actions | CI/CD builds → actions/deploy-pages artifact upload |
Recommended. No build artifacts in git. Clean. Modern. |
Model C is clearly best. mkdocs gh-deploy already supports this. GitLab Pages uses standard .gitlab-ci.yml (mkdocs build → mv site public).
Critical implication: If we use GitHub Actions, docs/ folder in the repo is NOT the compiled output — it's the source for MkDocs (or we skip docs/ entirely and use mkdocs-gen-files to aggregate from scattered locations).
G5: GitLab Pages Compatibility¶
Standard .gitlab-ci.yml for MkDocs:
pages:
image: python:3.11-slim
script:
- pip install mkdocs mkdocs-material mkdocs-gen-files
- mkdocs build
- mv site public
artifacts:
paths: [public]
Fully compatible. Same mkdocs.yml config works for both platforms. Only the CI/CD wrapper differs.
G6: Critical Constraint — MkDocs docs_dir Limitation¶
MkDocs requires all source files in a single directory tree. Cannot set docs_dir: . (project root) — causes recursive build issues. Cannot natively read from multiple directories.
Three workarounds evaluated:
| Approach | Pros | Cons | Windows-safe? |
|---|---|---|---|
| Symlinks | No copy overhead | Fragile on Windows, CI/CD issues | ❌ |
| Pre-build copy script | Simple, portable | Duplicates files temporarily | ✅ |
mkdocs-gen-files plugin |
In-memory, no file duplication, runs at build time | Requires writing file-reading Python code | ✅ |
mkdocs-gen-files wins. It's the MkDocs-native way to aggregate from scattered directories. No symlinks, no copy scripts, no Windows issues.
G7: AI-Queryability Landscape — DeepWiki, MCP, Chat-over-docs¶
User's insight: the end goal isn't just "web docs" — it's AI-queryable knowledge endpoints. Web docs is one output, MCP-powered AI assistants is another. Both consume the same structured artifacts.
DeepWiki (by Cognition/Devin): - Analyzes repos → generates wiki-style docs + chat interface - Uses RAG for Q&A over codebase - Problem for TFW: DeepWiki analyzes code, not structured knowledge. TFW's value is in decisions, facts, processes — which DeepWiki can't distinguish from generic markdown
Existing MCP servers for documentation:
| Tool | Architecture | Fits TFW? |
|---|---|---|
markdown-rules-mcp |
Serves project md files as AI context. Smart dependency resolution, line-range embeds | ⚠️ Generic — serves files, no structure awareness |
Markdown Vault MCP (LobeHub) |
Full read/write to md folders. Frontmatter queries, regex search | ⚠️ Obsidian-optimized, but concept is right |
MCP-Markdown-RAG |
Local-first RAG with Milvus vector DB, semantic search | ⚠️ Heavy (requires vector DB), overkill for structured artifacts |
library-mcp |
Markdown KB server — retrieval by tags, date, full-text | ✅ Closest to TFW's needs — serves structured md with metadata |
mjm.local.docs |
.NET, Blazor web UI + MCP endpoint, semantic search | ⚠️ Interesting dual-interface (web + MCP) but heavy stack |
Yandex Docs MCP architecture: - MCP server connects to Yandex Cloud documentation - Agent queries → server retrieves relevant sections → injects into context - Key insight: Yandex doesn't serve raw docs — it serves indexed, structured content with search. This is what makes it useful.
G8: Architecture Implications — Three Output Layers¶
The user's vision reveals that Layer 2 from the HL is actually multiple output targets, not just a web site:
Layer 1 (Agent) → Maintains structured artifacts
↓
Layer 2 (Utility) → Produces MULTIPLE outputs:
├── 2a: Static docs site (MkDocs → GitHub Pages)
├── 2b: MCP-servable knowledge (markdown files + index for MCP server)
└── 2c: Zip/archive for distribution (portable knowledge bundle)
Critical insight: If TFW artifacts have consistent, compilable structure (the "compilable contract"), ALL these outputs can be produced from the same source: - MkDocs reads artifacts → web site - MCP server reads artifacts → AI context - Zip tool archives artifacts → portable bundle
The compilable contract IS the common interface. The outputs are independent consumers.
How this fits existing tools:
- A simple MCP server (like library-mcp or custom) pointed at TFW's knowledge/, KNOWLEDGE.md, .tfw/conventions.md already works — agents can read these files directly
- The user already experiences this: Yandex Docs MCP + agents at Innoforce
- What's missing: a TFW-specific MCP server that understands the artifact structure (decisions, facts, conventions) and can answer structured queries like "what decisions exist?" or "what are the verified facts about X?"
For TFW-26 scope: Web docs (2a) is in scope. MCP server (2b) and distribution (2c) are future tasks. BUT: the compilable contract design must not preclude 2b/2c. This is a constraint on the design, not a deliverable.
Checkpoint¶
| Found | Remaining |
|---|---|
MkDocs + Material + mkdocs-gen-files is the strongest candidate for web output |
Need to verify: can gen-files handle actual TFW artifact formatting? What transformations? |
| GitHub Actions deployment — no compiled output in git | Confirmed by user |
| GitLab Pages trivially compatible with MkDocs | Confirmed |
| ADR tools don't fit — TFW artifacts are broader | Confirmed |
| Zensical exists as MkDocs successor, reads same config | Risk: new, may not be stable. MkDocs is safer |
| AI-queryability is a parallel output target, not a replacement for web docs | Scope: compilable contract must support both, but MCP server is future task |
Existing MCP servers for markdown exist and work — library-mcp, markdown-rules-mcp |
TFW-specific MCP = future task. Generic MCP on TFW artifacts already works without any changes |
| The "compilable contract" design is the key deliverable — it serves ALL output targets | This is the Extract stage focus |
Sufficiency: - [x] External source used? (10 web searches, 12+ tools evaluated) - [x] Briefing gap closed? (SSG landscape + AI-queryability landscape fully mapped)
Deep mode: - [x] Hypothesis tested? (H2: MkDocs + gen-files = yes. AI-queryability: existing MCP servers already work on raw files) - [x] Counter-evidence sought? (All SSGs need aggregation. DeepWiki doesn't understand structured knowledge.)
Metacognitive check (loop 2): New insight — the compilable contract is more important than the tool choice. If artifacts have strict structure, they can serve web docs, MCP endpoints, AND portable distribution. The tool is just one consumer. This shifts the research focus from "which SSG?" to "what contract?"
Stage complete: YES → User decision: ___