Quran corpus — research & build plan

GSD phase gate: There is no .gsd/ directory in this repo yet, so gsd headless query exits with No .gsd/ directory found. Run gsd once from the repo root to initialize, then use your template’s EXPECTED_PHASE checks. Until then, structured scope → plan → build phases are documented here only.

Hypothesis (cycle): Finishing the 114 surah fetch plus keeping surah-hashes.json + Quartz paths stable removes the largest blockers for Ayah embeds, Atlas extraction, and publish.

This note is the master plan for turning the vault’s Quranic layer into a complete, reproducible, searchable, and publishable corpus. It uses Obsidian wikilinks to Surahs, Atlas, Ayah, Juz, scripts under .dev/scripts/, and related notes so you can navigate the graph and hand phases to agents without re-explaining context.


Current inventory (what already exists)

LayerRoleWhere
API clientShared httpx + retries for Quran.com v4.dev/scripts/quran_api.py (used by fetch + generators)
Surah fetchArabic + English + OpenFurqan links + ayah_header_lines.dev/scripts/fetch_quran.pySurahs folder
Hash cachePer-file SHA-256 + fetch options to skip APIsurah-hashes.json
Ayah line indexCLI to refresh / extract ayah by line.dev/scripts/quran_surah_index.py · .dev/scripts/quran_surah_lines.py
Ayah notes6,236 stubs embedding ### Ayah n from surah filesAyah index
Juz pages30 parts, API verse_mapping, embeds Ayah notesJuz index
AtlasDivine names, people, places, books (surahs)Quran Atlas
OverviewOpenFurqan, mushaf order, categoriesSurahs (vault note)

Gap (highest leverage): the 114 surahs are not all present as files yet—only a subset is fetched. Downstream embeds, entity extraction, and published HTML all depend on complete or explicitly scoped surah text first.


Target pipeline (end state)

flowchart LR
  F[Fetch surahs API] --> O[Organize paths + FM]
  O --> H[Hash + index JSON]
  O --> L[Ayah line index in FM]
  O --> A[Atlas entity notes]
  A --> C[Categorize + tag]
  C --> P[Publish Quartz]
  P --> V[Browser eval]

Each stage below lists inputs, outputs, tools, and Definition of Done (observable).


Phase A — Fetch (complete text)

Goal: Every surah 1…114 exists as Graphe/Quran/Surahs/Surah NNN - Name.md with consistent frontmatter and ayah_header_lines.

  • Inputs: translation_id, arabic_field, cache policy (see fetch script).
  • Outputs: 114 markdown files; updated surah-hashes.json.
  • Commands: uv run .dev/scripts/fetch_quran.py -f (or staged batches to respect API limits).
  • DoD: find Graphe/Quran/Surahs -name 'Surah *.md' | wc -l114; random spot-check ayah_count vs API.

Wikilinks: Surahs folder note · Surahs overview


Phase B — Organize (stable layout & naming)

Goal: One canonical tree; no duplicate “surah” stories.

  • Convention: Surah NNN - {name_simple}.md only under Surahs; [[Graphe/Quran/Ayah/Ayah|Ayah]] / Juz names stay Ayah SSS-AAA / Juz JJ.
  • Regenerate: uv run .dev/scripts/generate_quran_juz_ayah.py after any rename (uses quran_api + /chapters + /juzs).
  • DoD: No broken ![[Graphe/Quran/Surahs/...#ayah-n|Ayah n]] embeds in a sample of Ayah notes across all juz ranges.

Phase C — Hash & integrity

Goal: Reproducible “what changed” for CI and agents.

  • Existing: surah-hashes.json entries (path, surah, sha256, translation options).
  • Extensions: optional global manifest (single JSON listing all surah hashes + generator versions) for diff in PRs.
  • DoD: Re-run fetch with no API change → no file write (hash unchanged); intentional edit → hash flips.

Wikilink: surah-hashes.json


Phase D — Index (machine + human)

Goal: Fast random access without loading huge files.

  • Per-surah FM: ayah_header_lines (line of each ### Ayah n) — maintained by fetch + quran_surah_index.py index.
  • Optional: byte-offset index in a sidecar if line-scan cost becomes an issue (future).
  • DoD: uv run .dev/scripts/quran_surah_index.py extract -f "…/Surah 002 - Al-Baqarah.md" -a 7 prints correct block on a fully fetched Baqarah.

Wikilinks: Atlas (tooling section) references the same index idea.


Phase E — Atlas entity extraction

Goal: Atlas entity notes are populated from corpus-wide extraction with a balanced quality gate.

Implemented workflow (full corpus):

  1. Ontology lock — Atlas extraction now scans four families: Divine Names, People, Places, Books (scriptural books, not surah files).
  2. Candidate generation — run full scan over all 114 surahs:
uv run .dev/scripts/quran_entity_pipeline.py --all-surahs --write-sidecars --write-reports
  1. Confidence queue — emit summary + review queue:
uv run .dev/scripts/quran_entity_pipeline.py --all-surahs --write-summary --write-review-queue
  1. Balanced write-back — apply only high confidence hits to Atlas notes via idempotent auto blocks:
uv run .dev/scripts/quran_entity_pipeline.py --all-surahs --apply-high
  1. Validation + regression — sidecar schema/path/ayah checks plus Surah 1 baseline comparison:
uv run .dev/scripts/quran_entity_pipeline.py --all-surahs --write-sidecars --validate

Artifacts produced:

qmd-assisted review (semantic helper):

# Ensure qmd has the Quran collection once
qmd collection add "/Users/rmac/repos/GrapheLogos" --name graphelogos-quran --mask "Graphe/Quran/**/*.md"
 
# Build evidence for queued medium/low matches
uv run .dev/scripts/quran_entity_qmd_evidence.py --collection graphelogos-quran --mode search

Legacy pilot remains: uv run .dev/scripts/quran_entity_pilot.py -s 1 --write-report --write-sidecar (single-surah check).

Wikilinks: Divine Names · People · Places · Books


Phase F — Categorize & tag (corpus semantics)

Goal: Filter by Meccan/Medinan, theme, juz, hizb (optional), without duplicating the mushaf.

Implemented (surah-level): uv run .dev/scripts/quran_surah_metadata.py --write enriched all 114 surah frontmatter files with:

  • revelation_place: Meccan | Medinan (from Quran.com /api/v4/chapters)
  • revelation_order: <int> (chronological order 1–114; Surah 096 = 1, first revealed)

Script is idempotent; re-run is safe.

  • Sources: external datasets (Quran.com metadata, academic tables) or manual YAML in Graphe/Quran/meta/ (proposed).
  • Per-ayah: extend Ayah frontmatter with optional topics: [] once extraction is trusted.
  • DoD: Query (e.g. Dataview or rg) returns consistent results for one tag (e.g. juz-30) across all Ayah notes.

Wikilinks: Juz index (structural partition) · Surahs (categories) for conceptual framing


Phase G — Publish (Quartz → localhost) & visual eval

Goal: Render the Quran tree in a static site, then screenshot and evaluate UX (navigation, embeds, search).

Implemented: uv run .dev/scripts/quartz_build.py --content Graphe/Quran temporarily points .dev/quartz/content at the Quran tree, swaps in quartz.config.quran.ts (fast build: ignores Ayah/), runs Quartz, then restores the Torah symlink and quartz.config.ts. Use --include-ayah for all 6k+ ayah pages (slow). Deploy defaults to quran-graphe; override with --pages-project.

Commands (verified)

cd /Users/rmac/repos/GrapheLogos
 
# If Quartz fails with ENOTEMPTY on rmdir under public/ (mixed Torah+Quran leftovers):
rm -rf .dev/quartz/public
 
uv run .dev/scripts/quartz_build.py --content Graphe/Quran --serve
# Listen URL is usually http://localhost:8080 — if EADDRINUSE: kill $(lsof -ti :8080)

Screenshot / browser eval

  1. bunx agent-browser install (once). If open reports Browser not launched, use Playwright from the Quartz package:
cd .dev/quartz && npx playwright screenshot http://localhost:8080 /tmp/quran-quartz.png
  1. Smoke: curl -sI http://localhost:8080200; manually check Quran home, this plan, a sample surah.

Eval notes (local run)

CheckResultImprovement
Home / indexQuran · GrapheLogos, explorer: Atlas / Juz / SurahsOK
Graph viewOften empty with partial corpusAdd links or tune Quartz graph when more surahs exist
Surah subset114/114 files presentKeep fetch reruns hash-aware; regenerate sidecars after content updates
Git date warningsQuartz warns “not yet tracked by git”git add Graphe/Quran when ready

DoD (publish slice): HTTP 200 on /; RESEARCH renders; build succeeds after public/ clean + free 8080.

Wikilinks: Quran home · Atlas · Ayah index · Juz index


Phase H — Structured build loop (GSD) alignment

The repo’s GSD workflow (gsd headless query, phases scope → … → done) is not initialized here until .gsd/ exists (gsd in project root). When you add it:

  1. Hypothesis for a cycle: e.g. “Completing fetch unblocks 90% of broken Ayah embeds.”
  2. Scope one gap (see table above).
  3. ResearchPlan (single DoD) → BuildTestRegressionEvalPost-mortemLogNext paths.

Paste the phase verify bash blocks from your template at the top of each agent run; do not proceed on EXPECTED_PHASE mismatch.


Risk register

RiskMitigation
API rate limits / 429fetch_quran delay + quran_api retries; batch fetch
Huge repo (6k+ Ayah files)Git LFS optional; or generate Ayah on demand
Quartz wikilink pathsAlign vault paths with Quartz baseUrl or use alias
Entity extraction false positivesHuman-in-the-loop; pilot surahs first

Next actions (ranked)

Latest run: full corpus extraction (114 surahs), schema_version: 3 sidecars, high-confidence Atlas backlinks (Divine Names + People + Places + Books), cross-scripture callouts for 14 shared figures, and surah-level revelation_place / revelation_order metadata (Phase F starter).

  1. Quartz smokeuv run .dev/scripts/quartz_build.py --content Graphe/Quran after Atlas/review updates; verify Atlas pages and transcludes. Enables Phase G visual eval. (next cycle)
  2. Review queue triage — process low candidates and promote confirmed aliases into Atlas frontmatter.
  3. Alias precision pass — reduce low-signal English triggers (god, lord, short terms) by adding Arabic aliases and tighter disambiguation for overloaded entries.
  4. Phase F continuation — extend Ayah frontmatter with topics: [] once extraction is trusted; add juz tag to all Ayah notes.
  5. GSD in repo — run gsd from repo root so .gsd/ exists and §Phase H can be executed directly.

Committed (2026-03-20): atlas_kg + wikilinks all 4 families; CLAUDE.md; Phase F revelation metadata (114 surahs); noindex on 6,268 Ayah + Juz stubs (graph cleanup).


QMD second pass (search index)

qmd indexes Graphe/Quran as collection graphelogos-quran and runs BM25 “gap probes” (fetch coverage, review queue, stubs, entity pipeline, etc.). Regenerate the report after major vault changes:

uv run .dev/scripts/quran_qmd_gap_pass.py

Output: qmd-pipeline-gaps.md. Hybrid qmd query is optional locally (needs LLM + unset CI); BM25 is CI-safe.

qmd entity–relationship pass: uv run .dev/scripts/quran_qmd_entity_extract.pyqmd-atlas-entity-graph (BM25 graphe:qmd_cooccurs triples). Review-queue evidence: quran_entity_qmd_evidence.py.


See also


Cycle goal: wire full Quran fetch, Atlas extraction, and Quartz proof — see §Phase G (commands + eval) and §Phase H (GSD).