Quran corpus — research & build plan
GSD phase gate: There is no
.gsd/directory in this repo yet, sogsd headless queryexits with No .gsd/ directory found. Rungsdonce from the repo root to initialize, then use your template’sEXPECTED_PHASEchecks. Until then, structured scope → plan → build phases are documented here only.
Hypothesis (cycle): Finishing the 114 surah fetch plus keeping surah-hashes.json + Quartz paths stable removes the largest blockers for Ayah embeds, Atlas extraction, and publish.
This note is the master plan for turning the vault’s Quranic layer into a complete, reproducible, searchable, and publishable corpus. It uses Obsidian wikilinks to Surahs, Atlas, Ayah, Juz, scripts under .dev/scripts/, and related notes so you can navigate the graph and hand phases to agents without re-explaining context.
Current inventory (what already exists)
| Layer | Role | Where |
|---|---|---|
| API client | Shared httpx + retries for Quran.com v4 | .dev/scripts/quran_api.py (used by fetch + generators) |
| Surah fetch | Arabic + English + OpenFurqan links + ayah_header_lines | .dev/scripts/fetch_quran.py → Surahs folder |
| Hash cache | Per-file SHA-256 + fetch options to skip API | surah-hashes.json |
| Ayah line index | CLI to refresh / extract ayah by line | .dev/scripts/quran_surah_index.py · .dev/scripts/quran_surah_lines.py |
| Ayah notes | 6,236 stubs embedding ### Ayah n from surah files | Ayah index |
| Juz pages | 30 parts, API verse_mapping, embeds Ayah notes | Juz index |
| Atlas | Divine names, people, places, books (surahs) | Quran Atlas |
| Overview | OpenFurqan, mushaf order, categories | Surahs (vault note) |
Gap (highest leverage): the 114 surahs are not all present as files yet—only a subset is fetched. Downstream embeds, entity extraction, and published HTML all depend on complete or explicitly scoped surah text first.
Target pipeline (end state)
flowchart LR F[Fetch surahs API] --> O[Organize paths + FM] O --> H[Hash + index JSON] O --> L[Ayah line index in FM] O --> A[Atlas entity notes] A --> C[Categorize + tag] C --> P[Publish Quartz] P --> V[Browser eval]
Each stage below lists inputs, outputs, tools, and Definition of Done (observable).
Phase A — Fetch (complete text)
Goal: Every surah 1…114 exists as Graphe/Quran/Surahs/Surah NNN - Name.md with consistent frontmatter and ayah_header_lines.
- Inputs:
translation_id,arabic_field, cache policy (see fetch script). - Outputs: 114 markdown files; updated surah-hashes.json.
- Commands:
uv run .dev/scripts/fetch_quran.py -f(or staged batches to respect API limits). - DoD:
find Graphe/Quran/Surahs -name 'Surah *.md' | wc -l→ 114; random spot-checkayah_countvs API.
Wikilinks: Surahs folder note · Surahs overview
Phase B — Organize (stable layout & naming)
Goal: One canonical tree; no duplicate “surah” stories.
- Convention:
Surah NNN - {name_simple}.mdonly under Surahs;[[Graphe/Quran/Ayah/Ayah|Ayah]]/ Juz names stayAyah SSS-AAA/Juz JJ. - Regenerate:
uv run .dev/scripts/generate_quran_juz_ayah.pyafter any rename (usesquran_api+/chapters+/juzs). - DoD: No broken
![[Graphe/Quran/Surahs/...#ayah-n|Ayah n]]embeds in a sample of Ayah notes across all juz ranges.
Phase C — Hash & integrity
Goal: Reproducible “what changed” for CI and agents.
- Existing:
surah-hashes.jsonentries (path, surah, sha256, translation options). - Extensions: optional global manifest (single JSON listing all surah hashes + generator versions) for
diffin PRs. - DoD: Re-run fetch with no API change → no file write (hash unchanged); intentional edit → hash flips.
Wikilink: surah-hashes.json
Phase D — Index (machine + human)
Goal: Fast random access without loading huge files.
- Per-surah FM:
ayah_header_lines(line of each### Ayah n) — maintained by fetch +quran_surah_index.py index. - Optional: byte-offset index in a sidecar if line-scan cost becomes an issue (future).
- DoD:
uv run .dev/scripts/quran_surah_index.py extract -f "…/Surah 002 - Al-Baqarah.md" -a 7prints correct block on a fully fetched Baqarah.
Wikilinks: Atlas (tooling section) references the same index idea.
Phase E — Atlas entity extraction
Goal: Atlas entity notes are populated from corpus-wide extraction with a balanced quality gate.
Implemented workflow (full corpus):
- Ontology lock — Atlas extraction now scans four families:
Divine Names,People,Places,Books(scriptural books, not surah files). - Candidate generation — run full scan over all 114 surahs:
uv run .dev/scripts/quran_entity_pipeline.py --all-surahs --write-sidecars --write-reports- Confidence queue — emit summary + review queue:
uv run .dev/scripts/quran_entity_pipeline.py --all-surahs --write-summary --write-review-queue- Balanced write-back — apply only
highconfidence hits to Atlas notes via idempotent auto blocks:
uv run .dev/scripts/quran_entity_pipeline.py --all-surahs --apply-high- Validation + regression — sidecar schema/path/ayah checks plus Surah 1 baseline comparison:
uv run .dev/scripts/quran_entity_pipeline.py --all-surahs --write-sidecars --validateArtifacts produced:
- Per-surah reports:
Graphe/Quran/Research/entities/entity-scan-surah-NNN.md - Sidecars (
schema_version: 3):Graphe/Quran/meta/entities/surah-NNN.yaml - Corpus summary: entity-corpus-summary
- Review queue: entity-review-queue
- qmd evidence dossier: entity-review-qmd-evidence
- Validation report: entity-validation-report
qmd-assisted review (semantic helper):
# Ensure qmd has the Quran collection once
qmd collection add "/Users/rmac/repos/GrapheLogos" --name graphelogos-quran --mask "Graphe/Quran/**/*.md"
# Build evidence for queued medium/low matches
uv run .dev/scripts/quran_entity_qmd_evidence.py --collection graphelogos-quran --mode searchLegacy pilot remains: uv run .dev/scripts/quran_entity_pilot.py -s 1 --write-report --write-sidecar (single-surah check).
Wikilinks: Divine Names · People · Places · Books
Phase F — Categorize & tag (corpus semantics)
Goal: Filter by Meccan/Medinan, theme, juz, hizb (optional), without duplicating the mushaf.
Implemented (surah-level): uv run .dev/scripts/quran_surah_metadata.py --write enriched all 114 surah frontmatter files with:
revelation_place: Meccan | Medinan(from Quran.com/api/v4/chapters)revelation_order: <int>(chronological order 1–114; Surah 096 = 1, first revealed)
Script is idempotent; re-run is safe.
- Sources: external datasets (Quran.com metadata, academic tables) or manual YAML in
Graphe/Quran/meta/(proposed). - Per-ayah: extend Ayah frontmatter with optional
topics: []once extraction is trusted. - DoD: Query (e.g. Dataview or
rg) returns consistent results for one tag (e.g.juz-30) across all Ayah notes.
Wikilinks: Juz index (structural partition) · Surahs (categories) for conceptual framing
Phase G — Publish (Quartz → localhost) & visual eval
Goal: Render the Quran tree in a static site, then screenshot and evaluate UX (navigation, embeds, search).
Implemented: uv run .dev/scripts/quartz_build.py --content Graphe/Quran temporarily points .dev/quartz/content at the Quran tree, swaps in quartz.config.quran.ts (fast build: ignores Ayah/), runs Quartz, then restores the Torah symlink and quartz.config.ts. Use --include-ayah for all 6k+ ayah pages (slow). Deploy defaults to quran-graphe; override with --pages-project.
Commands (verified)
cd /Users/rmac/repos/GrapheLogos
# If Quartz fails with ENOTEMPTY on rmdir under public/ (mixed Torah+Quran leftovers):
rm -rf .dev/quartz/public
uv run .dev/scripts/quartz_build.py --content Graphe/Quran --serve
# Listen URL is usually http://localhost:8080 — if EADDRINUSE: kill $(lsof -ti :8080)Screenshot / browser eval
bunx agent-browser install(once). Ifopenreports Browser not launched, use Playwright from the Quartz package:
cd .dev/quartz && npx playwright screenshot http://localhost:8080 /tmp/quran-quartz.png- Smoke:
curl -sI http://localhost:8080→200; manually check Quran home, this plan, a sample surah.
Eval notes (local run)
| Check | Result | Improvement |
|---|---|---|
| Home / index | Quran · GrapheLogos, explorer: Atlas / Juz / Surahs | OK |
| Graph view | Often empty with partial corpus | Add links or tune Quartz graph when more surahs exist |
| Surah subset | 114/114 files present | Keep fetch reruns hash-aware; regenerate sidecars after content updates |
| Git date warnings | Quartz warns “not yet tracked by git” | git add Graphe/Quran when ready |
DoD (publish slice): HTTP 200 on /; RESEARCH renders; build succeeds after public/ clean + free 8080.
Wikilinks: Quran home · Atlas · Ayah index · Juz index
Phase H — Structured build loop (GSD) alignment
The repo’s GSD workflow (gsd headless query, phases scope → … → done) is not initialized here until .gsd/ exists (gsd in project root). When you add it:
- Hypothesis for a cycle: e.g. “Completing fetch unblocks 90% of broken Ayah embeds.”
- Scope one gap (see table above).
- Research → Plan (single DoD) → Build → Test → Regression → Eval → Post-mortem → Log → Next paths.
Paste the phase verify bash blocks from your template at the top of each agent run; do not proceed on EXPECTED_PHASE mismatch.
Risk register
| Risk | Mitigation |
|---|---|
| API rate limits / 429 | fetch_quran delay + quran_api retries; batch fetch |
| Huge repo (6k+ Ayah files) | Git LFS optional; or generate Ayah on demand |
| Quartz wikilink paths | Align vault paths with Quartz baseUrl or use alias |
| Entity extraction false positives | Human-in-the-loop; pilot surahs first |
Next actions (ranked)
Latest run: full corpus extraction (114 surahs), schema_version: 3 sidecars, high-confidence Atlas backlinks (Divine Names + People + Places + Books), cross-scripture callouts for 14 shared figures, and surah-level revelation_place / revelation_order metadata (Phase F starter).
- Quartz smoke —
uv run .dev/scripts/quartz_build.py --content Graphe/Quranafter Atlas/review updates; verify Atlas pages and transcludes. Enables Phase G visual eval. (next cycle) - Review queue triage — process low candidates and promote confirmed aliases into Atlas frontmatter.
- Alias precision pass — reduce low-signal English triggers (
god,lord, short terms) by adding Arabic aliases and tighter disambiguation for overloaded entries. - Phase F continuation — extend Ayah frontmatter with
topics: []once extraction is trusted; addjuztag to all Ayah notes. - GSD in repo — run
gsdfrom repo root so.gsd/exists and §Phase H can be executed directly.
Committed (2026-03-20): atlas_kg + wikilinks all 4 families; CLAUDE.md; Phase F revelation metadata (114 surahs); noindex on 6,268 Ayah + Juz stubs (graph cleanup).
QMD second pass (search index)
qmd indexes Graphe/Quran as collection graphelogos-quran and runs BM25 “gap probes” (fetch coverage, review queue, stubs, entity pipeline, etc.). Regenerate the report after major vault changes:
uv run .dev/scripts/quran_qmd_gap_pass.pyOutput: qmd-pipeline-gaps.md. Hybrid qmd query is optional locally (needs LLM + unset CI); BM25 is CI-safe.
qmd entity–relationship pass: uv run .dev/scripts/quran_qmd_entity_extract.py → qmd-atlas-entity-graph (BM25 graphe:qmd_cooccurs triples). Review-queue evidence: quran_entity_qmd_evidence.py.
See also
- Quran home (Quartz entry)
- Literary structures overview
- Juz — literary overview
- Surahs overview
- Quran Atlas
- Ayah index
- Juz index
- Torah Atlas (pattern reference for entity depth)
- QMD pipeline gap pass (BM25)
- qmd Atlas entity graph (co-occurrence)
Cycle goal: wire full Quran fetch, Atlas extraction, and Quartz proof — see §Phase G (commands + eval) and §Phase H (GSD).