Corpus Preparation Skill¶
The corpus-prep skill is the first phase of the encoding pipeline, responsible for fetching source texts from Sefaria, building complete derivation chains, identifying machloket (disputes), and preparing structured artifacts for human review.
Overview¶
| Attribute | Value |
|---|---|
| Skill Name | corpus-prep |
| Phase | 1 of 5 |
| Checkpoint | 1: Source Review |
| Prerequisites | Sefaria MCP available, valid seif reference |
| Outputs | corpus-report.md, corpus-sources.yaml, corpus-chain.mermaid, corpus-questions.yaml |
| Trigger Phrases | "prepare corpus", "fetch sources", "start encoding seif", "begin encoding", "prepare sources for encoding", "build derivation chain", "get source texts from Sefaria" |
Invocation¶
Typical Invocation¶
User: "Prepare corpus for YD 87:3"
Agent: I'll prepare the corpus for Yoreh Deah 87:3. Let me:
1. Validate the reference with Sefaria
2. Fetch the primary text (Hebrew + English)
3. Fetch relevant commentaries
4. Build the derivation chain
5. Identify any machloket
...
Explicit Invocation¶
Phase A: Reference Resolution & Validation¶
Step 1: Validate Reference Format¶
The skill uses Sefaria's clarify_name_argument tool to validate and canonicalize the seif reference:
# Validate "Yoreh Deah 88:1"
result = mcp__sefaria_texts__clarify_name_argument(
name="Shulchan Arukh, Yoreh De'ah 88:1",
type_filter="ref"
)
Common Reference Formats:
| Input | Canonical Form |
|---|---|
| YD 87:3 | Shulchan Arukh, Yoreh De'ah 87:3 |
| Yoreh Deah 88:1 | Shulchan Arukh, Yoreh De'ah 88:1 |
| SA YD 89:2 | Shulchan Arukh, Yoreh De'ah 89:2 |
Step 2: Get Structural Context¶
The skill retrieves the siman's structure to understand context:
Extracted Information:
- Total seifim in siman
- Position of target seif
- Related seifim that may have dependencies
Step 3: Check Prior Encoding State¶
The skill reads .mistaber-session.yaml to determine:
- Previously encoded seifim
- Pending dependencies
- Any active encoding session
Step 4: Identify Dependencies¶
Cross-references to other seifim are identified:
- Forward references ("k'mo she'yitbaer l'kaman" - as will be explained below)
- Backward references ("k'mo she'katavnu l'eil" - as we wrote above)
Phase B: Primary Source Extraction¶
Step 1: Fetch Primary Text¶
Both Hebrew and English versions are retrieved:
# Fetch with all translations
text = mcp__sefaria_texts__get_text(
reference="Shulchan Arukh, Yoreh De'ah 88:1",
version_language="both"
)
# Get all English translations for comparison
translations = mcp__sefaria_texts__get_english_translations(
reference="Shulchan Arukh, Yoreh De'ah 88:1"
)
Step 2: Linguistic Analysis¶
For the Hebrew text, the skill identifies:
| Element | Example | Purpose |
|---|---|---|
| Key terms | "basar", "chalav", "bishul" | Map to predicates |
| Ambiguous phrases | "taam k'ikar" | Flag for clarification |
| Cross-references | "siman 89" | Track dependencies |
Dictionary lookups are performed for technical terms:
Step 3: Semantic Decomposition¶
The seif is broken into atomic statements:
statements:
- id: s1
type: ISSUR
text_he: "בשר בהמה טהורה בחלב בהמה טהורה אסור לבשל"
text_en: "Meat of a kosher domesticated animal with milk of a kosher domesticated animal is forbidden to cook"
conditions: []
world_context: base
- id: s2
type: ISSUR
text_he: "ואסור לאכול"
text_en: "and forbidden to eat"
conditions: []
world_context: base
- id: s3
type: MACHLOKET
text_he: "דגים וחגבים אין בהם איסור בשר בחלב... מפני הסכנה"
text_en: "Fish and locusts have no meat-and-milk prohibition... due to danger"
conditions: []
world_context: mechaber # Disputed by Rema
Statement Type Classification:
| Type | Description | Example |
|---|---|---|
| ISSUR | Prohibition | "forbidden to cook" |
| ISSUR_SAKANA | Health danger | "due to danger" |
| HETER | Permission | "permitted to eat" |
| CHIYUV | Obligation | "must wait six hours" |
| DEFINITION | Category definition | "fish is not basar" |
| CONDITION | Prerequisite | "if cooked together" |
| EXCEPTION | Exception to rule | "except bedieved" |
Phase C: Commentary Layer (Adaptive Depth)¶
Tier-Based Fetching Strategy¶
The skill fetches commentaries in tiers, proceeding to deeper tiers only when necessary:
graph TD
T1["Tier 1: Always Fetch"] --> D1{Sufficient?}
D1 -->|No| T2["Tier 2: Standard"]
D1 -->|Yes| DONE[Complete]
T2 --> D2{Sufficient?}
D2 -->|No| T3["Tier 3: Extended"]
D2 -->|Yes| DONE
T3 --> D3{Sufficient?}
D3 -->|No| T4["Tier 4: Modern"]
D3 -->|Yes| DONE
T4 --> DONE
Tier Definitions¶
Tier 1 (Always Fetch):
| Commentary | Abbreviation | Tradition |
|---|---|---|
| Shach (Siftei Kohen) | sk | Ashkenazi |
| Taz (Turei Zahav) | - | Ashkenazi |
| Rema | - | Ashkenazi gloss |
Tier 2 (Standard - fetch if Tier 1 insufficient):
| Commentary | Abbreviation | Notes |
|---|---|---|
| Pri Megadim | PM | Commentary on Shach/Taz |
| Chochmat Adam | CA | Practical summary |
| Aruch HaShulchan | AH | Modern codification |
Tier 3 (Extended - for complex machloket):
| Commentary | Focus |
|---|---|
| Pitchei Teshuva | Acharonim summary |
| Darchei Teshuva | Additional opinions |
| Be'er Heitev | Abbreviated summary |
Tier 4 (Modern - practical application):
| Commentary | Community |
|---|---|
| Yalkut Yosef | Sefardi |
| Mishnah Berurah | Ashkenazi (OC) |
Commentary Classification¶
Each commentary passage is classified:
# Fetch Shach
shach = mcp__sefaria_texts__get_text(
reference="Shakh on Shulchan Arukh, Yoreh De'ah 88:1",
version_language="both"
)
Classification Types:
| Type | Description | Encoding Impact |
|---|---|---|
| CLARIFICATION | Explains without changing | Comments only |
| QUALIFICATION | Adds conditions | New rule conditions |
| EXTENSION | Applies to new cases | New rules |
| DISPUTE | Disagrees with base | Override/machloket |
| PRACTICAL | Practical guidance | May affect world scoping |
Machloket Detection¶
Disputes are identified between:
| Dispute Type | Example |
|---|---|
| Mechaber vs. Rema | Fish + dairy: sakana vs. mutar |
| Shach vs. Taz | Interpretation of SA ruling |
| Ashkenazi vs. Sefardi | Practice differences |
| Earlier vs. Later | Rishonim vs. Acharonim |
Machloket Documentation:
machloket:
- id: m1
topic: "dag_bechalav"
position_a:
authority: mechaber
ruling: sakana
source: "BY YD 87"
position_b:
authority: rema
ruling: mutar
source: "Rema gloss YD 87:3"
practical_difference: "Fish cooked in dairy (e.g., lox with cream cheese)"
worlds_affected:
- mechaber # Position A
- rema # Position B
Phase D: Derivation Chain Building¶
Recursive Chain Building¶
The skill traces sources backward from SA to authoritative terminus:
SA YD 88:1 → Tur YD 88 → Rambam MA 9:1 → Gemara Chullin 104b → Mishnah Chullin 8:1 → Torah Shemot 23:19
Using Sefaria Links:
# Get SA links
sa_links = mcp__sefaria_texts__get_links_between_texts(
reference="Shulchan Arukh, Yoreh De'ah 88:1",
with_text="1"
)
# Follow chain to Tur
tur_links = mcp__sefaria_texts__get_links_between_texts(
reference="Tur, Yoreh Deah 88",
with_text="1"
)
# Continue until reaching terminus
Chain Completeness¶
A complete chain reaches one of:
| Terminus Type | Example | For |
|---|---|---|
| Torah verse | Shemot 23:19 | d_oraita rules |
| Mishnah/Gemara | Chullin 8:1 | d_rabanan rules |
| Takana | Chazal decree | Rabbinic institutions |
| Minhag source | Community practice | Minhag rules |
Loop Detection¶
The skill tracks visited references to detect circular chains:
visited = set()
def trace_chain(ref, depth=0):
if ref in visited:
return {"loop_detected": True, "ref": ref}
visited.add(ref)
links = get_links(ref)
# Continue recursively...
Makor Chain Format¶
The final chain is formatted for ASP:
% Rule naming: r_{topic}_{specific} (NOT siman-seif based)
% Example: r_bb_achiila for basar bechalav eating rule
makor(r_bb_achiila, sa("yd:88:1")).
makor(r_bb_achiila, tur("yd:88")).
makor(r_bb_achiila, rambam("maachalot:9:1")).
makor(r_bb_achiila, gemara("chullin:104b")).
makor(r_bb_achiila, torah("shemot:23:19")).
Phase E: Semantic Enrichment¶
Topic Lookup¶
Related topics are retrieved from Sefaria:
topics = mcp__sefaria_texts__get_topic_details(
topic_slug="meat-and-milk",
with_links=True,
with_refs=True
)
Semantic Search¶
Related material is discovered through semantic search:
related = mcp__sefaria_texts__english_semantic_search(
query="prohibition of cooking meat and milk together Biblical commandment",
filters={"document_categories": ["Halakhah"]}
)
Shiur and Measurement Extraction¶
Quantities mentioned are identified:
| Hebrew | Transliteration | Meaning |
|---|---|---|
| כזית | k'zayit | olive-sized portion |
| כביצה | k'beitza | egg-sized portion |
| שישים | shishim | 60:1 ratio |
| נותן טעם | noten taam | taste transfer |
Temporal and Contextual Conditions¶
Context modifiers are identified:
| Type | Examples | Encoding Impact |
|---|---|---|
| Time-based | "on Shabbat", "during Pesach" | Additional conditions |
| Situation-based | "bedieved", "hefsed merubeh" | World-specific rules |
| Location-based | "in Eretz Yisrael" | Geographic conditions |
Phase F: Gap Analysis¶
Ontology Coverage Check¶
The skill verifies predicates exist in the ontology:
predicates_check:
existing:
- issur/3: "Prohibition predicate"
- heter/2: "Permission predicate"
- sakana/1: "Danger predicate"
missing:
- hachshara/2: "Vessel preparation - needs to be added"
World Coverage Check¶
Required worlds are verified:
worlds_check:
existing:
- base: "Universal rulings"
- mechaber: "Sefardi positions"
- rema: "Ashkenazi positions"
missing: [] # None in this case
Prerequisite Verification¶
Dependencies on other seifim are checked:
prerequisites:
- seif: "YD:87:1"
status: encoded
notes: "Base BB prohibition"
- seif: "YD:87:2"
status: not_encoded
notes: "Species definitions - may need to encode first"
blocking: false
Complexity Scoring¶
A complexity score (1-10) is assigned:
complexity:
score: 6
factors:
machloket_count: 2
commentary_depth: 3 # Tiers needed
chain_length: 5
cross_references: 3
novel_predicates: 1
# Scoring rubric:
# 1-3: Simple, no machloket, straightforward chain
# 4-6: Moderate, some machloket, complete chain
# 7-10: Complex, multiple machloket, incomplete chain or novel concepts
Phase G: Artifact Generation¶
Artifact 1: Human Review Report¶
corpus-report-YD-{siman}-{seif}.md
Uses the template from templates/corpus-report.md to generate a comprehensive human-readable report including:
- Executive summary with metrics
- Primary source text (Hebrew + English)
- Atomic statements table
- Commentary excerpts by tier
- Derivation chain diagram
- Machloket documentation
- Gap analysis results
- Questions for human review
- Checkpoint criteria checklist
Artifact 2: Machine-Readable Sources¶
corpus-sources-YD-{siman}-{seif}.yaml
reference: "YD:88:1"
fetched_at: "2026-01-25T10:00:00Z"
primary:
hebrew: "בשר בהמה טהורה בחלב..."
english: "Meat of a kosher domesticated animal..."
translations:
- version: "Sefaria Community Translation"
text: "..."
statements:
- id: s1
type: ISSUR
text_he: "..."
text_en: "..."
conditions: []
world_context: base
commentaries:
shach:
- location: "sk 1"
type: CLARIFICATION
text_he: "..."
text_en: "..."
taz:
- location: "sk 1"
type: QUALIFICATION
text_he: "..."
text_en: "..."
derivation_chain:
- level: 0
source: "SA YD 88:1"
ref: "Shulchan Arukh, Yoreh De'ah 88:1"
- level: 1
source: "Tur YD 88"
ref: "Tur, Yoreh Deah 88"
# ... continues to terminus
machloket:
- id: m1
topic: dag_bechalav
position_a:
authority: mechaber
ruling: sakana
position_b:
authority: rema
ruling: mutar
Artifact 3: Visual Derivation Chain¶
corpus-chain-YD-{siman}-{seif}.mermaid
graph TD
SA["SA YD 88:1<br/>Primary Source"] --> TUR["Tur YD 88"]
TUR --> RAMBAM["Rambam MA 9:1"]
RAMBAM --> GEM["Gemara Chullin 104b"]
GEM --> MISH["Mishnah Chullin 8:1"]
MISH --> TORAH["Torah: Shemot 23:19<br/>Lo tevashel gedi..."]
style SA fill:#90EE90
style TORAH fill:#FFD700
Artifact 4: Questions for Human¶
corpus-questions-YD-{siman}-{seif}.yaml
questions:
- id: q1
phase: B
type: CLARIFICATION
question: "Does 'beheima' include all domesticated animals or only cattle?"
context: "Statement s1 mentions 'beheima tehora'"
options:
- "All domesticated kosher animals (cow, sheep, goat)"
- "Only cattle specifically"
default: 0
impact: "Affects predicate scope"
- id: q2
phase: C
type: MACHLOKET
question: "Should we encode the Taz's lenient opinion on bedieved cases?"
context: "Taz sk 3 permits bedieved in certain cases"
options:
- "Yes, encode as world-specific rule"
- "No, follow strict Mechaber"
default: 0
impact: "May require additional taz world"
Checkpoint Criteria¶
Before requesting human approval, the skill verifies:
- [ ] Primary source text accurately fetched (Hebrew + English)
- [ ] All Tier 1 commentaries fetched and classified
- [ ] Derivation chain reaches authoritative source
- [ ] All machloket identified with both positions documented
- [ ] No unresolved blocking dependencies on other seifim
- [ ] Questions formulated for any ambiguities
- [ ] All four artifacts generated
Session State Update¶
After completing corpus preparation:
current_phase: corpus-prep
target_seif: "YD:88:1"
checkpoints:
corpus-prep:
status: pending_review
artifacts:
- .mistaber-artifacts/corpus-report-YD-88-1.md
- .mistaber-artifacts/corpus-sources-YD-88-1.yaml
- .mistaber-artifacts/corpus-chain-YD-88-1.mermaid
- .mistaber-artifacts/corpus-questions-YD-88-1.yaml
complexity_score: 6
statements_count: 5
machloket_count: 1
commentary_tiers_fetched: 2
Common Issues¶
Reference Not Found¶
Symptom: Sefaria returns no results for reference.
Solutions:
- Validate reference format using
clarify_name_argument - Try alternative spellings (Arukh vs Aruch)
- Check if text exists in Sefaria library
Incomplete Derivation Chain¶
Symptom: Chain doesn't reach authoritative terminus.
Solutions:
- Mark as requiring manual research
- Check if rule is d'rabanan (may terminate at Mishnah)
- Flag for human review in questions
Commentary Not Available¶
Symptom: Specific commentary not in Sefaria.
Solutions:
- Note as gap in corpus report
- Proceed with available commentaries
- Flag for human to provide source if critical
Machloket Identification Ambiguity¶
Symptom: Unclear if positions truly disagree.
Solutions:
- Generate clarification question
- Default to conservative interpretation (encode as machloket)
- Request human guidance
Related Documentation¶
- HLL Encoding Skill - Next phase after corpus approval
- Workflow Guide - Complete pipeline overview
- Troubleshooting - Error resolution