Corpus Preparation¶
The corpus preparation phase is the foundation of the encoding workflow. It transforms a seif reference into a comprehensive, structured corpus of source texts with complete derivation chains and identified machloket (disputes). This phase uses the corpus-prep skill and produces artifacts for human review at Checkpoint 1.
Purpose¶
The corpus-prep skill accomplishes several critical objectives:
- Source Retrieval: Fetch primary text and commentaries from Sefaria
- Derivation Chains: Trace each ruling back to its authoritative source (Torah/Mishnah/Takana)
- Machloket Identification: Identify all disputes between halachic authorities
- Semantic Decomposition: Break the seif into atomic, encodable statements
- Question Generation: Formulate clarifying questions for human expert
Invoking the Skill¶
To start corpus preparation:
The skill automatically:
- Validates the reference
- Fetches all required texts
- Builds the derivation chain
- Generates review artifacts
- Requests checkpoint approval
Phase A: Reference Resolution and Validation¶
Step 1: Validate Reference Format¶
The skill uses Sefaria's clarify_name_argument() tool to validate and canonicalize the reference:
result = mcp__sefaria_texts__clarify_name_argument(
name="Shulchan Arukh, Yoreh De'ah 87:1",
type_filter="ref"
)
This ensures:
- The reference exists in Sefaria's database
- The canonical format is used consistently
- Typos and variations are resolved
Step 2: Get Structural Context¶
Understanding the seif's position within the siman helps identify dependencies:
The skill documents:
| Context | Purpose |
|---|---|
| Total seifim in siman | Understand scope of topic |
| Position of target seif | Identify sequence dependencies |
| Related seifim | Find cross-references |
Step 3: Check Prior Encoding State¶
If a .mistaber-session.yaml exists, the skill reads it to determine:
- Previously encoded seifim (dependencies satisfied)
- Pending dependencies (may block encoding)
- Active encoding session (resume vs. new)
Step 4: Identify Dependencies¶
The skill searches for cross-references within the text:
| Hebrew Pattern | Meaning | Action |
|---|---|---|
| "כמו שיתבאר לקמן" | "As will be explained below" | Forward dependency (may proceed) |
| "כמו שכתבנו לעיל" | "As we wrote above" | Backward dependency (must be encoded first) |
| "עיין סימן X" | "See siman X" | Cross-siman reference (note but don't block) |
Phase B: Primary Source Extraction¶
Step 1: Fetch Primary Text¶
Retrieve both Hebrew and English versions:
# Fetch with all translations
text = mcp__sefaria_texts__get_text(
reference="Shulchan Arukh, Yoreh De'ah 87:1",
version_language="both"
)
# Get all English translations for comparison
translations = mcp__sefaria_texts__get_english_translations(
reference="Shulchan Arukh, Yoreh De'ah 87:1"
)
The skill captures:
- Original Hebrew text
- Primary English translation
- Alternative translations (when available)
Step 2: Linguistic Analysis¶
For the Hebrew text, the skill identifies and documents:
Key Terms - Technical halachic vocabulary requiring precise understanding:
# Look up technical terms
definitions = mcp__sefaria_texts__search_in_dictionaries(
query="נותן טעם" # "noten taam" - taste transfer
)
Common terms requiring lookup:
| Term | Transliteration | Meaning |
|---|---|---|
| בשר בחלב | basar bechalav | meat and milk |
| נותן טעם | noten taam | taste transfer |
| בטל בשישים | batel b'shishim | nullified in 60:1 |
| לכתחילה | lechatchila | ideally/initially |
| בדיעבד | bedieved | after the fact |
Ambiguous Phrases - Terms requiring commentary for clarification
Cross-References - Internal references to other sources
Step 3: Semantic Decomposition¶
The skill breaks the seif into atomic statements. Each statement represents:
- A single normative claim (issur/heter), OR
- A single definitional claim, OR
- A single conditional relationship
Example Decomposition (YD 87:1):
statements:
- id: s1
type: ISSUR
text_he: "בשר בהמה טהורה בחלב בהמה טהורה אסור לבשל"
text_en: "Meat of a kosher domesticated animal with milk of a kosher domesticated animal is forbidden to cook"
conditions: []
- id: s2
type: ISSUR
text_he: "ואסור לאכול"
text_en: "and forbidden to eat"
conditions: []
- id: s3
type: ISSUR
text_he: "ואסור בהנאה"
text_en: "and forbidden to derive benefit"
conditions: []
- id: s4
type: DEFINITION
text_he: "דאורייתא"
text_en: "by Torah law"
applies_to: [s1, s2, s3]
Statement Types¶
| Type | Description | Encoding Approach |
|---|---|---|
ISSUR |
Prohibition | asserts(W, issur(action, M, level)) |
ISSUR_SAKANA |
Health danger | asserts(W, sakana(M)) |
HETER |
Permission | asserts(W, heter(action, M)) |
CHIYUV |
Obligation | asserts(W, chiyuv(action, M, level)) |
DEFINITION |
Category/classification | is_X(entity) predicate |
CONDITION |
Prerequisite | Part of rule body |
EXCEPTION |
Override | override/3 or NAF |
Phase C: Commentary Layer (Adaptive Depth)¶
Step 1: Discover All Commentaries¶
The skill fetches all linked commentaries:
links = mcp__sefaria_texts__get_links_between_texts(
reference="Shulchan Arukh, Yoreh De'ah 87:1",
with_text="1"
)
Step 2: Tier-Based Fetching¶
Commentaries are fetched in tiers based on authority and relevance:
TIER 1 (Always Fetch)¶
These commentaries are always required:
| Commentary | Author | Role |
|---|---|---|
| Shach (Siftei Kohen) | R. Shabsai HaKohen | Primary Ashkenazi commentary |
| Taz (Turei Zahav) | R. David HaLevi Segal | Primary Ashkenazi commentary |
| Rema gloss | R. Moshe Isserles | Ashkenazi rulings within text |
# Fetch Shach
shach = mcp__sefaria_texts__get_text(
reference="Shakh on Shulchan Arukh, Yoreh De'ah 87:1",
version_language="both"
)
# Fetch Taz
taz = mcp__sefaria_texts__get_text(
reference="Turei Zahav on Shulchan Arukh, Yoreh De'ah 87:1",
version_language="both"
)
TIER 2 (Standard - Fetch if TIER 1 is Insufficient)¶
| Commentary | When Needed |
|---|---|
| Pri Megadim | Complex TIER 1 disagreement |
| Chochmat Adam | Practical application unclear |
| Aruch HaShulchan | Modern summary needed |
TIER 3 (Extended - Fetch for Complex Machloket)¶
| Commentary | When Needed |
|---|---|
| Pitchei Teshuva | Multiple minority opinions |
| Darchei Teshuva | Chasidic practice variations |
| Be'er Heitev | Additional sources needed |
TIER 4 (Modern - Fetch for Practical Application)¶
| Commentary | When Needed |
|---|---|
| Yalkut Yosef | Sefardi modern practice |
| Mishnah Berurah | Cross-reference for YD |
Step 3: Classify Commentary Type¶
Each commentary passage is classified:
| Type | Description | Example |
|---|---|---|
CLARIFICATION |
Explains without changing | "Beheima includes cow, sheep, and goat" |
QUALIFICATION |
Adds conditions | "Only applies when taste transfers" |
EXTENSION |
Applies to new cases | "Includes butter, not just milk" |
DISPUTE |
Disagrees with ruling | "Rema permits in Ashkenazi practice" |
PRACTICAL |
Guidance for practice | "The custom today is to be strict" |
Step 4: Machloket Detection and Mapping¶
The skill identifies disputes at multiple levels:
Primary Machloket (Mechaber vs. Rema):
machloket:
- id: m1
topic: "dag_bechalav"
layer: primary
position_a:
authority: mechaber
ruling: sakana
source: "BY YD 87"
reasoning: "Medical sources indicate danger"
position_b:
authority: rema
ruling: mutar
source: "Rema gloss YD 87:3"
reasoning: "No current medical basis"
practical_difference: "Fish cooked in dairy dishes"
Secondary Machloket (Commentary level):
machloket:
- id: m2
topic: "bishul_achrei_bishul"
layer: secondary
position_a:
authority: shach
ruling: applies_only_to_liquid
source: "Shach sk 5"
position_b:
authority: taz
ruling: applies_to_all
source: "Taz sk 3"
practical_difference: "Reheating solid food"
Phase D: Derivation Chain (Recursive)¶
Step 1: Build Chain Backward¶
For each normative statement, the skill traces the source chain:
SA YD 87:1 → Tur YD 87 → Rambam Maachalot Asurot 9:1 →
Gemara Chullin 104b → Mishnah Chullin 8:1 → Torah Shemot 23:19
The skill uses recursive get_links_between_texts() calls:
# Get SA links
sa_links = mcp__sefaria_texts__get_links_between_texts(
reference="Shulchan Arukh, Yoreh De'ah 87:1",
with_text="1"
)
# Find Tur reference and get its links
tur_links = mcp__sefaria_texts__get_links_between_texts(
reference="Tur, Yoreh Deah 87",
with_text="1"
)
# Continue recursively until reaching terminus
Step 2: Loop Detection¶
The skill tracks visited references to detect circular chains:
visited = set()
def build_chain(ref, chain, depth=0):
if ref in visited:
return chain, "loop_detected"
if depth > 10:
return chain, "max_depth_exceeded"
visited.add(ref)
links = get_links(ref)
# Continue recursion...
If a loop is detected, it's documented and flagged for human review.
Step 3: Chain Completeness Verification¶
A complete chain should reach one of:
| Terminus Type | Description | Example |
|---|---|---|
| Torah verse | Biblical source | Shemot 23:19 |
| Rabbinic institution | Takana d'rabanan | Chazal's enactment |
| Minhag source | Custom origin | "Minhag Ashkenaz" |
Incomplete chains are flagged:
chain_status: incomplete
missing: "Cannot trace beyond Tur - earlier sources not linked"
action_required: manual_research
Step 4: Assemble Makor Chain¶
The chain is formatted for ASP:
% Rule naming: r_{topic}_{specific} (topic-based, NOT siman-seif based)
% Example: r_bb_achiila for basar bechalav eating rule
makor(r_bb_achiila, sa("yd:87:1")).
makor(r_bb_achiila, tur("yd:87")).
makor(r_bb_achiila, rambam("maachalot:9:1")).
makor(r_bb_achiila, gemara("chullin:104b")).
makor(r_bb_achiila, mishnah("chullin:8:1")).
makor(r_bb_achiila, torah("shemot:23:19")).
Phase E: Semantic Enrichment¶
Step 1: Topic Lookup¶
The skill queries Sefaria's topic system:
topics = mcp__sefaria_texts__get_topic_details(
topic_slug="meat-and-milk",
with_links=True,
with_refs=True
)
This provides:
- Related topics for context
- Cross-references to related texts
- Modern scholarship links
Step 2: Semantic Search for Related Material¶
For edge cases and extensions:
related = mcp__sefaria_texts__english_semantic_search(
query="prohibition of cooking meat and milk together Biblical commandment",
filters={"document_categories": ["Halakhah"]}
)
Step 3: Cross-Siman Dependencies¶
Document references to other simanim:
cross_references:
- target: "YD 89"
type: forward
topic: "waiting times after meat"
note: "Seif 4 references waiting times"
- target: "YD 87:3"
type: internal
topic: "fish and dairy"
note: "Exception case within same siman"
Step 4: Shiur and Measurement Extraction¶
Identify quantities mentioned:
| Shiur | Measurement | Usage in Seif |
|---|---|---|
| k'zayit | Olive-size | Minimum for violation |
| k'beitza | Egg-size | (not in this seif) |
| shishim | 60:1 ratio | Nullification threshold |
| noten taam | Taste transfer | Determines applicability |
Step 5: Temporal and Contextual Conditions¶
Identify context modifiers:
| Condition Type | Hebrew | English | Effect |
|---|---|---|---|
| Time-based | בשבת | on Shabbat | May change ruling |
| Situation | בדיעבד | after the fact | Lenient ruling |
| Situation | הפסד מרובה | significant loss | May allow leniency |
| Situation | שעת הדחק | pressing circumstances | Emergency leniency |
| Location | בארץ ישראל | in Eretz Yisrael | Regional variation |
Phase F: Gap Analysis¶
Step 1: Ontology Coverage Check¶
The skill verifies predicates exist in the ontology:
predicates_check:
found:
- issur/3
- heter/2
- is_beheima_chalav_mixture/1
missing:
- name: "bishul_achrei_bishul"
expected_arity: 1
context: "reheating solid basar bechalav"
action: "May need to add to sorts.lp"
Step 2: World Coverage Check¶
Verify worlds exist for all authorities:
worlds_check:
existing:
- base
- mechaber
- rema
- shach
- taz
needed:
- name: "yalkut_yosef"
reason: "Modern Sefardi authority cited"
action: "Consider adding to worlds/"
Step 3: Prerequisite Verification¶
Check dependent seifim are encoded:
dependencies:
satisfied:
- seif: "YD 87:1"
status: "encoded"
blocking:
- seif: "YD 86:1"
reason: "Defines treifa categories used here"
action: "Encode 86:1 first, or proceed with caution"
optional:
- seif: "YD 89:1"
reason: "Waiting times (forward reference)"
action: "Proceed, encode later"
Step 4: Complexity Scoring¶
The skill assigns a complexity score (1-10):
complexity:
score: 6
factors:
machloket_count: 2
commentary_depth: 3 # Tiers needed
chain_length: 5 # Steps to Torah
cross_references: 3
novel_predicates: 1
breakdown:
- "2 machloket (Mechaber/Rema, Shach/Taz)"
- "Requires TIER 3 commentary"
- "One new predicate needed"
Scoring Rubric:
| Score | Complexity | Characteristics |
|---|---|---|
| 1-3 | Simple | No machloket, straightforward chain, existing predicates |
| 4-6 | Moderate | 1-2 machloket, complete chain, minor extensions |
| 7-10 | Complex | Multiple machloket, incomplete chain, novel concepts |
Phase G: Report Generation¶
The skill generates four artifacts in .mistaber-artifacts/:
Artifact 1: Human Review Report¶
corpus-report-YD-{siman}-{seif}.md
A comprehensive markdown document for human review containing:
# Corpus Report: YD 87:1 - Basar Bechalav D'Oraita
## Executive Summary
- **Reference:** Yoreh Deah 87:1
- **Topic:** Basic prohibition of meat and milk
- **Complexity:** 6/10 (Moderate)
- **Machloket:** 2 identified
- **Chain Status:** Complete to Torah
## Primary Text
### Hebrew
בשר בהמה טהורה בחלב בהמה טהורה אסור לבשל ואסור לאכול ואסור בהנאה דאורייתא...
### English (Sefaria Translation)
The meat of a pure domesticated animal with the milk of a pure domesticated animal is forbidden to cook, forbidden to eat, and forbidden to derive benefit, by Torah law...
## Atomic Statements
[Decomposed statements as YAML]
## Derivation Chain
[Mermaid diagram]
## Machloket Summary
[Table of disputes]
## Commentary Analysis
[TIER 1-4 excerpts and classifications]
## Questions for Human Review
[Generated questions]
Artifact 2: Machine-Readable Sources¶
corpus-sources-YD-{siman}-{seif}.yaml
reference: "YD:87:1"
fetched_at: "2026-01-25T10:00:00Z"
primary:
hebrew: "בשר בהמה טהורה..."
english: "The meat of a pure domesticated animal..."
translations:
- version: "Sefaria Community Translation"
text: "..."
- version: "Artscroll"
text: "..."
statements:
- id: s1
type: ISSUR
text_he: "..."
text_en: "..."
predicates: [issur, bishul]
commentaries:
shach:
- location: "sk 1"
type: CLARIFICATION
text_he: "..."
text_en: "..."
affects_statements: [s1]
taz:
- location: "sk 1"
type: QUALIFICATION
text_he: "..."
text_en: "..."
affects_statements: [s1, s2]
derivation_chain:
- level: 0
source: "SA YD 87:1"
ref: "Shulchan Arukh, Yoreh De'ah 87:1"
- level: 1
source: "Tur YD 87"
ref: "Tur, Yoreh Deah 87"
- level: 2
source: "Rambam MA 9:1"
ref: "Mishneh Torah, Forbidden Foods 9:1"
- level: 3
source: "Chullin 104b"
ref: "Talmud Bavli Chullin 104b"
- level: 4
source: "Chullin 8:1"
ref: "Mishnah Chullin 8:1"
- level: 5
source: "Shemot 23:19"
ref: "Exodus 23:19"
terminus: true
machloket:
- id: m1
topic: "dag_bechalav"
position_a:
authority: mechaber
ruling: sakana
position_b:
authority: rema
ruling: mutar
Artifact 3: Visual Derivation Chain¶
corpus-chain-YD-{siman}-{seif}.mermaid
graph TD
SA["SA YD 87:1<br/>Shulchan Aruch"] --> TUR["Tur YD 87"]
TUR --> RAMBAM["Rambam<br/>Maachalot Asurot 9:1"]
RAMBAM --> GEM["Chullin 104b<br/>Talmud Bavli"]
GEM --> MISH["Chullin 8:1<br/>Mishnah"]
MISH --> TORAH["Shemot 23:19<br/>Torah"]
SA --> SHACH["Shach sk 1"]
SA --> TAZ["Taz sk 1"]
style TORAH fill:#ffe6e6
style SA fill:#e6f3ff
style SHACH fill:#fff2e6
style TAZ fill:#fff2e6
Artifact 4: Questions for Human¶
corpus-questions-YD-{siman}-{seif}.yaml
questions:
- id: q1
phase: B
type: CLARIFICATION
question: "Does 'beheima' include all domesticated animals or only cattle?"
context: "Statement s1 mentions 'beheima tehora'"
options:
- "All domesticated kosher animals (cow, sheep, goat)"
- "Only cattle specifically"
default: 0
source_hint: "Shach sk 1 may clarify"
- id: q2
phase: C
type: MACHLOKET
question: "Should we encode the Taz's lenient opinion on bedieved cases?"
context: "Taz sk 3 permits bedieved when mixed unintentionally"
options:
- "Yes, encode as interpretation layer"
- "No, follow strict Mechaber ruling only"
default: 0
implications:
- "Option 0: More complete encoding, more complex"
- "Option 1: Simpler encoding, may miss practical ruling"
- id: q3
phase: D
type: CHAIN
question: "The chain includes a disputed gemara interpretation. Use Rashi or Tosafot?"
context: "Chullin 104b has competing interpretations"
options:
- "Follow Rashi (standard)"
- "Follow Tosafot (stricter)"
default: 0
Checkpoint Criteria¶
Before requesting human approval, verify:
- [ ] Primary source text accurately fetched (Hebrew + English)
- [ ] All TIER 1 commentaries fetched and classified
- [ ] Derivation chain reaches authoritative source (Torah or explicit rabbinic)
- [ ] All machloket identified with both positions documented
- [ ] No unresolved blocking dependencies on other seifim
- [ ] Questions formulated for any ambiguities
- [ ] All four artifacts generated
Session State Update¶
After completing corpus preparation:
current_phase: corpus-prep
target_seif: "YD:87:1"
checkpoints:
corpus-prep:
status: pending_review
artifacts:
- .mistaber-artifacts/corpus-report-YD-87-1.md
- .mistaber-artifacts/corpus-sources-YD-87-1.yaml
- .mistaber-artifacts/corpus-chain-YD-87-1.mermaid
- .mistaber-artifacts/corpus-questions-YD-87-1.yaml
complexity_score: 6
questions_count: 3
Troubleshooting¶
"Reference not found in Sefaria"¶
The reference format may be incorrect. Try:
- Use full reference: "Shulchan Arukh, Yoreh De'ah 87:1" not "YD 87:1"
- Check spelling of siman name
- Use
clarify_name_argument()to find correct format
"Cannot fetch commentary"¶
Some commentaries have limited coverage in Sefaria:
- Check if the commentary exists for this seif
- Try alternative commentary in same tier
- Note as "requires manual research" in questions
"Derivation chain incomplete"¶
Not all sources are linked in Sefaria:
- Document the gap in the report
- Add manual research note
- Proceed with available chain
- Human reviewer can provide missing links
"Too many machloket identified"¶
If the skill identifies 5+ machloket, the seif may be too complex:
- Consider breaking into smaller units
- Focus on primary machloket first
- Create follow-up tasks for secondary disputes
Best Practices¶
- Verify Hebrew text carefully - Translation errors can propagate through the entire encoding
- Read TIER 1 commentaries fully - Even sections that seem unrelated may affect encoding
- Document uncertainty - Better to flag questions than assume incorrectly
- Check cross-references - A seif may depend on rulings from another siman
- Score conservatively - Underestimate complexity leads to surprise difficulties
Next Phase¶
After Checkpoint 1 approval, proceed to HLL Encoding.