Corpus Preparation¶

The corpus preparation phase is the foundation of the encoding workflow. It transforms a seif reference into a comprehensive, structured corpus of source texts with complete derivation chains and identified machloket (disputes). This phase uses the corpus-prep skill and produces artifacts for human review at Checkpoint 1.

Purpose¶

The corpus-prep skill accomplishes several critical objectives:

Source Retrieval: Fetch primary text and commentaries from Sefaria
Derivation Chains: Trace each ruling back to its authoritative source (Torah/Mishnah/Takana)
Machloket Identification: Identify all disputes between halachic authorities
Semantic Decomposition: Break the seif into atomic, encodable statements
Question Generation: Formulate clarifying questions for human expert

Invoking the Skill¶

To start corpus preparation:

User: "Prepare corpus for YD 87:1"

The skill automatically:

Validates the reference
Fetches all required texts
Builds the derivation chain
Generates review artifacts
Requests checkpoint approval

Phase A: Reference Resolution and Validation¶

Step 1: Validate Reference Format¶

The skill uses Sefaria's clarify_name_argument() tool to validate and canonicalize the reference:

result = mcp__sefaria_texts__clarify_name_argument(
    name="Shulchan Arukh, Yoreh De'ah 87:1",
    type_filter="ref"
)

This ensures:

The reference exists in Sefaria's database
The canonical format is used consistently
Typos and variations are resolved

Step 2: Get Structural Context¶

Understanding the seif's position within the siman helps identify dependencies:

shape = mcp__sefaria_texts__get_text_or_category_shape(
    name="Shulchan Arukh, Yoreh De'ah 87"
)

The skill documents:

Context	Purpose
Total seifim in siman	Understand scope of topic
Position of target seif	Identify sequence dependencies
Related seifim	Find cross-references

Step 3: Check Prior Encoding State¶

If a .mistaber-session.yaml exists, the skill reads it to determine:

Previously encoded seifim (dependencies satisfied)
Pending dependencies (may block encoding)
Active encoding session (resume vs. new)

Step 4: Identify Dependencies¶

The skill searches for cross-references within the text:

Hebrew Pattern	Meaning	Action
"כמו שיתבאר לקמן"	"As will be explained below"	Forward dependency (may proceed)
"כמו שכתבנו לעיל"	"As we wrote above"	Backward dependency (must be encoded first)
"עיין סימן X"	"See siman X"	Cross-siman reference (note but don't block)

Phase B: Primary Source Extraction¶

Step 1: Fetch Primary Text¶

Retrieve both Hebrew and English versions:

# Fetch with all translations
text = mcp__sefaria_texts__get_text(
    reference="Shulchan Arukh, Yoreh De'ah 87:1",
    version_language="both"
)

# Get all English translations for comparison
translations = mcp__sefaria_texts__get_english_translations(
    reference="Shulchan Arukh, Yoreh De'ah 87:1"
)

The skill captures:

Original Hebrew text
Primary English translation
Alternative translations (when available)

Step 2: Linguistic Analysis¶

For the Hebrew text, the skill identifies and documents:

Key Terms - Technical halachic vocabulary requiring precise understanding:

# Look up technical terms
definitions = mcp__sefaria_texts__search_in_dictionaries(
    query="נותן טעם"  # "noten taam" - taste transfer
)

Common terms requiring lookup:

Term	Transliteration	Meaning
בשר בחלב	basar bechalav	meat and milk
נותן טעם	noten taam	taste transfer
בטל בשישים	batel b'shishim	nullified in 60:1
לכתחילה	lechatchila	ideally/initially
בדיעבד	bedieved	after the fact

Ambiguous Phrases - Terms requiring commentary for clarification

Cross-References - Internal references to other sources

Step 3: Semantic Decomposition¶

The skill breaks the seif into atomic statements. Each statement represents:

A single normative claim (issur/heter), OR
A single definitional claim, OR
A single conditional relationship

Example Decomposition (YD 87:1):

statements:
  - id: s1
    type: ISSUR
    text_he: "בשר בהמה טהורה בחלב בהמה טהורה אסור לבשל"
    text_en: "Meat of a kosher domesticated animal with milk of a kosher domesticated animal is forbidden to cook"
    conditions: []

  - id: s2
    type: ISSUR
    text_he: "ואסור לאכול"
    text_en: "and forbidden to eat"
    conditions: []

  - id: s3
    type: ISSUR
    text_he: "ואסור בהנאה"
    text_en: "and forbidden to derive benefit"
    conditions: []

  - id: s4
    type: DEFINITION
    text_he: "דאורייתא"
    text_en: "by Torah law"
    applies_to: [s1, s2, s3]

Statement Types¶

Type	Description	Encoding Approach
`ISSUR`	Prohibition	`asserts(W, issur(action, M, level))`
`ISSUR_SAKANA`	Health danger	`asserts(W, sakana(M))`
`HETER`	Permission	`asserts(W, heter(action, M))`
`CHIYUV`	Obligation	`asserts(W, chiyuv(action, M, level))`
`DEFINITION`	Category/classification	`is_X(entity)` predicate
`CONDITION`	Prerequisite	Part of rule body
`EXCEPTION`	Override	`override/3` or NAF

Phase C: Commentary Layer (Adaptive Depth)¶

Step 1: Discover All Commentaries¶

The skill fetches all linked commentaries:

links = mcp__sefaria_texts__get_links_between_texts(
    reference="Shulchan Arukh, Yoreh De'ah 87:1",
    with_text="1"
)

Step 2: Tier-Based Fetching¶

Commentaries are fetched in tiers based on authority and relevance:

TIER 1 (Always Fetch)¶

These commentaries are always required:

Commentary	Author	Role
Shach (Siftei Kohen)	R. Shabsai HaKohen	Primary Ashkenazi commentary
Taz (Turei Zahav)	R. David HaLevi Segal	Primary Ashkenazi commentary
Rema gloss	R. Moshe Isserles	Ashkenazi rulings within text

# Fetch Shach
shach = mcp__sefaria_texts__get_text(
    reference="Shakh on Shulchan Arukh, Yoreh De'ah 87:1",
    version_language="both"
)

# Fetch Taz
taz = mcp__sefaria_texts__get_text(
    reference="Turei Zahav on Shulchan Arukh, Yoreh De'ah 87:1",
    version_language="both"
)

TIER 2 (Standard - Fetch if TIER 1 is Insufficient)¶

Commentary	When Needed
Pri Megadim	Complex TIER 1 disagreement
Chochmat Adam	Practical application unclear
Aruch HaShulchan	Modern summary needed

TIER 3 (Extended - Fetch for Complex Machloket)¶

Commentary	When Needed
Pitchei Teshuva	Multiple minority opinions
Darchei Teshuva	Chasidic practice variations
Be'er Heitev	Additional sources needed

TIER 4 (Modern - Fetch for Practical Application)¶

Commentary	When Needed
Yalkut Yosef	Sefardi modern practice
Mishnah Berurah	Cross-reference for YD

Step 3: Classify Commentary Type¶

Each commentary passage is classified:

Type	Description	Example
`CLARIFICATION`	Explains without changing	"Beheima includes cow, sheep, and goat"
`QUALIFICATION`	Adds conditions	"Only applies when taste transfers"
`EXTENSION`	Applies to new cases	"Includes butter, not just milk"
`DISPUTE`	Disagrees with ruling	"Rema permits in Ashkenazi practice"
`PRACTICAL`	Guidance for practice	"The custom today is to be strict"

Step 4: Machloket Detection and Mapping¶

The skill identifies disputes at multiple levels:

Primary Machloket (Mechaber vs. Rema):

machloket:
  - id: m1
    topic: "dag_bechalav"
    layer: primary
    position_a:
      authority: mechaber
      ruling: sakana
      source: "BY YD 87"
      reasoning: "Medical sources indicate danger"
    position_b:
      authority: rema
      ruling: mutar
      source: "Rema gloss YD 87:3"
      reasoning: "No current medical basis"
    practical_difference: "Fish cooked in dairy dishes"

Secondary Machloket (Commentary level):

machloket:
  - id: m2
    topic: "bishul_achrei_bishul"
    layer: secondary
    position_a:
      authority: shach
      ruling: applies_only_to_liquid
      source: "Shach sk 5"
    position_b:
      authority: taz
      ruling: applies_to_all
      source: "Taz sk 3"
    practical_difference: "Reheating solid food"

Phase D: Derivation Chain (Recursive)¶

Step 1: Build Chain Backward¶

For each normative statement, the skill traces the source chain:

SA YD 87:1 → Tur YD 87 → Rambam Maachalot Asurot 9:1 →
Gemara Chullin 104b → Mishnah Chullin 8:1 → Torah Shemot 23:19

The skill uses recursive get_links_between_texts() calls:

# Get SA links
sa_links = mcp__sefaria_texts__get_links_between_texts(
    reference="Shulchan Arukh, Yoreh De'ah 87:1",
    with_text="1"
)

# Find Tur reference and get its links
tur_links = mcp__sefaria_texts__get_links_between_texts(
    reference="Tur, Yoreh Deah 87",
    with_text="1"
)

# Continue recursively until reaching terminus

Step 2: Loop Detection¶

The skill tracks visited references to detect circular chains:

visited = set()

def build_chain(ref, chain, depth=0):
    if ref in visited:
        return chain, "loop_detected"
    if depth > 10:
        return chain, "max_depth_exceeded"

    visited.add(ref)
    links = get_links(ref)

    # Continue recursion...

If a loop is detected, it's documented and flagged for human review.

Step 3: Chain Completeness Verification¶

A complete chain should reach one of:

Terminus Type	Description	Example
Torah verse	Biblical source	Shemot 23:19
Rabbinic institution	Takana d'rabanan	Chazal's enactment
Minhag source	Custom origin	"Minhag Ashkenaz"

Incomplete chains are flagged:

chain_status: incomplete
missing: "Cannot trace beyond Tur - earlier sources not linked"
action_required: manual_research

Step 4: Assemble Makor Chain¶

The chain is formatted for ASP:

% Rule naming: r_{topic}_{specific} (topic-based, NOT siman-seif based)
% Example: r_bb_achiila for basar bechalav eating rule

makor(r_bb_achiila, sa("yd:87:1")).
makor(r_bb_achiila, tur("yd:87")).
makor(r_bb_achiila, rambam("maachalot:9:1")).
makor(r_bb_achiila, gemara("chullin:104b")).
makor(r_bb_achiila, mishnah("chullin:8:1")).
makor(r_bb_achiila, torah("shemot:23:19")).

Phase E: Semantic Enrichment¶

Step 1: Topic Lookup¶

The skill queries Sefaria's topic system:

topics = mcp__sefaria_texts__get_topic_details(
    topic_slug="meat-and-milk",
    with_links=True,
    with_refs=True
)

This provides:

Related topics for context
Cross-references to related texts
Modern scholarship links

For edge cases and extensions:

related = mcp__sefaria_texts__english_semantic_search(
    query="prohibition of cooking meat and milk together Biblical commandment",
    filters={"document_categories": ["Halakhah"]}
)

Step 3: Cross-Siman Dependencies¶

Document references to other simanim:

cross_references:
  - target: "YD 89"
    type: forward
    topic: "waiting times after meat"
    note: "Seif 4 references waiting times"

  - target: "YD 87:3"
    type: internal
    topic: "fish and dairy"
    note: "Exception case within same siman"

Step 4: Shiur and Measurement Extraction¶

Identify quantities mentioned:

Shiur	Measurement	Usage in Seif
k'zayit	Olive-size	Minimum for violation
k'beitza	Egg-size	(not in this seif)
shishim	60:1 ratio	Nullification threshold
noten taam	Taste transfer	Determines applicability

Step 5: Temporal and Contextual Conditions¶

Identify context modifiers:

Condition Type	Hebrew	English	Effect
Time-based	בשבת	on Shabbat	May change ruling
Situation	בדיעבד	after the fact	Lenient ruling
Situation	הפסד מרובה	significant loss	May allow leniency
Situation	שעת הדחק	pressing circumstances	Emergency leniency
Location	בארץ ישראל	in Eretz Yisrael	Regional variation

Phase F: Gap Analysis¶

Step 1: Ontology Coverage Check¶

The skill verifies predicates exist in the ontology:

predicates_check:
  found:
    - issur/3
    - heter/2
    - is_beheima_chalav_mixture/1
  missing:
    - name: "bishul_achrei_bishul"
      expected_arity: 1
      context: "reheating solid basar bechalav"
      action: "May need to add to sorts.lp"

Step 2: World Coverage Check¶

Verify worlds exist for all authorities:

worlds_check:
  existing:
    - base
    - mechaber
    - rema
    - shach
    - taz
  needed:
    - name: "yalkut_yosef"
      reason: "Modern Sefardi authority cited"
      action: "Consider adding to worlds/"

Step 3: Prerequisite Verification¶

Check dependent seifim are encoded:

dependencies:
  satisfied:
    - seif: "YD 87:1"
      status: "encoded"
  blocking:
    - seif: "YD 86:1"
      reason: "Defines treifa categories used here"
      action: "Encode 86:1 first, or proceed with caution"
  optional:
    - seif: "YD 89:1"
      reason: "Waiting times (forward reference)"
      action: "Proceed, encode later"

Step 4: Complexity Scoring¶

The skill assigns a complexity score (1-10):

complexity:
  score: 6
  factors:
    machloket_count: 2
    commentary_depth: 3  # Tiers needed
    chain_length: 5      # Steps to Torah
    cross_references: 3
    novel_predicates: 1
  breakdown:
    - "2 machloket (Mechaber/Rema, Shach/Taz)"
    - "Requires TIER 3 commentary"
    - "One new predicate needed"

Scoring Rubric:

Score	Complexity	Characteristics
1-3	Simple	No machloket, straightforward chain, existing predicates
4-6	Moderate	1-2 machloket, complete chain, minor extensions
7-10	Complex	Multiple machloket, incomplete chain, novel concepts

Phase G: Report Generation¶

The skill generates four artifacts in .mistaber-artifacts/:

Artifact 1: Human Review Report¶

corpus-report-YD-{siman}-{seif}.md

A comprehensive markdown document for human review containing:

# Corpus Report: YD 87:1 - Basar Bechalav D'Oraita

## Executive Summary
- **Reference:** Yoreh Deah 87:1
- **Topic:** Basic prohibition of meat and milk
- **Complexity:** 6/10 (Moderate)
- **Machloket:** 2 identified
- **Chain Status:** Complete to Torah

## Primary Text

### Hebrew
בשר בהמה טהורה בחלב בהמה טהורה אסור לבשל ואסור לאכול ואסור בהנאה דאורייתא...

### English (Sefaria Translation)
The meat of a pure domesticated animal with the milk of a pure domesticated animal is forbidden to cook, forbidden to eat, and forbidden to derive benefit, by Torah law...

## Atomic Statements
[Decomposed statements as YAML]

## Derivation Chain
[Mermaid diagram]

## Machloket Summary
[Table of disputes]

## Commentary Analysis
[TIER 1-4 excerpts and classifications]

## Questions for Human Review
[Generated questions]

Artifact 2: Machine-Readable Sources¶

corpus-sources-YD-{siman}-{seif}.yaml

reference: "YD:87:1"
fetched_at: "2026-01-25T10:00:00Z"

primary:
  hebrew: "בשר בהמה טהורה..."
  english: "The meat of a pure domesticated animal..."
  translations:
    - version: "Sefaria Community Translation"
      text: "..."
    - version: "Artscroll"
      text: "..."

statements:
  - id: s1
    type: ISSUR
    text_he: "..."
    text_en: "..."
    predicates: [issur, bishul]

commentaries:
  shach:
    - location: "sk 1"
      type: CLARIFICATION
      text_he: "..."
      text_en: "..."
      affects_statements: [s1]

  taz:
    - location: "sk 1"
      type: QUALIFICATION
      text_he: "..."
      text_en: "..."
      affects_statements: [s1, s2]

derivation_chain:
  - level: 0
    source: "SA YD 87:1"
    ref: "Shulchan Arukh, Yoreh De'ah 87:1"
  - level: 1
    source: "Tur YD 87"
    ref: "Tur, Yoreh Deah 87"
  - level: 2
    source: "Rambam MA 9:1"
    ref: "Mishneh Torah, Forbidden Foods 9:1"
  - level: 3
    source: "Chullin 104b"
    ref: "Talmud Bavli Chullin 104b"
  - level: 4
    source: "Chullin 8:1"
    ref: "Mishnah Chullin 8:1"
  - level: 5
    source: "Shemot 23:19"
    ref: "Exodus 23:19"
    terminus: true

machloket:
  - id: m1
    topic: "dag_bechalav"
    position_a:
      authority: mechaber
      ruling: sakana
    position_b:
      authority: rema
      ruling: mutar

Artifact 3: Visual Derivation Chain¶

corpus-chain-YD-{siman}-{seif}.mermaid

graph TD
    SA["SA YD 87:1<br/>Shulchan Aruch"] --> TUR["Tur YD 87"]
    TUR --> RAMBAM["Rambam<br/>Maachalot Asurot 9:1"]
    RAMBAM --> GEM["Chullin 104b<br/>Talmud Bavli"]
    GEM --> MISH["Chullin 8:1<br/>Mishnah"]
    MISH --> TORAH["Shemot 23:19<br/>Torah"]

    SA --> SHACH["Shach sk 1"]
    SA --> TAZ["Taz sk 1"]

    style TORAH fill:#ffe6e6
    style SA fill:#e6f3ff
    style SHACH fill:#fff2e6
    style TAZ fill:#fff2e6

Artifact 4: Questions for Human¶

corpus-questions-YD-{siman}-{seif}.yaml

questions:
  - id: q1
    phase: B
    type: CLARIFICATION
    question: "Does 'beheima' include all domesticated animals or only cattle?"
    context: "Statement s1 mentions 'beheima tehora'"
    options:
      - "All domesticated kosher animals (cow, sheep, goat)"
      - "Only cattle specifically"
    default: 0
    source_hint: "Shach sk 1 may clarify"

  - id: q2
    phase: C
    type: MACHLOKET
    question: "Should we encode the Taz's lenient opinion on bedieved cases?"
    context: "Taz sk 3 permits bedieved when mixed unintentionally"
    options:
      - "Yes, encode as interpretation layer"
      - "No, follow strict Mechaber ruling only"
    default: 0
    implications:
      - "Option 0: More complete encoding, more complex"
      - "Option 1: Simpler encoding, may miss practical ruling"

  - id: q3
    phase: D
    type: CHAIN
    question: "The chain includes a disputed gemara interpretation. Use Rashi or Tosafot?"
    context: "Chullin 104b has competing interpretations"
    options:
      - "Follow Rashi (standard)"
      - "Follow Tosafot (stricter)"
    default: 0

Checkpoint Criteria¶

Before requesting human approval, verify:

[ ] Primary source text accurately fetched (Hebrew + English)
[ ] All TIER 1 commentaries fetched and classified
[ ] Derivation chain reaches authoritative source (Torah or explicit rabbinic)
[ ] All machloket identified with both positions documented
[ ] No unresolved blocking dependencies on other seifim
[ ] Questions formulated for any ambiguities
[ ] All four artifacts generated

Session State Update¶

After completing corpus preparation:

current_phase: corpus-prep
target_seif: "YD:87:1"
checkpoints:
  corpus-prep:
    status: pending_review
    artifacts:
      - .mistaber-artifacts/corpus-report-YD-87-1.md
      - .mistaber-artifacts/corpus-sources-YD-87-1.yaml
      - .mistaber-artifacts/corpus-chain-YD-87-1.mermaid
      - .mistaber-artifacts/corpus-questions-YD-87-1.yaml
    complexity_score: 6
    questions_count: 3

Troubleshooting¶

"Reference not found in Sefaria"¶

The reference format may be incorrect. Try:

Use full reference: "Shulchan Arukh, Yoreh De'ah 87:1" not "YD 87:1"
Check spelling of siman name
Use clarify_name_argument() to find correct format

"Cannot fetch commentary"¶

Some commentaries have limited coverage in Sefaria:

Check if the commentary exists for this seif
Try alternative commentary in same tier
Note as "requires manual research" in questions

"Derivation chain incomplete"¶

Not all sources are linked in Sefaria:

Document the gap in the report
Add manual research note
Proceed with available chain
Human reviewer can provide missing links

"Too many machloket identified"¶

If the skill identifies 5+ machloket, the seif may be too complex:

Consider breaking into smaller units
Focus on primary machloket first
Create follow-up tasks for secondary disputes

Best Practices¶

Verify Hebrew text carefully - Translation errors can propagate through the entire encoding
Read TIER 1 commentaries fully - Even sections that seem unrelated may affect encoding
Document uncertainty - Better to flag questions than assume incorrectly
Check cross-references - A seif may depend on rulings from another siman
Score conservatively - Underestimate complexity leads to surprise difficulties

Next Phase¶

After Checkpoint 1 approval, proceed to HLL Encoding.