Skip to content

Corpus Preparation

The corpus preparation phase is the foundation of the encoding workflow. It transforms a seif reference into a comprehensive, structured corpus of source texts with complete derivation chains and identified machloket (disputes). This phase uses the corpus-prep skill and produces artifacts for human review at Checkpoint 1.

Purpose

The corpus-prep skill accomplishes several critical objectives:

  1. Source Retrieval: Fetch primary text and commentaries from Sefaria
  2. Derivation Chains: Trace each ruling back to its authoritative source (Torah/Mishnah/Takana)
  3. Machloket Identification: Identify all disputes between halachic authorities
  4. Semantic Decomposition: Break the seif into atomic, encodable statements
  5. Question Generation: Formulate clarifying questions for human expert

Invoking the Skill

To start corpus preparation:

User: "Prepare corpus for YD 87:1"

The skill automatically:

  1. Validates the reference
  2. Fetches all required texts
  3. Builds the derivation chain
  4. Generates review artifacts
  5. Requests checkpoint approval

Phase A: Reference Resolution and Validation

Step 1: Validate Reference Format

The skill uses Sefaria's clarify_name_argument() tool to validate and canonicalize the reference:

result = mcp__sefaria_texts__clarify_name_argument(
    name="Shulchan Arukh, Yoreh De'ah 87:1",
    type_filter="ref"
)

This ensures:

  • The reference exists in Sefaria's database
  • The canonical format is used consistently
  • Typos and variations are resolved

Step 2: Get Structural Context

Understanding the seif's position within the siman helps identify dependencies:

shape = mcp__sefaria_texts__get_text_or_category_shape(
    name="Shulchan Arukh, Yoreh De'ah 87"
)

The skill documents:

Context Purpose
Total seifim in siman Understand scope of topic
Position of target seif Identify sequence dependencies
Related seifim Find cross-references

Step 3: Check Prior Encoding State

If a .mistaber-session.yaml exists, the skill reads it to determine:

  • Previously encoded seifim (dependencies satisfied)
  • Pending dependencies (may block encoding)
  • Active encoding session (resume vs. new)

Step 4: Identify Dependencies

The skill searches for cross-references within the text:

Hebrew Pattern Meaning Action
"כמו שיתבאר לקמן" "As will be explained below" Forward dependency (may proceed)
"כמו שכתבנו לעיל" "As we wrote above" Backward dependency (must be encoded first)
"עיין סימן X" "See siman X" Cross-siman reference (note but don't block)

Phase B: Primary Source Extraction

Step 1: Fetch Primary Text

Retrieve both Hebrew and English versions:

# Fetch with all translations
text = mcp__sefaria_texts__get_text(
    reference="Shulchan Arukh, Yoreh De'ah 87:1",
    version_language="both"
)

# Get all English translations for comparison
translations = mcp__sefaria_texts__get_english_translations(
    reference="Shulchan Arukh, Yoreh De'ah 87:1"
)

The skill captures:

  • Original Hebrew text
  • Primary English translation
  • Alternative translations (when available)

Step 2: Linguistic Analysis

For the Hebrew text, the skill identifies and documents:

Key Terms - Technical halachic vocabulary requiring precise understanding:

# Look up technical terms
definitions = mcp__sefaria_texts__search_in_dictionaries(
    query="נותן טעם"  # "noten taam" - taste transfer
)

Common terms requiring lookup:

Term Transliteration Meaning
בשר בחלב basar bechalav meat and milk
נותן טעם noten taam taste transfer
בטל בשישים batel b'shishim nullified in 60:1
לכתחילה lechatchila ideally/initially
בדיעבד bedieved after the fact

Ambiguous Phrases - Terms requiring commentary for clarification

Cross-References - Internal references to other sources

Step 3: Semantic Decomposition

The skill breaks the seif into atomic statements. Each statement represents:

  • A single normative claim (issur/heter), OR
  • A single definitional claim, OR
  • A single conditional relationship

Example Decomposition (YD 87:1):

statements:
  - id: s1
    type: ISSUR
    text_he: "בשר בהמה טהורה בחלב בהמה טהורה אסור לבשל"
    text_en: "Meat of a kosher domesticated animal with milk of a kosher domesticated animal is forbidden to cook"
    conditions: []

  - id: s2
    type: ISSUR
    text_he: "ואסור לאכול"
    text_en: "and forbidden to eat"
    conditions: []

  - id: s3
    type: ISSUR
    text_he: "ואסור בהנאה"
    text_en: "and forbidden to derive benefit"
    conditions: []

  - id: s4
    type: DEFINITION
    text_he: "דאורייתא"
    text_en: "by Torah law"
    applies_to: [s1, s2, s3]

Statement Types

Type Description Encoding Approach
ISSUR Prohibition asserts(W, issur(action, M, level))
ISSUR_SAKANA Health danger asserts(W, sakana(M))
HETER Permission asserts(W, heter(action, M))
CHIYUV Obligation asserts(W, chiyuv(action, M, level))
DEFINITION Category/classification is_X(entity) predicate
CONDITION Prerequisite Part of rule body
EXCEPTION Override override/3 or NAF

Phase C: Commentary Layer (Adaptive Depth)

Step 1: Discover All Commentaries

The skill fetches all linked commentaries:

links = mcp__sefaria_texts__get_links_between_texts(
    reference="Shulchan Arukh, Yoreh De'ah 87:1",
    with_text="1"
)

Step 2: Tier-Based Fetching

Commentaries are fetched in tiers based on authority and relevance:

TIER 1 (Always Fetch)

These commentaries are always required:

Commentary Author Role
Shach (Siftei Kohen) R. Shabsai HaKohen Primary Ashkenazi commentary
Taz (Turei Zahav) R. David HaLevi Segal Primary Ashkenazi commentary
Rema gloss R. Moshe Isserles Ashkenazi rulings within text
# Fetch Shach
shach = mcp__sefaria_texts__get_text(
    reference="Shakh on Shulchan Arukh, Yoreh De'ah 87:1",
    version_language="both"
)

# Fetch Taz
taz = mcp__sefaria_texts__get_text(
    reference="Turei Zahav on Shulchan Arukh, Yoreh De'ah 87:1",
    version_language="both"
)

TIER 2 (Standard - Fetch if TIER 1 is Insufficient)

Commentary When Needed
Pri Megadim Complex TIER 1 disagreement
Chochmat Adam Practical application unclear
Aruch HaShulchan Modern summary needed

TIER 3 (Extended - Fetch for Complex Machloket)

Commentary When Needed
Pitchei Teshuva Multiple minority opinions
Darchei Teshuva Chasidic practice variations
Be'er Heitev Additional sources needed

TIER 4 (Modern - Fetch for Practical Application)

Commentary When Needed
Yalkut Yosef Sefardi modern practice
Mishnah Berurah Cross-reference for YD

Step 3: Classify Commentary Type

Each commentary passage is classified:

Type Description Example
CLARIFICATION Explains without changing "Beheima includes cow, sheep, and goat"
QUALIFICATION Adds conditions "Only applies when taste transfers"
EXTENSION Applies to new cases "Includes butter, not just milk"
DISPUTE Disagrees with ruling "Rema permits in Ashkenazi practice"
PRACTICAL Guidance for practice "The custom today is to be strict"

Step 4: Machloket Detection and Mapping

The skill identifies disputes at multiple levels:

Primary Machloket (Mechaber vs. Rema):

machloket:
  - id: m1
    topic: "dag_bechalav"
    layer: primary
    position_a:
      authority: mechaber
      ruling: sakana
      source: "BY YD 87"
      reasoning: "Medical sources indicate danger"
    position_b:
      authority: rema
      ruling: mutar
      source: "Rema gloss YD 87:3"
      reasoning: "No current medical basis"
    practical_difference: "Fish cooked in dairy dishes"

Secondary Machloket (Commentary level):

machloket:
  - id: m2
    topic: "bishul_achrei_bishul"
    layer: secondary
    position_a:
      authority: shach
      ruling: applies_only_to_liquid
      source: "Shach sk 5"
    position_b:
      authority: taz
      ruling: applies_to_all
      source: "Taz sk 3"
    practical_difference: "Reheating solid food"

Phase D: Derivation Chain (Recursive)

Step 1: Build Chain Backward

For each normative statement, the skill traces the source chain:

SA YD 87:1 → Tur YD 87 → Rambam Maachalot Asurot 9:1 →
Gemara Chullin 104b → Mishnah Chullin 8:1 → Torah Shemot 23:19

The skill uses recursive get_links_between_texts() calls:

# Get SA links
sa_links = mcp__sefaria_texts__get_links_between_texts(
    reference="Shulchan Arukh, Yoreh De'ah 87:1",
    with_text="1"
)

# Find Tur reference and get its links
tur_links = mcp__sefaria_texts__get_links_between_texts(
    reference="Tur, Yoreh Deah 87",
    with_text="1"
)

# Continue recursively until reaching terminus

Step 2: Loop Detection

The skill tracks visited references to detect circular chains:

visited = set()

def build_chain(ref, chain, depth=0):
    if ref in visited:
        return chain, "loop_detected"
    if depth > 10:
        return chain, "max_depth_exceeded"

    visited.add(ref)
    links = get_links(ref)

    # Continue recursion...

If a loop is detected, it's documented and flagged for human review.

Step 3: Chain Completeness Verification

A complete chain should reach one of:

Terminus Type Description Example
Torah verse Biblical source Shemot 23:19
Rabbinic institution Takana d'rabanan Chazal's enactment
Minhag source Custom origin "Minhag Ashkenaz"

Incomplete chains are flagged:

chain_status: incomplete
missing: "Cannot trace beyond Tur - earlier sources not linked"
action_required: manual_research

Step 4: Assemble Makor Chain

The chain is formatted for ASP:

% Rule naming: r_{topic}_{specific} (topic-based, NOT siman-seif based)
% Example: r_bb_achiila for basar bechalav eating rule

makor(r_bb_achiila, sa("yd:87:1")).
makor(r_bb_achiila, tur("yd:87")).
makor(r_bb_achiila, rambam("maachalot:9:1")).
makor(r_bb_achiila, gemara("chullin:104b")).
makor(r_bb_achiila, mishnah("chullin:8:1")).
makor(r_bb_achiila, torah("shemot:23:19")).

Phase E: Semantic Enrichment

Step 1: Topic Lookup

The skill queries Sefaria's topic system:

topics = mcp__sefaria_texts__get_topic_details(
    topic_slug="meat-and-milk",
    with_links=True,
    with_refs=True
)

This provides:

  • Related topics for context
  • Cross-references to related texts
  • Modern scholarship links

For edge cases and extensions:

related = mcp__sefaria_texts__english_semantic_search(
    query="prohibition of cooking meat and milk together Biblical commandment",
    filters={"document_categories": ["Halakhah"]}
)

Step 3: Cross-Siman Dependencies

Document references to other simanim:

cross_references:
  - target: "YD 89"
    type: forward
    topic: "waiting times after meat"
    note: "Seif 4 references waiting times"

  - target: "YD 87:3"
    type: internal
    topic: "fish and dairy"
    note: "Exception case within same siman"

Step 4: Shiur and Measurement Extraction

Identify quantities mentioned:

Shiur Measurement Usage in Seif
k'zayit Olive-size Minimum for violation
k'beitza Egg-size (not in this seif)
shishim 60:1 ratio Nullification threshold
noten taam Taste transfer Determines applicability

Step 5: Temporal and Contextual Conditions

Identify context modifiers:

Condition Type Hebrew English Effect
Time-based בשבת on Shabbat May change ruling
Situation בדיעבד after the fact Lenient ruling
Situation הפסד מרובה significant loss May allow leniency
Situation שעת הדחק pressing circumstances Emergency leniency
Location בארץ ישראל in Eretz Yisrael Regional variation

Phase F: Gap Analysis

Step 1: Ontology Coverage Check

The skill verifies predicates exist in the ontology:

predicates_check:
  found:
    - issur/3
    - heter/2
    - is_beheima_chalav_mixture/1
  missing:
    - name: "bishul_achrei_bishul"
      expected_arity: 1
      context: "reheating solid basar bechalav"
      action: "May need to add to sorts.lp"

Step 2: World Coverage Check

Verify worlds exist for all authorities:

worlds_check:
  existing:
    - base
    - mechaber
    - rema
    - shach
    - taz
  needed:
    - name: "yalkut_yosef"
      reason: "Modern Sefardi authority cited"
      action: "Consider adding to worlds/"

Step 3: Prerequisite Verification

Check dependent seifim are encoded:

dependencies:
  satisfied:
    - seif: "YD 87:1"
      status: "encoded"
  blocking:
    - seif: "YD 86:1"
      reason: "Defines treifa categories used here"
      action: "Encode 86:1 first, or proceed with caution"
  optional:
    - seif: "YD 89:1"
      reason: "Waiting times (forward reference)"
      action: "Proceed, encode later"

Step 4: Complexity Scoring

The skill assigns a complexity score (1-10):

complexity:
  score: 6
  factors:
    machloket_count: 2
    commentary_depth: 3  # Tiers needed
    chain_length: 5      # Steps to Torah
    cross_references: 3
    novel_predicates: 1
  breakdown:
    - "2 machloket (Mechaber/Rema, Shach/Taz)"
    - "Requires TIER 3 commentary"
    - "One new predicate needed"

Scoring Rubric:

Score Complexity Characteristics
1-3 Simple No machloket, straightforward chain, existing predicates
4-6 Moderate 1-2 machloket, complete chain, minor extensions
7-10 Complex Multiple machloket, incomplete chain, novel concepts

Phase G: Report Generation

The skill generates four artifacts in .mistaber-artifacts/:

Artifact 1: Human Review Report

corpus-report-YD-{siman}-{seif}.md

A comprehensive markdown document for human review containing:

# Corpus Report: YD 87:1 - Basar Bechalav D'Oraita

## Executive Summary
- **Reference:** Yoreh Deah 87:1
- **Topic:** Basic prohibition of meat and milk
- **Complexity:** 6/10 (Moderate)
- **Machloket:** 2 identified
- **Chain Status:** Complete to Torah

## Primary Text

### Hebrew
בשר בהמה טהורה בחלב בהמה טהורה אסור לבשל ואסור לאכול ואסור בהנאה דאורייתא...

### English (Sefaria Translation)
The meat of a pure domesticated animal with the milk of a pure domesticated animal is forbidden to cook, forbidden to eat, and forbidden to derive benefit, by Torah law...

## Atomic Statements
[Decomposed statements as YAML]

## Derivation Chain
[Mermaid diagram]

## Machloket Summary
[Table of disputes]

## Commentary Analysis
[TIER 1-4 excerpts and classifications]

## Questions for Human Review
[Generated questions]

Artifact 2: Machine-Readable Sources

corpus-sources-YD-{siman}-{seif}.yaml

reference: "YD:87:1"
fetched_at: "2026-01-25T10:00:00Z"

primary:
  hebrew: "בשר בהמה טהורה..."
  english: "The meat of a pure domesticated animal..."
  translations:
    - version: "Sefaria Community Translation"
      text: "..."
    - version: "Artscroll"
      text: "..."

statements:
  - id: s1
    type: ISSUR
    text_he: "..."
    text_en: "..."
    predicates: [issur, bishul]

commentaries:
  shach:
    - location: "sk 1"
      type: CLARIFICATION
      text_he: "..."
      text_en: "..."
      affects_statements: [s1]

  taz:
    - location: "sk 1"
      type: QUALIFICATION
      text_he: "..."
      text_en: "..."
      affects_statements: [s1, s2]

derivation_chain:
  - level: 0
    source: "SA YD 87:1"
    ref: "Shulchan Arukh, Yoreh De'ah 87:1"
  - level: 1
    source: "Tur YD 87"
    ref: "Tur, Yoreh Deah 87"
  - level: 2
    source: "Rambam MA 9:1"
    ref: "Mishneh Torah, Forbidden Foods 9:1"
  - level: 3
    source: "Chullin 104b"
    ref: "Talmud Bavli Chullin 104b"
  - level: 4
    source: "Chullin 8:1"
    ref: "Mishnah Chullin 8:1"
  - level: 5
    source: "Shemot 23:19"
    ref: "Exodus 23:19"
    terminus: true

machloket:
  - id: m1
    topic: "dag_bechalav"
    position_a:
      authority: mechaber
      ruling: sakana
    position_b:
      authority: rema
      ruling: mutar

Artifact 3: Visual Derivation Chain

corpus-chain-YD-{siman}-{seif}.mermaid

graph TD
    SA["SA YD 87:1<br/>Shulchan Aruch"] --> TUR["Tur YD 87"]
    TUR --> RAMBAM["Rambam<br/>Maachalot Asurot 9:1"]
    RAMBAM --> GEM["Chullin 104b<br/>Talmud Bavli"]
    GEM --> MISH["Chullin 8:1<br/>Mishnah"]
    MISH --> TORAH["Shemot 23:19<br/>Torah"]

    SA --> SHACH["Shach sk 1"]
    SA --> TAZ["Taz sk 1"]

    style TORAH fill:#ffe6e6
    style SA fill:#e6f3ff
    style SHACH fill:#fff2e6
    style TAZ fill:#fff2e6

Artifact 4: Questions for Human

corpus-questions-YD-{siman}-{seif}.yaml

questions:
  - id: q1
    phase: B
    type: CLARIFICATION
    question: "Does 'beheima' include all domesticated animals or only cattle?"
    context: "Statement s1 mentions 'beheima tehora'"
    options:
      - "All domesticated kosher animals (cow, sheep, goat)"
      - "Only cattle specifically"
    default: 0
    source_hint: "Shach sk 1 may clarify"

  - id: q2
    phase: C
    type: MACHLOKET
    question: "Should we encode the Taz's lenient opinion on bedieved cases?"
    context: "Taz sk 3 permits bedieved when mixed unintentionally"
    options:
      - "Yes, encode as interpretation layer"
      - "No, follow strict Mechaber ruling only"
    default: 0
    implications:
      - "Option 0: More complete encoding, more complex"
      - "Option 1: Simpler encoding, may miss practical ruling"

  - id: q3
    phase: D
    type: CHAIN
    question: "The chain includes a disputed gemara interpretation. Use Rashi or Tosafot?"
    context: "Chullin 104b has competing interpretations"
    options:
      - "Follow Rashi (standard)"
      - "Follow Tosafot (stricter)"
    default: 0

Checkpoint Criteria

Before requesting human approval, verify:

  • [ ] Primary source text accurately fetched (Hebrew + English)
  • [ ] All TIER 1 commentaries fetched and classified
  • [ ] Derivation chain reaches authoritative source (Torah or explicit rabbinic)
  • [ ] All machloket identified with both positions documented
  • [ ] No unresolved blocking dependencies on other seifim
  • [ ] Questions formulated for any ambiguities
  • [ ] All four artifacts generated

Session State Update

After completing corpus preparation:

current_phase: corpus-prep
target_seif: "YD:87:1"
checkpoints:
  corpus-prep:
    status: pending_review
    artifacts:
      - .mistaber-artifacts/corpus-report-YD-87-1.md
      - .mistaber-artifacts/corpus-sources-YD-87-1.yaml
      - .mistaber-artifacts/corpus-chain-YD-87-1.mermaid
      - .mistaber-artifacts/corpus-questions-YD-87-1.yaml
    complexity_score: 6
    questions_count: 3

Troubleshooting

"Reference not found in Sefaria"

The reference format may be incorrect. Try:

  1. Use full reference: "Shulchan Arukh, Yoreh De'ah 87:1" not "YD 87:1"
  2. Check spelling of siman name
  3. Use clarify_name_argument() to find correct format

"Cannot fetch commentary"

Some commentaries have limited coverage in Sefaria:

  1. Check if the commentary exists for this seif
  2. Try alternative commentary in same tier
  3. Note as "requires manual research" in questions

"Derivation chain incomplete"

Not all sources are linked in Sefaria:

  1. Document the gap in the report
  2. Add manual research note
  3. Proceed with available chain
  4. Human reviewer can provide missing links

"Too many machloket identified"

If the skill identifies 5+ machloket, the seif may be too complex:

  1. Consider breaking into smaller units
  2. Focus on primary machloket first
  3. Create follow-up tasks for secondary disputes

Best Practices

  1. Verify Hebrew text carefully - Translation errors can propagate through the entire encoding
  2. Read TIER 1 commentaries fully - Even sections that seem unrelated may affect encoding
  3. Document uncertainty - Better to flag questions than assume incorrectly
  4. Check cross-references - A seif may depend on rulings from another siman
  5. Score conservatively - Underestimate complexity leads to surprise difficulties

Next Phase

After Checkpoint 1 approval, proceed to HLL Encoding.