Skip to content

Corpus Preparation Skill

The corpus-prep skill is the first phase of the encoding pipeline, responsible for fetching source texts from Sefaria, building complete derivation chains, identifying machloket (disputes), and preparing structured artifacts for human review.

Overview

Attribute Value
Skill Name corpus-prep
Phase 1 of 5
Checkpoint 1: Source Review
Prerequisites Sefaria MCP available, valid seif reference
Outputs corpus-report.md, corpus-sources.yaml, corpus-chain.mermaid, corpus-questions.yaml
Trigger Phrases "prepare corpus", "fetch sources", "start encoding seif", "begin encoding", "prepare sources for encoding", "build derivation chain", "get source texts from Sefaria"

Invocation

Typical Invocation

User: "Prepare corpus for YD 87:3"

Agent: I'll prepare the corpus for Yoreh Deah 87:3. Let me:
1. Validate the reference with Sefaria
2. Fetch the primary text (Hebrew + English)
3. Fetch relevant commentaries
4. Build the derivation chain
5. Identify any machloket
...

Explicit Invocation

User: "Use mistaber:corpus-prep for Shulchan Arukh Yoreh Deah 88:1"

Phase A: Reference Resolution & Validation

Step 1: Validate Reference Format

The skill uses Sefaria's clarify_name_argument tool to validate and canonicalize the seif reference:

# Validate "Yoreh Deah 88:1"
result = mcp__sefaria_texts__clarify_name_argument(
    name="Shulchan Arukh, Yoreh De'ah 88:1",
    type_filter="ref"
)

Common Reference Formats:

Input Canonical Form
YD 87:3 Shulchan Arukh, Yoreh De'ah 87:3
Yoreh Deah 88:1 Shulchan Arukh, Yoreh De'ah 88:1
SA YD 89:2 Shulchan Arukh, Yoreh De'ah 89:2

Step 2: Get Structural Context

The skill retrieves the siman's structure to understand context:

shape = mcp__sefaria_texts__get_text_or_category_shape(
    name="Shulchan Arukh, Yoreh De'ah 88"
)

Extracted Information:

  • Total seifim in siman
  • Position of target seif
  • Related seifim that may have dependencies

Step 3: Check Prior Encoding State

The skill reads .mistaber-session.yaml to determine:

  • Previously encoded seifim
  • Pending dependencies
  • Any active encoding session

Step 4: Identify Dependencies

Cross-references to other seifim are identified:

  • Forward references ("k'mo she'yitbaer l'kaman" - as will be explained below)
  • Backward references ("k'mo she'katavnu l'eil" - as we wrote above)

Phase B: Primary Source Extraction

Step 1: Fetch Primary Text

Both Hebrew and English versions are retrieved:

# Fetch with all translations
text = mcp__sefaria_texts__get_text(
    reference="Shulchan Arukh, Yoreh De'ah 88:1",
    version_language="both"
)

# Get all English translations for comparison
translations = mcp__sefaria_texts__get_english_translations(
    reference="Shulchan Arukh, Yoreh De'ah 88:1"
)

Step 2: Linguistic Analysis

For the Hebrew text, the skill identifies:

Element Example Purpose
Key terms "basar", "chalav", "bishul" Map to predicates
Ambiguous phrases "taam k'ikar" Flag for clarification
Cross-references "siman 89" Track dependencies

Dictionary lookups are performed for technical terms:

definitions = mcp__sefaria_texts__search_in_dictionaries(
    query="נותן טעם"  # "noten taam"
)

Step 3: Semantic Decomposition

The seif is broken into atomic statements:

statements:
  - id: s1
    type: ISSUR
    text_he: "בשר בהמה טהורה בחלב בהמה טהורה אסור לבשל"
    text_en: "Meat of a kosher domesticated animal with milk of a kosher domesticated animal is forbidden to cook"
    conditions: []
    world_context: base

  - id: s2
    type: ISSUR
    text_he: "ואסור לאכול"
    text_en: "and forbidden to eat"
    conditions: []
    world_context: base

  - id: s3
    type: MACHLOKET
    text_he: "דגים וחגבים אין בהם איסור בשר בחלב... מפני הסכנה"
    text_en: "Fish and locusts have no meat-and-milk prohibition... due to danger"
    conditions: []
    world_context: mechaber  # Disputed by Rema

Statement Type Classification:

Type Description Example
ISSUR Prohibition "forbidden to cook"
ISSUR_SAKANA Health danger "due to danger"
HETER Permission "permitted to eat"
CHIYUV Obligation "must wait six hours"
DEFINITION Category definition "fish is not basar"
CONDITION Prerequisite "if cooked together"
EXCEPTION Exception to rule "except bedieved"

Phase C: Commentary Layer (Adaptive Depth)

Tier-Based Fetching Strategy

The skill fetches commentaries in tiers, proceeding to deeper tiers only when necessary:

graph TD
    T1["Tier 1: Always Fetch"] --> D1{Sufficient?}
    D1 -->|No| T2["Tier 2: Standard"]
    D1 -->|Yes| DONE[Complete]
    T2 --> D2{Sufficient?}
    D2 -->|No| T3["Tier 3: Extended"]
    D2 -->|Yes| DONE
    T3 --> D3{Sufficient?}
    D3 -->|No| T4["Tier 4: Modern"]
    D3 -->|Yes| DONE
    T4 --> DONE

Tier Definitions

Tier 1 (Always Fetch):

Commentary Abbreviation Tradition
Shach (Siftei Kohen) sk Ashkenazi
Taz (Turei Zahav) - Ashkenazi
Rema - Ashkenazi gloss

Tier 2 (Standard - fetch if Tier 1 insufficient):

Commentary Abbreviation Notes
Pri Megadim PM Commentary on Shach/Taz
Chochmat Adam CA Practical summary
Aruch HaShulchan AH Modern codification

Tier 3 (Extended - for complex machloket):

Commentary Focus
Pitchei Teshuva Acharonim summary
Darchei Teshuva Additional opinions
Be'er Heitev Abbreviated summary

Tier 4 (Modern - practical application):

Commentary Community
Yalkut Yosef Sefardi
Mishnah Berurah Ashkenazi (OC)

Commentary Classification

Each commentary passage is classified:

# Fetch Shach
shach = mcp__sefaria_texts__get_text(
    reference="Shakh on Shulchan Arukh, Yoreh De'ah 88:1",
    version_language="both"
)

Classification Types:

Type Description Encoding Impact
CLARIFICATION Explains without changing Comments only
QUALIFICATION Adds conditions New rule conditions
EXTENSION Applies to new cases New rules
DISPUTE Disagrees with base Override/machloket
PRACTICAL Practical guidance May affect world scoping

Machloket Detection

Disputes are identified between:

Dispute Type Example
Mechaber vs. Rema Fish + dairy: sakana vs. mutar
Shach vs. Taz Interpretation of SA ruling
Ashkenazi vs. Sefardi Practice differences
Earlier vs. Later Rishonim vs. Acharonim

Machloket Documentation:

machloket:
  - id: m1
    topic: "dag_bechalav"
    position_a:
      authority: mechaber
      ruling: sakana
      source: "BY YD 87"
    position_b:
      authority: rema
      ruling: mutar
      source: "Rema gloss YD 87:3"
    practical_difference: "Fish cooked in dairy (e.g., lox with cream cheese)"
    worlds_affected:
      - mechaber  # Position A
      - rema      # Position B

Phase D: Derivation Chain Building

Recursive Chain Building

The skill traces sources backward from SA to authoritative terminus:

SA YD 88:1 → Tur YD 88 → Rambam MA 9:1 → Gemara Chullin 104b → Mishnah Chullin 8:1 → Torah Shemot 23:19

Using Sefaria Links:

# Get SA links
sa_links = mcp__sefaria_texts__get_links_between_texts(
    reference="Shulchan Arukh, Yoreh De'ah 88:1",
    with_text="1"
)

# Follow chain to Tur
tur_links = mcp__sefaria_texts__get_links_between_texts(
    reference="Tur, Yoreh Deah 88",
    with_text="1"
)

# Continue until reaching terminus

Chain Completeness

A complete chain reaches one of:

Terminus Type Example For
Torah verse Shemot 23:19 d_oraita rules
Mishnah/Gemara Chullin 8:1 d_rabanan rules
Takana Chazal decree Rabbinic institutions
Minhag source Community practice Minhag rules

Loop Detection

The skill tracks visited references to detect circular chains:

visited = set()

def trace_chain(ref, depth=0):
    if ref in visited:
        return {"loop_detected": True, "ref": ref}
    visited.add(ref)
    links = get_links(ref)
    # Continue recursively...

Makor Chain Format

The final chain is formatted for ASP:

% Rule naming: r_{topic}_{specific} (NOT siman-seif based)
% Example: r_bb_achiila for basar bechalav eating rule
makor(r_bb_achiila, sa("yd:88:1")).
makor(r_bb_achiila, tur("yd:88")).
makor(r_bb_achiila, rambam("maachalot:9:1")).
makor(r_bb_achiila, gemara("chullin:104b")).
makor(r_bb_achiila, torah("shemot:23:19")).

Phase E: Semantic Enrichment

Topic Lookup

Related topics are retrieved from Sefaria:

topics = mcp__sefaria_texts__get_topic_details(
    topic_slug="meat-and-milk",
    with_links=True,
    with_refs=True
)

Related material is discovered through semantic search:

related = mcp__sefaria_texts__english_semantic_search(
    query="prohibition of cooking meat and milk together Biblical commandment",
    filters={"document_categories": ["Halakhah"]}
)

Shiur and Measurement Extraction

Quantities mentioned are identified:

Hebrew Transliteration Meaning
כזית k'zayit olive-sized portion
כביצה k'beitza egg-sized portion
שישים shishim 60:1 ratio
נותן טעם noten taam taste transfer

Temporal and Contextual Conditions

Context modifiers are identified:

Type Examples Encoding Impact
Time-based "on Shabbat", "during Pesach" Additional conditions
Situation-based "bedieved", "hefsed merubeh" World-specific rules
Location-based "in Eretz Yisrael" Geographic conditions

Phase F: Gap Analysis

Ontology Coverage Check

The skill verifies predicates exist in the ontology:

predicates_check:
  existing:
    - issur/3: "Prohibition predicate"
    - heter/2: "Permission predicate"
    - sakana/1: "Danger predicate"
  missing:
    - hachshara/2: "Vessel preparation - needs to be added"

World Coverage Check

Required worlds are verified:

worlds_check:
  existing:
    - base: "Universal rulings"
    - mechaber: "Sefardi positions"
    - rema: "Ashkenazi positions"
  missing: []  # None in this case

Prerequisite Verification

Dependencies on other seifim are checked:

prerequisites:
  - seif: "YD:87:1"
    status: encoded
    notes: "Base BB prohibition"
  - seif: "YD:87:2"
    status: not_encoded
    notes: "Species definitions - may need to encode first"
    blocking: false

Complexity Scoring

A complexity score (1-10) is assigned:

complexity:
  score: 6
  factors:
    machloket_count: 2
    commentary_depth: 3  # Tiers needed
    chain_length: 5
    cross_references: 3
    novel_predicates: 1

# Scoring rubric:
# 1-3: Simple, no machloket, straightforward chain
# 4-6: Moderate, some machloket, complete chain
# 7-10: Complex, multiple machloket, incomplete chain or novel concepts

Phase G: Artifact Generation

Artifact 1: Human Review Report

corpus-report-YD-{siman}-{seif}.md

Uses the template from templates/corpus-report.md to generate a comprehensive human-readable report including:

  • Executive summary with metrics
  • Primary source text (Hebrew + English)
  • Atomic statements table
  • Commentary excerpts by tier
  • Derivation chain diagram
  • Machloket documentation
  • Gap analysis results
  • Questions for human review
  • Checkpoint criteria checklist

Artifact 2: Machine-Readable Sources

corpus-sources-YD-{siman}-{seif}.yaml

reference: "YD:88:1"
fetched_at: "2026-01-25T10:00:00Z"
primary:
  hebrew: "בשר בהמה טהורה בחלב..."
  english: "Meat of a kosher domesticated animal..."
  translations:
    - version: "Sefaria Community Translation"
      text: "..."
statements:
  - id: s1
    type: ISSUR
    text_he: "..."
    text_en: "..."
    conditions: []
    world_context: base
commentaries:
  shach:
    - location: "sk 1"
      type: CLARIFICATION
      text_he: "..."
      text_en: "..."
  taz:
    - location: "sk 1"
      type: QUALIFICATION
      text_he: "..."
      text_en: "..."
derivation_chain:
  - level: 0
    source: "SA YD 88:1"
    ref: "Shulchan Arukh, Yoreh De'ah 88:1"
  - level: 1
    source: "Tur YD 88"
    ref: "Tur, Yoreh Deah 88"
  # ... continues to terminus
machloket:
  - id: m1
    topic: dag_bechalav
    position_a:
      authority: mechaber
      ruling: sakana
    position_b:
      authority: rema
      ruling: mutar

Artifact 3: Visual Derivation Chain

corpus-chain-YD-{siman}-{seif}.mermaid

graph TD
    SA["SA YD 88:1<br/>Primary Source"] --> TUR["Tur YD 88"]
    TUR --> RAMBAM["Rambam MA 9:1"]
    RAMBAM --> GEM["Gemara Chullin 104b"]
    GEM --> MISH["Mishnah Chullin 8:1"]
    MISH --> TORAH["Torah: Shemot 23:19<br/>Lo tevashel gedi..."]

    style SA fill:#90EE90
    style TORAH fill:#FFD700

Artifact 4: Questions for Human

corpus-questions-YD-{siman}-{seif}.yaml

questions:
  - id: q1
    phase: B
    type: CLARIFICATION
    question: "Does 'beheima' include all domesticated animals or only cattle?"
    context: "Statement s1 mentions 'beheima tehora'"
    options:
      - "All domesticated kosher animals (cow, sheep, goat)"
      - "Only cattle specifically"
    default: 0
    impact: "Affects predicate scope"

  - id: q2
    phase: C
    type: MACHLOKET
    question: "Should we encode the Taz's lenient opinion on bedieved cases?"
    context: "Taz sk 3 permits bedieved in certain cases"
    options:
      - "Yes, encode as world-specific rule"
      - "No, follow strict Mechaber"
    default: 0
    impact: "May require additional taz world"

Checkpoint Criteria

Before requesting human approval, the skill verifies:

  • [ ] Primary source text accurately fetched (Hebrew + English)
  • [ ] All Tier 1 commentaries fetched and classified
  • [ ] Derivation chain reaches authoritative source
  • [ ] All machloket identified with both positions documented
  • [ ] No unresolved blocking dependencies on other seifim
  • [ ] Questions formulated for any ambiguities
  • [ ] All four artifacts generated

Session State Update

After completing corpus preparation:

current_phase: corpus-prep
target_seif: "YD:88:1"
checkpoints:
  corpus-prep:
    status: pending_review
    artifacts:
      - .mistaber-artifacts/corpus-report-YD-88-1.md
      - .mistaber-artifacts/corpus-sources-YD-88-1.yaml
      - .mistaber-artifacts/corpus-chain-YD-88-1.mermaid
      - .mistaber-artifacts/corpus-questions-YD-88-1.yaml
    complexity_score: 6
    statements_count: 5
    machloket_count: 1
    commentary_tiers_fetched: 2

Common Issues

Reference Not Found

Symptom: Sefaria returns no results for reference.

Solutions:

  1. Validate reference format using clarify_name_argument
  2. Try alternative spellings (Arukh vs Aruch)
  3. Check if text exists in Sefaria library

Incomplete Derivation Chain

Symptom: Chain doesn't reach authoritative terminus.

Solutions:

  1. Mark as requiring manual research
  2. Check if rule is d'rabanan (may terminate at Mishnah)
  3. Flag for human review in questions

Commentary Not Available

Symptom: Specific commentary not in Sefaria.

Solutions:

  1. Note as gap in corpus report
  2. Proceed with available commentaries
  3. Flag for human to provide source if critical

Machloket Identification Ambiguity

Symptom: Unclear if positions truly disagree.

Solutions:

  1. Generate clarification question
  2. Default to conservative interpretation (encode as machloket)
  3. Request human guidance