Corpus Preparation Skill¶

The corpus-prep skill is the first phase of the encoding pipeline, responsible for fetching source texts from Sefaria, building complete derivation chains, identifying machloket (disputes), and preparing structured artifacts for human review.

Overview¶

Attribute	Value
Skill Name	`corpus-prep`
Phase	1 of 5
Checkpoint	1: Source Review
Prerequisites	Sefaria MCP available, valid seif reference
Outputs	corpus-report.md, corpus-sources.yaml, corpus-chain.mermaid, corpus-questions.yaml
Trigger Phrases	"prepare corpus", "fetch sources", "start encoding seif", "begin encoding", "prepare sources for encoding", "build derivation chain", "get source texts from Sefaria"

Invocation¶

Typical Invocation¶

User: "Prepare corpus for YD 87:3"

Agent: I'll prepare the corpus for Yoreh Deah 87:3. Let me:
1. Validate the reference with Sefaria
2. Fetch the primary text (Hebrew + English)
3. Fetch relevant commentaries
4. Build the derivation chain
5. Identify any machloket
...

Explicit Invocation¶

User: "Use mistaber:corpus-prep for Shulchan Arukh Yoreh Deah 88:1"

Phase A: Reference Resolution & Validation¶

Step 1: Validate Reference Format¶

The skill uses Sefaria's clarify_name_argument tool to validate and canonicalize the seif reference:

# Validate "Yoreh Deah 88:1"
result = mcp__sefaria_texts__clarify_name_argument(
    name="Shulchan Arukh, Yoreh De'ah 88:1",
    type_filter="ref"
)

Common Reference Formats:

Input	Canonical Form
YD 87:3	Shulchan Arukh, Yoreh De'ah 87:3
Yoreh Deah 88:1	Shulchan Arukh, Yoreh De'ah 88:1
SA YD 89:2	Shulchan Arukh, Yoreh De'ah 89:2

Step 2: Get Structural Context¶

The skill retrieves the siman's structure to understand context:

shape = mcp__sefaria_texts__get_text_or_category_shape(
    name="Shulchan Arukh, Yoreh De'ah 88"
)

Extracted Information:

Total seifim in siman
Position of target seif
Related seifim that may have dependencies

Step 3: Check Prior Encoding State¶

The skill reads .mistaber-session.yaml to determine:

Previously encoded seifim
Pending dependencies
Any active encoding session

Step 4: Identify Dependencies¶

Cross-references to other seifim are identified:

Forward references ("k'mo she'yitbaer l'kaman" - as will be explained below)
Backward references ("k'mo she'katavnu l'eil" - as we wrote above)

Phase B: Primary Source Extraction¶

Step 1: Fetch Primary Text¶

Both Hebrew and English versions are retrieved:

# Fetch with all translations
text = mcp__sefaria_texts__get_text(
    reference="Shulchan Arukh, Yoreh De'ah 88:1",
    version_language="both"
)

# Get all English translations for comparison
translations = mcp__sefaria_texts__get_english_translations(
    reference="Shulchan Arukh, Yoreh De'ah 88:1"
)

Step 2: Linguistic Analysis¶

For the Hebrew text, the skill identifies:

Element	Example	Purpose
Key terms	"basar", "chalav", "bishul"	Map to predicates
Ambiguous phrases	"taam k'ikar"	Flag for clarification
Cross-references	"siman 89"	Track dependencies

Dictionary lookups are performed for technical terms:

definitions = mcp__sefaria_texts__search_in_dictionaries(
    query="נותן טעם"  # "noten taam"
)

Step 3: Semantic Decomposition¶

The seif is broken into atomic statements:

statements:
  - id: s1
    type: ISSUR
    text_he: "בשר בהמה טהורה בחלב בהמה טהורה אסור לבשל"
    text_en: "Meat of a kosher domesticated animal with milk of a kosher domesticated animal is forbidden to cook"
    conditions: []
    world_context: base

  - id: s2
    type: ISSUR
    text_he: "ואסור לאכול"
    text_en: "and forbidden to eat"
    conditions: []
    world_context: base

  - id: s3
    type: MACHLOKET
    text_he: "דגים וחגבים אין בהם איסור בשר בחלב... מפני הסכנה"
    text_en: "Fish and locusts have no meat-and-milk prohibition... due to danger"
    conditions: []
    world_context: mechaber  # Disputed by Rema

Statement Type Classification:

Type	Description	Example
ISSUR	Prohibition	"forbidden to cook"
ISSUR_SAKANA	Health danger	"due to danger"
HETER	Permission	"permitted to eat"
CHIYUV	Obligation	"must wait six hours"
DEFINITION	Category definition	"fish is not basar"
CONDITION	Prerequisite	"if cooked together"
EXCEPTION	Exception to rule	"except bedieved"

Phase C: Commentary Layer (Adaptive Depth)¶

Tier-Based Fetching Strategy¶

The skill fetches commentaries in tiers, proceeding to deeper tiers only when necessary:

graph TD
    T1["Tier 1: Always Fetch"] --> D1{Sufficient?}
    D1 -->|No| T2["Tier 2: Standard"]
    D1 -->|Yes| DONE[Complete]
    T2 --> D2{Sufficient?}
    D2 -->|No| T3["Tier 3: Extended"]
    D2 -->|Yes| DONE
    T3 --> D3{Sufficient?}
    D3 -->|No| T4["Tier 4: Modern"]
    D3 -->|Yes| DONE
    T4 --> DONE

Tier Definitions¶

Tier 1 (Always Fetch):

Commentary	Abbreviation	Tradition
Shach (Siftei Kohen)	sk	Ashkenazi
Taz (Turei Zahav)	-	Ashkenazi
Rema	-	Ashkenazi gloss

Tier 2 (Standard - fetch if Tier 1 insufficient):

Commentary	Abbreviation	Notes
Pri Megadim	PM	Commentary on Shach/Taz
Chochmat Adam	CA	Practical summary
Aruch HaShulchan	AH	Modern codification

Tier 3 (Extended - for complex machloket):

Commentary	Focus
Pitchei Teshuva	Acharonim summary
Darchei Teshuva	Additional opinions
Be'er Heitev	Abbreviated summary

Tier 4 (Modern - practical application):

Commentary	Community
Yalkut Yosef	Sefardi
Mishnah Berurah	Ashkenazi (OC)

Commentary Classification¶

Each commentary passage is classified:

# Fetch Shach
shach = mcp__sefaria_texts__get_text(
    reference="Shakh on Shulchan Arukh, Yoreh De'ah 88:1",
    version_language="both"
)

Classification Types:

Type	Description	Encoding Impact
CLARIFICATION	Explains without changing	Comments only
QUALIFICATION	Adds conditions	New rule conditions
EXTENSION	Applies to new cases	New rules
DISPUTE	Disagrees with base	Override/machloket
PRACTICAL	Practical guidance	May affect world scoping

Machloket Detection¶

Disputes are identified between:

Dispute Type	Example
Mechaber vs. Rema	Fish + dairy: sakana vs. mutar
Shach vs. Taz	Interpretation of SA ruling
Ashkenazi vs. Sefardi	Practice differences
Earlier vs. Later	Rishonim vs. Acharonim

Machloket Documentation:

machloket:
  - id: m1
    topic: "dag_bechalav"
    position_a:
      authority: mechaber
      ruling: sakana
      source: "BY YD 87"
    position_b:
      authority: rema
      ruling: mutar
      source: "Rema gloss YD 87:3"
    practical_difference: "Fish cooked in dairy (e.g., lox with cream cheese)"
    worlds_affected:
      - mechaber  # Position A
      - rema      # Position B

Phase D: Derivation Chain Building¶

Recursive Chain Building¶

The skill traces sources backward from SA to authoritative terminus:

SA YD 88:1 → Tur YD 88 → Rambam MA 9:1 → Gemara Chullin 104b → Mishnah Chullin 8:1 → Torah Shemot 23:19

Using Sefaria Links:

# Get SA links
sa_links = mcp__sefaria_texts__get_links_between_texts(
    reference="Shulchan Arukh, Yoreh De'ah 88:1",
    with_text="1"
)

# Follow chain to Tur
tur_links = mcp__sefaria_texts__get_links_between_texts(
    reference="Tur, Yoreh Deah 88",
    with_text="1"
)

# Continue until reaching terminus

Chain Completeness¶

A complete chain reaches one of:

Terminus Type	Example	For
Torah verse	Shemot 23:19	d_oraita rules
Mishnah/Gemara	Chullin 8:1	d_rabanan rules
Takana	Chazal decree	Rabbinic institutions
Minhag source	Community practice	Minhag rules

Loop Detection¶

The skill tracks visited references to detect circular chains:

visited = set()

def trace_chain(ref, depth=0):
    if ref in visited:
        return {"loop_detected": True, "ref": ref}
    visited.add(ref)
    links = get_links(ref)
    # Continue recursively...

Makor Chain Format¶

The final chain is formatted for ASP:

% Rule naming: r_{topic}_{specific} (NOT siman-seif based)
% Example: r_bb_achiila for basar bechalav eating rule
makor(r_bb_achiila, sa("yd:88:1")).
makor(r_bb_achiila, tur("yd:88")).
makor(r_bb_achiila, rambam("maachalot:9:1")).
makor(r_bb_achiila, gemara("chullin:104b")).
makor(r_bb_achiila, torah("shemot:23:19")).

Phase E: Semantic Enrichment¶

Topic Lookup¶

Related topics are retrieved from Sefaria:

topics = mcp__sefaria_texts__get_topic_details(
    topic_slug="meat-and-milk",
    with_links=True,
    with_refs=True
)

Semantic Search¶

Related material is discovered through semantic search:

related = mcp__sefaria_texts__english_semantic_search(
    query="prohibition of cooking meat and milk together Biblical commandment",
    filters={"document_categories": ["Halakhah"]}
)

Shiur and Measurement Extraction¶

Quantities mentioned are identified:

Hebrew	Transliteration	Meaning
כזית	k'zayit	olive-sized portion
כביצה	k'beitza	egg-sized portion
שישים	shishim	60:1 ratio
נותן טעם	noten taam	taste transfer

Temporal and Contextual Conditions¶

Context modifiers are identified:

Type	Examples	Encoding Impact
Time-based	"on Shabbat", "during Pesach"	Additional conditions
Situation-based	"bedieved", "hefsed merubeh"	World-specific rules
Location-based	"in Eretz Yisrael"	Geographic conditions

Phase F: Gap Analysis¶

Ontology Coverage Check¶

The skill verifies predicates exist in the ontology:

predicates_check:
  existing:
    - issur/3: "Prohibition predicate"
    - heter/2: "Permission predicate"
    - sakana/1: "Danger predicate"
  missing:
    - hachshara/2: "Vessel preparation - needs to be added"

World Coverage Check¶

Required worlds are verified:

worlds_check:
  existing:
    - base: "Universal rulings"
    - mechaber: "Sefardi positions"
    - rema: "Ashkenazi positions"
  missing: []  # None in this case

Prerequisite Verification¶

Dependencies on other seifim are checked:

prerequisites:
  - seif: "YD:87:1"
    status: encoded
    notes: "Base BB prohibition"
  - seif: "YD:87:2"
    status: not_encoded
    notes: "Species definitions - may need to encode first"
    blocking: false

Complexity Scoring¶

A complexity score (1-10) is assigned:

complexity:
  score: 6
  factors:
    machloket_count: 2
    commentary_depth: 3  # Tiers needed
    chain_length: 5
    cross_references: 3
    novel_predicates: 1

# Scoring rubric:
# 1-3: Simple, no machloket, straightforward chain
# 4-6: Moderate, some machloket, complete chain
# 7-10: Complex, multiple machloket, incomplete chain or novel concepts

Phase G: Artifact Generation¶

Artifact 1: Human Review Report¶

corpus-report-YD-{siman}-{seif}.md

Uses the template from templates/corpus-report.md to generate a comprehensive human-readable report including:

Executive summary with metrics
Primary source text (Hebrew + English)
Atomic statements table
Commentary excerpts by tier
Derivation chain diagram
Machloket documentation
Gap analysis results
Questions for human review
Checkpoint criteria checklist

Artifact 2: Machine-Readable Sources¶

corpus-sources-YD-{siman}-{seif}.yaml

reference: "YD:88:1"
fetched_at: "2026-01-25T10:00:00Z"
primary:
  hebrew: "בשר בהמה טהורה בחלב..."
  english: "Meat of a kosher domesticated animal..."
  translations:
    - version: "Sefaria Community Translation"
      text: "..."
statements:
  - id: s1
    type: ISSUR
    text_he: "..."
    text_en: "..."
    conditions: []
    world_context: base
commentaries:
  shach:
    - location: "sk 1"
      type: CLARIFICATION
      text_he: "..."
      text_en: "..."
  taz:
    - location: "sk 1"
      type: QUALIFICATION
      text_he: "..."
      text_en: "..."
derivation_chain:
  - level: 0
    source: "SA YD 88:1"
    ref: "Shulchan Arukh, Yoreh De'ah 88:1"
  - level: 1
    source: "Tur YD 88"
    ref: "Tur, Yoreh Deah 88"
  # ... continues to terminus
machloket:
  - id: m1
    topic: dag_bechalav
    position_a:
      authority: mechaber
      ruling: sakana
    position_b:
      authority: rema
      ruling: mutar

Artifact 3: Visual Derivation Chain¶

corpus-chain-YD-{siman}-{seif}.mermaid

graph TD
    SA["SA YD 88:1<br/>Primary Source"] --> TUR["Tur YD 88"]
    TUR --> RAMBAM["Rambam MA 9:1"]
    RAMBAM --> GEM["Gemara Chullin 104b"]
    GEM --> MISH["Mishnah Chullin 8:1"]
    MISH --> TORAH["Torah: Shemot 23:19<br/>Lo tevashel gedi..."]

    style SA fill:#90EE90
    style TORAH fill:#FFD700

Artifact 4: Questions for Human¶

corpus-questions-YD-{siman}-{seif}.yaml

questions:
  - id: q1
    phase: B
    type: CLARIFICATION
    question: "Does 'beheima' include all domesticated animals or only cattle?"
    context: "Statement s1 mentions 'beheima tehora'"
    options:
      - "All domesticated kosher animals (cow, sheep, goat)"
      - "Only cattle specifically"
    default: 0
    impact: "Affects predicate scope"

  - id: q2
    phase: C
    type: MACHLOKET
    question: "Should we encode the Taz's lenient opinion on bedieved cases?"
    context: "Taz sk 3 permits bedieved in certain cases"
    options:
      - "Yes, encode as world-specific rule"
      - "No, follow strict Mechaber"
    default: 0
    impact: "May require additional taz world"

Checkpoint Criteria¶

Before requesting human approval, the skill verifies:

[ ] Primary source text accurately fetched (Hebrew + English)
[ ] All Tier 1 commentaries fetched and classified
[ ] Derivation chain reaches authoritative source
[ ] All machloket identified with both positions documented
[ ] No unresolved blocking dependencies on other seifim
[ ] Questions formulated for any ambiguities
[ ] All four artifacts generated

Session State Update¶

After completing corpus preparation:

current_phase: corpus-prep
target_seif: "YD:88:1"
checkpoints:
  corpus-prep:
    status: pending_review
    artifacts:
      - .mistaber-artifacts/corpus-report-YD-88-1.md
      - .mistaber-artifacts/corpus-sources-YD-88-1.yaml
      - .mistaber-artifacts/corpus-chain-YD-88-1.mermaid
      - .mistaber-artifacts/corpus-questions-YD-88-1.yaml
    complexity_score: 6
    statements_count: 5
    machloket_count: 1
    commentary_tiers_fetched: 2

Common Issues¶

Reference Not Found¶

Symptom: Sefaria returns no results for reference.

Solutions:

Validate reference format using clarify_name_argument
Try alternative spellings (Arukh vs Aruch)
Check if text exists in Sefaria library

Incomplete Derivation Chain¶

Symptom: Chain doesn't reach authoritative terminus.

Solutions:

Mark as requiring manual research
Check if rule is d'rabanan (may terminate at Mishnah)
Flag for human review in questions

Commentary Not Available¶

Symptom: Specific commentary not in Sefaria.

Solutions:

Note as gap in corpus report
Proceed with available commentaries
Flag for human to provide source if critical

Machloket Identification Ambiguity¶

Symptom: Unclear if positions truly disagree.

Solutions:

Generate clarification question
Default to conservative interpretation (encode as machloket)
Request human guidance

HLL Encoding Skill - Next phase after corpus approval
Workflow Guide - Complete pipeline overview
Troubleshooting - Error resolution

Corpus Preparation Skill¶

Overview¶

Invocation¶

Typical Invocation¶

Explicit Invocation¶

Phase A: Reference Resolution & Validation¶

Step 1: Validate Reference Format¶

Step 2: Get Structural Context¶

Step 3: Check Prior Encoding State¶

Step 4: Identify Dependencies¶

Phase B: Primary Source Extraction¶

Step 1: Fetch Primary Text¶

Step 2: Linguistic Analysis¶

Step 3: Semantic Decomposition¶

Phase C: Commentary Layer (Adaptive Depth)¶

Tier-Based Fetching Strategy¶

Tier Definitions¶

Commentary Classification¶

Machloket Detection¶

Phase D: Derivation Chain Building¶

Recursive Chain Building¶

Chain Completeness¶

Loop Detection¶

Makor Chain Format¶

Phase E: Semantic Enrichment¶

Topic Lookup¶

Semantic Search¶

Shiur and Measurement Extraction¶

Temporal and Contextual Conditions¶

Phase F: Gap Analysis¶

Ontology Coverage Check¶

World Coverage Check¶

Prerequisite Verification¶

Complexity Scoring¶

Phase G: Artifact Generation¶

Artifact 1: Human Review Report¶

Artifact 2: Machine-Readable Sources¶

Artifact 3: Visual Derivation Chain¶

Artifact 4: Questions for Human¶

Checkpoint Criteria¶

Session State Update¶

Common Issues¶

Reference Not Found¶

Incomplete Derivation Chain¶

Commentary Not Available¶

Machloket Identification Ambiguity¶

Related Documentation¶