PyRAG: Executable Multi-Hop Reasoning for Retrieval-Augmented Generation

The Question

"Who is older, Jed Hoyer or John William Henry II?"

Vanilla RAG

Single-shot retrieve-then-read. Noisy evidence, one chance.

✗ Prone to incomplete evidence

Search Agent

Iterative think→search→observe. Vague queries cause entity drift.

△ Error accumulation

PyRAG (Ours)

Decompose into atomic sub-queries, synthesize an executable program, run it.

✓ Inspectable & verifiable

The Key Idea

What if the reasoning trace was the program?

Multi-hop QA is fundamentally step-by-step computation: decompose, compute intermediates, compose. This is exactly what code-specialized LLMs are trained to do. We synthesize an executable Python program over two primitives: retrieve(query) and answer(query, docs).

# Step 1–2: When was Jed Hoyer born?
doc1 = retrieve("When was Jed Hoyer born?")
jed_birth = answer("When was Jed Hoyer born?", doc1)

# Step 3–4: When was John William Henry II born?
doc2 = retrieve("When was John William Henry II born?")
john_birth = answer("When was John William Henry II born?", doc2)

# Step 5: Compose — handled by Python, not free-form LLM reasoning
jed_date  = datetime.strptime(re.search(r"[A-Z][a-z]+\s\d{1,2},\s\d{4}", jed_birth).group(), "%B %d, %Y")
john_date = datetime.strptime(re.search(r"[A-Z][a-z]+\s\d{1,2},\s\d{4}", john_birth).group(), "%B %d, %Y")

return "Jed Hoyer" if jed_date < john_date else "John William Henry II"
# → John William Henry II  ✓

Explicit state

Intermediate results are variables, not narrative fragments. No entity drift.

Deterministic feedback

Compiler/runtime errors are real signals — not the LLM grading itself.

Inspectable trace

Every retrieval, every answer, every variable — recorded and auditable.

Framework

Three agents, one execution interface.

01 · Decompose

Atomic sub-queries

An LLM breaks the question into independently answerable single-hop queries.

02 · Plan

Synthesize program

A code-specialized LLM emits a Python program threading retrieve and answer calls through named variables.

03 · Execute

Run step-by-step

A Python interpreter runs the plan, producing an inspectable trace and grounded final answer.

ⓐ Compiler-Grounded Self-Repair

If the program raises a runtime exception, the traceback is fed back to the Plan Agent as a deterministic, grounded signal.

Triggers on ~5% of queries. No additional training.

ⓑ Execution-Driven Adaptive Retrieval

When answer() returns a sentinel like "unknown", the runtime automatically re-retrieves with a boosted top-k.

Triggers on ~20% of queries. Targeted, not global.

Results

Consistent gains. Largest on compositional multi-hop.

Exact Match (%) on five open-domain QA benchmarks, training-free setting.

Method	PopQA	HotpotQA	2WikiMQA	MuSiQue	Bamboogle	Avg.
Qwen2.5-7B-Instruct
Direct Inference	14.0	18.3	12.6	3.1	12.0	12.0
Vanilla RAG	26.7	28.9	18.9	4.7	16.0	19.0
Self-Ask	29.4	30.2	21.5	6.8	22.1	22.0
IRCoT	32.6	32.7	24.8	9.1	24.3	24.7
ITER-RETGEN	31.4	32.5	28.9	8.7	29.6	26.2
PyRAG (ours)	33.5	34.0	33.4	11.8	41.5	30.8
∆ vs. Vanilla RAG	+6.8	+5.1	+14.5	+7.1	+25.5	+11.8
Qwen2.5-72B-Instruct
Vanilla RAG	33.2	36.8	30.4	10.6	21.6	26.5
ITER-RETGEN	43.4	50.5	40.2	13.8	33.6	36.3
PyRAG (ours)	45.5	52.0	44.4	16.9	45.5	40.9

+25.5

EM gain on Bamboogle
(7B, vs. Vanilla RAG)

+11.8

Average EM gain
across 5 benchmarks

3.7

LLM calls / query
(matches Search-R1's EM)

Case Studies

See PyRAG in action.

Four representative execution traces from HotpotQA. Each case shows the generated program, intermediate variable bindings, and what makes the executable interface work. Use arrows, dots, or keyboard ← → to navigate.

Case A Variable threading across hops

✓ Correct

"2014 S/S is the debut album of a South Korean boy group that was formed by who?"

GoldYG Entertainment

PredictedYG Entertainment

# Step 1-2: identify the boy group
ctx_grp = retrieve("2014 S/S debut South Korean boy group", topk=5)
group   = answer("Which group is the album by?", ctx_grp)
# group = "WINNER"

# Step 3-4: the next query *uses* the bound variable `group`
ctx_co  = retrieve(f"Which company formed {group}?", topk=5)
company = answer(f"Which company formed {group}?", ctx_co)
# company = "YG Entertainment"

# Step 5: final aggregation built from the two intermediate variables
final = answer(
    f"Given: {group} is a South Korean boy group; "
    f"{company} formed {group}. "
    f"Answer the question: 2014 S/S is the debut album "
    f"of a South Korean boy group that was formed by who?",
    ctx_co,
)
# final -> "YG Entertainment"   [ok]

Takeaway: Variables produced at one step are explicitly consumed by subsequent calls through string interpolation. The data dependency group → company is visible in the program structure itself.

Case B Self-correction via re-retrieval

✓ Correct

"Which other Mexican Formula One race car driver has held the podium besides the Force India driver born in 1990?"

GoldPedro Rodríguez

PredictedPedro Rodríguez

# Step 1-2: a known Mexican F1 podium driver
ctx_a   = retrieve("Mexican F1 driver who held the podium", topk=5)
known   = answer("Who is the Mexican F1 podium driver?", ctx_a)
# known = "Sergio Pérez"

# Step 3-4: try to identify the Force India 1990-born driver
ctx_b   = retrieve("Force India driver born in 1990", topk=5)
exclude = answer("Which Force India driver was born in 1990?", ctx_b)
# exclude = "unknown"      [warn] insufficient evidence

# Step 5-6: executor reissues a BROADER retrieval (adaptive re-retrieval)
ctx_b   = retrieve("Force India driver born in 1990", topk=10)  # boost k
exclude = answer("Which Force India driver was born in 1990?", ctx_b)
# exclude = "Pedro Rodríguez and Sergio Pérez"   [ok]

# Step 7: the final aggregation templates in the recovered variable
final = answer(
    f"Given: {known} is a Mexican F1 podium driver; "
    f"the Mexican F1 podium drivers are {exclude}. "
    f"Which other Mexican F1 podium driver is there besides {known}?",
    ctx_b,
)
# final -> "Pedro Rodríguez"   [ok]

Takeaway: When Step 4 returns the sentinel "unknown", execution-guided refinement triggers a broader re-retrieval (highlighted in green). The plan structure is preserved — only the under-evidenced sub-step is repaired.

Case C Boolean conjunction in Python

✓ Correct

"Are Freakonomics and In the Realm of the Hackers both American documentaries?"

Goldno

PredictedNo

films      = ["Freakonomics", "In the Realm of the Hackers"]
predicates = ["is a documentary", "is American"]

# Two predicates per film -> grid of 4 yes/no probes
flags = {f: {} for f in films}
for film in films:
    for pred in predicates:
        ctx               = retrieve(f"Is {film} {pred}?", topk=5)
        flags[film][pred] = answer(f"Is {film} {pred}?", ctx)

# flags = {
#   "Freakonomics":              {"documentary": "no",  "American": "yes"},
#   "In the Realm of the Hackers":{"documentary": "yes", "American": "no"},
# }

# Boolean structure handled in Python -- not by the answer agent
def qualifies(d):
    return all(v.lower() == "yes" for v in d.values())

final = "yes" if all(qualifies(flags[f]) for f in films) else "no"
# final -> "no"   [ok]

Takeaway: The plan reduces a "both X and Y" question to a 2×2 grid of yes/no probes whose conjunction is decided by Python's all(). The answer agent never has to perform multi-clause logical reasoning.

Case D Arithmetic over retrieved values

✓ Correct

"How old was Virginia Bruce when she starred in Let Freedom Ring?"

Gold29

Predicted29

import re

# Two retrievals supply the numeric inputs
ctx_film = retrieve("In what year did Virginia Bruce star in Let Freedom Ring?", topk=5)
year_str = answer("In what year did Virginia Bruce star in Let Freedom Ring?", ctx_film)
# year_str = "1939"

ctx_born = retrieve("In what year was Virginia Bruce born?", topk=5)
born_str = answer("In what year was Virginia Bruce born?", ctx_born)
# born_str = "1910"

# Cast text -> int and compute the answer in Python; NO further LLM call
year_film = int(re.search(r"\d{4}", year_str).group())
year_born = int(re.search(r"\d{4}", born_str).group())
final     = year_film - year_born
# final -> 29   [ok]

Takeaway: The final answer isn't in any retrieved document — it's the difference of two retrieved years. PyRAG separates retrieval (tools) from computation (Python), giving deterministic numeric answers without LLM mental arithmetic.

1 / 4

Get Started

Run PyRAG in three commands.

# Clone
$ git clone https://github.com/GasolSun36/PyRAG.git
$ cd PyRAG

# Install
$ pip install -r requirements.txt

# Run on HotpotQA
$ python main.py

Full documentation, training scripts, and reproduction details on GitHub.

Retrieval is Cheap, Show Me the Code:

Executable Multi-Hop Reasoning for Retrieval-Augmented Generation

Abstract

The Key Idea

What if the reasoning trace was the program?

Framework

Three agents, one execution interface.

Results

Consistent gains. Largest on compositional multi-hop.

Case Studies

See PyRAG in action.

Get Started

Run PyRAG in three commands.

Cite