Retrieval is Cheap, Show Me the Code:

Executable Multi-Hop Reasoning for Retrieval-Augmented Generation

Jiashuo Sun*1, Jimeng Shi*1, Yixuan Xie1, Saizhuo Wang2, Jash Rajesh Parekh1, Pengcheng Jiang1, Zhiyi Shi1, Jiajun Fan1, Qinglong Zheng1, Peiran Li1,3, Shaowen Wang1, Ge Liu1, Jiawei Han1
1University of Illinois Urbana-Champaign   2Hong Kong University of Science and Technology   3Texas A&M University
*Equal contribution
Paper Code Case BibTeX
The Question
"Who is older, Jed Hoyer or John William Henry II?"
Vanilla RAG
Single-shot retrieve-then-read. Noisy evidence, one chance.
✗ Prone to incomplete evidence
Search Agent
Iterative think→search→observe. Vague queries cause entity drift.
△ Error accumulation
PyRAG (Ours)
Decompose into atomic sub-queries, synthesize an executable program, run it.
✓ Inspectable & verifiable

Abstract

Retrieval-Augmented Generation (RAG) has become a standard approach for knowledge-intensive question answering, but existing systems remain brittle on multi-hop questions. Current methods represent reasoning through free-form natural language, where intermediate states are implicit, retrieval queries can drift from intended entities, and errors are detected by the same model that produces them.

We introduce PyRAG, a framework that reformulates multi-hop RAG as program synthesis and execution. PyRAG represents the reasoning process as an executable Python program over retrieval and QA tools, exposing intermediate states as variables, producing deterministic feedback through execution, and yielding an inspectable trace of the entire reasoning process.

Experiments on five QA benchmarks show that PyRAG consistently outperforms strong baselines under both training-free and RL-trained settings, with +25.5 EM on Bamboogle over Vanilla RAG.

The Key Idea

What if the reasoning trace was the program?

Multi-hop QA is fundamentally step-by-step computation: decompose, compute intermediates, compose. This is exactly what code-specialized LLMs are trained to do. We synthesize an executable Python program over two primitives: retrieve(query) and answer(query, docs).

# Step 1–2: When was Jed Hoyer born?
doc1 = retrieve("When was Jed Hoyer born?")
jed_birth = answer("When was Jed Hoyer born?", doc1)

# Step 3–4: When was John William Henry II born?
doc2 = retrieve("When was John William Henry II born?")
john_birth = answer("When was John William Henry II born?", doc2)

# Step 5: Compose — handled by Python, not free-form LLM reasoning
jed_date  = datetime.strptime(re.search(r"[A-Z][a-z]+\s\d{1,2},\s\d{4}", jed_birth).group(), "%B %d, %Y")
john_date = datetime.strptime(re.search(r"[A-Z][a-z]+\s\d{1,2},\s\d{4}", john_birth).group(), "%B %d, %Y")

return "Jed Hoyer" if jed_date < john_date else "John William Henry II"
# → John William Henry II  ✓
Explicit state
Intermediate results are variables, not narrative fragments. No entity drift.
Deterministic feedback
Compiler/runtime errors are real signals — not the LLM grading itself.
Inspectable trace
Every retrieval, every answer, every variable — recorded and auditable.

Framework

Three agents, one execution interface.

01 · Decompose
Atomic sub-queries

An LLM breaks the question into independently answerable single-hop queries.

02 · Plan
Synthesize program

A code-specialized LLM emits a Python program threading retrieve and answer calls through named variables.

03 · Execute
Run step-by-step

A Python interpreter runs the plan, producing an inspectable trace and grounded final answer.

ⓐ Compiler-Grounded Self-Repair

If the program raises a runtime exception, the traceback is fed back to the Plan Agent as a deterministic, grounded signal.

Triggers on ~5% of queries. No additional training.

ⓑ Execution-Driven Adaptive Retrieval

When answer() returns a sentinel like "unknown", the runtime automatically re-retrieves with a boosted top-k.

Triggers on ~20% of queries. Targeted, not global.

Results

Consistent gains. Largest on compositional multi-hop.

Exact Match (%) on five open-domain QA benchmarks, training-free setting.

MethodPopQAHotpotQA2WikiMQAMuSiQueBamboogleAvg.
Qwen2.5-7B-Instruct
Direct Inference14.018.312.63.112.012.0
Vanilla RAG26.728.918.94.716.019.0
Self-Ask29.430.221.56.822.122.0
IRCoT32.632.724.89.124.324.7
ITER-RETGEN31.432.528.98.729.626.2
PyRAG (ours)33.534.033.411.841.530.8
∆ vs. Vanilla RAG+6.8+5.1+14.5+7.1+25.5+11.8
Qwen2.5-72B-Instruct
Vanilla RAG33.236.830.410.621.626.5
ITER-RETGEN43.450.540.213.833.636.3
PyRAG (ours)45.552.044.416.945.540.9
+25.5
EM gain on Bamboogle
(7B, vs. Vanilla RAG)
+11.8
Average EM gain
across 5 benchmarks
3.7
LLM calls / query
(matches Search-R1's EM)

Case Studies

See PyRAG in action.

Four representative execution traces from HotpotQA. Each case shows the generated program, intermediate variable bindings, and what makes the executable interface work. Use arrows, dots, or keyboard ← → to navigate.

Get Started

Run PyRAG in three commands.

# Clone
$ git clone https://github.com/GasolSun36/PyRAG.git
$ cd PyRAG

# Install
$ pip install -r requirements.txt

# Run on HotpotQA
$ python main.py

Full documentation, training scripts, and reproduction details on GitHub.

Cite

@misc{sun2026retrievalcheapcodeexecutable,
      title={Retrieval is Cheap, Show Me the Code: Executable Multi-Hop Reasoning for Retrieval-Augmented Generation}, 
      author={Jiashuo Sun and Jimeng Shi and Yixuan Xie and Saizhuo Wang and Jash Rajesh Parekh and Pengcheng Jiang and Zhiyi Shi and Jiajun Fan and Qinglong Zheng and Peiran Li and Shaowen Wang and Ge Liu and Jiawei Han},
      year={2026},
      eprint={2605.12975},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2605.12975}, 
}