Retrieval-Augmented Generation (RAG) has become a standard approach for knowledge-intensive question answering, but existing systems remain brittle on multi-hop questions. Current methods represent reasoning through free-form natural language, where intermediate states are implicit, retrieval queries can drift from intended entities, and errors are detected by the same model that produces them.
We introduce PyRAG, a framework that reformulates multi-hop RAG as program synthesis and execution. PyRAG represents the reasoning process as an executable Python program over retrieval and QA tools, exposing intermediate states as variables, producing deterministic feedback through execution, and yielding an inspectable trace of the entire reasoning process.
Experiments on five QA benchmarks show that PyRAG consistently outperforms strong baselines under both training-free and RL-trained settings, with +25.5 EM on Bamboogle over Vanilla RAG.
Multi-hop QA is fundamentally step-by-step computation: decompose, compute intermediates,
compose. This is exactly what code-specialized LLMs are trained to do. We synthesize an
executable Python program over two primitives:
retrieve(query) and
answer(query, docs).
# Step 1–2: When was Jed Hoyer born? doc1 = retrieve("When was Jed Hoyer born?") jed_birth = answer("When was Jed Hoyer born?", doc1) # Step 3–4: When was John William Henry II born? doc2 = retrieve("When was John William Henry II born?") john_birth = answer("When was John William Henry II born?", doc2) # Step 5: Compose — handled by Python, not free-form LLM reasoning jed_date = datetime.strptime(re.search(r"[A-Z][a-z]+\s\d{1,2},\s\d{4}", jed_birth).group(), "%B %d, %Y") john_date = datetime.strptime(re.search(r"[A-Z][a-z]+\s\d{1,2},\s\d{4}", john_birth).group(), "%B %d, %Y") return "Jed Hoyer" if jed_date < john_date else "John William Henry II" # → John William Henry II ✓
An LLM breaks the question into independently answerable single-hop queries.
A code-specialized LLM emits a Python program threading retrieve and answer calls through named variables.
A Python interpreter runs the plan, producing an inspectable trace and grounded final answer.
If the program raises a runtime exception, the traceback is fed back to the Plan Agent as a deterministic, grounded signal.
Triggers on ~5% of queries. No additional training.
When answer() returns a sentinel
like "unknown", the runtime
automatically re-retrieves with a boosted top-k.
Triggers on ~20% of queries. Targeted, not global.
Exact Match (%) on five open-domain QA benchmarks, training-free setting.
| Method | PopQA | HotpotQA | 2WikiMQA | MuSiQue | Bamboogle | Avg. |
|---|---|---|---|---|---|---|
| Qwen2.5-7B-Instruct | ||||||
| Direct Inference | 14.0 | 18.3 | 12.6 | 3.1 | 12.0 | 12.0 |
| Vanilla RAG | 26.7 | 28.9 | 18.9 | 4.7 | 16.0 | 19.0 |
| Self-Ask | 29.4 | 30.2 | 21.5 | 6.8 | 22.1 | 22.0 |
| IRCoT | 32.6 | 32.7 | 24.8 | 9.1 | 24.3 | 24.7 |
| ITER-RETGEN | 31.4 | 32.5 | 28.9 | 8.7 | 29.6 | 26.2 |
| PyRAG (ours) | 33.5 | 34.0 | 33.4 | 11.8 | 41.5 | 30.8 |
| ∆ vs. Vanilla RAG | +6.8 | +5.1 | +14.5 | +7.1 | +25.5 | +11.8 |
| Qwen2.5-72B-Instruct | ||||||
| Vanilla RAG | 33.2 | 36.8 | 30.4 | 10.6 | 21.6 | 26.5 |
| ITER-RETGEN | 43.4 | 50.5 | 40.2 | 13.8 | 33.6 | 36.3 |
| PyRAG (ours) | 45.5 | 52.0 | 44.4 | 16.9 | 45.5 | 40.9 |
Four representative execution traces from HotpotQA. Each case shows the generated program, intermediate variable bindings, and what makes the executable interface work. Use arrows, dots, or keyboard ← → to navigate.
# Clone $ git clone https://github.com/GasolSun36/PyRAG.git $ cd PyRAG # Install $ pip install -r requirements.txt # Run on HotpotQA $ python main.py
Full documentation, training scripts, and reproduction details on GitHub.
@misc{sun2026retrievalcheapcodeexecutable,
title={Retrieval is Cheap, Show Me the Code: Executable Multi-Hop Reasoning for Retrieval-Augmented Generation},
author={Jiashuo Sun and Jimeng Shi and Yixuan Xie and Saizhuo Wang and Jash Rajesh Parekh and Pengcheng Jiang and Zhiyi Shi and Jiajun Fan and Qinglong Zheng and Peiran Li and Shaowen Wang and Ge Liu and Jiawei Han},
year={2026},
eprint={2605.12975},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2605.12975},
}