This repository contains the cleaned implementation for the paper Reflective Prompt Tuning through Language Model Function-Calling.
RPT automates prompt improvement by using an optimizer LLM with function-calling abilities to invoke diagnostic tools, analyze the target model’s failures, and produce structured evaluation reports. Each report, together with the history of earlier reports, is fed back to the optimizer, which iteratively refines the target prompt.
RPT is a function-calling workflow for iterative prompt optimization. Given a task, a target model, and an initial prompt program, the system repeatedly:
- evaluates the current prompt on an optimization split,
- diagnoses recurring failure modes,
- summarizes calibration and task metrics,
- asks an optimizer model to propose a prompt patch or stop,
- selects the final prompt using held-out validation performance.
The cleaned repository supports three tasks:
hotpotqalivebench_mathxbrl_formula
RPT/
├── .gitignore
├── LICENSE
├── README.md
├── data/
│ ├── hotpotqa/
│ │ └── .gitkeep
│ ├── livebench_math/
│ │ └── .gitkeep
│ └── xbrl_formula/
│ └── .gitkeep
├── figs/
│ └── RPT_overview.png
├── requirements.txt
├── rpt/
│ ├── __init__.py
│ ├── analysis/
│ │ ├── __init__.py
│ │ ├── cluster_failures_and_patches.py
│ │ ├── cluster_fusion.py
│ │ ├── interpret_data_using_heatmaps.py
│ │ ├── paths.py
│ │ └── performance_summarization_and_analysis.py
│ ├── common.py
│ ├── data/
│ │ ├── __init__.py
│ │ └── prepare.py
│ ├── gemini_utils.py
│ ├── paths.py
│ └── tasks/
│ ├── __init__.py
│ ├── hotpotqa.py
│ ├── hotpotqa_gemini.py
│ ├── livebench_math.py
│ ├── livebench_math_gemini.py
│ ├── xbrl_formula.py
│ └── xbrl_formula_gemini.py
└── run_analysis_pipeline.sh
Generated artifacts are ignored by git: logs/, clustering_results/, vis_results/, results/, analysis_reports/, and generated data/**/*.jsonl files.
Use Python >= 3.10.
Install dependencies:
pip install -r requirements.txtFor editable local development, optionally create a virtual environment first:
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txtSet only the credentials for the backend you plan to use.
For OpenAI target or optimizer runs:
export OPENAI_API_KEY="..."For Gemini optimizer runs through Vertex AI:
export GOOGLE_CLOUD_PROJECT="..."
export GOOGLE_CLOUD_LOCATION="global"The repository does not redistribute dataset JSONL files. It keeps the expected data directories in place and provides a preparation command that downloads or regenerates the same local split paths.
| Dataset | Local path | Split files | Upstream reference |
|---|---|---|---|
| HotpotQA | data/hotpotqa/ |
train.jsonl, dev.jsonl, test.jsonl |
HotpotQA / Hugging Face |
| LiveBench Math | data/livebench_math/ |
train.jsonl, val.jsonl, test.jsonl |
LiveBench / Hugging Face |
| XBRL Formula | data/xbrl_formula/ |
train.jsonl, val.jsonl, test.jsonl |
ACE finance data |
Prepare all datasets:
python -m rpt.data.prepareThe first run requires network access to Hugging Face and GitHub.
This writes:
data/hotpotqa/train.jsonl,data/hotpotqa/dev.jsonl, anddata/hotpotqa/test.jsonldata/livebench_math/train.jsonl,data/livebench_math/val.jsonl, anddata/livebench_math/test.jsonldata/xbrl_formula/train.jsonl,data/xbrl_formula/val.jsonl, anddata/xbrl_formula/test.jsonl
Prepare one dataset or overwrite existing files:
python -m rpt.data.prepare --dataset xbrl_formula
python -m rpt.data.prepare --forceDataset paths can be overridden with environment variables:
export RPT_DATA_ROOT="/path/to/data"
export RPT_HOTPOTQA_DATA_DIR="/path/to/hotpotqa"
export RPT_LIVEBENCH_MATH_DATA_DIR="/path/to/livebench_math"
export RPT_XBRL_FORMULA_DATA_DIR="/path/to/xbrl_formula"The generated local splits build on the following data sources:
-
HotpotQA
- Source: HotpotQA and the hotpotqa/hotpot_qa Hugging Face mirror.
- License: CC BY-SA 4.0.
-
LiveBench Math
- Source: LiveBench and the livebench/math Hugging Face dataset.
- License: Apache License, Version 2.0.
-
XBRL Formula
- Source: ACE finance data.
- License: Apache License, Version 2.0.
Please refer to the respective upstream sources for complete licensing terms and attribution requirements.
Prepare the local JSONL splits first:
python -m rpt.data.prepareRun an OpenAI optimizer:
python -m rpt.tasks.hotpotqa --iters 40
python -m rpt.tasks.xbrl_formula --iters 40
python -m rpt.tasks.livebench_math --iters 40Run a Gemini optimizer:
python -m rpt.tasks.hotpotqa_gemini --iters 40 --optimizer_name gemini-3.1-pro
python -m rpt.tasks.livebench_math_gemini --iters 40 --optimizer_name gemini-3.1-pro
python -m rpt.tasks.xbrl_formula_gemini --iters 40 --optimizer_name gemini-3.1-proPrepare or inspect cached LiveBench Math splits without running optimization:
python -m rpt.tasks.livebench_math --prepare_onlyEvaluate the seed prompt only:
python -m rpt.tasks.hotpotqa --evaluate_only
python -m rpt.tasks.livebench_math --evaluate_onlyRun the analysis pipeline for an existing log:
./run_analysis_pipeline.sh \
--log_path logs/xbrl_formula/gpt-5/example.jsonl \
--task_name xbrl_formula \
--model_name gpt-5The pipeline can generate:
- failure and patch corpora,
- ClusterFusion topics,
- human-readable topic labels,
- transition and persistence summaries,
- heatmaps and prompt-length plots.
RPT runs write JSONL logs containing prompt programs, train/dev/test metrics, diagnostic reports, decisions, and final evaluations. Analysis scripts write derived artifacts into task/model/log-specific subdirectories.
Common output locations:
logs/clustering_results/vis_results/analysis_reports/results/
If you use this repository, please cite our work:
@article{bayat2026reflectiveprompttuning,
title = {Reflective Prompt Tuning through Language Model Function-Calling},
author = {Fatahi Bayat, Farima and Aminnaseri, Moin and Pezeshkpour, Pouya and Hruschka, Estevam},
year = {2026},
url = {https://arxiv.org/abs/2605.21781}
}Embedded in, or bundled with, this product are open source software (OSS) components, datasets and other third party components identified below. The license terms respectively governing the datasets and third-party components continue to govern those portions, and you agree to those license terms, which, when applicable, specifically limit any distribution. You may receive a copy of, distribute and/or modify any open source code for the OSS component under the terms of their respective licenses, which may be CC license and Apache 2.0 license. In the event of conflicts between Megagon Labs, Inc., license conditions and the Open Source Software license conditions, the Open Source Software conditions shall prevail with respect to the Open Source Software portions of the software. You agree not to, and are not permitted to, distribute actual datasets used with the OSS components listed below. You agree and are limited to distribute only links to datasets from known sources by listing them in the datasets overview table below. You are permitted to distribute derived datasets of data sets from known sources by including links to original dataset source in the datasets overview table below. You agree that any right to modify datasets originating from parties other than Megagon Labs, Inc. are governed by the respective third party's license conditions. All OSS components and datasets are distributed WITHOUT ANY WARRANTY, without even implied warranty such as for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE, and without any liability to or claim against any Megagon Labs, Inc. entity other than as explicitly documented in this README document. You agree to cease using any part of the provided materials if you do not agree with the terms or the lack of any warranty herein. While Megagon Labs, Inc., makes commercially reasonable efforts to ensure that citations in this document are complete and accurate, errors may occur. If you see any error or omission, please help us improve this document by sending information to contact_oss@megagon.ai.
