Evaluating how AI agents understand and adapt tests to real-world software evolution.
TestEvo-Bench is a live benchmark for evaluating AI software engineering agents on realistic software test evolution tasks mined from open-source repositories.
Unlike traditional benchmarks that isolate tests from production changes, TestEvo-Bench models real software co-evolution between production code and test suites.
The benchmark contains two complementary tracks:
Each task is execution-grounded with runnable environments and evaluated using metrics such as pass rate, coverage, and mutation score.
Test Generation — https://huggingface.co/datasets/TestEvo-Bench/teb-generation
Test Update — https://huggingface.co/datasets/TestEvo-Bench/teb-update
🌐 Website — https://www.testevo-bench.com/
🤗 Hugging Face Space — https://huggingface.co/spaces/TestEvo-Bench/
💻 Code — https://anonymous.4open.science/r/testevo-bench-1150/README.md
Real-world • Execution-grounded • Live software evolution benchmark