C3: A Bilingual Benchmark for Spoken Dialogue Models Exploring Challenges in Complex Conversations Paper โข 2507.22968 โข Published Jul 30, 2025 โข 25 โข 4
SWE-Factory: Your Automated Factory for Issue Resolution Training Data and Evaluation Benchmarks Paper โข 2506.10954 โข Published Jun 12, 2025 โข 54 โข 2
OmniGIRL: A Multilingual and Multimodal Benchmark for GitHub Issue Resolution Paper โข 2505.04606 โข Published May 7, 2025 โข 9 โข 1