One-Eval: An Agentic System for Automated and Traceable LLM Evaluation
Abstract
One-Eval is an agentic evaluation system that automates large language model assessment by converting natural-language requests into executable workflows with integrated benchmark planning, dataset handling, and customizable reporting.
Reliable evaluation is essential for developing and deploying large language models, yet in practice it often requires substantial manual effort: practitioners must identify appropriate benchmarks, reproduce heterogeneous evaluation codebases, configure dataset schema mappings, and interpret aggregated metrics. To address these challenges, we present One-Eval, an agentic evaluation system that converts natural-language evaluation requests into executable, traceable, and customizable evaluation workflows. One-Eval integrates (i) NL2Bench for intent structuring and personalized benchmark planning, (ii) BenchResolve for benchmark resolution, automatic dataset acquisition, and schema normalization to ensure executability, and (iii) Metrics \& Reporting for task-aware metric selection and decision-oriented reporting beyond scalar scores. The system further incorporates human-in-the-loop checkpoints for review, editing, and rollback, while preserving sample evidence trails for debugging and auditability. Experiments show that One-Eval can execute end-to-end evaluations from diverse natural-language requests with minimal user effort, supporting more efficient and reproducible evaluation in industrial settings. Our framework is publicly available at https://github.com/OpenDCAI/One-Eval.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- EigenData: A Self-Evolving Multi-Agent Platform for Function-Calling Data Synthesis, Auditing, and Repair (2026)
- SWE-Hub: A Unified Production System for Scalable, Executable Software Engineering Tasks (2026)
- Jailbreak Foundry: From Papers to Runnable Attacks for Reproducible Benchmarking (2026)
- Tool-Genesis: A Task-Driven Tool Creation Benchmark for Self-Evolving Language Agent (2026)
- TraceSIR: A Multi-Agent Framework for Structured Analysis and Reporting of Agentic Execution Traces (2026)
- Case-Aware LLM-as-a-Judge Evaluation for Enterprise-Scale RAG Systems (2026)
- Agent-Based Software Artifact Evaluation (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper
