PatronusAI/Qwen3-4B-Instruct-2507-CE-s39T-GPT41Tea-notR-L2-M-Ep1-6e-5-Q32-65536-1534Feb14
4B
•
Updated
•
48
LLM Evaluation
Benchmarking Reward Hack Detection in Code Environments via Contrastive Analysis
MEMTRACK: Evaluating Long-Term Memory and State Tracking in Multi-Platform Dynamic Agent Environments