Models That Know How Evaluations Are Designed Score Safer Paper • 2605.28591 • Published 6 days ago • 6
stefanocarrera/autophagycode_D_he_train-mercury_Qwen3-8B_strategy_trust_t1.25_g6_run1 Viewer • Updated 4 days ago • 164 • 24 • 1
SQuTR: A Robustness Benchmark for Spoken Query to Text Retrieval under Acoustic Noise Paper • 2602.12783 • Published Feb 13 • 246
ClawKeeper: Comprehensive Safety Protection for OpenClaw Agents Through Skills, Plugins, and Watchers Paper • 2603.24414 • Published Mar 25 • 183
SocialOmni: Benchmarking Audio-Visual Social Interactivity in Omni Models Paper • 2603.16859 • Published Mar 17 • 248