Model Evaluations Sumbol

News

Your AI models are failing in production—Here’s how to fix model selection

The Allen Institute of AI updated its reward model evaluation RewardBench to better reflect real-life scenarios for enterprises.

TechCrunch10mon

Many safety evaluations for AI models have significant limitations

The study’s co-authors first surveyed academic literature to establish an overview of the harms and risks models pose today, and the state of existing AI model evaluations. They then interviewed ...

5don MSN

China's DeepSeek Upgrades R1 AI Model, Narrowing Gap with Western Counterparts

The latest version of the R1 model reportedly performs just below OpenAI's o3 and o4-mini, based on evaluations by ...

TechCrunch21d

OpenAI pledges to publish AI safety test results more often

OpenAI is moving to publish the results of its internal AI model safety evaluations more regularly in what the outfit is saying is an effort to increase transparency. On Wednesday, OpenAI launched ...

ZDNet1mon

OpenAI is pushing for industry-specific AI benchmarks - why that matters

and many others are missing a unified source of truth for model benchmarking." As a result, OpenAI will now work with multiple companies across each industry to develop those evaluations ...

eWeek6mon

OpenAI Orion Model Evaluation Setbacks Spark Industry Concerns

Model evaluations have found only moderate improvements over previous versions, sparking concerns about the rate of advancement we can expect in future generative AI models. OpenAI employees who ...

Forbes5mon

The Fundamentals Of Designing Autonomy Evaluations For AI Safety

The goal of these agentic evaluations is to measure how capable models are at acting autonomously. Ideally, the problem is unique so the model hasn't simply memorized the solution from its ...

Some results have been hidden because they may be inaccessible to you

Show inaccessible results