News
The Allen Institute of AI updated its reward model evaluation RewardBench to better reflect real-life scenarios for enterprises.
The study’s co-authors first surveyed academic literature to establish an overview of the harms and risks models pose today, and the state of existing AI model evaluations. They then interviewed ...
The latest version of the R1 model reportedly performs just below OpenAI's o3 and o4-mini, based on evaluations by ...
OpenAI is moving to publish the results of its internal AI model safety evaluations more regularly in what the outfit is saying is an effort to increase transparency. On Wednesday, OpenAI launched ...
and many others are missing a unified source of truth for model benchmarking." As a result, OpenAI will now work with multiple companies across each industry to develop those evaluations ...
Model evaluations have found only moderate improvements over previous versions, sparking concerns about the rate of advancement we can expect in future generative AI models. OpenAI employees who ...
The goal of these agentic evaluations is to measure how capable models are at acting autonomously. Ideally, the problem is unique so the model hasn't simply memorized the solution from its ...
Some results have been hidden because they may be inaccessible to you
Show inaccessible results