A study conducted by the Oxford Internet Institute, in collaboration with over thirty institutions, analyzes 445 benchmarks used for evaluating artificial intelligence (AI). Researchers emphasize that many of these tests lack scientific rigor and do not accurately measure the abilities they claim to assess.
For example, some benchmarks do not clearly define the competencies being evaluated, while others reuse data from previous tests, affecting the reliability of the results. Adam Mahdi, one of the lead authors, warns that these deficiencies can distort the perception of AI progress. The study proposes eight recommendations for creating more transparent and reliable benchmarks, including a clear definition of the purpose of each test and the use of more representative task sets.