Humanity’s Last Exam: A Bold New Initiative to Challenge AI Systems

Image by Unsplash

This post is also available in: עברית (Hebrew)

A groundbreaking initiative called Humanity’s Last Exam has been launched, aiming to set the highest benchmarks for artificial intelligence (AI) systems. Spearheaded by the Center for AI Safety (CAIS) and Scale AI, this ambitious project seeks to create the world’s most challenging public AI benchmark through expert-driven questions across various fields.

According to Dan Hendrycks, the director of CAIS, the initiative marks a significant leap in AI evaluation methodologies. “We are collecting the hardest and broadest set of questions ever to evaluate how close we are to achieving expert-level AI across diverse domains,” he stated. Technology experts are invited to submit their most difficult questions by November 1st, with a total prize pool of $500,000 available for selected contributions.

The initiative encourages submissions from individuals with over five years of experience in a technical field or those who hold or are pursuing a PhD. Participants whose questions are chosen will not only receive monetary rewards but will also be credited as co-authors on the corresponding research paper linked to the new dataset. The top 50 submissions will earn $5,000 each, while the next 500 questions will receive $500, fostering competition and innovation within the AI community.

Scale AI, a San Francisco-based software company known for providing labeled data to train AI applications, emphasizes the necessity of this initiative. Current benchmarks have become too simplistic for advanced AI models, making it essential to develop more rigorous evaluations. As of September, OpenAI’s latest model, Strawberry release (OpenAI o1), has demonstrated capabilities that nearly maximize existing benchmarks, underlining the urgency for more challenging assessments.

The guidelines for question submissions are stringent: all entries must be original, challenging, objective, and self-contained. The questions should span a variety of fields. Notably, the initiative prohibits questions related to sensitive subjects, such as weapons of mass destruction or cyber warfare, ensuring a focus on constructive and safe inquiry.

Scale AI’s commitment to AI safety and evaluation methods aims to distinguish between models that excel in basic assessments and those that can contribute to advanced research and problem-solving. As AI technology continues to evolve, initiatives like Humanity’s Last Exam are vital for pushing the boundaries of what these systems can achieve.

For those interested in participating, detailed submission guidelines and information can be found on the official website. As the AI landscape shifts, Humanity’s Last Exam represents a pivotal step toward developing robust and effective benchmarks for the next generation of intelligent systems.