People are benchmarking AI by having it make balls bounce in rotating shapes

The list of informal, weird AI benchmarks keeps growing.

Over the past few days, some in the AI community on X have become obsessed with a test of how different AI models, particularly so-called reasoning models, handle prompts like this: “Write a Python script for a bouncing yellow ball within a shape. Make the shape slowly rotate, and make sure that the ball stays within the shape.”

Some models manage better on this “ball in rotating shape” benchmark than others. According to one user on X, Chinese AI lab DeepSeek’s freely available R1 swept the floor with OpenAI’s o1 pro mode, which costs $200 per month as a part of OpenAI’s ChatGPT Pro plan.

Per another X poster, Anthropic’s Claude 3.5 Sonnet and Google’s Gemini 1.5 Pro models misjudged the physics, resulting in the ball escaping the shape. Other users reported that Google’s Gemini 2.0 Flash Thinking Experimental, and even OpenAI’s older GPT-4o, aced the evaluation in one go.

But what does it prove that an AI can or can’t code a rotating, ball-containing shape?

Well, simulating a bouncing ball is a classic programming challenge. Accurate simulations incorporate collision detection algorithms, which try to identify when two objects (e.g. a ball and the side of a shape) collide. Poorly written algorithms can affect the simulation’s performance or lead to obvious physics mistakes.

X user n8programs, a researcher in residence at AI startup Nous Research, says it took him roughly two hours to program a bouncing ball in a rotating heptagon from scratch. “One has to track multiple coordinate systems, how the collisions are done in each system, and design the code from the beginning to be robust,” n8programs explained in a post.

But while bouncing balls and rotating shapes are a reasonable test of programming skills, they’re not a very empirical AI benchmark. Even slight variations in the prompt can — and do — yield different outcomes. That’s why some users on X report have more luck with o1, while others say that R1 falls short.

If anything, viral tests like these point the intractable problem of creating useful systems of measurement for AI models. It’s often difficult to tell what differentiates one model from another, outside of esoteric benchmarks that aren’t relevant to most people.

Many efforts are underway to build better tests, like the ARC-AGI benchmark and Humanity’s Last Exam. We’ll see how those fare — and in the meantime watch GIFs of balls bouncing in rotating shapes.



Leave a Comment

You cannot copy content of this page