Intrig Research

Designing benchmarks for artificial superintelligence (ASI) is often framed as a technical problem: how do we create tasks that measure abstraction, reasoning, and generalization rather than memorization or scale? But there is a deeper, rarely acknowledged issue underneath this effort—communication with an unknown intelligence.

In this sense, ASI benchmark design has more in common with Messaging to Extraterrestrial Intelligence (METI) than with traditional evaluation. When we design a benchmark, we are effectively sending a “message” to an intelligence whose perception, priors, embodiment, and learning dynamics we do not fully understand. If the benchmark is misunderstood, it may fail not because the agent lacks intelligence, but because we failed to communicate.

This post argues that many current benchmarks—ARC included—implicitly rely on unspoken human priors, and that eliminating such priors is neither possible nor desirable. Instead, we should make those priors explicit, complete, and portable, much like the design philosophy behind humanity’s attempts to communicate with hypothetical alien intelligences.

The Impossibility of Prior-Free Benchmarks

A common aspiration in benchmark design is to minimize or eliminate “human priors.” The motivation is clear: if humans solve tasks intuitively while machines struggle, perhaps the benchmark captures something fundamental about intelligence.

I’m skeptical of this framing.

If humans can solve a benchmark intuitively, that intuition necessarily draws on lived experience, embodied interaction, and culturally mediated abstractions. There is no such thing as a task that humans solve easily without any human-grounded priors. Conversely, if we construct a truly “AI-native” benchmark—one stripped of human concepts and experience—humans will struggle significantly.

This is not a bug; it’s a consequence of having only one known instance of general intelligence: humans.

ARC attempted to address this by appealing to “core knowledge priors.” However, these priors are abstract, not explicitly specified, and not proven to be sufficient or complete for the task distribution. As a result, it is unclear which priors are actually being tested in any given task, or whether an agent that fails is missing reasoning capacity or simply missing an unstated assumption

METI as a Design Lens

One of the most serious attempts at describing knowledge without shared context comes from METI—specifically, the Voyager Golden Record and the Arecibo message.

The Voyager Golden Record includes engraved diagrams explaining how to play the record and interpret its contents, intended for an intelligence with no shared language, culture, or biology. Every design choice mattered. One famous example is the deliberate avoidance of arrows in the diagrams. Arrows feel universal, but they encode assumptions about motion, directionality, and physical interaction in a fluid medium—assumptions that may not hold for non-human intelligences.

This illustrates a crucial lesson:
even our most basic symbols smuggle in priors.

The designers of the Golden Record treated communication as an adversarial problem against hidden assumptions. That mindset is strikingly absent from most AI benchmark design.

Benchmarks as Low-Bandwidth Messages to Alien Intelligence

What if we treated benchmark deployment as sending a message that might take 50,000 light-years to reach its recipient?

In this framing:

The benchmark designer is the sender.
The benchmark itself is the message.
The evaluated model is the alien intelligence.
Failure is ambiguous: misunderstanding or lack of intelligence?

ARC already partially simulates this constraint. In ARC-3, for example, actions are costly, creating an extremely low-bandwidth interaction loop. But low bandwidth alone does not guarantee clarity. If the interface, task representation, or interaction mechanics leak human experience, we are no longer testing general intelligence—we are testing exposure to human conventions.

This raises a concern: ARC-3 may be less self-contained than ARC-1 or ARC-2 due to UI and interaction assumptions. A comparative study between digitally native users and those unfamiliar with modern interfaces might reveal performance gaps unrelated to reasoning ability.

Explicit Priors, Not Hidden Ones

Eliminating human priors is impossible—and arguably undesirable. The real problem is unstated priors.

Instead of pretending benchmarks are neutral, we should:

Clearly define the priors being assumed.
Prove (or at least argue) that these priors are sufficient and complete for the task set.
Ship them as an explicit “attachment” alongside the benchmark.

In other words, an intelligent alien should be able to:

Read the priors,
Learn from the provided training tasks,
And solve the benchmark without guessing what the designer “meant.”

This shifts competition away from exploiting human-privileged tricks and toward genuine reasoning under declared assumptions.

Notably, the dominance of LLMs in ARC-1 and ARC-2 may reflect precisely this issue: these benchmarks inadvertently rewarded models that are best at capturing subtle human abstractions. The Abstraction and Reasoning Corpus filtered out models that failed to align with those abstractions, leaving behind systems optimized for human concept inference rather than general intelligence.

Towards Benchmarks for Benchmarks

One provocative idea is to evaluate benchmarks themselves.

Imagine testing a benchmark against multiple “alien” intelligences:

agents with non-visual perception,
symbolic solvers without spatial intuition,
or humans given radically constrained instructions.

If only one narrow class of intelligence can solve the benchmark, that suggests the benchmark is not a closed system but one that leaks designer-specific assumptions.

This is analogous to asking whether the Golden Record would actually be interpretable by a non-human intelligence—or only by humans projecting themselves into the alien’s role.

Conclusion

Designing ASI benchmarks is not just an engineering challenge; it is a problem in semiotics, communication, and epistemic humility.

If we are serious about measuring general intelligence, we must think like the designers of the Voyager Golden Record: paranoid about hidden assumptions, explicit about priors, and honest about what our messages do—and do not—convey.

Perhaps the real question is not whether an ASI can solve our benchmarks, but whether we have learned how to speak clearly to an intelligence that does not already think like us.