1. Introduction: ARC as a Research Loss Function
Since its introduction in On the Measure of Intelligence, the Abstraction and Reasoning Corpus (ARC) has played a unique role in AI research. It has resisted shortcuts, discouraged superficial pattern recognition, and provided a focused interface for evaluating abstraction and reasoning—always abstraction then reasoning, in that order. ARC sits in a rare category of benchmarks that intentionally shape the direction of a field.
One helpful analogy, which I want to make explicit, is that a benchmark behaves like a loss function for the research community.
Benchmarks define what “doing well” means, and researchers, collectively, move downhill on the loss landscape that benchmark creates. A well-designed benchmark can gently steer a field toward productive directions. A sparsely informative benchmark can still be valuable—it can serve as an indicator or detector of progress—even if it provides fewer gradients for incremental improvement. ARC has often functioned in this latter mode: a high-level signal rather than a step-by-step guide.
As ARC has evolved into ARC-Prize and ARC-2, it has remained unusually resilient. It rarely rewards marginal architectural tweaks or overfitting to incidental patterns. Instead, it seems to reward only approaches that move meaningfully closer to its underlying goal: the efficient acquisition of new skills under severe data scarcity.
This makes ARC a powerful benchmark, but also one that is difficult to interpret, extend, or use diagnostically. This post reflects on why ARC exhibits this robustness, what its limitations are, and how one might extend it—gently and naturally—toward a version that aligns more directly with its original philosophy.
2. What Makes ARC So Robust to Contemporary Models?
2.1 The difficulty may lie more in abstraction than reasoning
ARC is often described as a reasoning benchmark, but a growing body of work—and personal observation—suggests that the real bottleneck is abstraction. Once the correct abstraction or representation is found, the reasoning required is often quite straightforward. Conversely, incorrect abstractions lead quickly and irreversibly into dead ends.
This distinction helps explain several ARC phenomena:
- The dominance of LLM-based systems on leaderboards
- The limited impact of purely algorithmic or search-based approaches
- The plateau of methods that rely heavily on hand-designed DSLs
LLMs appear to excel not because ARC directly matches natural language, but because they carry broad, flexible priors that help stabilize abstraction under ambiguity. ARC tasks tend to reward models that can tolerate incomplete information and still form coherent hypotheses.
2.2 ARC’s intended “closedness” and the challenge of core knowledge
ARC was designed around a set of “core knowledge priors,” with the ideal that a solver should rely only on these priors and the examples given. In practice, it remains an open question how fully self-contained ARC truly is. Current best-performing methods often make heavy use of pretrained models, and this introduces powerful abstractions not originally envisioned.
This is not necessarily problematic, but it highlights a subtle tension: ARC itself is carefully specified, but the knowledge injected into solvers is not. The challenge of ensuring closedness may simply be much harder than assumed, especially given modern model capabilities.
3. ARC’s Measurement: Entanglement and Sparsity
Chollet’s definition of intelligence emphasizes efficiency of skill acquisition, especially under limited experience. ARC encodes this strongly—it constrains experience to just a few examples. But when it comes to measurement, ARC mostly observes whether a task is solved, and little else.
This leads to what you’ve described as an entangled measure:
- Compute constraints
- Representational choices
- Pretrained priors
- Experience efficiency
…all intertwine to produce a single binary outcome. ARC is extremely stringent and elegant in what it permits, but the information it returns is sparse.
Here again the benchmark–loss-function analogy is useful: ARC provides a loss landscape with extremely sharp peaks and wide plateaus. It serves as a detector of important advances but gives almost no intermediate gradients to guide incremental improvement.
This is not a flaw in ARC—it reflects its design philosophy—but it does shape the type of work that tends to succeed and the types of research signals it provides (or withholds).
4. The Missing Dimension: Agency and Active Querying
One of the most interesting consequences of ARC’s design is that the agent has no control over what data it receives. The experience is fixed. Yet in many real learning settings—scientific inquiry, robotics, human learning—the crucial step is choosing which experience to gather next.
ARC therefore measures a form of intelligence that is intentionally passive, by design. But this means ARC cannot measure:
- experiment design
- hypothesis-driven querying
- active elimination of possibilities
- or efficient exploration of ambiguous rule spaces
These are all meaningful components of skill acquisition efficiency. Their absence may partially explain why RL-style or agentic approaches do not currently appear on ARC leaderboards: ARC simply does not expose the dimensions these methods specialize in.
This observation is not a critique of ARC; rather it suggests a natural direction for extension.
5. Introducing Active-ARC
Active-ARC is a proposal that aims to preserve the central ARC philosophy while expanding the measured dimensions of intelligence. The idea is straightforward:
- The agent is given one input-output example of a task.
- It may propose additional input grids of its own choice.
- An oracle returns the output according to the hidden rule.
- Each query incurs a small cost (with optional complexity-based penalties).
- When ready, the agent attempts the test input.
- Performance is measured both by correctness and by the query efficiency.
This transforms ARC from a purely passive setting into a mild form of active learning, where skill acquisition efficiency can be measured more directly and with finer resolution.
5.1 Why this extension is aligned with ARC’s philosophy
Active-ARC does not introduce new object types, new task domains, or new cognitive requirements. It remains within the symbolic, implicit, abstract world of ARC-1 and ARC-2. What changes is the agent’s ability to seek clarifying information.
From an information-theoretic standpoint, Active-ARC allows us to measure how efficiently an agent reduces uncertainty in the hypothesis space using the queries available. This aligns with both the spirit of ARC and the broader interpretation of intelligence as efficient model-building under limited experience.
5.2 Not ARC-3, but something like ARC-2.5
Compared to ARC-3, which introduces sequential environments and interaction, Active-ARC is deliberately more modest. It retains the “abstract transformations” character central to ARC-1/2, while adding a minimal action interface. In this sense, Active-ARC can be viewed as a bridge—a step toward interactive intelligence measurement without leaving ARC’s symbolic domain.
6. Practical Considerations
Implementing Active-ARC raises several design questions:
- Oracle construction: ensuring no unintended information leakage
- Handling undefined inputs: choosing a principled, consistent behavior
- Anti-gaming: avoiding overly complex mechanisms that become secondary oracles
- Human baselines: grounding the metric by measuring human query efficiency
A prototype implementation is feasible, especially using existing DSL-based enumerations of ARC tasks. Careful engineering is required but the conceptual challenges seem surmountable.
7. Conclusion: Toward Benchmarks With More Informative Signals
ARC remains one of the most carefully constructed benchmarks in AI, notable not only for its difficulty but for its clarity of purpose. It has pushed the community to grapple with abstraction more seriously and exposed where contemporary methods still fall short.
However, as a research loss function, ARC provides very sparse gradients. It tells us when we have succeeded, but not much about the space between. Active-ARC aims to provide a slightly smoother landscape—still rigorous, still abstract, but more diagnostic and more aligned with the measurement of skill acquisition efficiency.
This proposal is not meant as a replacement for ARC, nor an attempt to redefine its philosophy. Instead, it is an invitation to explore how ARC’s central ideas might extend into the realm of active querying, while keeping the abstraction-and-reasoning spirit intact.
If intelligence is partly the ability to choose the right experiment, then allowing ARC agents to act—carefully, minimally—may help us study that dimension more directly, and perhaps bring us a little closer to benchmarks that reflect the full richness of learning.