Intrig Logo
Why Program Synthesis Approaches Have Struggled
July 28, 2025
Written by
Jumyung Park
Program Synthesis
DreamCoder
Neuro-Symbolic

Program synthesis seems like a natural fit for ARC: tasks are rule-based, compositional, and require transforming objects under explicit constraints. Many teams tried symbolic or hybrid neuro-symbolic methods early on. Yet despite the promise, these approaches have shown limited progress compared to neural or LLM-based models.

Two things appear repeatedly in the history of ARC research:

  1. Program synthesis is fundamentally hard to engineer.
  2. We lack the supporting tools and frameworks that made deep learning flourish.

This post breaks down why.

The Limits of Program Synthesis on ARC

Most program-synthesis efforts begin with a hand-designed DSL—Hodel’s DSL being the most widely used—and treat ARC as a search problem over program space. Researchers then apply the standard toolbox: heuristics, pruning, parallelization, caching, etc.

The results: incremental improvements, but no breakthrough.

This outcome aligns with ARC’s design philosophy: the benchmark is intentionally resistant to brute-force or heuristic search.

Why progress stalls

1. The program space is enormous

Even well-engineered DSLs explode combinatorially. Search quickly becomes intractable, no matter how many heuristics you add.

2. Designing a “good” DSL is brutally hard

A high-quality DSL must be:

  • Expressive: able to describe all transformations you care about.
  • Compact: not so expressive that it generates endless irrelevant programs.

Most DSLs end up expressive-but-not-compact—too many useless expressions for search algorithms to sift through.

And the DSL is fragile:

small changes in primitive definitions can drastically alter the search space. Debugging these shifts is difficult, unintuitive, and often unpredictable.

3. Search is CPU-bound

Program enumeration and symbolic execution don’t benefit much from GPU parallelism. This makes scaling slow and expensive.

4. DSL choice dominates results

Search quality is usually overshadowed by DSL quality. In practice:

“Success” is mostly determined by whether the DSL happens to model the right space efficiently.

This raises an important open question:

Can we isolate the effect of DSL design from the effect of search algorithms?

Traditional search methods have solid evaluation theory, but modern neural-guided search lacks such analysis.

5. Slow iteration kills idea exploration

Because DSLs are brittle and search is expensive, you can’t cheaply try 20 variations and see what happens. Each tweak requires deep engineering.

6. Hard, unglamorous engineering

The fastest solvers are written in C, OCaml, or other low-level languages. Python struggles with the CPU-bound workloads. This raises the barrier for experimentation and drives researchers toward easier, more ergonomic alternatives—typically deep learning or LLMs.

DreamCoder: A Promising Idea That’s Too Hard to Build On

DreamCoder demonstrated a compelling neuro-symbolic system:

  • learns reusable program primitives
  • performs wake-sleep cycles
  • improves its own DSL over time
  • aligns closely with ARC’s philosophy of skill acquisition and abstraction

Yet, DreamCoder-style models have not become mainstream in ARC research. There are few replications, few variants, and almost no modern extensions.

Why?

1. Neuro-symbolic engines are extremely difficult to engineer

Compared to stacking layers in PyTorch or calling an LLM API, building a performant symbolic system is:

  • slower
  • more fragile
  • more difficult to optimize
  • less documented
  • less supported by modern tooling

Deep learning benefits from enormous industrial investment in developer experience. Program synthesis does not.

2. Fast iteration is the engine of innovation

In software ecosystems, productivity is king. Web developers move fast because the tooling makes iteration cheap. Deep learning exploded because frameworks like PyTorch and TensorFlow:

  • abstracted away complexity
  • standardized workflows
  • made experimentation cheap
  • provided reusable layers, components, utilities

In program synthesis, we are still in the “NumPy era”:

  • many bespoke engines
  • few shared abstractions
  • no standardized interchange formats
  • no ecosystem of reusable blocks
  • heavy reliance on low-level languages

There is no “PyTorch for program synthesis.”

3. DreamCoder’s components are not reusable

Although DreamCoder contains many general ideas—library learning, neural guidance, DSL bootstrapping—they aren’t packaged as modular primitives. You can’t import a “DreamCoder library-learning layer” the way you import a PyTorch LSTM.

To build your own variant, you essentially have to reimplement everything from scratch. This discourages experimentation.

4. The community gravitates toward what’s easy to use

Models that package their work into reusable components get far more traction. LLM-based methods skyrocket partly because:

  • the API is trivial
  • experimenting requires minimal engineering
  • results are visible early
  • iteration is fast

Symbolic models lack this developer experience, so fewer people explore them—even when they’re conceptually aligned with ARC.

We Need a Program Synthesis Framework

Given the recurring lessons, the missing piece is obvious:

We need a shared, high-level program-synthesis framework that makes exploration cheap.

A framework that provides:

  • standard DSL tools
  • modular search algorithms
  • neural-guidance components
  • symbolic execution utilities
  • profiling and visualization tools
  • benchmarks and auto-evaluators
  • sane defaults, reusable patterns, clear APIs

This would let researchers:

  • prototype new ideas quickly
  • swap components without rewriting everything
  • isolate the effects of DSL vs. search
  • build on top of each other’s work
  • lower the barrier for newcomers

DreamCoder showed the direction. We now need the tooling that allows the next thousand experiments to happen.

A “PyTorch for program synthesis” would radically accelerate progress—especially on benchmarks like ARC and ARC-2, where symbolic and compositional reasoning is central.