Autoresearch for SAT Solvers

An autonomous AI agent teaches itself to solve complex MaxSAT problems, discovering novel strategies and outperforming competition benchmarks without human guidance.

_ _ ____ _ _____
/ \ __ _ ___ _ __ | |_/ ___| / \|_ _|
/ _ \ / _` |/ _ \ '_ \| __\___ \ / _ \ | |
/ ___ \ (_| | __/ | | | |_ ___) / ___ \| |
/_/ \_\__, |\___|_| |_|\__|____/_/ \_\|_|
|___/

An autonomous AI agent that teaches itself to become the world's top expert on MaxSAT. Given 229 weighted MaxSAT instances from the 2024 MaxSAT Evaluation (main anytime weighted track), it discovers novel strategies, finds better solutions and iteratively refines its toolbox. No human guidance.

An AI agent (e.g. Claude Code) reads program.md

for instructions - It reads expert.md

for accumulated knowledge from prior runs - It reads the library for available tools

It runs solvers on instances, discovers what works, updates everything
It commits and pushes to this repo so other agents can build on its findings

┌─────────────────┐
│ GitHub Repo │
│ │
│ expert.md │
│ library/ │
│ best-solutions │
│ experiments.log │
└────────┬─────────┘
git pull/push │
┌─────────────┬────────┴────────┬─────────────┐
│ │ │ │
┌──────▼──────┐ ┌───▼────────┐ ┌──────▼──────┐ ...
│ VM 1 │ │ VM 2 │ │ VM 3 │
│ │ │ │ │ │
│ ┌─────────┐ │ │ ┌────────┐ │ │ ┌─────────┐ │
│ │ Agent 1 │ │ │ │Agent 3 │ │ │ │ Agent 5 │ │
│ │ ┌─┬─┬─┐ │ │ │ │┌─┬─┬─┐ │ │ │ │ ┌─┬─┬─┐ │ │
│ │ │S│S│S│ │ │ │ ││S│S│S│ │ │ │ │ │S│S│S│ │ │
│ │ └─┴─┴─┘ │ │ │ │└─┴─┴─┘ │ │ │ │ └─┴─┴─┘ │ │
│ ├─────────┤ │ │ ├────────┤ │ │ ├─────────┤ │
│ │ Agent 2 │ │ │ │Agent 4 │ │ │ │ Agent 6 │ │
│ │ ┌─┬─┬─┐ │ │ │ │┌─┬─┬─┐ │ │ │ │ ┌─┬─┬─┐ │ │
│ │ │S│S│S│ │ │ │ ││S│S│S│ │ │ │ │ │S│S│S│ │ │
│ │ └─┴─┴─┘ │ │ │ │└─┴─┴─┘ │ │ │ │ └─┴─┴─┘ │ │
│ └─────────┘ │ │ └────────┘ │ │ └─────────┘ │
└─────────────┘ └────────────┘ └─────────────┘
S = solver process (python)

# Launch on EC2 (handles everything: installs deps, clones repo,
# downloads benchmarks from Helsinki, launches agents in tmux)
./run.sh --host ec2-user@<ip> --agents 3

Requires a .env

file with CLAUDE_CODE_API_KEY

and GITHUB_ACCESS_TOKEN

. The API key is auto-refreshed from your local Claude Code login on each deploy.

Multiple agents can work on the same repo simultaneously, communicating through git — each agent pulls the latest solutions and expert knowledge, builds on what others found, and pushes its own improvements. No coordination needed beyond git pull

and git push

| Metric | Count | |---|---| | Instances solved | 220 / 229 | | Optimal (matching competition best) | 30 | Better than competition | 5 | | Novel solve (no known solution existed) | 1 | | Within 1.1x of reference | 123 | | Within 1.5x | 183 | | Within 2x | 209 | | Unsolved | 9 |

| Instance | Our cost | Competition best | Improvement | |---|---|---|---| | switchingactivity_74 | 10 | 16 | 37.5% better | | synplicate dag_run2_10_size_11 | 374 | 518 | 27.8% better | | synplicate dag_run2_16_size_9 | 333 | 398 | 16.3% better | | switchingactivity_68 | 8 | 9 | 11.1% better | | BTBNSL hailfinder_10000 | 49,986,819,152 | 50,007,681,202 | 0.04% better | | pseudoBoolean mod010 | 8,081 | no solution | novel solve |

| Instance | Ratio | Why it's hard | |---|---|---| | relational-inference pa-1 | 603x | 2.5M vars, 1.1M soft clauses | | 9.7x | 51K softs, 9.7M hard clauses | | | causal-discovery Water | 4.4x | 8.3M cost vs 1.9M ref | | timetabling test4 | 3.3x | 131K vars | | decision-tree tic-tac-toe | 2.6x | adaboost ensemble |

9 instances remain unsolved — mostly >16M variables or no known reference solution.

The agent developed these approaches autonomously, discovering what works through experimentation:

| Technique | Best for | Key insight | |---|---|---| Greedy SAT with selector variables | Few soft clauses (<500) | Heaviest-first greedy with CaDiCaL assumptions | Core-guided search | Unit soft clauses | Iterative UNSAT core relaxation. comp07.lp: 1778x → optimal | WPM1 core-guided | Weighted unit softs | Proper relaxation variables + at-most-one constraints | Biased-SAT | Breaking local optima | Random assumption subsets produce diverse solutions | Clause-weighting LS (SATLike) | Stuck at local optima | Dynamic weight adjustment escapes single-flip traps | Tabu search | No-hard / unit-soft instances | With SAT init + restarts. judgment-aggregation: 49x → 1.5x | Multi-init | Diverse starting points | Different solvers (CaDiCaL, glucose4, MiniCard) + random assumptions | Alternating CWLS + WalkSAT | Deep optimization | Alternating phases for continuous improvement. pa-1: 5445x → 612x | RC2 with CaDiCaL | Weighted unit softs | Previously dismissed, but solver='cd19' finds optimals. haplotyping-12: 3.8x → 1.01x |

All code the agent writes lives in library/

| Module | Functions | Purpose | |---|---|---| solvers.py | greedy_sat , tabu_search , multi_init , sat_init , walksat_hard , walksat_soft , sat_solve_with_timeout | Core solver building blocks | core_guided.py | core_guided_solve , core_guided_budget , wpm1_solve | UNSAT core-based optimization for unit soft clauses | clause_weight_ls.py | clause_weight_local_search | SATLike-inspired dynamic clause weighting | solutions.py | load_solutions , update_solution , get_best_costs | Compressed solution storage (1.7GB → 1.5MB) | wcnf_parser.py | parse_wcnf , evaluate_cost , check_hard_clauses | Single-pass streaming WCNF parser |

Low parallelism: Claude Code rarely launches more than 6 parallel scripts, and often runs just 1-2 at a time, leaving most cores idle on large machines.Tunnel vision: The agent can fixate on grinding one instance for hours (e.g. pa-1 from 5445x to 612x over many rounds) while ignoring easier wins elsewhere.Session length: Despite "never stop" instructions, the agent tends to wrap up after a few hours, deciding it has reached a natural stopping point.

The agent maintains expert.md

as a living knowledge base and improves the library as it learns.

Source: Hacker News