Thu, Jun 11 · 7:00 PM – Jun 18
AI Control Hackathon
AI Control Hackathon
A one-week AI safety hackathon on AI control: limiting the harm an untrusted AI system can cause while it operates, even when it tries to subvert the controls around it.
Co-hosted by HΩ and HackOS.
Schedule
- Thu June 11 7PM - Kickoff, with talk from Henri Lemoine (EquiStamp, Mila, McGill). - One week of hackathon, online, with option to work in-person at Ω Labs. - Thu June 18 7PM - Demos evening The problem
AI agents increasingly do software work with real access: reading databases, changing configuration, calling services, under less and less human review. The same access that makes an agent useful is what lets it cause harm, like leaking a secret, weakening a security control, or corrupting data, so locking it down to be safe also makes it useless. The more capable agents get, the less we can simply assume they are on our side.
AI control takes the pessimistic view: assume the capable agent might be working against you, and build the deployment so you stay safe regardless. The core tool is a second, weaker model you do trust, watching the capable one and flagging suspicious actions for the small fraction a human can audit. Can a limited, trusted monitor catch a more capable agent that is trying not to get caught?
How it works
Join one side:
- The Blue team builds the protocols and monitors that catch sabotage without raising false alarms on honest work. - The Red team hides sabotage inside coding work, adversarially bypassing the Blue team protocol. - The White team manages the game and environments the above two play in. The white team’s purpose is for the Control game to run smoothly and produce high signals. One number scores both sides: safety, the share of sabotage the monitor catches before it does harm, measured at a fixed audit budget. A human can audit only about 1% of the agent's actions, so what matters isn't how much a monitor flags but whether that 1% catches the attacks. Red drives the number down, whereas blue drives it up.
Note that actual iterative Blue team development requires some Red team work as well. For instance, while developing a protocol, it is essential to think of how this protocol could be defeated by a sufficiently clever adversary.
Tracks
🟦 Blue team
You build control defenses.
Track A: Monitor building
Build the monitor. You write the logic and prompt for a small, trusted model that scores how suspicious each agent action is, calibrated to flag about 1% of honest actions. The hard part is the budget: flagging everything catches every attack but is useless, so the skill is catching more while keeping false alarms low. Most of this needs no Docker: you work from released trajectory data, score your monitor, and read off a safety curve. Calibration is the trap: your safety number is only meaningful if the 1% threshold is set on a held-out honest set, not tuned on the attacks.
Example projects:
- Beat the baseline monitor's safety score on the released attack set at a 1% false-positive rate, and report which attack types drove the gain. - Combine the signal from multiple monitors to achieve a better cost-performance frontier than any one taken individually. - A passive monitor's power depends on what it sees. Find the ceiling of a passive monitor by feeding it more: the full list of possible attacks, more of the trajectory, summaries of earlier actions. Measure how much each addition lifts safety, and where it stops helping. - Build an agentic monitor that can issue follow-up queries about a suspicious action, and test whether monitoring directly beats investigating more. Expect to build scaffolding the framework does not ship; a working prototype plus one clean comparison is a strong result. Track B: Protocol design
Go beyond a single monitor: design a protocol and measure its safety/usefulness tradeoff. Defer to the trusted model when suspicious, have it rewrite risky output, gate permissions action-by-action, give the trusted model a tool (e.g. a firewall to whitelist/blacklist URLs), or split a task across models so sabotage needs collusion.
Example projects:
- Implement a firewall-tool protocol: the trusted model whitelists/blacklists URLs action-by-action; show high safety against API-key-exfil side tasks with little usefulness hit on main tasks that need the internet. - Take a shipped protocol, design a variant (new deferral rule, a pre-execution check), and compare them at a fixed usefulness cost, with a worked example of each failure mode. - Prototype a genuinely new protocol and characterize its dynamics. 🟥 Red team
You stress-test the Blue team’s control defenses.
Track C: Attack crafting
Hand-build one sabotage trajectory: a sequence of edits and commands that completes a hidden harmful task (steal a secret, plant a backdoor, disable an audit log) while reading as ordinary work to our monitors, refined against the monitor until it stops flagging you. The craft is making sabotage look like an honest bug. It counts only if it works and uses no facts the agent never looked up.
Example projects:
- Land a working sabotage trajectory on a side task. - Write an attack strategy: a reusable playbook that, handed to an automated attacker, raises its undetected rate. Track D: Red-teaming control protocols
A monitor is only one defense. The framework ships fuller protocols ready to run: defer to the trusted model when suspicious, have it rewrite risky output, gate permissions. Take one and break it: find the attack that survives the protocol, then propose and test a fix. The result worth finding is where a protocol's safety comes from and where it's brittle.
Example projects:
- Pick a shipped protocol (trusted-editing or defer-to-resample), find an attack class it lets through, and show a concrete tweak that closes the gap without much usefulness cost. - Compare two protocols at a fixed usefulness cost and show which is safer against which kinds of attack, with a worked example of each failure. Note: a plain threshold sweep on trusted-monitoring is nearly free (re-scoring existing data); protocols like trusted-editing or defer need live agent re-runs and draw on the sponsored compute budget. Plan around that.
⬜ White team
You build the rules of the game and the arena both sides play in.
Track E: New control settings
Extend the testbed. The benchmark is only as good as its environments and tasks; build new control settings, environments, and main and side tasks to run control evals on.
Example projects:
- Design a new side task in an existing environment, ship its scorer, and demonstrate a working attack that the baseline monitor misses. A new task only counts if all three hold: its scorer works, a real attack accomplishes it, and the baseline monitor doesn't trivially catch that attack. Validate all three early. See docs for more details. - Port an existing repo or benchmark into Control Tower as a new setting: bring in a realistic codebase or an existing capability benchmark and add a side task. Porting is often most of the work; a clean port plus one side task is a strong result. E.g. RE-Bench sabotage. Deliverable
- A short report (2 pages, template provided): problem, method, results, limitations - Code repository - 3 minute demo video Judging
Reports are scored on a shared rubric, 1-5 per dimension:
- Impact and innovation - Execution quality - Clarity There are also automatic leaderboard prizes for the core game (best attack, best monitor).
Strong attacks can be contributed back to the benchmark.
Schedule
- Thursday June 11, evening. Kickoff: a short talk, a live demo of one attack and one monitor, a workshop on finding and hiding subtle code vulnerabilities, team formation, setup help. - Around Sunday June 14. Online checkpoint: attacks validated, monitors running, debugging help - Thursday June 18, evening. Submissions close, scoring runs live, three-minute demos, results and awards. Any size
Fixed reference opponents (a baseline monitor and a set of human-crafted attacks) are always available, so every team has something to measure against whether five show up or twenty: a challenge ladder at the low end, a full tournament at the high end. The core game should be able to run on a laptop.
Resources
- Control Tower library - Control Tower docs
- AI Control Concepts - Running evals on Hawk [link todo] - LinuxArena: A Control Setting for AI Agents in Live Production Software Environments (Tracy et al.). - AI Control: Improving Safety Despite Intentional Subversion (Greenblatt, Shlegeris et al., ICLR 2025). - Ctrl-Z: Controlling AI Agents via Resampling (Bhatt, Rushing et al., 2025). - An Overview of Control Measures (Greenblatt). Contact hosts or team@horizonomega.org for questions.