How My Agents Self-Heal in Production

Vishnu Suresh, a software engineer at LangChain, explains how he built a self-healing deployment pipeline that automatically detects, triages, and fixes regressions using AI agents.

By Vishnu Suresh, Software Engineer @ LangChain

This blog was initially published on X.

I built a self-healing deployment pipeline for our GTM Agent. After every deploy, it detects regressions, triages whether the change caused them, and kicks off an agent to open a PR with a fix, with no manual intervention needed until review time.

The hard part of shipping isn't getting code out. It's everything after: figuring out if your last deploy broke something, whether it's actually your fault, and fixing it before users notice. I wanted to deploy, move on, and trust that if something regressed, the system would catch it and close the loop itself.

How the Self-Healing Flow Works

The GTM Agent runs on Deep Agents and deploys through LangSmith Deployments. We already had an internal coding agent called Open SWE, an open-source async coding agent that can research a codebase, write fixes, and open PRs. The missing piece was automated regression detection and triage to connect production errors back to Open SWE.

Right after a deployment to main, a self-healing GitHub Action triggers, capturing the build and server logs. The flow has two paths: catching build failures immediately, and detecting server-side regressions over a monitoring window. If either path finds a real issue, Open SWE gets kicked off to fix it and open a PR.

Catching Docker Build Failures

First, we check the build logs to make sure the Docker images build properly. If the image fails to build, the pipeline automatically pipes the error logs from the CLI, fetches the git diff from the last commit to main, and hands it off to Open SWE — no human involved. Build failures are almost always caused by the most recent change, so a narrow diff gives Open SWE enough context to act on.

Monitoring for Post-Deploy Errors

Server-side issues are trickier than build failures. Any production system carries a background error rate, network timeouts, third-party API issues, transient failures. In an ideal world you'd track and fix every single one, but when you're trying to answer "did my last deploy break something," you need to separate the errors your change caused from the noise that was already there. That's what this step does.

First, I collect a baseline of all error logs from the past 7 days. These get normalized into error signatures, regex replaces UUIDs, timestamps, and long numeric strings, then truncates to 200 characters, so logically identical errors get bucketed together even when the specifics differ.

Next, I poll for errors from the current revision over a 60-minute window after deployment, normalizing the same way. Once that window closes, I have error counts from two very different time scales—a week of baseline data and an hour of post-deployment data. While I could naively compare these two numbers to detect if our latest change caused an error, I wanted to take a more principled approach.

Gating with a Poisson Test

A Poisson distribution models how many times an event occurs in a fixed interval, given a known average rate (λ) and the assumption that events are independent. Using the 7-day baseline, I estimate the expected error rate per hour for each error signature, then scale it to the 60-minute post-deployment window. If the observed count significantly exceeds what the distribution predicts (p < 0.05), I flag it as a potential regression.

The Triage Agent

Rather than feeding errors directly into Open SWE, I add another gating mechanism. The diffs from the last commit and the specific error get passed into a triage agent built on Deep Agents. The agent must establish a concrete causal link between a specific line in the diff and the observed error. This prevents false positives where the agent might hallucinate a causal chain from a non-runtime file to a production bug.

Closing the Loop with Open SWE

Once the triage agent green-lights an investigation, Open SWE takes over, works through the bug, and opens a PR. I get notified when it's ready for review, so the entire flow from error detection to proposed fix happens without any manual intervention.

Future Improvements

Wider Lookback Window

The triage agent currently looks at the diff between the current and previous deployed revision. Widening the look back is the obvious fix, but the more diffs you feed into triage, the noisier the signal gets.

Smarter Error Grouping

One idea I've been considering is embedding error messages into a vector space and clustering them, rather than relying on regex normalization. Errors that mean the same thing would naturally land near each other regardless of surface-level differences.

Fix-Forward vs Looking Back

Right now the system always fixes forward. A smarter approach would be deciding between a rollback or a fix-forward based on severity, error rate, and triage confidence.

Source: LangChain Blog