Idempotency Audit: A Script That Ran Twice Broke Production
A key differentiator for senior engineers is the focus on idempotency. This report tells the story of a script that ran twice and broke production, highlighting why "check-then-act" logic is fragile compared to declarative state. Enforcing declarative patterns is a core component of IT governance architecture.
Outcome: Automated scripts replaced with declarative state enforcement, preventing race conditions and duplicate resources.
At a glance
Goal: Automate VM provisioning in a brownfield environment.
Constraint: Existing legacy network configuration must be preserved.
Reality: The script added duplicate NICs and corrupted routing tables when retried.
Imperative scripts that check for existence often fail during race conditions or partial failures. Declarative engines enforce the end state regardless of the starting point.
Engineering standards used
- Idempotency is code property. The code must handle re-runs safely rather than relying on the runner.
- Destructive actions need checks. Explicitly verify state before modifying or deleting resources.
- Distinguish change from no-op. Logs must clearly show when no action was taken vs. when a change occurred.
The "Smart" Script
The incident started with a well-intentioned script designed to provision VMs. It included logic to check if a VM already existed before attempting to create it.
if (!exists(vm)) create(vm);
This logic works perfectly in isolation. However, in a distributed system, or even a slow one, the gap between the check and the act is a danger zone.
The Race Condition
During a deployment, the API response for the creation request timed out. The system, interpreting this as a failure, retried the script.
The first request had actually succeeded on the backend but failed to report back in time. The retry script checked for existence, but due to eventual consistency or simple timing, the new VM wasn't yet visible in the query result.
The script proceeded to "create" the resources again. Since the VM ID was reused, it attached a second network interface to the existing VM instead of failing or updating it. This duplicate NIC grabbed a new IP via DHCP, creating a routing loop that took the application offline.
The Fix: Declarative State
The solution required stopping checks entirely rather than writing better ones.
I moved the provisioning logic to a declarative tool (Terraform/Ansible). Instead of saying "create this," I defined the end state: "This VM exists, and it has exactly one NIC."
When the declarative engine runs, it queries the actual state of the resource. If it sees two NICs, it removes one to match the definition. If the VM exists, it does nothing.
Takeaway
If you can't run it twice safely, don't run it once automatically. Idempotency is the foundation of automation that lets you sleep at night.
Next step
Most engagements start with the Health Check. Fixed fee, clear picture, under two weeks.