Decision Note: Deferring a Major Version Upgrade


We deferred a major platform upgrade after weighing recovery certainty, operational stability, and reversibility against the value of new features.

> **Outcome:** Upgrade deferred, compensating controls implemented, restore-test cadence increased, and decision log updated.

---

## Context

We evaluated a major version upgrade inside a core infrastructure layer: virtualization, storage firmware, or backup software. The release promised meaningful features and better alignment with the vendor roadmap, but it also introduced change across multiple surfaces at once: disk formats, management tooling, drivers, and recovery behavior. The existing platform is stable and predictable.

## Decision

We chose to defer the upgrade. This is not a permanent refusal. It is a timing decision. We will revisit after additional validation, compatibility work, and a lower-risk change window.

## What we optimized for

- Predictable maintenance windows with explicit validation steps.
- Restore proof over feature promise.
- Minimal blast radius per change.

![Diagram showing the gating logic for deferring or proceeding with a major version upgrade.](/images/notes/upgrade-deferral-gates.svg)

Upgrade decisions move through explicit evidence gates. If rollback or recovery proof is weak, the default is defer with compensating controls.

> **Decision Snapshot**
>
> - **Decision:** defer the major version upgrade.
> - **Primary objective:** maintain reversibility and recovery certainty.
> - **Tradeoff:** postpone non-essential features in favor of stability.

## Risks considered

- **Recovery and rollback risk.** Major versions often change on-disk formats or metadata handling. Rollbacks can be destructive, not reversible.
- **Operational continuity risk.** Upgrades require maintenance modes, evacuations, or service restarts.
- **Compatibility risk.** Hypervisor, firmware, drivers, and storage paths move together.
- **Backup integrity risk.** Backup software upgrades can alter formats, encryption, or retention logic.
- **Operational response risk.** New versions change telemetry and error patterns, slowing triage.

## Evidence reviewed

- Internal stability metrics were stable or improving.
- Recent restore tests met RPO/RTO targets on current versions.
- Vendor release notes included active caveats for the first patch cycle.
- Compatibility matrices required staged remediation before upgrade.
- Peer outcomes showed uneven stability until point releases arrived.
- The upgrade window overlapped with higher-risk operational periods.

## Compensating controls implemented

- Increased restore test frequency for tier-1 systems.
- Applied security and stability patches within the existing major line.
- Standardized settings across clusters and removed known sources of drift.
- Documented rollback procedures and verified credential access.
- Elevated monitoring for storage latency, backup success, and restore anomalies.

## Revisit criteria

- Vendor point release reduces known-issue surface area.
- Compatibility gaps are remediated across firmware and drivers.
- Lab upgrade and rollback rehearsal completed successfully.
- Two consecutive restore tests meet RPO/RTO under planned versions.
- A lower-risk maintenance window is available.

---

> **Operating Principle**
>
> **Stability before novelty.** A major upgrade is a one-way door unless you can prove the way back.