Skip to content

hibernation: wake fast-path returns Active without clearing hibernate annotation #2

@CMGS

Description

@CMGS

Summary

In hibernation/wake.go, when reconcileWake observes that the VM has already come back (container Running + VMID set), it immediately drops the snapshot tag and marks the CR Active, without clearing the vm.cocoonstack.io/hibernate annotation on the pod.

// hibernation/wake.go
if vmClonedAndRunning(pod) {
    r.Epoch.DeleteManifest(ctx, vmName, meta.HibernateSnapshotTag)
    return ctrl.Result{}, r.setPhase(ctx, hib, cocoonv1.CocoonHibernationPhaseActive, vmName)
}

// this clear is skipped on the fast-path above
if meta.ReadHibernateState(pod) {
    commonk8s.PatchHibernateState(ctx, r.Client, pod, false)
}

Scenario

  1. A pod is already running with a valid VMID, but still carries hibernate=true (e.g. residue from a prior failed hibernate, or a CR created against an already-awake pod).
  2. User creates/sets Desire=Wake. First reconcile hits the fast-path, returns Active. The hibernate=true annotation is left in place.
  3. User flips Desire=Hibernate. reconcileHibernate calls PatchHibernateState(pod, true), which is a no-op because the annotation already matches (see cocoon-common/k8s/utils.go:27).
  4. The reconciler immediately probes the registry for the snapshot tag. If a stale tag happens to be present, the CR gets marked Hibernated without vk-cocoon ever taking a new snapshot for this cycle.

Impact

A subsequent wake would clone from a stale (or nonexistent) snapshot, resulting in data divergence or a stuck Waking phase.

Notes

  • This is pre-existing behavior (predates 82a9bc3). Not introduced by the recent VMID-gate hardening.
  • Raised during a /code review of HEAD~3..HEAD; deferred out of scope for that review.

Possible fixes

  • Always call PatchHibernateState(pod, false) before returning Active on the fast-path.
  • Or: move the ReadHibernateState/PatchHibernateState block above the fast-path, so the annotation is cleared unconditionally during any wake reconcile.
  • Either fix needs a small unit test covering the "hibernate annotation residue on an already-live pod" case.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions