Skip to content

build: pre-flight tag existence check + cleanup on downstream failure #421

@gandalf-at-lerian

Description

@gandalf-at-lerian

Problem

Run: tenant-manager · Build Pipeline · 2.2.0-beta.6

This is a two-attempt failure chain — each attempt has a distinct root cause.


Attempt #1 — Cosign signing failed (Rekor 404)

Job: build / Build tenant-manager (attempt 1)

Build and push to both DockerHub and GHCR succeeded. Cosign signing then failed on all 3 retry attempts with:

error during command execution: signing [docker.io/lerianstudio/tenant-manager@sha256:49758a04...]:
signing digest: [GET /api/v1/log/entries/{entryUUID}][404] getLogEntryByUuidNotFound

Root cause: transient Rekor (Sigstore public transparency log) 404 getLogEntryByUuidNotFound. The cosign client successfully retrieved the SCT (Signed Certificate Timestamp) but then failed to confirm the entry in Rekor, indicating a momentary inconsistency in the public log service.

There is also a secondary template error at the end of this step:

The template is not valid. ...build.yml@v1.31.0 (Line: 372, Col: 28): Unexpected value ''

This suggests the continue-on-error expression or a similar field on line 372 evaluates to an empty string under some conditions, which is itself a bug.


Attempt #2 — Docker push denied (tag immutability)

Job: build / Build tenant-manager (attempt 2)

The pipeline was re-run to recover from the cosign failure. The full Docker build ran again (~19 s), then:

ERROR: failed to solve: failed to push lerianstudio/tenant-manager:2.2.0-beta.6:
denied: requested access to the resource is denied — tag 2.2.0-beta.6 is already
assigned to an image in this repository and cannot be updated due to immutability settings.

Root cause: DockerHub has tag immutability enabled. The tag was already published in attempt #1; the re-run had no way to detect this before spending time on a full rebuild.


Proposed Fixes

Fix 1 — Resilience to transient Rekor failures (attempt #1 root cause)

The 3-attempt retry with exponential backoff exists but is insufficient for Rekor intermittency, which can last several minutes. Options:

  • Increase cosign_max_attempts default from 3 to a higher value (e.g. 5) and increase the backoff ceiling.
  • Add jitter to the retry delay to avoid thundering-herd if multiple jobs hit Rekor simultaneously.
  • Fix the template error on line 372: The Unexpected value '' error means a field receives an empty string where a boolean or defined value is expected. This should be investigated and fixed — it may also mask error propagation silently in other scenarios.
  • Consider honouring continue_gitops_on_signing_failure more broadly: if Rekor is down, the image is still valid and signed certificates were issued — only the transparency log entry retrieval failed. Blocking the entire pipeline (and forcing a re-run that will fail for a different reason) is a disproportionate response to a Sigstore outage.

Fix 2 — Pre-flight tag existence check (attempt #2 root cause)

Before starting the Docker build, check whether the target tag already exists in each enabled registry. If it does, either skip the build (idempotent re-run behaviour) or fail fast with a clear, early error — not after a full build.

Suggested new input:

on_existing_tag: 'fail' | 'skip' | 'warn'   # default: 'fail'

Implementation sketch (DockerHub):

TOKEN=
STATUS="000""000""000"
if [ "$STATUS" = "200" ]; then
  echo "::warning::Tag $TAG already exists — skipping (immutable registry)."
  exit 0
fi

For GHCR: docker manifest inspect ghcr.io/$ORG/$IMAGE:$TAG.


Fix 3 — Cleanup pushed images on downstream failure (suggestion)

If the build+push succeeds but a later step fails (cosign, GitOps artifact upload, Helm dispatch), the image is left in the registry unsigned and without a GitOps record. No rollback exists today.

Suggestion: optional cleanup job/step running if: failure():

cleanup_on_failure: true | false   # default: false
  • DockerHub: DELETE /v2/repositories/{namespace}/{repository}/tags/{tag} (requires delete-scope token).
  • GHCR: gh api -X DELETE /orgs/{org}/packages/container/{package}/versions/{version_id}.
  • Target only tags published in the current run.
  • If the registry has immutability and deletion is not possible, emit a warning with the digest and manual remediation steps.

Checklist

  • Investigate and fix Unexpected value '' on line 372 of build.yml
  • Review cosign_max_attempts default and retry backoff ceiling
  • Add jitter to cosign retry delay
  • Add on_existing_tag input (fail / skip / warn) with pre-flight check for DockerHub and GHCR
  • Add cleanup_on_failure input with registry cleanup on downstream step failure
  • Cleanup emits a warning (not hard failure) when registry deletion is not possible

Metadata

Metadata

Assignees

Labels

bugSomething is not working as expectedenhancementNew feature or improvement request

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions