Skip to content

feat: add max_broken_part_ratio to allow partial backups#1442

Open
mvanhorn wants to merge 1 commit into
Altinity:masterfrom
mvanhorn:feat/1418-max-broken-part-ratio
Open

feat: add max_broken_part_ratio to allow partial backups#1442
mvanhorn wants to merge 1 commit into
Altinity:masterfrom
mvanhorn:feat/1418-max-broken-part-ratio

Conversation

@mvanhorn

Copy link
Copy Markdown
Contributor

Summary

This adds a new opt-in general.max_broken_part_ratio option (env MAX_BROKEN_PART_RATIO) that lets backup creation tolerate a bounded fraction of broken data parts and still finish successfully, as requested in #1418. The value is a fraction between 0 and 1, and it defaults to 0, which keeps exactly today's behavior where the very first broken part stops the whole backup.

When the ratio is greater than zero, creation no longer aborts on a part it cannot freeze, move, or upload. Instead it skips that part, counts it as broken, and keeps going. Once every table has been processed, the worker compares the total number of broken parts against the total number of parts collected. If their ratio stayed at or below the configured threshold the backup is marked successful with a warning that records how many parts were lost; if it went over the threshold the backup fails with an explicit error naming the observed ratio and the configured limit. Counting honors --partitions so parts outside the requested partitions are never blamed, and parts that an incremental backup marks as required (their data lives in the diff base) are preserved rather than discarded when an object-disk upload fails.

Why this matters

When a backup hits broken parts because of an S3-disk or filesystem failure, clickhouse-backup currently gives up entirely and produces nothing. For large datasets that means a single corrupt part can deny an operator any backup at all, even though the other 99.99% of the data is intact and worth preserving. Issue #1418 asks for a way to accept a small, known fraction of broken parts so a partial backup still succeeds while the failure stays visible. Because the default is 0, nobody who has not deliberately set the option sees any change in behavior.

Testing

New unit tests in the config package cover the default value, the validation that rejects ratios outside the 0 to 1 range, and the threshold decision itself across the under, at, and over cases plus the zero-total guard. go test ./pkg/backup/... ./pkg/config/..., go vet, and gofmt all pass, and the full module builds.

Fixes #1418

Add general.max_broken_part_ratio config option so create/upload can
continue when the fraction of broken data parts stays at or below the
configured threshold, producing a successful-but-partial backup instead
of aborting completely. Default 0 preserves today's behavior where any
broken part stops the backup.

Fixes Altinity#1418
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add config option max_broken_part_ratio

1 participant