feat: add max_broken_part_ratio to allow partial backups#1442
Open
mvanhorn wants to merge 1 commit into
Open
Conversation
Add general.max_broken_part_ratio config option so create/upload can continue when the fraction of broken data parts stays at or below the configured threshold, producing a successful-but-partial backup instead of aborting completely. Default 0 preserves today's behavior where any broken part stops the backup. Fixes Altinity#1418
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This adds a new opt-in
general.max_broken_part_ratiooption (envMAX_BROKEN_PART_RATIO) that lets backup creation tolerate a bounded fraction of broken data parts and still finish successfully, as requested in #1418. The value is a fraction between 0 and 1, and it defaults to0, which keeps exactly today's behavior where the very first broken part stops the whole backup.When the ratio is greater than zero, creation no longer aborts on a part it cannot freeze, move, or upload. Instead it skips that part, counts it as broken, and keeps going. Once every table has been processed, the worker compares the total number of broken parts against the total number of parts collected. If their ratio stayed at or below the configured threshold the backup is marked successful with a warning that records how many parts were lost; if it went over the threshold the backup fails with an explicit error naming the observed ratio and the configured limit. Counting honors
--partitionsso parts outside the requested partitions are never blamed, and parts that an incremental backup marks as required (their data lives in the diff base) are preserved rather than discarded when an object-disk upload fails.Why this matters
When a backup hits broken parts because of an S3-disk or filesystem failure, clickhouse-backup currently gives up entirely and produces nothing. For large datasets that means a single corrupt part can deny an operator any backup at all, even though the other 99.99% of the data is intact and worth preserving. Issue #1418 asks for a way to accept a small, known fraction of broken parts so a partial backup still succeeds while the failure stays visible. Because the default is
0, nobody who has not deliberately set the option sees any change in behavior.Testing
New unit tests in the config package cover the default value, the validation that rejects ratios outside the 0 to 1 range, and the threshold decision itself across the under, at, and over cases plus the zero-total guard.
go test ./pkg/backup/... ./pkg/config/...,go vet, andgofmtall pass, and the full module builds.Fixes #1418