fix #849 fix #1428 need rebind replica zk path when path already exists explicitly to cover some corner cases#1440
Open
Slach wants to merge 7 commits into
Open
fix #849 fix #1428 need rebind replica zk path when path already exists explicitly to cover some corner cases#1440Slach wants to merge 7 commits into
Slach wants to merge 7 commits into
Conversation
…le (#1428) The #849 stale-path detection (commit 0d8279e, v2.7.1) rebound a restored ReplicatedMergeTree to default_replica_path whenever the table-level ZK path had children but our own replica entry was absent. In an HA restore the live sibling's children look identical to stale leftovers, so the second replica was wrongly rebound -> split-brain with total_replicas=1 on both nodes (#1428). The discriminator is whether a *different local* table already uses that exact resolved path: system.replicas is node-local, so a hit means a foreign/renamed table occupies the path on THIS node (the real #849 case) and we rebind automatically; a genuine HA sibling lives on another node and is absent from the local system.replicas, so we join (HA-safe). The residual case (ZK children but no local table -- async-stale after DROP, indistinguishable from a briefly-offline sibling) stays behind the opt-in flag: clickhouse.rebind_replica_path_if_exists (env CLICKHOUSE_REBIND_REPLICA_PATH_IF_EXISTS, CLI --rebind-replica-path-if-exists on restore/restore_remote). Add TestRebindReplicaPathIfExists covering auto-rebind, HA-join and flag cases.
39aaada to
1c8499f
Compare
Signed-off-by: slach <bloodjazman@gmail.com>
…ints (#1428) Wire the rebind-replica-path-if-exists override into POST /backup/restore and /backup/restore_remote: both now read the per-request reloaded config and set ClickHouse.RebindReplicaPathIfExists=true when the rebind_replica_path_if_exists / rebind-replica-path-if-exists query param is present, mirroring the CLI flag override semantics (only forces true, never false). The restore handler previously built the Backuper from the stale api.config instead of the freshly reloaded cfg; switch it to cfg. Also fix httpCleanHandler to reload config from disk per request like every other HTTP handler and its actionsCleanHandler sibling, using a closure defer so the recorded status captures errors from both ReloadConfig and Clean. Document both query params in ReadMe.md and add testAPIRebindReplicaPath to the integration TestServerAPI suite, exercising flag-off (join shared ZK path) vs flag-on (rebind to default_replica_path) through the API. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ix/1428-rebind-replica-path-opt-in
The SYNC keyword for DROP TABLE only lands in ClickHouse 21.x; on 20.8 it fails with 'Syntax error ... Expected one of: NO DELAY'. Mirror the version gate already used in replicationPath_test.go: the test is gated to >= 20.8, where 'DROP TABLE ... NO DELAY' provides the same synchronous drop needed to deregister the replica ZK entry before restore.
TestFIPS intermittently fails on the first create_remote with 'another operation is currently running' (ErrAPILocked, HTTP 500). Root cause: the fips server runs ResumeOperationsAfterRestart on startup, which scans /var/lib/clickhouse/backup/*/*.state2 and resumes any leftover upload/download state, holding the API lock while the test fires its first action. env.Cleanup does not remove resumable state, so it leaks between tests sharing a pooled env. Add a diagnostic check at the start of TestFIPS (before it creates any backup) that lists *.state2 and fails with the offending paths, so the leaking test and backup name are identified instead of surfacing as a flaky lock error.
Coverage Report for CI Build 28297144703Warning Build has drifted: This PR's base is out of sync with its target branch, so coverage data may include unrelated changes. Coverage increased (+0.08%) to 65.104%Details
Uncovered Changes
Coverage Regressions7 previously-covered lines in 4 files lost coverage.
Coverage Stats
💛 - Coveralls |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
fix #849 fix #1428
The #849 stale-path detection (commit 0d8279e, v2.7.1) rebound a restored ReplicatedMergeTree to default_replica_path whenever the table-level ZK path still had children but our own replica entry was absent. In an HA restore the live sibling replica's children are observationally identical to stale leftovers, so the second replica was wrongly rebound -> split-brain with total_replicas=1 on both nodes (#1428).
Stale/foreign leftovers (#849) and a live sibling (#1428) cannot be told apart from ZK state — the discriminator is operator intent. So default to the HA-safe join behavior and put the #849 rebind behind an explicit opt-in: clickhouse.rebind_replica_path_if_exists (env CLICKHOUSE_REBIND_REPLICA_PATH_IF_EXISTS, CLI --rebind-replica-path-if-exists on restore/restore_remote).
Add TestRebindReplicaPathIfExists covering both branches.