Skip to content

feat: add bucket-based processing to PR analysis snapshot IN-1180#4166

Merged
gaspergrom merged 3 commits into
mainfrom
feat/IN-1180-pull-request-analysis-initial-snapshot
Jun 4, 2026
Merged

feat: add bucket-based processing to PR analysis snapshot IN-1180#4166
gaspergrom merged 3 commits into
mainfrom
feat/IN-1180-pull-request-analysis-initial-snapshot

Conversation

@gaspergrom
Copy link
Copy Markdown
Contributor

@gaspergrom gaspergrom commented Jun 3, 2026

Summary

Adds bucket-based (sharded) processing to the PR analysis initial snapshot pipe to avoid hitting memory limits when processing large datasets. Each run can target a subset of segments via bucket_id and num_buckets parameters; once all buckets complete, the hourly snapshot merger takes over. Also switches the copy mode from replace to append to support incremental runs.

Changes

  • Added % (templated query) marker to all NODEs so Tinybird evaluates the {% if defined(bucket_id) %} conditionals
  • Added cityHash64(segmentId) % num_buckets = bucket_id filter to every NODE, allowing callers to process one bucket at a time
  • Changed COPY_MODE from replace to append to support incremental bucket runs

Type of change

  • Bug fix
  • New feature
  • Refactor / cleanup
  • Performance improvement
  • Chore / dependency update
  • Documentation

JIRA ticket

https://linuxfoundation.atlassian.net/browse/IN-1180


Note

Medium Risk
Changes how pull_requests_analyzed is bootstrapped: append mode plus partial bucket runs can leave incomplete or duplicate snapshot data if operations skip buckets or omit a full reset.

Overview
The PR analysis initial snapshot Tinybird pipe can now be run in sharded passes using optional bucket_id and num_buckets (default 5), filtering each node on cityHash64(segmentId) % num_buckets so large backfills stay within memory limits.

Every upstream SQL node is switched to templated queries (% plus Jinja) so the bucket filter applies consistently across opened, lifecycle, and patchset nodes. COPY_MODE moves from replace to append, so each bucket run adds rows instead of wiping pull_requests_analyzed; operators are expected to run all buckets, then rely on the hourly merger as before.

Reviewed by Cursor Bugbot for commit 18f7b98. Bugbot is set up for automated code reviews on this repo. Configure here.

Signed-off-by: Gašper Grom <gasper.grom@gmail.com>
@gaspergrom gaspergrom self-assigned this Jun 3, 2026
Copilot AI review requested due to automatic review settings June 3, 2026 19:25
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error: Your billing is not configured or you have Copilot licenses from multiple standalone organizations or enterprises. To use premium requests, select a billing entity via the GitHub site, under Settings > Copilot > Features.

TYPE COPY
TARGET_DATASOURCE pull_requests_analyzed
COPY_MODE replace
COPY_MODE append
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Append duplicates full unbucketed run

High Severity

With COPY_MODE append, any on-demand run that omits bucket_id still scans the full dataset and appends every PR row. The previous replace mode cleared the target first. Appending onto an already populated pull_requests_analyzed duplicates keys and skews downstream averages and counts that read the table without a snapshotId filter.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 153af27. Configure here.

TYPE COPY
TARGET_DATASOURCE pull_requests_analyzed
COPY_MODE replace
COPY_MODE append
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-run bucket appends duplicate PRs

High Severity

COPY_MODE append has no idempotency for a given bucket_id. Re-running the same bucket after a successful copy writes another copy of the same PR rows (same keys and snapshotId). The hourly merger unions historical rows without deduplicating identical keys, so duplicates can remain in pull_requests_analyzed and inflate analytics.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 153af27. Configure here.

TYPE COPY
TARGET_DATASOURCE pull_requests_analyzed
COPY_MODE replace
COPY_MODE append
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merger during partial bucket load

High Severity

Append mode exposes partially loaded data in pull_requests_analyzed while buckets 0–N are still running. The hourly pull_request_analysis_snapshot_merger_copy job uses COPY_MODE replace and treats whatever is already in the table as the historical baseline. If it runs before every bucket finishes, the replace output can permanently under-represent PRs until a full rebootstrap.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 153af27. Configure here.

Signed-off-by: Gašper Grom <gasper.grom@gmail.com>
Copilot AI review requested due to automatic review settings June 4, 2026 06:42
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error: Your billing is not configured or you have Copilot licenses from multiple standalone organizations or enterprises. To use premium requests, select a billing entity via the GitHub site, under Settings > Copilot > Features.

Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 4 total unresolved issues (including 3 from previous reviews).

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 18f7b98. Configure here.

required=False,
)
}}
{% end %}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bucket gate ignores num_buckets

Medium Severity

Sharding is enabled when only bucket_id is defined, while num_buckets can default independently per template. A run with bucket_id but a missing or different num_buckets than other bucket runs mis-partitions segments, leaving gaps or double-processing PR data across the append loads.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 18f7b98. Configure here.

@gaspergrom gaspergrom merged commit 225d3fe into main Jun 4, 2026
16 checks passed
@gaspergrom gaspergrom deleted the feat/IN-1180-pull-request-analysis-initial-snapshot branch June 4, 2026 06:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants