feat: criticality worker init [CM-1214]#4161
Conversation
Signed-off-by: Mouad BANI <mouad-mb@outlook.com>
|
|
There was a problem hiding this comment.
Pull request overview
Introduces initial “criticality” groundwork in packages_worker by adding a PageRank-based centrality computation (written into packages_universe) and updating the database ranking function/migrations to support an ADR-based criticality scoring formula.
Changes:
- Add CSR graph building + PageRank computation and a standalone runner for validating graph correctness.
- Add DB queries to load direct dependency edges and merge computed centrality scores back into
packages_universe. - Add migrations for graph-derived signals and a v2
rank_packages_universe()implementation using weighted percentile ranks.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 11 comments.
Show a summary per file
| File | Description |
|---|---|
| services/apps/packages_worker/src/criticality/types.ts | Adds types for centrality input/output and criticality weight definitions. |
| services/apps/packages_worker/src/criticality/run-pagerank.ts | Adds a standalone CLI-like script to build/validate the graph and optionally run full PageRank. |
| services/apps/packages_worker/src/criticality/queries.ts | Adds SQL helpers to load direct dependency edges and bulk-merge centrality scores. |
| services/apps/packages_worker/src/criticality/graph.ts | Implements CSR graph construction and PageRank iteration utilities. |
| services/apps/packages_worker/src/criticality/activities.ts | Implements the Temporal activity to compute PageRank and persist centrality scores in chunks. |
| backend/src/osspckgs/migrations/V1780416481__rank_packages_universe_v2.sql | Replaces/updates rank_packages_universe() scoring + ranking logic to match the ADR methodology. |
| backend/src/osspckgs/migrations/V1780394591__packages_universe_graph_signals.sql | Adds transitive_dependent_count and centrality_score columns to packages_universe. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Signed-off-by: Mouad BANI <mouad-mb@outlook.com>
Signed-off-by: Mouad BANI <mouad-mb@outlook.com>
Signed-off-by: Mouad BANI <mouad-mb@outlook.com>
Signed-off-by: Mouad BANI <mouad-mb@outlook.com>
Signed-off-by: Mouad BANI <mouad-mb@outlook.com>
Signed-off-by: Mouad BANI <mouad-mb@outlook.com>
Signed-off-by: Mouad BANI <mbani@contractor.linuxfoundation.org>
Signed-off-by: Mouad BANI <mouad-mb@outlook.com>
Signed-off-by: Mouad BANI <mouad-mb@outlook.com>
| export interface CriticalityWeights { | ||
| wCentrality: number // 0.40 | ||
| wTransitive: number // 0.10 | ||
| wDepPkgs: number // 0.20 | ||
| wDepRepos: number // 0.15 | ||
| wDownloads: number // 0.15 | ||
| } |
There was a problem hiding this comment.
Is this still up to date?
There was a problem hiding this comment.
Good catch! This is actually no longer used, removed it
| weight_downloads numeric DEFAULT 0.25, | ||
| weight_dependent_packages numeric DEFAULT 0.25, | ||
| weight_transitive numeric DEFAULT 0.50, | ||
| critical_top_n_by_ecosystem jsonb DEFAULT '{}'::jsonb |
There was a problem hiding this comment.
Not sure where this is defined, but to already consider all of the possible registries I would say to have it like:
- npm: 30% (210k)
- PyPI: 20% (140k)
- Maven Central: 17% (120k)
- NuGet: 10% (70k)
- Packagist: 8% (56k)
- Go modules: 6% (42k)
- crates.io: 4% (28k)
- RubyGems: 3% (21k)
- Docker Hub: 2% (13k)
There was a problem hiding this comment.
The idea was to set them during the function call, but yeah we should have default values.
Fixed
Signed-off-by: Mouad BANI <mouad-mb@outlook.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
There are 2 total unresolved issues (including 1 from previous review).
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 025534c. Configure here.
|
|
||
| const [result] = await qx.select( | ||
| `SELECT * FROM rank_packages_universe($/wDownloads/, $/wDepPkgs/, $/wTransitive/, $/topN/::jsonb)`, | ||
| { wDownloads, wDepPkgs, wTransitive, topN }, |
There was a problem hiding this comment.
Partial top-N clears critical flags
Medium Severity
run:impact always passes critical_top_n_by_ecosystem as JSON, and the default only includes cargo and maven. In rank_packages_universe(), missing ecosystem keys make (critical_top_n_by_ecosystem ->> ecosystem)::int null, so is_critical stays false for npm, pypi, go, nuget, and others when the script runs without --top-n.
Reviewed by Cursor Bugbot for commit 025534c. Configure here.
| const ecosystem = process.argv[2] ?? 'cargo' | ||
| const graphOnly = process.argv.includes('--graph-only') |
| function parseJsonArg(flag: string, fallback: string): string { | ||
| const idx = process.argv.indexOf(flag) | ||
| return idx !== -1 ? process.argv[idx + 1] : fallback | ||
| } |
| if (delta < convergence) break // scores have stabilised | ||
| } | ||
|
|
||
| return { scores, iterations: iters } |
| // Each node v collects votes from packages that depend on it. | ||
| // numDeps[dependent] is always >= 1 here — only packages with at least one | ||
| // outgoing edge appear in colData, so division by zero cannot occur. | ||
| // Dangling nodes (numDeps = 0) never appear in colData; their score | ||
| // accumulates but never redistributes. This is acceptable because scores | ||
| // are used for relative ranking via pct_rank(), not as absolute values. | ||
| for (let v = 0; v < N; v++) { | ||
| let incoming = 0 | ||
| for (let j = rowPtr[v]; j < rowPtr[v + 1]; j++) { | ||
| const dependent = colData[j] | ||
| incoming += scores[dependent] / numDeps[dependent] | ||
| } | ||
| next[v] = teleportation + damping * incoming | ||
| } |
| // Bulk-update centrality_score on packages_universe rows by joining through packages. | ||
| // Uses unnest — one parameterised query regardless of row count, no string interpolation. | ||
| // Isolated packages (not in the graph) remain NULL; rank_packages_universe() treats | ||
| // NULL as 0 via COALESCE. Idempotent — safe for Temporal retries. |
| export function computePageRank( | ||
| { numDeps, rowPtr, colData, N }: Graph, | ||
| damping = 0.85, | ||
| maxIter = 100, | ||
| convergence = 1e-6, |
| -- Formula (ADR-0001 §Criticality scoring methodology): | ||
| -- impact = w_downloads * pct_rank( LOG(1 + downloads_last_30d) ) within ecosystem | ||
| -- + w_dep_pkgs * pct_rank( LOG(1 + dependent_count) ) within ecosystem | ||
| -- + w_transitive * pct_rank( LOG(1 + transitive_dependent_count) ) within ecosystem |
| impact = w_downloads * pct_rank( LOG(1 + downloads_last_30d) ) within ecosystem | ||
| + w_dep_pkgs * pct_rank( LOG(1 + dependent_count) ) within ecosystem | ||
| + w_transitive * pct_rank( LOG(1 + transitive_dependent_count) ) within ecosystem |
| // ── Build graph ─────────────────────────────────────────────────────────── | ||
| console.log(`Building graph for ecosystem=${ecosystem} ...`) | ||
| let t = Date.now() | ||
| const edges = await loadDirectEdges(qx, ecosystem) | ||
| const edgeCount = edges.length | ||
| const graph = buildGraph(edges) |
| } | ||
| }, | ||
| ) | ||
| log.info({ ecosystem, iterations, nodeCount: graph.N }, 'PageRank converged') |
Signed-off-by: Mouad BANI <mouad-mb@outlook.com>
Signed-off-by: Mouad BANI <mbani@contractor.linuxfoundation.org>
Signed-off-by: Mouad BANI <mouad-mb@outlook.com>
| "start:packages-worker": "CROWD_TEMPORAL_TASKQUEUE=packages-worker CROWD_TEMPORAL_NAMESPACE=$CROWD_PACKAGES_TEMPORAL_NAMESPACE SERVICE=packages-worker tsx src/bin/packages-worker.ts", | ||
| "start:criticality-worker": "CROWD_TEMPORAL_TASKQUEUE=packages-worker CROWD_TEMPORAL_NAMESPACE=$CROWD_PACKAGES_TEMPORAL_NAMESPACE SERVICE=criticality-worker tsx src/bin/criticality-worker.ts", | ||
| "start:deps-dev-ingest": "CROWD_TEMPORAL_TASKQUEUE=deps-dev-ingest CROWD_TEMPORAL_NAMESPACE=$CROWD_PACKAGES_TEMPORAL_NAMESPACE SERVICE=deps-dev-ingest tsx src/bin/deps-dev-ingest.ts", | ||
| "start:github-repos-enricher": "SERVICE=github-repos-enricher tsx src/bin/github-repos-enricher.ts", | ||
| "start:npm-worker": "CROWD_TEMPORAL_TASKQUEUE=npm-worker CROWD_TEMPORAL_NAMESPACE=$CROWD_PACKAGES_TEMPORAL_NAMESPACE SERVICE=npm-worker tsx src/bin/npm-worker.ts", | ||
| "start:packages-worker": "CROWD_TEMPORAL_TASKQUEUE=packages-worker CROWD_TEMPORAL_NAMESPACE=$CROWD_PACKAGES_TEMPORAL_NAMESPACE SERVICE=packages-worker tsx src/bin/packages-worker.ts", | ||
| "start:github-repos-enricher": "SERVICE=github-repos-enricher tsx src/bin/github-repos-enricher.ts", |
| "scripts": { | ||
| "start:packages-worker": "CROWD_TEMPORAL_TASKQUEUE=packages-worker CROWD_TEMPORAL_NAMESPACE=$CROWD_PACKAGES_TEMPORAL_NAMESPACE SERVICE=packages-worker tsx src/bin/packages-worker.ts", | ||
| "start:criticality-worker": "CROWD_TEMPORAL_TASKQUEUE=packages-worker CROWD_TEMPORAL_NAMESPACE=$CROWD_PACKAGES_TEMPORAL_NAMESPACE SERVICE=criticality-worker tsx src/bin/criticality-worker.ts", |
| FROM package_dependencies pd | ||
| JOIN packages p | ||
| ON p.id = pd.package_id | ||
| AND p.ecosystem = $/ecosystem/ | ||
| WHERE pd.dependency_kind = 'direct'`, |
| if (delta < convergence) break // scores have stabilised | ||
| } | ||
|
|
||
| return { scores, iterations: iters } |
| export function buildGraph(edges: DirectEdge[]): Graph { | ||
| // Pass 0: assign contiguous indices | ||
| const nodeIndex = new Map<number, number>() | ||
| const nodeIds: number[] = [] | ||
|
|
| function parseJsonArg(flag: string, fallback: string): string { | ||
| const idx = process.argv.indexOf(flag) | ||
| return idx !== -1 ? process.argv[idx + 1] : fallback | ||
| } |
| Per-ecosystem percentile-rank of each log-transformed signal, then weighted blend: | ||
|
|
||
| ``` | ||
| score = w_downloads * pct_rank( LN(1 + downloads_last_30d) ) within ecosystem | ||
| + w_dep_pkgs * pct_rank( LN(1 + dependent_packages_count) ) within ecosystem | ||
| + w_dep_repos * pct_rank( LN(1 + dependent_repos_count) ) within ecosystem | ||
| + w_transitive * pct_rank( LN(1 + transitive_dependent_count) ) within ecosystem | ||
| + w_centrality * pct_rank( centrality_score ) within ecosystem | ||
| impact = w_downloads * pct_rank( LOG(1 + downloads_last_30d) ) within ecosystem | ||
| + w_dep_pkgs * pct_rank( LOG(1 + dependent_count) ) within ecosystem |


This pull request introduces a new criticality scoring pipeline for open source packages, implementing the ADR-0001 methodology. It adds a manual override table for criticality, a new impact scoring formula, and a standalone PageRank-based centrality computation. The changes include new SQL migrations, worker scripts, and supporting TypeScript modules for graph construction and scoring. These updates make the system more flexible, auditable, and tunable.
Database schema and scoring logic:
package_criticality_spotlighttable to allow manual overrides for critical package designation, ensuring certain packages are always marked as critical regardless of computed score.rank_packages_universe()to use an "impact" metric (replacingcriticality_score), based on weighted percentiles of downloads, direct dependents, and transitive dependents. The function now also applies spotlight overrides and propagates scores to the mainpackagestable.PageRank centrality computation:
Worker scripts and developer tools:
package.jsonfor running and developing the criticality worker, PageRank, and impact scorer, supporting both production and local environments.run-pagerank.ts) with validation/spot-checks, and for triggering the impact scoring function (run-impact.ts) with tunable parameters. [1] [2]Note
High Risk
Renames scoring columns and redefines
is_criticalselection for large package universes; incorrect weights, top-N JSON, or graph/spotlight logic would mis-rank Tier 2 enrichment targets.Overview
Introduces the criticality slice of ADR-0001: auditable spotlight overrides, a new
impactranking pass in Postgres, and an in-worker PageRank path that writescentrality_scoreahead of folding it into impact.Database: Adds
package_criticality_spotlightand replacesrank_packages_universe()socriticality_scorebecomesimpactonpackages_universeandpackages. Impact is a per-ecosystem weighted blend of percentile ranks on log downloads, direct dependents, and transitive dependents (defaults 0.25 / 0.25 / 0.50), then top-Nis_critical, then spotlight forces critical, then propagation topackages. ADR-0001 is updated to describe this slimmer formula (PageRank stored but not in impact yet).Worker / tooling: New
src/criticality/builds a CSR graph from directpackage_dependencies, runs PageRank, and bulk-updatespackages_universe.centrality_score.criticality-workeris a DB health stub for now;run:pagerank(with DB spot-checks) andrun:impactinvoke scoring on demand.package.jsongains start/dev scripts for the criticality worker and the CLIs.Reviewed by Cursor Bugbot for commit 6054c53. Bugbot is set up for automated code reviews on this repo. Configure here.