feat: pom fetcher#4144
Conversation
|
|
There was a problem hiding this comment.
Pull request overview
Introduces a new Maven “POM fetcher” worker loop to enrich packages data in the osspckgs database by fetching Maven Central metadata/POMs, and adds corresponding data-access-layer helpers for selecting candidates and upserting packages/maintainers.
Changes:
- Added
@crowd/data-access-layerosspckgs module with queries for Maven enrichment candidates and upserts intopackages,maintainers, andpackage_maintainers. - Added a
pom-fetcherworker (config + entrypoint + enrichment loop) that resolves latest Maven versions and extracts POM metadata (licenses, SCM, developers/contributors). - Wired up scripts/deps for running the new worker (package.json scripts, docker-compose service yaml, lockfile updates).
Reviewed changes
Copilot reviewed 11 out of 13 changed files in this pull request and generated 12 comments.
Show a summary per file
| File | Description |
|---|---|
| services/libs/data-access-layer/src/osspckgs/types.ts | Adds DB-facing types for osspckgs package/maintainer upserts and universe rows. |
| services/libs/data-access-layer/src/osspckgs/packages.ts | Adds query to list Maven universe packages needing enrichment + upsert into packages. |
| services/libs/data-access-layer/src/osspckgs/maintainers.ts | Adds upserts for maintainers and package_maintainers. |
| services/libs/data-access-layer/src/osspckgs/index.ts | Re-exports osspckgs DAL surface. |
| services/libs/data-access-layer/src/index.ts | Exposes osspckgs DAL from the package root. |
| services/apps/packages_worker/src/pom-fetcher/runPomEnrichmentLoop.ts | Implements batch/concurrent enrichment loop and persistence of extracted metadata. |
| services/apps/packages_worker/src/pom-fetcher/metadata.ts | Resolves latest version via maven-metadata.xml. |
| services/apps/packages_worker/src/pom-fetcher/extract.ts | Fetches POMs and extracts fields with limited parent inheritance traversal. |
| services/apps/packages_worker/src/config.ts | Adds pom-fetcher config loader. |
| services/apps/packages_worker/src/bin/pom-fetcher.ts | Adds runnable entrypoint with shutdown handling. |
| services/apps/packages_worker/package.json | Adds scripts and deps (axios, fast-xml-parser) for pom-fetcher. |
| scripts/services/pom-fetcher.yaml | Adds docker-compose service definition for pom-fetcher. |
| pnpm-lock.yaml | Updates lockfile for new deps (but includes an unexpected workspace importer). |
Files not reviewed (1)
- pnpm-lock.yaml: Language not supported
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Signed-off-by: Mouad BANI <mouad-mb@outlook.com>
Signed-off-by: Mouad BANI <mouad-mb@outlook.com>
…ved to packages_worker) Signed-off-by: Mouad BANI <mouad-mb@outlook.com>
Signed-off-by: Umberto Sgueglia <usgueglia@contractor.linuxfoundation.org>
Signed-off-by: Umberto Sgueglia <usgueglia@contractor.linuxfoundation.org>
Signed-off-by: Umberto Sgueglia <usgueglia@contractor.linuxfoundation.org>
Signed-off-by: Umberto Sgueglia <usgueglia@contractor.linuxfoundation.org>
Signed-off-by: Umberto Sgueglia <usgueglia@contractor.linuxfoundation.org>
Signed-off-by: Umberto Sgueglia <usgueglia@contractor.linuxfoundation.org>
Signed-off-by: Umberto Sgueglia <usgueglia@contractor.linuxfoundation.org>
Signed-off-by: Umberto Sgueglia <usgueglia@contractor.linuxfoundation.org>
| INSERT INTO versions (package_id, ecosystem, name, number, is_latest, is_prerelease, licenses, last_synced_at) | ||
| SELECT | ||
| t.package_id, t.ecosystem, t.name, t.number, t.is_latest, t.is_prerelease, | ||
| CASE WHEN t.license IS NULL THEN NULL ELSE ARRAY[t.license] END, | ||
| NOW() | ||
| FROM UNNEST( | ||
| $(packageIds)::bigint[], | ||
| $(ecosystems)::text[], | ||
| $(names)::text[], | ||
| $(numbers)::text[], | ||
| $(isLatests)::bool[], | ||
| $(isPreleases)::bool[], | ||
| $(licenses)::text[] | ||
| ) AS t(package_id, ecosystem, name, number, is_latest, is_prerelease, license) | ||
| ON CONFLICT (package_id, number) DO UPDATE SET | ||
| is_latest = EXCLUDED.is_latest, | ||
| is_prerelease = EXCLUDED.is_prerelease, | ||
| licenses = COALESCE(EXCLUDED.licenses, versions.licenses), | ||
| last_synced_at = NOW() | ||
| RETURNING number, is_latest, is_prerelease, licenses |
Signed-off-by: Umberto Sgueglia <usgueglia@contractor.linuxfoundation.org>
|
Your PR title doesn't contain a Jira issue key. Consider adding it for better traceability. Example:
Projects:
Please add a Jira issue key to your PR title. |
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 5 potential issues.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit a21c949. Configure here.
|
|
||
| const main = async () => { | ||
| const qx = await getPackagesDb() | ||
| const sql = readFileSync(SQL_PATH, 'utf8') |
There was a problem hiding this comment.
Missing data quality SQL file
Medium Severity
The new validate:maven-quality script reads ../maven/data_quality.sql at runtime, but that file is not added in this change and is absent from src/maven/. The script fails on startup with a file-not-found error instead of running checks or gating deploys.
Reviewed by Cursor Bugbot for commit a21c949. Configure here.
| "monitor:osspckgs:local": "bash -c 'set -a && . ../../../backend/.env.dist.local && . ../../../backend/.env.override.local && set +a && SERVICE=monitor tsx src/scripts/monitorOsspckgs.ts'", | ||
| "trigger-bootstrap": "SERVICE=deps-dev-ingest tsx src/scripts/triggerBootstrap.ts", | ||
| "trigger-bootstrap:local": "set -a && . ../../../backend/.env.dist.local && . ../../../backend/.env.override.local && set +a && SERVICE=deps-dev-ingest tsx src/scripts/triggerBootstrap.ts", | ||
| "monitor:osspckgs:local": "bash -c 'set -a && . ../../../backend/.env.dist.local && . ../../../backend/.env.override.local && set +a && node ../../../scripts/monitor-osspckgs.mjs'", |
There was a problem hiding this comment.
Broken monitor osspckgs script
Medium Severity
monitor:osspckgs:local now runs node ../../../scripts/monitor-osspckgs.mjs, but that file does not exist in the repository. The previous local script invoked src/scripts/monitorOsspckgs.ts, which still exists but is no longer referenced.
Reviewed by Cursor Bugbot for commit a21c949. Configure here.
| if (maintainerLinks.length > 0) { | ||
| const pmChanged = await replacePackageMaintainers(t, packageId, maintainerLinks) | ||
| pmChanged.forEach((f) => changed.add(f)) | ||
| } |
There was a problem hiding this comment.
Stale maintainers never cleared
Medium Severity
After a successful POM sync, replacePackageMaintainers runs only when maintainerLinks.length > 0. If the resolved POM has no developers or contributors, existing package_maintainers rows are left untouched, so outdated maintainer links can remain in the database.
Reviewed by Cursor Bugbot for commit a21c949. Configure here.
| ) | ||
| ORDER BY | ||
| p.criticality_score DESC NULLS LAST, | ||
| p.id ASC |
There was a problem hiding this comment.
Critical queue breaks POM cache
Medium Severity
Tier‑2 critical sync orders work by criticality_score, while the parent POM cache in extract.ts assumes batches cluster by rank_in_ecosystem so sibling artifacts share cached parents. That mismatch reduces cache hits and increases redundant Maven HTTP traffic (and throttling risk).
Reviewed by Cursor Bugbot for commit a21c949. Configure here.
| POM_FETCHER_GROUP_DELAY_MS=100 | ||
| # Set to 'true' on first run against a fresh/restored DB to skip the version-unchanged | ||
| # optimisation and force full POM extraction. Set to 'false' after the first pass. | ||
| POM_FETCHER_FORCE_FULL_EXTRACTION=true |
There was a problem hiding this comment.
Force full extraction env ignored
Medium Severity
POM_FETCHER_FORCE_FULL_EXTRACTION is documented in .env.dist.local as controlling full POM extraction on first run, but nothing in getMavenConfig or the worker reads it. The Temporal path always passes forceFullExtraction: false for universe polling regardless of that variable.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit a21c949. Configure here.
| status: string | ||
| } | ||
|
|
||
| const SQL_PATH = join(__dirname, '../maven/data_quality.sql') |
| * Returns null when the artifact is not found (404) or the metadata is | ||
| * malformed. | ||
| */ |
| upsertRepo, | ||
| upsertVersionsBatch, | ||
| } from '@crowd/data-access-layer' | ||
| import { QueryExecutor } from '@crowd/data-access-layer/src/queryExecutor' |
| /** | ||
| * Core POM extraction logic — pure functions (no I/O side-effects, no DB calls). | ||
| * Callers are responsible for concurrency, retries, and persistence. | ||
| */ |
| // with transient errors — we never do it. Maven coordinates are immutable, so a cached | ||
| // POM never goes stale; the LRU size cap is purely to bound memory. | ||
|
|
||
| const POM_CACHE_MAX_ENTRIES = 5_000 |
| | Entry point | Mode | Behaviour | | ||
| | -------------------------------- | --------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | ||
| | Standalone `bin/maven.ts` | **backfill** | Always runs full POM extraction for every selected critical package, regardless of version. Use for the initial fill / periodic full refresh. | | ||
| | Temporal `mavenCriticalWorkflow` | **incremental** | If the upstream release version equals the stored `latest_version`, skips the POM fetch and only bumps `last_synced_at` (status `unchanged`). Full extraction runs only for new packages or when the version changed. | |
| Two Temporal schedules are registered on startup of `bin/packages-worker.ts` | ||
| (see `maven/schedule.ts`): | ||
|
|
||
| | Schedule ID | Cron | Workflow | Activity | Workflow timeout | | ||
| | -------------------- | ----------------------------- | -------------------------- | ------------------------------------------------------- | ---------------- | | ||
| | `maven-critical` | `*/5 * * * *` (every 5 min) | `mavenCriticalWorkflow` | `processMavenCriticalBatch` → one critical batch | 15 min | | ||
| | `maven-non-critical` | `*/10 * * * *` (every 10 min) | `mavenNonCriticalWorkflow` | `processMavenNonCriticalBatch` → one non-critical batch | 5 min | | ||
|
|
| # Set to 'true' on first run against a fresh/restored DB to skip the version-unchanged | ||
| # optimisation and force full POM extraction. Set to 'false' after the first pass. | ||
| POM_FETCHER_FORCE_FULL_EXTRACTION=true |
| SELECT | ||
| p.id, | ||
| p.purl, | ||
| p.namespace, | ||
| p.name, | ||
| p.criticality_score AS "criticalityScore", | ||
| p.dependent_count AS "dependentPackagesCount", | ||
| p.dependent_repos_count AS "dependentReposCount", | ||
| p.downloads_last_month AS "downloads30d", | ||
| p.latest_version AS "latestVersion" |
| INSERT INTO repos (url, host, owner, name, last_synced_at) | ||
| VALUES ($(url), $(host), $(owner), $(name), NOW()) | ||
| ON CONFLICT (url) DO UPDATE SET | ||
| host = COALESCE(EXCLUDED.host, repos.host), | ||
| owner = COALESCE(EXCLUDED.owner, repos.owner), | ||
| name = COALESCE(EXCLUDED.name, repos.name), | ||
| last_synced_at = NOW() | ||
| RETURNING id |


Summary
Adds a Maven POM fetcher to the packages_worker service that syncs Maven Central package metadata into the packages DB. It pulls candidates from packages_universe, extracts metadata from POM files (with parent-chain resolution), and populates package, version, maintainer, and repository data. This brings the Maven ecosystem to parity with the existing npm pipeline so critical Maven packages get high-quality, enriched metadata for downstream analytics.
Changes
Type of change
Note
Medium Risk
Large-scale writes to packages DB and sustained outbound calls to Maven Central (rate limits); Temporal scheduling and shared maintainer/repo upserts add operational and concurrency risk, mitigated by idempotent upserts and retries.
Overview
Adds a Maven POM enrichment pipeline in
packages_workerthat syncs critical Maven packages from Central (or a configurable mirror) into the packages DB: metadata, full version lists, maintainers (email hashed), and declared repo links.Runtime:
packages-workerregisters amaven-criticalTemporal schedule (1‑minute cron in code) that runsprocessMavenCriticalBatchper tick—optional delta API pass (MAVEN_SYNC_SOURCEapi/both) plus polling of Tier‑2packagesrows (is_critical), with incremental skips whenlatest_versionis unchanged. A separatemaven-backfillentry point drains the critical queue with always-full POM extraction and graceful shutdown.Implementation highlights: HTTP fetch/parse (
axios,fast-xml-parser), parent POM inheritance (up to 8 hops), in-process LRU POM cache with request coalescing, rate-limit retries, transactional upserts with deadlock retry, and sentinelingestion_sourcevalues for not-on-Central / no-version / POM errors. NewosspckgsDAL helpers cover package/version/maintainer/repo upserts and sync queue queries. Adds unit tests for prerelease/SCM/repo URL normalization, benchmark and SQL-based data-quality scripts, and local env/script wiring (backfill:maven, delta benchmark, quality validate).Reviewed by Cursor Bugbot for commit a21c949. Bugbot is set up for automated code reviews on this repo. Configure here.