Skip to content

feat: pom fetcher#4144

Open
ulemons wants to merge 15 commits into
mainfrom
feat/pom-fetcher
Open

feat: pom fetcher#4144
ulemons wants to merge 15 commits into
mainfrom
feat/pom-fetcher

Conversation

@ulemons
Copy link
Copy Markdown
Contributor

@ulemons ulemons commented May 26, 2026

Summary

Adds a Maven POM fetcher to the packages_worker service that syncs Maven Central package metadata into the packages DB. It pulls candidates from packages_universe, extracts metadata from POM files (with parent-chain resolution), and populates package, version, maintainer, and repository data. This brings the Maven ecosystem to parity with the existing npm pipeline so critical Maven packages get high-quality, enriched metadata for downstream analytics.

Changes

  • Two-tier fetch strategy — non-critical packages are DB-only (copy universe stats, no HTTP, ~1000 pkg/sec); critical packages get full POM extraction with parent-chain resolution (max 8 hops) for description, homepage, SCM/repo, licenses, maintainers, and the full version list.
  • Two entry points — bin/packages-worker.ts registers the maven-critical Temporal schedule for incremental syncing (skips POM extraction when the version is unchanged), and bin/maven-backfill.ts (pnpm backfill:maven) does a one-shot, resumable full-extraction backfill. The DB state is the cursor, so re-runs pick up where they left off.
  • Module-level parent POM cache (extract.ts) — coordinate-keyed LRU with request coalescing, caches only successful fetches (never null, to avoid poisoning), no TTL since Maven coordinates are immutable. This is the main lever against Maven Central rate limiting and works because the rank_in_ecosystem ordering clusters sibling artifacts that share parent POMs. Exposes getPomCacheStats() for hit-rate observability.
  • New osspckgs data-access-layer module — query functions for packages, versions, maintainers, and repos (functional, pg-promise via queryExecutor), shared across the worker.
  • Delta API support (deltaApi.ts) for incremental upstream change detection, plus benchmark and data-quality validation scripts.
  • Adds unit tests for the pure normalization functions.
  • Maven-specific config in config.ts (POM_FETCHER_REFRESH_DAYS, POM_CACHE_MAX_ENTRIES, etc.) and .env.dist.local entries.

Type of change

  • Bug fix
  • New feature
  • Refactor / cleanup
  • Performance improvement
  • Chore / dependency update
  • Documentation

Note

Medium Risk
Large-scale writes to packages DB and sustained outbound calls to Maven Central (rate limits); Temporal scheduling and shared maintainer/repo upserts add operational and concurrency risk, mitigated by idempotent upserts and retries.

Overview
Adds a Maven POM enrichment pipeline in packages_worker that syncs critical Maven packages from Central (or a configurable mirror) into the packages DB: metadata, full version lists, maintainers (email hashed), and declared repo links.

Runtime: packages-worker registers a maven-critical Temporal schedule (1‑minute cron in code) that runs processMavenCriticalBatch per tick—optional delta API pass (MAVEN_SYNC_SOURCE api/both) plus polling of Tier‑2 packages rows (is_critical), with incremental skips when latest_version is unchanged. A separate maven-backfill entry point drains the critical queue with always-full POM extraction and graceful shutdown.

Implementation highlights: HTTP fetch/parse (axios, fast-xml-parser), parent POM inheritance (up to 8 hops), in-process LRU POM cache with request coalescing, rate-limit retries, transactional upserts with deadlock retry, and sentinel ingestion_source values for not-on-Central / no-version / POM errors. New osspckgs DAL helpers cover package/version/maintainer/repo upserts and sync queue queries. Adds unit tests for prerelease/SCM/repo URL normalization, benchmark and SQL-based data-quality scripts, and local env/script wiring (backfill:maven, delta benchmark, quality validate).

Reviewed by Cursor Bugbot for commit a21c949. Bugbot is set up for automated code reviews on this repo. Configure here.

Copilot AI review requested due to automatic review settings May 26, 2026 15:59
@CLAassistant
Copy link
Copy Markdown

CLAassistant commented May 26, 2026

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
0 out of 2 committers have signed the CLA.

❌ mbani01
❌ ulemons
You have signed the CLA already but the status is still pending? Let us recheck it.

Comment thread services/apps/packages_worker/package.json Fixed
Comment thread services/apps/packages_worker/package.json Fixed
Comment thread services/apps/packages_worker/package.json Fixed
Comment thread services/apps/packages_worker/package.json Fixed
Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Conventional Commits FTW!

@ulemons ulemons changed the base branch from main to feat/track-packages May 26, 2026 16:00
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Introduces a new Maven “POM fetcher” worker loop to enrich packages data in the osspckgs database by fetching Maven Central metadata/POMs, and adds corresponding data-access-layer helpers for selecting candidates and upserting packages/maintainers.

Changes:

  • Added @crowd/data-access-layer osspckgs module with queries for Maven enrichment candidates and upserts into packages, maintainers, and package_maintainers.
  • Added a pom-fetcher worker (config + entrypoint + enrichment loop) that resolves latest Maven versions and extracts POM metadata (licenses, SCM, developers/contributors).
  • Wired up scripts/deps for running the new worker (package.json scripts, docker-compose service yaml, lockfile updates).

Reviewed changes

Copilot reviewed 11 out of 13 changed files in this pull request and generated 12 comments.

Show a summary per file
File Description
services/libs/data-access-layer/src/osspckgs/types.ts Adds DB-facing types for osspckgs package/maintainer upserts and universe rows.
services/libs/data-access-layer/src/osspckgs/packages.ts Adds query to list Maven universe packages needing enrichment + upsert into packages.
services/libs/data-access-layer/src/osspckgs/maintainers.ts Adds upserts for maintainers and package_maintainers.
services/libs/data-access-layer/src/osspckgs/index.ts Re-exports osspckgs DAL surface.
services/libs/data-access-layer/src/index.ts Exposes osspckgs DAL from the package root.
services/apps/packages_worker/src/pom-fetcher/runPomEnrichmentLoop.ts Implements batch/concurrent enrichment loop and persistence of extracted metadata.
services/apps/packages_worker/src/pom-fetcher/metadata.ts Resolves latest version via maven-metadata.xml.
services/apps/packages_worker/src/pom-fetcher/extract.ts Fetches POMs and extracts fields with limited parent inheritance traversal.
services/apps/packages_worker/src/config.ts Adds pom-fetcher config loader.
services/apps/packages_worker/src/bin/pom-fetcher.ts Adds runnable entrypoint with shutdown handling.
services/apps/packages_worker/package.json Adds scripts and deps (axios, fast-xml-parser) for pom-fetcher.
scripts/services/pom-fetcher.yaml Adds docker-compose service definition for pom-fetcher.
pnpm-lock.yaml Updates lockfile for new deps (but includes an unexpected workspace importer).
Files not reviewed (1)
  • pnpm-lock.yaml: Language not supported

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread services/libs/data-access-layer/src/osspckgs/types.ts
Comment thread services/libs/data-access-layer/src/osspckgs/types.ts
Comment thread services/libs/data-access-layer/src/osspckgs/packages.ts
Comment thread services/libs/data-access-layer/src/osspckgs/packages.ts Outdated
Comment thread services/libs/data-access-layer/src/osspckgs/packages.ts Outdated
Comment thread services/apps/packages_worker/src/pom-fetcher/runPomEnrichmentLoop.ts Outdated
Comment thread services/apps/packages_worker/src/pom-fetcher/runPomEnrichmentLoop.ts Outdated
Comment thread services/apps/packages_worker/src/maven/extract.ts
Comment thread services/apps/packages_worker/src/config.ts Outdated
Comment thread pnpm-lock.yaml Outdated
Base automatically changed from feat/track-packages to main May 26, 2026 17:44
@ulemons ulemons changed the title Feat/pom fetcher feat: pom fetcher May 27, 2026
@ulemons ulemons self-assigned this May 27, 2026
Copilot AI review requested due to automatic review settings June 2, 2026 13:49
@ulemons ulemons force-pushed the feat/pom-fetcher branch from b0812f9 to d907907 Compare June 2, 2026 13:49
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 21 out of 24 changed files in this pull request and generated 16 comments.

Files not reviewed (1)
  • pnpm-lock.yaml: Language not supported

Comment thread services/libs/data-access-layer/src/osspckgs/packages.ts
Comment thread services/libs/data-access-layer/src/osspckgs/packages.ts
Comment thread services/libs/data-access-layer/src/osspckgs/packages.ts Outdated
Comment thread services/libs/data-access-layer/src/osspckgs/packages.ts Outdated
Comment thread services/libs/data-access-layer/src/osspckgs/packages.ts Outdated
Comment thread services/apps/packages_worker/src/maven/README.md Outdated
Comment thread services/apps/packages_worker/src/maven/runMavenEnrichmentLoop.ts
Comment thread services/libs/data-access-layer/src/osspckgs/versions.ts Outdated
Comment thread services/apps/packages_worker/src/maven/runMavenEnrichmentLoop.ts Outdated
Comment thread services/apps/packages_worker/src/maven/extract.ts
Copilot AI review requested due to automatic review settings June 3, 2026 19:36
mbani01 and others added 11 commits June 3, 2026 21:38
Signed-off-by: Mouad BANI <mouad-mb@outlook.com>
Signed-off-by: Mouad BANI <mouad-mb@outlook.com>
…ved to packages_worker)

Signed-off-by: Mouad BANI <mouad-mb@outlook.com>
Signed-off-by: Umberto Sgueglia <usgueglia@contractor.linuxfoundation.org>
Signed-off-by: Umberto Sgueglia <usgueglia@contractor.linuxfoundation.org>
Signed-off-by: Umberto Sgueglia <usgueglia@contractor.linuxfoundation.org>
Signed-off-by: Umberto Sgueglia <usgueglia@contractor.linuxfoundation.org>
Signed-off-by: Umberto Sgueglia <usgueglia@contractor.linuxfoundation.org>
Signed-off-by: Umberto Sgueglia <usgueglia@contractor.linuxfoundation.org>
Signed-off-by: Umberto Sgueglia <usgueglia@contractor.linuxfoundation.org>
Signed-off-by: Umberto Sgueglia <usgueglia@contractor.linuxfoundation.org>
@ulemons ulemons force-pushed the feat/pom-fetcher branch from 1fef57d to 27c4836 Compare June 3, 2026 19:40
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 25 out of 28 changed files in this pull request and generated 9 comments.

Files not reviewed (1)
  • pnpm-lock.yaml: Language not supported

Comment thread services/apps/packages_worker/src/scripts/validateDataQuality.ts
Comment thread services/apps/packages_worker/src/maven/metadata.ts
Comment thread services/apps/packages_worker/src/maven/metadata.ts
Comment thread services/apps/packages_worker/src/maven/extract.ts
Comment thread services/apps/packages_worker/src/maven/schedule.ts
Comment thread services/apps/packages_worker/src/maven/README.md
Comment thread services/apps/packages_worker/src/maven/README.md
Comment thread backend/.env.dist.local
Comment thread services/apps/packages_worker/src/maven/README.md Outdated
Signed-off-by: Umberto Sgueglia <usgueglia@contractor.linuxfoundation.org>
Signed-off-by: Umberto Sgueglia <usgueglia@contractor.linuxfoundation.org>
Copilot AI review requested due to automatic review settings June 3, 2026 20:21
Signed-off-by: Umberto Sgueglia <usgueglia@contractor.linuxfoundation.org>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 24 out of 27 changed files in this pull request and generated 15 comments.

Files not reviewed (1)
  • pnpm-lock.yaml: Language not supported

Comment on lines +38 to +57
INSERT INTO versions (package_id, ecosystem, name, number, is_latest, is_prerelease, licenses, last_synced_at)
SELECT
t.package_id, t.ecosystem, t.name, t.number, t.is_latest, t.is_prerelease,
CASE WHEN t.license IS NULL THEN NULL ELSE ARRAY[t.license] END,
NOW()
FROM UNNEST(
$(packageIds)::bigint[],
$(ecosystems)::text[],
$(names)::text[],
$(numbers)::text[],
$(isLatests)::bool[],
$(isPreleases)::bool[],
$(licenses)::text[]
) AS t(package_id, ecosystem, name, number, is_latest, is_prerelease, license)
ON CONFLICT (package_id, number) DO UPDATE SET
is_latest = EXCLUDED.is_latest,
is_prerelease = EXCLUDED.is_prerelease,
licenses = COALESCE(EXCLUDED.licenses, versions.licenses),
last_synced_at = NOW()
RETURNING number, is_latest, is_prerelease, licenses
Comment thread services/apps/packages_worker/src/maven/metadata.ts
Comment thread services/apps/packages_worker/src/maven/extract.ts
Comment thread services/apps/packages_worker/src/maven/schedule.ts
Comment thread services/apps/packages_worker/src/maven/README.md
Comment thread services/apps/packages_worker/src/scripts/validateDataQuality.ts
Comment thread services/apps/packages_worker/package.json
Comment thread services/apps/packages_worker/src/maven/extract.ts
Comment thread services/libs/data-access-layer/src/osspckgs/packages.ts
Comment thread services/libs/data-access-layer/src/osspckgs/packages.ts
Signed-off-by: Umberto Sgueglia <usgueglia@contractor.linuxfoundation.org>
Copilot AI review requested due to automatic review settings June 3, 2026 20:40
@ulemons ulemons marked this pull request as ready for review June 3, 2026 20:45
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 3, 2026

⚠️ Jira Issue Key Missing

Your PR title doesn't contain a Jira issue key. Consider adding it for better traceability.

Example:

  • feat: add user authentication (CM-123)
  • feat: add user authentication (IN-123)

Projects:

  • CM: Community Data Platform
  • IN: Insights

Please add a Jira issue key to your PR title.

Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 5 potential issues.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit a21c949. Configure here.


const main = async () => {
const qx = await getPackagesDb()
const sql = readFileSync(SQL_PATH, 'utf8')
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing data quality SQL file

Medium Severity

The new validate:maven-quality script reads ../maven/data_quality.sql at runtime, but that file is not added in this change and is absent from src/maven/. The script fails on startup with a file-not-found error instead of running checks or gating deploys.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit a21c949. Configure here.

"monitor:osspckgs:local": "bash -c 'set -a && . ../../../backend/.env.dist.local && . ../../../backend/.env.override.local && set +a && SERVICE=monitor tsx src/scripts/monitorOsspckgs.ts'",
"trigger-bootstrap": "SERVICE=deps-dev-ingest tsx src/scripts/triggerBootstrap.ts",
"trigger-bootstrap:local": "set -a && . ../../../backend/.env.dist.local && . ../../../backend/.env.override.local && set +a && SERVICE=deps-dev-ingest tsx src/scripts/triggerBootstrap.ts",
"monitor:osspckgs:local": "bash -c 'set -a && . ../../../backend/.env.dist.local && . ../../../backend/.env.override.local && set +a && node ../../../scripts/monitor-osspckgs.mjs'",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Broken monitor osspckgs script

Medium Severity

monitor:osspckgs:local now runs node ../../../scripts/monitor-osspckgs.mjs, but that file does not exist in the repository. The previous local script invoked src/scripts/monitorOsspckgs.ts, which still exists but is no longer referenced.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit a21c949. Configure here.

if (maintainerLinks.length > 0) {
const pmChanged = await replacePackageMaintainers(t, packageId, maintainerLinks)
pmChanged.forEach((f) => changed.add(f))
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stale maintainers never cleared

Medium Severity

After a successful POM sync, replacePackageMaintainers runs only when maintainerLinks.length > 0. If the resolved POM has no developers or contributors, existing package_maintainers rows are left untouched, so outdated maintainer links can remain in the database.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit a21c949. Configure here.

)
ORDER BY
p.criticality_score DESC NULLS LAST,
p.id ASC
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Critical queue breaks POM cache

Medium Severity

Tier‑2 critical sync orders work by criticality_score, while the parent POM cache in extract.ts assumes batches cluster by rank_in_ecosystem so sibling artifacts share cached parents. That mismatch reduces cache hits and increases redundant Maven HTTP traffic (and throttling risk).

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit a21c949. Configure here.

Comment thread backend/.env.dist.local
POM_FETCHER_GROUP_DELAY_MS=100
# Set to 'true' on first run against a fresh/restored DB to skip the version-unchanged
# optimisation and force full POM extraction. Set to 'false' after the first pass.
POM_FETCHER_FORCE_FULL_EXTRACTION=true
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Force full extraction env ignored

Medium Severity

POM_FETCHER_FORCE_FULL_EXTRACTION is documented in .env.dist.local as controlling full POM extraction on first run, but nothing in getMavenConfig or the worker reads it. The Temporal path always passes forceFullExtraction: false for universe polling regardless of that variable.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit a21c949. Configure here.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 24 out of 27 changed files in this pull request and generated 12 comments.

Files not reviewed (1)
  • pnpm-lock.yaml: Language not supported

status: string
}

const SQL_PATH = join(__dirname, '../maven/data_quality.sql')
Comment on lines +8 to +10
* Returns null when the artifact is not found (404) or the metadata is
* malformed.
*/
upsertRepo,
upsertVersionsBatch,
} from '@crowd/data-access-layer'
import { QueryExecutor } from '@crowd/data-access-layer/src/queryExecutor'
Comment on lines +1 to +4
/**
* Core POM extraction logic — pure functions (no I/O side-effects, no DB calls).
* Callers are responsible for concurrency, retries, and persistence.
*/
// with transient errors — we never do it. Maven coordinates are immutable, so a cached
// POM never goes stale; the LRU size cap is purely to bound memory.

const POM_CACHE_MAX_ENTRIES = 5_000
Comment on lines +36 to +39
| Entry point | Mode | Behaviour |
| -------------------------------- | --------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| Standalone `bin/maven.ts` | **backfill** | Always runs full POM extraction for every selected critical package, regardless of version. Use for the initial fill / periodic full refresh. |
| Temporal `mavenCriticalWorkflow` | **incremental** | If the upstream release version equals the stored `latest_version`, skips the POM fetch and only bumps `last_synced_at` (status `unchanged`). Full extraction runs only for new packages or when the version changed. |
Comment on lines +303 to +310
Two Temporal schedules are registered on startup of `bin/packages-worker.ts`
(see `maven/schedule.ts`):

| Schedule ID | Cron | Workflow | Activity | Workflow timeout |
| -------------------- | ----------------------------- | -------------------------- | ------------------------------------------------------- | ---------------- |
| `maven-critical` | `*/5 * * * *` (every 5 min) | `mavenCriticalWorkflow` | `processMavenCriticalBatch` → one critical batch | 15 min |
| `maven-non-critical` | `*/10 * * * *` (every 10 min) | `mavenNonCriticalWorkflow` | `processMavenNonCriticalBatch` → one non-critical batch | 5 min |

Comment thread backend/.env.dist.local
Comment on lines +210 to +212
# Set to 'true' on first run against a fresh/restored DB to skip the version-unchanged
# optimisation and force full POM extraction. Set to 'false' after the first pass.
POM_FETCHER_FORCE_FULL_EXTRACTION=true
Comment on lines +69 to +78
SELECT
p.id,
p.purl,
p.namespace,
p.name,
p.criticality_score AS "criticalityScore",
p.dependent_count AS "dependentPackagesCount",
p.dependent_repos_count AS "dependentReposCount",
p.downloads_last_month AS "downloads30d",
p.latest_version AS "latestVersion"
Comment on lines +23 to +30
INSERT INTO repos (url, host, owner, name, last_synced_at)
VALUES ($(url), $(host), $(owner), $(name), NOW())
ON CONFLICT (url) DO UPDATE SET
host = COALESCE(EXCLUDED.host, repos.host),
owner = COALESCE(EXCLUDED.owner, repos.owner),
name = COALESCE(EXCLUDED.name, repos.name),
last_synced_at = NOW()
RETURNING id
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants