feat: tracking NPM packages#4159
Conversation
Signed-off-by: anilb <epipav@gmail.com>
Signed-off-by: anilb <epipav@gmail.com>
Signed-off-by: anilb <epipav@gmail.com>
|
|
There was a problem hiding this comment.
Pull request overview
Adds first-pass npm package tracking to the OSS packages subsystem, including ingestion from the npm registry + _changes feed, persistence into the packages-db schema, and scheduled workflows to backfill/refresh download metrics.
Changes:
- Introduces new data-access-layer modules for npm packages, versions, maintainers, repos, download metrics, worker state, and audit-field-change logging.
- Implements Temporal workflows + activities in
packages_workerto ingest watch-listed npm packages, backfilldownloads_daily, and refresh monthlydownloads_last_30dwindows. - Extends DB migrations and local scaffolding to support npm worker state tables and pg_partman partition setup for download tables.
Reviewed changes
Copilot reviewed 34 out of 35 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
| services/libs/data-access-layer/src/packages/versions.ts | Upsert helper for npm package versions. |
| services/libs/data-access-layer/src/packages/repos.ts | Repo upsert + package↔repo provenance linking helpers. |
| services/libs/data-access-layer/src/packages/packages.ts | Package upsert + “tracked npm packages” query. |
| services/libs/data-access-layer/src/packages/npmWorkerState.ts | Persistence for npm _changes cursor. |
| services/libs/data-access-layer/src/packages/npmPackageState.ts | Persistence for “scanned” npm packages. |
| services/libs/data-access-layer/src/packages/maintainers.ts | Maintainer upsert + package↔maintainer linking. |
| services/libs/data-access-layer/src/packages/index.ts | Exports newly added packages DAL modules. |
| services/libs/data-access-layer/src/packages/fundingLinks.ts | Upsert helper for package funding links. |
| services/libs/data-access-layer/src/packages/downloadsLast30d.ts | Reads/upserts downloads_last_30d and optionally mirrors latest to packages_universe. |
| services/libs/data-access-layer/src/packages/downloadsDaily.ts | Computes missing daily dates + inserts daily download rows. |
| services/libs/data-access-layer/src/packages/auditFieldChanges.ts | Inserts audit rows recording which fields changed per ingest execution. |
| services/libs/data-access-layer/src/index.ts | Re-exports the new packages DAL module. |
| services/apps/packages_worker/src/workflows/index.ts | Exposes npm workflows from the worker entrypoint. |
| services/apps/packages_worker/src/utils/concurrency.ts | Adds a generic concurrency-limited async mapper. |
| services/apps/packages_worker/src/npm/workflows.ts | Implements ingest + download backfill/refresh workflows. |
| services/apps/packages_worker/src/npm/watchList.ts | Hardcoded temporary npm watch list. |
| services/apps/packages_worker/src/npm/upsertPackage.ts | Normalizes packuments and persists package + related entities. |
| services/apps/packages_worker/src/npm/types.ts | Types for packument + fetch error shapes. |
| services/apps/packages_worker/src/npm/schedule.ts | Registers Temporal schedules for ingest and download jobs. |
| services/apps/packages_worker/src/npm/normalize.ts | npm name parsing, purl construction, license normalization, repo canonicalization, maintainers extraction. |
| services/apps/packages_worker/src/npm/last30dGaps.ts | Computes missing monthly “rolling 30d” windows. |
| services/apps/packages_worker/src/npm/fetchPackument.ts | Fetches npm packuments from the registry. |
| services/apps/packages_worker/src/npm/fetchDownloads.ts | Fetches npm download counts (daily range + point range + bulk point range). |
| services/apps/packages_worker/src/npm/fetchChanges.ts | Polls replicate.npmjs.com _changes feed. |
| services/apps/packages_worker/src/npm/downloadGaps.ts | Computes contiguous missing-date windows for daily downloads backfill. |
| services/apps/packages_worker/src/npm/activities.ts | Temporal activities: change polling, ingestion, gap detection, downloads fetch+persist. |
| services/apps/packages_worker/src/bin/packages-worker.ts | Worker bootstrap now registers the new schedules. |
| services/apps/packages_worker/src/activities.ts | Re-exports npm activities for Temporal worker registration. |
| services/apps/packages_worker/CONTEXT.md | Documents OSS packages terminology and relationships. |
| scripts/scaffold.yaml | Tightens port bindings to localhost and adds a custom packages-db image build. |
| scripts/packages-db/Dockerfile | Builds a Postgres image with pg_partman installed. |
| docs/adr/0001-oss-packages-design-decisions.md | Updates ADR to reflect npm worker ownership of downloads_last_30d and download workflows. |
| backend/src/osspckgs/migrations/V1780231200__npm_worker.sql | Adds npm worker state tables + pg_partman setup + partition creation for downloads tables. |
| backend/src/osspckgs/migrations/V1779710880__initial_schema.sql | Adjusts maintainer email storage and documents partition management migration; adds audit_field_changes. |
| backend/.env.dist.local | Minor formatting change near packages DB env vars. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Signed-off-by: anilb <epipav@gmail.com>
Signed-off-by: anilb <epipav@gmail.com>
|
Your PR title doesn't contain a Jira issue key. Consider adding it for better traceability. Example:
Projects:
Please add a Jira issue key to your PR title. |
Signed-off-by: anilb <epipav@gmail.com>
|
@cursor review |
Signed-off-by: anilb <epipav@gmail.com>
Signed-off-by: anilb <epipav@gmail.com>
Signed-off-by: anilb <epipav@gmail.com>
Signed-off-by: anilb <epipav@gmail.com>
Signed-off-by: anilb <epipav@gmail.com>
| export function parseNpmName(raw: string): { namespace: string | null; name: string } { | ||
| if (raw.startsWith('@')) { | ||
| const slash = raw.indexOf('/') | ||
| if (slash !== -1) { | ||
| return { namespace: raw.slice(0, slash), name: raw.slice(slash + 1) } | ||
| } | ||
| } | ||
| return { namespace: null, name: raw } | ||
| } | ||
|
|
||
| export function buildPurl(raw: string): string { | ||
| const { namespace, name } = parseNpmName(raw) | ||
| return namespace ? `pkg:npm/${namespace}/${name}` : `pkg:npm/${name}` | ||
| } |
| ins AS ( | ||
| INSERT INTO package_repos (package_id, repo_id, source, confidence) | ||
| VALUES ($(packageId)::bigint, $(repoId)::bigint, $(source), $(confidence)) | ||
| ON CONFLICT (package_id, repo_id) DO UPDATE SET | ||
| confidence = EXCLUDED.confidence, | ||
| verified_at = NOW() | ||
| RETURNING source, confidence | ||
| ) | ||
| SELECT array_remove(ARRAY[ | ||
| CASE WHEN o.source IS NULL THEN 'package_repos.repo_id' END, | ||
| CASE WHEN o.source IS NULL THEN 'package_repos.source' END, | ||
| CASE WHEN o.source IS NULL | ||
| OR o.confidence IS DISTINCT FROM ins.confidence THEN 'package_repos.confidence' END |
| | `versions` | Append-only via `INSERT … ON CONFLICT DO NOTHING`. Yanked/deprecated status is a separate targeted `UPDATE (is_yanked = true) WHERE …`. | | ||
| | `repos` | Registry workers (npm, Maven) do **not** write directly to `repos`. They write `package_repos` rows. The GitHub enricher — triggered when `repos.last_synced_at IS NULL` — upserts `repos` with metadata. Docker Hub worker adds `docker_*` columns on top. | | ||
| | `repos` | Registry workers (npm, Maven) do **not** write `repos` enrichment metadata. They INSERT a minimal `repos(url, host)` row — `url` (canonical) and `host` (coarse classification) are both derived from the declared repository URL — solely to create the FK target their `package_repos` link needs. `owner`/`name`/`stars`/`description` and all other metadata stay NULL and remain enricher-owned; existing rows are never updated by registry workers. The GitHub enricher — triggered when `repos.last_synced_at IS NULL` — upserts `repos` with metadata. Docker Hub worker adds `docker_*` columns on top. | | ||
| | `package_repos` | Composite PK `(package_id, repo_url)`. Each `source` value ('declared', 'deps_dev', 'heuristic', 'manual') is a separate row — sources do not overwrite each other. | |
| | `maintainers` / `package_maintainers` | Upsert on `(ecosystem, username)`. Never delete — history is preserved. | | ||
| | `maintainers` / `package_maintainers` | `maintainers`: upsert on `(ecosystem, username)`, never deleted — the identity history is preserved. `package_maintainers`: reflects the **current** link set — the npm worker replaces a package's links each ingest (delete + reinsert), so prior link rows are not retained. | | ||
| | `downloads_daily` | Append-only time-series. Each `(package_id, date)` row is written once. npm and Maven workers own disjoint rows by ecosystem. Historical timelines are preserved — workers do not overwrite past dates. | | ||
| | `downloads_last_30d` | Upsert on `(purl, end_date)`. Written by the weekly ranking worker only. The cached `packages_universe.downloads_last_30d` column must be updated in the same pass. | |
| ALTER TABLE maintainers DROP COLUMN IF EXISTS email_hash; | ||
| ALTER TABLE maintainers ADD COLUMN IF NOT EXISTS email text; |
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 37d889e. Configure here.
| await Promise.all(executing) | ||
| if (failed) throw firstError | ||
| return results | ||
| } |
There was a problem hiding this comment.
Unused mapWithConcurrency utility in production code
Low Severity
mapWithConcurrency is exported and fully tested but never imported in any production code — only in its own test file. This is dead code in the current changeset.
Reviewed by Cursor Bugbot for commit 37d889e. Configure here.


Note
Medium Risk
Large new ingestion surface (external npm APIs, rate limits, maintainer emails in DB) plus pg_partman migration dependency; watch-list scope limits blast radius for now.
Overview
Replaces the npm placeholder with a full Temporal-driven npm registry worker: daily
_changespolling with a persistedchanges_last_seq, packument ingest intopackages/versions/ maintainers / funding / minimalrepos+package_repos, and audit logging. Adds self-healing download jobs—backfillDailyDownloads(gap-filleddownloads_daily) andrefreshLast30dDownloads(fixed 30-daydownloads_last_30dwindows, mirroring the latest count topackages_universewhenisLatest)—all keyed off a temporary static watch list until deps.dev sourcing lands.Data layer & ops: new Flyway migration (
npm_worker_state,npm_package_state, maintaineremailinstead ofemail_hash,pg_partman+ historical partitions for download tables), a packages-db image with postgresql-14-partman, and broad data-access-layer upsert/query helpers. Local scaffold binds services to127.0.0.1and moves S3 to host port 9100 (.env.dist.localupdated). ADR-0001 andCONTEXT.mddocument npm ingest, repo/maintainer write rules, and download window semantics.Reviewed by Cursor Bugbot for commit 37d889e. Bugbot is set up for automated code reviews on this repo. Configure here.