Skip to content

feat: tracking NPM packages#4159

Open
epipav wants to merge 12 commits into
mainfrom
feat/tracking-npm-packages
Open

feat: tracking NPM packages#4159
epipav wants to merge 12 commits into
mainfrom
feat/tracking-npm-packages

Conversation

@epipav
Copy link
Copy Markdown
Collaborator

@epipav epipav commented Jun 2, 2026

Note

Medium Risk
Large new ingestion surface (external npm APIs, rate limits, maintainer emails in DB) plus pg_partman migration dependency; watch-list scope limits blast radius for now.

Overview
Replaces the npm placeholder with a full Temporal-driven npm registry worker: daily _changes polling with a persisted changes_last_seq, packument ingest into packages / versions / maintainers / funding / minimal repos + package_repos, and audit logging. Adds self-healing download jobsbackfillDailyDownloads (gap-filled downloads_daily) and refreshLast30dDownloads (fixed 30-day downloads_last_30d windows, mirroring the latest count to packages_universe when isLatest)—all keyed off a temporary static watch list until deps.dev sourcing lands.

Data layer & ops: new Flyway migration (npm_worker_state, npm_package_state, maintainer email instead of email_hash, pg_partman + historical partitions for download tables), a packages-db image with postgresql-14-partman, and broad data-access-layer upsert/query helpers. Local scaffold binds services to 127.0.0.1 and moves S3 to host port 9100 (.env.dist.local updated). ADR-0001 and CONTEXT.md document npm ingest, repo/maintainer write rules, and download window semantics.

Reviewed by Cursor Bugbot for commit 37d889e. Bugbot is set up for automated code reviews on this repo. Configure here.

epipav added 3 commits June 1, 2026 10:26
Signed-off-by: anilb <epipav@gmail.com>
Signed-off-by: anilb <epipav@gmail.com>
Signed-off-by: anilb <epipav@gmail.com>
Copilot AI review requested due to automatic review settings June 2, 2026 07:29
@CLAassistant
Copy link
Copy Markdown

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

Comment thread services/apps/packages_worker/src/npm/normalize.ts Fixed
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds first-pass npm package tracking to the OSS packages subsystem, including ingestion from the npm registry + _changes feed, persistence into the packages-db schema, and scheduled workflows to backfill/refresh download metrics.

Changes:

  • Introduces new data-access-layer modules for npm packages, versions, maintainers, repos, download metrics, worker state, and audit-field-change logging.
  • Implements Temporal workflows + activities in packages_worker to ingest watch-listed npm packages, backfill downloads_daily, and refresh monthly downloads_last_30d windows.
  • Extends DB migrations and local scaffolding to support npm worker state tables and pg_partman partition setup for download tables.

Reviewed changes

Copilot reviewed 34 out of 35 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
services/libs/data-access-layer/src/packages/versions.ts Upsert helper for npm package versions.
services/libs/data-access-layer/src/packages/repos.ts Repo upsert + package↔repo provenance linking helpers.
services/libs/data-access-layer/src/packages/packages.ts Package upsert + “tracked npm packages” query.
services/libs/data-access-layer/src/packages/npmWorkerState.ts Persistence for npm _changes cursor.
services/libs/data-access-layer/src/packages/npmPackageState.ts Persistence for “scanned” npm packages.
services/libs/data-access-layer/src/packages/maintainers.ts Maintainer upsert + package↔maintainer linking.
services/libs/data-access-layer/src/packages/index.ts Exports newly added packages DAL modules.
services/libs/data-access-layer/src/packages/fundingLinks.ts Upsert helper for package funding links.
services/libs/data-access-layer/src/packages/downloadsLast30d.ts Reads/upserts downloads_last_30d and optionally mirrors latest to packages_universe.
services/libs/data-access-layer/src/packages/downloadsDaily.ts Computes missing daily dates + inserts daily download rows.
services/libs/data-access-layer/src/packages/auditFieldChanges.ts Inserts audit rows recording which fields changed per ingest execution.
services/libs/data-access-layer/src/index.ts Re-exports the new packages DAL module.
services/apps/packages_worker/src/workflows/index.ts Exposes npm workflows from the worker entrypoint.
services/apps/packages_worker/src/utils/concurrency.ts Adds a generic concurrency-limited async mapper.
services/apps/packages_worker/src/npm/workflows.ts Implements ingest + download backfill/refresh workflows.
services/apps/packages_worker/src/npm/watchList.ts Hardcoded temporary npm watch list.
services/apps/packages_worker/src/npm/upsertPackage.ts Normalizes packuments and persists package + related entities.
services/apps/packages_worker/src/npm/types.ts Types for packument + fetch error shapes.
services/apps/packages_worker/src/npm/schedule.ts Registers Temporal schedules for ingest and download jobs.
services/apps/packages_worker/src/npm/normalize.ts npm name parsing, purl construction, license normalization, repo canonicalization, maintainers extraction.
services/apps/packages_worker/src/npm/last30dGaps.ts Computes missing monthly “rolling 30d” windows.
services/apps/packages_worker/src/npm/fetchPackument.ts Fetches npm packuments from the registry.
services/apps/packages_worker/src/npm/fetchDownloads.ts Fetches npm download counts (daily range + point range + bulk point range).
services/apps/packages_worker/src/npm/fetchChanges.ts Polls replicate.npmjs.com _changes feed.
services/apps/packages_worker/src/npm/downloadGaps.ts Computes contiguous missing-date windows for daily downloads backfill.
services/apps/packages_worker/src/npm/activities.ts Temporal activities: change polling, ingestion, gap detection, downloads fetch+persist.
services/apps/packages_worker/src/bin/packages-worker.ts Worker bootstrap now registers the new schedules.
services/apps/packages_worker/src/activities.ts Re-exports npm activities for Temporal worker registration.
services/apps/packages_worker/CONTEXT.md Documents OSS packages terminology and relationships.
scripts/scaffold.yaml Tightens port bindings to localhost and adds a custom packages-db image build.
scripts/packages-db/Dockerfile Builds a Postgres image with pg_partman installed.
docs/adr/0001-oss-packages-design-decisions.md Updates ADR to reflect npm worker ownership of downloads_last_30d and download workflows.
backend/src/osspckgs/migrations/V1780231200__npm_worker.sql Adds npm worker state tables + pg_partman setup + partition creation for downloads tables.
backend/src/osspckgs/migrations/V1779710880__initial_schema.sql Adjusts maintainer email storage and documents partition management migration; adds audit_field_changes.
backend/.env.dist.local Minor formatting change near packages DB env vars.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread services/apps/packages_worker/src/utils/concurrency.ts
Comment thread services/apps/packages_worker/src/npm/normalize.ts
Comment thread services/apps/packages_worker/src/npm/normalize.ts Outdated
Comment thread services/libs/data-access-layer/src/packages/versions.ts Outdated
Comment thread services/libs/data-access-layer/src/packages/repos.ts Outdated
Comment thread services/libs/data-access-layer/src/packages/maintainers.ts
Comment thread services/libs/data-access-layer/src/packages/packages.ts Outdated
Comment thread services/apps/packages_worker/src/npm/fetchChanges.ts
Comment thread backend/src/osspckgs/migrations/V1779710880__initial_schema.sql
Comment thread scripts/scaffold.yaml
Comment thread services/apps/packages_worker/src/utils/concurrency.ts Outdated
epipav added 2 commits June 2, 2026 09:39
Signed-off-by: anilb <epipav@gmail.com>
Signed-off-by: anilb <epipav@gmail.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 2, 2026

⚠️ Jira Issue Key Missing

Your PR title doesn't contain a Jira issue key. Consider adding it for better traceability.

Example:

  • feat: add user authentication (CM-123)
  • feat: add user authentication (IN-123)

Projects:

  • CM: Community Data Platform
  • IN: Insights

Please add a Jira issue key to your PR title.

Comment thread services/apps/packages_worker/src/npm/normalize.ts Fixed
Signed-off-by: anilb <epipav@gmail.com>
Copilot AI review requested due to automatic review settings June 2, 2026 08:25
@epipav
Copy link
Copy Markdown
Collaborator Author

epipav commented Jun 2, 2026

@cursor review

Comment thread services/apps/packages_worker/src/activities.ts
Comment thread services/libs/data-access-layer/src/index.ts Outdated
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 35 out of 36 changed files in this pull request and generated 13 comments.

Comment thread services/libs/data-access-layer/src/packages/maintainers.ts
Comment thread services/libs/data-access-layer/src/packages/maintainers.ts Outdated
Comment thread services/libs/data-access-layer/src/packages/maintainers.ts
Comment thread services/libs/data-access-layer/src/packages/repos.ts Outdated
Comment thread services/libs/data-access-layer/src/packages/repos.ts Outdated
Comment thread services/apps/packages_worker/src/npm/fetchChanges.ts
Comment thread services/apps/packages_worker/src/utils/concurrency.ts
Comment thread backend/src/osspckgs/migrations/V1780231200__npm_worker.sql
Comment thread backend/src/osspckgs/migrations/V1780231200__npm_worker.sql
Comment thread services/libs/data-access-layer/src/index.ts Outdated
epipav added 2 commits June 2, 2026 10:33
Signed-off-by: anilb <epipav@gmail.com>
Signed-off-by: anilb <epipav@gmail.com>
Comment thread services/apps/packages_worker/src/npm/fetchDownloads.ts
Signed-off-by: anilb <epipav@gmail.com>
Copilot AI review requested due to automatic review settings June 2, 2026 09:17
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 36 out of 37 changed files in this pull request and generated 9 comments.

Comment thread services/libs/data-access-layer/src/packages/versions.ts Outdated
Comment thread services/libs/data-access-layer/src/packages/versions.ts
Comment thread services/libs/data-access-layer/src/packages/fundingLinks.ts
Comment thread services/apps/packages_worker/src/npm/upsertPackage.ts
Comment thread services/libs/data-access-layer/src/packages/packages.ts
Comment thread docs/adr/0001-oss-packages-design-decisions.md
Comment thread services/apps/packages_worker/src/npm/downloadGaps.ts
Comment thread backend/src/osspckgs/migrations/V1780231200__npm_worker.sql
Comment thread services/apps/packages_worker/src/npm/last30dGaps.ts
mbani01
mbani01 previously approved these changes Jun 2, 2026
Signed-off-by: anilb <epipav@gmail.com>
joanagmaia
joanagmaia previously approved these changes Jun 2, 2026
Signed-off-by: anilb <epipav@gmail.com>
Copilot AI review requested due to automatic review settings June 2, 2026 12:44
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 38 out of 39 changed files in this pull request and generated 5 comments.

Comment on lines +6 to +19
export function parseNpmName(raw: string): { namespace: string | null; name: string } {
if (raw.startsWith('@')) {
const slash = raw.indexOf('/')
if (slash !== -1) {
return { namespace: raw.slice(0, slash), name: raw.slice(slash + 1) }
}
}
return { namespace: null, name: raw }
}

export function buildPurl(raw: string): string {
const { namespace, name } = parseNpmName(raw)
return namespace ? `pkg:npm/${namespace}/${name}` : `pkg:npm/${name}`
}
Comment on lines +38 to +50
ins AS (
INSERT INTO package_repos (package_id, repo_id, source, confidence)
VALUES ($(packageId)::bigint, $(repoId)::bigint, $(source), $(confidence))
ON CONFLICT (package_id, repo_id) DO UPDATE SET
confidence = EXCLUDED.confidence,
verified_at = NOW()
RETURNING source, confidence
)
SELECT array_remove(ARRAY[
CASE WHEN o.source IS NULL THEN 'package_repos.repo_id' END,
CASE WHEN o.source IS NULL THEN 'package_repos.source' END,
CASE WHEN o.source IS NULL
OR o.confidence IS DISTINCT FROM ins.confidence THEN 'package_repos.confidence' END
| `versions` | Append-only via `INSERT … ON CONFLICT DO NOTHING`. Yanked/deprecated status is a separate targeted `UPDATE (is_yanked = true) WHERE …`. |
| `repos` | Registry workers (npm, Maven) do **not** write directly to `repos`. They write `package_repos` rows. The GitHub enricher — triggered when `repos.last_synced_at IS NULL` — upserts `repos` with metadata. Docker Hub worker adds `docker_*` columns on top. |
| `repos` | Registry workers (npm, Maven) do **not** write `repos` enrichment metadata. They INSERT a minimal `repos(url, host)` row — `url` (canonical) and `host` (coarse classification) are both derived from the declared repository URL — solely to create the FK target their `package_repos` link needs. `owner`/`name`/`stars`/`description` and all other metadata stay NULL and remain enricher-owned; existing rows are never updated by registry workers. The GitHub enricher — triggered when `repos.last_synced_at IS NULL` — upserts `repos` with metadata. Docker Hub worker adds `docker_*` columns on top. |
| `package_repos` | Composite PK `(package_id, repo_url)`. Each `source` value ('declared', 'deps_dev', 'heuristic', 'manual') is a separate row — sources do not overwrite each other. |
| `maintainers` / `package_maintainers` | Upsert on `(ecosystem, username)`. Never delete — history is preserved. |
| `maintainers` / `package_maintainers` | `maintainers`: upsert on `(ecosystem, username)`, never deletedthe identity history is preserved. `package_maintainers`: reflects the **current** link set — the npm worker replaces a package's links each ingest (delete + reinsert), so prior link rows are not retained. |
| `downloads_daily` | Append-only time-series. Each `(package_id, date)` row is written once. npm and Maven workers own disjoint rows by ecosystem. Historical timelines are preserved — workers do not overwrite past dates. |
| `downloads_last_30d` | Upsert on `(purl, end_date)`. Written by the weekly ranking worker only. The cached `packages_universe.downloads_last_30d` column must be updated in the same pass. |
Comment on lines +3 to +4
ALTER TABLE maintainers DROP COLUMN IF EXISTS email_hash;
ALTER TABLE maintainers ADD COLUMN IF NOT EXISTS email text;
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 37d889e. Configure here.

await Promise.all(executing)
if (failed) throw firstError
return results
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused mapWithConcurrency utility in production code

Low Severity

mapWithConcurrency is exported and fully tested but never imported in any production code — only in its own test file. This is dead code in the current changeset.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 37d889e. Configure here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants