DOC-2262: Add OAuth authentication to the docs MCP server (Redpanda Cloud IdP)#181
DOC-2262: Add OAuth authentication to the docs MCP server (Redpanda Cloud IdP)#181JakeSCahill wants to merge 49 commits into
Conversation
👷 Deploy Preview for redpanda-documentation processing.
|
Add a lightweight email->token authentication gate to the docs MCP server to capture users' work email addresses for lead capture and usage attribution. - New /mcp/register endpoint: users submit a work email and the bearer token is delivered ONLY by email (never in the HTTP response), so possession of a working token proves the address is real and owned. - Mandatory 4-layer validation: format, work-domain filter (reject free/ disposable providers), MX-record check, email delivery. - Tokens stored hashed in Netlify Blobs; auth middleware in mcp.mjs threads the authenticated email/domain to Kapa via _meta.user for attribution. - Bearer header and ?token= query fallback (for clients that can't set headers). - Gated behind REQUIRE_AUTH (grace period -> enforce); per-token rate limiting. - Captured emails -> Netlify Blobs + logs + optional CRM_WEBHOOK_URL forward. - Docs: registration + per-client setup + privacy/consent note; server-card and server.json advertise the token requirement. - 17 unit tests (tests/mcp-auth.test.ts). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
9c6f267 to
3a89e42
Compare
Netlify Functions don't reliably set NODE_ENV=production at runtime, so the previous NODE_ENV-based dev bypass could fire in deployed environments — silently logging tokens instead of emailing them and not failing when RESEND_API_KEY is missing. Gate the bypass on NETLIFY_DEV (set only by `netlify dev`/`functions:serve`) so any deployed env without a key errors loudly. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Replace the custom email->token gate with a standard MCP OAuth 2.1 resource server delegating to the Redpanda Cloud IdP (auth.prd.cloud.redpanda.com). This is required so ChatGPT can authenticate (ChatGPT only supports spec OAuth, not static tokens), while still capturing users' verified work emails. Verified the Cloud IdP supports everything needed (open Dynamic Client Registration, CIMD, PKCE S256, public clients, email scope, userinfo). - /.well-known/oauth-protected-resource (RFC 9728) edge function advertises the Cloud IdP as the authorization server; clients self-register via DCR/CIMD. - mcp.mjs auth middleware validates the bearer token against the IdP /userinfo endpoint, extracts the verified email/org, captures it (Blobs + log + optional CRM_WEBHOOK_URL), and threads it to Kapa via _meta.user. - Optional work-email enforcement (REQUIRE_WORK_EMAIL, default on) returns 403 for personal providers; REQUIRE_AUTH keeps the grace->enforce rollout. - Remove the email->token registration endpoint and email-sending module. - Docs updated: clients prompt for Redpanda Cloud sign-in (no token to paste). - Unit tests rewritten for the OAuth logic (16 tests). Production hardening (needs identity team): register an Auth0 API for the MCP resource so tokens are audience-bound JWTs, and add email as an access-token claim. Until then we validate via /userinfo (no audience binding). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
⛔ Blocked: identity team — Dynamic Client Registration is disabled on the Cloud IdPThe OAuth resource server, discovery, and token validation are implemented and verified, but end-to-end auth cannot work yet because MCP clients can't register with the Redpanda Cloud IdP. What works (verified against prd
|
Drop the unused OAUTH_ISSUER export from idp.mjs and de-export the FREE_EMAIL_DOMAINS / DISPOSABLE_DOMAINS sets (used only internally). No behavior change. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The probe confirmed CIMD is not enabled on the Cloud IdP (a valid client metadata document used as client_id still returns 'Unknown client'). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Update — call with Cloud Identity (Santi): implementation direction decidedArchitecture: We unblock MCP auth by having the docs service run its own OAuth 2.1 Authorization Server (AS), with the Cloud IdP (Auth0) as the upstream identity provider. This sidesteps the earlier blocker (the Cloud Auth0 tenant has both DCR and CIMD disabled — see prior comments): the AI tools register and authenticate against our AS, not the Cloud IdP directly. The Cloud IdP only ever sees one client — ours. Division of responsibilities
Flow: AI client → our AS ( Open asks to Santi (in flight):
Future (phase 2): the same Auth0 federation core also powers human login to the docs site — it just sets a browser session instead of issuing tokens. One Auth0 app, two consumers; MCP ships first. Next steps:
|
… (M1) Replace the superseded resource-server-pointing-at-Cloud approach with the agreed broker architecture: our service is the OAuth 2.1 Authorization Server, federating the human login upstream to Auth0 and issuing/validating its own tokens. Ports the validated spike to production shape. Added (Milestone 1 — AS core): - lib/oauth/keys.mjs — jose RS256 sign/verify + JWKS; key from env (MCP_OAUTH_SIGNING_JWK) or dev-generated + persisted in Blobs (the spike proved an in-memory key breaks the flow) - lib/oauth/store.mjs — auth requests + auth codes on Netlify Blobs (interface is the seam for a Netlify DB/Neon backend when relational queries are needed) - lib/oauth/pkce.mjs, config.mjs, upstream.mjs (Auth0 + dev mock federation, id_token validated against Auth0 JWKS) - mcp-oauth.mjs — AS endpoints: discovery (RFC 8414), JWKS, /authorize, /mcp/callback, /token (authorization_code + PKCE) Changed: - mcp.mjs resource server now validates OUR OWN access tokens (jose) instead of calling the upstream /userinfo - protected-resource metadata + server card point authorization_servers at us - removed lib/idp.mjs (superseded /userinfo validation) Deferred (clearly marked): DCR/CIMD client registration (M2), refresh_token grant + rotation (M3), consent UI, revocation. Neon backend is a documented swap behind the store interface (needs Netlify DB provisioning). Auth0 mode needs Santi's client_id; defaults to a dev mock until then. Tests: 22 pass (PKCE incl. RFC 7636 vector; JWT issue/verify; JWKS leaks no private key; wrong-audience/tampered rejected). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Milestone 1 landed — production AS scaffold (jose + storage), federating to Auth0Pivoted the branch from the superseded resource-server-pointing-at-Cloud approach to the agreed broker architecture: our service is the OAuth 2.1 Authorization Server; it federates the human login upstream to Auth0 and issues/validates its own tokens. The validated spike is now ported to production shape. In this milestone
Tests: 22 pass (PKCE incl. the RFC 7636 vector; JWT issue/verify; JWKS leaks no private key; wrong-audience/tampered rejected). Deferred (clearly marked in code): DCR/CIMD client registration (M2), refresh-token grant + rotation (M3), consent UI, revocation. Still gated on:
The flow itself was already validated end-to-end on Netlify Functions in the spike branch ( |
The dev mock issues canned identities, so it must never be reachable by accident in a deployed environment. Resolve the upstream mode fail-closed: mock is only allowed under an explicit dev signal (NETLIFY_DEV or MCP_OAUTH_ALLOW_MOCK=true). Anything that would otherwise silently fall back to mock (e.g. a prod deploy missing REDPANDA_OAUTH_CLIENT_ID) resolves to null, and the AS returns 503 on the flow endpoints instead of handing out mock tokens. Discovery + JWKS stay up. - config.mjs: resolveUpstreamMode() (pure, tested) + UPSTREAM_MISCONFIGURED - upstream.mjs: throw if neither auth0 nor mock is active - mcp-oauth.mjs: 503 on /authorize, /callback, /token, mock-idp when misconfigured - tests: 6 cases covering the resolution matrix (28 total pass) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Netlify statically analyzes config.path at bundle time, so it can't be an array of imported constants (PATHS.*) — that failed bundling (and the PR preview build) with 'path: Must be a string or array of strings'. Use literal paths. Verified the full M1 flow live (functions:serve, mock upstream): authorize -> mock-idp -> /mcp/callback -> /token -> AS-issued JWT, then /mcp accepts that token (200) and rejects no-token / garbage (401). Confirms cross-function token validation via the Blobs-shared signing key. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Netlify Blobs defaults to eventual consistency (deletes/updates propagate up to
60s). For one-time-use auth codes and refresh-token rotation/reuse-detection
that window would let a consumed code/token be replayed, so the auth store now
uses { consistency: 'strong' }. The dev signing-key store does too, so the
resource server reads the key the AS just wrote rather than regenerating.
Verified live (functions:serve): full flow issues a token, /mcp accepts it
(200), and replaying a consumed auth code is rejected (400).
Note: Blobs still has no atomic CAS, so a sub-second concurrent replay remains
theoretically possible — negligible at our volume; a relational DB is the only
full fix (documented as the future swap behind the store interface).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
With Cloud login the email is already verified by Auth0, so blocking personal-domain Cloud accounts (gmail etc.) is just friction with little benefit. Flip the default off: we accept and capture every verified Cloud login, and still record the email domain for attribution. Set REQUIRE_WORK_EMAIL=true to re-enable the free/disposable 403 rejection. - config.mjs + auth.mjs: default false (only true when explicitly 'true') - test updated; docs 403 note reworded (personal emails accepted by default) - 47 unit tests pass
# Conflicts: # package.json
…endpoints CodeRabbit flagged the protected-resource metadata edge function for missing OPTIONS/CORS preflight. Fixed there, and applied the same to the sibling endpoints browser-based OAuth clients fetch cross-origin (authorization-server metadata, JWKS, /token, /register): OPTIONS now returns 204 with the CORS headers, and the JSON responses carry Access-Control-Allow-Origin. /authorize and /callback are top-level navigations, so they don't need it. Verified: OPTIONS -> 204 + CORS, GET discovery -> 200 + ACAO, OPTIONS /oauth/token -> 204. 47 tests pass.
… test email_verified - clients.mjs: harden the CIMD SSRF guard against IPv6 — strip brackets and block ::1/::, IPv4-mapped (::ffff:), ULA (fc00::/7) and link-local (fe80::/10), in addition to the IPv4 private/loopback/link-local ranges. Extracted isBlockedHost and unit-tested it (incl. bracketed forms). - mcp-oauth.mjs: require client_id in token requests for both grants (public clients don't authenticate, so RFC 6749 requires it) — missing -> 400 invalid_request, mismatch -> 400 invalid_grant. - tests: cover email_verified=false / absent (still allowed since SSO logins often omit it, but flagged unverified in the captured context). Verified live: token without client_id -> 400, with -> 200; 52 unit tests pass.
- CIMD fetch: refuse redirects (redirect: 'error') and cap the response size with a streaming reader, so a malicious or huge client doc can't drive an SSRF or memory blowup - Rate-limit CIMD client resolution per IP in /authorize (allowCimd), alongside the existing /register limit - Pin RS256 when verifying our signing key and the upstream Auth0 ID token - Capture email_verified on the user record, log line, and CRM payload (we record it rather than block, since SSO often omits it) - Note the read-then-delete race for one-time codes/tokens (no Blobs CAS) and that the Postgres backend fixes it transactionally - Clarify the MCP server is an OAuth resource server (not authless) and why GET/SSE isn't gated - Fix the docs rate-limit sample (40 -> 60) to match the real limit - Tests for the size cap, blocked redirect, and IPv6 SSRF guard
Add a privacy-policy link to the login interstitial (and the docs auth section) so users know we collect their work email and track MCP usage before they sign in. The URL is configurable via MCP_OAUTH_PRIVACY_URL and defaults to redpanda.com/legal/privacy-policy.
Remove the affirmative query-handling promise from the login interstitial and docs note. Queries are proxied to Kapa, so the claim is hard to stand behind end-to-end; data handling belongs in the linked Privacy Policy where it can be properly qualified. The notice keeps to what we need consent for: collecting the work email and tracking usage.
Update the default SIGNUP_URL and the two sign-up references in the docs to the dedicated sign-up page rather than the Cloud landing page.
Add a 'Staying signed in' subsection explaining that the MCP client handles token renewal automatically: sign in once, the client refreshes in the background, active users stay signed in, and 30 days idle triggers a fresh (usually silent) sign-in. No manual token handling.
|
After thinking through the storage options for the OAuth layer, I’m leaning toward using Neon Postgres (via Netlify’s integration) as the system of record for auth state, rather than Netlify Blobs. Postgres gives us:
Netlify Blobs feels fine for simple metadata or low-risk storage, but it doesn’t provide strong enough guarantees for OAuth flows where race conditions or replay could become an issue. |
Add an MCP tool that lets AI clients forward user feedback (bugs, doc gaps, frustrations, feature requests) straight to the Redpanda team. The tool description tells agents to ask the user before submitting and to include the relevant page/context. Feedback goes to the existing api-feedback Netlify form (the same store our docs feedback uses); the hidden form is extended with category, source, and user identity fields. When the user is signed in we attach their email + domain so the team can follow up; anonymous otherwise. We log only category/domain/authed, never the raw email or feedback text. Also bumps the server version and documents the capability for users.
Mention the feedback tool in the MCP server-card and server.json descriptions, and bump the server-card version to 1.3.0 to match.
The feedback tool POSTed to the site root, which 301-redirects to /home/. fetch followed the redirect (POST -> GET, body dropped), so Netlify Forms never recorded the submission — but the final 200 made the tool report success. Verified: POST to / => 301; POST to /home/ => 200 and the submission lands. POST to /home/ (configurable via MCP_FEEDBACK_FORM_PATH) and set redirect: 'error' so a redirect surfaces as a failure instead of a false success.
Jira: DOC-2262
Goal
Add authentication to the docs MCP server (
docs.redpanda.com/mcp) so AI tools (ChatGPT, Claude, Cursor, VS Code) have users sign in with their Redpanda Cloud account, letting us capture verified work emails and attribute docs usage to organizations.Architecture (decided with Cloud Identity)
The docs service runs its own OAuth 2.1 Authorization Server (AS), with the Cloud IdP (Auth0) as the upstream identity provider. AI tools register and authenticate against our AS; we federate the human login to Auth0 and issue our own tokens.
Why this shape:
email/org from Auth0 to capture + attribute.Division of responsibilities
client_id, Authorization Code + PKCE, no secret), our/callbackredirect URIs allow-listed, ID token returnsemail/email_verified+ org. One app covers MCP now and docs-site login later./authorize,/callback,/token(+ refresh w/ rotation), client registration (DCR + CIMD), JWKS, consent/login UI; federate login to Auth0; issue/validate our own tokens. State in Netlify.Also in this PR: documentation feedback tool
Adds an MCP tool,
submit_documentation_feedback, so AI clients can forward user feedback for bugs, documentation gaps, incorrect/missing info, feature requests straight to the docs/DX team. The tool description instructs the agent to ask the user before submitting and to include the relevant page/context.api-feedbackNetlify form (same store as our docs feedback); the hidden registration form gainscategory,source, and user-identity fields.category/domain/authed— never the raw email or feedback text.server.jsondescriptions; server version bumped to1.3.0.Future (phase 2)
The same Auth0 federation core also powers human login to the docs site — it just sets a browser session instead of issuing tokens. One Auth0 app, two consumers; MCP ships first.
Testing
2026-06-18_19-50-10.mp4
The deploy preview is wired to the integration Auth0 tenant, so you can test the full flow there.
1. Add the preview server to Claude Code:
2. Authenticate and use it:
/mcp, selectredpanda-preview, choose Authenticateclaude mcp remove redpanda-previewNotes:
integration-cloudv2.us.auth0.com), not prod.npm run test:mcpplus thetests/mcp-oauth-*.test.tssuites.Kapa now has your conversation along with your email and org
History
Explored a pure OAuth resource server pointing at the Cloud IdP (blocked: DCR/CIMD disabled on that tenant), landing on the AS-broker design above after the call with Cloud.