Reeve

Reeve is a production-shaped autonomous agent that maintains a GitHub repository the way a senior maintainer would — built on Mastra + Google Gemini, it interprets a task, discovers and selects its own tools by description (62 tools across 9 namespaces, progressively exposed), delegates isolated subtasks to scoped read-only subagents, and runs long-horizon jobs (the triage_repository task crossed 27 tool calls in one live session) without losing its plan — all behind production scaffolding: a single throttled/retrying GitHub client, a typed error taxonomy, structured spans, and an eval harness with unit + integration tests.

Quickstart

# 1. Install (Node 20+, pnpm)
pnpm install

# 2. Configure — copy and fill in
cp .env.example .env
#   GITHUB_TOKEN                  fine-grained PAT for the sandbox repo
#   GOOGLE_GENERATIVE_AI_API_KEY  Google Generative AI key (Mastra model router)
#   GITHUB_SANDBOX_REPO           owner/repo Reeve operates on

# 3. Verify (no network / no model needed)
pnpm typecheck && pnpm test       # unit + integration suites
pnpm eval --mock                  # eval harness, fully offline

# 4. Live demos (need the env above; use Gemini free-tier quota)
pnpm tsx scripts/smoke.ts         # one orchestrated task end-to-end
pnpm tsx scripts/triage-demo.ts   # flagship long-horizon triage (20+ tool calls)
pnpm eval                         # eval with the live LLM judge

Free-tier note: gemini-2.5-flash-lite is capped at 20 requests/day. One full triage run can exhaust it; runs fail fast on a 429 rather than retry. Move the model in src/config/models.ts to a higher tier for sustained use.

See it run

A real excerpt from the flagship triage_repository run on the sandbox repo (full capture: artifacts/triage-demo.txt) — it records a plan, paginates every open issue, clusters them, investigates the top items via the isolated subagent, and emits a ranked backlog:

======== PLAN (recorded to memory at start) ========
Triage all open issues in abhay-codes07/reeve-sandbox
  1. gather: paginate through all open issues
  2. cluster: group issues into prioritised clusters
  3. investigate: run the investigate_issue subagent on the top-priority items
  4. draft: write maintainer responses for each cluster
  5. backlog: emit a ranked backlog

======== RANKED BACKLOG ========
  #1 Security (critical) — issues #5
      labels: security, priority:critical
      draft : Thanks for the report. We treat security issues (#5) as critical priority
              and will investigate immediately. Please avoid sharing further exploit details publicly.
  #2 Bug (high) — issues #9, #4, #1
      labels: bug, needs-triage
  #3 Performance (medium) — issues #8
      labels: performance

======== SUMMARY ========
total issues     : 11
clusters         : 7
TOTAL TOOL CALLS : 27 (>20 ✅)

Architecture

flowchart TD
    User([Task]) --> Orch[Orchestrator agent<br/>gemini-2.5-flash → flash-lite]
    Orch -->|discover| Disc[list_namespaces · list_tools · get_tool_schema]
    Orch -->|act| Inv[invoke_tool dispatcher]
    Disc -.reads.-> Reg[(Tool registry<br/>62 tools / 9 namespaces<br/>progressively exposed)]
    Inv -->|by name| Reg
    Reg --> NS[github-issues · github-prs · github-repo<br/>github-actions · github-search · github-checks<br/>github-releases · triage · subagents]
    NS --> GH[GitHubClient<br/>Octokit + throttling + retry/backoff<br/>single choke point]
    GH -->|failures mapped| Err[Typed error taxonomy]
    GH --> Log[(Structured logging + spans)]

    Orch -->|delegates| Sub[Isolated subagents<br/>review_pr · investigate_issue<br/>worker model · scoped read-only subset<br/>brief-only input · typed return]
    Sub -->|own context| GH

    Orch -.long-horizon.-> Triage[triage_repository workflow]
    Triage -->|plan + compacted summaries| Mem[(TriageMemory)]
    Triage -->|chain| Chain[search_issues → cluster_issues → draft_triage_report]
    Triage -->|top items| Sub
    Triage --> Backlog([Ranked backlog + tool-call count])

Stack: TypeScript (strict) + Mastra + Node 20+ + pnpm.
Models: Google Gemini via Mastra's model router. Orchestrator + long-horizon task use a fallback chain gemini-2.5-flash → gemini-2.5-flash-lite with per-model retries; subagents and the eval judge use gemini-2.5-flash-lite. Provider-swappable by design (src/config/models.ts).
Namespaces: github-issues, github-prs, github-repo, github-actions, github-search, github-checks, github-releases, triage, subagents.
Flagship long-horizon task: triage_repository — paginates all open issues, clusters and prioritises them, investigates top items via the isolated subagent, drafts responses, emits a ranked backlog; 20+ tool calls with the plan persisted and intermediate batches compacted.
Composable chain: search_issues → cluster_issues → draft_triage_report.

Required properties → where they live

Property	Code	Proven by
50+ tools, ≥4 namespaces, model-selected, progressively exposed	`src/tools/registry.ts`, `src/tools/exposure.ts` (62 tools / 9 ns)	`tests/unit/tools.registry.test.ts`, `tests/unit/tools.handlers.test.ts`; live `artifacts/smoke.txt`
≥1 truly isolated subagent, scoped tools, typed return	`src/agents/subagents/runner.ts`, `review-pr.ts`, `investigate-issue.ts`	`tests/unit/subagents.isolation.test.ts`; live `review_pr` → typed PrReview on PR #11 `artifacts/review-pr.txt`; investigations in `artifacts/triage-demo.txt`
Single session crosses 20 tool calls, plan intact	`src/workflows/triage-repository.ts`, `triage-memory.ts`	`tests/unit/triage-repository.test.ts`; live 27 calls `artifacts/triage-demo.txt`
Observability, retries+backoff, rate limiting, typed errors, eval, unit+integration tests	`src/github/client.ts`, `src/observability/`, `src/errors/`, `src/eval/`	`tests/integration/github.resilience.test.ts`, `tests/unit/model-fallback.test.ts`, `tests/unit/eval.test.ts`, `artifacts/eval-mock.txt`
≥1 composed tool chain	`src/workflows/triage-chain.ts`	`tests/unit/chain.schemas.test.ts`, `tests/integration/triage-chain.test.ts`

See CLAUDE.md for the full engineering charter and invariants, and DECISIONS.md for autonomous decisions and their rationale.

Layout

src/
  config/         # zod-validated env + shared Mastra model config (fallback chain)
  github/         # Octokit wrapper: throttling + retry, the single GitHub choke point
  errors/         # typed error taxonomy + Octokit → taxonomy mapping
  observability/  # pino structured logger with operation context
  tools/          # GitHub tool registry              (step 3+)
  agents/         # orchestrator + isolated subagents  (step 4+)
  workflows/      # composable chains + triage_repository (step 5+)
  eval/           # scored evaluation harness          (step 6+)
tests/
  unit/           # hermetic; msw blocks all network
  integration/    # real client + plugins, GitHub simulated by msw
  msw/ · setup/   # shared msw server + per-project setup

Getting started

pnpm install
cp .env.example .env   # then fill in GITHUB_TOKEN, GOOGLE_GENERATIVE_AI_API_KEY, GITHUB_SANDBOX_REPO

Required environment (validated at startup — missing values fail fast):

Variable	Purpose
`GITHUB_TOKEN`	GitHub PAT for all API calls
`GOOGLE_GENERATIVE_AI_API_KEY`	Google key read by Mastra's model router
`GITHUB_SANDBOX_REPO`	Target repo as `owner/repo`

Scripts

Command	Description
`pnpm typecheck`	`tsc --noEmit` (strict)
`pnpm build`	Emit `dist/` from `src/`
`pnpm test`	Run unit + integration suites
`pnpm test:unit` / `pnpm test:integration`	Run one project
`pnpm dev`	Run the bootstrap entry point

Production scaffolding

Env config — zod-validated, fail-fast with an aggregated error.
Model config — Mastra model router with a fallback chain + per-model retries.
GitHub client — Octokit composed with throttling (rate-limit aware) and retry (exponential backoff on 5xx/network); every tool calls GitHub only through it.
Error taxonomy — AuthError, NotFoundError, RateLimitError, ValidationError, UpstreamError; Octokit failures map in, no untyped throws.
Observability — pino structured logging with bound operation context.
Tests — vitest unit + integration projects; msw guarantees unit tests never hit the network.

Observability — what is traced

Everything logs through one pino-based layer (src/observability), which binds an operation (and optional correlationId) to every line and redacts tokens/keys. Each significant unit emits structured spans carrying operation, the tool name where relevant, durationMs latency, and an outcome (success/failure):

Layer	Span(s)	Key fields
GitHub client	`github.request.start/success/failure`	`operation`, `durationMs`, mapped `err`
Throttle/retry	`Primary/Secondary rate limit hit`	`method`, `url`, `retryAfter`, `retryCount`
Orchestrator	`orchestrator.tool_call`	`operation`, `tool`, `durationMs`, `outcome`, `errorCode`
Subagents	`subagent.start` / `subagent.done` / `subagent.failed`	`threadId`, `scope`, `durationMs`, `outcome`
triage_repository	`triage.tool_call`, `triage.plan_recorded`, `triage.gathered`, `triage.done`	`tool`, `durationMs`, `outcome`, running `count`

The tool-call count is visible end-to-end: every tool/subagent call in the long-horizon task increments a ToolCallCounter that logs the running count on each call, and the final result reports totalToolCalls. Logs are JSON by default; set REEVE_LOG_PRETTY=1 for human-readable output and REEVE_LOG_LEVEL=debug for verbose tracing.

Evaluation harness

src/eval scores triage/investigation quality against fixtures mirroring the seeded sandbox. The scorer has two modes: deterministic checks (exact / contains / ordering on structured outcomes) and an LLM judge for fuzzy criteria. The judge is the only place the harness touches a live model and is isolated behind one function, so it is fully mockable.

pnpm eval          # default: live judge (google/gemini-2.5-flash-lite)
pnpm eval --mock   # fully offline: stubbed judge, deterministic checks run for real

Live runs fail fast on a Gemini 429 (no retry loop).

Deployment readiness

The codebase is structured for deployment, not just demo:

Typed, fail-fast config — all secrets/settings load through one zod-validated loadEnv() (src/config/env.ts); a missing var aborts at boot, not mid-run.
Single GitHub choke point — every external call goes through one throttled + retrying GitHubClient (src/github/client.ts), so rate limiting, exponential backoff, logging, and typed-error mapping are guaranteed, not per-call.
Structured logs + spans — JSON logging with operation/tool/latency/outcome and token redaction (src/observability), ready to ship to any log sink.
Stateless tools — each tool is a pure typed input→output handler over the shared client; nothing holds process-local state, so the service scales horizontally.
Model-router fallback — the orchestrator degrades flash → flash-lite on 429/5xx, absorbing transient upstream failures.

How it would deploy (not done here): package src/ as a container image (or a serverless function) with config supplied entirely via environment variables; the GitHub PAT and model key are injected as secrets. To take it off the free tier, swap the runtime model to a higher-tier/paid key in one line in src/config/models.ts (the model router makes the provider/model provider-swappable). No code changes are required to point Reeve at a different repository — only GITHUB_SANDBOX_REPO.

Name		Name	Last commit message	Last commit date
Latest commit History 47 Commits
.github/workflows		.github/workflows
artifacts		artifacts
assets		assets
scripts		scripts
src		src
tests		tests
.env.example		.env.example
.gitattributes		.gitattributes
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
DECISIONS.md		DECISIONS.md
LICENSE		LICENSE
MEMO.md		MEMO.md
README.md		README.md
package.json		package.json
pnpm-lock.yaml		pnpm-lock.yaml
pnpm-workspace.yaml		pnpm-workspace.yaml
tsconfig.build.json		tsconfig.build.json
tsconfig.json		tsconfig.json
vitest.config.ts		vitest.config.ts

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Reeve

Quickstart

See it run

Architecture

Required properties → where they live

Layout

Getting started

Scripts

Production scaffolding

Observability — what is traced

Evaluation harness

Deployment readiness

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Reeve

Quickstart

See it run

Architecture

Required properties → where they live

Layout

Getting started

Scripts

Production scaffolding

Observability — what is traced

Evaluation harness

Deployment readiness

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages