DivergenceLens

Reference-free, structurally-grounded silent-divergence auditing for LangChain Deep Agents.

Final-answer evals miss most step-level failures. A large fraction of "successful" agent runs are corrupt successes — an async subagent silently errors, a todo gets marked done with no supporting action, a file-write claim has no matching mutation. DivergenceLens catches these without requiring a reference trajectory.

What it does

DivergenceLens audits a Deep Agents run against its own stated plan and claims — no reference trajectory required. It builds a causal provenance graph over the run, runs a deterministic rule engine plus an optional LLM judge, and classifies findings into a typed divergence taxonomy:

Category	What it catches
Phantom completion	Todo marked done with no supporting successful action
Silent failure masking	Tool errored but agent claimed success
Claim–write mismatch	Agent asserts it wrote a file; no `FileMutation` exists
Summary inflation	Async subagent summary overstates vs. its real trajectory
Plan drift	Consequential actions with no corresponding todo
Orphaned evidence	Retrieved content never used or contradicted later

Architecture

serve / sdk / cli               ← interfaces
    reporting · dashboard       ← outputs
    runtime: middleware · monitor · policy · interrupt · rollback  ← act
    detection: consistency matrix · taxonomy · severity            ← decide
    alignment: deterministic rules · judge · calibration           ← score
    provenance: causal / data-flow graph                           ← structure
    ingest: LangSmith · LangGraph state · OTEL · stream            ← normalize
    core: event schema · matrix types · config · registries        ← foundation
bench/ (corpus · injection · metrics · baselines · ablations)

Quickstart

# Install
git clone https://github.com/Lkumar209/divergencelens
cd divergencelens
uv sync

# Audit a LangSmith run
divergencelens audit <run_id>

# Audit from exported JSON
divergencelens audit ./trace.json

# Run the benchmark
divergencelens bench --split test --seeds 3

# Start the audit service
divergencelens serve --port 8000
# POST /audit {"run_id": "<id>"}

# Smoke test (no API required)
make smoke

SDK usage

from divergencelens import DivergenceLens, DivergenceLensConfig

lens = DivergenceLens()

# From LangSmith
result = lens.audit_langsmith_run("run_id_here")

# From a Run object
result = lens.audit_run(run)

print(result.summary)
for div in result.divergences:
    print(f"[{div.severity.value}] {div.category.value}: {div.rationale}")

Middleware (real-time, deepagents-compatible)

from deepagents import create_deep_agent
from divergencelens.runtime.middleware import DivergenceMiddleware

agent = create_deep_agent(
    model="anthropic:claude-sonnet-4-6",
    middleware=[DivergenceMiddleware()],
    # ... your tools and subagents
)

Benchmark

DivergenceLens is validated on DivergenceBench — a labeled dataset built via fault injection on synthetic Deep Agents runs, with a frozen train/dev/test split. Ground-truth labels come from the injector, not LLM annotation.

make reproduce   # regenerates DivergenceBench and RESULTS.md from scratch

Results (DivergenceBench, test split, 3 seeds, no LLM judge)

Method	F1	Precision	Recall	FP Rate
Final-answer baseline	0.00	—	—	0.00
Generic LLM-judge baseline	—	—	—	—
Deterministic-only	1.00	1.00	1.00	0.00
Graph-only	0.00	0.00	0.00	0.00
DivergenceLens (full)	1.00	1.00	1.00	0.00

Mean localization accuracy (within ±2 steps of gold): 0.848

Per-category F1: all 5 injected categories score 1.00.

Note: F1=1.0 on a synthetic corpus reflects rule-matched injections, not real-world generalization. See LIMITATIONS.md and results/RESULTS.md for honest discussion.

See results/RESULTS.md for full per-seed breakdowns, ablation table, and figures.

Environment variables

LANGSMITH_API_KEY=...      # for LangSmith loading + feedback writing
OPENAI_API_KEY=...         # for LLM judge (optional; rules work without it)
ANTHROPIC_API_KEY=...      # alternative judge provider

Copy .env.example → .env and fill in what you need.

Limitations

See LIMITATIONS.md for an honest accounting of where the method is weakest:

Synthetic corpus (not real production agent runs)
Claim extraction is heuristic-based (regex patterns)
LLM judge cost grows linearly with run length
No trained evasion; deterministic rules can be worked around by a sufficiently adversarial agent

Project structure

src/divergencelens/
  core/         event schema, types, config, registries
  ingest/       LangSmith, LangGraph, OTEL, stream normalizers
  provenance/   causal graph, entity tracker, localizer
  alignment/    deterministic rules, LLM judge, calibration, fusion
  detection/    consistency matrix, taxonomy, severity
  runtime/      middleware, monitor, policy, interrupt, rollback
  integrations/ LangSmith feedback, OTEL export, webhook
  report/       per-run and aggregate reports
  serve/        FastAPI audit service
  sdk/          DivergenceLens programmatic API
  cli/          Typer CLI
bench/
  corpus/       synthetic run corpus
  inject/       fault injectors (one per divergence category)
  metrics/      precision, recall, F1, localization, calibration
  baselines/    comparison baselines
tests/
  unit/         core, detection, provenance unit tests
  smoke/        end-to-end pipeline smoke test (no API)

License

MIT — see LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.github/workflows		.github/workflows
bench		bench
examples		examples
results		results
src		src
tests		tests
.env.example		.env.example
.gitignore		.gitignore
.python-version		.python-version
DESIGN_NOTES.md		DESIGN_NOTES.md
LICENSE		LICENSE
LIMITATIONS.md		LIMITATIONS.md
Makefile		Makefile
PROGRESS.md		PROGRESS.md
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DivergenceLens

What it does

Architecture

Quickstart

SDK usage

Middleware (real-time, deepagents-compatible)

Benchmark

Results (DivergenceBench, test split, 3 seeds, no LLM judge)

Environment variables

Limitations

Project structure

License

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

DivergenceLens

What it does

Architecture

Quickstart

SDK usage

Middleware (real-time, deepagents-compatible)

Benchmark

Results (DivergenceBench, test split, 3 seeds, no LLM judge)

Environment variables

Limitations

Project structure

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages