faultsim

Failure Misclassification in Distributed Clusters: A Simulation Study Under Jitter, Churn, and Partitions

faultsim is a discrete-event simulator for studying how failure-detection strategies behave under unstable network conditions. It focuses on misclassification: healthy nodes being declared failed because of jitter, delay, churn, packet loss, or partitions.

Research Question

Under what network conditions do common failure-detection strategies misclassify healthy nodes as failed, and how do adaptive or gossip-assisted approaches compare to fixed-timeout methods?

Strategies Under Study

Strategy	Description
Fixed-timeout heartbeat	Declares failure after a static timeout with no heartbeat received
Adaptive-timeout heartbeat	EWMA of observed inter-arrival times drives the timeout
Gossip-assisted suspicion	Combines local heartbeat monitoring with suspicion disseminated via gossip
Phi accrual	Suspicion level derived from a distribution over recent inter-arrival samples
Adaptive accrual	Accrual-style detector with parameters tuned online

Key Metrics

False positive rate - fraction of healthy nodes incorrectly declared failed
Detection latency - time from actual failure to detection, including mean and p95 summaries
Wall-clock runtime - simulator execution time in milliseconds (wall_time_ms in CSV/JSON outputs)

Project Structure

src/            Simulator source code (Rust)
docs/           Design notes, experiment plan, demo guide, custom-detector guide
configs/        Scenario configuration files (TOML)
scripts/        Plotting and experiment helper scripts
results/        Generated output (git-ignored, except .gitkeep)
tests/          Integration and smoke tests

Getting Started

# Build the simulator
cargo build
cargo build --release

# Run one scenario
cargo run --release -- run --config configs/scenarios/baseline.toml
cargo run --release -- run --config configs/scenarios/baseline.toml --seed 42 --output results/demo

# Run every scenario in a directory and write a combined summary.csv
cargo run --release -- run-all --scenarios configs/scenarios --output results/batch

# Re-run one scenario across many seeds for aggregate statistics
cargo run --release -- sweep-seeds --config configs/scenarios/baseline.toml --seeds 30 --output results/seeds

# Sweep one detector or network parameter over a range
cargo run --release -- sweep --config configs/scenarios/crash_phi_accrual.toml \
    --param phi_threshold --start 1.0 --end 16.0 --steps 10 --output results/phi_sweep

# Test and lint
cargo test
cargo fmt -- --check
cargo clippy -- -D warnings

# Plot a combined summary.csv into publication-style figures
python scripts/plot_results.py results/batch/summary.csv --output results/plots

The plotting script expects matplotlib and pandas. For a batch summary it generates fp_rate.png, latency.png, strategy_comparison.png, and wall_time.png. For sweep CSVs it generates sweep_<param>.png.

Single-scenario runs can export CSV or JSON summaries plus detection logs. run-all always writes a combined summary.csv, and the aggregate CSV outputs now include wall_time_ms.

See docs/demo.md for a guided walkthrough of the main workflows.

Scenarios

configs/scenarios/ holds the full scenario set. Each TOML specifies cluster size, network parameters, detector strategy, and a fault schedule (crash, recover, partition_start, partition_end).

Current scenario groups include:

baseline / adaptive / gossip - clean-network baselines
crash_* - single-node crash scenarios for each detector
high_jitter_* - elevated jitter stress tests
drops_5pct / drops_15pct - lossy-link scenarios
partition - partition and heal scenarios
custom_example - starter scenario for user-defined detectors

See configs/scenarios/README.md for the full config schema and scenario conventions.

Custom Detectors

To add your own detector:

Edit src/detector/custom.rs.
Set strategy = "custom" in a scenario TOML.
Put any detector-specific knobs under [detector.params].
Run the scenario normally, or sweep those custom parameters with faultsim sweep.

The engine, metrics export, seed sweeps, parameter sweeps, and plotting workflow all work with the custom strategy. See docs/custom-detector.md for the full guide and worked example.

Reproducibility

Every run is deterministic for a given (config, seed) pair. A single StdRng is threaded through src/scenario.rs into the engine and network model, and the deterministic_replay integration test guards against accidental nondeterminism.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
.github/workflows		.github/workflows
configs/scenarios		configs/scenarios
docs		docs
results		results
scripts		scripts
src		src
tests		tests
.editorconfig		.editorconfig
.gitattributes		.gitattributes
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md
rustfmt.toml		rustfmt.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

faultsim

Research Question

Strategies Under Study

Key Metrics

Project Structure

Getting Started

Scenarios

Custom Detectors

Reproducibility

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

faultsim

Research Question

Strategies Under Study

Key Metrics

Project Structure

Getting Started

Scenarios

Custom Detectors

Reproducibility

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages