IncidentBench
Resilience Benchmarking for Log Analytics Platforms
IncidentBench is an open-source, Kubernetes-native benchmark that measures how log analytics platforms behave when it matters most: during a real incident, when ingestion spikes and query spikes overlap simultaneously.
IncidentBench is not a throughput benchmark. It is a resilience benchmark.
Apache 2.0
The Problem
Every SRE has lived this moment. A cascading failure spikes log volume 10–50x. Simultaneously, multiple engineers are hammering the logging platform with investigative queries. The platform you’re using to investigate the outage is also drowning in the same incident you’re investigating.
Existing benchmarks don’t test for this. They measure throughput in isolation: max ingest rate, peak QPS, compression ratios, storage efficiency. Those are useful numbers, but they tell you nothing about whether your platform survives the overlap.
IncidentBench fills that gap.
What It Measures
IncidentBench evaluates four properties that determine whether a platform holds up under real operational stress.
Stability
Do query latencies stay predictable when ingestion surges? Or does everything slow to a crawl?
Isolation
Does a query storm degrade ingestion throughput? Does an ingestion surge degrade query performance? Or does the platform keep them independent?
Predictability
Are tail latencies bounded under stress? Or do p99s blow up unpredictably during the overlap window?
Recovery
How quickly does the system return to baseline after the incident ends? Does it snap back, or does it limp along with lingering degradation?
How It Works
IncidentBench simulates a realistic incident lifecycle in six phases, modeling how a real outage unfolds from first spike to full recovery.
Baseline
Normal operating conditions. Moderate ingestion, light query load. Establishes the performance reference that everything else is measured against.
Incident Trigger
A bug is deployed. Error rates spike. Log volume triples. No human response yet. Alerting hasn’t fired.
Ingestion Surge
Retries cascade. Log volume hits 10x baseline. The alert fires. On-call engineers begin querying.
Overlap
Full incident mode. Ingestion at peak. Multiple engineers querying simultaneously. This is the core measurement window. Behavior here is the primary differentiator between platforms.
Recovery
Fix deployed. Error rates declining. Engineers still querying to verify the fix.
Post-Incident
Rates return to baseline. Measures whether the system has fully recovered or has lingering degradation.
The Scorecard
Every run produces a scorecard with the metrics that matter.
| Metric | What It Tells You |
|---|---|
| Baseline p99 latency | How fast queries are under normal conditions |
| Overlap p99 latency | How fast queries are during peak incident stress |
| p99 degradation ratio | How much worse queries get during the overlap (overlap p99 / baseline p99) |
| Query error rate | Did queries fail during peak stress? |
| Peak ingestion backlog | Maximum Kafka consumer lag. Can the platform keep up with the data? |
| Backlog drain time | How long to catch up after the surge ends |
| Recovery time | How long until query latency returns to within 1.2x of baseline |
The scorecard is designed to be immediately legible to anyone evaluating a platform — whether you’re an SRE, a platform engineer, or a buyer making a procurement decision.
Architecture
IncidentBench runs as a Kubernetes operator. Benchmark scenarios are defined as Custom Resources. The harness horizontally scales ingest and query workers across pods, routes ingestion through Kafka, and streams real-time metrics during the run.
Key Design Decisions
Kafka as the ingestion path
Ingest workers produce events to Kafka topics. The target platform consumes from Kafka via its own ingest pipeline. This models real production architectures where log data flows through a message bus. It also provides a natural, always-available backlog metric: Kafka consumer lag.
Distributed load generation
Ingest and query workers run as horizontally-scaled Kubernetes deployments. The default SRE Outage scenario uses 10 ingest workers producing 50,000 events per second and 4 query workers sustaining 40 queries per second during the overlap window.
Precise phase coordination
Phase transitions across all workers are synchronized via a two-phase barrier protocol over gRPC bidirectional streaming, achieving sub-second skew. This ensures measurement integrity across the distributed harness.
Pluggable target adapters
The adapter trait handles platform setup (creating indexes, ingest pipelines, warehouses) and query execution. The ingestion path is always Kafka. Adapters are contributed per platform.
Deterministic data generation
Each run uses a seed. The generated event stream is deterministic for a given seed and worker count, ensuring reproducibility across runs.
The SRE Outage Scenario
v0.1 ships with the flagship SRE Outage scenario: a deployment introduces a bug in a core microservice, error rates spike, retries cascade, and log volume surges while engineers investigate.
| Phase | Duration | Ingest EPS | Query QPS |
|---|---|---|---|
| Baseline | 120s | 5,000 (1x) | 5 (1x) |
| Incident Trigger | 60s | 15,000 (3x) | 5 (1x) |
| Ingestion Surge | 90s | 50,000 (10x) | 15 (3x) |
| Overlap | 120s | 50,000 (10x) | 40 (8x) |
| Recovery | 90s | 10,000 (2x) | 20 (4x) |
| Post-Incident | 120s | 5,000 (1x) | 5 (1x) |
The scenario generates realistic structured application logs from 8 microservices with trace IDs, HTTP status codes, error codes, response times, and Kubernetes metadata. The query mix includes 8 query types representing what an SRE team actually runs during an incident investigation:
- error_search — Search for errors in the failing service
- recent_errors — Most recent errors across all services
- error_code_agg — Top error codes
- service_error_rate — Error rate breakdown by service
- trace_lookup — Look up a specific trace
- status_code_timeline — HTTP status code trend for failing service
- slow_requests — Find slow requests in the failing service
- wildcard_message — Wildcard search for specific exception
Duration and rate scaling are configured as fields in the benchmark CR spec, letting you run smoke tests on a laptop or scale up for production-grade hardware:
scaling:
rate_scale: 0.01
duration_scale: 0.1Multi-Warehouse Mode
IncidentBench supports benchmarking workload isolation across multiple query warehouses. Define analyst groups in the scenario (e.g., “heavy analysts” running complex aggregations, “SRE triage” running fast searches), map each group to a separate warehouse, and measure whether isolation holds under stress.
This is particularly relevant for platforms that offer dedicated query resources. If your platform has real workload isolation as a primitive, multi-warehouse mode reveals it. If it doesn’t, the benchmark reveals that too.
The Mach5 adapter supports multi-warehouse mode natively, creating and tearing down dedicated warehouses as part of the benchmark lifecycle.
Getting Started
Prerequisites
- A Kubernetes cluster (kind/minikube for local testing, or a production cluster)
- Kafka (external, or let IncidentBench deploy a managed instance)
- A target platform to benchmark (Mach5 adapter ships with v0.1)
Quick Start
# Clone the repository
git clone https://github.com/mach5-io/IncidentBench.git
cd IncidentBench
# Deploy a local cluster with Kafka and the operator
make deploy-local
# Start a benchmark run
kubectl apply -f config/samples/sre-outage-run.yaml
# Watch live metrics
incidentbench metrics sre-outage-run-001 --live
# Generate the report
incidentbench report sre-outage-run-001
# Cleanup
kubectl delete incidentbenchrun sre-outage-run-001Note: The sample CRs in the repository use a scaled-down configuration (fewer workers, lower rates) suitable for local testing. The default scenario parameters listed above represent the full-scale production benchmark.
Built With
Rust
Zero-overhead load generation with no GC-induced latency jitter
kube-rs
Kubernetes operator framework
tonic
gRPC coordination between workers and the phase controller
rdkafka
Kafka production and consumer lag monitoring
ratatui
Live terminal UI
t-digest
Mergeable distributed latency percentile estimation
Roadmap
v0.1 ships with the SRE Outage scenario and the Mach5 adapter. Planned additions:
| Scenario | Domain | Description |
|---|---|---|
| SOC Attack | Security | Brute-force attack triggers alert storm. Analysts query IOCs while SIEM ingestion spikes. |
| E-Commerce Flash Sale | Commerce | Clickstream and order logging surge under real-time analytics load. |
| SaaS Tenant Storm | Multi-tenant | Single large tenant generates disproportionate load. Measures noisy-neighbor isolation. |
| Data Backfill | Operations | Historical backfill runs alongside live ingestion and queries. |
Additional platform adapters (Elasticsearch, OpenSearch) are planned as community contributions.
Open Source
IncidentBench is licensed under Apache 2.0.
IncidentBench is built by Mach5, a high-performance search and analytics platform for security, observability, and product analytics.