IncidentBench

Resilience Benchmarking for Log Analytics Platforms

IncidentBench is an open-source, Kubernetes-native benchmark that measures how log analytics platforms behave when it matters most: during a real incident, when ingestion spikes and query spikes overlap simultaneously.

IncidentBench is not a throughput benchmark. It is a resilience benchmark.

Apache 2.0

The Problem

Every SRE has lived this moment. A cascading failure spikes log volume 10–50x. Simultaneously, multiple engineers are hammering the logging platform with investigative queries. The platform you’re using to investigate the outage is also drowning in the same incident you’re investigating.

Existing benchmarks don’t test for this. They measure throughput in isolation: max ingest rate, peak QPS, compression ratios, storage efficiency. Those are useful numbers, but they tell you nothing about whether your platform survives the overlap.

IncidentBench fills that gap.

What It Measures

IncidentBench evaluates four properties that determine whether a platform holds up under real operational stress.

Stability

Do query latencies stay predictable when ingestion surges? Or does everything slow to a crawl?

Isolation

Does a query storm degrade ingestion throughput? Does an ingestion surge degrade query performance? Or does the platform keep them independent?

Predictability

Are tail latencies bounded under stress? Or do p99s blow up unpredictably during the overlap window?

Recovery

How quickly does the system return to baseline after the incident ends? Does it snap back, or does it limp along with lingering degradation?

How It Works

IncidentBench simulates a realistic incident lifecycle in six phases, modeling how a real outage unfolds from first spike to full recovery.

1

Baseline

Normal operating conditions. Moderate ingestion, light query load. Establishes the performance reference that everything else is measured against.

2

Incident Trigger

A bug is deployed. Error rates spike. Log volume triples. No human response yet. Alerting hasn’t fired.

3

Ingestion Surge

Retries cascade. Log volume hits 10x baseline. The alert fires. On-call engineers begin querying.

4

Overlap

Full incident mode. Ingestion at peak. Multiple engineers querying simultaneously. This is the core measurement window. Behavior here is the primary differentiator between platforms.

5

Recovery

Fix deployed. Error rates declining. Engineers still querying to verify the fix.

6

Post-Incident

Rates return to baseline. Measures whether the system has fully recovered or has lingering degradation.

The Scorecard

Every run produces a scorecard with the metrics that matter.

MetricWhat It Tells You
Baseline p99 latencyHow fast queries are under normal conditions
Overlap p99 latencyHow fast queries are during peak incident stress
p99 degradation ratioHow much worse queries get during the overlap (overlap p99 / baseline p99)
Query error rateDid queries fail during peak stress?
Peak ingestion backlogMaximum Kafka consumer lag. Can the platform keep up with the data?
Backlog drain timeHow long to catch up after the surge ends
Recovery timeHow long until query latency returns to within 1.2x of baseline

The scorecard is designed to be immediately legible to anyone evaluating a platform — whether you’re an SRE, a platform engineer, or a buyer making a procurement decision.

Architecture

IncidentBench runs as a Kubernetes operator. Benchmark scenarios are defined as Custom Resources. The harness horizontally scales ingest and query workers across pods, routes ingestion through Kafka, and streams real-time metrics during the run.

Key Design Decisions

Kafka as the ingestion path

Ingest workers produce events to Kafka topics. The target platform consumes from Kafka via its own ingest pipeline. This models real production architectures where log data flows through a message bus. It also provides a natural, always-available backlog metric: Kafka consumer lag.

Distributed load generation

Ingest and query workers run as horizontally-scaled Kubernetes deployments. The default SRE Outage scenario uses 10 ingest workers producing 50,000 events per second and 4 query workers sustaining 40 queries per second during the overlap window.

Precise phase coordination

Phase transitions across all workers are synchronized via a two-phase barrier protocol over gRPC bidirectional streaming, achieving sub-second skew. This ensures measurement integrity across the distributed harness.

Pluggable target adapters

The adapter trait handles platform setup (creating indexes, ingest pipelines, warehouses) and query execution. The ingestion path is always Kafka. Adapters are contributed per platform.

Deterministic data generation

Each run uses a seed. The generated event stream is deterministic for a given seed and worker count, ensuring reproducibility across runs.

The SRE Outage Scenario

v0.1 ships with the flagship SRE Outage scenario: a deployment introduces a bug in a core microservice, error rates spike, retries cascade, and log volume surges while engineers investigate.

PhaseDurationIngest EPSQuery QPS
Baseline120s5,000 (1x)5 (1x)
Incident Trigger60s15,000 (3x)5 (1x)
Ingestion Surge90s50,000 (10x)15 (3x)
Overlap120s50,000 (10x)40 (8x)
Recovery90s10,000 (2x)20 (4x)
Post-Incident120s5,000 (1x)5 (1x)
Total duration: 10 minutes
Total events: ~13.5 million
Peak concurrent load: 50,000 EPS + 40 QPS

The scenario generates realistic structured application logs from 8 microservices with trace IDs, HTTP status codes, error codes, response times, and Kubernetes metadata. The query mix includes 8 query types representing what an SRE team actually runs during an incident investigation:

  • error_search — Search for errors in the failing service
  • recent_errors — Most recent errors across all services
  • error_code_agg — Top error codes
  • service_error_rate — Error rate breakdown by service
  • trace_lookup — Look up a specific trace
  • status_code_timeline — HTTP status code trend for failing service
  • slow_requests — Find slow requests in the failing service
  • wildcard_message — Wildcard search for specific exception

Duration and rate scaling are configured as fields in the benchmark CR spec, letting you run smoke tests on a laptop or scale up for production-grade hardware:

scaling:
  rate_scale: 0.01
  duration_scale: 0.1

Multi-Warehouse Mode

IncidentBench supports benchmarking workload isolation across multiple query warehouses. Define analyst groups in the scenario (e.g., “heavy analysts” running complex aggregations, “SRE triage” running fast searches), map each group to a separate warehouse, and measure whether isolation holds under stress.

This is particularly relevant for platforms that offer dedicated query resources. If your platform has real workload isolation as a primitive, multi-warehouse mode reveals it. If it doesn’t, the benchmark reveals that too.

The Mach5 adapter supports multi-warehouse mode natively, creating and tearing down dedicated warehouses as part of the benchmark lifecycle.

Getting Started

Prerequisites

  • A Kubernetes cluster (kind/minikube for local testing, or a production cluster)
  • Kafka (external, or let IncidentBench deploy a managed instance)
  • A target platform to benchmark (Mach5 adapter ships with v0.1)

Quick Start

# Clone the repository
git clone https://github.com/mach5-io/IncidentBench.git
cd IncidentBench

# Deploy a local cluster with Kafka and the operator
make deploy-local

# Start a benchmark run
kubectl apply -f config/samples/sre-outage-run.yaml

# Watch live metrics
incidentbench metrics sre-outage-run-001 --live

# Generate the report
incidentbench report sre-outage-run-001

# Cleanup
kubectl delete incidentbenchrun sre-outage-run-001

Note: The sample CRs in the repository use a scaled-down configuration (fewer workers, lower rates) suitable for local testing. The default scenario parameters listed above represent the full-scale production benchmark.

Built With

Rust

Zero-overhead load generation with no GC-induced latency jitter

kube-rs

Kubernetes operator framework

tonic

gRPC coordination between workers and the phase controller

rdkafka

Kafka production and consumer lag monitoring

ratatui

Live terminal UI

t-digest

Mergeable distributed latency percentile estimation

Roadmap

v0.1 ships with the SRE Outage scenario and the Mach5 adapter. Planned additions:

ScenarioDomainDescription
SOC AttackSecurityBrute-force attack triggers alert storm. Analysts query IOCs while SIEM ingestion spikes.
E-Commerce Flash SaleCommerceClickstream and order logging surge under real-time analytics load.
SaaS Tenant StormMulti-tenantSingle large tenant generates disproportionate load. Measures noisy-neighbor isolation.
Data BackfillOperationsHistorical backfill runs alongside live ingestion and queries.

Additional platform adapters (Elasticsearch, OpenSearch) are planned as community contributions.

Open Source

IncidentBench is licensed under Apache 2.0.

IncidentBench is built by Mach5, a high-performance search and analytics platform for security, observability, and product analytics.