Root Cause Analysis with CtrlB

Introduction

When issues occur in distributed systems, quick and accurate root cause analysis (RCA) is crucial. CtrlB's RCA framework provides a systematic approach to debugging problems using distributed tracing data.

The CtrlB RCA Framework

This 10-step framework helps you methodically identify and resolve issues in your distributed systems.

Step 1: Establish Baseline

Create a performance baseline for comparison:

Navigate to the Services table
Set time range to 30 minutes before the incident
Record P50, P90, P99 values and error rates for all services

This is our baseline.

Step 2: Identify Problem Window

Compare incident performance against baseline:

Switch time range to incident window
Compare this same Services table during incident time
Look for services where P90/P99 shot up significantly
Identify which services had the worst spike

This becomes our primary suspect.

Step 3: Narrow Down Timing

Pinpoint exact problem start time:

Go to suspected service's P99/P90/P50 latency graph
Zoom into incident window and identify exact problem start time
Look for patterns like regular spikes (that might indicate timeouts) or service degradation

Step 4: Find Slow Operations

Identify which specific operations became slow:

Open Operations table for problem service during the incident window
Sort by descending P99 latency
Compare against baseline operations table to see which specific operations became slow

Step 5: Search for Errors

Find error patterns that coincide with performance issues:

Search for error spans in 2-3 minute window around problem start
Look for the most frequent error types
Note any patterns in timing

Step 6: Analyze Error Traces

Get detailed error information from trace waterfalls:

Select 2-3 representative error spans
Click into their trace waterfall views
Look at the span JSON for detailed error messages and stack traces
Identify what exactly failed

Step 7: Form Hypothesis

Categorize the root cause type.

Categories:

Database Problems: Connection issues, query failures, missing data
Resource Exhaustion: Timeouts, memory, CPU constraints
Application Logic Errors: Poor error handling, retry loops
External Dependencies: Third-party service failures

Step 8: Validate Pattern

Confirm hypothesis with additional evidence:

Check if similar errors occurred in some other time windows
Look for infrastructure metrics that are supporting the hypothesis
Verify timing matches between different symptoms

Step 9: Document Root Cause

Documentation Should Include:

Specific error type and message
Affected operations and services
Exact timing of incident
Supporting evidence like relevant span IDs, trace IDs
Impact scope and duration

Step 10: Plan Fix

Based on the RCA, following steps can be taken:

Identify immediate mitigation like restart services, adjust timeouts
Look for long-term fixes like code changes or monitoring improvements

Introduction​

The CtrlB RCA Framework​

Step 1: Establish Baseline​

Step 2: Identify Problem Window​

Step 3: Narrow Down Timing​

Step 4: Find Slow Operations​

Step 5: Search for Errors​

Step 6: Analyze Error Traces​

Step 7: Form Hypothesis​

Step 8: Validate Pattern​

Step 9: Document Root Cause​

Step 10: Plan Fix​