Root Cause Analysis with CtrlB
Introduction
When issues occur in distributed systems, quick and accurate root cause analysis (RCA) is crucial. CtrlB's RCA framework provides a systematic approach to debugging problems using distributed tracing data.
The CtrlB RCA Framework
This 10-step framework helps you methodically identify and resolve issues in your distributed systems.
Step 1: Establish Baseline
Create a performance baseline for comparison:
- Navigate to the Services table
- Set time range to 30 minutes before the incident
- Record P50, P90, P99 values and error rates for all services
This is our baseline.
Step 2: Identify Problem Window
Compare incident performance against baseline:
- Switch time range to incident window
- Compare this same Services table during incident time
- Look for services where P90/P99 shot up significantly
- Identify which services had the worst spike
This becomes our primary suspect.
Step 3: Narrow Down Timing
Pinpoint exact problem start time:
- Go to suspected service's P99/P90/P50 latency graph
- Zoom into incident window and identify exact problem start time
- Look for patterns like regular spikes (that might indicate timeouts) or service degradation
Step 4: Find Slow Operations
Identify which specific operations became slow:
- Open Operations table for problem service during the incident window
- Sort by descending P99 latency
- Compare against baseline operations table to see which specific operations became slow
Step 5: Search for Errors
Find error patterns that coincide with performance issues:
- Search for error spans in 2-3 minute window around problem start
- Look for the most frequent error types
- Note any patterns in timing
Step 6: Analyze Error Traces
Get detailed error information from trace waterfalls:
- Select 2-3 representative error spans
- Click into their trace waterfall views
- Look at the span JSON for detailed error messages and stack traces
- Identify what exactly failed
Step 7: Form Hypothesis
Categorize the root cause type.
Categories:
- Database Problems: Connection issues, query failures, missing data
- Resource Exhaustion: Timeouts, memory, CPU constraints
- Application Logic Errors: Poor error handling, retry loops
- External Dependencies: Third-party service failures
Step 8: Validate Pattern
Confirm hypothesis with additional evidence:
- Check if similar errors occurred in some other time windows
- Look for infrastructure metrics that are supporting the hypothesis
- Verify timing matches between different symptoms
Step 9: Document Root Cause
Documentation Should Include:
- Specific error type and message
- Affected operations and services
- Exact timing of incident
- Supporting evidence like relevant span IDs, trace IDs
- Impact scope and duration
Step 10: Plan Fix
Based on the RCA, following steps can be taken:
- Identify immediate mitigation like restart services, adjust timeouts
- Look for long-term fixes like code changes or monitoring improvements