Skip to main content

Root Cause Analysis with CtrlB

Introduction

When issues occur in distributed systems, quick and accurate root cause analysis (RCA) is crucial. CtrlB's RCA framework provides a systematic approach to debugging problems using distributed tracing data.

The CtrlB RCA Framework

This 10-step framework helps you methodically identify and resolve issues in your distributed systems.

Step 1: Establish Baseline

Create a performance baseline for comparison:

  • Navigate to the Services table
  • Set time range to 30 minutes before the incident
  • Record P50, P90, P99 values and error rates for all services

This is our baseline.

Step 2: Identify Problem Window

Compare incident performance against baseline:

  • Switch time range to incident window
  • Compare this same Services table during incident time
  • Look for services where P90/P99 shot up significantly
  • Identify which services had the worst spike

This becomes our primary suspect.

Step 3: Narrow Down Timing

Pinpoint exact problem start time:

  • Go to suspected service's P99/P90/P50 latency graph
  • Zoom into incident window and identify exact problem start time
  • Look for patterns like regular spikes (that might indicate timeouts) or service degradation

Step 4: Find Slow Operations

Identify which specific operations became slow:

  • Open Operations table for problem service during the incident window
  • Sort by descending P99 latency
  • Compare against baseline operations table to see which specific operations became slow

Step 5: Search for Errors

Find error patterns that coincide with performance issues:

  • Search for error spans in 2-3 minute window around problem start
  • Look for the most frequent error types
  • Note any patterns in timing

Step 6: Analyze Error Traces

Get detailed error information from trace waterfalls:

  • Select 2-3 representative error spans
  • Click into their trace waterfall views
  • Look at the span JSON for detailed error messages and stack traces
  • Identify what exactly failed

Step 7: Form Hypothesis

Categorize the root cause type.

Categories:

  • Database Problems: Connection issues, query failures, missing data
  • Resource Exhaustion: Timeouts, memory, CPU constraints
  • Application Logic Errors: Poor error handling, retry loops
  • External Dependencies: Third-party service failures

Step 8: Validate Pattern

Confirm hypothesis with additional evidence:

  • Check if similar errors occurred in some other time windows
  • Look for infrastructure metrics that are supporting the hypothesis
  • Verify timing matches between different symptoms

Step 9: Document Root Cause

Documentation Should Include:

  • Specific error type and message
  • Affected operations and services
  • Exact timing of incident
  • Supporting evidence like relevant span IDs, trace IDs
  • Impact scope and duration

Step 10: Plan Fix

Based on the RCA, following steps can be taken:

  • Identify immediate mitigation like restart services, adjust timeouts
  • Look for long-term fixes like code changes or monitoring improvements