k13d Stability Architecture¶
This document describes the internal stability patterns that ensure k13d operates reliably in production environments. It serves as a guide for contributors and maintainers working on the TUI codebase.
Table of Contents¶
- Lock Ordering Convention
- Goroutine Lifecycle
- Error Recovery Strategy
- Watcher & Refresh Synchronization
- Shutdown Sequence
- Thread-Safe UI Updates
Lock Ordering Convention¶
k13d uses multiple mutexes to protect concurrent access to shared state. Lock ordering is critical to prevent deadlocks.
Lock Hierarchy (Must Follow This Order)¶
Rule: Always acquire locks from left to right. Never nest locks in reverse order.
Lock Definitions¶
| Lock | Type | Purpose | Scope |
|---|---|---|---|
watchMu | sync.Mutex | Protects watcher lifecycle (start/stop) | pkg/ui/app.go |
navMx | sync.Mutex | Protects navigation stack (back/forward history) | pkg/ui/app.go |
mx | sync.RWMutex | Protects app state (resource, namespace, filters, sorts) | pkg/ui/app.go |
pfMx | sync.Mutex | Protects port-forward tracking | pkg/ui/app.go |
aiMx | sync.RWMutex | Protects AI decision approval state | pkg/ui/app.go |
cancelLock | sync.Mutex | Protects context cancellation | pkg/ui/app.go |
Lock Ordering Examples¶
✅ Correct: watchMu → mx¶
func (a *App) startWatch() {
a.watchMu.Lock()
defer a.watchMu.Unlock()
// Later acquire mx (allowed: watchMu before mx)
a.mx.RLock()
resource := a.currentResource
a.mx.RUnlock()
// ... start watcher ...
}
❌ WRONG: mx → watchMu (Deadlock Risk!)¶
// NEVER DO THIS - violates lock ordering
func badFunction() {
a.mx.Lock() // Acquire mx first
a.watchMu.Lock() // Then watchMu - WRONG ORDER!
// ...
}
The navigateTo() Centralized State Transition¶
navigateTo() is the single entry point for all resource/namespace transitions. It enforces the lock ordering convention:
func (a *App) navigateTo(resource, namespace, filter string) {
// 1. Stop watch first (watchMu)
a.stopWatch()
// 2. Update state (mx)
a.mx.Lock()
a.currentResource = resource
a.currentNamespace = namespace
a.filterText = filter
a.sortColumn = -1
a.sortAscending = true
a.mx.Unlock()
// 3. Refresh and start new watch (no locks held)
a.safeGo("navigateTo-refresh", func() {
a.updateHeader()
a.updateStatusBar()
a.refresh()
a.startWatch() // Start watch for new resource
})
}
Why this works: 1. stopWatch() acquires watchMu alone 2. State mutation acquires mx after watchMu is released 3. startWatch() runs in goroutine, acquiring watchMu again (no deadlock risk)
Goroutine Lifecycle¶
k13d uses goroutines extensively for async operations. All goroutines use the safeGo() wrapper for panic recovery.
The safeGo() Pattern (k9s RunE Pattern)¶
// safeGo wraps goroutines with panic recovery
func (a *App) safeGo(name string, fn func()) {
go func() {
defer func() {
if r := recover(); r != nil {
a.logger.Error("goroutine panic recovered",
"name", name,
"error", r,
"stack", string(debug.Stack()))
a.flashMsg(fmt.Sprintf("Internal error in %s. The operation has been recovered. Please try again or check logs for details.", name), true)
}
}()
fn()
}()
}
Goroutine Usage Examples¶
// Resource loading
a.safeGo("loadAPIResources", a.loadAPIResources)
// AI queries
a.safeGo("askAI", func() {
a.askAI(question)
})
// Refresh after mutation
a.safeGo("refresh-after-execute", a.refresh)
Key Points: - Always use safeGo() instead of raw go statements - Provide descriptive names for debugging (shows in logs) - Panics are logged with full stack trace - User sees friendly error message in TUI
Context Propagation and Cancellation¶
Use context.WithCancel or context.WithTimeout for operations that need to be aborted:
a.safeGo("editResource-fetch", func() {
ctx, cancel := context.WithTimeout(context.Background(), 10*time.Second)
defer cancel()
yaml, err := a.k8s.GetResourceYAML(ctx, ns, name, gvr)
if ctx.Err() != nil {
a.logger.Debug("Fetch cancelled", "resource", resource)
return // Silent return on cancellation
}
// ... use yaml ...
})
Atomic Guards (Lock-Free Coordination)¶
k13d uses sync/atomic for lightweight state checks without mutexes:
| Atomic Flag | Purpose |
|---|---|
stopping | Set to 1 immediately when Stop() is called |
running | Set to 1 after Application.Run() starts |
inUpdate | Deduplicates concurrent refresh() calls |
hasToolCall | Tracks pending AI tool approvals |
needsSync | Triggers full terminal re-sync |
Example:
func (a *App) refresh() {
// Prevent duplicate concurrent refreshes
if !atomic.CompareAndSwapInt32(&a.inUpdate, 0, 1) {
return // Already refreshing
}
defer atomic.StoreInt32(&a.inUpdate, 0)
// Check if shutting down
if atomic.LoadInt32(&a.stopping) == 1 {
return
}
// ... perform refresh ...
}
Error Recovery Strategy¶
k13d uses a multi-layered approach to error handling:
1. Panic Recovery (safeGo)¶
All goroutines recover from panics:
a.safeGo("operation", func() {
// If this panics, it's caught by safeGo's defer/recover
dangerousOperation()
})
2. Kubernetes API Failure Handling¶
Use exponential backoff for transient failures:
func (a *App) refresh() {
b := backoff.NewExponentialBackOff()
b.MaxElapsedTime = 30 * time.Second
var pods []k8s.Pod
err := backoff.Retry(func() error {
var err error
pods, err = a.k8s.ListPods(ctx, namespace)
return err
}, b)
if err != nil {
if ctx.Err() != nil {
// Context cancelled - silently return
return
}
// Real error - show to user
a.flashMsg(fmt.Sprintf("Failed to load pods: %v", err), true)
return
}
// ... populate table ...
}
3. Graceful Degradation¶
When non-critical features fail, continue operation:
// AI client initialization failure - continue without AI
newClient, err := ai.NewClient(&a.config.LLM)
if err != nil {
a.logger.Warn("AI client init failed", "error", err)
a.aiClient = nil // Disable AI, but TUI still works
} else {
a.aiClient = newClient
}
4. User-Friendly Error Messages¶
Always provide actionable error messages:
// ❌ BAD: Generic, unhelpful
a.flashMsg("Error: %v", err)
// ✅ GOOD: Specific, actionable
a.flashMsg(fmt.Sprintf("Failed to load %s: %v. Check cluster connectivity and permissions.", resource, err), true)
Watcher & Refresh Synchronization¶
k13d uses Kubernetes watch API for real-time resource updates. The watch/refresh cycle is carefully synchronized to avoid race conditions.
Watch → Debounce → onChange → Refresh Flow¶
┌──────────────┐
│ K8s API │
│ Watch │
└──────┬───────┘
│
│ Resource changed event
▼
┌──────────────────┐
│ ResourceWatcher │
│ (debounced) │
└──────┬───────────┘
│
│ onChange callback (100ms debounce)
▼
┌──────────────┐
│ a.refresh() │ ← Fetches latest data and updates table
└──────────────┘
Debouncing (k9s Pattern)¶
Rapid K8s API events are debounced to avoid UI thrashing:
type ResourceWatcher struct {
debounce time.Duration // Default: 100ms
timer *time.Timer
}
func (w *ResourceWatcher) eventLoop(ctx context.Context) {
for event := range w.eventChan {
// Reset timer on each event
w.timer.Reset(w.debounce)
}
}
func (w *ResourceWatcher) debounceLoop(ctx context.Context) {
for {
select {
case <-w.timer.C:
w.onChange() // Call callback after debounce period
case <-ctx.Done():
return
}
}
}
Stop Guard with isStopped()¶
The watcher checks if it's been stopped before invoking callbacks:
func (w *ResourceWatcher) isStopped() bool {
w.mu.RLock()
defer w.mu.RUnlock()
return w.stopped
}
func (w *ResourceWatcher) debounceLoop(ctx context.Context) {
for {
select {
case <-w.timer.C:
if !w.isStopped() {
w.onChange()
}
case <-ctx.Done():
return
}
}
}
Resource Switch Sequence¶
When switching resources (e.g., :pods → :svc):
1. stopWatch() → Cancel old watcher context
→ Set watcher.stopped = true
→ Drain remaining events
2. State change → Update currentResource, currentNamespace
→ Reset filters, sorts
3. startWatch() → Create new watcher for new resource
→ Start new watch goroutines
Critical: Always call stopWatch() before changing state to prevent stale events from updating the wrong resource view.
Shutdown Sequence¶
k13d follows a graceful shutdown sequence to avoid resource leaks and crashes.
Shutdown Flow¶
User presses 'q'
│
▼
┌──────────────────────────────────────┐
│ 1. Set stopping = 1 (atomic) │ ← Prevent new operations
└──────┬───────────────────────────────┘
│
▼
┌──────────────────────────────────────┐
│ 2. Cancel in-flight operations │ ← Call cancelFn()
└──────┬───────────────────────────────┘
│
▼
┌──────────────────────────────────────┐
│ 3. Stop watcher (stopWatch) │ ← Cancel watch context
└──────┬───────────────────────────────┘
│
▼
┌──────────────────────────────────────┐
│ 4. Cleanup resources │ ← Close channels, stop goroutines
└──────┬───────────────────────────────┘
│
▼
┌──────────────────────────────────────┐
│ 5. tview.Application.Stop() │ ← Exit event loop
└──────────────────────────────────────┘
Shutdown Code¶
func (a *App) Stop() {
// 1. Set atomic flag immediately (guards against new operations)
atomic.StoreInt32(&a.stopping, 1)
// 2. Cancel all in-flight contexts
a.cancelLock.Lock()
if a.cancelFn != nil {
a.cancelFn()
}
a.cancelLock.Unlock()
// 3. Stop watcher
a.stopWatch()
// 4. Stop tview application (blocks until event loop exits)
if a.Application != nil {
a.Application.Stop()
}
}
Checking for Shutdown in Operations¶
All long-running operations check the stopping flag:
func (a *App) refresh() {
// Exit early if shutting down
if atomic.LoadInt32(&a.stopping) == 1 {
return
}
// Prevent duplicate refreshes
if !atomic.CompareAndSwapInt32(&a.inUpdate, 0, 1) {
return
}
defer atomic.StoreInt32(&a.inUpdate, 0)
// ... perform refresh ...
}
Avoiding Deadlocks on Shutdown¶
Problem: Goroutines blocking on mutexes during shutdown.
Solution: Use atomic flags (stopping) to exit early before acquiring locks:
func (a *App) startWatch() {
// Check stopping flag BEFORE acquiring watchMu
if atomic.LoadInt32(&a.stopping) == 1 {
return
}
a.watchMu.Lock()
defer a.watchMu.Unlock()
// ... start watch ...
}
Thread-Safe UI Updates¶
tview is not thread-safe. All UI updates from goroutines must go through QueueUpdateDraw().
QueueUpdateDraw Pattern¶
// From a goroutine (e.g., safeGo callback)
a.safeGo("fetchData", func() {
data := fetchData() // Background operation
// Update UI on main thread
a.QueueUpdateDraw(func() {
a.table.SetCell(0, 0, tview.NewTableCell(data))
})
})
QueueUpdateDraw Implementation¶
func (a *App) QueueUpdateDraw(f func()) {
// Guard: Don't queue if stopping or app is nil
if a.Application == nil || atomic.LoadInt32(&a.stopping) == 1 {
return
}
// IMPORTANT: Run in goroutine to avoid deadlock when called from input handlers
go func() {
a.Application.QueueUpdateDraw(f)
}()
}
Why the extra goroutine?
tview's QueueUpdateDraw can block if called from within an input handler (because the event loop is waiting for the handler to return). Wrapping it in a goroutine makes it non-blocking.
Direct Update (queueUpdateDrawDirect)¶
Use when already outside the event loop (e.g., from safeGo callback):
func (a *App) queueUpdateDrawDirect(f func()) {
if a.Application == nil || atomic.LoadInt32(&a.stopping) == 1 {
return
}
a.Application.QueueUpdateDraw(f) // No extra goroutine needed
}
AI Streaming Updates (Throttled)¶
AI responses stream token-by-token. Throttle updates to avoid UI thrashing:
func (a *App) appendAIResponse(text string) {
now := time.Now().UnixNano()
last := atomic.LoadInt64(&a.lastAIDraw)
// Throttle: Only update UI every 50ms
if now-last < 50*time.Millisecond.Nanoseconds() {
return
}
if atomic.CompareAndSwapInt64(&a.lastAIDraw, last, now) {
a.QueueUpdateDraw(func() {
// Update AI panel with new text
a.aiPanel.SetText(a.aiPanel.GetText(true) + text)
})
}
}
Testing Stability¶
k13d includes regression tests for stability patterns:
Test Files¶
| File | Purpose |
|---|---|
pkg/ui/stability_regression_test.go | Tests deadlock scenarios, lock ordering, shutdown |
pkg/ui/feature_test.go | E2E tests with concurrency checks |
pkg/ui/golden_test.go | Screen snapshot tests |
Running Stability Tests¶
# Always use -race flag for TUI tests
go test -race ./pkg/ui/ -run Stability
# Check for deadlocks
go test -race -timeout=10s ./pkg/ui/ -run TestNavigationDeadlock
Example: Testing Lock Ordering¶
func TestNavigationDeadlock(t *testing.T) {
app := NewTestApp()
defer app.Stop()
// Spawn 100 concurrent navigations
var wg sync.WaitGroup
for i := 0; i < 100; i++ {
wg.Add(1)
go func(idx int) {
defer wg.Done()
resource := []string{"pods", "svc", "deploy"}[idx%3]
app.navigateTo(resource, "", "")
}(i)
}
// Should complete without deadlock
done := make(chan struct{})
go func() {
wg.Wait()
close(done)
}()
select {
case <-done:
// Success
case <-time.After(5 * time.Second):
t.Fatal("Deadlock detected: navigation did not complete")
}
}
Common Pitfalls & Solutions¶
Pitfall 1: Acquiring Locks in Wrong Order¶
Problem:
Solution: Follow the lock hierarchy: watchMu → navMx → mx
Pitfall 2: Forgetting to Release Locks¶
Problem:
func badFunction() {
a.mx.Lock()
if err != nil {
return // OOPS: forgot to unlock!
}
a.mx.Unlock()
}
Solution: Always use defer:
func goodFunction() {
a.mx.Lock()
defer a.mx.Unlock()
if err != nil {
return // Lock released automatically
}
}
Pitfall 3: UI Updates Without QueueUpdateDraw¶
Problem:
a.safeGo("fetch", func() {
data := fetch()
a.table.SetCell(0, 0, tview.NewTableCell(data)) // CRASH: not thread-safe!
})
Solution:
a.safeGo("fetch", func() {
data := fetch()
a.QueueUpdateDraw(func() {
a.table.SetCell(0, 0, tview.NewTableCell(data))
})
})
Pitfall 4: Not Checking stopping Flag¶
Problem:
func (a *App) longOperation() {
a.mx.Lock()
defer a.mx.Unlock()
// If Stop() was called, this will deadlock or cause errors
// ... long operation ...
}
Solution:
func (a *App) longOperation() {
if atomic.LoadInt32(&a.stopping) == 1 {
return // Exit early
}
a.mx.Lock()
defer a.mx.Unlock()
// ... operation ...
}
Debugging Tips¶
Enable Debug Logging¶
Check for Data Races¶
Analyze Deadlocks¶
Profile the TUI¶
Summary¶
k13d's stability architecture follows these principles:
- Strict Lock Ordering:
watchMu → navMx → mx - Centralized State Transitions:
navigateTo()is the single entry point - Panic Recovery: All goroutines use
safeGo() - Graceful Shutdown: Atomic
stoppingflag prevents new operations - Thread-Safe UI: All UI updates via
QueueUpdateDraw() - Watch/Refresh Sync: Debounced watchers with stop guards
- Error Recovery: Exponential backoff for transient failures
These patterns ensure k13d remains responsive and stable even under high load, rapid navigation, and concurrent operations.
References¶
- k9s Stability: github.com/derailed/k9s (RunE pattern, watch debouncing)
- tview Threading: github.com/rivo/tview (QueueUpdateDraw documentation)
- Go Concurrency: golang.org/doc/effective_go#concurrency
- Lock Ordering: "The Little Book of Semaphores" by Allen B. Downey