k13d Stability Architecture¶

This document describes the internal stability patterns that ensure k13d operates reliably in production environments. It serves as a guide for contributors and maintainers working on the TUI codebase.

Table of Contents¶

Lock Ordering Convention
Goroutine Lifecycle
Error Recovery Strategy
Watcher & Refresh Synchronization
Shutdown Sequence
Thread-Safe UI Updates

Lock Ordering Convention¶

k13d uses multiple mutexes to protect concurrent access to shared state. Lock ordering is critical to prevent deadlocks.

Lock Hierarchy (Must Follow This Order)¶

watchMu → navMx → mx

Rule: Always acquire locks from left to right. Never nest locks in reverse order.

Lock Definitions¶

Lock	Type	Purpose	Scope
`watchMu`	`sync.Mutex`	Protects watcher lifecycle (start/stop)	`pkg/ui/app.go`
`navMx`	`sync.Mutex`	Protects navigation stack (back/forward history)	`pkg/ui/app.go`
`mx`	`sync.RWMutex`	Protects app state (resource, namespace, filters, sorts)	`pkg/ui/app.go`
`pfMx`	`sync.Mutex`	Protects port-forward tracking	`pkg/ui/app.go`
`aiMx`	`sync.RWMutex`	Protects AI decision approval state	`pkg/ui/app.go`
`cancelLock`	`sync.Mutex`	Protects context cancellation	`pkg/ui/app.go`

Lock Ordering Examples¶

✅ Correct: watchMu → mx¶

func (a *App) startWatch() {
    a.watchMu.Lock()
    defer a.watchMu.Unlock()

    // Later acquire mx (allowed: watchMu before mx)
    a.mx.RLock()
    resource := a.currentResource
    a.mx.RUnlock()

    // ... start watcher ...
}

❌ WRONG: mx → watchMu (Deadlock Risk!)¶

// NEVER DO THIS - violates lock ordering
func badFunction() {
    a.mx.Lock()              // Acquire mx first
    a.watchMu.Lock()         // Then watchMu - WRONG ORDER!
    // ...
}

The `navigateTo()` Centralized State Transition¶

navigateTo() is the single entry point for all resource/namespace transitions. It enforces the lock ordering convention:

func (a *App) navigateTo(resource, namespace, filter string) {
    // 1. Stop watch first (watchMu)
    a.stopWatch()

    // 2. Update state (mx)
    a.mx.Lock()
    a.currentResource = resource
    a.currentNamespace = namespace
    a.filterText = filter
    a.sortColumn = -1
    a.sortAscending = true
    a.mx.Unlock()

    // 3. Refresh and start new watch (no locks held)
    a.safeGo("navigateTo-refresh", func() {
        a.updateHeader()
        a.updateStatusBar()
        a.refresh()
        a.startWatch() // Start watch for new resource
    })
}

Why this works: 1. stopWatch() acquires watchMu alone 2. State mutation acquires mx after watchMu is released 3. startWatch() runs in goroutine, acquiring watchMu again (no deadlock risk)

Goroutine Lifecycle¶

k13d uses goroutines extensively for async operations. All goroutines use the safeGo() wrapper for panic recovery.

The `safeGo()` Pattern (k9s RunE Pattern)¶

// safeGo wraps goroutines with panic recovery
func (a *App) safeGo(name string, fn func()) {
    go func() {
        defer func() {
            if r := recover(); r != nil {
                a.logger.Error("goroutine panic recovered",
                    "name", name,
                    "error", r,
                    "stack", string(debug.Stack()))
                a.flashMsg(fmt.Sprintf("Internal error in %s. The operation has been recovered. Please try again or check logs for details.", name), true)
            }
        }()
        fn()
    }()
}

Goroutine Usage Examples¶

// Resource loading
a.safeGo("loadAPIResources", a.loadAPIResources)

// AI queries
a.safeGo("askAI", func() {
    a.askAI(question)
})

// Refresh after mutation
a.safeGo("refresh-after-execute", a.refresh)

Key Points: - Always use safeGo() instead of raw go statements - Provide descriptive names for debugging (shows in logs) - Panics are logged with full stack trace - User sees friendly error message in TUI

Context Propagation and Cancellation¶

Use context.WithCancel or context.WithTimeout for operations that need to be aborted:

a.safeGo("editResource-fetch", func() {
    ctx, cancel := context.WithTimeout(context.Background(), 10*time.Second)
    defer cancel()

    yaml, err := a.k8s.GetResourceYAML(ctx, ns, name, gvr)
    if ctx.Err() != nil {
        a.logger.Debug("Fetch cancelled", "resource", resource)
        return // Silent return on cancellation
    }
    // ... use yaml ...
})

Atomic Guards (Lock-Free Coordination)¶

k13d uses sync/atomic for lightweight state checks without mutexes:

Atomic Flag	Purpose
`stopping`	Set to 1 immediately when `Stop()` is called
`running`	Set to 1 after `Application.Run()` starts
`inUpdate`	Deduplicates concurrent refresh() calls
`hasToolCall`	Tracks pending AI tool approvals
`needsSync`	Triggers full terminal re-sync

Example:

func (a *App) refresh() {
    // Prevent duplicate concurrent refreshes
    if !atomic.CompareAndSwapInt32(&a.inUpdate, 0, 1) {
        return // Already refreshing
    }
    defer atomic.StoreInt32(&a.inUpdate, 0)

    // Check if shutting down
    if atomic.LoadInt32(&a.stopping) == 1 {
        return
    }

    // ... perform refresh ...
}

Error Recovery Strategy¶

k13d uses a multi-layered approach to error handling:

1. Panic Recovery (safeGo)¶

All goroutines recover from panics:

a.safeGo("operation", func() {
    // If this panics, it's caught by safeGo's defer/recover
    dangerousOperation()
})

2. Kubernetes API Failure Handling¶

Use exponential backoff for transient failures:

func (a *App) refresh() {
    b := backoff.NewExponentialBackOff()
    b.MaxElapsedTime = 30 * time.Second

    var pods []k8s.Pod
    err := backoff.Retry(func() error {
        var err error
        pods, err = a.k8s.ListPods(ctx, namespace)
        return err
    }, b)

    if err != nil {
        if ctx.Err() != nil {
            // Context cancelled - silently return
            return
        }
        // Real error - show to user
        a.flashMsg(fmt.Sprintf("Failed to load pods: %v", err), true)
        return
    }
    // ... populate table ...
}

3. Graceful Degradation¶

When non-critical features fail, continue operation:

// AI client initialization failure - continue without AI
newClient, err := ai.NewClient(&a.config.LLM)
if err != nil {
    a.logger.Warn("AI client init failed", "error", err)
    a.aiClient = nil // Disable AI, but TUI still works
} else {
    a.aiClient = newClient
}

4. User-Friendly Error Messages¶

Always provide actionable error messages:

// ❌ BAD: Generic, unhelpful
a.flashMsg("Error: %v", err)

// ✅ GOOD: Specific, actionable
a.flashMsg(fmt.Sprintf("Failed to load %s: %v. Check cluster connectivity and permissions.", resource, err), true)

Watcher & Refresh Synchronization¶

k13d uses Kubernetes watch API for real-time resource updates. The watch/refresh cycle is carefully synchronized to avoid race conditions.

Watch → Debounce → onChange → Refresh Flow¶

┌──────────────┐
│   K8s API    │
│   Watch      │
└──────┬───────┘
       │
       │ Resource changed event
       ▼
┌──────────────────┐
│  ResourceWatcher │
│   (debounced)    │
└──────┬───────────┘
       │
       │ onChange callback (100ms debounce)
       ▼
┌──────────────┐
│ a.refresh()  │ ← Fetches latest data and updates table
└──────────────┘

Debouncing (k9s Pattern)¶

Rapid K8s API events are debounced to avoid UI thrashing:

type ResourceWatcher struct {
    debounce time.Duration // Default: 100ms
    timer    *time.Timer
}

func (w *ResourceWatcher) eventLoop(ctx context.Context) {
    for event := range w.eventChan {
        // Reset timer on each event
        w.timer.Reset(w.debounce)
    }
}

func (w *ResourceWatcher) debounceLoop(ctx context.Context) {
    for {
        select {
        case <-w.timer.C:
            w.onChange() // Call callback after debounce period
        case <-ctx.Done():
            return
        }
    }
}

Stop Guard with `isStopped()`¶

The watcher checks if it's been stopped before invoking callbacks:

func (w *ResourceWatcher) isStopped() bool {
    w.mu.RLock()
    defer w.mu.RUnlock()
    return w.stopped
}

func (w *ResourceWatcher) debounceLoop(ctx context.Context) {
    for {
        select {
        case <-w.timer.C:
            if !w.isStopped() {
                w.onChange()
            }
        case <-ctx.Done():
            return
        }
    }
}

Resource Switch Sequence¶

When switching resources (e.g., :pods → :svc):

1. stopWatch()      → Cancel old watcher context
                    → Set watcher.stopped = true
                    → Drain remaining events

2. State change     → Update currentResource, currentNamespace
                    → Reset filters, sorts

3. startWatch()     → Create new watcher for new resource
                    → Start new watch goroutines

Critical: Always call stopWatch() before changing state to prevent stale events from updating the wrong resource view.

Shutdown Sequence¶

k13d follows a graceful shutdown sequence to avoid resource leaks and crashes.

Shutdown Flow¶

User presses 'q'
       │
       ▼
┌──────────────────────────────────────┐
│ 1. Set stopping = 1 (atomic)         │ ← Prevent new operations
└──────┬───────────────────────────────┘
       │
       ▼
┌──────────────────────────────────────┐
│ 2. Cancel in-flight operations       │ ← Call cancelFn()
└──────┬───────────────────────────────┘
       │
       ▼
┌──────────────────────────────────────┐
│ 3. Stop watcher (stopWatch)          │ ← Cancel watch context
└──────┬───────────────────────────────┘
       │
       ▼
┌──────────────────────────────────────┐
│ 4. Cleanup resources                 │ ← Close channels, stop goroutines
└──────┬───────────────────────────────┘
       │
       ▼
┌──────────────────────────────────────┐
│ 5. tview.Application.Stop()          │ ← Exit event loop
└──────────────────────────────────────┘

Shutdown Code¶

func (a *App) Stop() {
    // 1. Set atomic flag immediately (guards against new operations)
    atomic.StoreInt32(&a.stopping, 1)

    // 2. Cancel all in-flight contexts
    a.cancelLock.Lock()
    if a.cancelFn != nil {
        a.cancelFn()
    }
    a.cancelLock.Unlock()

    // 3. Stop watcher
    a.stopWatch()

    // 4. Stop tview application (blocks until event loop exits)
    if a.Application != nil {
        a.Application.Stop()
    }
}

Checking for Shutdown in Operations¶

All long-running operations check the stopping flag:

func (a *App) refresh() {
    // Exit early if shutting down
    if atomic.LoadInt32(&a.stopping) == 1 {
        return
    }

    // Prevent duplicate refreshes
    if !atomic.CompareAndSwapInt32(&a.inUpdate, 0, 1) {
        return
    }
    defer atomic.StoreInt32(&a.inUpdate, 0)

    // ... perform refresh ...
}

Avoiding Deadlocks on Shutdown¶

Problem: Goroutines blocking on mutexes during shutdown.

Solution: Use atomic flags (stopping) to exit early before acquiring locks:

func (a *App) startWatch() {
    // Check stopping flag BEFORE acquiring watchMu
    if atomic.LoadInt32(&a.stopping) == 1 {
        return
    }

    a.watchMu.Lock()
    defer a.watchMu.Unlock()
    // ... start watch ...
}

Thread-Safe UI Updates¶

tview is not thread-safe. All UI updates from goroutines must go through QueueUpdateDraw().

QueueUpdateDraw Pattern¶

// From a goroutine (e.g., safeGo callback)
a.safeGo("fetchData", func() {
    data := fetchData() // Background operation

    // Update UI on main thread
    a.QueueUpdateDraw(func() {
        a.table.SetCell(0, 0, tview.NewTableCell(data))
    })
})

QueueUpdateDraw Implementation¶

func (a *App) QueueUpdateDraw(f func()) {
    // Guard: Don't queue if stopping or app is nil
    if a.Application == nil || atomic.LoadInt32(&a.stopping) == 1 {
        return
    }

    // IMPORTANT: Run in goroutine to avoid deadlock when called from input handlers
    go func() {
        a.Application.QueueUpdateDraw(f)
    }()
}

Why the extra goroutine?

tview's QueueUpdateDraw can block if called from within an input handler (because the event loop is waiting for the handler to return). Wrapping it in a goroutine makes it non-blocking.

Direct Update (queueUpdateDrawDirect)¶

Use when already outside the event loop (e.g., from safeGo callback):

func (a *App) queueUpdateDrawDirect(f func()) {
    if a.Application == nil || atomic.LoadInt32(&a.stopping) == 1 {
        return
    }
    a.Application.QueueUpdateDraw(f) // No extra goroutine needed
}

AI Streaming Updates (Throttled)¶

AI responses stream token-by-token. Throttle updates to avoid UI thrashing:

func (a *App) appendAIResponse(text string) {
    now := time.Now().UnixNano()
    last := atomic.LoadInt64(&a.lastAIDraw)

    // Throttle: Only update UI every 50ms
    if now-last < 50*time.Millisecond.Nanoseconds() {
        return
    }

    if atomic.CompareAndSwapInt64(&a.lastAIDraw, last, now) {
        a.QueueUpdateDraw(func() {
            // Update AI panel with new text
            a.aiPanel.SetText(a.aiPanel.GetText(true) + text)
        })
    }
}

Testing Stability¶

k13d includes regression tests for stability patterns:

Test Files¶

File	Purpose
`pkg/ui/stability_regression_test.go`	Tests deadlock scenarios, lock ordering, shutdown
`pkg/ui/feature_test.go`	E2E tests with concurrency checks
`pkg/ui/golden_test.go`	Screen snapshot tests

Running Stability Tests¶

# Always use -race flag for TUI tests
go test -race ./pkg/ui/ -run Stability

# Check for deadlocks
go test -race -timeout=10s ./pkg/ui/ -run TestNavigationDeadlock

Example: Testing Lock Ordering¶

func TestNavigationDeadlock(t *testing.T) {
    app := NewTestApp()
    defer app.Stop()

    // Spawn 100 concurrent navigations
    var wg sync.WaitGroup
    for i := 0; i < 100; i++ {
        wg.Add(1)
        go func(idx int) {
            defer wg.Done()
            resource := []string{"pods", "svc", "deploy"}[idx%3]
            app.navigateTo(resource, "", "")
        }(i)
    }

    // Should complete without deadlock
    done := make(chan struct{})
    go func() {
        wg.Wait()
        close(done)
    }()

    select {
    case <-done:
        // Success
    case <-time.After(5 * time.Second):
        t.Fatal("Deadlock detected: navigation did not complete")
    }
}

Common Pitfalls & Solutions¶

Pitfall 1: Acquiring Locks in Wrong Order¶

Problem:

func badFunction() {
    a.mx.Lock()      // Acquire mx first
    a.watchMu.Lock() // Then watchMu - WRONG!
}

Solution: Follow the lock hierarchy: watchMu → navMx → mx

Pitfall 2: Forgetting to Release Locks¶

Problem:

func badFunction() {
    a.mx.Lock()
    if err != nil {
        return // OOPS: forgot to unlock!
    }
    a.mx.Unlock()
}

Solution: Always use defer:

func goodFunction() {
    a.mx.Lock()
    defer a.mx.Unlock()

    if err != nil {
        return // Lock released automatically
    }
}

Pitfall 3: UI Updates Without QueueUpdateDraw¶

Problem:

a.safeGo("fetch", func() {
    data := fetch()
    a.table.SetCell(0, 0, tview.NewTableCell(data)) // CRASH: not thread-safe!
})

Solution:

a.safeGo("fetch", func() {
    data := fetch()
    a.QueueUpdateDraw(func() {
        a.table.SetCell(0, 0, tview.NewTableCell(data))
    })
})

Pitfall 4: Not Checking `stopping` Flag¶

Problem:

func (a *App) longOperation() {
    a.mx.Lock()
    defer a.mx.Unlock()
    // If Stop() was called, this will deadlock or cause errors
    // ... long operation ...
}

Solution:

func (a *App) longOperation() {
    if atomic.LoadInt32(&a.stopping) == 1 {
        return // Exit early
    }

    a.mx.Lock()
    defer a.mx.Unlock()
    // ... operation ...
}

Debugging Tips¶

Enable Debug Logging¶

# config.yaml
log_level: debug

Check for Data Races¶

go test -race ./pkg/ui/

Analyze Deadlocks¶

# Run with GODEBUG to see mutex contention
GODEBUG=schedtrace=1000,scheddetail=1 k13d

Profile the TUI¶

go test -cpuprofile=cpu.prof -memprofile=mem.prof -bench=.
go tool pprof cpu.prof

Summary¶

k13d's stability architecture follows these principles:

Strict Lock Ordering: watchMu → navMx → mx
Centralized State Transitions: navigateTo() is the single entry point
Panic Recovery: All goroutines use safeGo()
Graceful Shutdown: Atomic stopping flag prevents new operations
Thread-Safe UI: All UI updates via QueueUpdateDraw()
Watch/Refresh Sync: Debounced watchers with stop guards
Error Recovery: Exponential backoff for transient failures

These patterns ensure k13d remains responsive and stable even under high load, rapid navigation, and concurrent operations.

References¶

k9s Stability: github.com/derailed/k9s (RunE pattern, watch debouncing)
tview Threading: github.com/rivo/tview (QueueUpdateDraw documentation)
Go Concurrency: golang.org/doc/effective_go#concurrency
Lock Ordering: "The Little Book of Semaphores" by Allen B. Downey

k13d Stability Architecture¶

Table of Contents¶

Lock Ordering Convention¶

Lock Hierarchy (Must Follow This Order)¶

Lock Definitions¶

Lock Ordering Examples¶

✅ Correct: watchMu → mx¶

❌ WRONG: mx → watchMu (Deadlock Risk!)¶

The navigateTo() Centralized State Transition¶

Goroutine Lifecycle¶

The safeGo() Pattern (k9s RunE Pattern)¶

Goroutine Usage Examples¶

Context Propagation and Cancellation¶

Atomic Guards (Lock-Free Coordination)¶

Error Recovery Strategy¶

1. Panic Recovery (safeGo)¶

2. Kubernetes API Failure Handling¶

3. Graceful Degradation¶

4. User-Friendly Error Messages¶

Watcher & Refresh Synchronization¶

Watch → Debounce → onChange → Refresh Flow¶

Debouncing (k9s Pattern)¶

Stop Guard with isStopped()¶

Resource Switch Sequence¶

Shutdown Sequence¶

Shutdown Flow¶

Shutdown Code¶

Checking for Shutdown in Operations¶

Avoiding Deadlocks on Shutdown¶

Thread-Safe UI Updates¶

QueueUpdateDraw Pattern¶

QueueUpdateDraw Implementation¶

Direct Update (queueUpdateDrawDirect)¶

AI Streaming Updates (Throttled)¶

Testing Stability¶

Test Files¶

Running Stability Tests¶

Example: Testing Lock Ordering¶

Common Pitfalls & Solutions¶

Pitfall 1: Acquiring Locks in Wrong Order¶

Pitfall 2: Forgetting to Release Locks¶

Pitfall 3: UI Updates Without QueueUpdateDraw¶

Pitfall 4: Not Checking stopping Flag¶

Debugging Tips¶

Enable Debug Logging¶

Check for Data Races¶

Analyze Deadlocks¶

Profile the TUI¶

Summary¶

References¶

The `navigateTo()` Centralized State Transition¶

The `safeGo()` Pattern (k9s RunE Pattern)¶

Stop Guard with `isStopped()`¶

Pitfall 4: Not Checking `stopping` Flag¶