跳转至

ADR 007: Weak Network Resilience Analysis

Status

Accepted (2026-02-12)

Context

A comprehensive analysis of the project's weak network resilience was conducted in version v3.0.4, revealing the following critical issues:

  1. Inconsistent network components: The project has four independent network request implementations (FetchWorker, NetworkUtils, NetworkWorker, ComponentService) with inconsistent timeout/retry/backoff strategies
  2. NetworkWorker lacks timeout: Uses QEventLoop::exec() to wait indefinitely, potentially blocking threads permanently under weak network conditions
  3. FetchWorker does not retry after timeout: Timeouts are the most common error under weak network conditions, but FetchWorker skips retries on timeout
  4. Unreasonable timeout configuration: FetchWorker timeouts (8-10s) are too short for weak network environments

Analysis Conclusions

Issue Severity Classification

Severity Issue Impact
Critical NetworkWorker no timeout, permanent block Application unresponsive
Critical FetchWorker no retry after timeout All batch exports fail under weak network
Critical Timeout too short (8-10s) Initial connection likely timeout when RTT > 2s
Moderate Thundering herd effect Bandwidth contention intensified
Moderate Linear backoff (not exponential) Slow recovery from rate limiting
Moderate Fixed 500ms retry delay Too aggressive under weak network
Low QNAM thread_local leak Memory growth during long runs

Component Weak Network Capability Comparison

Component Timeout Retry Backoff Strategy Weak Network Rating
FetchWorker 8-10s 3x (no retry on timeout) Linear +1000ms Poor
NetworkUtils 30s 3x (retries on timeout) Incremental 3/5/10s Good
NetworkWorker None None None Very Poor
ComponentService 15s 3x Incremental 1/2/3s Fair

Decision

Improvement Directions

Based on the analysis conclusions, improvements are recommended in two priority levels:

P0 Improvements (Priority)

  1. Unify network timeout mechanism
  2. Add timeout protection to NetworkWorker
  3. Evaluate migrating NetworkWorker to FetchWorker or NetworkUtils

  4. Fix FetchWorker timeout retry logic

  5. Perform normal retries on OperationCanceledError (timeout)
  6. Align with NetworkUtils timeout retry behavior

  7. Adjust timeout values

  8. Component info: 8s -> 15-20s
  9. 3D models: 10s -> 30s
  10. Align with NetworkUtils timeout configuration

P1 Improvements (Follow-up)

  1. Unify backoff strategy
  2. Implement true exponential backoff (1s -> 2s -> 4s -> 8s)
  3. Add random jitter to avoid thundering herd effects

  4. Unify retry delays

  5. Adopt NetworkUtils incremental delay pattern (3s -> 5s -> 10s)

  6. Fix QNAM memory management

  7. Use std::unique_ptr to manage thread_local objects

Consequences

Positive

  • Clarified the current state of weak network support and improvement directions
  • Provided clear priority guidance for network resilience improvements in future versions
  • Standardized network strategy across components

Negative

  • Improvement work requires additional development and testing time
  • Increasing timeout values may slightly increase waiting time under normal network conditions (mitigated by dynamic timeout strategies)

Risks

  • Changes to NetworkWorker require confirming whether it is still actively used to avoid unnecessary modifications
  • Timeout adjustments need to balance weak network resilience with user experience

References