Skip to main content

Resilience Overview

Resilience in NPipeline refers to the ability of your data pipelines to detect, handle, and recover from failures without complete system breakdown. This section provides a comprehensive guide to building robust, fault-tolerant pipelines.

⚡ Quick Start: Node Restart

If you want to enable node restarts, start here: Node Restart Quick Start Checklist

Node restart requires three mandatory configuration steps. Missing any one causes silent failures. The quickstart guide is the single canonical source of truth for configuring all three prerequisites correctly.


Why Resilience Matters

In production environments, pipelines inevitably encounter failures from various sources:

  • Transient infrastructure issues: Network timeouts, database connection failures
  • Data quality problems: Invalid formats, missing values, unexpected data types
  • Resource constraints: Memory pressure, CPU saturation, I/O bottlenecks
  • External service dependencies: API rate limits, service outages, authentication failures

Without proper resilience mechanisms, these failures can cascade through your pipeline, causing data loss, system instability, and costly manual intervention.

Resilience Strategy Comparison

StrategyBest ForMemory RequirementsComplexityKey Benefits
Simple RetryTransient failures (network timeouts, temporary service issues)LowLowQuick recovery from temporary issues
Node RestartPersistent node failures, resource exhaustionMedium (requires materialization)MediumComplete recovery from node-level failures
Circuit BreakerProtecting against cascading failures, external service dependenciesLowMediumPrevents system overload during outages
Dead-Letter QueuesHandling problematic items that can't be processedLowHighPreserves problematic data for manual review
Combined ApproachProduction systems with multiple failure typesHighHighComprehensive protection against all failure types

Choosing the Right Strategy

  • For simple pipelines with basic needs: Start with Simple Retry
  • For streaming data processing: Use Node Restart with materialization
  • For external service dependencies: Add Circuit Breaker to prevent cascade failures
  • For critical data pipelines: Implement Dead-Letter Queues to preserve failed items
  • For production systems: Combine multiple strategies for comprehensive protection

Core Resilience Components

NPipeline's resilience framework is built around several interconnected components:

ComponentRoleCritical Dependency
ResilientExecutionStrategyWrapper that enables recovery capabilities for nodesPrerequisite for all resilience features
Materialization & BufferingBuffers input items to enable replay during restartsRequired for PipelineErrorDecision.RestartNode
Error HandlingDetermines how to respond to different types of failuresProvides decision logic for recovery actions
Retry OptionsConfigures retry limits and materialization capsControls resilience behavior boundaries

⚠️ Critical Prerequisites for Node Restart (RestartNode)

If you intend to use PipelineErrorDecision.RestartNode to recover from failures, read the Node Restart Quick Start Checklist first.

You must configure all three of the following mandatory prerequisites. The quickstart guide provides detailed step-by-step instructions for each requirement.

💡 Pro Tip: The NPipeline build-time analyzer (NP9002) detects incomplete resilience configurations at compile-time, preventing these silent failures. See Build-Time Resilience Analyzer for details.

Mandatory Requirements Summary

  • Requirement 1: ResilientExecutionStrategy

    • The node must be wrapped with ResilientExecutionStrategy
    • Without this: Restart decisions are ignored; node cannot recover
    • See detailed instructions: Node Restart Quick Start Checklist
  • Requirement 2: MaxNodeRestartAttempts Configuration

  • Requirement 3: MaxMaterializedItems Configuration

    • Set MaxMaterializedItems > 0 in PipelineRetryOptions (for streaming inputs)
    • This enables the input stream to be buffered/materialized for replay
    • Critical: Without this, even if RestartNode is requested, the pipeline will fall back to FailPipeline
    • See detailed instructions: Node Restart Quick Start Checklist

What Happens If You Miss These

Missing ComponentWhat Goes WrongObservable Behavior
ResilientExecutionStrategyRestart capability disabledError handler decisions are ignored; pipeline always fails
MaxMaterializedItemsInput stream not bufferedRestartNode falls back to FailPipeline; entire pipeline halts unexpectedly
Error Handler RestartNodeRestart never triggeredAll errors result in pipeline failure, even recoverable ones

Example of Silent Failure:

// ❌ WRONG: Missing materialization
var options = new PipelineRetryOptions(
MaxItemRetries: 3,
MaxNodeRestartAttempts: 2,
MaxMaterializedItems: null // ← This is the problem!
);

// Developer expects RestartNode to work, but...
// When an error occurs and handler returns RestartNode:
// → Pipeline sees MaxMaterializedItems is not set
// → Falls back to FailPipeline
// → Entire pipeline halts (unexpected failure!)

For complete configuration examples and detailed explanations, see the Node Restart Quick Start Checklist.

The Dependency Chain

Understanding the dependency relationships between resilience components is crucial for proper configuration:

Figure: The dependency chain showing how resilience components must be configured in the correct sequence.

Critical Dependency Rules

  1. ResilientExecutionStrategy is mandatory: All resilience features require this strategy to be applied to a node
  2. Materialization enables restarts: PipelineErrorDecision.RestartNode only works if the input stream is materialized via MaxMaterializedItems
  3. Buffer size matters: The MaxMaterializedItems value determines how many items can be replayed during a restart
  4. Streaming inputs need materialization: Only streaming inputs require explicit materialization; already-buffered inputs work automatically

Decision Flow for Choosing Resilience Strategies

Use this flow diagram to determine the appropriate resilience configuration for your use case:

Key Scenarios

Scenario 1: Simple Retry Logic

For handling transient failures without node restarts:

  • Apply ResilientExecutionStrategy
  • Configure NodeErrorDecision.Retry or NodeErrorDecision.Skip
  • No materialization required

Scenario 2: Node Restart Capability

For recovering from node-level failures:

  • Apply ResilientExecutionStrategy
  • Configure PipelineErrorDecision.RestartNode
  • Set MaxMaterializedItems to enable replay (for streaming inputs)
  • See detailed configuration: Node Restart Quick Start Checklist

Scenario 3: Memory-Constrained Environment

For systems with limited memory:

  • Apply ResilientExecutionStrategy
  • Set MaxMaterializedItems to a conservative value
  • Monitor for buffer overflow exceptions
  • Consider alternative recovery strategies

Next Steps