Introduction

In enterprise environments, API integrations are critical infrastructure. A single point of failure can cascade into data inconsistencies, lost transactions, and frustrated users. Over the years, I've learned that the difference between a good integration and a great one isn't what happens when everything works—it's how gracefully the system handles failures.

This article covers the patterns, practices, and techniques I use to build integrations that stay resilient under real-world conditions: network hiccups, API rate limits, service outages, and the inevitable unexpected errors.

The Reality of API Failures

Let's start with a hard truth: all APIs fail eventually. Network issues, server problems, rate limits, deployment windows, and bugs are inevitable. Your integration needs to expect and handle these scenarios.

Common Failure Scenarios

  • Transient failures: Temporary network issues, timeouts, brief service interruptions
  • Rate limiting: Too many requests in a time window (HTTP 429)
  • Service degradation: API responds slowly but doesn't fail completely
  • Partial failures: Some records succeed, others fail in batch operations
  • Data inconsistencies: Response received but processing failed locally
  • Authentication expiration: Tokens expire mid-operation
  • Schema changes: API updates break existing integrations

Core Principles of Resilient Integrations

1. Idempotency: Safe to Retry

Design your integration operations to be idempotent—executing the same request multiple times produces the same result as executing it once. This makes retries safe.

✅ Implementing Idempotency

public class OrderSyncService
{
    public async Task SyncOrderAsync(string orderId, Order orderData)
    {
        var idempotencyKey = $"order-sync-{orderId}-{orderData.LastModified:yyyyMMddHHmmss}";
        
        // Check if we've already processed this exact version
        if (await _cache.ExistsAsync(idempotencyKey))
        {
            _logger.LogInformation($"Order {orderId} already synced, skipping");
            return;
        }
        
        // Perform the sync operation
        await _apiClient.UpdateOrderAsync(orderId, orderData);
        
        // Mark as completed
        await _cache.SetAsync(idempotencyKey, "completed", TimeSpan.FromDays(7));
    }
}

Key points: Use a unique identifier combining the resource ID and a version marker (timestamp, version number, or hash). Store completion state before marking success.

2. Retry Logic with Exponential Backoff

When an API call fails, don't immediately retry. Use exponential backoff to progressively increase wait times between attempts.

✅ Exponential Backoff Implementation

public class ResilientApiClient
{
    private readonly int[] _retryDelays = { 1000, 2000, 4000, 8000, 16000 }; // milliseconds
    
    public async Task<T> ExecuteWithRetryAsync<T>(
        Func<Task<T>> operation,
        string operationName)
    {
        Exception lastException = null;
        
        for (int attempt = 0; attempt <= _retryDelays.Length; attempt++)
        {
            try
            {
                _logger.LogDebug($"{operationName}: Attempt {attempt + 1}");
                return await operation();
            }
            catch (HttpRequestException ex) when (IsTransient(ex))
            {
                lastException = ex;
                
                if (attempt < _retryDelays.Length)
                {
                    var delay = _retryDelays[attempt];
                    _logger.LogWarning(
                        $"{operationName} failed, retrying in {delay}ms: {ex.Message}"
                    );
                    await Task.Delay(delay);
                }
            }
            catch (Exception ex)
            {
                _logger.LogError($"{operationName} failed with non-retryable error: {ex}");
                throw;
            }
        }
        
        throw new IntegrationException(
            $"{operationName} failed after {_retryDelays.Length + 1} attempts",
            lastException
        );
    }
    
    private bool IsTransient(Exception ex)
    {
        // Network errors, timeouts, and 5xx responses are transient
        return ex is HttpRequestException ||
               ex is TaskCanceledException ||
               (ex is ApiException apiEx && apiEx.StatusCode >= 500);
    }
}

3. Circuit Breaker Pattern

Don't keep hammering a failing API. Implement a circuit breaker that stops requests when failure rates exceed a threshold, then periodically tests if the service has recovered.

Circuit Breaker States

  • Closed: Normal operation, requests pass through
  • Open: Too many failures detected, requests fail immediately
  • Half-Open: Testing if service recovered, limited requests allowed
public class CircuitBreaker
{
    private CircuitState _state = CircuitState.Closed;
    private int _failureCount = 0;
    private DateTime _lastFailureTime;
    private readonly int _failureThreshold = 5;
    private readonly TimeSpan _openDuration = TimeSpan.FromMinutes(2);
    
    public async Task<T> ExecuteAsync<T>(Func<Task<T>> operation)
    {
        if (_state == CircuitState.Open)
        {
            if (DateTime.UtcNow - _lastFailureTime > _openDuration)
            {
                _state = CircuitState.HalfOpen;
                _logger.LogInformation("Circuit entering half-open state");
            }
            else
            {
                throw new CircuitBreakerOpenException(
                    "Circuit breaker is open, service unavailable"
                );
            }
        }
        
        try
        {
            var result = await operation();
            
            if (_state == CircuitState.HalfOpen)
            {
                _state = CircuitState.Closed;
                _failureCount = 0;
                _logger.LogInformation("Circuit closed, service recovered");
            }
            
            return result;
        }
        catch (Exception ex)
        {
            _failureCount++;
            _lastFailureTime = DateTime.UtcNow;
            
            if (_failureCount >= _failureThreshold)
            {
                _state = CircuitState.Open;
                _logger.LogWarning(
                    $"Circuit breaker opened after {_failureCount} failures"
                );
            }
            
            throw;
        }
    }
}

Handling Rate Limits

Most APIs enforce rate limits. Your integration must respect these limits and handle 429 responses gracefully.

Strategy 1: Respect Rate Limit Headers

public async Task<HttpResponseMessage> SendRequestAsync(HttpRequestMessage request)
{
    var response = await _httpClient.SendAsync(request);
    
    if (response.StatusCode == HttpStatusCode.TooManyRequests)
    {
        // Check for Retry-After header
        if (response.Headers.TryGetValues("Retry-After", out var values))
        {
            var retryAfter = int.Parse(values.First());
            _logger.LogWarning($"Rate limited, waiting {retryAfter} seconds");
            await Task.Delay(TimeSpan.FromSeconds(retryAfter));
            
            // Retry the request
            return await SendRequestAsync(request);
        }
        
        // Check for rate limit reset time
        if (response.Headers.TryGetValues("X-RateLimit-Reset", out var resetValues))
        {
            var resetTime = DateTimeOffset.FromUnixTimeSeconds(
                long.Parse(resetValues.First())
            );
            var waitTime = resetTime - DateTimeOffset.UtcNow;
            
            if (waitTime > TimeSpan.Zero)
            {
                _logger.LogWarning($"Rate limited, waiting until {resetTime}");
                await Task.Delay(waitTime);
                return await SendRequestAsync(request);
            }
        }
    }
    
    return response;
}

Strategy 2: Implement Token Bucket

Proactively limit your request rate to stay under API limits.

public class TokenBucket
{
    private int _tokens;
    private readonly int _capacity;
    private readonly int _refillRate; // tokens per second
    private DateTime _lastRefill;
    private readonly SemaphoreSlim _lock = new SemaphoreSlim(1, 1);
    
    public TokenBucket(int capacity, int refillRate)
    {
        _capacity = capacity;
        _refillRate = refillRate;
        _tokens = capacity;
        _lastRefill = DateTime.UtcNow;
    }
    
    public async Task WaitForTokenAsync()
    {
        await _lock.WaitAsync();
        try
        {
            // Refill tokens based on time elapsed
            var now = DateTime.UtcNow;
            var elapsed = (now - _lastRefill).TotalSeconds;
            var tokensToAdd = (int)(elapsed * _refillRate);
            
            if (tokensToAdd > 0)
            {
                _tokens = Math.Min(_capacity, _tokens + tokensToAdd);
                _lastRefill = now;
            }
            
            // Wait if no tokens available
            while (_tokens <= 0)
            {
                _lock.Release();
                await Task.Delay(1000);
                await _lock.WaitAsync();
                
                now = DateTime.UtcNow;
                elapsed = (now - _lastRefill).TotalSeconds;
                tokensToAdd = (int)(elapsed * _refillRate);
                
                if (tokensToAdd > 0)
                {
                    _tokens = Math.Min(_capacity, _tokens + tokensToAdd);
                    _lastRefill = now;
                }
            }
            
            _tokens--;
        }
        finally
        {
            _lock.Release();
        }
    }
}

Maintaining Data Consistency

Use a Message Queue

Don't process integrations synchronously in user-facing operations. Queue messages for asynchronous processing.

Benefits:

  • Decouples API failures from user operations
  • Provides natural retry mechanism
  • Enables processing during API maintenance windows
  • Scales independently of source system
// Dynamics 365 Plugin - Queue the operation
public class AccountUpdatePlugin : IPlugin
{
    public void Execute(IServiceProvider serviceProvider)
    {
        var context = (IPluginExecutionContext)serviceProvider
            .GetService(typeof(IPluginExecutionContext));
        
        var account = (Entity)context.InputParameters["Target"];
        
        // Queue message instead of calling API directly
        var message = new IntegrationMessage
        {
            EntityName = "account",
            EntityId = account.Id,
            Operation = "update",
            Timestamp = DateTime.UtcNow
        };
        
        _serviceBus.SendMessageAsync("account-sync", message);
    }
}

// Azure Function - Process from queue with retry
[FunctionName("ProcessAccountSync")]
public async Task ProcessAccountSync(
    [ServiceBusTrigger("account-sync")] IntegrationMessage message,
    ILogger log)
{
    try
    {
        await _resilientApiClient.ExecuteWithRetryAsync(
            async () => await SyncAccountToExternalSystemAsync(message),
            "SyncAccount"
        );
    }
    catch (Exception ex)
    {
        log.LogError($"Failed to sync account {message.EntityId}: {ex}");
        throw; // Message goes to dead letter queue
    }
}

Dead Letter Queue Handling

When messages fail repeatedly, they move to a dead letter queue. Monitor and process these.

[FunctionName("ProcessDeadLetters")]
public async Task ProcessDeadLetters(
    [TimerTrigger("0 */15 * * * *")] TimerInfo timer,
    ILogger log)
{
    var receiver = _serviceBusClient.CreateReceiver(
        "account-sync",
        new ServiceBusReceiverOptions { SubQueue = SubQueue.DeadLetter }
    );
    
    var messages = await receiver.ReceiveMessagesAsync(maxMessages: 10);
    
    foreach (var message in messages)
    {
        var body = message.Body.ToObjectFromJson<IntegrationMessage>();
        
        // Log for manual investigation
        await _alertingService.SendAlertAsync(
            "Dead Letter Message",
            $"Entity: {body.EntityName}, ID: {body.EntityId}, " +
            $"Reason: {message.DeadLetterReason}"
        );
        
        // Complete the message to remove from dead letter queue
        await receiver.CompleteMessageAsync(message);
    }
}

Monitoring and Alerting

What to Monitor

  • Success rate: Percentage of successful API calls
  • Response times: Average, p95, p99 latencies
  • Error rates by type: Transient vs. permanent failures
  • Queue depth: Number of pending messages
  • Dead letter count: Messages that failed repeatedly
  • Circuit breaker state: How often it opens
  • Rate limit hits: How close to limits you're running

Implementation with Application Insights

public class MonitoredApiClient
{
    private readonly TelemetryClient _telemetry;
    
    public async Task<T> CallApiAsync<T>(string endpoint, object request)
    {
        var sw = Stopwatch.StartNew();
        var success = false;
        
        try
        {
            var result = await _httpClient.PostAsJsonAsync(endpoint, request);
            success = result.IsSuccessStatusCode;
            
            if (!success)
            {
                _telemetry.TrackEvent("ApiCallFailed", new Dictionary<string, string>
                {
                    { "endpoint", endpoint },
                    { "statusCode", ((int)result.StatusCode).ToString() }
                });
            }
            
            return await result.Content.ReadFromJsonAsync<T>();
        }
        finally
        {
            sw.Stop();
            
            _telemetry.TrackMetric("ApiCallDuration", sw.ElapsedMilliseconds, 
                new Dictionary<string, string>
                {
                    { "endpoint", endpoint },
                    { "success", success.ToString() }
                });
        }
    }
}

Timeout Configuration

Always set appropriate timeouts. Never let an operation hang indefinitely.

var httpClient = new HttpClient
{
    Timeout = TimeSpan.FromSeconds(30)
};

// For specific operations, use CancellationToken
using var cts = new CancellationTokenSource(TimeSpan.FromSeconds(10));
try
{
    var response = await httpClient.GetAsync(url, cts.Token);
}
catch (OperationCanceledException)
{
    _logger.LogWarning("Request timed out after 10 seconds");
    throw new TimeoutException("API request timed out");
}

Authentication Resilience

Token expiration is a common failure point. Implement automatic token refresh.

public class TokenManager
{
    private string _accessToken;
    private DateTime _tokenExpiry;
    private readonly SemaphoreSlim _refreshLock = new SemaphoreSlim(1, 1);
    
    public async Task<string> GetValidTokenAsync()
    {
        // Check if token is still valid (with 5 minute buffer)
        if (_accessToken != null && 
            DateTime.UtcNow < _tokenExpiry.AddMinutes(-5))
        {
            return _accessToken;
        }
        
        // Ensure only one thread refreshes at a time
        await _refreshLock.WaitAsync();
        try
        {
            // Double-check after acquiring lock
            if (_accessToken != null && 
                DateTime.UtcNow < _tokenExpiry.AddMinutes(-5))
            {
                return _accessToken;
            }
            
            // Refresh token
            var tokenResponse = await RequestNewTokenAsync();
            _accessToken = tokenResponse.AccessToken;
            _tokenExpiry = DateTime.UtcNow.AddSeconds(tokenResponse.ExpiresIn);
            
            return _accessToken;
        }
        finally
        {
            _refreshLock.Release();
        }
    }
}

Testing Resilience

Chaos Engineering

Test your integration's resilience by intentionally introducing failures.

public class ChaosHttpHandler : DelegatingHandler
{
    private readonly Random _random = new Random();
    private readonly double _failureRate;
    
    public ChaosHttpHandler(double failureRate = 0.1)
    {
        _failureRate = failureRate;
    }
    
    protected override async Task<HttpResponseMessage> SendAsync(
        HttpRequestMessage request,
        CancellationToken cancellationToken)
    {
        // Randomly fail requests
        if (_random.NextDouble() < _failureRate)
        {
            throw new HttpRequestException("Chaos: Simulated network failure");
        }
        
        // Randomly delay requests
        if (_random.NextDouble() < 0.2)
        {
            await Task.Delay(_random.Next(1000, 5000), cancellationToken);
        }
        
        return await base.SendAsync(request, cancellationToken);
    }
}

Best Practices Checklist

  • ✅ Implement idempotent operations
  • ✅ Use exponential backoff for retries
  • ✅ Implement circuit breaker pattern
  • ✅ Respect rate limits with token bucket
  • ✅ Use message queues for async processing
  • ✅ Monitor dead letter queues
  • ✅ Set appropriate timeouts
  • ✅ Implement automatic token refresh
  • ✅ Log comprehensively for debugging
  • ✅ Monitor success rates and latencies
  • ✅ Alert on anomalies
  • ✅ Test with chaos engineering

Conclusion

Resilient API integrations don't happen by accident. They require deliberate design choices, proper error handling, and comprehensive monitoring. The patterns covered here—idempotency, retries, circuit breakers, rate limiting, and queue-based processing—form the foundation of reliable integrations that enterprises can depend on.

Remember: the goal isn't to prevent all failures (that's impossible), but to handle failures gracefully so they don't cascade into bigger problems. When you build resilience into your integrations from the start, you'll spend less time firefighting and more time delivering value.