Transient Fault Handling

Introduction

Transient fault occurs in almost all the platforms, considering the cloud applications, there is no way that a cloud application runs without any transient faults. This article is about handling the transient failures.

Definition of a Transient Fault

When a client makes a request to the server, there may be failure responses because of temporary reasons such as:

Network Issues
Infrastructure faults
Explicit throttling

These failures are very common in cloud applications. Retrying the same operation after a short time may result in a successful response. These errors are called as transient faults. These errors occur inconsistently and no tracking can be done for this error.

As there is no particular way to differentiate transient and non-transient faults, by retrying the same server request few more times results in success. Undergoing the retries based on a predefined set of processes is known as handing the transient faults.

Transient Fault Handling Application Block

The Transient Fault Handling Application Block can apply retry policies to operations that an application performs against services that may exhibit transient faults. This makes it easier to implement consistent retry behaviour for any transient faults that may affect the application.

All the retry mechanisms, based on the provided strategies are handled by the Transient Fault Handling Application Block. This includes two blocks such as:

The block includes the strategies to identifying the transient faults in cloud based services. There are logics implemented to identify whether the exception falls under the transient fault.

The detection strategies are available for:

SQL database
Azure Service Bus
Azure Storage Service
Azure Catching Service

The application block enables the user to define a set of custom retry strategies, so that the possible areas where transient faults may occur can be handled as per the user knowledge. This custom retry strategies includes the logic when the retry should be performed, the maximum number of retries that the application should perform and the time interval between the retries.

This kind of logic is known as Conditional retry.

A few build-in strategies exist, which allows the retries to be performed based on

Fixed intervals
Incremental intervals
Random exponential intervals

The intervals are nothing but the time duration between the consecutive retries, after the first exception occurred. Custom detection strategies can be defined if the build-in strategies do not meet the requirements.

Following is the pictorial representation of the transient fault handling application block:

The retry strategy and the detection strategy are combined and the ExecuteAction method should be used for wrapping the call from the application with the retry policy.

Implementation

All the prerequisites are handled by the following NuGet packages:

WindowsAzure.TransientFaultHandling
TransientFaultHandlingFx

The following code is to implement “Incremental Retry Strategy” from the configuration and perform the retry:

using Microsoft.Practices.TransientFaultHandling;

using Microsoft.Practices.EnterpriseLibrary.Common.Configuration;

using Microsoft.Practices.EnterpriseLibrary.WindowsAzure.TransientFaultHandling;

// Get an instance of the RetryManager class.

var retryManager = EnterpriseLibraryContainer.Current.GetInstance<RetryManager>();

// Create a retry policy that uses a retry strategy from the configuration.

var retryPolicy = retryManager.GetRetryPolicy

<StorageTransientErrorDetectionStrategy>("Incremental Retry Strategy");

// Receive notifications about retries.

retryPolicy.Retrying += (sender, args) =>

{

// Log details of the retry.

var msg = String.Format("Retry - Count:{0}, Delay:{1}, Exception:{2}",

args.CurrentRetryCount, args.Delay, args.LastException);

Trace.WriteLine(msg, "Information");

};

try

{

// Do some work that may result in a transient fault.

var queues = retryPolicy.ExecuteAction(

() =>

{

// Call a method that uses any Azure client and which may

// throw a transient exception.

// namespaceManager is an instance of the NamespaceManager class namespaceManager.GetQueues();

});

}

catch (Exception)

{

// All the retries failed.

}

Thus by providing the required configuration, we can handle the transient failures.

The ExecuteAction method is used to wrap the synchronous calls, which may be affected from transient faults.

The ExecuteAsync method is used to wrap the asynchronous calls, which may be affected from the transient faults.

Below you can see an example for the retry strategies included in the application configuration file:

<RetryPolicyConfiguration defaultRetryStrategy="Fixed Interval Retry Strategy" defaultSqlConnectionRetryStrategy="Backoff Retry Strategy" defaultSqlCommandRetryStrategy="Incremental Retry Strategy" defaultAzureStorageRetryStrategy="Fixed Interval Retry Strategy" defaultAzureServiceBusRetryStrategy="Fixed Interval Retry Strategy">

<incremental name="Incremental Retry Strategy" retryIncrement="00:00:01" retryInterval="00:00:01" maxRetryCount="10" />

<fixedInterval name="Fixed Interval Retry Strategy" retryInterval="00:00:01" maxRetryCount="10" />

<exponentialBackoff name="Backoff Retry Strategy" minBackoff="00:00:01"

maxBackoff="00:00:30" deltaBackoff="00:00:10" maxRetryCount="10"

fastFirstRetry="false"/>

</RetryPolicyConfiguration>

Transient Handling With Database Connection

The Transient Fault Handling Application Block completely takes care of the retry mechanisms if the application uses ADO.NET for SQL Database. Just the retry configurations should be provided by the user.

The Following code will handle the retry mechanism

public void HandleTransients()

{

var connStr = "some database";

var _policy = RetryPolicy.Create<SqlAzureTransientErrorDetectionStrategy(

retryCount: 3,

retryInterval: TimeSpan.FromSeconds(5));

using (var conn = new ReliableSqlConnection(connStr, _policy))

{

// Do SQL stuff here.

}

The database connection string, the number of retries to be performed and the retry interval should be provided as per the requirement.

In case of the Entity Framework, the retry policy is provided with the DbConfiguration.

public class EFConfiguration : DbConfiguration

{

public EFConfiguration()

{

AddExecutionStrategy(() => new SqlAzureExecutionStrategy());

}

Conclusion

In order to avoid unexpected errors, cloud applications should have Transient Error handling, which will improve the application usability. Implementing Transient Fault Handling is advisable for the applications which use database connections more frequently.