In this post we will look at handling transient failures in distributed applications. Though the design principles can be applied to any programming language, we write our sample code in Clojure - a modern, dynamic, and functional dialect of the Lisp programming language on the Java platform.

Fault Tolerance and Recovery

Most applications face momentary loss of connectivity to network services and resources. These faults are usually self-correcting and the connection attempt likely succeeds if retried after a delay. For example, a database server when under heavy workload, may reject new clients. An application might still be able to connect to the database, if it tries again after a few seconds.

An application may want to apply different retry policies for different network services. It may even want to dynamically change the policy for a service based on the last response received from it. It should be possible to attempt a retry immediately or after computing a suitable delay. It should also be possible to cancel any further retries, if the response indicates that the fault is non-recoverable.

This post shows how various retry strategies can be captured as a simple abstraction. This abstraction is defined in recoil, a resilience and fault-handling library for Clojure.

Let me demonstrate the use of this abstraction in the context of a program that tries to establish a new database connection. For this purpose, it has a function defined called make-db-connection. This function may throw a TimeoutException if the database server fails to respond within a specific amount of time. It may also throw a ConnectionRefusedException if the database rejects the client due to overload. If the function succeeds, the new connection object is returned in a map under the key :ok.

;; Pseudo-code, for illustration purpose only.
(defn make-db-connection [db-config]
  (when-let [connection (db/connect db-config)]
    {:ok connection}))

If there was a connection failure due to Timeout or ConnectionRefused exceptions, the program should re-attempt to connect after a few seconds. The application may not want to retry if there is some other error. It also don’t want to retry infinitely. It should stop after a given number of retries, and return a value indicating that the number of retries has expired. The retry pattern as implemented in recoil allows us to capture these policies as:

(use '[recoil.retry :as r])

(def exec (r/executor {:handle [TimeoutException ConnectionRefusedException]
                       :wait-secs 5
                       :retry 3}))

The call to r/executor returns a function that knows how to apply the given retry policies to a user-defined action. The action must be represented as a zero-argument function and passed to the new executor.

The policy passed to r/executor above means a retry will be initiated in the following scenarios:

  • if action throws either TimeoutException or ConnectionRefusedException
  • if action does not return a map with an :ok entry

The rest of the policy spec means 3 retries will be made 5 seconds apart. If any of those attempts result in a {:ok ...} value, the executor will return that and terminate. If all retries expire without success the executor will return {:status :no-retries-left, :result ...}, where :result will be the last failure value returned by the action or the last handled exception thrown by it. If action throws any exception not listed in :handle, the executor will return {:error :unhandled-exception :exception the-exception-object} and immediately terminate.

Here is how we call the make-db-connection as an action via the new retry executor:

(exec #(make-db-connection db-config))

Dynamically Configuring Wait Delay

Sometimes, the failure value returned by the action can give an indication as to how long to wait before making the next connection attempt. In such cases, dynamically computing the wait-secs from the result or other parameters would be ideal than having a constant delay. For handling use-cases like this, the retry policy allows a key called :wait-fn. The value of this key must be a function with three parameters - the value of the result returned by the last call to action, the current value of wait-secs, and the number of retries remaining. If the last call to action raised a handled exception, it will be wrapped in the result argument as {:error :handled-exception, :exception the-exception-object}. If the :wait-secs key was not set in the policy spec, it will initially come as nil.

The following program shows how to create an executor that implements a simple exponential backoff policy.

(def exec (r/executor {:handle [TimeoutException ConnectionRefusedException]
                       :retry 3
                       :wait-fn (fn [_ wait-secs _] ; ignore the last result and number of retries left.
                                  (if wait-secs
                                    (* wait-secs 2) ; exponential back-off.
                                    1))}))

Retrying in the Background

Operations like obtaining a resource over the network can be lengthy. Applications would like to execute tasks like this in a background thread while the main threads proceeds to do other computations. Later at some point in the future, it would like to check the result of the network call.

This can be easily achieved in Clojure by composing the built-in future function with a retry executor.

(def f (future (exec #(make-db-connection db-config))))
;; Do other stuff while the attempts to connect to the DB
;; proceeds on another thread.

;; Later check the status of the background task:
(if (:ok @f)
   (log "connected to db")
   (log "failed to connect to db"))

Conclusion

In real world applications, failure to acquire a resource or respond to a message is a common occurrence. Fault tolerance strategies like the Retry Pattern are required to mitigate this. This post showed how this pattern could be abstracted into a first-class function - customized by simple configuration data. The abstraction is powerful enough to respond to the runtime behavior of the program and adjust the configuration dynamically. The implementation is also thread-safe and can be easily composed with multi-processing abstractions available in the platform.


Note that name and e-mail are required for posting comments