How Are Backoff Strategies (with Client Retries) Helpful?

Sometimes a client attempts to connect to or use an application. Sometimes a Kubernetes Pod is being created and tries to pull down an image. Sometimes a network device tries to establish a connection to an endpoint. These attempts can initially fail. Retries can be attempted in rapid succession. To mitigate excessive attempts in a short amount of time (to not waste resources or cause a denial-of-service attack), a backoff strategy will be implemented.

A backoff strategy is one where a program or algorithm may try to pause after initial failed attempts, to eventually try again. Ultimately the goal is to connect to another service or use a file (or something similar). But if the other service was in use and constrained (e.g., via concurrency control), the client attempts may fail.

It seems like aggressively and persistently retrying to use the resource (e.g., a file) would be the best option after an initial failure. Why would backoff strategies be used rather than repeated simple, fixed-time intervals?

For files: Round robin distribution for multiple clients, processes or applications is achievable over time provided that simultaneous attempts at accessing a file with strict locking are avoided. Slowing the rate after a certain number of failures is beneficial to both the overall success of multiple processes and the security of the entire software system.

Concurrency control with files, variables in messaging systems, or atomic values in data stores often involve decentralized components. Race conditions involve two or more clients trying to use a resource at the same time. To backoff attempting excessive utilization of a file (e.g., two or more processes trying to modify a file) or variable that is not available, a random amount of time may be used to pause the subsequent attempt. Successive backoffs, as part of a strategy to successfully modify or write data, may involve durations that successively become increasing. Excessive failed attempts can create bandwidth congestion or unnecessary CPU or RAM utilization. Random durations of retries can prevent problems with inadvertently simultaneous attempts.

Backoff algorithms (with variable time factors that increase each successive attempt) can prevent denial of service attacks as compromised systems cannot make too many requests in a short duration. A client's thread (or process) that attempts to access data may need to invoke a sleep function (a common feature of many programming languages) to ensure time lapses between attempts.

Network traffic routing also involves a decentralized pair of devices (a source and destination). If there are collisions with packets in a network, many routing algorithms will have a variable amount of time before a packet is transmitted across the network again. Collision avoidance along paths can be better achieved with a backoff strategy. See [1] for more information.

Upstream throttling or load shedding are deliberate actions designed to prevent a denial-of-service. The benefit of having some service at the price of reduced rate/capacity can be a design tradeoff that systems architects are willing to accept. Synchronization of resource utilization could be an alternative to backoff strategies (by avoiding the scenarios where they are necessary). Usually this scheduling strategy for performance tuning would apply to routine processing of batch jobs or nightly backups. By using different schedules for processes that try to connect to a server or use the same file, you can avoid the race conditions or getting close to maximum bandwidth utilization at single time of the day.


To read about backoffs with Kubernetes using different images, see these four links:

Leave a comment

Your email address will not be published. Required fields are marked *