Fun with cloudfront fail overs

AWS Cloudfront, the Content delivery network from Amazon, has a lot of great features, but one I would like to focus on today is Origin failover.

A basic premise of a Cloudfront distribution is that of origins and behaviours.
An origin is the source of the content being served. It defines where content is fetched from, how it is fetched, any headers that need adding etc.
A behaviour is the conditions under which an origin is served (and cached). What url prefixes point to this origin, which origin to use, which headers are forwarded and added to the cache key, what the response headers are, any edge functions and so forth.

There is an alternative to specifying an origin for a behaviour though. We can specify an origin group.
And that opens some interesting possibilities.

Introducing Origin groups

An origin group is simply a pair of origins, and it’s purpose is to improve availability by specifying an alternative origin to ‘failover’ to. The name is slightly misleading, an origin group must contain exactly two different items.

In our origin group we specify under what conditions (specific HTTP status codes), the response from origin A should failover to origin B. Any behaviour which uses this origin group will try and serve content A, and if the failover criteria are met, for example if a 500 internal server error is returned, serve content B instead.

It is important to note a major limitation on failovers with origin groups. It is only possible for origins serving read only requests, using a GET, HEAD, or OPTIONS method.

It should be obvious how this could be highly useful for availability, serving identical content from a second source (perhaps in a different region).
But there are other, more creative possibilities, such as image manipulation, serving versioned content, and even waking sleeping machines.

Image manipulation and thumbnailing on the fly

Processing uploaded images—such as creating thumbnails or optimizing compression—is typically done before the assets are requested. This can be done in a queue or, in some cases, as the user uploads them (and waits).

Imagine we have a headless content management system with a media library, we upload a large number of images, but only a few of those will ever actually be used in content.

To make all thumbnail sizes for all content would be a waste of time and resources, it might not even be practical. If we want to generate and store only those images that are being used, we need to make them on demand.

This on demand image manipulation can be achieved with an origin group failover, and might work as follows:

The client requests an image from a behaviour with an origin group. Origin A is the image server, for example S3, where we look for the specific image.
If we receive a 403/404 error (generally S3 responds with forbidden rather than not found for incorrect keys), we failover to origin B, otherwise we simply return the requested content.
Origin B asks a dedicated image processing microservice (perhaps using lambda functions) to handle the request. This generates the requested thumbnail on the fly, saves it to S3 and returns it to the user (with headers to prevent it being cached).
Future requests to this url, will still go to S3 (because of the no-cache headers), where they will find the thumbnail, and return it immediately.

This architecture could be further refined with expiration rules in S3, preventing the retention and accumulation of unused images and associated storage costs.

Serving versioned or private content

In some instances, it might be desirable to fall back to the previous working example of content, should a deployment fail.
For example, we can have origin A as the latest version of our site, with origin B providing the previous stable version. Should a deployment fail for any reason, our site will fallback to the previous version.

A similar approach can be used when rebuilding a site.
Origin B could be the legacy version, with origin A serving the replacement new build. When content or a feature is not available (yet) in the new build, the legacy version will be served until such time that it is.

The same concept could also be used for serving paywalled content, falling back to a free, limited version, or displaying a paywall.

In all cases, it is important to consider the cache headers that will be used to serve a request- failover responses are cached in the same way as any other response, and this may not always be desirable.

Also consider that this is only suitable for read requests (GET/HEAD/OPTIONS), and thus is only really practicable for static content.

Waking sleeping machines on demand.

Some resources, particularly internal ones, are only required sporadically. Services such as reporting software, staging or preview sites, or auditing tools may be very rarely used, yet it is often infeasible to re-engineer these as a lambda function which only runs on demand.

We want to be able to turn these resources off, either manually or on a schedule, and only boot them up again on demand.

Using a cloudfront origin group failover can simplify this for us. If the service returns a 503 error, then the server is not available, and we can failover to a lambda function which boots it up, and returns a message to the user inviting them to return in 5-10 minutes.

Bear in mind any traffic would wake the machine, so either a specific, non indexed, url needs to be chosen, or the whole distribution needs to be unavailable to the public (for example behind authentication).
The limitation of failover to read-only requests is also a limitation here - if your domain root expects POST/PUT/DELETE requests, then your wake-up failover will need to be placed on a specific url.

Conclusion

Origin failover groups in AWS CloudFront offer more than just resilience and high availability—they can be used creatively to address a variety of common challenges. While they do come with certain limitations, such as being restricted to read-only requests, these can be mitigated with thoughtful planning around CDN caching and behaviors.

When used effectively, origin failover groups can be a powerful tool for optimizing performance, reducing operational complexity, and enhancing efficiency. They enable us to offload complex logic to infrastructure and microservices, ultimately helping to save time, reduce costs, and minimize environmental impact. Understanding both their potential and limitations allows us to use CloudFront to build more efficient and sustainable solutions.