Sparking joy in container host maintenance
I’m sure you can relate to this story:
Step 1: You build the perfect application,
…
Step 3: you never think about it again.
In the end, your team hands you a trophy and you get a massive raise for your efforts as well as the love/adoration of junior, senior, and executive colleagues. Happens every day.
What, the world isn’t like that? Yep, you’re right: the world is messy and full of muck. Your job is to contain and manage the technological mess that haunts every project on the face of the earth. Our industry has learned a lot of muck-reducing lessons in recent years: turns out, not managing your own infrastructure and containerizing applications go a long way.
Container hosts, the servers that run your containerized workloads thankfully rank relatively low on the muck scale for Amazon ECS clusters. Many folks use AWS Fargate here, which minimizes most of the needs of even thinking about your container host. Fargate doesn’t work in every circumstance though; for those circumstances you can run your own container hosts on Amazon EC2 with something like the Amazon ECS-optimized AMI.
The Amazon ECS-optimized AMI isn’t magical: it’s functionally an image of Amazon Linux with the pre-installed software needed to join an ECS cluster. In theory, you could install the same software on practically any Linux distribution and it would join the cluster. Why is this important? The underlying OS is just Linux: nothing magical. Like any machine running Linux, you have some responsibility for updating and maintaining this software.
The heart of most (foreshadowing: not all) Linux distributions is the ‘package’. Packages have pre-requisites and are often dependencies for other packages. There is very little chance a human would be able to properly manage all this themselves without flaw, so package managers take care of this complexity for us. Package managers even take care of checking for, and installing, updates. So, as someone administering a container host, your job should be to just tell the package manager to update to fix any out-of-date software, right? Right?
In an ideal world, yes, but we don’t live in an ideal world. The problem is that some packages can’t be transparently updated automatically. Packages are built from ‘upstream’ projects, typically the originating authors of the software. The upstream projects fix bugs and address security issues but backport to a limited set of releases. Over time, the project moves forward and backporting slows for older releases due to the increase in difficulty as they diverge from the more recent releases. The teams that manage distros, especially those with long lifecycles, often take the burden of patching these older versions for their users.
Patches are applied in a couple of different ways: you can get an up-to-date AMI that includes the latest patches, or keep around an older AMI and have it patch itself on launch (and deal with potential reboots on kernel updates).
For an AMI that is only a few weeks old, patching on launch is no problem - the update time will be quick to fetch and apply.
Again, as this AMI ages, the number of patches build up and it takes longer and longer to do the on-launch patch.
From a cost efficiency standpoint, you’re paying for compute time that isn’t productive for your application, so it’s to your advantage to adopt the latest
AMI.
Additionally, patching on launch means that you’re less able to scale exactly when you want to, instead you’re always waiting on the application of patches before your compute is useful.
So, as someone who administers a container host for ECS, you have some “care and feeding” responsibilities. It’s tempting to approach a container host as “if it’s not broken, don’t fix it.” Sadly, this is a dangerous path. You have a few broad options:
-
Know and understand the releases of your distro beyond just updating. This means figuring out the release cadence, what a release changes, and how this might affect your workload. So, in the case of Amazon Linux, at the time of writing, the oldest continually updated release is Amazon Linux 2 (AL2) and the current is Amazon Linux 2023 (AL2023). These releases are quite different in how they’re put together so moving any workload (including a container host) from AL2 to AL2023 is something you’ll need to test as it uses different versions of the container runtime and kernel (among many other differentiated bits). Eventually, AL2 will stop receiving updates, so you’ll need to migrate to a newer major version and this will also happen with AL2023, but on a longer time scale. Migrating is not without it’s own hassles, so you’ll probably want to maximize the interval between migrations by adopting new versions early so you’re not under excessive pressure & time constraints when an end of life comes.
-
Avoid patching all together. In the context of a container host, you’re applications are already abstracted through containerization, so the host are largely decoupled. Instead of maintaining instances through patching, regularly (and frequently) provision new capacity with the most up-to-date host operating system and move your workloads over. This requires a degree of intentionality and coordination. You need to make sure your workloads can tolerate regular changes in the host and graceful draining as well as the automation to remove old and provision new capacity.
-
Adopt a container host OS with abstractions to minimize care and feeding. Bottlerocket is a free and open source Linux distribution started at AWS that is specialized to host containers especially for ECS and Kubernetes (Amazon EKS). Because it’s specialized it has a small surface area and eliminates packages in lieu of an image-based update system that can be easily automated. It also abstracts away much of the minutia of the underlying Linux components, meaning the team building Bottlerocket can make bigger moves in the underlying components without bubbling those changes up to you. The abstractions in Bottlerocket do have a learning curve and you might have to adapt some processes and policies to fit the model of a specialized OS like Bottlerocket. Additionally, these abstractions don’t entirely remove the need to migrate to newer versions they do lessen the complexity, dampen the impact, and reduce the frequency of migrations.
Both AL2023 and Bottlerocket have mechanisms to allow you to control when you want to update and how those updates work. AL2023 has “deterministic ugrades” which allows you to updated on your own schedule, do more precise testing, and revert to another version if things don’t work out. Bottlerocket is immutable (so it updates the entire OS atomically) and can do either in-place (manually or with ECS Updater) or node replacement updates.
Winning the trophy for muck management means you need to have a long-term plan for your container hosts. You can use something like Amazon Linux and plan for migrations, or you can use Bottlerocket to abstract away most of the muck. Or, if your workloads are right for it, you can use Fargate to let AWS handle the container host entirely. Whatever you do, planning for changes is part of the responsibility of running containerized applications.