As Bayzat SRE team, we have migrated our infrastructure to AWS ECS approx. 1.5 years ago. This blog post explains why we decided to migrate to container orchestration, how we chose AWS ECS amongst the alternatives, what we are happy about, and what we lack so far.

What did we migrate from?

We have a mostly monolithic architecture, consisting of one big JVM app, some Lambda functions, and a separate web server for front-end assets. We did host all of these, except Lambda, of course, on EC2 behind an Nginx proxy managed by supervisord process. We were executing shell commands on supervisord to manage deployments, restarts, even stops.

Why even migrate?

You must have heard this gem multiple times before: if it works, it works; don’t touch it!

EC2-based simple solution served us for quite some time. It served us well: it was simple, easy to understand. But as the company grew, the team size grew. This simple workflow did not scale with the team requirements.

After a while, we figured that the below issues affected team productivity:

1. Golden images

Having a JVM-based application means you can package all your dependencies into a single JAR file. But it runs on an OS, a toolset, and many dependencies according to your application (ours was using a specific font file that we didn’t know until we migrated to containers & failed in production). Packaging the application with everything it depends on makes sense.

We could achieve the same thing by building AMIs, but we concluded it had the following issues:

To take an AMI snapshot, first, you need to start a VM. After all the provisioners are completed, we need to take the snapshot; although, taking a snapshot of the VM is not instant and takes time.
To build the AMI properly, we would have to use some combination Packer and Ansible; which I think is not intuitive/accessible as much as Dockerfiles.

2. Self-buildable applications

We used to depend on Ansible scripts to provision host machines. It meant that the dependencies of the application were in another repository because Ansible scripts were centralized. We needed to break this dependency; each service should define how it should be built/run/deployed. Changes to the build/run mechanism should not require releases of multiple services.

We thought about hosting Ansible provisioners on each service’s code repository, but that meant abandoning battle-tested commonly used Ansible roles and playbooks. Thinking about the possible separation of the application into multiple services, we didn’t go for this option.

Another reason is that Ansible is not easy to dive into from a developer’s point of view. On the other hand, Docker is much more intuitive, has more appliances, and received better in the developer community. We could use common parent images if we needed to amongst multiple services.

3. Expanding toolset around container orchestration

If we kept the old system, we would need to improve & maintain the deployer code and its capabilities. Supervisord does not change frequently; Linux service management tools have had certain capabilities for years. On the other hand, we get blue/green deployments, canary deployments, rollbacks, auto-scale policies out-of-the-box when we use hot cloud technologies.

These issues at a point forced us to migrate to a container-based architecture & container-orchestration platform.

How did we evaluate alternatives?

After the decision was made, we did a two-day hackathon in the SRE team. The purpose of the hackathon was plain and simple: considering these problems, evaluate different container orchestration platforms, check out their capabilities, and do a PoC if possible.

We checked out the solutions below:

Docker on EC2
Kubernetes
AWS ECS
AWS EKS
Hashicorp Nomad
Netflix Titus

At the end of the hackathon, everyone shared their experiences, and the winner revealed itself. Everybody agreed on the same solution.

Of course, it wasn’t that easy. We had to set some constraints:

AWS was/is the most prominent cloud provider in our region. If the solution were to be self-hosted, it would run on EC2 anyways.
Because of team size, an open-source solution had to be much more attractive than managed services.
We replicate our production infrastructure to development/test environments too. So the solution should be replicable to many development environments.
Being a startup means that we need to prefer short-term gains as much as possible. So we needed a solution with extensive tooling support.

Let me share our findings at that time.

Evaluation results

Docker on EC2

Docker on EC2 meant running one container on one VM for us. It would be a good step in the containerization process. While keeping EC2 know-how, we could leverage container functionality too.

The problem with this approach is, we would need to maintain the container lifecycle carefully and adequately. Our team did not want to maintain run scripts full of docker runs, or docker stops, so one alternative was docker compose. At that time, again, docker-compose documentation said do not use in production. So we quickly abandoned this solution.

Netflix Titus

To not just jump on the K8s train, I personally included Titus and Nomad to the list. Titus looked like it was being abandoned; there were no commits recently. So it didn’t conform to our ecosystem constraint.

Hashicorp Nomad

Nomad was promising. We were already using Terraform heavily, so why not use another solution under Hashicorp’s umbrella? A single executable runs VM. The structs, usage, installing were simpler than K8s.

Well, comparing to K8s and AWS, it had less tooling. For me, the real deal-breaker was namespace constraint on open-source usage. It didn’t conform to our development environment constraint. From their perspective, it makes sense to license the product over a feature. From our perspective, it was hard to sell it upstairs while there were cost-free, more popular solutions.

Docker Swarm

from Deviantart

There were more solutions back then they are today. Not all of them survived or developed actively. Titus was the first example; Swarm is, I think, another one.

It is fair to say that we skipped Swarm quickly. At the time of our decision, Swarm still had some activity, but it was clear which orchestration platform would be the winner.

Speaking of which:

Kubernetes (self-hosted)

mandatory k8s comic

The elephant in the room: secretly, everyone wants to do something in it, it is cool, it is hype. I did try it out before in some personal projects, with some tutorials, but didn’t maintain any production services.

The good part about K8s is, being the hype, everyone is doing something in K8s. There are many tools, extensive tutorials, a very dynamic ecosystem, managed services in all cloud providers, a lot of things at hand’s reach.

The bad part is that it is not simple. K8s requires migration to a new whole mindset. Managing a whole cluster of VMs running K8s is something entirely else; we would need to spare two people just to deal with K8s effectively.

Our development environment is based on multi-account AWS infrastructure. It was doable, but was not simple, to migrate this architecture to a limited number of clusters. If we started from scratch, it would make more sense, maybe.

AWS EKS

EKS is the hosted K8s solution AWS provides. It effectively removes the problem of sparing two people for K8s maintenance, scaling, etc. It also has Fargate, the serverless infrastructure for K8s.

What is the problem? Personally, I remember that someone’s cluster went missing out of the blue.

More objectively, it didn’t remove the complexity of K8s. It was still complex, and it would make our lives complicated to spread the knowledge to the whole team.

The last touch was AWS charging per EKS cluster per hour. This directly conflicted with our development environment constraint. This is, of course, solvable by re-thinking the development environment model, but we would still want at least three clusters.

AWS ECS

ECS is AWS’s version of container orchestration. It is simpler than K8s to manage. With Fargate, it creates much less headache. It supports out-of-the-box manual and automated scalability, blue/green and canary deployment patterns, and AWS’s opinionated tooling.

The bad parts? It gets too opinionated. Like:

We use lots of development environments. ECS requires ALB in front of containers, and it makes sense for all advantages mentioned above. But not for development environments: I don’t care about scalability, blue/green deployment etc., but I have to pay for ALB anyways.
Some applications require downtimes during deployments. ECS allows you to configure for that. Just play with minimumHealthyPercent. But be careful, do not play with the number of tasks while deploying; ECS will start a new task before the first down shuts down.
We are at the hands of AWS’s tooling roadmap. While K8s has this exponential tool growth, we are left with AWS only. I’d very much appreciate a tool like k9s, for instance.
Fargate is very cool, but it has resource limits, doesn’t go over 4 CPUs and 30 GB of memory. This makes sense for microservices, services that can run with multiple instances behind a load balancer, but we also had to invest in our back-end app to run like that.
Some applications need to have some complex start logic, maybe running external scripts and migrations right before running.

Some of these problems are inherent to all container solutions. They are not problems from a different perspective; they are our adaptation pains to cloud-native container solutions. We needed to re-think our way of doing things adapt to them.

Being a managed service and simple, AWS ECS was the ideal candidate for us. A simple PoC showed us how easy it was to start with it. I’ll reiterate, but, Fargate was an excellent reason to choose it.

Conclusion & experiences after 1.5 years

We have matured our ECS-based architecture so far. We are using both ECS-EC2 and ECS Fargate. Our choice of technology was relevant to our conditions at that time. As time changes, constraints and focus shifts, we might go with something else.

The pros we have:

Managing deployment logic is opinionated but, again, easy. Do you need downtime requirements? Change a variable. Do you need rolling deployments? Change a variable. But do not expect too much customizability.
ECS is a good alternative if you cannot justify investing in more controllable/customizable solutions.
Fargate is amazing. Just specify the resource requirements, and that is all. It is much more intuitive than creating big instances and scheduling multiple tasks inside. ECS Exec allows us to not ssh into the EC2 instance & run diagnostic shell commands from outside.
Scalability on ECS is easy. Since our workload did not require, we just left the parallel running container count as one. Then, one day, we had to scale it up. It took us 10 minutes to test the multiple running instances logic. It took 30 minutes to deploy the changes via Terraform.

Here are the problems we have encountered:

Scalability with Fargate is simple and fun. However, when Fargate limits are insufficient, we needed to use ECS based on EC2, which comes with autoscale groups. Autoscale groups are easy to scale up but complex to scale in (decrease number of tasks & destroy unused instances). You can decrease the number of running tasks easily; but, it is not guaranteed that running EC2 instances will be destroyed or not. The official documentation says: it will scale out, it may scale in.
When health checks of a task fail, ECS decides to shut it down, takes it out of the load balancer, and starts a new task. But we want to keep that instance running, take heap/thread dumps and investigate later. AWS doesn’t provide this out-of-the-box, even makes it very hard to manage. There are alternative solutions people implement and even feature requests waiting for consideration.
We had one particular problem in new deployments. According to the official documentation, we should use one label in the task definition, update the ECR image with the same label, and then execute a force deployment. In theory, it worked fine. In practice, it worked fine most of the time. But, sometimes, the ECR change was not reflected in ECS. The new deployment started with the old image, which confused the hell out of us. After it happened three or four times, we decided to use exact versions in task definitions and update the task definition for every deployment.
ALB costs in development environments are still hurting us, but there is nothing to do.

To conform to ECS, we are trying to reduce our service requirements into smaller applications running in parallel. If we can manage that, on-demand scale out/in will be pretty easy and cost-effective.

The second issue, debugging faulty tasks, is still a problem to be solved on the AWS side. It might even be a deal-breaker for us if we ever get a serious incident because of it. We might need to consider our constraints and jump on the K8s train finally to have better fine-grained controls.