Why We're Switching from AWS to Heroku

Logos are courtesy of AWS and Heroku and do not represent sponsorship.

For the past five years, DataMade has maintained a custom deployment framework on Amazon Web Services (AWS) in order to put our clients’ Django apps online. This summer, after over six months of research and evaluation, we have finally deprecated our custom framework and switched over to deploying containerized Django apps on Heroku, a Containers-as-a-Service platform for deploying web apps.

We put a lot of effort into our custom deployment framework, and we also spent serious time researching solutions that could replace it. We suspect that we aren’t alone in this kind of journey. For anyone else that may be facing similar choices, this blog post documents why and how DataMade decided to switch away from a custom-built deployment framework in favor of a Containers-as-a-Service platform, and why Heroku won us over.

TL;DR: Save time, spend on infrastructure automation

Overall, these are our main reasons for switching to Heroku:

As a small team working with small clients, we want to spend as little time as possible maintaining infrastructure, and as much time as possible writing code.
Big providers like AWS focus their container services on helping big teams scale horizontally, which often leads to even more infrastructure than a simple server.
Heroku is the most mature platform that optimizes for time to first deploy, and it does the most to automate infrastructure provisioning. Its higher prices reflect this added value.

Read on for more details on our legacy deployment practices, our current Heroku practices, and our reasons for making the switch.

The old method: CodeDeploy, EC2, and “zero-downtime” deployments
- Consistent problems with custom deployments
How we evaluated possible solutions
Challenges in transitioning to Heroku
Making the switch

The old method: CodeDeploy, EC2, and “zero-downtime” deployments

DataMade first began experimenting with Continuous Deployment back in 2014. Back then, DevOps was still a relatively new idea, AWS felt like a jumble of under-documented APIs, and Docker had been publicly available for less than a year. Facing an immature market, we rolled our own solution, designing a custom deployment framework on AWS using CodeDeploy and EC2.

The framework was relatively straightforward, relying mostly on Bash scripts run by CodeDeploy. In its simplest form, CodeDeploy exposes a web API and some convenience functions for pulling code from a Git remote onto an EC2 instance and running arbitrary Bash scripts in order to set up an application. Our deployment framework amounted to a standardized set of Bash scripts that CodeDeploy could use to create or update an application on a server, as well as an extensive set of documetation detailing how to set up an EC2 instance, how to configure CodeDeploy, which daemonized processes to install and run in order to serve an application, and how to troubleshoot in cases where deployments went awry.

It took us about four years to fully stabilize the framework. At first, the biggest pain point was that we had to stop and start applications in order to update them, causing anywhere from a few seconds to a few minutes of downtime for our clients. To fix this problem, DataMade engineers built a “zero-downtime” feature into our deployment framework, which required lots of custom logic to create new application directories, switch Supervisord daemons to serve files from the new directories, and clean up the old application directories when the deployment was done. Here’s a quick taste of our app_start.sh script, which polled a custom application endpoint intended to report whether or not a new deployment had started up:

# Re-read supervisor config, and add new processes
supervisorctl reread
supervisorctl add $APP_NAME-$DEPLOYMENT_ID

# Check to see if our /pong/ endpoint responds with the correct deployment ID.
loop_counter=0
while true; do
    # check to see if the socket file that the gunicorn process that is running
    # the app has been created. If not, wait for a second.
    if [[ -e /tmp/$APP_NAME-${DEPLOYMENT_ID}.sock ]]; then

        # Pipe an HTTP request into the netcat tool (nc) and grep the response
        # for the deployment ID. If it's not there, wait for a second.
        running_app=`printf "GET /pong/ HTTP/1.1 \r\nHost: localhost \r\n\r\n" | nc -U /tmp/$APP_NAME-${DEPLOYMENT_ID}.sock | grep -e "$DEPLOYMENT_ID" -e 'Bad deployment*'`
        echo $running_app
        if [[ $running_app == $DEPLOYMENT_ID ]] ; then
            echo "App matching $DEPLOYMENT_ID started"
            break
        elif [[ $loop_counter -ge 20 ]]; then
            echo "Application matching deployment $DEPLOYMENT_ID has failed to start"
            exit 99
        else
            echo "Waiting for app $DEPLOYMENT_ID to start"
            sleep 1
        fi
    elif [[ $loop_counter -ge 20 ]]; then
        echo "Application matching deployment $DEPLOYMENT_ID has failed to start"
        exit 99
    else
        echo "Waiting for socket $APP_NAME-$DEPLOYMENT_ID.sock to be created"
        sleep 1
    fi
    loop_counter=$(expr $loop_counter + 1)
done

There’s a lot going on here. Since the Gunicorn process that runs the application starts up asynchronously, we need custom polling logic to check whether it’s finished and gracefully handle the case where it fails to start. We also need to maintain a specific endpoint in our application code, /pong/, solely for the purpose of deploying them. And this snippet only represents 34 of a total 198 lines of deployment script!

Even after four years of development, our custom “zero-downtime” deployment framework was brittle: difficult to change, difficult to troubleshoot, and difficult to operate.

Consistent problems with custom deployments

In May 2019, the DataMade DevOps committee convened to talk through the problems with our custom deployment framework. A few themes emerged:

The framework couldn’t deploy containers. Over the course of 2018, DataMade had rapidly adopted Docker as a standard tool for containerizing our development environments. We wanted to deploy containers in production, too, but our custom framework didn’t support it, and we would have had to significantly redesign it to retrofit it for containers.
Custom deployment scripts were hard to read, write, and maintain. Bash is a notoriously esoteric language, and we had no way to unit test changes to our scripts. Plus, whenever a change was made, it was difficult to push it to all of the applications that used our scripts, because our scripts required so much ad-hoc customization to meet the needs of particular applications.
SSH access was granted on a per-application basis, but also had to be revoked on a per-application basis. Rotating developer credentials was a daunting task because we had to audit every single one of our EC2 instances to remove SSH and GPG keys. Because of this, we didn’t rotate keys as often as we should have.
Logs were inconsitent, hard to find, and required SSH access. Requiring our team members to shell into servers to hunt for logs discouraged them from actively monitoring applications, and further exacerbated the wide distribution of developer credentials across servers.
Setting up deployments took too long. In some cases it took new developers up to 15 hours to learn how to use our framework and set up their first deployment. Veteran developers could do it in three or four hours, but this still represented a significant loss of productive time.
Shared tenancy introduced undesirable dependencies between colocated applications. Around 2017, we began deploying every new production client app on its own dedicated EC2 instance. But staging apps (and some legacy production apps) were colocated on shared EC2 instances, which created hellish dependency collisions.
Colocated applications were hard to clean up. Between Nginx configs, Gunicorn processes, Supervisord configs, and SSL certificates, colocated applications were painful to clean up, and doing so would sometimes accidentally take down other services.

How we evaluated possible solutions

After years of struggling with our custom deployment framework, we knew that we needed to make some fundamental changes. To figure out which changes to make, we followed our R&D process for contributing changes to our stack. Our R&D process involves six key steps:

Propose a research project to the team of lead developers
Conduct research and develop a proof of concept
Recommend adoption, further research, or abandonment
Notify company partners of recommendation
Pilot use of the tool on a project
Produce adoption artifacts

We generated a list of possible solutions and used step 2, the research phase, to evaluate each solution in detail. The sections below provide our high-level evaluation of each solution. If you’d like to read a more detailed evaluation of each solution, see our research report and recommendation of adoption of Heroku.

Option 1: Further automation and Docker Compose on EC2

The most incremental step that we could have taken would have been to invest in further automating our deployment framework and refactoring it to run containerized applications using Docker Compose. Our initial conversation about problems with our deployment process revealed a number of clear improvements we could have made, including:

Adding unit tests
Using Terraform or an equivalent tool to automate infrastructure provisioning
Packaging deployment scripts as a library for easier updating
Using a centralized credential management solution like AWS Session Manager to provision shell access
Pushing logs to AWS CloudWatch
Using Docker Compose and Docker Hub to pull, start, and stop services

However, we realized that by investing in these improvements, we would be continuing to try to invent our own Containers-as-a-Service platform. Before investing potentially hundreds more hours into fine-tuning our custom solution, we wanted to see if we could pay for a service that would do the work for us.

Option 2: AWS Elastic Container Service (ECS)

ECS was the first service that we evaluated in detail, since it is by far the most popular Containers-as-a-Service platform. Ultimately, we left feeling pessimistic about the prospect of using ECS at DataMade.

For us, the primary disadvantage to ECS is that it is optimized for large teams that have serious horizontal scaling requirements. ECS gives the user full control over all aspects of your container deployment, for better and for worse. The user is responsible for pushing and pulling images from a registry, stopping and starting containers, and provisioning SSL certificates and load balancers, among other tasks.

Overall, our evaluation revealed that adopting ECS would require us to maintain a more complicated deployment process instead of a simpler one. Given the problems we were facing with maintaining our existing deployment process, this felt like a deal killer.

Option 3: Divio

The second service we evaluated in detail was Divio, a Containers-as-a-Service platform optimized for Django.

Due to gaps in documentation, rigid requirements about repository structures, and a lack of online community, we decided that Divio wasn’t mature enough or flexible enough for us to put into production. For more details, see our rundown of why we aren’t adopting Divio.

The idea of a Django-focused CaaS platform is very appealing to us, and we’re excited to see Divio grow and develop. For now, though, it won’t meet our clients’ day-to-day infrastructure needs.

Option 4: Heroku

Immediately after trialling Heroku, there was already a lot to like. Over all, we loved how quickly we could spin up and down new applications on the service using the same container images as we use in development. Some specific features that won us over included:

Zero deployment scripts. A heroku pipeline can be set up entirely through configuration files and a few CLI commands. There are no deployment scripts to maintain, just a set of templates for recommended config values.
Deploy previews for every pull request. With review apps, Heroku can automatically spin up a testing version of an app for each open pull request, similar to Netlify’s deploy previews. We love how much this feature speeds up the code review process.
Fully-managed load balancing and SSL. After years of shelling into servers and debugging Let’s Encrypt failures, we were delighted to get load balancing and SSL for free with every Heroku deploy.

In all, Heroku automates much of the tedium of infrastructure provisioning, and optimizes for time-to-first-deploy. As a small team, this type of service is critical for us.

Challenges in transitioning to Heroku

Switching infrastructure services is always a disruptive change, but after five years of working with AWS, we knew that we would need to expect real difficulty. Transitioning to Heroku came with its own set of challenges, including:

Pricing. At a minimum of $25/mo for a production dyno and $50/mo for a production database, Heroku is a lot more expensive than EC2, where a $10/mo t3.small can often successfully host a small client’s entire app. We felt that the decrease in time we would have to spend maintaining servers would easily offset this higher cost, but we had to make that case carefully.
Long-term maintenance. The web changes fast, and companies go out of business all the time. We need to make sure Heroku will last as long as our client applications do, and that we’ll be able to use it effectively for that long. Ultimately, we felt that Heroku’s status as a 13-year-old company with major backing from Salesforce meant it was stable enough to rely on.
Changes to our stack. A number of standard tools that we’ve used for Django apps in the past, including Nginx, Supervisord, and Blackbox, have no sensible analogs in Heroku. Often Heroku requires switching to a new paradigm entirely, as when using WhiteNoise to serve static files directly from the app.

Ultimately, we felt that the value proposition of Heroku would outweigh these challenges, and that we couldn’t afford not to switch.

Making the switch

While it took months to complete, we’re feeling grateful that our process for researching alternatives to our custom deployment pipeline allowed us to find a service that fits our needs. We’re happy with our choice of Heroku and we’re excited to fully deprecate our legacy deployments.

We now maintain detailed documentation on deploying Django apps to Heroku using containers, along with a template for quickly spinning up a new containerized Django app and deploying it to Heroku. Our docs and our templates are free and open source, and we encourage contributions from anyone who uses them.

Are you looking to switch CaaS providers, and aren’t sure where to start? Or do you already have years of experience putting containers in production on Heroku, and you want to share some wisdom? Reach out to us on Twitter and let us know what you think.