We, like many small startups, had started our application on Heroku, and had built up considerable technical and social practice around its use. About 18 months ago, the Engineering team here at Salsify recognized that as our team, product, and user base continued to grow, we would outgrow the ideal case for Heroku.
One driving factor was the increasing number of applications (services) that the Engineering team was creating. Heroku did not have good facilities for service discovery, coordination, or dependency management. In addition to services management, we anticipated a need for tighter control of resource utilization, varied scaling scenarios, and security.
This recognition led to kicking off a major project to design and build the next Salsify hosting platform, which we’ll talk about in more detail.
What Options Were Available?
We spent a couple weeks evaluating container orchestration platforms. Our capacity allowed us to create a proof of concept for three solutions: Docker Swarm, Tectonic Kubernetes, and vanilla Kubernetes. Having rigid low level requirements helped narrow the list considerably. It required a granular permissioning model, the ability to attach to deployed containers for development agility, support for AWS IAM roles at the container level, and a mechanism to easily organize Rails applications. We realized there wouldn’t be a single solution which checked every box, leading us to look for something that gave us primitives we could build the rest on top of.
Once this realization was made, Kubernetes became a very appealing choice. There was also an immediate bias towards it because of the incredible and sustained development velocity compared to Docker Swarm, beating it by an order of magnitude in terms of measurable activity on Github. Kubernetes being API first philosophically allowed us to develop towards a model that provided the usability of Heroku with the observability and configuration you can only get when running your own PaaS.
When designing our Kubernetes environment, we knew it would be an iterative process given the gradual rollout of microservices to it, however we wanted to come to a general consensus on various design decisions where possible. By utilizing the KOPS project, we were able to experiment with cluster configurations fairly quickly.
For instance, creating two logical separate clusters for our Staging and Production environments, allowed us to become more comfortable with deployments, as well as gave us two identical environments for testing major changes with applications running. Another major design principle we chose was to break applications up into their own namespaces. This means that a backend audits service, with all of its processes and config, lives in a namespace called audits-service. Running and scaling the Deployments independently within a namespace effectively turns this collection into a Formation. And, by utilizing role based access control (RBAC), we were able to replicate the granular per-app permission model we had before in Heroku.
For authentication, we tested a few different solutions, but eventually settled on using Google OIDC auth. We utilize Google Apps authentication for a number of other services, making it natural to extend it to this case as well. This allowed for our clients to utilize ephemeral keys tied to their Google accounts for authentication to the Kubernetes API.
In terms of deployment, we decided upon Jenkins to be our swiss army knife. Using declarative pipelines, we were able to capture all the deploy logic we needed and scale that to deploy all our services to both Kubernetes environments.
When proposing a change as big as switching platforms, this can really disrupt the agility of an engineering org. It requires becoming comfortable with new artifacts, new tools, and new monitoring, among other things. With this in mind, handing developers a low level API tool like kubectl would not have the best initial outcome, nor foster the adoption the company was hoping for.
In turn, we started developing in house tools to ease the transition. The two major ones, salsifyk8s and ksonnify, were designed to help engineers migrate and operate their applications. salsifyk8s, contains all the logic required to run a rails console session with the current code and configuration for a given environment. It also helps display information like process types and count, as well as scaling by process name instead of deployment. Ksonnify, on the other hand, allows easy creation of the configuration artifacts required for a given deployment, modifying the fields like image, secret name, configmap name etc, as well as the initial bootstrap of a Dockerfile and Kubernetes namespace with RBAC rules. Ksonnify drives it’s initial configuration from the Procfile required by Heroku, making the first steps of migrating an app as simple as possible.
After we had been running in production for a few months, we decided to make upgrading the clusters a common occurrence. The idea that things were running smoothly, and therefore you should not touch things, had to never be a reality. In order to really support and show the organization they could not only trust Operations, but Kubernetes, we went through various upgrades. One interesting note was after we had a few services in the cluster, we realized upgrading ETCD from 2 to 3 would be dangerous via KOPS, and deployed new clusters and subsequently wrote a script to port an application to the new cluster one deployment at a time. This in turn really hit home that declarative applications built for Kubernetes were easy to lift and move to other Kubernetes environments.
Once we started migrating services, we soon realized our current monitoring was not cut out for containers. Furthermore, the visibility into process metrics in Heroku was pretty limited, and we felt we could do better. To solve this problem, we deployed Prometheus using the prometheus-operator framework. With some customization, we had a monitoring stack deployed with Grafana, and per Kubernetes namespace alerting to Slack and Pagerduty in very little time.
This did, however, introduce some new issues that we did not face before with monitoring. For instance, we had set up alerting on a metric which captured the memory usage of a container. However, this specific one also contained page cache, which when working with lots of files in Ruby, tends to become an inaccurate representation of memory usage. At the end of the day, just capturing all the data via metrics in Kubernetes, and then visualizing that in Grafana has made us much more proactive in sizing of of services. Admittedly though, Operations tends to still be the number one users of this monitoring stack, although that change is happening organically over time.
When the time came to migrate services, we created a pretty simple process for ourselves. Since we utilize lots of background workers that just process jobs off a queue, we could cutover processes one by one and ensure things were working as expected. To enable this, we first extended our Jenkins pipeline to support deploying to both Heroku and Kubernetes at the same time. Then, starting with low risk processes, we would update the Kubernetes definition, deploy the application, then spin down that process in Heroku. After a few times, this became comfortable and very repeatable. The last process to move was the web processes, which was left to a simple DNS cutover when we were ready.
In looking back over the last 18 months, we succeeded at nearly all of our initial goals in the migration, albeit significantly slower than we’d estimated (at least 6 months). We still have work to do on improving developer enablement, particularly as it relates to the creation of new services. At varying points in the process we underestimated the cost of prototyping, and overinvested in tooling too early in the process. This overinvestment led to a persistent tax of updating our tooling as our process and practices shifted as we learned more and became more opinionated. The tools were, and are, an important piece of our developer ecosystem, but could have benefited from a better plan and design as opposed to the more organic approach that we took.
Ultimately we managed to migrate our production application from one hosting environment to another with minimal downtime, while gaining considerable capabilities we would not have had otherwise.
Now that we’re past the major lift of the migration itself, we’re also really excited about the new things we can do on top of this new platform. We have good aggregate-level tooling now to work with entire deployments as single objects, so we must evangelize those tools and get their functionality integrated into our workflows. We have a solid API for inspection and introspection, which creates opportunities for improvements in monitoring, service discovery, announcement and more. Our provisioning time has dropped to the point that dynamically provisioning full testing environments seems a lot less of a dream. In a lot of ways, one of the best parts about getting all this difficult work done is...it makes way more work for us.
Salsify DevOps Team (past and present): Adam Bell, Josh Branham, Tanner Burson, Nurus Chowdhury, James Harrington, Joe Roberts