Criteria for Architecture
In this segment of our larger piece, we will cover the criteria we used to create and plan our architecture. Moving from a monolithic app to microservices isn’t just a matter of code, it is also and above all a question of architecture. Without a cohesive plan, and a strong understanding of the potential pitfalls and gains of a large transition such as this one, it is bound to falter. As you (may have) read in our previous article, this project had us move from classic servers on VMs to something completely new to us. Every decision could change a multitude of later processes, without any assurance that it was the right path. For that reason, we made sure to create and provide a full specifications document, and shared it with the team to gather a large amount of feedback, and gain support and buy-in for our plan. Sharing this document also provided the team with the big picture view, and allowed them to influence key decisions.
Within the full specifications document shared with our team, the following key points were covered:
As Gemnasium is an application responsible for delivering security advisories, it was vitally important that no matter what path we chose, that the security of our application remained a top consideration. After all, who would trust the advisories provided by a solution that was not itself secure? With more than 500K projects, we have access to a lot of accounts, tokens, and data. Because of this, we take security very seriously.
This consideration was tested early, when we considered the first batch of new tools. The first tools considered were Docker and its siblings, such as docker-compose, swarm, etc. While the Docker registry was freely available and opensource, there was no way to authenticate users at this point. We couldn’t yet see a way with these solutions to secure access to the registry, or to provide for different roles (readonly, readwrite, etc.). Another drawback in those we tested did not have anything that would isolate projects completely from each other, or even isolate the components themselves. For these reasons, we couldn’t justify a shift to these tools.
When we discovered Openshift, on the other hand, we found a solution in which the registry was only available with proper authentication. It provided compartmentalization, strong authentication/authorization, and still had the flexibility we needed for building enterprise grade architecture.
A simple, process-driven and easy deployment was important to us. We had been using Capistrano for years. We also loved heroku-like deployments, where when code was pushed, a specific remote would push and deploy the app. We wanted something similar, simple and neat. However, we had a few other considerations that forced us to make a move.
First of all, our new architecture had to support rollbacks, both manual and automatic, like we did with Capistrano. This change alone involved a lot of changes to our processes, so we needed a solution that wouldn’t add to the workload. Unfortunately, while most of the solutions we tested did provide at least a partial response to our criteria, or were close to meeting them, the gaps were significant enough that they required creating and maintaining a lot of in-house scripts. This meant more work for us, and no one wants that if they can avoid it.
Once again Openshift proved the solution (for our requirements). The health check of deployed containers practically guaranteed clean deployments. If something was failing, the previous version would still be in place. Their deployment strategies allowed us full control over how the new Docker containers were deployed. For example, we could set the “Rolling” strategy to have two new containers up, with at least two running at once. Best of all, like Capistrano, it offered zero-downtime deploys, which in turn ensured smooth database migration. Because we could keep at least two versions of the codebase running at once, it was impossible for us to introduce breaking changes.
Scalability was another strong consideration. As anyone in IT knows, while VMs can be easily duplicated, but it is by far not a point and click process preparing and updating VMs, or removing old architecture, as would be required for our new approach. We needed a solution that provided more than just vertical scaling (adding more resources and containers). The vertical scaling approach has a ceiling (the node’s available resources), and was not resistant to a node failure. With only one instance running, and all resources tied to the overtaxed node, nothing was available to restart the node elsewhere.
We wanted something simple, but with the ability to scale with a few operations if needed. We were shifting our processes to micro-services, but there were some pieces that would still require a bit of heft. In what is likely sounding fairly repetitive at this point, Openshift again had everything our organization needed (they’re not paying us, we swear). A few clicks enabled us to scale containers up or down as needed. As we exploded our giant app into a multitude of smaller pieces, we could scale exactly to the requirements of that fragment, without scaling the whole app. On top of this, the underlying kubernetes layer, responsible for orchestrating all these containers, also provided real-time monitoring, ensuring that the exact desired number of replica were running.
If you’re making any kind of wholesale changes to your app, knowing exactly what is going on, or at least being able to dig into the problem at a moment’s notice, is vital to your progress. Checking the logs, whether to confirm your suspicion, or to dig into the bowels of your system, is a go to staple for any troubleshooting coder. Unfortunately, prior to 2015, and the introduction of Docker logging drivers, nothing was available that would send these logs directly to a central location. Aggregation of these logs was a requirement for us, because no-one wants to hunt for the logs, only to then have to hunt through the logs for the issue. One hunt at a time, folks.
Our solution was simple, if a little old school. If details were present in environmental variables to send logs to a specific location, we mirrored those instructions and sent them to the centralized location as well. With the help of a “log” service (http://kubernetes.io/docs/user-guide/services), it was relatively easy to configure this for all of our components at once.
No solution can be considered ready for go-live if a backup solution is not in place, and well tested. More importantly, above and beyond backup, a disaster recovery plan should be in place. When a critical failure occurs, whatever the cause, you need to be able to restore things to status quo in a very short time-frame. There is no time to figure out where the required data is backed up, where it should be restored, and what to do next. One challenge of the micro-services approach is that if it is the failure of a single (but key) service, finding the back-up files specific to that piece might be problematic, if not adequately planned. In particular when many of the services tested did not provide centralized data stores. Additionally, no matter which solution we looked at, the question of what to do next in a recovery situation was not really covered at all in the documentation.
In Kubernetes (hence Openshift), all persistent data is available through persistent volumes. This mechanism is automated, well-documented, not to mention very complete. Many volume types and shares are supported, including NFS, GlusterFS, and even Ceph. With all the persistent data in one place, it is extremely easy to backup, restore, and monitor. What to do next is also simplified, reduced to scaling down replicas to zero, restoring the data, then scaling back up again to the original number of replicas. Even if it is not spelled out, planning your recovery is made simpler simply by the features provided.
Another key consideration when choosing your architecture is monitoring. It is a bit of a challenge to monitor microservices, because it is difficult to predict where and when a component/service will be running. In most cases, all you need to do is check the load balancer for expected responses from the services behind it. However, this won’t tell you how many instances of a given service are running. If it is only one instance, your app could be circling the proverbial drain, awaiting that last minuscule push/pull to edge into disaster (or at least downtime) territory. Good monitoring is pro-active, and should let you know of such impending problems. Ideally, it should be able to trigger automated responses, such as restarting services if they stop, and notifying an admin when this happens.
In our case, metrics are largely provided by the app directly, and fed into a centralized group of servers dedicated to that purpose.
Of the solutions we looked at, only Openshift provided key metrics and monitoring out-of-the-box. Openshift automatically provides hawkular metrics, and made graphs available directly to the console. While this provided a convenient and attractive dashboard, the metrics provided did not go beyond CPU and RAM consumption. For us, this was not enough, so we came up with our own innovations, which we will cover in part 4 of this serial blog.
You can have the most brilliant, innovative solution in the world, but if you don’t tell people how to use it, it will fail. Adequate documentation is where nearly all of the solutions we tested failed. Granted, many were in the early stages of development, and now probably have more complete documentation, but at the time we were testing these solutions, every problem encountered resulted in a great deal of wasted time. As is often the case with opensource projects, solutions are very clear for the authors of the project, but until there is a larger community surrounding the project, the holes in the provided information are simply not filled in. If planning to shift to a new methodology and a new platform, solid documentation is vital. Hands down, Openshift won this round. RedHat provided dedicated teams to create online documentation. Even in the beta stages, we had not seen a solution with documentation as precise and complete as this. Even better, it was constantly evolving, with new details being added in tandem with each version (a monthly basis approx). This alone might have made our decision for us, even if a couple of the other pieces had been missing, simply because with adequate documentation you know what is missing or incompatible, what is on the roadmap, and can plan accordingly.
We spent a great deal of time evaluating potential solutions, and for some it took several weeks before we finally had to toss them aside. In many cases, we liked the solution, but could not justify the delays they would incur to fix bugs or integrate a needed feature. Choosing a promising and shiny startup software is always tempting, but what if the project is discontinued? You are back to square one, because all the tests you’ve done, and the development and planning surrounding this solution are now obsolete. We needed to make a long-term investment, and master it asap. This meant we had to see a similar investment from our chosen solution. RedHat’s support of Openshift development swayed us in this regard. This support made it clear that Openshift was in it for the long haul. The underlying layer, Kubernetes, had just been opensourced by Google, and was starting to get a great deal of attention from the development community. To us, this meant that the solution had both the required support, and a promising development path for the future.
The final decision to go with Openshift was a watershed moment, based on the first feedback from our IT team. The feedback was simply: “We went through the Getting Started part of the documentation, which includes installation, and our test app is working fine”. This was the first time that everything had gone well in our testing from start to finish. That was all it really took, we had our new tool. It was time to figure out how to fit it to our existing app, and the new services within. With the help of services and endpoints, we could start putting components into Openshift, while still being able to communicate with legacy components outside the cluster. This was the core of our new solution, because it allowed us to migrate one piece at a time, and rollback easily if there was a problem. The benefits of the Kubernetes configuration of our app also allowed us to look ahead to our upcoming Enterprise Edition of Gemnasium.
In our next article, we’ll start to show you how we put the pieces together. See you next week, and thanks for reading!