Off-the-shelf Cloud Platforms vs DIY with Infrastructure-as-Code

In this blog post, we compare two prevalent approaches to cloud infrastructure management. First is what we broadly classify as Infrastructure-as-Code, where engineers use programming\scripting languages to build a set of scripts to achieve the desired topology on a cloud platform. Terraform, Cloud Formation, Chef, Puppet, Ansible are some popular ones. This technology is comprised of a language to write scripts, plus a controller that can run the scripts. Once satisfied with the result, the user would save the scripts in a code repository. Subsequently, if a change is to be made then the files would be edited and the same process repeated.

The second category would be a “Cloud Orchestrator” or “Platform”. This would typically be a thin abstraction over native cloud APIs. This would interface with the user as a web service and the user would connect to the service (via UI or API) and build the cloud topology within that web service itself. The topology built will be applied by the orchestrator and saved in its own database. The user does not need to explicitly save the configuration. When an update has to be made, the user will again login to the system and make changes.

For smaller scale use cases a platform maybe too heavy. But at scale, the former approach morphs into an in-house platform. In this blog, we argue that for larger scale infrastructure, a better strategy is to use off-the-shelf platform that can be enhanced with infrastructure-as-code scripts when customization is required. Mega scale data centers like those belonging to Facebook and Netflix are a different ball game and no content in this blog applies to them.

Not sure where you fall? Let us present you with a solution tailored to suit your needs.

New call-to-action

“Long Running Context”

The fundamental value that a platform-based approach provides is what we call “long running context.” People may also call this a “project” or a “tenant”. A context could map to say an application or an environment like demo, test, prod or a developer sandbox or alike. When making updates to the topology, the user always operates in this context. The platform would save the updates in its own database within this context before applying the same to the cloud. In short: Ooe is always guaranteed that what is present in this database is what is applied to the cloud.

In the Infrastucture-as-Code approach, such a context is not provided natively and is left to the user. Typically this would translate to something like “which scripts need to be run for which context” or maybe a “folder” in the code base that represents a configuration for a given tenant or project. Defining the context as a collection of code is harder because many of the scripts might be common across tenants. So most likely it comes down to the developers’ understanding of the code base.

A platform is a more declarative approach to the problem as it requires no or little coding as the system would generate the code based on the intent, without requiring knowledge of low-level implementation details. Meanwhile, in the case of scripting, any changes require a good understanding of the code base — especially when operating at scale. A user can come back and log in to the same context a few days later and continue where they left off without having to dig deep into the code to understand what was done before.

Difference between the code base and what is applied to the cloud

The second fundamental difference between the two is that Infrastructure-as-Code is a multi-step process i.e. write the script, run it and merge it in the repo, while a platform is a one-step process i.e. login to the context and make the change. During scripting it is possible that one might update a script, but may also forget or postpone saving in the repository. Meanwhile, another engineer could have made changes to the code base for their own side of topology and merged it. Now, since many pieces of code are shared for the two use cases, the first developer may find themselves in a conflict which, even if resolved by merging the code, lands them in a situation where what was run in the cloud is not what is in the repo. So now the developer has to re-run the merged code to validate, notwithstanding the possibility of causing regression. To avoid this risk, we need to now test the script in a QA environment.

All the “other” stuff

Scripting tools would enable deployments but there is so much more to running infrastructure for cloud software. We need an application provisioning mechanism, a way to collect and segregate logs and metrics per application, monitor health and raise alerts, an audit trail, and an authentication system to manage user access to infrastructure. Several tools are available to solve these individual problems, but they need to be put together and integrated into an application context. Kubernetes, splunk, cloudwatch, signalfx, sentry, Elk, oauth providers are all examples of these tools. But one needs a coherent “platform” to bring all this together if they want to operate at a reasonable scale. This brings us to our next point.

Much of Infrastructure-as-Code is basically a homegrown cloud platform

When talking to many engineers we hear the argument that Infrastructure-as-Code combined with bash scripts of even regular programming languages like go, java and python provide all the hooks necessary to overcome the above challenges. Of course, I agree. With this sort of code, one can build anything. But effectively, you might be building the same kind of platform that already exists. Why not start from an existing platform and add customization through scripts?

The second argument we have heard is that Infrastructure-as-Code is more flexible and allows for deep customization, while in a platform one might have to wait for the vendor to provide the same support. I think as we are progressing in technology to the point where cars are driving themselves — once thought to be little more than pure fantasy! — platforms are far more advanced than they are given credit for and have great machine generation techniques to satisfy most, if not all, use cases. Plus, a good platform would not block one from customizing the part that is beyond its own scope through the scripting tools. A well-designed platform should provide the right hooks to consume scripts written outside the platform itself. Thus, in my opinion, this argument does not justify building a code base for the majority of the tasks that are standard.

“There is no platform that fits our needs”

This is also a common argument. And I agree: a good platform should strive to solve this prevalent problem. At DuploCloud, we believe we have built an exceptional platform that addresses the majority of the use cases while giving developers the ability to integrate policies created and managed outside the system.

“The San Mateo Line!”

A somewhat surprising argument in favor of building homegrown platforms is that it is simply a very cool project for an engineer to tackle — especially if those engineers are from a systems background. We live in the Bay Area and have found a very interesting trend while talking to customers north and south of San Mateo.

When we talk to infrastructure engineers in companies headquartered south of San Mateo, we find that they have a stronger urge to build platforms in-house and they are quite clear that they are building a “platform” for their respective organizations and are not, as they would consider it, “scripting.” For such companies, customization is the common argument against off-the-shelf tools, while hybrid cloud use and on-premise are very important use cases. Open source components like K8, consul etc are common, and thus we frequently hear the assertion that the wheel need not be reinvented. Yet the size of the team and time allocated for the solution is substantial. In some cases, the focus on building the platform overshadows the core business product that the company is supposed to sell.

Meanwhile, north of San Mateo, we find mostly native cloud applications. The core talent is full stack. The nature of business is SAAS. The applications use so much native cloud software (S3, Dyamo, Sqs, Sns) that it’s hard to be hybrid. They are happy to give the container to AWS ECS via API\UI to deploy it. They find no joy in either deploying or learning about Kubernetes. Hence, the trend and depth of in-house customizations is much less.

How many times and how many people will write the same code to achieve the same use? Time-to-market will eventually prevail.

New call-to-action