Dissecting Cloud Migrations and the Role of Automation
- Custom Applications developed by Engineering Teams.
- Commercial off-the-shelf (COTS) Applications
Migration Work Streams
Migrating an environment from on premise to cloud can be split into 4 workstreams:
- Application Migration: This includes the custom developed components by engineering teams
- Data Migration: This includes SQL databases, Stores procedures, jobs like SSIS, SSRS, file stores and NoSql
- Infrastructure Migration: This includes networking topologies, security groups, ACLs, IAM policies, user management, encryption keys, disks, SQL Servers, file store setups and the full list of compliance controls
- Security Controls: For businesses in regulated industries, there is a list of exhaustive controls to be implemented. Unlike other workstreams, the overall approach to security is very different in cloud compared to on-premise. This is discussed later in the whitepaper
“Expertise required to replicate and migrate the infrastructure components is more-or-less standardized but requires cloud SME and can be very laborious with current automation techniques. Application Migration is less laborious but involves subject matter expertise custom to the specific app. Data migration is partly standardized and partly requires understanding of the custom logic in the environment. For example familiarity of Databases, their inter dependencies, stored procedures, SSIS, SSRS etc.”
Let’s dig deeper into each of these work streams.
- Lift-and-shift Migrations
- Redeploy with no application or configuration change
- Repackage (Containerize) and Deploy
Minimal change and typically fastest way to get the first version of the migration up and running
Redeploy with no application or configuration change
Using a cloud native VM setup i.e. deployment is using a the latest configuration in terms OS versions and diagnostics configurations like agents
Repackage (Containerize) and Deploy
Closest to cloud native. Most automated, easy to maintain and scale
Data migration typically involves 2 types of data stores: SQL and File store. The most important element of data migration is to reduce downtime. Applications may be a 24×7 SAAS service with large amounts of data. Stopping all incoming requests for several hours or days while the data is copied over is not an option. There are 2 strategies towards this:
- Differential Copy over internet: Here data is copied from on-premise to cloud while the application is functioning. The process is repeated several times over till we reach a point where the delta is small enough and can be copied over in a very small-time. That process/action will require downtime so as to ensure no new data is being added to on-premise data stores. We have seen these techniques work well for about a few TBs of data. In order to speed up the data transfer one could add a dedicated Site-to-site VPN connectivity with dedicated bandwidth as against copying over the internet.
- Offline copy by shipping disks: This technique will be required when the size of the data is so large that it is practically unviable to copy it over the internet. For such cases cloud providers provide a hardware storage device that gets shipped to on premise and data copies and shipped back to the cloud provider where they get copied into the customer’s cloud account.
For differential copy over over internet, SQL databases provide 2 techniques:
- Differential backups: In this technique, one would create a full backup, copy it over to a file store in the cloud provider and keep adding small differential backups at periodic intervals to the store. The idea is to get to a point where differential backups are so small that they can be created and copied over in say an hour or even less. When we hit that point, we would stop the application; take a final differential backup, copy it over to the cloud and restore the whole series of backups in an isolated database in the cloud.
- Secondary DB: An extra database server is stood up in the cloud provider and connected to the primary as a replica. The primary begins to mirror all transactions’ runtime. Over a period of time the secondary catches up to the primary. Finally, one would take a downtime and flip over. This approach guarantees minimal downtime
Differential copy of raw files is a very simple process using tools like Rich copy which can compare difference between two folders and copy only what is needed. We could repeatedly run such a tool to replicate data from on-premise to cloud. The first replication would take the longest. Again the goal is to reach a point where the last replication is achieved after stopping the application and takes minimal time to copy over.
Outside of the application packages and data, everything else constitutes infrastructure. 100% of cloud provider (like AWS, Azure) configuration falls in this category. We would start by drawing out a high level application architecture. This would typically be done by an architect in the organization. An example of one such diagram for AWS shown in Figure 2 below.
Here, we see a topology that consists of a VPC with a set of public and private subnets and multiple availability zones. The application is running in EC2 instances fronted by a load balancer and WAF with a MSSQL database and file store. There are multiple components of the application each with its own LB and WAF.
If an organization is in Azure, they have a deployment architecture that is Azure specific. The constructs and terminology may change but conceptually it’s the same. One such topology is shown in Figure 3.
Such high level architecture, with say 15 odd constructs, gets passed to DevOps teams who translate these into 100s of lower level cloud configurations that would require thousands of lines of Infrastructure-as-code.
Infrastructure configuration to realize the application blueprints is typically done in a series of phases as shown in Figure 4.
Security Controls and Compliance Standards
Intertwined with the infrastructure setup are the security controls. This can be the most laborious component and thus deserve to be a workstream by itself. The process here starts with the mapping of the desired compliance standards like PCI-DSS, HiTrust, HIPAA, GDPR etc with the corresponding configurations in the cloud. Cloud providers like AWS have published the mapping of this control set which acts as the authoritative implementation guide. In addition there are tools like AWS security hub, Azure security center which list down hundreds of granular configurations that need to be applied.
Each compliance standard prescribes a list of controls as shown in figure below:
Number of Controls
AWS Well Architected framework
“Using current automation techniques each control would take a day or 2 to automate. One can see how security by itself can make migrations a multi-month project”
- Turn off public access on S3 bucket
- Drop invalid HTTP headers in ELB
- Deploy firewall at each Internet connection and between any demilitarized zone (DMZ) and the Internal network zone
- Limit inbound Internet traffic to IP addresses within the DMZ.
- Rotation of secrets
An additional challenge with security is that unlike other work streams the approach in cloud is substantially different. On-premise security is largely a centralized function revolving around firewalls, IDS, IPS, endpoint security and OS.
“In cloud security controls are distributed across the full stack of infrastructure. In addition many expensive security tools being used on premise have limited applicability. Adapting to the new way of security is both a people and a process problem. “
Automation of these workstreams is the single most important factor for any migration project. Duration of the project, downtime, security and correctness are all functions of the level of the automation. Automation is achieved in a series of phases shown in figure 5:
- Base Infrastructure: This is the starting point where one would pick the regions, bringup VPC/VNETs with right address spaces, setup VPN connectivity and availability zones.
- Application Services: This is the area where we have our virtual machines, databases, NoSQL, object store, cdn, elastic search, redis, memcache, message queues and other supporting services. Further in this area are DR, backup, Image templates, resource management and other such supporting functions.
- Application Provisioning: Depending on the application packaging type, different automation techniques and tools can be applied. For example if its a vanilla virtual machine then ansible can help automate deployments. For containers we have Kubernetes, Amazon ECS and Azure Webapps. EMR, Data bricks are options for data science workloads
- Logging, Monitoring and Alerts: These are the core diagnostic functions that need to be set up. Centralized logging can be achieved by elastic search, splunk and SumoLogic, ES, Splunk and Datadog. For monitoring and APM we have Datadog, Cloudwatch, SignalFx and so on. For alerts we have sentry. Many unified tools like Datadog provide all 3 functions.
- CI/CD: There are probably 25+ good CI/CD tools in the industry from our good old Jenkins to CircleCI, Harness.io, Azure Devops and so on.
- Security Controls: these are required across the board
Executing the Migration
Once we have the migration strategy planned for the workstreams and automation in place, it is time to execute. Execution is performed in the following steps:
- Staging Environment: A mirror of the running setup on premise is replicated. It starts out by deploying the underlying infrastructure, followed by creating the data stores which have a small but representative dataset and then finally the application is deployed in the environment. The environment is then validated. Security is largely a behind the scenes function and should ideally be baked into the automation.
- Data migration: Kick off data copy and reach a point where enough data has been copied over to the cloud and any remaining data can be moved in a short duration by stopping the application so as to avoid any new data ingestion.
- Production Infrastructure Setup: Bring up the production environment as per the blueprint pointing to the new data set. Keep the application in stopped mode.
- Begin Downtime and Finish Data Migration: The on premise application would be stopped and the final data copy is performed.
- Bring up Cloud Environment: The new production environment in the cloud is lighted up and validations run. During this time it is important to turn off any background jobs that can trigger any production functionality. Once validation is complete the DNS names are swapped and other finishing tasks performed and the environment is live in the cloud.
Accelerating Infrastructure setup and security with DuploCloud
At DuploCloud our focus is the Infrastructure Migration and Security Controls work streams. We have been to create an E2E automation platform that enables the devsecops lifecycle show in figure 5 out-of-box. There are over 200+ tasks automated across the full lifecycle with a breakdown shown in figure 6 below. It covers almost all configurations in a cloud provider and a multitude of compliance standards. Further the power of the platform is not limited to the workflows and compliance controls available out-of-box but also in the innovative rules engine powering the platform that enables newer services and workflows without having to write thousands of lines of code by expensive subject matter expertise. The two DuploCloud whitepapers for Devops and Security provide a deeper dive into DuploCloud’s automation and security platforms
In a migration project by adopting DuploCloud the infrastructure setup and security work stream are accelerated by 10x with 75% cost reduction. DuploCloud levels the playing field for cloud native and on-premise companies by making niche automation techniques like IAC easily accessible and adoptable by operations teams of all skill levels.
Cloud migration requires careful planning, coordination and automation. There are 4 key work streams: Application Migration, Data Migration, Infrastructure Setup and Security Controls. The onus of application and data migration components is largely on the authors of these components, the infrastructure and security controls are more standardized. While application and data migration requires a custom skill set, it is a bounded component relative to infrastructure and security which are a lot more laborious in nature. Infrastructure Automation can substantially reduce migration times and improve operational efficiency post migration. It is best done early in the process so as to have a scalable digital transformation path post migration as well. At DuploCloud our goal is to make automation and security a no-op so engineering teams can focus on the product as against spending man hours on standardized configurations.