By Adarsh Shah
1. This article is 1st one in a series of articles on "From Infrastructure as Code to Environment as Code". This article talks about current challenges scaling Infrastructure as Code and in follow-up articles we will talk about "How to resolve challenges scaling IaC" and more.
2. If you are just getting started or have a very simple setup you might not face these issues.
I would highly recommend reading my previous article on Infrastructure as Code: Principles, Patterns, and Practices if you have not already as those Principles, Patterns, and Practices will be helpful in understanding this article.
Infrastructure as Code(IaC) has made managing infrastructure easier in a lot of ways, but there are many challenges that companies accept as the cost of adopting IaC especially when scaling. This article digs into these challenges to try to better understand them. So let’s get started with those challenges below:
As mentioned in my previous article, once you write IaC it’s recommended to execute IaC from a Shared Environment using a Pipeline or GitOps workflow. While your IaC is declarative these pipelines and what it manages are not. This, especially when implemented using CI tools, creates a maintenance nightmare as you scale your environment infrastructure. These CI tools are not built for supporting an optimal workflow for managing infrastructure. My teams and I have built these pipelines so many times and it becomes unmanageable after a point.
Teams want Entire Environments like in diagram 1 above to deploy and operate their applications and not just individual infrastructure resources. As you can see it has networking at the top and then platform-ec2, platform-eks, and db-security-group & rds-database all dependent on networking.
To provision these types of environments, teams typically use one of the below approaches. These approaches are inefficient for provisioning and teardown of environments and slow down feature development and innovation. They also usually require dedicated teams to bootstrap and manage these environments.
One way to provision an entire environment and manage dependencies is to do it in a pipeline. For example, execute networking layer IaC first and then execute the platform-eks layer and then execute the k8s-addons layer. Teardown must be supported and any failures/errors that impact the environment must be accounted for. While your individual Infrastructure Resources are Idempotent on their own due to IaC but your entire Environment is not.
Another way is to manage the entire environment as a single monolith IaC. This creates tight coupling which in turn creates a maintenance nightmare and causes issues like slow provisioning (due to aspects like large state files). I would not recommend going this route unless you have a very simple setup.
If you want to follow principles like Immutability for your environments or make it easier to share best practices implementation of environments across various teams having a mechanism to easily replicate environments is critical. Since the above pipelines are not ideal for managing entire environments it becomes painful to replicate them. Teams spend a lot of time writing custom code to replicate environments.
Teams also struggle to visualize and understand environments (like dev, qa, or production) that they provision using IaC. Trying to find that information by going directly to the cloud provider’s dashboard is even more confusing. If they want to troubleshoot an issue, share knowledge between teams, or make any changes to existing environments they need to go through a painful and time-consuming process.
A lot of teams create diagrams for their environments with various cloud resources and how they are connected but these diagrams get out of date soon with the real environments. Instead of helping they actually provide incorrect information and can cause confusion.
Getting any insights like usage and costs at various levels (team, environment, individual resources) is not straightforward. There are tools available that give you insights at an individual resource level but not at a higer(team/environment) level.
Most companies have budgets for teams for their cloud expenses. They are also given budgets to experiment and try out various cloud services but they struggle to track costs for these environments that they provision using IaC.
Over a period of time due to human error or indirect changes provisioned infrastructure drifts from the desired state in code. With existing solutions(like using a pipeline to execute IaC), drift can be detected since IaC is declarative but only when that pipeline executes the IaC next time but teams should know about that drift right away so they can remediate any issues as soon as possible.
Thanks for reading the article and hope that you find it useful. If you have any questions or comments you can reach out to me via twitter or email: firstname.lastname@example.org. If you enjoyed this article watch out for a follow-up article on “How to resolve challenges scaling IaC: From Infrastructure as Code to Environment as Code” that is coming out soon.
Arielle Sullivan read the draft version of this article and provided feedback to improve it
If you enjoyed this article you might like our product that makes Environment Management Easy across all 3 major cloud providers as well as On-premises. Please watch the below video to know more & Request a Demo.