The Power of On-Demand Environments
By Arielle Sullivan & Adarsh Shah
We have seen that in a lot of companies environments are treated as fragile. Tasks that require an update to an environment configuration are often met with “Why should I touch my environment? It hasn’t given me any issues and has been untouched for years”. This response is rooted in the toil and time many engineers have spent provisioning one-off environment requests, troubleshooting mysterious bugs caused by differences in environments and the high stakes of impacting production with no quick fix.
The idea of a repeatable and automated process to provision and tear down environments is often pushed aside as not worth the cost of creating such a system. Because so few teams have that automated and repeatable process, the benefits are not as well known.
Ephemeral environments, or on-demand environments deserve a closer look. This article breaks down the benefits to developer productivity and the use cases where ephemeral environments are most valuable.
Environment: A logical grouping of all the Infrastructure Components as well as apps that are needed to run business applications. The grouping includes components like networking, platform-eks, database, s3 buckets, frontend app, backend app and any other components.
Ephemeral environment is the idea of provisioning an environment on-demand when needed and then destroying it afterward. It is a useful technique to test your IaC and applications running on it without a need to keep it running all the time. Ephemeral environments are also called dynamic or short-lived environments.
Advantages of Ephemeral Environments
Avoid Bugs due to Configuration Drift
Configuration drift is a huge problem with environments. It occurs when over a period there are changes made to environments that are not recorded, and various environments drift from each other in ways that are not easily reproducible. This usually happens if you have a mutable infrastructure that lives for a long time. Long-lived infrastructure is more brittle in general since issues like a slow memory leak or disk out of space due to log accumulation might not be caught for a long time. These issues are resolved with immutable infrastructure.
Have confidence in Environment Provisioning
Many teams do drills to perform disaster recovery and test the failover process once in a while. You can instead have confidence in your ability to handle a worst-case scenario by relying on automation. Automated ephemeral environments enable your team to provision environments (both infrastructure as well as applications) from scratch frequently and then tear them down when not needed.
Get Production like environment
There are use cases (like load testing, troubleshooting production issues) where you want your ephemeral environments exactly like Production so that you can replicate the same behavior. Having the automation to provision environments gives flexibility to configure them with production configuration, infrastructure/application versions and even data.
Companies waste a lot of money due to unused cloud resources and even entire environments. These resources include unused VM instances that are running, unattached storage volumes and obsolete snapshots, idle load balancers etc. Entire environments can also be left running when not needed like engineers forgetting to bring down experimentation environments or leave development environments running after hours when they are not working. Having an ability to easily teardown cloud resources and even entire environments when they are not needed reduces the cost significantly.
Use Cases for Ephemeral Environments
Now that we understand the advantages of Ephemeral Environments let's go through some of their common use cases.
With modern cloud native architectures, engineers usually require a complex setup to do development. If you are using microservices architecture you need to run a lot of microservices and even databases. This makes local development really hard.
- You have to set up everything locally on your machine which makes things a lot slower & painful. Engineers spend a lot of time managing the local environment on their machine.
- Relying on a shared development environment that you can connect to for services/databases you are not working on is a bottleneck. These environments are usually shared with other team members and are not updated frequently or broken most of the time, especially if you are upgrading a dependency. This slows down developer productivity significantly.
High performing teams have an automated way of provisioning as well as tearing down development environments. When they start their day developers can rely on creating a new environment quickly and in a reliable state that they can use for their own development purposes. When they leave for the day, the environment is destroyed so they can save money. Since these environments are dedicated to each developer they can try new dependency upgrades etc. without affecting anyone else in their team.
One of the key principles of Continuous Delivery is to `Build quality in` so you can find & fix problems sooner. Merging code to trunk often and with confidence is easy if an ephemeral preview environment is provisioned for each Pull Request under review. Having an ability to preview changes for infrastructure or application based on a Pull Request allows engineers who worked on the change & the PR reviewers to validate changes in a real environment before merging it to the main branch.
A lot of times reviewers pull down the changes from the branch and test it locally on their machines. This is a time consuming and error prone step. Preview environments save a lot of time & effort while code review and improves the quality of changes that go into higher environments.
Running tests provides the peace of mind if the environment you are testing in is identical to production conditions. Drift in a long-running staging environment can create false-positives in a test and hide bugs for long periods of time. When problematic configuration is detected, the fix can be a big lift. Bugs can also be difficult to find if environment drift means they can’t be reproduced in your non-production environments.
Test Product Features
Ephemeral environments enable you to run end to end tests with an accurate environment to detect issues. Testing an application with real infrastructure gives the most coverage but can be deemed too costly. With ephemeral environments teams can schedule running these automated end to end tests on a periodic basis and teardown the environment afterwards. QA can also use ephemeral environments for testing new & existing features.
Troubleshoot Production Issues
Often, troubleshooting a production issue or other type debugging requires adding data in a specific state to your long-running staging environment. Staging data can easily be corrupted and costly to customize, especially in a time crunch while fixing a bug. A more robust approach is to use an on-demand environment to reseed sanitized data ready to use for testing. Automating seed data that demonstrates various user states will save your team time in the future.
Performance/Load testing is difficult. Using benchmarks in lower environments (that are not like production) to predict how an application runs under load is an imperfect approach. Using real traffic inherently puts your production environment at risk. Using ephemeral environments that can easily be provisioned using the same configuration & data as Production helps in doing performance/load testing that saves performance issues in production.
Experimentation is a critical piece of product development, but the risk and cost of experimenting with infrastructure can stand in the way. A lot of companies assign experimentation budgets to teams and staying within that budget is always challenging especially if you keep resources running in the cloud by mistake. Let’s say your team wants to test out a machine learning solution for a limited period in the cloud but has no infrastructure for it. Ephemeral environments allow you to quickly spin up these experimentation environments. Easy teardown option means they aren’t left running by accident, draining your experimentation budget.
Reducing barriers to experimentation gives further savings when using cost-saving infrastructure (e.g. migrating to spot instances) becomes easier.
Testing Major Upgrades & Code Refactoring
Making major application upgrades to dependencies or code refactoring should be like getting an oil change on a car done rather than waiting for it to break down. These changes can require an engineer to use a dedicated environment to avoid blocking others’ work. Having the ability to provision and tear down environments when not needed helps test these major changes in isolation.
Customizing your application for a last minute sales demo same-day can win a deal. Ephemeral environments allow you to add a customer logo or a request feature that is not production-ready to a demo. Cleaning is easy, just teardown the environment. Productivity does not take a hit as the demo environment is separate from all ongoing work.
Beyond the demo, you can create a sandbox environment where you and your customer have access. This is an impactful way to showcase new features or to troubleshoot specific application issues. Gone are the days of lengthy email exchanges regarding bugs or feature announcements that leave users to figure it out themselves.
Blue Green in Production
Testing changes using production configuration and realistic data allows teams to feel confident deploying changes to production. Blue/Green deployments allow you to do just that.
A blue/green deployment is a deployment strategy in which you create two separate, but identical environments. One environment (blue) is running the current application version and one environment (green) is running the new application version. They both run against the same production database. Using a blue/green deployment strategy increases application availability and reduces deployment risk by simplifying the rollback process if a deployment fails. Once testing has been completed on the green environment, live application traffic is directed to the green environment and the blue environment is deprecated. [Source: AWS]
In this cloud native age, automated ephemeral environments unlock workflows and productivity gains that far outweigh the cost. Companies have long underestimated what is possible. It's time to change that thinking. Thank you for reading the article.