Easy is Difficult; Difficult is Impossible
Have you encountered a Reddit discussion along these lines:
Question: How do I implement disaster recovery for EC2 instances? Me: Replicate EC2 volume snapshots across regions and test regularly; BTW we have a product that does this. Commenter: No marketing here, I’ll just write a script. Easy.
I appreciate the do-it-yourself nature on AWS Reddit discussion boards, but we append marketing plugs to the knowledge we bring because we think it is helpful. DR even if you intend to implement it yourself might not be a high priority. But more importantly, and the subject of this post, is a concrete example of my axiom of software solutions: what seems easy is frequently difficult, what is difficult is often impossible.
Writing a script to snapshot EC2 volumes, replicate them across regions, attach them to remote instances and test that the application can recover might be complex code but should be within the skill set of a seasoned AWS developer.
However, where will you host this script? Not on your laptop, it needs to run regularly. In an EC2 instance? Hardly economical paying 24/7 for CPU cycles to host a single script that executes infrequently. How about Lambda: sounds simple, but how will you navigate the 15-minute timeout for operations, like replication, that may require more than 15 minutes to complete? A state machine may be necessary, which is not trivial. Let alone handling the complex cases, such as multiple volumes attached to the same instance as well as customary error handling.
What permissions will your script need? You certainly don’t want to embed credentials in your script, so tight but sufficient IAM roles are the order of the day.
What interface will you use to discover your production instances, their VPCs, keypairs and other attributes to make sure similar backup instances are provisioned at the recovery region?
Next, how will you manage your script to make sure it is working properly? Where will logs go, how will they be accessible, how will you be notified in case of a problem? Achieving lofty set-and-forget goals can be a challenge even for the most straightforward solutions.
How will you manage changes in your environment: new EC2 instances would need protection. Terminated instances likewise would need to be removed from the environment.
How will you keep the environment secure and up-to-date: whatever underlying OS on which the script will run, libraries, and languages would eventually require care and feeding.
Most importantly, though, is a robust and secure testing facility. DR must work correctly the first and potentially only time it is employed. You don’t want to try to power on the backup instances for the first time in an emergency. Regular power-ons are a must, however even more thorough would be a facility to connect to the backup applications on a regular basis to confirm they can recover properly. Architecting a secure, extensible, and meaningful deep testing framework might grow into a whole project in and of itself.
Hopefully by now you might appreciate my axiom that something that seems relatively easy upon closer examination explodes in complexity in various directions. To build the script robustly might require more resources and time than originally intended. This is precisely why we think it is helpful to suggest our commercial alternative, that addresses all of these issues in the most cost-effective manner possible. Our 100% cloud-native disaster recovery automation solution robustly protects your mission-critical cloud workload, addressing all of the above issues in a lightweight, easy-to-use package.
CloudFormation, CloudWatch, EventBridge, and CloudFront are our interfaces, with which you are most likely familiar.
You can cross DR off your long list of cloud requirements without writing a single line or code, and without spending a lot of money. This is something that might appeal even on Reddit.