Drift Detection

I recently watched 2 presentations that got me writing this post about IaC and drift detection. The First one was Cloudonaut podcast episode 70 where Michael Wittig described the conflict with Terraform and other tools that apply tags to resources.

There are tools that run in some AWS accounts that just automatically tag all the new resources with some tags. For example the user who created the resource, maybe the timestamp and things like that. But this approach does not really work well with infrastructure as code, because what we all when we start using IaC, is that out of band changes are bad … and now every time you run terraform it will just remove all the auto added tags because it truly believes this is not how it should look like …

As a solution for this Terraform AWS provider has ignore_tags setting you can use to skip certain tags when detecting drift between your state and reality. The Podcast got me curious how this works in Cloudformation …

Definition of drifting in Cloudformation

In Cloudformation User Guide the scope of drift detection is defined as follows

CloudFormation only determines drift for property values that are explicitly set, either through the stack template or by specifying template parameters. This doesn’t include default values for resource properties. To have CloudFormation track a resource property for purposes of determining drift, explicitly set the property value, even if you are setting it to the default value.

If you don’t explicitely set a parameter for resource in the template, then you are free to modify it, from the default value, outside of Cloudformation, and it won’t be considered as drift. For tagging this means that if you don’t define any tags in the resource parameters, you are free to apply and modify them without causing a drift for the stack. However, if you define even a single tag, ie. have that Tags: in the resource definition,

     Tags:
      - Key: FOO
        Value: BAR

then any changes to resource tags will be picked as drift. Because resource parameter is the array of tags, not the individual tags you defined in the template.

Managing tags without causing drift

Above definition of drift detection isn’t 100% true. You can have 2 kinds of tags that are not considered as drift even they are not defined in the template. The First group is tags added by AWS services. These start with aws: prefix and you can not create or modify them. The Second group of tags that are ignored by Cloudformation drift detection are the tags added by Cloudformation itself.

When you apply tags to Cloudformation stack, those tags will be synced to all resources. To avoid unnecessary drifting, tools that apply extra tags for resources should start from updating the stacks with proper tags and only after that process individual resources that are not part of any stack. Or you could do the same checking first if resource is member of stack and only if it isn’t, apply tags directly to resource. Limitation of this approach is, it applies the same key=value -pairs for every resource of given stack.

Is the problem with IaC or reality?

But this was just tags and avoiding the drifting is already getting complicated. When the state of your IaC tooling and reality doesn’t match, is the problem in reality or how the state is managed? Saying all operations must be done through certain tool is bit lazy argument, unless you are prepared to prove this can be done. I also find it hard to believe that people would cause the drift just because they are stupid or evil, but there must be something built-in to the concept of IaC that attracts the drift.

The Second talk was Adam Jacob’s “What if Infrastructure as Code never existed” where he explains the history of configuration-as-code and then breaking the rules/myths of it. The idea of having the single source of truth in your code was especially something that resonated with me. It doesn’t matter how much you insist the truth should be in the Terraform/Cloudformation/Pulumi -code, when it doesn’t allign with reality, the code will always be wrong. Or as Adam puts it - “It is a f***ing lie!!!”

If infrastructure as code is something you are interested in (and it must be as you read this far) I highly recommend watching Adams thought provoking but also very entertaining talk.

Summary

Going to extremes doesn’t give the best results. Same applies for infrastructure as code. It certainly has it’s virtues but saying “all things must be in IaC” will cause some (serious) drifting that you must be prerated to handle. One way of handling it could leaving certain parts off-scope and accepting some drifting and measuring the goals against available skills (see Adam’s comments about 200% knowledge problem).

Remembering that drifting isn’t the problem but rather symptom of code not being the single source of truth makes infrastructure-as-model sound very tempting. But building a digital twin of AWS account is going to be complex as well. I’m interested to see how System Initiative will succeed on building this and creating “Devops without papercuts”.