top of page
Search

From the Ashes: (Re)building the Spine Platform with IaC – Part 2

  • Jacob Head
  • Dec 11
  • 12 min read

Updated: Dec 12

Phoenix image designed by Wannapik
Phoenix image designed by Wannapik

In my last post (read it here: [link]) I extolled the virtues of Infrastructure as Code (IaC) and described how we rebuilt a new company – Spine Energy Technology – from scratch using an existing IaC template. This extreme application of “lift and shift” proves the incalculable value of well-maintained IaC, and should give some hope to those dreamers whose disaster recovery plan is essentially just terraform apply.


However, cloud infrastructure management is a complicated beast and IaC is not a silver bullet. When a SaaS platform gets to a certain level of complexity, several challenges begin to emerge.


Some of these challenges are purely organisational: the way the IaC is used, or the interactions between the many component systems that make up the stack. Some of the challenges are technological: edge cases or peculiarities which start to emerge from sufficiently large-scale (or long-term) usage of IaC technologies.


The good news is, we hit both kinds of these speed bumps in the course of re-applying our entire codebase – so there’s a lot of rich learnings that can be shared. The bad news, of course, is that some of these problems might be lurking in your codebase too.


In this post, I’ll pass on some of that battle-scarred wisdom so that you can understand some of the limitations of the IaC pattern – even if you’re not attempting to apply it for a company-wide rebuild.


I’ll be then distilling these lessons learned into a few key IaC rules to live by in a later post.


The Bad

Let’s talk about what could have gone better for us while we rebuilt Spine from IaC.

This is a mixture of the unfortunate (bad luck or unfavourable circumstances going into the process) and the unhelpful (subtle pitfalls or drawbacks of the technology), but certainly not unique problems.


Don’t get me wrong: I don’t mean to say these are situations where IaC can’t help you. Rather, these might be places where you need to be mindful of the limitations, because IaC alone won’t dig you out of the hole.


For us, one of the biggest challenges was:


  • We were in the middle of some big refactors


IaC doesn’t exist in a vacuum – it’s part of the living, changing tapestry that is the software development lifecycle.


Think about it: If you were to drop tools today and capture a snapshot of your codebase, what would it look like? Most of it would be fine, but I’d expect at least one service to appear caught like a rabbit in the headlights: halfway between a bright future and a murky tech-debt past.


If you were to restore your stack from IaC scratch then, would you choose to re-deploy the older version - warts and all - or the incomplete new version?


For us, there were three big refactor rabbits: a migration between Identity Providers (IdPs), the consolidation of our identity-as-a-service platform as a CDKTF monorepo, and the replacement of a restrictive battery scheduler service with something more flexible and extensible for the future.


In each case, we made the call to spend the time productionising the incomplete - but preferable - solution, rather than to deploy the wrong thing fast.


This may sound like the sunk-cost fallacy (and it is), but sometimes there’s no way to avoid spending more effort in this kind of exercise. Take our IdP refactor, for example:

We had been running an outdated version of Keycloak which, for a security-centric tool, is ... not great. Our options here were:


  1. Redeploy as-is and hope that nobody malicious sneezes too hard in our direction

  2. Deploy the latest Keycloak

  3. Push on and complete implementation of the new IdP


Given how far Keycloak interfaces have moved on (and hence the scale of rework required for option 2), the last two options weren’t so different in effort. So, we stuck to our guns and finished the refactor – trusting in our past decisions and moving to a IdP better-suited to our usage (Zitadel).


This choice of technology ultimately worked really well for us, but it came at the cost of time spent completing the implementation.


Incomplete refactors were not the only tech-debt that came due for us though:

 

  • IaC versions can change subtly over time


Terraform is great for long-term stability. The core tech, and even the 3rd party provider plugins that power it, offer fantastic backwards and forwards compatibility: Generally, you can pick up some IaC created years ago and re-run it successfully – at worst needing to run commands from inside an older docker container.


We were working with IaC written across 10 years, and hence across a wide range of terraform versions. By and large, this didn’t cause problems for independently-deployed infrastructure but could be a pain for platform-scale IaC.


If you’re consolidating this kind of platform into a single deployable platform stack*, you also need to align on a single version of the IaC throughout that stack. In our case this meant that several parts of our platform jumped multiple years of IaC versions.


* - Which you might want to do to consolidate configuration of your infrastructure, as values can be passed between its component parts, rather than needing to be provided as config.

These version bumps can bring in subtle benign changes (e.g. defaults being added or changing) through to more inscrutable differences (e.g. structure changes or new mandatory fields). In the worst case, cloud providers have to retire older versions of their providers altogether, where the burden of backwards compatibility limits the development of their own products.


All of this adds up to a large time-sink, and a lot of time squinting at the output of terraform plan.


But, even if every piece of IaC had started on the same, up-to-date version:

 

  • Not everything is built with IaC


On face value this may seem like an unforced error or gap in our IaC strategy with an easy solution. Obviously, if you don’t capture all aspects of your tech stack in IaC, then IaC alone cannot rebuild that tech stack.


But should everything be covered by IaC?


Let’s assume you’re starting from nothing and want to apply some pre-existing IaC – what are the fundamental prerequisites? If we assume you’re using something terraform-esque and trying to rebuild something cloud-ish I’d say you need at least:

 

·      A runtime for your IaC

·      Somewhere to store your IaC state

·      A cloud account to create resources in

 

The runtime seems easy – any good CI system should be able to serve this purpose, with the benefit of generally using repeatable configuration files you could (and should) be storing alongside your IaC. You could even define your CI system setup itself through IaC!


But – err - if you did that: where would you apply the IaC from?


Perhaps in this kind of extreme situation you could run the bootstrapping from the terminal on your laptop, sure. But then, where are you going to store the state files?


For the uninitiated, in terraform the IaC state is stored (as a file) to keep track of all the resources managed by that stack. This acts as both a record of the current status of the stack, and as a mapping layer between the general resources described by the IaC and the specific instances running in the cloud. This enables the IaC runtime to figure out which resources need to change (and how) without having to call out to each resource’s remote API to figure out the delta.


One of the most common approaches to managing terraform state is to store it in amazon S3. This enables collaboration with other users and ensures that the state will not be misplaced or versioned incorrectly - backed by S3’s highly resilient file storage.


But to use the S3 backend for terraform you already need to have an S3 bucket and an AWS account to host it in (more on this in a moment). Both of these things should arguably be managed by IaC as well, which poses something of a circular dependency.


You could rely on local state files for some of these foundational resources (but you probably shouldn’t*), or maybe store the state files in version control (but you really shouldn’t**)?


*- Local state mostly precludes collaboration unless you want to keep passing the file around – and good luck if you misplace the file or use a historical version rather than the latest.


**- State files capture a lot of information about the resources it represents – including sensitive credentials. Storing statefiles in version control is as bad an idea as storing your passwords in version control. Terraform, in fact, recommends encrypting your statefile at-rest and limiting access to users.


So, next problem: how do you create the cloud account you want to apply your IaC into?

Again, you could define this like any other resource in IaC – but does anyone actually do that? Should anyone do that?


For example, the best practice for any mature cloud-native organisation, AWS recommends, is to have one root account for organisation management and central billing (which otherwise contains no workloads or resources), nested under which are multiple functional accounts aligned with business purpose. This is both a better security posture***, and a better encapsulation of separated concerns across the tech stack.

 

*** - i.e. a compromised leaf account does not grant access to the wider organisation while the root security cross-section is as small as possible.

 

This means, however, that there should probably not be a CI runner or user with (enduring) authorisation to create resources in that root account. Therefore, any IaC stack responsible for account creation should probably be a single-use setup step (and de-authorised shortly thereafter).


In summary, some of the early prerequisites for rebuilding your world from IaC can themselves be represented in IaC, but this approach requires so much contrivance that it starts to lose value. If an IaC stack can only be meaningfully run once and uses different patterns for authorisation or state management than your wider ecosystem, why use IaC here at all?


In our case, much of these foundational resources were entirely absent from our IaC codebase, and had to be either reverse-engineering or otherwise redone manually. I suspect it’s a similar story for most tech stacks out in the wild.


But suppose you did have some complete, IaC-purist representation of your entire organisation. The bootstrapping issues I’ve discussed above (e.g. “I can’t do X before Y because X relies on Y!”) point to another universal - and more insidious – problem:

 

  • IaC can be dependency hell


In an ideal world every cloud resource (and hence every piece of IaC) can be deployed independently and in any order. In the real world, infrastructure can have very hard dependencies on other infrastructure, and this only gets worse the further upstream you go*.

 

* - e.g. You can very easily deploy application-layer dependencies out of order, and an app will happily crashloop until everything is in place. You can’t deploy an app at all if you don’t have a runtime, and you probably can’t deploy a runtime without some compute.

 

This is an equal problem for monolithic infrastructure stacks and distributed independent infrastructure stacks: in the former your IaC stack will not be able to complete, in the latter your many stacks won’t successfully apply unless you run them in the correct order. In our case, we had a mixture of both and hit both kinds of problem.


The more insidious aspect of this drawback of IaC is that it is entirely possible to sleepwalk into the situation where your stack(s) are an incorrect or incomplete dependency chain over time. This might apply to you (yes: you!) right now, and you don’t even realise it. In fact, you might not realise the layer of dependency hell you’ve sunk to until you try to do something like rebuilding your stack from scratch.

 

Let me explain with a simple example.

 

Suppose you have a simple, monolithic infrastructure stack with three components: A, B, and C. These components depend as follows, and so you structure your IaC stack accordingly:

v1: A <-- B <-- C

i.e. Component C depends on component B, which in turn depends on component A


Everything so far is correct, so you apply your IaC and A, B, and C spring into being somewhere in the cloud.


Later, you want to introduce component D into the stack. D has a runtime dependency on C, but you overlook this. Worse, during implementation you wire in a config dependency for B on D (e.g. you take the output of D and feed it into the config of B). Your stack should look like this:

vCorrect: A <-- B <-- C <-- D


But it actually looks like this:

v2: A <-- D <-- B <-- C

 

However, “good” news: your IaC applies just fine! Component D gets created, component B gets updated with its new config – everything seems great!

 

“So, what’s going on?”, I hear you cry.


Well, IaC attempts to apply only the delta between your original state and your desired state. This means that the “apply” stage of v2 attempts to:


  1. Deploy D

  2. Update B with D’s output as config


Step 1 requires component C in order to be successful, but luckily you had already deployed it during the rollout of v1.

 

“Great! So, what’s the problem exactly?”

 

We’ve established that you can successfully deploy an invalid dependency tree, sequentially. But what if you wanted to re-deploy a fresh version of your stack?


In deploying v2 from scratch, your IaC would attempt to:


  1. Deploy A

  2. Deploy D <-- This will fail!

  3. Deploy B (with output config from D)

  4. Deploy C


Step 1 would be fine, but step 2 would fail (D doesn’t have its runtime dependency, C). Without the output from D, step 3 would be blocked, and hence so would step 4.


Your stack is now deadlocked by its dependencies and can only be fixed by correcting the dependency tree to match vCorrect** above.

 

**- Note that you would also need to remove the dependency from B to D as part of this, otherwise you’d end up with a circular (and equally deadlocked) dependency tree.

 

“Oh no! Can this get any worse?”, you wail, despondent.

 

Of course it can! Let’s make the failure mode even more inscrutable: Suppose you never included the B  D config dependency but nevertheless omitted the D <-- C dependency.


Your stack now looks like this:

 

v3: A <-- B <-- C

     <-- D                   


As before, moving from v1 to v3 would succeed without a hitch, but in attempting to deploy v3 from nothing, your IaC would try to do something like:


  1. Deploy A

  2. In parallel:

    1. Deploy B

    2. Deploy D  <-- This might fail. Sometimes.

  3. Deploy C


Step 2b will sometimes fail for the same reason as the earlier example: there’s no component C to satisfy the runtime dependency. But what’s more likely to happen is that this stage will hang for a while for component D to stabilise (e.g. by passing some liveness test). If you’re lucky, this will be enough time for 2a to complete and step 3 to start running, deploying the required component C needed for 2b to complete successfully.


Hooray: We’ve traded a deadlocked dependency tree for a dependency race condition! Let’s hope steps 2a and 3 don’t ever vary in duration or we might have a stack which fails sometimes (hint: you do).


Now scale this example up to dozens of components across many parallel branches (and loosely coupled IaC stacks) and you can probably imagine how many nasty gremlins might be hiding in your dependency tree.


Things aren’t conceptually any better for this example with many independent IaC stacks. The only material difference would be that you, the human in the loop, would be manually orchestrating the dependency tree through the order in which you applied those stacks.


So, what’s the solution here?


Ultimately you need to be on top of your IaC dependencies, but this is far easier said than done. In my experience, the best solution is to regularly test your IaC stack thoroughly, and a great tool for this is performing a destroy/create cycle.


Let me explain: If your dependency chain looks like this:

A <-- B <-- C


Then a terraform apply would:

  1. Create A

  2. Create B

  3. Create C


But terraform destroy would:

  1. Destroy C

  2. Destroy B

  3. Destroy A


i.e. the dependency chain is inverted to ensure that resources relied upon by others are not taken away until nothing else depends on it.


This is really useful for finding dependency chain misconfigurations!


Let’s take the misconfigured dependency chain from earlier:

A <-- D <-- B <-- C

 

Where, if you recall, D actually depends on C, but is misconfigured to depend on A. If we tried to terraform destroy this stack, this is what would happen:


  1. Destroy C

    1. D would break at this stage, because its dependency is gone!

  2. Destroy B

  3. Try to destroy D <-- this will likely fail because D is in an unhealthy state


Even if step 3 didn’t fail somehow, you would not be able to re-apply the stack from scratch because of the dependency ordering described earlier on. So, by performing a full remove/create cycle you can easily catch mistakes or omissions from your dependency chain.


In our case, we had been running destroy/create cycles on one environment nightly and it caught a lot of issues. Did this mean we didn’t experience any dependency problems during our rebuild? No, but we were far, far better prepared.

 

Conclusion

All in all, while IaC is a great pattern in general for infrastructure management (and terraform can be a powerful tool, specifically), it cannot solve all your problems. More troubling: it’s easy to get subtly wrong in ways that won’t become obvious until you need to recover from some disaster.


In our case, needing to fix forward (rather than redeploy IaC frozen midway through a metamorphosis) and having to unpick a snarled knot of dependencies added considerable time to our reboot. However, these kinds of pitfalls weren’t the only things holding us back.


In the next part of this series, I’ll explain the grimy underbelly of IaC: the ugly truths, technological missteps, and the places where reasonable choices made in the past did not help us out in Spine’s rebirth.


As a ray of hope, in the final part of the series I’ll distil a few key takeaways and best practice recommendations to help you and your IaC ecosystem (even if you don’t need to rebuild your world from scratch).

 
 
 

Comments


  • Email
  • LinkedIn

© 2025 by Spine Energy Technology Ltd.

bottom of page