Operations: Recover from an incomplete Terraform execution - 22Acacia/sossity GitHub Wiki
It is unlikely, but possible, for a Terraform execution to fail after it has altered a number of resources but before it was written out its state. This is analogous to a write in a database making it into a database table while the transaction fails. At this point the world has entered a paradox from Terraform's point of view and manual intervention is required.
This page will cover how to manually alter Terraform's state file to bring it into line with the current infrastructure and allow for future executions to run expectedly.
- destroy any existing resources that are no longer desired
- create any new resources that are desired and modify any existing resources as desired.
We will perform the same steps manually.
- Download current state file from Atlas
Its wrong but its the best place to start - Open this file in an editor of your choice, preferably one that provides bracket highlighting. There will be json editing later and this is a great help
- Find the 'serial' key, it is one of the first couple of keys and has an integer value. It will be used later
- Execute sossity to generate the current terraform config file
This assumes the current working directory is an up-to-date checkout of pipeline-controller - Run 'terraform refresh' (delete loop start)
- Increase the value of the serial key by one
- For every resource that is listed from the refresh command as 'not found - 404' delete that document from the terraform state file
- Return to start of delete loop start
This may have to be done several times as Terraform will refresh the resources one level of the dependency chain at a time - Stop when the refresh command returns successfully
This will take ten to fifteen minutes to complete.
- Run 'Terraform plan'
- Identify each resource in the plan list that has already been created. This is the list of resources that will need to be manually added to the terraform state file
- For each resource that needs to be added, add a document of the form:
"<type>.<name>": { "type": "<type>", "primary": { "id": "<id of resource in gcloud>' } }
- <type> is something like google_storage_bucket or googlecli_container_replica_controller
- <name> is the name value for this resource from the .tf config file generated by sossity
- <id of resource in gcloud> is the unique and service specific id for each resource.
eg. For buckets it is the bucket name, for pubsub topics it is projects//topics/ and for all others find an existing resource in the state file to determine the pattern of the name.
If desired, the operator can run 'terraform refresh' after adding a resource to check that it has been introduced correctly. If this is done, ensure to increment the serial value after each refresh.
This can be error prone so be very careful. This is also where the bracket syntax highlighting will be very useful.
There are many resources that will also requires a number of attributes be present in the state file to prevent tearing down/building up the resources. It is straight forwad to let terraform rebuild the system but that may not be practical from a business perspective.
This isn't recommended practice by terraform or 22acacia and is only documented in the event of a terraform run failing and not saving state.
This will take 30 to 120 minutes to complete.