Updating custom resources causes them to be deleted?

It’s important to understand the custom resource life cycle, to prevent your data from being deleted.

A very interesting and important thing to know is that CloudFormation compares the physical resource id you returned by your Lambda function to the one you returned previously. If the IDs are different, CloudFormation assumes the resource has been replaced with a new resource. Then something interesting happens.

When the resource update logic completes successfully, a Delete request is sent with the old physical resource id. If the stack update fails and a rollback occurs, the new physical resource id is sent in the Delete event.

You can read more here about custom resource life cycle and other best practices


The problem seems to be the sample implementation of the sendResponse() function that is used to send the custom resource completion event back to CloudFormation. This method is responsible for setting the custom resource's physical resource ID. As far as I understand, this value represents the globally unique identifier of the "external resource" that is managed by the Lambda function backing the CloudFormation custom resource.

As can be seen in the CloudFormation's "Lambda-backed Custom Resource" sample code, as well as in the cfn-response NPM module's send() and the CloudFormation's built-in cfn-response module, this method has a default behavior for calculating the physical resource ID, if not provided as a 5th parameter, and it uses the CloudWatch Logs' log stream that is handling logging for the request being processed:

var responseBody = JSON.stringify({
    ...
    PhysicalResourceId: context.logStreamName,
    ...
})

Because CloudFormation (or the AWS Lambda runtime?) occasionally changes the log stream to a new one, the physical resource ID generated by sendResponse() is changing unexpectedly from time to time, and confuses CloudFormation.

As I understand it, CloudFormation managed entities sometimes need to be replaced during an update (a good example is RDS::DBInstance that needs replacing for almost any change). CloudFormation policy is that if a resource needs replacing, the new resource is created during the "update stage" and the old resource is deleted during the "cleanup stage".

So using the default sendResponse() physical resource ID calculation, the process looks like this:

  1. A stack is created.
  2. A new log stream is created to handle the custom resource logging.
  3. The backing Lambda function is called to create the resource and the default behavior set its resource ID to be the log stream ID.
  4. Some time passes
  5. The stack gets updated with new parameters for the custom resource.
  6. A new log stream is created to handle the custom resource logging, with a new ID.
  7. The backing Lambda function is called to update the resource and the default behavior set a new resource ID to the new log stream ID.
  8. CloudFormation understands that a new resource was created to replace the old resource and according to the policy it should delete the old resource during the "cleanup stage".
  9. CloudFormation reaches the "cleanup stage" and sends a delete request with the old physical resource ID.

The solution, at least in my case where I never "replace the external resource" is to fabricate a unique identifier for the managed resource, provide it as the 5th parameter to the send response routine, and then stick to it - keep sending the same physical resource ID received in the update request, in the update response. CloudFormation will then never send a delete request during the "cleanup stage".

My implemenation (in JavaScript) looks something like this:

    var resID = event.ResourceProperties.PhysicalResourceId || uuid();
    ...
    sendResponse(event, context, status, resData, resID);

Another alternative - which would probably only make sense if you actually need to replace the external resource and want to adhere to the CloudFormation model of removing the old resource during cleanup - is to use the actual external resource ID as the physical resource ID, and when receiving a delete request - to use the provided physical resource ID to delete the old external resource. That is what CloudFormation designers probably had in mind in the first place, but their default sample implementation causes a lot of confusion - probably because the sample implementation doesn't manage a real resource and has no update functionality. There is also zero documentation in CloudFormation to explain the design and reasoning.