-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
s3: possible race condition, adding new object to bucket between deletion of bucket objects and bucket deletion prevents deletion #21776
Comments
Based on the error you received, it looks like the custom resource didn't actually finish emptying the bucket (although it did try to delete it):
This can happen when S3 buckets have a large amount of objects in them. I was able to run import * as cdk from 'aws-cdk-lib';
import { Construct } from 'constructs';
import * as s3 from 'aws-cdk-lib/aws-s3';
export class DummyStack extends cdk.Stack {
public readonly Bucket: cdk.CfnOutput;
public readonly logBucket: cdk.CfnOutput;
constructor(scope: Construct, id: string, props?: cdk.StackProps) {
super(scope, id, props);
// Create a bucket for all the different logs.
const logBucket = new s3.Bucket(this, 'LogBucket', {
removalPolicy: cdk.RemovalPolicy.DESTROY,
autoDeleteObjects: true,
accessControl: s3.BucketAccessControl.LOG_DELIVERY_WRITE,
enforceSSL: true,
encryption: s3.BucketEncryption.S3_MANAGED,
blockPublicAccess: s3.BlockPublicAccess.BLOCK_ALL,
});
const bucket = new s3.Bucket(this, 'OutputBucket', {
removalPolicy: cdk.RemovalPolicy.DESTROY,
autoDeleteObjects: true,
enforceSSL: true,
encryption: s3.BucketEncryption.S3_MANAGED,
serverAccessLogsPrefix: "output-access-logs/",
serverAccessLogsBucket: logBucket,
blockPublicAccess: s3.BlockPublicAccess.BLOCK_ALL,
});
this.Bucket = new cdk.CfnOutput(this, 'S3Bucket', {
value: bucket.bucketName,
description: "The bucket",
});
this.logBucket = new cdk.CfnOutput(this, 'logS3Bucket', {
value: logBucket.bucketName,
description: "The log bucket",
});
}
} |
I wasn't able to reproduce this with the template and steps provided @max-allan-surevine, maybe @pradoz is right and this is related to having a large number of objects to delete? |
This issue has not received a response in a while. If you want to keep this issue open, please leave a comment below and auto-close will be canceled. |
If the "empty bucket" lambda has not emptied the bucket. Surely it should not be returning with a "Success"? Regardless of the reason why it didn't empty the bucket. Did you definitely wait for some content to be published to the log bucket before deleting the stack??? Using the stack @pradoz provided, I still get exactly the same problem. There is a single folder output-access-logs with a single object in it. The stack was created about 09:55 BST, the object is dated 10:20:50 (BST) , the delete failed at 09:21:07.499Z. So the Lambda appears to be failing to delete a single object. Is it possible the log delivery mechanism is holding an object "open" and that can't be deleted??
|
Oh, and the logs from the auto delete bucket lambda, ran at 9:20:19Z. Which is about 30seconds before the timestamp on the object in the bucket. (10:20:50 BST) :
|
Yes For reference, here's our logic for emptying a bucket
We don't check to ensure the bucket is actually emptied, and assume that this function takes care of it. I'll try to reproduce this again later today According to your description, I wonder if the remaining object was added to the bucket during the attempt to delete all objects. Or, according to your timeline, maybe it was added after all objects were deleted, but before the bucket itself was deleted |
I just tried again and waited a long time since adding files to the content bucket. And no logs had been added to the log bucket for ~20 minutes. (2 hours since content was added) Then bucket&stack deleted successfully.... The delayed log delivery thing seems to be causing this problem somehow, but it is going to be tricky to figure out whether it was during empty or between empty and delete. During empty sounds reasonable, there could be a lot of log files slowly being created in the background. The initial "show objects" wouldn't spot the new file, then it takes a while to delete all the objects, during which a new object is created by the log delivery. I think simply repeating the delete objects would catch that most of the time, because the second delete would only have a small number to delete and so complete quickly. But then if there is a delay before deleting the bucket, more files could still arrive. Maybe the best plan is to remove the log delivery permission before deleting the objects! Could I make the accessControl on the log bucket be it's own cdk object, with a dependency from the objectsBucket so the accessControl is created after and deleted before the buckets?? |
I'm not sure how possible any of this is. I don't think you could make accessControl on the log bucket be its own object - the only things you can add dependencies on are actual CloudFormation resources, or abstractions of them. There are only the two buckets, with their two bucketpolicies, I'm not confident that rearranging the order of deletion here is either a good idea, or even an idea that will work. Do you know what the current order of deletion is between the lambda deleting objects and all of these resources getting deleted? |
We are seeing this as well.
Our observation is that bucket is emptied however, there are some queues that are still processing, and after the bucket deletion server access logs gets delivered into the bucket between custom resource emptying the bucket and destroying it. Our solution could be
But I think the custom resource should disable server access logging on the bucket before deleting the items in it as the proper solution, or it should destroy the bucket first, and then delete its logging bucket. |
Hey, we've decided that, at least for now, we're not sure how to address this. We'll document that this is an issue that will need to be worked around if ran into. |
@peterwoodworth One solution might be to update bucket policy to block adding objects to the bucket and after that deleting the objects. That should prevent writes to the bucket between clearing the bucket and deleting the bucket. (Unless the used write method ignores the bucket policy) |
Seeing this problem also. Interestingly, we observe the issue when our system couldn't cause any access events on the content bucket itself during the Stack destroy events. We don't have any compute accessing the buckets during that time. I have no problem manually emptying and then deleting the access logs bucket because there's nothing continuously accessing the content which be populating the access logs. However, CDK fails to empty and then delete the access logs every time, reliably. Content bucket gets emptied and deleted, never the access logs. Therefore, I wonder if it's a timing in CDK itself. Perhaps marking its "empty the access logs bucket" process finished before its "empty the content bucket" has finished. One theory:
|
I'm seeing the exact same problem and after a little bit of debugging, I thought this could be the problem too
In addition, I noticed that after running |
I'm seeing the same problem, and was looking to see if I could implement a fix. My thought was to modify the auto-delete-objects-handler, so it would disable access logging prior to start deleting objects. This would ensure that the downstream logging bucket would not be bothered by access log entries resulting from objects deleted in the source bucket. However, then I was a bit confused, as the existing code already puts a policy statement onto the bucket before it starts deleting objects. This policy statement explicitly denies PutObject for all principals. So I wonder: is the more specific bucket policy to allow PutObject for logging getting more priority then the Deny for PutObject? I would be happy to submit a patch, but I would need some feedback if my interpretation is correct. Background info: This is the existing policy on my logging bucket, before I try to destroy the stack: {
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::<roleid>:role/<replaced>-f-CustomS3AutoDeleteObjects-FhzxbAEGWCwc"
},
"Action": [
"s3:DeleteObject*",
"s3:GetBucket*",
"s3:List*",
"s3:PutBucketPolicy"
],
"Resource": [
"arn:aws:s3:::<replaced>-logging-bucket",
"arn:aws:s3:::<replaced>-logging-bucket/*"
]
},
{
"Effect": "Deny",
"Principal": "*",
"Action": "s3:*",
"Resource": [
"arn:aws:s3:::<replaced>-logging-bucket",
"arn:aws:s3:::<replaced>-logging-bucket/*"
],
"Condition": {
"Bool": {
"aws:SecureTransport": "false"
}
}
},
{
"Effect": "Allow",
"Principal": {
"Service": "logging.s3.amazonaws.com"
},
"Action": "s3:PutObject",
"Resource": "arn:aws:s3:::<replaced>-logging-bucket/fe*",
"Condition": {
"StringEquals": {
"aws:SourceAccount": "<id>"
},
"ArnLike": {
"aws:SourceArn": "arn:aws:s3:::<replaced>"
}
}
},
{
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::054676820928:root"
},
"Action": "s3:PutObject",
"Resource": "arn:aws:s3:::<replaced>-logging-bucket/AWSLogs/255377600067/*"
},
{
"Effect": "Allow",
"Principal": {
"Service": "delivery.logs.amazonaws.com"
},
"Action": "s3:PutObject",
"Resource": "arn:aws:s3:::<replaced>-logging-bucket/AWSLogs/255377600067/*",
"Condition": {
"StringEquals": {
"s3:x-amz-acl": "bucket-owner-full-control"
}
}
},
{
"Effect": "Allow",
"Principal": {
"Service": "delivery.logs.amazonaws.com"
},
"Action": "s3:GetBucketAcl",
"Resource": "arn:aws:s3:::<replaced>-logging-bucket"
}
]
} And this is the code fragment in packages/@aws-cdk/custom-resource-handlers/lib/aws-s3/auto-delete-objects-handler/index.ts: policy.Statement.push(
// Prevent any more objects from being created in the bucket
{
Principal: '*',
Effect: 'Deny',
Action: ['s3:PutObject'],
Resource: [`arn:aws:s3:::${bucketName}/*`],
},
); |
There seems to be a duplicate issue around this problem, #26874. @PieterVanEde @jduchon-sonarsource I see you both commented this year and I want to check on the version of CDK you used. This PR added a Deny policy during custom resource One other note, to answer @PieterVanEde's question on the policy
This policy was not the one added by the custom resource (in the PR). The custom resource only adds the policy when Would like to get the community help on this issue and see if anyone is still able to reproduce this issue using the latest versions of CDK (CDK v2.130+). If anyone is still able to reproduce this issue using a newer CDK version, please tag me immediately and provide how you reproduced it. I'll take a look and get back to you ASAP. |
At the project where I observed this behaviour, we are using CDK 2.130 exactly. I couldn't find a more clear reproduction path, because this seems to be a kind of race condition not having a clear reproduction path. But if we could somehow prevent new logging to be written to the access bucket before we start cleaning it up, then that should solve this problem. |
@PieterVanEde thanks for getting back to me. That's interesting, because AFAIK we've added a This policy is attached when the CloudFormation sends |
Describe the bug
If I create a log bucket. Then a content bucket, using log bucket for server access logs, when I do a "cdk destroy" the access logs are not removed and the stack fails to destroy.
Expected Behavior
The destroy to complete successfully! (The log bucket and content to be removed.)
Current Behavior
Cloudwatch logs for the delete objects Lambda suggests it was successful :
But the result of the
cdk destroy
is a fail because there is content :Reproduction Steps
Deploy the template below. Upload your current directory to "bucket" and wait up to an hour until there are some logs in logBucket. Then run the destroy.
(Yes the code is a bit janky, I did a lazy copy/paste and removed some logic that isn't relevant etc...)
Possible Solution
I imagine the "delete content" lambda is not checking some error condition it isn't expecting and reporting success when it hasn't succeeded. The Lambda does get deleted and the code in the Lambda editor is minified so I couldn't easily retry or debug it.
Additional Information/Context
The logic I removed in the code example avoids deleting the content in our "production" deployments. But in dev, we have no need to keep the access logs. So if someone was thinking "but it's log data, we can't delete that", please : do delete any data. Or at least provide a "actuallyAutoDeleteAllObjects: true" option.
CDK CLI Version
2.38.1 (build a5ced21)
Framework Version
No response
Node.js Version
Node.js v18.8.0
OS
OSX
Language
Typescript
Language Version
No response
Other information
No response
The text was updated successfully, but these errors were encountered: