Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(aws-eks): Custom::AWSCDK-EKS-HelmChart StateNotFoundError: State functionActiveV2 not found #23862

Closed
beamsies opened this issue Jan 27, 2023 · 19 comments
Labels
@aws-cdk/aws-eks Related to Amazon Elastic Kubernetes Service bug This issue is a bug. closed-for-staleness This issue was automatically closed because it hadn't received any attention in a while. response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days.

Comments

@beamsies
Copy link

beamsies commented Jan 27, 2023

Describe the bug

I'm trying to deploy an eks cluster from a tutorial I'm following here:

After I run cdk bootstrap I then run cdk deploy.

Then I get this error a little more than 1/2 through the process:

11:27:27 PM | CREATE_FAILED | Custom::AWSCDK-EKS-HelmChart | my-cluster/chart-a...r/Resource/Default Received response status [FAILED] from custom resource. Message returned: StateNotFoundError: State functionActiveV2 not found. at constructor.loadWaiterConfig (/var/runtime/node_modules/aws-sdk/lib/resource_waiter.js:196:32) at new constructor (/var/runtime/node_modules/aws-sdk/lib/resource_waiter.js:64:10) at features.constructor.waitFor (/var/runtime/node_modules/aws-sdk/lib/service.js:271:18) at Object.defaultInvokeFunction [as invokeFunction] (/var/task/outbound.js:1:826) at processTicksAndRejections (internal/process/task_queues.js:95:5) at async invokeUserFunction (/var/task/framework.js:1:2149) at async onEvent (/var/task/framework.js:1:365) at async Runtime.handler (/var/task/cfn-response.js:1:1543) (RequestId: 0d8ef9af-72af-4130-82bb-c480d217e863)

Expected Behavior

I'm expecting the cdk deploy command to successfully deploy the eks cdk stack since it was from a tutorial on aws blog.

Current Behavior

The cdk deploy command failed with the following error:

11:27:27 PM | CREATE_FAILED | Custom::AWSCDK-EKS-HelmChart | my-cluster/chart-a...r/Resource/Default Received response status [FAILED] from custom resource. Message returned: StateNotFoundError: State functionActiveV2 not found. at constructor.loadWaiterConfig (/var/runtime/node_modules/aws-sdk/lib/resource_waiter.js:196:32) at new constructor (/var/runtime/node_modules/aws-sdk/lib/resource_waiter.js:64:10) at features.constructor.waitFor (/var/runtime/node_modules/aws-sdk/lib/service.js:271:18) at Object.defaultInvokeFunction [as invokeFunction] (/var/task/outbound.js:1:826) at processTicksAndRejections (internal/process/task_queues.js:95:5) at async invokeUserFunction (/var/task/framework.js:1:2149) at async onEvent (/var/task/framework.js:1:365) at async Runtime.handler (/var/task/cfn-response.js:1:1543) (RequestId: 0d8ef9af-72af-4130-82bb-c480d217e863)

Reproduction Steps

  1. git clone https://github.com/aws-samples/cdk-eks-fargate
  2. npm install
  3. npm i -g aws-cdk
  4. cdk bootstrap
  5. cdk deploy
  6. Look for error described above

Possible Solution

I have searched all over the internet for similar issues and I am not sure. I am learning from a tutorial and ran into this strange error. I've tried different versions of aws-cdk (going down) and that didn't help either.

Additional Information/Context

One thing to note that may be causing this: I"m using aws-nuke to purge any and all resources when I'm done for the day as I'm just trying to get the cluster up and running and configured the way I like it. I'm doing this for cost reasons so I do not incur charges for something that isn't serving any apps/websites.

CDK CLI Version

2.62.1 (build 8641449)

Framework Version

^2.31.1

Node.js Version

v18.13.0

OS

WSL2

Language

Typescript

Language Version

^4.0.2

Other information

I'm attempting to deploy on us-east-2.

@beamsies beamsies added bug This issue is a bug. needs-triage This issue or PR still needs to be triaged. labels Jan 27, 2023
@github-actions github-actions bot added the @aws-cdk/aws-eks Related to Amazon Elastic Kubernetes Service label Jan 27, 2023
@pahud pahud added investigating This issue is being investigated and/or work is in progress to resolve the issue. and removed needs-triage This issue or PR still needs to be triaged. labels Jan 27, 2023
@pahud pahud self-assigned this Jan 27, 2023
@beamsies
Copy link
Author

Update: The error in the description of this bug seems to happen on any aws-cdk eks cluster I try to create (I'm trying a few different tutorial examples).

Also, I'm trying on my linux machine now to see if I get different results. I still got an error but a different one now:

4:39:51 PM | CREATE_FAILED | Custom::AWSCDK-EKS-KubernetesResource | myclustereksfargatelogging92048F91 Received response status [FAILED] from custom resource. Message returned: TooManyRequestsException: Rate Exceeded. at Object.extractError (/var/runtime/node_modules/aws-sdk/lib/protocol/json.js:52:27) at Request.extractError (/var/runtime/node_modules/aws-sdk/lib/protocol/rest_json.js:49:8) at Request.callListeners (/var/runtime/node_modules/aws-sdk/lib/sequential_executor.js:106:20) at Request.emit (/var/runtime/node_modules/aws-sdk/lib/sequential_executor.js:78:10) at Request.emit (/var/runtime/node_modules/aws-sdk/lib/request.js:686:14) at Request.transition (/var/runtime/node_modules/aws-sdk/lib/request.js:22:10) at AcceptorStateMachine.runTo (/var/runtime/node_modules/aws-sdk/lib/state_machine.js:14:12) at /var/runtime/node_modules/aws-sdk/lib/state_machine.js:26:10 at Request.<anonymous> (/var/runtime/node_modules/aws-sdk/lib/request.js:38:9) at Request.<anonymous> (/var/runtime/node_modules/aws-sdk/lib/request.js:688:12) (RequestId: 90a9e646-6283-4da2-aa9a-c990a8b62fd7)

@pahud
Copy link
Contributor

pahud commented Jan 30, 2023

Hi

As this issue is related to the sample repo, the best place to report this issue is https://github.com/aws-samples/cdk-eks-fargate/issues

As this is not relevant to aws-cdk directly, I am closing this issue for now.

@pahud pahud closed this as completed Jan 30, 2023
@github-actions
Copy link

⚠️COMMENT VISIBILITY WARNING⚠️

Comments on closed issues are hard for our team to see.
If you need more assistance, please either tag a team member or open a new issue that references this one.
If you wish to keep having a conversation with other community members under this issue feel free to do so.

@pahud pahud removed the investigating This issue is being investigated and/or work is in progress to resolve the issue. label Jan 30, 2023
@jaredtbates
Copy link

jaredtbates commented Feb 24, 2023

Hey there @pahud, we encountered this same issue with KubernetesPatch and upgrading @aws-cdk/lambda-layer-kubectl-v24 from 2.0.77 to 2.0.107, and CDK from 2.58.1 to 2.66.0. We are not using the sample repo. Any recommendations on how to proceed?

@beamsies did you ever find a solution to this?

@beamsies
Copy link
Author

@jaredtbates No, I did not find a solution to this issue.

I went back to using terraform for now lol.

I'd love to figure this out but I am not sure how to debug this issue.

@pahud mentioned that it is not directly related to this repository and to open a new issue here:
https://github.com/aws-samples/cdk-eks-fargate/issues

I just forgot to be honest.

@mmayors
Copy link

mmayors commented Feb 24, 2023

I'm seeing this exact same error when deploying a custom resource backed by a NodeJS lambda (wrapped in a Provider). Oddly this only seems to reproduce in the ap-south-2 region.

In the provider's OnEvent log group I see

INFO    [provider-framework] executing user function arn:aws:lambda:ap-south-2:...:function:... with payload 
{
    "RequestType": "Create",
    ...
    "RequestId": "ce62339d-6830-47db-bd61-3b53da709967"
}

INFO    [provider-framework] CREATE failed, responding with a marker physical resource id so that the subsequent DELETE will be ignored

INFO    [provider-framework] submit response to cloudformation 
{
    "Status": "FAILED",
    "Reason": "StateNotFoundError: State functionActiveV2 not found.\n    at constructor.loadWaiterConfig (/var/runtime/node_modules/aws-sdk/lib/resource_waiter.js:196:32)\n    at new constructor (/var/runtime/node_modules/aws-sdk/lib/resource_waiter.js:64:10)\n    at features.constructor.waitFor (/var/runtime/node_modules/aws-sdk/lib/service.js:271:18)\n    at Object.defaultInvokeFunction [as invokeFunction] (/var/task/outbound.js:1:826)\n    at processTicksAndRejections (internal/process/task_queues.js:95:5)\n    at async invokeUserFunction (/var/task/framework.js:1:2149)\n    at async onEvent (/var/task/framework.js:1:365)\n    at async Runtime.handler (/var/task/cfn-response.js:1:1543)",
    "StackId": "...",
    "RequestId": "ce62339d-6830-47db-bd61-3b53da709967",
    "PhysicalResourceId": "AWSCDK::CustomResourceProviderFramework::CREATE_FAILED",
    "LogicalResourceId": "..."
}

But there's no corresponding RequestId in the lambda function's log group (the function that the provider says it's invoking). I'll try to create a repro but given the recent comments from others also seeing this error I suggest reopening this issue.

@pahud
Copy link
Contributor

pahud commented Feb 24, 2023

reopening this issue as it is still relevant.

@pahud pahud reopened this Feb 24, 2023
@pahud
Copy link
Contributor

pahud commented Feb 24, 2023

Hi @jaredtbates @beamsies @mmayors

Instead of deploying the sample from https://github.com/aws-samples/cdk-eks-fargate, can you share a small sample that I can reproduce it in my account? I'd be happy to help investigate.

@pahud pahud added needs-reproduction This issue needs reproduction. investigating This issue is being investigated and/or work is in progress to resolve the issue. labels Feb 24, 2023
@pahud
Copy link
Contributor

pahud commented Feb 24, 2023

The reason I need a small sample for issue reproduction is that I feel this error is the lambda function not being able to callback to the cloudformation service on custom resource creation, and this usually happens when your lambda function is connecting to the vpc subnets that have no egress. But I am not 100% sure so I need a small sample so I can dive into it.

    "Reason": "StateNotFoundError: State functionActiveV2 not found.\n    at constructor.loadWaiterConfig (/var/runtime/node_modules/aws-sdk/lib/resource_waiter.js:196:32)\n    at new constructor (/var/runtime/node_modules/aws-sdk/lib/resource_waiter.js:64:10)\n    at features.constructor.waitFor (/var/runtime/node_modules/aws-sdk/lib/service.js:271:18)\n    at Object.defaultInvokeFunction [as invokeFunction] (/var/task/outbound.js:1:826)\n    at processTicksAndRejections (internal/process/task_queues.js:95:5)\n    at async invokeUserFunction (/var/task/framework.js:1:2149)\n    at async onEvent (/var/task/framework.js:1:365)\n    at async Runtime.handler (/var/task/cfn-response.js:1:1543)",

You can try redeploy it with cdk deploy -R or cdk deploy --no-rollback. On deployment failure, go to the lambda console and check the network configuration of the cluster handler and check all the vpc subnets that associate with this lambda function. Make sure all subnets have egress access. If any isolated subnet is associated to the lambda function, the custom resource will have a chance to have errors like this. This is just something off the top of my head and any sample codes that reproduces the error would be appreciated.

@zelu-zuehlke
Copy link

zelu-zuehlke commented Feb 24, 2023

I had the same error when I used cdk8s-plus-24 kplus deployment. When I reverted my deployment to k8s.io/v1 KubeDeployment, the error disappeared.
So I suppose that problem was somewhere in kplus part or in the inconsistency that was produced because of kplus.

@jaredtbates
Copy link

I had the same error when I used cdk8s-plus-24 kplus deployment. When I reverted my deployment to k8s.io/v1 KubeDeployment, the error disappeared.

So I suppose that problem was somewhere in kplus part or in the inconsistency that was produced because of kplus.

We aren't using cdk8s or cdk8s+ at this point, just the built in CDK constructs.

@pahud I can try to get you a reproduction next week sometime if I get time. To follow up, if I let the update run its course, cloudformation seems to time out after an hour or so and then just keeps going and succeeds. I guess I hadn't waited long enough? But I still think this error is causing the trouble.

@pahud
Copy link
Contributor

pahud commented Feb 24, 2023

@jaredtbates Depends on your deployment size, if you just deploy a EKS cluster 1.24 with a default managed nodegroup in an existing VPC, the deployment should be completed in 20minutes.

AFAIK we recently have some known eks issues:

  1. If you are deploying eks 1.24 cluster, make sure you specify kubectlLayer with 1.24 version layer assets.
  2. If you include isolated subnets in the vpcSubnets property, make sure NOT enable placeClusterHandlerInVpc because your cluster handler could connect to your vpc isolated subnets and not able to callback to cloudformation.

Anyways, feel free to provide me a working minimal sample here that I can reproduce in my account. I will need to know how you configure your eks cluster to avoid some known issues like that. copy @mmayors

@mmayors
Copy link

mmayors commented Feb 27, 2023

So I attached a minimal repro, but... it only reproduces in a specific account and only in ap-south-2 ¯_(ツ)_/¯
The stack contains:

If I deploy to ap-south-2 using a particular account, CloudFormation fails when creating the custom resources. It successfully creates maybe 3 out of 10 custom resources, then the rest change to CREATE_FAILED with the error:
Received response status [FAILED] from custom resource. Message returned: StateNotFoundError: State functionActiveV2 not found. at constructor.loadWaiterConfig (/var/runtime/node_modules/aws-sdk/lib/resource_waiter.js:196:32) at new constructor (/var/runtime/node_modules/aws-sdk/lib/resource_waiter.js:64:10) at features.constructor.waitFor (/var/runtime/node_modules/aws-sdk/lib/service.js:271:18) at Object.defaultInvokeFunction [as invokeFunction] (/var/task/outbound.js:1:826) at processTicksAndRejections (internal/process/task_queues.js:95:5) at async invokeUserFunction (/var/task/framework.js:1:2149) at async onEvent (/var/task/framework.js:1:365) at async Runtime.handler (/var/task/cfn-response.js:1:1543) (RequestId: 2227040a-bedb-4cfe-bb68-2576b1ef4054)

If I deploy to us-east-1, it succeeds. And if I deploy to ap-south-2 using any other account opted in to the region, it also succeeds.

I know that's not a lot to go on but it's what I have. Open to any suggestions to help narrow this down.

statenotfound-repro.zip

(Not sure if relevant to the stacktrace) It doesn't matter if I use the NodeJS 16 or 18 lambda runtime. But AFAIK JavaScript SDK V3 should come preinstalled with Node18, not V2.

@pahud
Copy link
Contributor

pahud commented Feb 28, 2023

Hi @mmayors your case is related to #24358 but not EKS.

If there's any code that can reproduce this error with eks cluster, please share in the comments. Thanks.

@jaredtbates
Copy link

i'm not going to be able to get a reproduction since we already updated our clusters and don't have time. Sorry about that. It's likely an edge case specific to our versions or environment then.

@pahud
Copy link
Contributor

pahud commented Feb 28, 2023

@jaredtbates No problem. If anyone is able to get a reproduction please share in the comments. This issue will auto close in a few days if no further comments. Feel free to reopen if necessary.

@pahud pahud added the response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. label Feb 28, 2023
@github-actions
Copy link

github-actions bot commented Mar 3, 2023

This issue has not received a response in a while. If you want to keep this issue open, please leave a comment below and auto-close will be canceled.

@github-actions github-actions bot added closing-soon This issue will automatically close in 4 days unless further comments are made. closed-for-staleness This issue was automatically closed because it hadn't received any attention in a while. and removed closing-soon This issue will automatically close in 4 days unless further comments are made. labels Mar 3, 2023
@github-actions github-actions bot closed this as completed Mar 8, 2023
@strongmindsnan
Copy link

strongmindsnan commented Mar 10, 2023

@pahud I have the same issue on a new project, for now just trying to deploy a simple cluster with an instance of Kafka and its associated Zookeeper helper to EKS using cdk, kubectl 1.24 and no customized networking, in the eu-west-1 region.

This cluster has deployed correctly recently, but I had to pull it down because the instance type I specified was too small for the number of pods, and now each deployment attempt fails with the same error:

Stack Deployments Failed: Error: The stack named cluster-stack failed to deploy: CREATE_FAILED (The following resource(s) failed to create: [pubsubclustermanifestzookeepersvc4A739AD8]. ): Received response status [FAILED] from custom resource. Message returned: StateNotFoundError: State functionActiveV2 not found. at constructor.loadWaiterConfig (/var/runtime/node_modules/aws-sdk/lib/resource_waiter.js:196:32) at new constructor (/var/runtime/node_modules/aws-sdk/lib/resource_waiter.js:64:10) at features.constructor.waitFor (/var/runtime/node_modules/aws-sdk/lib/service.js:271:18) at Object.defaultInvokeFunction [as invokeFunction] (/var/task/outbound.js:1:826) at processTicksAndRejections (internal/process/task_queues.js:95:5) at async invokeUserFunction (/var/task/framework.js:1:2149) at async onEvent (/var/task/framework.js:1:365) at async Runtime.handler (/var/task/cfn-response.js:1:1543) (RequestId: fb6f2409-3698-4743-bdce-07c31e17ccdb)

The exact resource that it fails at (here “pubsubclustermanifestzookeepersvc4A739AD8”) differs between attempts, and can be both the services, the AWS auth object, and the pods.

Here is a reproduction repo with just the stack I am trying to deploy: https://github.com/strongmindsnan/CdkTest
Note that the "YourRoleNameHere" on line 174 in cdktest-stack.ts needs to be filled out with the name of the role you’re using.

@pahud
Copy link
Contributor

pahud commented Mar 23, 2023

@strongmindsnan this issue may not related to EKS but CFN and custom resources. Please watch #24358 for updates.

@pahud pahud removed their assignment Aug 12, 2024
@pahud pahud removed investigating This issue is being investigated and/or work is in progress to resolve the issue. needs-reproduction This issue needs reproduction. labels Aug 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
@aws-cdk/aws-eks Related to Amazon Elastic Kubernetes Service bug This issue is a bug. closed-for-staleness This issue was automatically closed because it hadn't received any attention in a while. response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days.
Projects
None yet
Development

No branches or pull requests

6 participants