-
Notifications
You must be signed in to change notification settings - Fork 736
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
vpc-cni 1.10.0 upgrade fails #1738
Comments
@fitchtech Did you apply the correct manifest? Or did you update the image tag in the 1.9 manifest? |
@achevuru I did it as an upgrade to the EKS Cluster VPC-CNI Add-on |
@achevuru I do have cert-manager with the aws-pca plugin deployed to this EKS cluster. Based on the error message looks like it's trying to hit the cert-manager endpoint for some reason now though that URL doesn't look right cause of the 172.20.0.1 address, that'd be the control plane I believe. The cert-manager release I have deployed to the cert-manager namespace has the webhook service |
@fitchtech Yeah, there is a known issue with Managed addon CNI manifest for v1.10.0. Will update here once the issue is addressed. |
@achevuru it's strange nothing is mentioned in the release notes regarding cert-manager or known issues. Doesnt seem to be an issue with the add-on when using 1.9.1 or 1.9.3 however, that works just fine. You should be aware that the console now puts a big banner that there's an update to this add-on. Doing that update would then break anyone with cert-manager deployed, which is a very commonly used standard Kubernetes service. During the update this failure causes other kubernetes services to break since it's a dependency to be running and pods are stuck in a unready state. The EKS add-on doesn't timeout that the update has failed for an hour. So until that times out the cluster is broken. This is a pretty big issue, I'd recommend if this is a known issue you do not release 1.10.0 into the wild, it's obviously not ready. |
@fitchtech Issue is not tied to cert-manager or v1.10 image itself. It is due to an issue in the Managed add on manifest used for v1.10.0. Managed add on change was already in the process of roll back when you upgraded your cluster. I believe the rollback is already complete. It'll be enabled once the issue is addressed. |
@achevuru good to know, looks like the rollback has been completed as I'm no longer seeing that add-on version listed |
Not sure if i should open a new issue but i tried to create a new cluster 1.21 which tried to install the latest cni plugin 1.10 but the nodes come up with
This was working fine when i create the cluster yesterday so i am sure this is related to the new cni 1.10 version. Edit I downgraded cni version to
|
@singhswg can you confirm if the cluster was upgraded to CNI v1.10 via Managed Addons? |
@singhswg I just tried creating a 1.21 cluster in |
I created a new 1.21 cluster in us-west-2 region using eksctl and it came up fine.
Could you please share more details on the issue you encountered ? |
I used the aws-eks terraform module v17.22.0 to spin up this cluster. Didn't use any managed addons so whatever comes default with the terraform module was used. This was a new cluster creation; didnt upgrade CNI specifically from 1.9.3 to 1.10.1. Yesterday morning the creation worked without issues and today morning i encountered the CNI issue as shown in the kubelet logs. |
@singhswg Thanks. Could you open an AWS support ticket if you have access to that ? |
Sure i'll try to get more data when i recreate the cluster again/or just upgrade CNI to 1.10.0 again. Also after some reading, looks like they dont support addons in the EKS terraform module yet. A case seems to already be open - terraform-aws-modules/terraform-aws-eks#1443 |
@singhswg Are you using Custom AMI? |
@achevuru I am using AWS provided EKS optimized AMI. |
I tried the example in https://github.com/terraform-aws-modules/terraform-aws-eks/tree/master/examples/complete to create a new EKS 1.21 cluster in "us-west-2" region. Cluster creation was successful and nodes became ready. I verified that CNI image was set to 1.10.0 in this cluster. |
I am doing the same thing too but i'll try to replicate again at some point today/tomorrow - just fyi i used module version v17.22.0 and AMI ID - ami-05de3fef5bb9d43a |
I tested again and encountered the same issue.
Tried downgrading cni to 1.9.3 and that fixed the issue. This is the same code that had been working for weeks until today so I am sure nothing really changed on my end. Let mw know if i should try something else here. |
@singhswg Can you check if you've IMDSv1 access disabled? https://github.com/aws/amazon-vpc-cni-k8s/blob/master/docs/troubleshooting.md#imds |
I have |
@singhswg when you create an EKS cluster with Terraform it does not install any EKS add-ons. After the EKS cluster is created and before creating the EKS node groups apply an aws_eks_addon resource with the specific version pinned. Side note, for the CoreDNS addon it must be created after at least 1 node group is created whereas the VPC-CNI and Proxy addons can be applied before the node group is created.
|
@achevuru I tested the Thank you guys. |
v1.10.1 is now available via Managed add-ons as well. Closing this issue. |
|
EKS 1.21 cluster with cert-manager deployed when upgrading the VPC-CNI addon from 1.9.1 to 1.10.0 new pods in daemonset get the following error and fail to start:
`{"level":"info","ts":"2021-11-09T21:01:59.568Z","caller":"entrypoint.sh","msg":"Validating env variables ..."}
{"level":"info","ts":"2021-11-09T21:01:59.569Z","caller":"entrypoint.sh","msg":"Install CNI binaries.."}
{"level":"info","ts":"2021-11-09T21:01:59.583Z","caller":"entrypoint.sh","msg":"Starting IPAM daemon in the background ... "}
{"level":"info","ts":"2021-11-09T21:01:59.584Z","caller":"entrypoint.sh","msg":"Checking for IPAM connectivity ... "}
I1109 21:02:00.660940 12 request.go:621] Throttling request took 1.0416662s, request: GET:https://172.20.0.1:443/apis/cert-manager.io/v1beta1?timeout=32s
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x39 pc=0x56248cd53508]
goroutine 580 [running]:
github.com/aws/amazon-vpc-cni-k8s/pkg/ipamd.(*IPAMContext).StartNodeIPPoolManager(0x0)
/go/src/github.com/aws/amazon-vpc-cni-k8s/pkg/ipamd/ipamd.go:633 +0x28
created by main._main
/go/src/github.com/aws/amazon-vpc-cni-k8s/cmd/aws-k8s-agent/main.go:64 +0x32c`
The text was updated successfully, but these errors were encountered: