vpc-cni 1.10.0 upgrade fails #1738

fitchtech · 2021-11-09T21:11:43Z

EKS 1.21 cluster with cert-manager deployed when upgrading the VPC-CNI addon from 1.9.1 to 1.10.0 new pods in daemonset get the following error and fail to start:

`{"level":"info","ts":"2021-11-09T21:01:59.568Z","caller":"entrypoint.sh","msg":"Validating env variables ..."}
{"level":"info","ts":"2021-11-09T21:01:59.569Z","caller":"entrypoint.sh","msg":"Install CNI binaries.."}
{"level":"info","ts":"2021-11-09T21:01:59.583Z","caller":"entrypoint.sh","msg":"Starting IPAM daemon in the background ... "}
{"level":"info","ts":"2021-11-09T21:01:59.584Z","caller":"entrypoint.sh","msg":"Checking for IPAM connectivity ... "}
I1109 21:02:00.660940 12 request.go:621] Throttling request took 1.0416662s, request: GET:https://172.20.0.1:443/apis/cert-manager.io/v1beta1?timeout=32s
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x39 pc=0x56248cd53508]

goroutine 580 [running]:
github.com/aws/amazon-vpc-cni-k8s/pkg/ipamd.(*IPAMContext).StartNodeIPPoolManager(0x0)
/go/src/github.com/aws/amazon-vpc-cni-k8s/pkg/ipamd/ipamd.go:633 +0x28
created by main._main
/go/src/github.com/aws/amazon-vpc-cni-k8s/cmd/aws-k8s-agent/main.go:64 +0x32c`

achevuru · 2021-11-09T21:28:27Z

@fitchtech Did you apply the correct manifest? Or did you update the image tag in the 1.9 manifest?

fitchtech · 2021-11-09T21:32:49Z

@achevuru I did it as an upgrade to the EKS Cluster VPC-CNI Add-on

fitchtech · 2021-11-09T21:48:22Z

@achevuru
Once the update timed out after an hour and failed I was able to update to 1.9.3 without issue. So it appears to be an issue only when trying to update the vpc-cni add-on to 1.10.0

I do have cert-manager with the aws-pca plugin deployed to this EKS cluster. Based on the error message looks like it's trying to hit the cert-manager endpoint for some reason now though that URL doesn't look right cause of the 172.20.0.1 address, that'd be the control plane I believe. The cert-manager release I have deployed to the cert-manager namespace has the webhook service

achevuru · 2021-11-09T21:55:36Z

@fitchtech Yeah, there is a known issue with Managed addon CNI manifest for v1.10.0. Will update here once the issue is addressed.

fitchtech · 2021-11-09T22:01:16Z

@achevuru it's strange nothing is mentioned in the release notes regarding cert-manager or known issues. Doesnt seem to be an issue with the add-on when using 1.9.1 or 1.9.3 however, that works just fine. You should be aware that the console now puts a big banner that there's an update to this add-on. Doing that update would then break anyone with cert-manager deployed, which is a very commonly used standard Kubernetes service. During the update this failure causes other kubernetes services to break since it's a dependency to be running and pods are stuck in a unready state. The EKS add-on doesn't timeout that the update has failed for an hour. So until that times out the cluster is broken.

This is a pretty big issue, I'd recommend if this is a known issue you do not release 1.10.0 into the wild, it's obviously not ready.

achevuru · 2021-11-09T22:06:21Z

@fitchtech Issue is not tied to cert-manager or v1.10 image itself. It is due to an issue in the Managed add on manifest used for v1.10.0. Managed add on change was already in the process of roll back when you upgraded your cluster. I believe the rollback is already complete. It'll be enabled once the issue is addressed.

fitchtech · 2021-11-09T22:23:30Z

@achevuru good to know, looks like the rollback has been completed as I'm no longer seeing that add-on version listed

singhswg · 2021-11-10T16:41:22Z

Not sure if i should open a new issue but i tried to create a new cluster 1.21 which tried to install the latest cni plugin 1.10 but the nodes come up with NotReady status and the kubelet complains about cni -

"
Nov 10 16:20:46 ip-10-0-51-254.us-west-2.compute.internal kubelet[3187]: I1110 16:20:46.978882    3187 cni.go:239] "Unable to update cni config" err="no networks found in /etc/cni/net.d"
Nov 10 16:20:47 ip-10-0-51-254.us-west-2.compute.internal kubelet[3187]: E1110 16:20:47.842699   3187 kubelet.go:2214] "Container runtime network not ready" networkReady="NetworkReady=false reason:Netw...initialized"

$ kubectl describe daemonset aws-node --namespace kube-system | grep Image | cut -d "/" -f 2
amazon-k8s-cni-init:v1.10.0-eksbuild.1
amazon-k8s-cni:v1.10.0-eksbuild.1

This was working fine when i create the cluster yesterday so i am sure this is related to the new cni 1.10 version.

Edit

I downgraded cni version to 1.9.3 again and cni was configured okay and node is in ready state.

$ kubectl describe daemonset aws-node --namespace kube-system | grep Image | cut -d "/" -f 2
amazon-k8s-cni-init:v1.9.3
amazon-k8s-cni:v1.9.3

jsidhu · 2021-11-10T17:37:04Z

@singhswg can you confirm if the cluster was upgraded to CNI v1.10 via Managed Addons?

achevuru · 2021-11-10T17:44:58Z

@singhswg I just tried creating a 1.21 cluster inus-west-2 and it came up fine (with v1.10). Did you upgrade your existing cluster? (or) is this a new cluster create?

vikasmb · 2021-11-10T18:42:45Z

@singhswg

I created a new 1.21 cluster in us-west-2 region using eksctl and it came up fine.

2021-11-10 10:34:38 [✔]  EKS cluster "test-pdx-latest-cluster-110" in "us-west-2" region is ready
$ kubectl get nodes
NAME                                           STATUS   ROLES    AGE     VERSION
ip-...us-west-2.compute.internal   Ready    <none>   4m11s   v1.21.4-eks-033ce7e
ip-...us-west-2.compute.internal   Ready    <none>   4m3s    v1.21.4-eks-033ce7e
ip-....us-west-2.compute.internal    Ready    <none>   4m11s   v1.21.4-eks-033ce7e
$kubectl describe ds aws-node -n kube-system | grep Image
    Image:      602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon-k8s-cni-init:v1.10.0-eksbuild.1
    Image:      602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon-k8s-cni:v1.10.0-eksbuild.1

Could you please share more details on the issue you encountered ?

singhswg · 2021-11-10T18:47:57Z

I used the aws-eks terraform module v17.22.0 to spin up this cluster. Didn't use any managed addons so whatever comes default with the terraform module was used.

This was a new cluster creation; didnt upgrade CNI specifically from 1.9.3 to 1.10.1. Yesterday morning the creation worked without issues and today morning i encountered the CNI issue as shown in the kubelet logs.

vikasmb · 2021-11-10T18:53:10Z

@singhswg Thanks. Could you open an AWS support ticket if you have access to that ?
Otherwise, can you run the commands in https://github.com/aws/amazon-vpc-cni-k8s/blob/master/docs/troubleshooting.md#troubleshooting-cniipamd-at-node-level and share the results ?

singhswg · 2021-11-10T18:56:38Z

Sure i'll try to get more data when i recreate the cluster again/or just upgrade CNI to 1.10.0 again. Also after some reading, looks like they dont support addons in the EKS terraform module yet. A case seems to already be open - terraform-aws-modules/terraform-aws-eks#1443

achevuru · 2021-11-10T19:36:41Z

@singhswg Are you using Custom AMI?

singhswg · 2021-11-10T19:38:22Z

@achevuru I am using AWS provided EKS optimized AMI.

vikasmb · 2021-11-10T20:07:00Z

I tried the example in https://github.com/terraform-aws-modules/terraform-aws-eks/tree/master/examples/complete to create a new EKS 1.21 cluster in "us-west-2" region. Cluster creation was successful and nodes became ready. I verified that CNI image was set to 1.10.0 in this cluster.

singhswg · 2021-11-10T20:26:58Z

I am doing the same thing too but i'll try to replicate again at some point today/tomorrow - just fyi i used module version v17.22.0 and AMI ID - ami-05de3fef5bb9d43a

singhswg · 2021-11-10T22:42:19Z

I tested again and encountered the same issue.

terraform version - 1.0.9
aws provider - 3.63.0

$ kubectl get nodes -o wide -w
NAME                                      STATUS     ROLES    AGE   VERSION               INTERNAL-IP   EXTERNAL-IP   OS-IMAGE         KERNEL-VERSION                CONTAINER-RUNTIME
ip-10-0-25-0.us-west-2.compute.internal   NotReady   <none>   29s   v1.21.4-eks-033ce7e   10.0.25.0     <none>        Amazon Linux 2   5.4.149-73.259.amzn2.x86_64   docker://20.10.7
ip-10-0-25-0.us-west-2.compute.internal   NotReady   <none>   30s   v1.21.4-eks-033ce7e   10.0.25.0     <none>        Amazon Linux 2   5.4.149-73.259.amzn2.x86_64   docker://20.10.7

$ kubectl describe ds aws-node -n kube-system | grep Image
    Image:      602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon-k8s-cni-init:v1.10.0-eksbuild.1
    Image:      602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon-k8s-cni:v1.10.0-eksbuild.1

Tried downgrading cni to 1.9.3 and that fixed the issue. This is the same code that had been working for weeks until today so I am sure nothing really changed on my end. Let mw know if i should try something else here.

achevuru · 2021-11-10T22:44:26Z

@singhswg Can you check if you've IMDSv1 access disabled?

https://github.com/aws/amazon-vpc-cni-k8s/blob/master/docs/troubleshooting.md#imds

singhswg · 2021-11-10T22:46:00Z

I have IMDSv2 enabled on the cluster nodes

fitchtech · 2021-11-11T01:06:38Z

@singhswg when you create an EKS cluster with Terraform it does not install any EKS add-ons. After the EKS cluster is created and before creating the EKS node groups apply an aws_eks_addon resource with the specific version pinned.

Side note, for the CoreDNS addon it must be created after at least 1 node group is created whereas the VPC-CNI and Proxy addons can be applied before the node group is created.

variable "cluster_id" {
  type        = string
  description = "The name of the EKS cluster"
}

variable "cluster_oidc_issuer" {
  type        = string
  description = "Required for VPC-CNI addon. The EKS Cluster OIDC Issuer."
}

variable "vpc_cni_version" {
  type        = string
  description = "EKS Addon version tag"
  default     = "v1.9.3-eksbuild.1"
}

module "irsa_vpc_cni" {
  source                        = "terraform-aws-modules/iam/aws//modules/iam-assumable-role-with-oidc"
  version                       = "4.6.0"
  create_role                   = true
  role_name                     = "${var.cluster_id}-vpc-cni"
  provider_url                  = replace(var.cluster_oidc_issuer, "https://", "")
  role_policy_arns              = [aws_iam_policy.vpc_cni.arn, "arn:aws:iam::aws:policy/AmazonEKS_CNI_Policy"]
  oidc_fully_qualified_subjects = ["system:serviceaccount:kube-system:aws-node"]
}

resource "aws_iam_policy" "vpc_cni" {
  name        = "${var.cluster_id}-vpc-cni"
  description = "EKS cluster addon for VPC CNI ${var.cluster_id}"
  policy      = data.aws_iam_policy_document.vpc_cni.json
  lifecycle {
    ignore_changes = [name, description]
  }
}

data "aws_iam_policy_document" "vpc_cni" {
  statement {
    actions = [
      "sts:AssumeRoleWithWebIdentity",
      "sts:AssumeRole"
    ]
    effect = "Allow"

    condition {
      test     = "StringEquals"
      variable = "${replace(var.cluster_oidc_issuer, "https://", "")}:sub"
      values   = ["system:serviceaccount:kube-system:aws-node"]
    }
    resources = ["*"]

  }
}

resource "aws_eks_addon" "cni" {
  cluster_name             = var.cluster_id
  addon_name               = "vpc-cni"
  addon_version            = var.vpc_cni_version
  resolve_conflicts        = "OVERWRITE"
  service_account_role_arn = module.irsa_vpc_cni.iam_role_arn
  lifecycle {
    ignore_changes = [cluster_name, addon_name]
  }
}

singhswg · 2021-11-11T17:16:10Z

@achevuru I tested the 1.10.1 release and that seems to have fixed my issue. For now, I am using the 1.9.3 version as @fitchtech suggested using aws_eks_addon.

Thank you guys.

achevuru · 2021-11-30T20:03:18Z

v1.10.1 is now available via Managed add-ons as well. Closing this issue.

github-actions · 2021-11-30T20:03:43Z

⚠️COMMENT VISIBILITY WARNING⚠️

Comments on closed issues are hard for our team to see.
If you need more assistance, please open a new issue that references this one.
If you wish to keep having a conversation with other community members under this issue feel free to do so.

fitchtech added the bug label Nov 9, 2021

fitchtech mentioned this issue Nov 9, 2021

Manifests and Readme updates #1732

Merged

achevuru self-assigned this Nov 16, 2021

achevuru closed this as completed Nov 30, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vpc-cni 1.10.0 upgrade fails #1738

vpc-cni 1.10.0 upgrade fails #1738

fitchtech commented Nov 9, 2021

achevuru commented Nov 9, 2021

fitchtech commented Nov 9, 2021 •

edited

Loading

fitchtech commented Nov 9, 2021

achevuru commented Nov 9, 2021 •

edited

Loading

fitchtech commented Nov 9, 2021

achevuru commented Nov 9, 2021

fitchtech commented Nov 9, 2021

singhswg commented Nov 10, 2021 •

edited

Loading

jsidhu commented Nov 10, 2021

achevuru commented Nov 10, 2021 •

edited

Loading

vikasmb commented Nov 10, 2021

singhswg commented Nov 10, 2021

vikasmb commented Nov 10, 2021

singhswg commented Nov 10, 2021

achevuru commented Nov 10, 2021

singhswg commented Nov 10, 2021

vikasmb commented Nov 10, 2021

singhswg commented Nov 10, 2021

singhswg commented Nov 10, 2021 •

edited

Loading

achevuru commented Nov 10, 2021

singhswg commented Nov 10, 2021

fitchtech commented Nov 11, 2021

singhswg commented Nov 11, 2021

achevuru commented Nov 30, 2021

github-actions bot commented Nov 30, 2021

vpc-cni 1.10.0 upgrade fails #1738

vpc-cni 1.10.0 upgrade fails #1738

Comments

fitchtech commented Nov 9, 2021

achevuru commented Nov 9, 2021

fitchtech commented Nov 9, 2021 • edited Loading

fitchtech commented Nov 9, 2021

achevuru commented Nov 9, 2021 • edited Loading

fitchtech commented Nov 9, 2021

achevuru commented Nov 9, 2021

fitchtech commented Nov 9, 2021

singhswg commented Nov 10, 2021 • edited Loading

jsidhu commented Nov 10, 2021

achevuru commented Nov 10, 2021 • edited Loading

vikasmb commented Nov 10, 2021

singhswg commented Nov 10, 2021

vikasmb commented Nov 10, 2021

singhswg commented Nov 10, 2021

achevuru commented Nov 10, 2021

singhswg commented Nov 10, 2021

vikasmb commented Nov 10, 2021

singhswg commented Nov 10, 2021

singhswg commented Nov 10, 2021 • edited Loading

achevuru commented Nov 10, 2021

singhswg commented Nov 10, 2021

fitchtech commented Nov 11, 2021

singhswg commented Nov 11, 2021

achevuru commented Nov 30, 2021

github-actions bot commented Nov 30, 2021

⚠️COMMENT VISIBILITY WARNING⚠️

fitchtech commented Nov 9, 2021 •

edited

Loading

achevuru commented Nov 9, 2021 •

edited

Loading

singhswg commented Nov 10, 2021 •

edited

Loading

achevuru commented Nov 10, 2021 •

edited

Loading

singhswg commented Nov 10, 2021 •

edited

Loading