Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vpc-cni 1.10.0 upgrade fails #1738

Closed
fitchtech opened this issue Nov 9, 2021 · 25 comments
Closed

vpc-cni 1.10.0 upgrade fails #1738

fitchtech opened this issue Nov 9, 2021 · 25 comments
Assignees
Labels

Comments

@fitchtech
Copy link

EKS 1.21 cluster with cert-manager deployed when upgrading the VPC-CNI addon from 1.9.1 to 1.10.0 new pods in daemonset get the following error and fail to start:

`{"level":"info","ts":"2021-11-09T21:01:59.568Z","caller":"entrypoint.sh","msg":"Validating env variables ..."}
{"level":"info","ts":"2021-11-09T21:01:59.569Z","caller":"entrypoint.sh","msg":"Install CNI binaries.."}
{"level":"info","ts":"2021-11-09T21:01:59.583Z","caller":"entrypoint.sh","msg":"Starting IPAM daemon in the background ... "}
{"level":"info","ts":"2021-11-09T21:01:59.584Z","caller":"entrypoint.sh","msg":"Checking for IPAM connectivity ... "}
I1109 21:02:00.660940 12 request.go:621] Throttling request took 1.0416662s, request: GET:https://172.20.0.1:443/apis/cert-manager.io/v1beta1?timeout=32s
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x39 pc=0x56248cd53508]

goroutine 580 [running]:
github.com/aws/amazon-vpc-cni-k8s/pkg/ipamd.(*IPAMContext).StartNodeIPPoolManager(0x0)
/go/src/github.com/aws/amazon-vpc-cni-k8s/pkg/ipamd/ipamd.go:633 +0x28
created by main._main
/go/src/github.com/aws/amazon-vpc-cni-k8s/cmd/aws-k8s-agent/main.go:64 +0x32c`

@achevuru
Copy link
Contributor

achevuru commented Nov 9, 2021

@fitchtech Did you apply the correct manifest? Or did you update the image tag in the 1.9 manifest?

@fitchtech
Copy link
Author

fitchtech commented Nov 9, 2021

@achevuru I did it as an upgrade to the EKS Cluster VPC-CNI Add-on

@fitchtech
Copy link
Author

@achevuru
Once the update timed out after an hour and failed I was able to update to 1.9.3 without issue. So it appears to be an issue only when trying to update the vpc-cni add-on to 1.10.0

I do have cert-manager with the aws-pca plugin deployed to this EKS cluster. Based on the error message looks like it's trying to hit the cert-manager endpoint for some reason now though that URL doesn't look right cause of the 172.20.0.1 address, that'd be the control plane I believe. The cert-manager release I have deployed to the cert-manager namespace has the webhook service

@achevuru
Copy link
Contributor

achevuru commented Nov 9, 2021

@fitchtech Yeah, there is a known issue with Managed addon CNI manifest for v1.10.0. Will update here once the issue is addressed.

@fitchtech
Copy link
Author

@achevuru it's strange nothing is mentioned in the release notes regarding cert-manager or known issues. Doesnt seem to be an issue with the add-on when using 1.9.1 or 1.9.3 however, that works just fine. You should be aware that the console now puts a big banner that there's an update to this add-on. Doing that update would then break anyone with cert-manager deployed, which is a very commonly used standard Kubernetes service. During the update this failure causes other kubernetes services to break since it's a dependency to be running and pods are stuck in a unready state. The EKS add-on doesn't timeout that the update has failed for an hour. So until that times out the cluster is broken.

This is a pretty big issue, I'd recommend if this is a known issue you do not release 1.10.0 into the wild, it's obviously not ready.

@achevuru
Copy link
Contributor

achevuru commented Nov 9, 2021

@fitchtech Issue is not tied to cert-manager or v1.10 image itself. It is due to an issue in the Managed add on manifest used for v1.10.0. Managed add on change was already in the process of roll back when you upgraded your cluster. I believe the rollback is already complete. It'll be enabled once the issue is addressed.

@fitchtech
Copy link
Author

@achevuru good to know, looks like the rollback has been completed as I'm no longer seeing that add-on version listed

@singhswg
Copy link

singhswg commented Nov 10, 2021

Not sure if i should open a new issue but i tried to create a new cluster 1.21 which tried to install the latest cni plugin 1.10 but the nodes come up with NotReady status and the kubelet complains about cni -

"
Nov 10 16:20:46 ip-10-0-51-254.us-west-2.compute.internal kubelet[3187]: I1110 16:20:46.978882    3187 cni.go:239] "Unable to update cni config" err="no networks found in /etc/cni/net.d"
Nov 10 16:20:47 ip-10-0-51-254.us-west-2.compute.internal kubelet[3187]: E1110 16:20:47.842699   3187 kubelet.go:2214] "Container runtime network not ready" networkReady="NetworkReady=false reason:Netw...initialized"
$ kubectl describe daemonset aws-node --namespace kube-system | grep Image | cut -d "/" -f 2
amazon-k8s-cni-init:v1.10.0-eksbuild.1
amazon-k8s-cni:v1.10.0-eksbuild.1

This was working fine when i create the cluster yesterday so i am sure this is related to the new cni 1.10 version.

Edit

I downgraded cni version to 1.9.3 again and cni was configured okay and node is in ready state.

$ kubectl describe daemonset aws-node --namespace kube-system | grep Image | cut -d "/" -f 2
amazon-k8s-cni-init:v1.9.3
amazon-k8s-cni:v1.9.3

@jsidhu
Copy link

jsidhu commented Nov 10, 2021

@singhswg can you confirm if the cluster was upgraded to CNI v1.10 via Managed Addons?

@achevuru
Copy link
Contributor

achevuru commented Nov 10, 2021

@singhswg I just tried creating a 1.21 cluster inus-west-2 and it came up fine (with v1.10). Did you upgrade your existing cluster? (or) is this a new cluster create?

@vikasmb
Copy link
Contributor

vikasmb commented Nov 10, 2021

@singhswg

I created a new 1.21 cluster in us-west-2 region using eksctl and it came up fine.

2021-11-10 10:34:38 [✔]  EKS cluster "test-pdx-latest-cluster-110" in "us-west-2" region is ready
$ kubectl get nodes
NAME                                           STATUS   ROLES    AGE     VERSION
ip-...us-west-2.compute.internal   Ready    <none>   4m11s   v1.21.4-eks-033ce7e
ip-...us-west-2.compute.internal   Ready    <none>   4m3s    v1.21.4-eks-033ce7e
ip-....us-west-2.compute.internal    Ready    <none>   4m11s   v1.21.4-eks-033ce7e
$kubectl describe ds aws-node -n kube-system | grep Image
    Image:      602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon-k8s-cni-init:v1.10.0-eksbuild.1
    Image:      602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon-k8s-cni:v1.10.0-eksbuild.1

Could you please share more details on the issue you encountered ?

@singhswg
Copy link

I used the aws-eks terraform module v17.22.0 to spin up this cluster. Didn't use any managed addons so whatever comes default with the terraform module was used.

This was a new cluster creation; didnt upgrade CNI specifically from 1.9.3 to 1.10.1. Yesterday morning the creation worked without issues and today morning i encountered the CNI issue as shown in the kubelet logs.

@vikasmb
Copy link
Contributor

vikasmb commented Nov 10, 2021

@singhswg Thanks. Could you open an AWS support ticket if you have access to that ?
Otherwise, can you run the commands in https://github.com/aws/amazon-vpc-cni-k8s/blob/master/docs/troubleshooting.md#troubleshooting-cniipamd-at-node-level and share the results ?

@singhswg
Copy link

Sure i'll try to get more data when i recreate the cluster again/or just upgrade CNI to 1.10.0 again. Also after some reading, looks like they dont support addons in the EKS terraform module yet. A case seems to already be open - terraform-aws-modules/terraform-aws-eks#1443

@achevuru
Copy link
Contributor

@singhswg Are you using Custom AMI?

@singhswg
Copy link

@achevuru I am using AWS provided EKS optimized AMI.

@vikasmb
Copy link
Contributor

vikasmb commented Nov 10, 2021

I tried the example in https://github.com/terraform-aws-modules/terraform-aws-eks/tree/master/examples/complete to create a new EKS 1.21 cluster in "us-west-2" region. Cluster creation was successful and nodes became ready. I verified that CNI image was set to 1.10.0 in this cluster.

@singhswg
Copy link

I am doing the same thing too but i'll try to replicate again at some point today/tomorrow - just fyi i used module version v17.22.0 and AMI ID - ami-05de3fef5bb9d43a

@singhswg
Copy link

singhswg commented Nov 10, 2021

I tested again and encountered the same issue.

terraform version - 1.0.9
aws provider - 3.63.0

$ kubectl get nodes -o wide -w
NAME                                      STATUS     ROLES    AGE   VERSION               INTERNAL-IP   EXTERNAL-IP   OS-IMAGE         KERNEL-VERSION                CONTAINER-RUNTIME
ip-10-0-25-0.us-west-2.compute.internal   NotReady   <none>   29s   v1.21.4-eks-033ce7e   10.0.25.0     <none>        Amazon Linux 2   5.4.149-73.259.amzn2.x86_64   docker://20.10.7
ip-10-0-25-0.us-west-2.compute.internal   NotReady   <none>   30s   v1.21.4-eks-033ce7e   10.0.25.0     <none>        Amazon Linux 2   5.4.149-73.259.amzn2.x86_64   docker://20.10.7

$ kubectl describe ds aws-node -n kube-system | grep Image
    Image:      602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon-k8s-cni-init:v1.10.0-eksbuild.1
    Image:      602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon-k8s-cni:v1.10.0-eksbuild.1

Tried downgrading cni to 1.9.3 and that fixed the issue. This is the same code that had been working for weeks until today so I am sure nothing really changed on my end. Let mw know if i should try something else here.

@achevuru
Copy link
Contributor

@singhswg
Copy link

I have IMDSv2 enabled on the cluster nodes

@fitchtech
Copy link
Author

@singhswg when you create an EKS cluster with Terraform it does not install any EKS add-ons. After the EKS cluster is created and before creating the EKS node groups apply an aws_eks_addon resource with the specific version pinned.

Side note, for the CoreDNS addon it must be created after at least 1 node group is created whereas the VPC-CNI and Proxy addons can be applied before the node group is created.

variable "cluster_id" {
  type        = string
  description = "The name of the EKS cluster"
}

variable "cluster_oidc_issuer" {
  type        = string
  description = "Required for VPC-CNI addon. The EKS Cluster OIDC Issuer."
}

variable "vpc_cni_version" {
  type        = string
  description = "EKS Addon version tag"
  default     = "v1.9.3-eksbuild.1"
}

module "irsa_vpc_cni" {
  source                        = "terraform-aws-modules/iam/aws//modules/iam-assumable-role-with-oidc"
  version                       = "4.6.0"
  create_role                   = true
  role_name                     = "${var.cluster_id}-vpc-cni"
  provider_url                  = replace(var.cluster_oidc_issuer, "https://", "")
  role_policy_arns              = [aws_iam_policy.vpc_cni.arn, "arn:aws:iam::aws:policy/AmazonEKS_CNI_Policy"]
  oidc_fully_qualified_subjects = ["system:serviceaccount:kube-system:aws-node"]
}

resource "aws_iam_policy" "vpc_cni" {
  name        = "${var.cluster_id}-vpc-cni"
  description = "EKS cluster addon for VPC CNI ${var.cluster_id}"
  policy      = data.aws_iam_policy_document.vpc_cni.json
  lifecycle {
    ignore_changes = [name, description]
  }
}

data "aws_iam_policy_document" "vpc_cni" {
  statement {
    actions = [
      "sts:AssumeRoleWithWebIdentity",
      "sts:AssumeRole"
    ]
    effect = "Allow"

    condition {
      test     = "StringEquals"
      variable = "${replace(var.cluster_oidc_issuer, "https://", "")}:sub"
      values   = ["system:serviceaccount:kube-system:aws-node"]
    }
    resources = ["*"]

  }
}

resource "aws_eks_addon" "cni" {
  cluster_name             = var.cluster_id
  addon_name               = "vpc-cni"
  addon_version            = var.vpc_cni_version
  resolve_conflicts        = "OVERWRITE"
  service_account_role_arn = module.irsa_vpc_cni.iam_role_arn
  lifecycle {
    ignore_changes = [cluster_name, addon_name]
  }
}

@singhswg
Copy link

@achevuru I tested the 1.10.1 release and that seems to have fixed my issue. For now, I am using the 1.9.3 version as @fitchtech suggested using aws_eks_addon.

Thank you guys.

@achevuru achevuru self-assigned this Nov 16, 2021
@achevuru
Copy link
Contributor

v1.10.1 is now available via Managed add-ons as well. Closing this issue.

@github-actions
Copy link

⚠️COMMENT VISIBILITY WARNING⚠️

Comments on closed issues are hard for our team to see.
If you need more assistance, please open a new issue that references this one.
If you wish to keep having a conversation with other community members under this issue feel free to do so.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants