Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable HPA by default #27

Merged
merged 3 commits into from
Feb 27, 2020
Merged

Enable HPA by default #27

merged 3 commits into from
Feb 27, 2020

Conversation

sslavic
Copy link

@sslavic sslavic commented Feb 10, 2020

@sslavic sslavic self-assigned this Feb 10, 2020
`controller.autoscaling.enabled` | Enables or disables Horizontal Pod Autoscaler (HPA) for NGINX Ingress Controller Deployment. | `false`
`controller.autoscaling.minReplicas` | Configures HPA min replicas. | `1`
`controller.autoscaling.enabled` | Enables or disables Horizontal Pod Autoscaler (HPA) for NGINX Ingress Controller Deployment. | `true`
`controller.autoscaling.minReplicas` | Configures HPA min replicas. | `2`
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Syncs docs with actual configured default

@@ -26,7 +26,7 @@ configmap:
# Increase hash table size to allow more server names for stability reasons
server-name-hash-bucket-size: "1024"
server-tokens: "false"
worker-processes: "4"
worker-processes: "1"
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Each nginx replica Pod starts number of processes

  • controller
  • cache
  • master
  • workers
    and some processes get started periodically or on-demand, e.g. to load configuration changes.

Upstream default for number of worker processes is auto which means number of worker processes in each nginx replica/Pod is set to same number as detected number of CPUs. Problem is, current logic for detecting available CPUs is not cgroups aware.

So far we've been overriding worker-processes to 4. Just knowing we've been requesting 500m CPU and 600Mi memory this was obviously misconfigured. From HPA trials on adidas tenant clusters observed that each worker-process seems to keep its own copy of entire configuration derived from Ingress definitions, and on dev clusters with large amount of Ingress CRs especially during configuration validation/reload resource usage would double, and didn't fit in available requested resources.

Reducing worker-processes to just 1, and granting enough of memory and CPU made resource usage of each replica more predictable and stable with enough of capacity to store configuration even during reloads.

Currently these settings are sized up to handle largest known clusters.

With Prometheus per TC once it's availble, we'll be able to collect more metrics like nginx built in ones (connection count/rate, request rate), Ingress count/rate etc, analyze variance in resource utilization across clusters and potentially discover different usage profiles, so we'll very likely revisit these settings again in the future.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have two concerns here:

  1. Increasing from 500m to 2G will have an impact on the allocatable memory/CPU of each worker node. We need to check if default instance type (or available instance types) have enough resources to give customer functional workers

  2. If we set 2 CPUs would have more sense to set worker processes to 2 instead of 1?
    According to https://nginx.org/en/docs/ngx_core_module.html#worker_processes looks like a good initial configuration is to align CPU cores with workers processes

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I share your concern, actually testing this and won't merge until verifying it fits; got earlier today ack from @puja108 on Slack that on-demand it's OK even to raise the minimal requirements.

  2. Had same thoughts, through have to admit relatively naive experiments so far didn't see perf benefit, but did see better stability with single worker; maybe that's due to fact that it's not only worker processes that run in each nginx Pod, there's controller, master, cache, and periodically kicked off processes for validating and applying configuration updates. So 1 worker process seems now as safe start, we can always relatively easily adjust to 2 or revert to 4 and event disable HPA on-demand on case by case basis e.g. if rollout to existing production tenant clusters causes some previously not observed issues. Once metrics are available, and we collect more info/experience from the live deployments, we can set objective performance and capacity SLOs per replica and continuously ensure they are met with automated testing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. What about adding an entry in changelogs, I will try to give more visibility so users are aware of implications

  2. ok all good

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pipo02mix I've update changelog, PTAL; we'll likely repeat this in some form and may even expand in legacy platform release notes.

Notice I've bumped that release with HPA enabled to be 1.6.0 instead of 1.5.0; pushed HPA back a bit to release first a regression fix for 1.4.0 + upgrade to latest nginx and that will be released as 1.5.0 instead. Regression was discovered during testing legacy WIP releases on aws that we discussed a bit earlier today in Slack #batman channel thread https://gigantic.slack.com/archives/C76JX6YLQ/p1581929367347400

resources:
requests:
cpu: 500m
memory: 600Mi
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

e2e tests environment seems to be constrained so without adjusting resource usage like this, 3rd replica wasn't able to start, due to not enough of resources.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could maybe increase the size of the kind nodes but I think this makes more sense. 👍

@@ -0,0 +1,15 @@
{{- if or (and .Values.controller.autoscaling.enabled (gt (.Values.controller.autoscaling.minReplicas | int) 1)) (gt (.Values.ingressController.replicas | int) 1) }}
apiVersion: policy/v1beta1
kind: PodDisruptionBudget
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With HPA enabled there's additional source of voluntary disruptions. This PDB helps us additionally protect nginx deployment, control how many instances can be unavailable at the same time.

@sslavic
Copy link
Author

sslavic commented Feb 11, 2020

Tested on giraffe - AWS, china, nginx as optional app on top of GS 11.0.0

@sslavic sslavic marked this pull request as ready for review February 11, 2020 16:40
@sslavic sslavic requested review from a team February 11, 2020 16:40
Copy link
Contributor

@pipo02mix pipo02mix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some concerns raised but it is a wonderful step forward

@@ -26,7 +26,7 @@ configmap:
# Increase hash table size to allow more server names for stability reasons
server-name-hash-bucket-size: "1024"
server-tokens: "false"
worker-processes: "4"
worker-processes: "1"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have two concerns here:

  1. Increasing from 500m to 2G will have an impact on the allocatable memory/CPU of each worker node. We need to check if default instance type (or available instance types) have enough resources to give customer functional workers

  2. If we set 2 CPUs would have more sense to set worker processes to 2 instead of 1?
    According to https://nginx.org/en/docs/ngx_core_module.html#worker_processes looks like a good initial configuration is to align CPU cores with workers processes

@sslavic
Copy link
Author

sslavic commented Feb 17, 2020

@marians suggested https://docs.giantswarm.io/basics/cluster-size-autoscaling/#ingress-controller-replicas-with-autoscaling needs to be updated

@pipo02mix
Copy link
Contributor

@sslavic do you think then is good idea to add the entry to the Changelog with the changes introduced?

@sslavic
Copy link
Author

sslavic commented Feb 17, 2020

@pipo02mix yes, did add (too brief) thumbsup when you brought it up last time via #27 (comment)
but didn't get to it yet since working on legacy release first and had intense oncall shifts; plan to work on it today as I shared via https://gigantic.slack.com/archives/CRBH3QNNS/p1581932970023000

@pipo02mix
Copy link
Contributor

no worries, take your time. :)

Copy link
Contributor

@pipo02mix pipo02mix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some wording to tight the assessments

CHANGELOG.md Outdated Show resolved Hide resolved
CHANGELOG.md Outdated Show resolved Hide resolved
CHANGELOG.md Outdated Show resolved Hide resolved
CHANGELOG.md Outdated Show resolved Hide resolved
CHANGELOG.md Outdated Show resolved Hide resolved
@sslavic sslavic force-pushed the enable-hpa branch 3 times, most recently from a1eb4cb to 30f6a55 Compare February 18, 2020 14:09
@sslavic sslavic force-pushed the enable-hpa branch 2 times, most recently from dae407a to a6e7bf0 Compare February 25, 2020 14:58
@sslavic
Copy link
Author

sslavic commented Feb 25, 2020

@rossf7 @pipo02mix @puja108 I've updated PR to include changes we discussed earlier today in #team-batman https://gigantic.slack.com/archives/C76JX6YLQ/p1582626800037000 - break away from cluster-operator dynamically configurable ingressController.replicas

PTAL

Copy link
Contributor

@pipo02mix pipo02mix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

use-forwarded-headers: "true"

controller:
name: nginx-ingress-controller
k8sAppLabel: nginx-ingress-controller

replicaCount: 2
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pipo02mix @puja108 @rossf7 recently GS AWS worker instance type with 2CPUs was removed and now min instance type has 4CPUs, still ASG allowed min and max remained to be 1. With such minimal test/playground/learning or whatever the use case for max single worker node clusters is (trying to discover that via https://gigantic.slack.com/archives/C0251MXL5/p1582671214019900), 2 nginx replicas with 2CPU each couldn't fit on such minimal single node 4CPU tenant cluster, especially not together with all the other pre-installed managed workloads.

Maybe we can consider increasing min worker count to 2.

Still, assuming we won't increase minimum worker count to 2, I'm thinking to either

  • reduce the nginx replica count default to 1, and HPA minReplicas to 1 too, hopefully at least 1 can get scheduled on minimal single worker node cluster; con is that even on production environments nginx Deployment could be dropped by HPA to single replica only, so risky setup with no redundancy; or alternatively to
  • keep 2 as the default replica count, tolerating that not both of them would be scheduled, I'd test that at least one does get scheduled; both of them being scheduled and nothing else of pre-installed components I think is not currently possible since nginx gets installed relatively late in cluster init; customer could be instructed to manually customize/optimize resource requests and autoscale settings in this non-productive corner case; pro would be that more important and more common non-idle usage case would be better covered with automation out of the box.

Longer term we could figure out supported use cases / usage profiles, if we can have them categorized to few T-shirt sizes determined based on resources assigned to the cluster (number range of workers and worker size / instance type), cluster-operator could have that resources -> profile mapping logic and share outcome in cluster values ConfigMap the determined cluster usage profile/category, based on profile apps can adjust default resource requests and redundancy settings, which would hopefully minimize need to have resource/scaling customized for the supported usage profiles.

What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding (which is limited) is that you can create single node clusters but they are not fully supported.

The 1st option of reducing to 1 replica seems simpler. There is a risk of downtime to the ingress but if you've configured a single node cluster you've increased your risk of downtime.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we do an intermediate solution like setting a min value depending on the size of the cluster?

-> cluster single NP with 1 min 3 max workers -> HPA IC min 1
-> cluster single NP with 3 min 10 max workers -> HPA IC min 2
-> cluster single NP with 10 min 50 max workers -> HPA IC min 4

Problem is the cluster scaling actions should modify the values on HPA of some components (here IC) but IMO it would be the ideal scenario. I talked to a customer with big node pool cluster and they were surprised the HPA min value for CoreDNS was 2 replicas. I think condition the min (or max) to cluster size is the ideal. Can we do it somehow?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rossf7 thanks for your feedback, on single node cluster use-case and support

It makes no sense to have min HPA higher than default replica count, as HPA would scale up immediately to its min.
I think defaulting to 1 replica (so also 1 min HPA) is availability risk even for tenant clusters which have more than 1 worker node - nginx, or node will crash or get recycled; we saw with recent incidents with Pods stuck Terminating for ages, it takes time for replacement to come up, so 1 replica is risk even in multi-node clusters.

Therefore I'm more for defaulting to 2 replicas, letting 1 hang unscheduled in playground single small node clusters.
Will go with that for now (so keep settings as is) and test in a WIP release even on a single node cluster with smallest instance type - can always come back and make adjustments in a follow up patch release.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pipo02mix yes, mentioned earlier the idea with having cluster-operator do t-shirt sizes / usage profile estimation, estimation would be based on what customer assigned to the tenant cluster (scaling range i.e. instance count and instance type / capacity), estimated profile would be saved/updated in (tenant) cluster values ConfigMap that would be provided to every installed app, and app charts could have mapping of t-shirt sizes to app configuration (replica count, resource assignment, HPA range, ...)

Start simple with 2-3 levels/profiles - think anything is better than current single one-size fits all, apps are not smart enough to deal with the range of clusters/usage, we practically can't apply both HPA and VPA at the same time to a single app.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested HPA enabled nginx with 2 initial replicas, and HPA with 2min replicas on smallest single node cluster we currently support on aws, using 11.0.1 platform release, and installing nginx App as optional/ondemand without any adjustments - other things started as expected, while there was single running nginx replica and 2 Pending - HPA was trying to scale out since the desired replica count was lower than the minimum replica count.

Problem with this configuration is that we'll get paged about nginx Deployment not being satisfied. For this reason (to cover smallest cluster without paging) I'll reduce initial replica count to 1 and also HPA min to 1.

Copy link
Contributor

@rossf7 rossf7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

This is great Stevo. Thank for all your efforts on this. I mainly reviewed in the context of the cluster-operator changes which all makes sense to me.

CHANGELOG.md Outdated Show resolved Hide resolved
resources:
requests:
cpu: 500m
memory: 600Mi
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could maybe increase the size of the kind nodes but I think this makes more sense. 👍

use-forwarded-headers: "true"

controller:
name: nginx-ingress-controller
k8sAppLabel: nginx-ingress-controller

replicaCount: 2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding (which is limited) is that you can create single node clusters but they are not fully supported.

The 1st option of reducing to 1 replica seems simpler. There is a risk of downtime to the ingress but if you've configured a single node cluster you've increased your risk of downtime.

Stevo Slavić and others added 2 commits February 26, 2020 17:20
Co-Authored-By: Ross Fairbanks <rossf7@users.noreply.github.com>
@sslavic
Copy link
Author

sslavic commented Feb 27, 2020

Tested on gauss on smallest single worker node cluster with 4CPU, nginx upgrade works.

@sslavic sslavic merged commit 741a18a into master Feb 27, 2020
@sslavic sslavic deleted the enable-hpa branch February 27, 2020 17:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants