Autoscaling Ingress Controllers | Add HPA to our tenant ingress controllers #65

MarcelMue · 2019-02-04T15:53:40Z

Goals

Make our ingress controllers more dynamic with autoscaling worker nodes.
Consume less resources in tenant clusters where IC is not used or is low traffic.
Respond better to peaks in traffic to Ingress Controller.
Have load test coverage for Ingress Controler HPA and cluster-autoscaler.

Current state:

When launching a new cluster we set the number of ingresses to the number of nodes in the workers-array in cluster-operator.
This causes problems as the workers-array is no longer supported as a source of truth for worker counts with cluster-autoscaling.
A cluster with autoscaling will not change the number ICs running according to the changing number of nodes.

TODO

Helm chart with Storm Forger test app - Fer - Add test app chart loadtest-app#1
Storm Forger test in JavaScript - Fer
Update outdated cluster-autoscaler PR - Fer Add HPA to nginx-ingress-controller deployment kubernetes-nginx-ingress-controller#78
Use Storm Forger CLI to launch test - Ross
Add load test to e2etests that creates TC, installs test app and triggers test - Ross https://github.com/giantswarm/roadmap/issues/141
Check if thresholds for CPU / Memory make sense in production https://github.com/giantswarm/giantswarm/issues/10163
Check what the min/max number of ICs should be (also consider how many workers an IC is using?) https://github.com/giantswarm/giantswarm/issues/10163
Find a better metric to scale the pods https://github.com/giantswarm/giantswarm/issues/10163

EDIT: Updated checklist with current state.

The text was updated successfully, but these errors were encountered:

MarcelMue · 2019-02-04T15:59:25Z

I also tried to check some actual IC usage on asgard cluster 05ri3:
I tried to query for cpu usage which seemed to be between 5 and 10% for ICs:

rate(container_cpu_usage_seconds_total{cluster_type="guest",cluster_id="05ri3",container_name="nginx-ingress-controller"}[20m]) * 100

While Memory usage was much higher, jumping between 400 and 800 M:

container_memory_usage_bytes{cluster_type="guest",cluster_id="05ri3",container_name="nginx-ingress-controller"}

I could very well have done this wrong as my prometheus skills are a bit rusty :(

MarcelMue · 2019-02-04T16:00:25Z

cc @giantswarm/team-batman

puja108 · 2019-02-04T16:12:18Z

two tests from my side:

load test against an app behind the ICs
test IC that handles lots of ingress objects (for memory usage)

pipo02mix · 2019-02-05T16:50:36Z

I have created a simple project to make run jmeter easier. First iteration but enough to run simple load test

https://github.com/giantswarm/jmeter

@MarcelMue we can take a look what we need to set exactly and iterate over

rossf7 · 2019-02-27T10:32:04Z

Picking this up for hackathon. My plan is to use the Sock Shop demo from Weave as it has built in load testing with Locust which I've used before. https://github.com/microservices-demo/load-test/

For the first test I'm going to use an AWS tenant cluster to test the interaction with the cluster-autoscaler.

rossf7 · 2019-02-28T10:31:01Z

Hackathon Day 1 Update

Sock Shop demo is very outdated and doesn't work with our PSPs. We installed the GCP demo instead which is much newer https://github.com/GoogleCloudPlatform/microservices-demo. It can be run without Istio which was why I initially rejected it.
@TheoBrigitte worked on testing the existing PR with wrk2 and our helloworld image. Scaling happened but we couldn't get good CPU metrics for the IC pods.
I worked on some Go code for mass generating ingress resources. So we can simulate production clusters and test memory usage for the IC pods.

TODO

Theo is working on getting metrics so we can observer the IC under load.
I'll get the ingress generation code into a Git repo so we can reuse it.
Once we have good metrics we will retest with different scenarios and check scaling behaviour and limits for IC pods.

stale · 2019-06-08T11:19:59Z

This issue has been marked as stale as it has not had recent activity, and will be closed in a week if there is no further activity.
To prevent this deletion, remove the label lifecycle/stale, and comment on this issue.
To prevent this deletion and keep this Issue indefinitely, remove the label lifecycle/stale and add the label lifecycle/keep.

rossf7 · 2019-06-10T07:52:44Z

@MarcelMue Yes, this is taking longer than we'd like :( Now conference season is over for a bit that should help.

I talked with @puja108 about this on Friday. Next step is to use Storm Forger so we have e2e coverage. We don't want to roll our own load testing setup. We will work on this in parallel with the next phase of the App Catalog work.

I'll contact Matthias @ Storm Forger tomorrow and ask him to create me an owner account.

cc @pipo02mix @tomahawk28 @sslavic

rossf7 · 2019-06-17T08:13:23Z

Status 17 June 2019

I am now owner of our giantswarm org in the Storm Forger app (we can only have 1 owner).
I created accounts for Batman and Marcel.

TODO

I will check out the Storm Forger test app and run some basic tests to get familiar with their platform. https://github.com/stormforger/testapp

rossf7 · 2019-06-26T08:53:01Z

@pipo02mix I updated the top comment and added a task list. Does it look OK?

rossf7 · 2019-06-27T17:16:00Z

Status 27 June 2019

@pipo02mix and I worked on this the past 2 days during the Q2 hackathon. Goal was to have an automated way to trigger load tests using Storm Forger against AWS tenant clusters.

Hackathon tasks

Add HPA for Nginx Ingress Controller with feature flag and it disabled by default - Fer https://github.com/giantswarm/kubernetes-nginx-ingress-controller
Helm chart for installing the Storm Forger testapp - Fer https://github.com/giantswarm/loadtest-app
Helm chart for triggering test using Storm Forger cli - Fer https://github.com/giantswarm/stormforger-cli
loadtest e2e test to automate installing the test app, starting the test and checking results - Ross WIP: Add loadtest e2e test using Storm Forger e2etests#39
use loadtest e2e test in aws-operator - Ross Add loadtest e2e test aws-operator#1707

I'm still debugging the e2e test but it's basically code-complete.

Next steps

Demo during Batman planning next week.
Decide whether we can permanently enable e2e test in aws-operator (often TC API does not come up).
Extend JS tests in stormforger-cli to stress IC more.
Test at higher load and see whether we can enable HPA for IC.

cc @puja108

rossf7 · 2019-07-01T07:35:36Z

Status 1 July 2019

I've split the e2e test into 2 PRs because it's big. 1st PR is in review.

TODO

Demo in planning and decide what is next on this.

rossf7 · 2019-07-23T15:37:16Z

Status 23 July 2019

The loadtest e2e test is now merged. So we have a fully automated load test that uses StormForger and is controlled from a Circle job. The aws-operator PR is stale so I need to do a fresh PR and align it with the changes from the reviews.

The test we run is a "Hello World" style test but I wanted to get the automation sorted first. It is stored in a configmap and deployed using a Helm Chart so it's easy to iterate on.

https://github.com/giantswarm/stormforger-cli/blob/1c086b187dc27c3ef979c48a07f1d6818855f08a/helm/stormforger-cli-chart/templates/configmap.yaml

I'd like to schedule another call with StormForger for next week to get their input on how we can do the deep testing of nginx-ingress-controller and cluster-autoscaler that we need.

@puja108 @pipo02mix WDYT?

cc @teemow @MarcelMue

puja108 · 2019-07-23T15:41:16Z

Cluster autoscaler only autoscales based on scheduled things so there's no need for actual load you can test it by scheduling huge request empty pods even

rossf7 · 2019-07-23T15:46:59Z

Yes I just want to be sure that nginx-ingress-controller and cluster-autoscaler play nicely. I think the focus should be on nginx-ingress-controller.

As discussed I'll come up with an agenda for the StormForger call and we can review it in batman first.

pipo02mix · 2019-07-24T10:56:52Z

Yeah it is a fair point, we are testing only IC at this point. In the future has sense to test cluster version (IC, cluster autoscaler, ...) under load test. Great work Ross count with me for StormForger call

pipo02mix · 2019-08-23T06:21:08Z

Status 23 August 2019

I am working on having the first real scenario done. After talking with stormforger these are things I am considering to have a first test reliable:

fewer sessions, more requests per session
Use requests that take longer (?delay=1234)
Test with SSL
Setup NFR Checks (https://docs.stormforger.com/reference/nfr/) - Dont check APpdex but latencies. Establish a baseline to have something to compare.
Latency Distribution Graph in StormForger UI
Get requirements of new connections vs new requests

Ross will help with the automation once we know the test we want to run. He will add the e2e test to aws-operator that only run under request.

pipo02mix · 2019-09-02T11:54:45Z

I could not work on it, as other things with more priority came up

puja108 · 2019-10-29T15:46:06Z

As IC is becoming both optional and more configurable, we decided to start testing this with a few customers in batman cycle 3 to develop realistic testing scenarios.

cokiengchiara · 2019-12-09T15:16:08Z

@rossf7 , I'm prepping for Cycle Planning. Here's what we discussed in the Pre-Cycle Planning. https://www.dropbox.com/scl/fi/87go47eugq4ikmqe4rp2f/Managed%20Services%20Dec%2010%20Cycle%20Planning.paper?dl=0&rlkey=gjjdj655vp97r342od7pe3p0d LMKYT.

Puja thinks the apps team should do this.
This is basically ready, but needs to be tested w customers. In eng’g day,
Fer and Julien's customers said they are ok with testing this.
One customer even has an autoscaling ingress controller setup. They can show us what they have and then we can see if there’s something different, we can discuss why it’s different, and if their setup makes sense for us. Puja would trust that what they have is not bad. Ping @piontec and Puja here. @puja108 , who should we talk to to ask our customers which non-critical clusters are safe to test this on? @giantswarm/sig-customer , would you know if there are non-critical clusters we could test autsocaling IC on?

rossf7 · 2019-12-09T17:13:23Z

@cokiengchiara I agree it makes sense for the apps team to take it. I just have 2 worries.

I feel a bit guilty I didn't get this done and this task gets passed to someone else again.
I'd like us to eventually have automated testing here. We're been talking to Storm Forger (who are friends also in Cologne). The blocker is we need a scenario with traffic numbers to test with. Maybe we could also get these from customers?

cokiengchiara · 2019-12-10T17:55:11Z

thanks @rossf7 please don't feel guilty. Should we agree to give this to Halo? In the long run, it helps free up IC from you to them as well...

rossf7 · 2019-12-11T08:11:01Z

Thanks @cokiengchiara I agree it makes sense to go to Halo.

rossf7 · 2019-12-23T16:44:41Z

This moved to Halo. Unassigning myself but happy to help with questions or handover.

cokiengchiara · 2020-01-06T14:03:04Z

@sslavic can talk with @pipo02mix re: his customer's autoscaling settings

puja108 · 2020-01-08T13:10:41Z

Customer said 2 things:

IC replicas above 12 are questionable
NGINX with CPU above 60% is already a bad sign

re 1 IMO maybe 20 is still ok, also considering that AWS instance for example have limited bandwith, in one of our first load tests with customer the bottle neck was the bandwidth of each EC2 instance and not the IC

re 2 I am not sure and it might just need a bit more research, and maybe ask customer where they have their values from.

Also to be considered is what gitlab has as defaults: https://gitlab.com/gitlab-org/charts/gitlab/blob/master/charts/nginx/values.yaml#L152-157

puja108 · 2020-01-08T13:11:41Z

As gitlab sometimes moves or deletes things here a copy from their values.yaml:

  autoscaling:
    enabled: false
    minReplicas: 1
    maxReplicas: 11
    targetCPUUtilizationPercentage: 50
    targetMemoryUtilizationPercentage: 50

cokiengchiara · 2020-01-21T14:19:19Z

Test in our own test clusters
1.1. A flag to flip on/off would also make Stevo comfortable to roll it out, so can switch back to old version if it doesn't work.
Customer is already willing to test in non-production cluster - @jgsqware can you ask customer for test clusters we can test this in?

Enable it to everyone else w enough confidence for previous steps. What else do we need to feel confident that we can roll it out? Scalable prometheus would be nice, even though it's not a hard dependency.

cokiengchiara · 2020-01-27T10:29:23Z

@sslavic I transferred to the public roadmap repository and lost the labels sorry :(

cokiengchiara · 2020-01-27T16:00:44Z

Documenting the benefit of HPA: So NGINGX IC can always run optimally, without it overloading resources

@sslavic explanation: HPA makes target workload reactive to / scaled with the load. When load (as measured by chosen metric) goes over our configured threshold, more replicas get created. We use CPU & memory average utilization, and 50% threshold for both. There is min and max number of replicas configured, so it doesn't scale out indefinitely.

Without HPA, number of replicas would be fixed. That typically means (1) when load is low, it would be wasting resources. (2) when load increases over initial capacity, it wouldn't utilize available resources and... what are the consequences?

So "can always run optimally" is correct and that is very concise way to explain the benefit, while "without it overloading resources" not really.

cokiengchiara · 2020-02-07T09:02:30Z

@sslavic from your comment here https://gigantic.slack.com/archives/CR99LSZ1N/p1581063843111900?thread_ts=1581059520.110600&cid=CR99LSZ1N i think we need to split the ticket, so we can scope this down to what we can do now. Then create a new one to move the stuff that doesn't make sense to implement now. WDYT?

sslavic · 2020-02-07T10:58:54Z

Yes, this one is epic, and relic of the previous way of managing backlog/TODO

cokiengchiara · 2020-03-16T13:22:28Z

Plan to ship today, except Azure. Doing everything on our side, so they can take over when they're ready.

cokiengchiara · 2020-03-23T12:39:34Z

Shipped AWS.
Today, shipping KVM. Handing Azure release to Celestial.

Once we ship release, we need to create follow up tickets.

add specification in design approach
where else can we use this? for general backlog

sslavic · 2020-04-27T10:42:02Z

HPA was already available for a while, it was disabled by default and no customer used it on their own while we had mix of other solutions (cluster-operator, CronJob) managing number of nginx replicas.

With work in last quarter or so, HPA has been enabled by default for selected clusters, across currently supported channels.

IMO any follow up work can be broken down (evaluated, prioritized) and this epic closed.

cokiengchiara · 2020-04-27T12:07:18Z

This is done. @cokiengchiara create follow up issues. Talk to @sslavic

cokiengchiara · 2020-05-05T09:02:30Z

@sslavic In Slack, you said, "repeatable load test was one of the TODOs in the epic. rest of the TODOs (and more) I think is covered with epic/nginx-tuning."

Add load test to e2etests that creates TC, installs test app and triggers test - Ross

Created this: https://github.com/giantswarm/roadmap/issues/141

The rest I added to https://github.com/giantswarm/giantswarm/issues/10163 and not in separate issues since first two are tasks to "check" let me know if this organization makes sense for you.

Check if thresholds for CPU / Memory make sense in production
Check what the min/max number of ICs should be (also consider how many workers an IC is using?)
Find a better metric to scale the pods giantswarm/giantswarm#10163

rossf7 self-assigned this Jun 10, 2019

cokiengchiara changed the title ~~Add HPA to our tenant ingress controllers~~ Autscaling Ingress Controllers | Add HPA to our tenant ingress controllers Dec 9, 2019

rossf7 removed their assignment Dec 23, 2019

sslavic changed the title ~~Autscaling Ingress Controllers | Add HPA to our tenant ingress controllers~~ Autoscaling Ingress Controllers | Add HPA to our tenant ingress controllers Dec 23, 2019

cokiengchiara assigned sslavic Jan 6, 2020

cokiengchiara transferred this issue from another repository Jan 27, 2020

cokiengchiara added the team/halo label Jan 27, 2020

cokiengchiara self-assigned this Jan 31, 2020

sslavic mentioned this issue Feb 10, 2020

Enable HPA by default giantswarm/ingress-nginx-app#27

Merged

cokiengchiara modified the milestones: Azure 11.2.1, AWS 9.2.1 Mar 9, 2020

This was referenced Mar 14, 2020

Differentiate resource requests and autoscaling for xxs, xs and s cluster profiles giantswarm/ingress-nginx-app#38

Merged

Support xss, xs, and small cluster profile detection giantswarm/cluster-operator#963

Merged

snizhana-dynnyk added the area/cloud-native-packs label Mar 23, 2020

snizhana-dynnyk modified the milestones: Azure 11.2.1, 2020 Q2 Apr 1, 2020

cokiengchiara closed this as completed Apr 27, 2020

Autoscaling Ingress Controllers | Add HPA to our tenant ingress controllers #65

Autoscaling Ingress Controllers | Add HPA to our tenant ingress controllers #65

Comments

MarcelMue commented Feb 4, 2019 • edited by cokiengchiara Loading

Goals

Current state:

TODO

MarcelMue commented Feb 4, 2019

MarcelMue commented Feb 4, 2019

puja108 commented Feb 4, 2019

pipo02mix commented Feb 5, 2019

rossf7 commented Feb 27, 2019

rossf7 commented Feb 28, 2019

Hackathon Day 1 Update

TODO

stale bot commented Jun 8, 2019

rossf7 commented Jun 10, 2019

rossf7 commented Jun 17, 2019

Status 17 June 2019

TODO

rossf7 commented Jun 26, 2019

rossf7 commented Jun 27, 2019 • edited Loading

Status 27 June 2019

Hackathon tasks

Next steps

rossf7 commented Jul 1, 2019

Status 1 July 2019

TODO

rossf7 commented Jul 23, 2019

Status 23 July 2019

puja108 commented Jul 23, 2019

rossf7 commented Jul 23, 2019

pipo02mix commented Jul 24, 2019

pipo02mix commented Aug 23, 2019

Status 23 August 2019

pipo02mix commented Sep 2, 2019

puja108 commented Oct 29, 2019

cokiengchiara commented Dec 9, 2019 • edited Loading

rossf7 commented Dec 9, 2019 • edited by cokiengchiara Loading

cokiengchiara commented Dec 10, 2019

rossf7 commented Dec 11, 2019

rossf7 commented Dec 23, 2019

cokiengchiara commented Jan 6, 2020 • edited Loading

puja108 commented Jan 8, 2020 • edited by cokiengchiara Loading

puja108 commented Jan 8, 2020

cokiengchiara commented Jan 21, 2020 • edited Loading

cokiengchiara commented Jan 27, 2020

cokiengchiara commented Jan 27, 2020 • edited Loading

cokiengchiara commented Feb 7, 2020

sslavic commented Feb 7, 2020

cokiengchiara commented Mar 16, 2020

cokiengchiara commented Mar 23, 2020

sslavic commented Apr 27, 2020

cokiengchiara commented Apr 27, 2020

cokiengchiara commented May 5, 2020

MarcelMue commented Feb 4, 2019 •

edited by cokiengchiara

Loading

rossf7 commented Jun 27, 2019 •

edited

Loading

cokiengchiara commented Dec 9, 2019 •

edited

Loading

rossf7 commented Dec 9, 2019 •

edited by cokiengchiara

Loading

cokiengchiara commented Jan 6, 2020 •

edited

Loading

puja108 commented Jan 8, 2020 •

edited by cokiengchiara

Loading

cokiengchiara commented Jan 21, 2020 •

edited

Loading

cokiengchiara commented Jan 27, 2020 •

edited

Loading