Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Autoscaling Ingress Controllers | Add HPA to our tenant ingress controllers #65

Closed
4 of 8 tasks
MarcelMue opened this issue Feb 4, 2019 · 39 comments
Closed
4 of 8 tasks
Assignees
Milestone

Comments

@MarcelMue
Copy link

MarcelMue commented Feb 4, 2019

Goals

  • Make our ingress controllers more dynamic with autoscaling worker nodes.
  • Consume less resources in tenant clusters where IC is not used or is low traffic.
  • Respond better to peaks in traffic to Ingress Controller.
  • Have load test coverage for Ingress Controler HPA and cluster-autoscaler.

Current state:

  • When launching a new cluster we set the number of ingresses to the number of nodes in the workers-array in cluster-operator.
  • This causes problems as the workers-array is no longer supported as a source of truth for worker counts with cluster-autoscaling.
  • A cluster with autoscaling will not change the number ICs running according to the changing number of nodes.

TODO

EDIT: Updated checklist with current state.

@MarcelMue
Copy link
Author

I also tried to check some actual IC usage on asgard cluster 05ri3:
I tried to query for cpu usage which seemed to be between 5 and 10% for ICs:

rate(container_cpu_usage_seconds_total{cluster_type="guest",cluster_id="05ri3",container_name="nginx-ingress-controller"}[20m]) * 100

While Memory usage was much higher, jumping between 400 and 800 M:

container_memory_usage_bytes{cluster_type="guest",cluster_id="05ri3",container_name="nginx-ingress-controller"}

I could very well have done this wrong as my prometheus skills are a bit rusty :(

@MarcelMue
Copy link
Author

cc @giantswarm/team-batman

@puja108
Copy link
Member

puja108 commented Feb 4, 2019

two tests from my side:

  1. load test against an app behind the ICs
  2. test IC that handles lots of ingress objects (for memory usage)

@pipo02mix
Copy link
Contributor

I have created a simple project to make run jmeter easier. First iteration but enough to run simple load test

https://github.com/giantswarm/jmeter

@MarcelMue we can take a look what we need to set exactly and iterate over

@rossf7
Copy link
Contributor

rossf7 commented Feb 27, 2019

Picking this up for hackathon. My plan is to use the Sock Shop demo from Weave as it has built in load testing with Locust which I've used before. https://github.com/microservices-demo/load-test/

For the first test I'm going to use an AWS tenant cluster to test the interaction with the cluster-autoscaler.

@rossf7
Copy link
Contributor

rossf7 commented Feb 28, 2019

Hackathon Day 1 Update

  • Sock Shop demo is very outdated and doesn't work with our PSPs. We installed the GCP demo instead which is much newer https://github.com/GoogleCloudPlatform/microservices-demo. It can be run without Istio which was why I initially rejected it.
  • @TheoBrigitte worked on testing the existing PR with wrk2 and our helloworld image. Scaling happened but we couldn't get good CPU metrics for the IC pods.
  • I worked on some Go code for mass generating ingress resources. So we can simulate production clusters and test memory usage for the IC pods.

TODO

  • Theo is working on getting metrics so we can observer the IC under load.
  • I'll get the ingress generation code into a Git repo so we can reuse it.
  • Once we have good metrics we will retest with different scenarios and check scaling behaviour and limits for IC pods.

@stale
Copy link

stale bot commented Jun 8, 2019

This issue has been marked as stale as it has not had recent activity, and will be closed in a week if there is no further activity.
To prevent this deletion, remove the label lifecycle/stale, and comment on this issue.
To prevent this deletion and keep this Issue indefinitely, remove the label lifecycle/stale and add the label lifecycle/keep.

@rossf7
Copy link
Contributor

rossf7 commented Jun 10, 2019

@MarcelMue Yes, this is taking longer than we'd like :( Now conference season is over for a bit that should help.

I talked with @puja108 about this on Friday. Next step is to use Storm Forger so we have e2e coverage. We don't want to roll our own load testing setup. We will work on this in parallel with the next phase of the App Catalog work.

I'll contact Matthias @ Storm Forger tomorrow and ask him to create me an owner account.

cc @pipo02mix @tomahawk28 @sslavic

@rossf7 rossf7 self-assigned this Jun 10, 2019
@rossf7
Copy link
Contributor

rossf7 commented Jun 17, 2019

Status 17 June 2019

  • I am now owner of our giantswarm org in the Storm Forger app (we can only have 1 owner).
  • I created accounts for Batman and Marcel.

TODO

@rossf7
Copy link
Contributor

rossf7 commented Jun 26, 2019

@pipo02mix I updated the top comment and added a task list. Does it look OK?

@rossf7
Copy link
Contributor

rossf7 commented Jun 27, 2019

Status 27 June 2019

@pipo02mix and I worked on this the past 2 days during the Q2 hackathon. Goal was to have an automated way to trigger load tests using Storm Forger against AWS tenant clusters.

Hackathon tasks

I'm still debugging the e2e test but it's basically code-complete.

Next steps

  • Demo during Batman planning next week.
  • Decide whether we can permanently enable e2e test in aws-operator (often TC API does not come up).
  • Extend JS tests in stormforger-cli to stress IC more.
  • Test at higher load and see whether we can enable HPA for IC.

cc @puja108

@rossf7
Copy link
Contributor

rossf7 commented Jul 1, 2019

Status 1 July 2019

  • I've split the e2e test into 2 PRs because it's big. 1st PR is in review.

TODO

  • Demo in planning and decide what is next on this.

@rossf7
Copy link
Contributor

rossf7 commented Jul 23, 2019

Status 23 July 2019

The loadtest e2e test is now merged. So we have a fully automated load test that uses StormForger and is controlled from a Circle job. The aws-operator PR is stale so I need to do a fresh PR and align it with the changes from the reviews.

The test we run is a "Hello World" style test but I wanted to get the automation sorted first. It is stored in a configmap and deployed using a Helm Chart so it's easy to iterate on.

https://github.com/giantswarm/stormforger-cli/blob/1c086b187dc27c3ef979c48a07f1d6818855f08a/helm/stormforger-cli-chart/templates/configmap.yaml

I'd like to schedule another call with StormForger for next week to get their input on how we can do the deep testing of nginx-ingress-controller and cluster-autoscaler that we need.

@puja108 @pipo02mix WDYT?

cc @teemow @MarcelMue

@puja108
Copy link
Member

puja108 commented Jul 23, 2019

Cluster autoscaler only autoscales based on scheduled things so there's no need for actual load you can test it by scheduling huge request empty pods even

@rossf7
Copy link
Contributor

rossf7 commented Jul 23, 2019

Yes I just want to be sure that nginx-ingress-controller and cluster-autoscaler play nicely. I think the focus should be on nginx-ingress-controller.

As discussed I'll come up with an agenda for the StormForger call and we can review it in batman first.

@pipo02mix
Copy link
Contributor

Yeah it is a fair point, we are testing only IC at this point. In the future has sense to test cluster version (IC, cluster autoscaler, ...) under load test. Great work Ross count with me for StormForger call

@pipo02mix
Copy link
Contributor

Status 23 August 2019

I am working on having the first real scenario done. After talking with stormforger these are things I am considering to have a first test reliable:

  • fewer sessions, more requests per session
  • Use requests that take longer (?delay=1234)
  • Test with SSL
  • Setup NFR Checks (https://docs.stormforger.com/reference/nfr/) - Dont check APpdex but latencies. Establish a baseline to have something to compare.
  • Latency Distribution Graph in StormForger UI
  • Get requirements of new connections vs new requests

Ross will help with the automation once we know the test we want to run. He will add the e2e test to aws-operator that only run under request.

@pipo02mix
Copy link
Contributor

I could not work on it, as other things with more priority came up

@puja108
Copy link
Member

puja108 commented Oct 29, 2019

As IC is becoming both optional and more configurable, we decided to start testing this with a few customers in batman cycle 3 to develop realistic testing scenarios.

@cokiengchiara cokiengchiara changed the title Add HPA to our tenant ingress controllers Autscaling Ingress Controllers | Add HPA to our tenant ingress controllers Dec 9, 2019
@cokiengchiara
Copy link
Contributor

cokiengchiara commented Dec 9, 2019

@rossf7 , I'm prepping for Cycle Planning. Here's what we discussed in the Pre-Cycle Planning. https://www.dropbox.com/scl/fi/87go47eugq4ikmqe4rp2f/Managed%20Services%20Dec%2010%20Cycle%20Planning.paper?dl=0&rlkey=gjjdj655vp97r342od7pe3p0d LMKYT.

  1. Puja thinks the apps team should do this.
  2. This is basically ready, but needs to be tested w customers. In eng’g day,
    Fer and Julien's customers said they are ok with testing this.
  3. One customer even has an autoscaling ingress controller setup. They can show us what they have and then we can see if there’s something different, we can discuss why it’s different, and if their setup makes sense for us. Puja would trust that what they have is not bad. Ping @piontec and Puja here. @puja108 , who should we talk to to ask our customers which non-critical clusters are safe to test this on? @giantswarm/sig-customer , would you know if there are non-critical clusters we could test autsocaling IC on?

@rossf7
Copy link
Contributor

rossf7 commented Dec 9, 2019

@cokiengchiara I agree it makes sense for the apps team to take it. I just have 2 worries.

  • I feel a bit guilty I didn't get this done and this task gets passed to someone else again.
  • I'd like us to eventually have automated testing here. We're been talking to Storm Forger (who are friends also in Cologne). The blocker is we need a scenario with traffic numbers to test with. Maybe we could also get these from customers?

@cokiengchiara
Copy link
Contributor

thanks @rossf7 please don't feel guilty. Should we agree to give this to Halo? In the long run, it helps free up IC from you to them as well...

@rossf7
Copy link
Contributor

rossf7 commented Dec 11, 2019

Thanks @cokiengchiara I agree it makes sense to go to Halo.

@rossf7
Copy link
Contributor

rossf7 commented Dec 23, 2019

This moved to Halo. Unassigning myself but happy to help with questions or handover.

@rossf7 rossf7 removed their assignment Dec 23, 2019
@sslavic sslavic changed the title Autscaling Ingress Controllers | Add HPA to our tenant ingress controllers Autoscaling Ingress Controllers | Add HPA to our tenant ingress controllers Dec 23, 2019
@cokiengchiara
Copy link
Contributor

cokiengchiara commented Jan 6, 2020

@sslavic can talk with @pipo02mix re: his customer's autoscaling settings

@puja108
Copy link
Member

puja108 commented Jan 8, 2020

Customer said 2 things:

  1. IC replicas above 12 are questionable
  2. NGINX with CPU above 60% is already a bad sign

re 1 IMO maybe 20 is still ok, also considering that AWS instance for example have limited bandwith, in one of our first load tests with customer the bottle neck was the bandwidth of each EC2 instance and not the IC

re 2 I am not sure and it might just need a bit more research, and maybe ask customer where they have their values from.

Also to be considered is what gitlab has as defaults: https://gitlab.com/gitlab-org/charts/gitlab/blob/master/charts/nginx/values.yaml#L152-157

@puja108
Copy link
Member

puja108 commented Jan 8, 2020

As gitlab sometimes moves or deletes things here a copy from their values.yaml:

  autoscaling:
    enabled: false
    minReplicas: 1
    maxReplicas: 11
    targetCPUUtilizationPercentage: 50
    targetMemoryUtilizationPercentage: 50

@cokiengchiara
Copy link
Contributor

cokiengchiara commented Jan 21, 2020

  1. Test in our own test clusters
    1.1. A flag to flip on/off would also make Stevo comfortable to roll it out, so can switch back to old version if it doesn't work.
  2. Customer is already willing to test in non-production cluster - @jgsqware can you ask customer for test clusters we can test this in?

  1. Enable it to everyone else w enough confidence for previous steps. What else do we need to feel confident that we can roll it out? Scalable prometheus would be nice, even though it's not a hard dependency.

@cokiengchiara cokiengchiara transferred this issue from another repository Jan 27, 2020
@cokiengchiara
Copy link
Contributor

@sslavic I transferred to the public roadmap repository and lost the labels sorry :(

@cokiengchiara
Copy link
Contributor

cokiengchiara commented Jan 27, 2020

Documenting the benefit of HPA: So NGINGX IC can always run optimally, without it overloading resources

@sslavic explanation: HPA makes target workload reactive to / scaled with the load. When load (as measured by chosen metric) goes over our configured threshold, more replicas get created. We use CPU & memory average utilization, and 50% threshold for both. There is min and max number of replicas configured, so it doesn't scale out indefinitely.

Without HPA, number of replicas would be fixed. That typically means (1) when load is low, it would be wasting resources. (2) when load increases over initial capacity, it wouldn't utilize available resources and... what are the consequences?

So "can always run optimally" is correct and that is very concise way to explain the benefit, while "without it overloading resources" not really.

@cokiengchiara cokiengchiara self-assigned this Jan 31, 2020
@cokiengchiara
Copy link
Contributor

@sslavic from your comment here https://gigantic.slack.com/archives/CR99LSZ1N/p1581063843111900?thread_ts=1581059520.110600&cid=CR99LSZ1N i think we need to split the ticket, so we can scope this down to what we can do now. Then create a new one to move the stuff that doesn't make sense to implement now. WDYT?

@sslavic
Copy link

sslavic commented Feb 7, 2020

Yes, this one is epic, and relic of the previous way of managing backlog/TODO

@cokiengchiara
Copy link
Contributor

Plan to ship today, except Azure. Doing everything on our side, so they can take over when they're ready.

@cokiengchiara
Copy link
Contributor

Shipped AWS.
Today, shipping KVM. Handing Azure release to Celestial.

Once we ship release, we need to create follow up tickets.

  • add specification in design approach
  • where else can we use this? for general backlog

@snizhana-dynnyk snizhana-dynnyk modified the milestones: Azure 11.2.1, 2020 Q2 Apr 1, 2020
@sslavic
Copy link

sslavic commented Apr 27, 2020

HPA was already available for a while, it was disabled by default and no customer used it on their own while we had mix of other solutions (cluster-operator, CronJob) managing number of nginx replicas.

With work in last quarter or so, HPA has been enabled by default for selected clusters, across currently supported channels.

IMO any follow up work can be broken down (evaluated, prioritized) and this epic closed.

@cokiengchiara
Copy link
Contributor

This is done. @cokiengchiara create follow up issues. Talk to @sslavic

@cokiengchiara
Copy link
Contributor

@sslavic In Slack, you said, "repeatable load test was one of the TODOs in the epic. rest of the TODOs (and more) I think is covered with epic/nginx-tuning."

  1. Add load test to e2etests that creates TC, installs test app and triggers test - Ross

Created this: https://github.com/giantswarm/roadmap/issues/141

  1. The rest I added to https://github.com/giantswarm/giantswarm/issues/10163 and not in separate issues since first two are tasks to "check" let me know if this organization makes sense for you.
  • Check if thresholds for CPU / Memory make sense in production
  • Check what the min/max number of ICs should be (also consider how many workers an IC is using?)
  • Find a better metric to scale the pods giantswarm/giantswarm#10163

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants