-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Autoscaling Ingress Controllers | Add HPA to our tenant ingress controllers #65
Comments
I also tried to check some actual IC usage on
While Memory usage was much higher, jumping between 400 and 800 M:
I could very well have done this wrong as my prometheus skills are a bit rusty :( |
cc @giantswarm/team-batman |
two tests from my side:
|
I have created a simple project to make run jmeter easier. First iteration but enough to run simple load test https://github.com/giantswarm/jmeter @MarcelMue we can take a look what we need to set exactly and iterate over |
Picking this up for hackathon. My plan is to use the Sock Shop demo from Weave as it has built in load testing with Locust which I've used before. https://github.com/microservices-demo/load-test/ For the first test I'm going to use an AWS tenant cluster to test the interaction with the cluster-autoscaler. |
Hackathon Day 1 Update
TODO
|
This issue has been marked as stale as it has not had recent activity, and will be closed in a week if there is no further activity. |
@MarcelMue Yes, this is taking longer than we'd like :( Now conference season is over for a bit that should help. I talked with @puja108 about this on Friday. Next step is to use Storm Forger so we have e2e coverage. We don't want to roll our own load testing setup. We will work on this in parallel with the next phase of the App Catalog work. I'll contact Matthias @ Storm Forger tomorrow and ask him to create me an owner account. |
Status 17 June 2019
TODO
|
@pipo02mix I updated the top comment and added a task list. Does it look OK? |
Status 27 June 2019@pipo02mix and I worked on this the past 2 days during the Q2 hackathon. Goal was to have an automated way to trigger load tests using Storm Forger against AWS tenant clusters. Hackathon tasks
I'm still debugging the e2e test but it's basically code-complete. Next steps
cc @puja108 |
Status 1 July 2019
TODO
|
Status 23 July 2019The loadtest e2e test is now merged. So we have a fully automated load test that uses StormForger and is controlled from a Circle job. The aws-operator PR is stale so I need to do a fresh PR and align it with the changes from the reviews. The test we run is a "Hello World" style test but I wanted to get the automation sorted first. It is stored in a configmap and deployed using a Helm Chart so it's easy to iterate on. I'd like to schedule another call with StormForger for next week to get their input on how we can do the deep testing of nginx-ingress-controller and cluster-autoscaler that we need. @puja108 @pipo02mix WDYT? |
Cluster autoscaler only autoscales based on scheduled things so there's no need for actual load you can test it by scheduling huge request empty pods even |
Yes I just want to be sure that nginx-ingress-controller and cluster-autoscaler play nicely. I think the focus should be on nginx-ingress-controller. As discussed I'll come up with an agenda for the StormForger call and we can review it in batman first. |
Yeah it is a fair point, we are testing only IC at this point. In the future has sense to test cluster version (IC, cluster autoscaler, ...) under load test. Great work Ross count with me for StormForger call |
Status 23 August 2019I am working on having the first real scenario done. After talking with stormforger these are things I am considering to have a first test reliable:
Ross will help with the automation once we know the test we want to run. He will add the e2e test to aws-operator that only run under request. |
I could not work on it, as other things with more priority came up |
As IC is becoming both optional and more configurable, we decided to start testing this with a few customers in batman cycle 3 to develop realistic testing scenarios. |
@rossf7 , I'm prepping for Cycle Planning. Here's what we discussed in the Pre-Cycle Planning. https://www.dropbox.com/scl/fi/87go47eugq4ikmqe4rp2f/Managed%20Services%20Dec%2010%20Cycle%20Planning.paper?dl=0&rlkey=gjjdj655vp97r342od7pe3p0d LMKYT.
|
@cokiengchiara I agree it makes sense for the apps team to take it. I just have 2 worries.
|
thanks @rossf7 please don't feel guilty. Should we agree to give this to Halo? In the long run, it helps free up IC from you to them as well... |
Thanks @cokiengchiara I agree it makes sense to go to Halo. |
This moved to Halo. Unassigning myself but happy to help with questions or handover. |
@sslavic can talk with @pipo02mix re: his customer's autoscaling settings |
Customer said 2 things:
re 1 IMO maybe 20 is still ok, also considering that AWS instance for example have limited bandwith, in one of our first load tests with customer the bottle neck was the bandwidth of each EC2 instance and not the IC re 2 I am not sure and it might just need a bit more research, and maybe ask customer where they have their values from. Also to be considered is what gitlab has as defaults: https://gitlab.com/gitlab-org/charts/gitlab/blob/master/charts/nginx/values.yaml#L152-157 |
As gitlab sometimes moves or deletes things here a copy from their values.yaml:
|
|
@sslavic I transferred to the public roadmap repository and lost the labels sorry :( |
Documenting the benefit of HPA: So NGINGX IC can always run optimally, without it overloading resources @sslavic explanation: HPA makes target workload reactive to / scaled with the load. When load (as measured by chosen metric) goes over our configured threshold, more replicas get created. We use CPU & memory average utilization, and 50% threshold for both. There is min and max number of replicas configured, so it doesn't scale out indefinitely. Without HPA, number of replicas would be fixed. That typically means (1) when load is low, it would be wasting resources. (2) when load increases over initial capacity, it wouldn't utilize available resources and... what are the consequences? So "can always run optimally" is correct and that is very concise way to explain the benefit, while "without it overloading resources" not really. |
@sslavic from your comment here https://gigantic.slack.com/archives/CR99LSZ1N/p1581063843111900?thread_ts=1581059520.110600&cid=CR99LSZ1N i think we need to split the ticket, so we can scope this down to what we can do now. Then create a new one to move the stuff that doesn't make sense to implement now. WDYT? |
Yes, this one is epic, and relic of the previous way of managing backlog/TODO |
Plan to ship today, except Azure. Doing everything on our side, so they can take over when they're ready. |
Shipped AWS. Once we ship release, we need to create follow up tickets.
|
HPA was already available for a while, it was disabled by default and no customer used it on their own while we had mix of other solutions (cluster-operator, CronJob) managing number of nginx replicas. With work in last quarter or so, HPA has been enabled by default for selected clusters, across currently supported channels. IMO any follow up work can be broken down (evaluated, prioritized) and this epic closed. |
This is done. @cokiengchiara create follow up issues. Talk to @sslavic |
@sslavic In Slack, you said, "repeatable load test was one of the TODOs in the epic. rest of the TODOs (and more) I think is covered with epic/nginx-tuning."
Created this: https://github.com/giantswarm/roadmap/issues/141
|
Goals
Current state:
workers-array
incluster-operator
.workers-array
is no longer supported as a source of truth for worker counts withcluster-autoscaling
.TODO
e2etests
that creates TC, installs test app and triggers test - Ross https://github.com/giantswarm/roadmap/issues/141EDIT: Updated checklist with current state.
The text was updated successfully, but these errors were encountered: