WIP: https & http/2 support (both h2c and full h2) #27

krancour · 2019-02-15T17:23:37Z

No description provided.

codecov-io · 2019-02-15T17:32:36Z

Codecov Report

❗ No coverage uploaded for pull request base (master@6b69328). Click here to learn what that means.
The diff coverage is 66.27%.

@@           Coverage Diff            @@
##             master     #27   +/-   ##
========================================
  Coverage          ?   57.3%           
========================================
  Files             ?      11           
  Lines             ?     623           
  Branches          ?       0           
========================================
  Hits              ?     357           
  Misses            ?     237           
  Partials          ?      29

Impacted Files	Coverage Δ
pkg/net/http/single_connection_listener.go	`100% <100%> (ø)`
pkg/net/http/transport.go	`100% <100%> (ø)`
pkg/net/http/version.go	`100% <100%> (ø)`
pkg/net/tcp/proxy.go	`39.62% <39.62%> (ø)`
pkg/net/http/proxy.go	`55.05% <55.05%> (ø)`
pkg/net/tcp/dynamic_proxy.go	`61.76% <61.76%> (ø)`
pkg/net/http/httputil/reverseproxy.go	`71.03% <71.03%> (ø)`
pkg/net/peekable_connection.go	`93.1% <93.1%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6b69328...4ce1b1b. Read the comment docs.

gogiraghu1 · 2019-04-15T21:36:29Z

Is there a slack channel for this commit to discuss on scale to zero support?

gogiraghu1 · 2019-04-15T21:40:40Z

looks like hello-osiris example in main branch only works for service type:loadbalancer..I see some changes done in this commit to address it. Is that true? We tried using ClusterIP in service yaml it did not seem to scale out.

tbeerbower · 2019-04-15T21:56:30Z

To add to the above comment, if hello-osiris example is type:LoadBalancer, then scale to zero works and scale out can be triggered by accessing through the service's cluster ip.

If hello-osiris example is type:ClusterIP, then scale to zero works but accessing through the service's cluster ip results in 'Connection refused' with no scale out.

krancour · 2019-04-15T23:29:54Z

Is there a slack channel for this commit to discuss on scale to zero support?

We don't currently have a Slack channel open to the community.

looks like hello-osiris example in main branch only works for service type:loadbalancer..I see some changes done in this commit to address it. Is that true?

Any seemingly related changes are a coincidence, because, until now, I was unaware of there being any issue with services that aren't type LoadBalancer.

Can you clarify whether you see this problem in both the master branch and this branch? Might I suggest opening a separate issue for this, please?

krancour · 2019-04-16T03:01:57Z

Can you please provide detailed instructions for reproducing the behavior you are observing? After attempting to replicate this, I have a pretty good guess at what's going on.

tbeerbower · 2019-04-17T11:36:46Z

@krancour
I just went through the steps to reproduce the issue (basically the hello-osiris demo with the service type: ClusterIP). This time it behaved as expected and scaled back out when I accessed through the cluster ip. I repeated this several times and all was okay.
I'm not sure what I did wrong last time but previously I could not get the scale back out to occur when type: ClusterIP was set.
Anyway, things look good now. Thanks for your quick response and sorry for the false alarm.

krancour · 2019-04-17T16:28:25Z

@tbeerbower, I'm glad to know things are working for you right now.

This time it behaved as expected and scaled back out when I accessed through the cluster ip. I repeated this several times and all was okay.

I can give you a little insight into why I think you might possibly have observed this issue and then seen it go away. I'll describe the steps I used to try and replicate this and how I was able to do so under one particular condition, but not under others...

Installed the latest edge release as indicated by our docs.
Installed the hello-osiris demo as indicated by docs, with the service type edited to be one of ClusterIP
Forced a scale-to-zero: kubectl scale deployment hello-osiris --replicas=0
Observed that hello-osiris endpoints were re-written by the endpoints controller to point to the activator-- a good sign.
Deployed a pod inside the cluster that I could exec into and test from:
1. kubectl run deb --image debian:latest --command -- sleep 1000
2. Find the name of this pod and exec into it: kubectl exec -it <pod name> -- bash
3. Install curl
Used curl to try to initiate a scale-from-zero:
1. First attempt FAILED: curl http://hello-osiris
2. Second attempt succeeded: curl http://hello-osiris.default
3. Third attempt succeeded: curl http://hello-osiris.default.svc.cluster.local

And then I remembered why this happens...

The activator is a multi-tenant component with "multi-tenancy" in this case meaning that it serves many Osiris-enabled services/deployments across many namespaces. Internally, the activator maintains a map of public IPs, private IPs, and known internal and external DNS names to specific deployments which it may need to reactivate. But here's the thing... there could be a hello-osiris service in multiple namespaces, so the unqualified hello-osiris is not a key in that map. The namespace-qualified names hello-osiris.default, hello-osiris.default.cluster.svc.local (and a few others) are in that map.

So... bottom line is that if you had tested with http://hello-osiris, scale-from-zero would not have worked, and that's by design. If you used any of the other internal DNS names for that service, it would work fine.

Hope this helps explain why you might have seen an issue and then also seen it go away.

tbeerbower · 2019-04-18T01:55:16Z

@krancour, thanks for the explanation. I don't think that it applies in my case as I was always testing with the cluster IP. curl http://10.96.214.148 Hello, World!

I did see another issue today that may be related. I have another deployment with a ClusterIP service. I can set it up so that it works with Osiris for scale to/from zero. It will work fine for some number of retries, then at some point after a scale in I will access again through the cluster IP and I'll get a bad response and no scale out ...

curl -i http://10.96.52.85:9060/foo
HTTP/1.1 502 Bad Gateway

If I immediately retry the curl command then it scales out and gives the correct response. However, something is not right with the pod. It doesn't contain the Osiris container. So at this point it will never scale back to zero.

Another strange thing, if I delete the deployment and recreate it, the new initial pod will not contain the Osiris container. I think that I can only get it working again by changing the name of the deployment and recreating it.

Sorry I don't have clear steps to reproduce. I'll try to get it to happen again with a simple test case.

Any suggestions on how to debug this?

Thanks!

gogiraghu1 · 2019-05-16T00:52:59Z

Can we please merge this pr into master? Why is this pending for so long?

krancour · 2019-05-16T01:09:53Z

Can we please merge this pr into master? Why is this pending for so long?

tbh, it's pending feedback from an internal user at MS who specifically requested this feature and is supposed to be confirming that it has adequately enabled their use case. I'll follow up with them tomorrow.

gogiraghu1 · 2019-05-17T01:30:49Z

Thanks Kent. Looks like when I tried to activate with service endpoint i.e. service-name.namespace..svc.cluster.local through kubernetes ingress. The activator did not instantiate pods.

We use ingress controller load balancer to route to a local ingress in a namespace talking to a service endpoint.
More specifically how can I configure osiris to listen to the ingress configured service verses the actual lbr url which gets forwarded to kubernetes ingress?
Looks like only curl http://clusterIP is only working to activate the pods. I am hoping it is addressed in this PR.

krancour · 2019-05-17T04:50:26Z

I'm not quite following you. Can you open a separate issue please and describe steps to reproduce?

gogiraghu1 · 2019-05-17T15:47:29Z

I finally figured out that annotation for load balancer is missing in my service yaml osiris.deislabs.io/loadBalancerHostname: After specifying that it worked fine.

Is there a way we can specify global default for this value in values.yaml of osiris helm charts? Instead of having to update the value of service yaml every time we deploy a new service?

observation is the time taken to terminate is not absolute for idle pods i.e. if it is specified 150 secs metrics polling interval in the values.yaml then I see it sometimes terminates after 5 mins..not sure why is the case.

Another observation is if the pods are terminating and then if the request comes in during termination even though the pod instantiation is happening we run into 502 errors frequently. Is there any recommendation for this?

krancour · 2019-05-17T16:08:21Z

@gogiraghu1 glad to hear it's working for you now.

I tried to activate with service endpoint i.e. service-name.namespace..svc.cluster.local through kubernetes ingress

This is what confused me about your issue. When you service-name.namespace.svc.cluster.local (from within the cluster, of course) there is no ingress involved.

I finally figured out that annotation for load balancer is missing in my service yaml osiris.deislabs.io/loadBalancerHostname: After specifying that it worked fine.

I thought your issue might be something like that, but wasn't sure enough to suggest it because of the confusion.

Is there a way we can specify global default for this value in values.yaml of osiris helm charts? Instead of having to update the value of service yaml every time we deploy a new service?

Nope. The purpose of that annotation is to say "this service is the one that needs to be activated by requests for this domain name". If a given domain name were mapped to multiple services, which among those should be activated by a corresponding request would be ambiguous.

If you don't mind, please open a separate issue if you want to discuss that further. I don't want the HTTP/2 thread to be hijacked by something unrelated.

gogiraghu1 · 2019-05-17T17:00:29Z

@gogiraghu1 glad to hear it's working for you now.

I tried to activate with service endpoint i.e. service-name.namespace..svc.cluster.local through kubernetes ingress

This is what confused me about your issue. When you service-name.namespace.svc.cluster.local (from within the cluster, of course) there is no ingress involved.

I finally figured out that annotation for load balancer is missing in my service yaml osiris.deislabs.io/loadBalancerHostname: After specifying that it worked fine.

I thought your issue might be something like that, but wasn't sure enough to suggest it because of the confusion.

Is there a way we can specify global default for this value in values.yaml of osiris helm charts? Instead of having to update the value of service yaml every time we deploy a new service?

Nope. The purpose of that annotation is to say "this service is the one that needs to be activated by requests for this domain name". If a given domain name were mapped to multiple services, which among those should be activated by a corresponding request would be ambiguous.

If you don't mind, please open a separate issue if you want to discuss that further. I don't want the HTTP/2 thread to be hijacked by something unrelated.

Sure Kent. Will do.

krancour added 2 commits February 15, 2019 11:43

change metrics we collect-- no changes to how they're used... yet

4ef97c3

update zeroscaler algorithm for scale-to-zero decision

78780f0

krancour added the do not merge A PR author or reviewer requests the PR not yet be merged, even if otherwise approved label Feb 15, 2019

krancour self-assigned this Feb 15, 2019

krancour requested a review from yaron2 as a code owner February 15, 2019 17:23

krancour added blocker This issue jeopardizes an upcoming release if not adequately addressed and removed blocker This issue jeopardizes an upcoming release if not adequately addressed labels Feb 21, 2019

krancour added 7 commits February 21, 2019 13:02

copy httputil.ReverseProxy code from Go 1.12

9bf8bdd

add packages for implementing advanced proxies

c31c603

corresponding vendor updates

0beb30e

make metrics-collecting proxy sidecar use dyanmic l4/l7 proxy

b9f6388

make activator use dyanmic l4/l7 proxy

75b893b

update example to use http/1.x, http/2, https, and grpc

99bfee3

upload coverage reports

4ce1b1b

krancour merged commit 4f8deee into deislabs:master May 21, 2019

krancour deleted the h2 branch May 21, 2019 13:50

This was referenced May 21, 2019

Improve metrics collection strategy #17

Closed

Support HTTPS #15

Closed

Support HTTP2 #16

Closed

Bug: Current algorithm for checking activity can miss some #24

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: https & http/2 support (both h2c and full h2) #27

WIP: https & http/2 support (both h2c and full h2) #27

krancour commented Feb 15, 2019

codecov-io commented Feb 15, 2019 •

edited

Loading

gogiraghu1 commented Apr 15, 2019

gogiraghu1 commented Apr 15, 2019 •

edited

Loading

tbeerbower commented Apr 15, 2019

krancour commented Apr 15, 2019

krancour commented Apr 16, 2019

tbeerbower commented Apr 17, 2019

krancour commented Apr 17, 2019

tbeerbower commented Apr 18, 2019

gogiraghu1 commented May 16, 2019

krancour commented May 16, 2019

gogiraghu1 commented May 17, 2019 •

edited

Loading

krancour commented May 17, 2019

gogiraghu1 commented May 17, 2019 •

edited

Loading

krancour commented May 17, 2019 •

edited

Loading

gogiraghu1 commented May 17, 2019

WIP: https & http/2 support (both h2c and full h2) #27

WIP: https & http/2 support (both h2c and full h2) #27

Conversation

krancour commented Feb 15, 2019

codecov-io commented Feb 15, 2019 • edited Loading

Codecov Report

gogiraghu1 commented Apr 15, 2019

gogiraghu1 commented Apr 15, 2019 • edited Loading

tbeerbower commented Apr 15, 2019

krancour commented Apr 15, 2019

krancour commented Apr 16, 2019

tbeerbower commented Apr 17, 2019

krancour commented Apr 17, 2019

tbeerbower commented Apr 18, 2019

gogiraghu1 commented May 16, 2019

krancour commented May 16, 2019

gogiraghu1 commented May 17, 2019 • edited Loading

krancour commented May 17, 2019

gogiraghu1 commented May 17, 2019 • edited Loading

krancour commented May 17, 2019 • edited Loading

gogiraghu1 commented May 17, 2019

codecov-io commented Feb 15, 2019 •

edited

Loading

gogiraghu1 commented Apr 15, 2019 •

edited

Loading

gogiraghu1 commented May 17, 2019 •

edited

Loading

gogiraghu1 commented May 17, 2019 •

edited

Loading

krancour commented May 17, 2019 •

edited

Loading