Skip to content
This repository has been archived by the owner on Jan 27, 2021. It is now read-only.

WIP: https & http/2 support (both h2c and full h2) #27

Merged
merged 9 commits into from
May 21, 2019
Merged

WIP: https & http/2 support (both h2c and full h2) #27

merged 9 commits into from
May 21, 2019

Conversation

krancour
Copy link
Contributor

No description provided.

@krancour krancour added the do not merge A PR author or reviewer requests the PR not yet be merged, even if otherwise approved label Feb 15, 2019
@krancour krancour self-assigned this Feb 15, 2019
@codecov-io
Copy link

codecov-io commented Feb 15, 2019

Codecov Report

❗ No coverage uploaded for pull request base (master@6b69328). Click here to learn what that means.
The diff coverage is 66.27%.

Impacted file tree graph

@@           Coverage Diff            @@
##             master     #27   +/-   ##
========================================
  Coverage          ?   57.3%           
========================================
  Files             ?      11           
  Lines             ?     623           
  Branches          ?       0           
========================================
  Hits              ?     357           
  Misses            ?     237           
  Partials          ?      29
Impacted Files Coverage Δ
pkg/net/http/single_connection_listener.go 100% <100%> (ø)
pkg/net/http/transport.go 100% <100%> (ø)
pkg/net/http/version.go 100% <100%> (ø)
pkg/net/tcp/proxy.go 39.62% <39.62%> (ø)
pkg/net/http/proxy.go 55.05% <55.05%> (ø)
pkg/net/tcp/dynamic_proxy.go 61.76% <61.76%> (ø)
pkg/net/http/httputil/reverseproxy.go 71.03% <71.03%> (ø)
pkg/net/peekable_connection.go 93.1% <93.1%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6b69328...4ce1b1b. Read the comment docs.

@krancour krancour added blocker This issue jeopardizes an upcoming release if not adequately addressed and removed blocker This issue jeopardizes an upcoming release if not adequately addressed labels Feb 21, 2019
@gogiraghu1
Copy link

Is there a slack channel for this commit to discuss on scale to zero support?

@gogiraghu1
Copy link

gogiraghu1 commented Apr 15, 2019

looks like hello-osiris example in main branch only works for service type:loadbalancer..I see some changes done in this commit to address it. Is that true? We tried using ClusterIP in service yaml it did not seem to scale out.

@tbeerbower
Copy link

To add to the above comment, if hello-osiris example is type:LoadBalancer, then scale to zero works and scale out can be triggered by accessing through the service's cluster ip.

If hello-osiris example is type:ClusterIP, then scale to zero works but accessing through the service's cluster ip results in 'Connection refused' with no scale out.

@krancour
Copy link
Contributor Author

Is there a slack channel for this commit to discuss on scale to zero support?

We don't currently have a Slack channel open to the community.

looks like hello-osiris example in main branch only works for service type:loadbalancer..I see some changes done in this commit to address it. Is that true?

Any seemingly related changes are a coincidence, because, until now, I was unaware of there being any issue with services that aren't type LoadBalancer.

Can you clarify whether you see this problem in both the master branch and this branch? Might I suggest opening a separate issue for this, please?

@krancour
Copy link
Contributor Author

Can you please provide detailed instructions for reproducing the behavior you are observing? After attempting to replicate this, I have a pretty good guess at what's going on.

@tbeerbower
Copy link

@krancour
I just went through the steps to reproduce the issue (basically the hello-osiris demo with the service type: ClusterIP). This time it behaved as expected and scaled back out when I accessed through the cluster ip. I repeated this several times and all was okay.
I'm not sure what I did wrong last time but previously I could not get the scale back out to occur when type: ClusterIP was set.
Anyway, things look good now. Thanks for your quick response and sorry for the false alarm.

@krancour
Copy link
Contributor Author

@tbeerbower, I'm glad to know things are working for you right now.

This time it behaved as expected and scaled back out when I accessed through the cluster ip. I repeated this several times and all was okay.

I can give you a little insight into why I think you might possibly have observed this issue and then seen it go away. I'll describe the steps I used to try and replicate this and how I was able to do so under one particular condition, but not under others...

  1. Installed the latest edge release as indicated by our docs.
  2. Installed the hello-osiris demo as indicated by docs, with the service type edited to be one of ClusterIP
  3. Forced a scale-to-zero: kubectl scale deployment hello-osiris --replicas=0
  4. Observed that hello-osiris endpoints were re-written by the endpoints controller to point to the activator-- a good sign.
  5. Deployed a pod inside the cluster that I could exec into and test from:
    1. kubectl run deb --image debian:latest --command -- sleep 1000
    2. Find the name of this pod and exec into it: kubectl exec -it <pod name> -- bash
    3. Install curl
  6. Used curl to try to initiate a scale-from-zero:
    1. First attempt FAILED: curl http://hello-osiris
    2. Second attempt succeeded: curl http://hello-osiris.default
    3. Third attempt succeeded: curl http://hello-osiris.default.svc.cluster.local

And then I remembered why this happens...

The activator is a multi-tenant component with "multi-tenancy" in this case meaning that it serves many Osiris-enabled services/deployments across many namespaces. Internally, the activator maintains a map of public IPs, private IPs, and known internal and external DNS names to specific deployments which it may need to reactivate. But here's the thing... there could be a hello-osiris service in multiple namespaces, so the unqualified hello-osiris is not a key in that map. The namespace-qualified names hello-osiris.default, hello-osiris.default.cluster.svc.local (and a few others) are in that map.

So... bottom line is that if you had tested with http://hello-osiris, scale-from-zero would not have worked, and that's by design. If you used any of the other internal DNS names for that service, it would work fine.

Hope this helps explain why you might have seen an issue and then also seen it go away.

@tbeerbower
Copy link

@krancour, thanks for the explanation. I don't think that it applies in my case as I was always testing with the cluster IP. curl http://10.96.214.148 Hello, World!

I did see another issue today that may be related. I have another deployment with a ClusterIP service. I can set it up so that it works with Osiris for scale to/from zero. It will work fine for some number of retries, then at some point after a scale in I will access again through the cluster IP and I'll get a bad response and no scale out ...

curl -i http://10.96.52.85:9060/foo
HTTP/1.1 502 Bad Gateway

If I immediately retry the curl command then it scales out and gives the correct response. However, something is not right with the pod. It doesn't contain the Osiris container. So at this point it will never scale back to zero.

Another strange thing, if I delete the deployment and recreate it, the new initial pod will not contain the Osiris container. I think that I can only get it working again by changing the name of the deployment and recreating it.

Sorry I don't have clear steps to reproduce. I'll try to get it to happen again with a simple test case.

Any suggestions on how to debug this?

Thanks!

@gogiraghu1
Copy link

Can we please merge this pr into master? Why is this pending for so long?

@krancour
Copy link
Contributor Author

Can we please merge this pr into master? Why is this pending for so long?

tbh, it's pending feedback from an internal user at MS who specifically requested this feature and is supposed to be confirming that it has adequately enabled their use case. I'll follow up with them tomorrow.

@gogiraghu1
Copy link

gogiraghu1 commented May 17, 2019

Thanks Kent. Looks like when I tried to activate with service endpoint i.e. service-name.namespace..svc.cluster.local through kubernetes ingress. The activator did not instantiate pods.

We use ingress controller load balancer to route to a local ingress in a namespace talking to a service endpoint.
More specifically how can I configure osiris to listen to the ingress configured service verses the actual lbr url which gets forwarded to kubernetes ingress?
Looks like only curl http://clusterIP is only working to activate the pods. I am hoping it is addressed in this PR.

@krancour
Copy link
Contributor Author

I'm not quite following you. Can you open a separate issue please and describe steps to reproduce?

@gogiraghu1
Copy link

gogiraghu1 commented May 17, 2019

I finally figured out that annotation for load balancer is missing in my service yaml osiris.deislabs.io/loadBalancerHostname: After specifying that it worked fine.

Is there a way we can specify global default for this value in values.yaml of osiris helm charts? Instead of having to update the value of service yaml every time we deploy a new service?

observation is the time taken to terminate is not absolute for idle pods i.e. if it is specified 150 secs metrics polling interval in the values.yaml then I see it sometimes terminates after 5 mins..not sure why is the case.

Another observation is if the pods are terminating and then if the request comes in during termination even though the pod instantiation is happening we run into 502 errors frequently. Is there any recommendation for this?

@krancour
Copy link
Contributor Author

krancour commented May 17, 2019

@gogiraghu1 glad to hear it's working for you now.

I tried to activate with service endpoint i.e. service-name.namespace..svc.cluster.local through kubernetes ingress

This is what confused me about your issue. When you service-name.namespace.svc.cluster.local (from within the cluster, of course) there is no ingress involved.

I finally figured out that annotation for load balancer is missing in my service yaml osiris.deislabs.io/loadBalancerHostname: After specifying that it worked fine.

I thought your issue might be something like that, but wasn't sure enough to suggest it because of the confusion.

Is there a way we can specify global default for this value in values.yaml of osiris helm charts? Instead of having to update the value of service yaml every time we deploy a new service?

Nope. The purpose of that annotation is to say "this service is the one that needs to be activated by requests for this domain name". If a given domain name were mapped to multiple services, which among those should be activated by a corresponding request would be ambiguous.

If you don't mind, please open a separate issue if you want to discuss that further. I don't want the HTTP/2 thread to be hijacked by something unrelated.

@gogiraghu1
Copy link

@gogiraghu1 glad to hear it's working for you now.

I tried to activate with service endpoint i.e. service-name.namespace..svc.cluster.local through kubernetes ingress

This is what confused me about your issue. When you service-name.namespace.svc.cluster.local (from within the cluster, of course) there is no ingress involved.

I finally figured out that annotation for load balancer is missing in my service yaml osiris.deislabs.io/loadBalancerHostname: After specifying that it worked fine.

I thought your issue might be something like that, but wasn't sure enough to suggest it because of the confusion.

Is there a way we can specify global default for this value in values.yaml of osiris helm charts? Instead of having to update the value of service yaml every time we deploy a new service?

Nope. The purpose of that annotation is to say "this service is the one that needs to be activated by requests for this domain name". If a given domain name were mapped to multiple services, which among those should be activated by a corresponding request would be ambiguous.

If you don't mind, please open a separate issue if you want to discuss that further. I don't want the HTTP/2 thread to be hijacked by something unrelated.

Sure Kent. Will do.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
do not merge A PR author or reviewer requests the PR not yet be merged, even if otherwise approved
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants