Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

On download, ensure resolve() endpoints exhaust all object locations #415

Open
csjx opened this issue Dec 11, 2017 · 6 comments
Open

On download, ensure resolve() endpoints exhaust all object locations #415

csjx opened this issue Dec 11, 2017 · 6 comments

Comments

@csjx
Copy link
Member

csjx commented Dec 11, 2017

We enable direct object downloads using the Download button in the MetadataView, and the URLs in the dataone theme default to CN resolve() URL endpoints. This call returns a response body with a Types.ObjectLocationList payload, and an HTTP 303 redirection code to the first URL in the object location list. If there is any failure on the redirected request, we need to catch the exception and iterate through the rest of the ObjectLocationList to try each replica URL until we succeed with the download. Only after exhausting the list should we show any sort of error message. Currently, we don't catch the error to provide a message - the browser just shows an error in the download status.

We should also think about providing a configuration flag in all themes to request to use the CN resolve URL by default, or to just go straight to the MN get() endpoint.

@csjx
Copy link
Member Author

csjx commented Nov 25, 2019

Moving comments from https://github.com/NCEAS/metacat/1407

We've discussed the issue where a repository in the down state, or just inaccessible because of network issues, will not be able to deliver objects with MN.get(). For people downloading from search.dataone.org, they can view the metadata (since it's replicated to the CN), but the data objects are unavailable, even when they are replicated to multiple member nodes.

@laurenwalker pointed out that XHR calls to CN.resolve() to get the ObjectLocationList for each object results in a 30x redirect, and the browser automatically follows the redirect to the (at times unavailable) first replica in the list. The ordering of the replicas on the CN side is being addressed in https://redmine.dataone.org/issues/8853 to de-prioritize replicas that are known to be unavailable (INVALID, down node status, etc). But MetacatUI needs to be able to iterate through the ObjectLocationList nonetheless and try/catch failures until success or the list is exhausted.

From this StackOverflow question we see that the browsers are doing what they are supposed to with XHR calls - follow the redirect.

However, we might be able to use the new Fetch API when calling CN.resolve()and set the Request.redirect property to manual so we can obtain the replica list and loop through it. Of course, this depends on browser support. Other ideas welcome!

I did some quick testing in a JSFiddle, and it turns out that fetch(url, { redirect: "manual"}) does not expose the body of the original request in the response (it is set to null), which is the same case with { redirect: "error"}, which just throws a NetworkError if the response code is 3xx.

Ultimately, there's no way to access the CN.resolve() body in the browser. From discussion here and here, the WHATWG Fetch API specifically disallows access for cross-site scripting security reasons. I would imagine this is the same reason that the browser also does the same for XHR requests (specifically disallowing).

The only two ways forward I see are

  1. Changing the CN response code from 303 to 200 and require the client to iterate through the ObjectLocationList body (unlikely, given this is a full CI stack change - @datadavev @mbjones can comment)

  2. MetacatUI tries to follow the redirect, and catches any errors (404, timeout, etc.) and then creates its own replica list from the SystemMetadata.replica list, which would involve discovering the baseUrl of each replica node from the node list given a nodeId, and tries to fetch the object from each replica in the list until it succeeds. Since the replica list should include status of each replica, we could hit only COMPLETED replicas (skipping those with FAILED|QUEUED|REQUESTED|INVALIDATED).

@amoeba
Copy link
Contributor

amoeba commented Nov 26, 2019

Thanks for bringing the details in here. (2) doesn't sound too bad. Other ideas might be:

  1. Have MetacatUI set either the Accept (or another header, like X-PLZ-NO-REDIRECT: true) to override the redirect behavior at the servlet level. For Accept, the value could be set to either (text|application)/xml or even a string representing the http://ns.dataone.org/service/types/v1#objectLocationList to make it even more clear.
  2. Have the servlet pre-flight the redirect before responding.

@csjx
Copy link
Member Author

csjx commented Nov 27, 2019

Yeah @amoeba - I like the negotiated Accept type idea, so the default behavior would be 303, and optionally a 200 if the ObjectLocationList is desired and the Accept header is whatever we decide is a good one. Both (3) and (4) still involve a CN change, but as in (1), I suppose those are just d1_cn_rest changes, and not type or library changes. Unless we hear from @datadavev or @mbjones regarding cycles to put towards this on the CN side, I think we would have to go with (2) at a minimum.

@csjx
Copy link
Member Author

csjx commented Feb 18, 2020

More comments and ideas from a Slack thread:

lauren 1:44 PM
The download via CORS issue is fixed… but no, the DataONE CN resolve service doesn’t seem to work like I would expect it to, since even though the connection to the replica server timed out, the resolve service doesn’t try to fetch the object from a different replica server
1:44
Maybe the other devs here know why that is
1:45
or how that could be fixed

rossdm 1:45 PM
ok, strange...it looks like the download links described above are working now
new messages
1:46
will be a problem if one replica going offline affects the overall package, kind of the opposite desired effect....will let our users know that the issue seems to be(?) resolved (edited)
1:46
ty for letting me know

chris: 1:48 PM
@lauren - Since the resolve service is just a pass-through, this really is a client issue. I think in the short term our only fix is #2 in #415 (comment) . When we have resources to make changes on the CN side, we could pursue that too.

lauren 1:48 PM
I’m just not sure how we are supposed to tell if a download has failed in the browser
1:48
I guess I’ll look into it
1:48
It seems like a lot of work for the client to do

chris: 1:49 PM
I think it’s a matter of a try/catch - so if you get a network error or a 404 or the like, you move on to the next replica in the list.
1:50
the error callback in the XHR would need to handle it I think

lauren 1:50 PM
That makes sense for XHR downloads, but we just fixed the download functionality so that it only sends XHR when the object is private
1:51
Otherwise, we just make an HTML link with the download attribute and just let the browser handle it

chris: 1:51 PM
Ah right. That definitely complicates it 🙂

lauren 1:52 PM
I think the only solution would be to show multiple links to the user, for each replica. So the user can manually try another link

chris: 1:54 PM
You could also do an XHR call to MNCore.describe(pid) (which is an HTTP HEAD /object/{pid} ) and if that succeeds, put in the link to MNCore.read()

davev 1:54 PM
link status should really be tracked by the CNs to prioritize resolve, though that adds a big load to CN work
1:56
It would be great if the CN could receive notice of failed links from clients. Something like a client does the head requests as Chris described and pings the CN with the info
1:57
Kind of like pushing the work onto the clients, but exposing the results of that work to future clients.

chris: 1:57 PM
Well, the replica auditor does that to a certain degree (checking fixity and availability), but it also needs some work

@csjx
Copy link
Member Author

csjx commented Apr 13, 2020

Bump. @laurenwalker - In the ESS-DIVE discussion today, this bug came up as a priority. Can we add this into a up and coming milestone?

@laurenwalker laurenwalker added this to the 2.12.0 milestone Apr 14, 2020
@laurenwalker
Copy link
Member

Ok, I added it to 2.12.0

@laurenwalker laurenwalker modified the milestones: 2.12.0, 2.13.0 May 20, 2020
@laurenwalker laurenwalker removed this from the 2.13.0 milestone Jul 9, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants