-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Queue-based transcoding w/ monq #104
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is 🔥 Amazing to see how much simpler the code is with monq
. Nice work.
Only real issue is that you ripped out the existing tests and haven't added any replacement.
@@ -0,0 +1,2 @@ | |||
test -f .env && dotenv |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's this file for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It allows .env
files to be used (by loading the contents into one's environment separately) without necessitating tools like foreman or libraries like dotenv
that tie more tightly to .env
files.
(It's really convenient for me but isn't necessary.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, then I think this needs to be in .gitignore
.
docker-compose.yml
Outdated
@@ -1,11 +1,31 @@ | |||
version: '2' | |||
services: | |||
app: | |||
environment: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does mongo need to be in Docker? It's easy enough to install on all OSs and besides we need to access the Mongolab DBs in production.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I work on a lot of different projects with a wide variety of dependencies, so this is a convenient way to keep each set of service dependencies isolated. It's for a standalone dev environment, essentially.
Feel free to revert this and I keep keep an uncommitted version locally to facilitate development.
docker-compose.yml
Outdated
|
||
|
||
volumes: | ||
- ./bin:/app/bin |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this for mounting development code? If so that should be put in the test/docker-compose.yml
file, in fact, mounting live code is already solved in that file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto for a local development environment. Each sub-directory / file is called out so that node_modules
doesn't get mounted in (for cross-platform binary reasons).
I can keep this uncommitted locally.
@@ -144,6 +158,105 @@ module.exports = [ | |||
}, | |||
|
|||
/** | |||
* @api {post} /uploads/:id/:sceneIdx/:imageId Update imagery metadata |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't fully understand the purpose of this endpoint. I think it's run after process.sh
has completed (whether sucessfully or not)? But I don't understand why it's over HTTP? process.sh
has direct access to the DB, so why not just do this in process.sh
? Or even better, in fact ideally, in bin/transcoder.js
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's used for marking status generally--requests will be made when transcoding starts, when different stages are reached, and when it completes (successfully or not).
I'm trying to treat process.sh
as a black-boxed external dependency to the extent I'm able. Mapzen uses this same script to transcode DEMs and needs to write footprint data to Postgres, so HTTP seemed like the simplest way to decouple them.
When I add the optional Batch transcoding backend (as an alternative to the bin/transcoder.js
monq worker), this may make more sense (transcoding jobs are fully queued rather than having "callbacks" associated ith them--it doesn't seem to make sense to have each transcoding node open a connection to Mongo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it doesn't seem to make sense to have each transcoding node open a connection to Mongo
Can you explain this in detail please?
I think there are a few problems with this approach. As mentioned, using HTTP, when there's a perfectly good Mongo socket connection already there. Supporting Batch is out of scope for this work, not to mention that Batch is platform-specific - in fact I'm sure that the "on-demand worker instance" paradigm is significantly cheaper almost everywhere apart from AWS. If we're going to be accomodating said paradigm, then there's broader benefits to supporting a platform agnostic method. However, I don't think OAM's transcoding load is going to scale to the point of seeing architectural cost-benefits for a long time and even then, the simpler approach is just to host somewhere cheaper. Also, process.sh
should first and foremost be serving OAM, it doesn't need to cater to anyone else. And even if theoretically there was a justification for HTTP here, then strictly speaking it should exist as a seperate API, therefore a distinct service. Having this endpoint on the catalog is unintuitive and sets a bad precedent of blurring seperation of concerns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@smit1678 is Batch out of scope? I've been assuming that it's not, for these reasons:
- we'd planned on incorporating it when the uploader got refactored to transcode imagery, but Nick ran out of time
- OAM occasionally needs to handle ~50GB uploads and process them; in the current paradigm, this means keeping hundreds of GB of space available for when this does happen
- after disasters, large quantities of imagery become available and should be ingested into OAM (I just did this with DG's hurricane-related imagery this/last week). being able to do this rapidly benefits many parties. using Batch makes this process much quicker--Irma imagery took ~12 hours to fully ingest because it could only process 4 images at a time (and I managed to use up the burst credits on the t2 instance; granted, an AWS-specific thing).
- it represents progressive enhancement when running on AWS; the existing worker implementation (mongodb/monq) can be used when Batch isn't available
marblecutter-tools (which includes process.sh
) was extracted and made generic based on the transcoding utilities in oam-dynamic-tiler. Mapzen is using these for DEM data, and re-adding application-specific code will only make it diverge again.
Transcoding is logically a separate service ("service / tool that produces COGs") distinct from the catalog API (which deals in metadata management + uploads). Without adding application-specific service dependencies (mongo), do you have any suggestions on how to communicate progress, status, and eventually submit generated metadata about items that have been queued?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is Batch out of scope? I've been assuming that it's not, for these reasons:
Correct, I don't think this is out of scope at all and is something we've been thinking about for a while. But I don't think scope is the right question here. Let's agree upon implementation method for how we talked about extracting and refactoring the dynamic-tiler into a set of tools. Let's make sure we're on the same page there.
For some reason the Travis build didn't trigger on this. But I know for a fact the tests won't pass for this PR. I've changed the Travis settings, so it should build on the next commit. So at the very least this won't get merged until the existing tests pass. Edit: Just figured it out, the Travis build contains encrypted ENV vars, so external PRs will not run the integration tests. Seth, do you not have direct commit access on this repo? |
This reverts commit 72bbbd1.
I didn't actually remove any tests--just the 1 test does currently fail when running (maybe this belongs in an issue) Can you update the docs on how to run the integration tests (via
However, it seems to fail to initialize while indexing the staging bucket:
|
The tests are passing on Travis from the parent commit, so Are you using that You can update |
@tombh @mojodna What do we need to do to come to a consensus here? The current plan is to continue to stay on AWS and so we should be open to finding wins where we can by leveraging services with AWS. I think our time is better spent on other things than making sure we're platform agnostic so I think platform specific code shouldn't be an issue here. |
The point is that supporting Batch is out of scope. The next sentence is personal opinion framed by the incidental rhetoric of my "not to mention". The main point is that platform-specific support is simply not part of the scope of this work, whereas "thorough refactoring" is. And even if it were theoretically part of scope, even then it doesn't justify the use of a new HTTP endpoint to update job statuses. Edit: And completely off topic, I would very strongly disagree that our time is better spent making sure we're platform agnostic. I know for a fact that the tiler's necessity for Lamdba has cost me significant development and debugging time. |
Can you elaborate / ask for help? If you need to run it locally, there's a Flask version of the server available. |
Requires a queue and job definition to exist and be configured.
Firstly, thanks for fixing that test. I've checked and it passes. Also, whilst I still strongly disagree with the use of Batch, your Now, most importantly this HTTP endpoint. I have a better idea of
I want to note that Now lets consider Marblecutter Tools'
To my mind, a refactored tool like Marblecutter Tools and mbtools mkcog florida.tiff
mbtools thumbnail -width 300 -height 300 florida.tiff
mbtools footprint florida.tiff This is what I've spoken about before in regards to making All of which to say Mapzen and OAM should be responsible for their own import 'db';
import 'worker';
import 's3';
import 'metadata_gen';
worker.progress('downloading gtiff');
s3.get(gtiff);
worker.progress('converting to cog');
spawn('mbtools mkcog' + gtiff);
worker.progress('generating thumbnail');
spawn('mbtools thumbnail -width 300 -height 300' + gtiff);
worker.progress('generating footprint');
spawn('mbtools footprint' + gtiff);
worker.progress('generating metadata');
meta = metadata_gen(gtiff);
worker.progress('uploading assets');
s3.put(gtiff);
s3.put(meta);
worker.progress('completed'); |
@tombh Can you please break the refactor and other ideas out into separate tickets? We can then come back to see how we can continue to improve. @tombh @mojodna To make sure we’re clear on the discussion at hand. The open and on topic questions are:
|
I understand the refactoring and the HTTP endpoint to be precisely the same topic. I don't think it's right to make it into another ticket or come back to it later. I need to understand the thinking behind why an HTTP endpoint is the right approach here. Yes the integration test still needs to be fixed. |
I can't get the integration tests to run in my environment:
How:
Docker Compose appears to be set up to expose |
For the HTTP endpoint used to update status / provide metadata, what do you (or anyone) see as alternatives? (I can't think of any reasonable alternatives right now.) Mongo isn't an option because it's an implementation detail specific to OAM and workers need to update from wherever they're running (there may be 1 running locally or there may be 1000 running across a cluster somewhere with available CPU; it feels like optimizing for the latter makes the former possible but not vice versa). Re: refactoring, all of that makes sense (and it's very helpful to see the process broken out through someone else's eyes), though beyond being cleaner, I don't see any immediate benefits. Steps 1 and 5 only occur if remote URIs are provided, and thumbnailing is really the only optional part of the process (thumbnail size can be set using |
This was necessary to get the tests running on macOS where Mongo was running on the host (binding to 0.0.0.0), docker-compose'd app running in Docker (proxied to localhost using Docker for Mac), and the tests running on the host.
sensor: scene.sensor | ||
}, | ||
title: scene.title | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential metadata changes (these all match, but there may be parts that are missing).
}; | ||
}), callback); | ||
|
||
// replace the urls list with a list of _id's | ||
// NOTE this causes side-effects |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think these side-effects (which have always been present) are problematic, but they're something to look out for.
meta.gsd = request.payload.properties.resolution_in_meters; | ||
meta.meta_uri = meta.uuid.replace(/\.tif$/, '_meta.json'); | ||
meta.properties = Object.assign(meta.properties, request.payload.properties); | ||
meta.properties.tms = `${config.tilerBaseUrl}/${request.params.id}/${request.params.scesceneIdx}/${request.params.imageId}/{z}/{x}/{y}.png`; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
^^ potential API changes
(especially GeoJSON vs. WKT)
This replaces the bespoke queue / worker implementation with monq and adds a local worker that uses marblecutter-tools to transcode imagery. Status updates and final metadata are provided using HTTP callbacks.