Make `maintenance-mode` more bulletproof #186

nathanielrindlaub · 2024-04-26T19:17:28Z

When we are deploying major changes to prod and need to shut down inputs temporarily, we currently set both ingestion lambda and the frontend into maintenance-mode. For the ingestion Lambda, MAINTENANCE_MODE: true will pause the creation of new image records when images are uploaded to the ingestion bucket, and instead route those images to a "parking-lot" bucket where they live until we've completed the updates and set maintenance mode back to false and then we can move those images back to the ingestion bucket for processing.

When the frontend is in maintenance mode, a splash-screen is displayed that prevents users from accessing the app.

This works ok, but it's not perfect, as we learned today. There are two main problems:

if the frontend is already loaded in a browser tab on a user's computer and they haven't refreshed it, they will still be able to access and interact with the frontend (edit labels, initiate bulk uploads) until they refresh the page and their cached files are updated with MAINTENANCE_MODE: true. So we need to figure out some way to force the user to refresh the page, perhaps by using Cognito to log out all users at once? Another idea might be to set up a maintenance mode for the graphql API, so that even if a user has access to the frontend, any actions they take would get rejected by the API.
users may have initiated bulk uploads before we set the ingestion lambda into maintenance mode, and if the zip was received and the batch job was started before we turn on maintenance mode, the batch would validated and unzip those images, then move them to the ingestion bucket one-by-one, at which point the ingestion lambda would move them to the parking lot bucket (because it's now in maintenance mode), and the images would sit there with S3 keys that looks like <batchId>/path/to/image.jpg. That is fine until we move them from the parking lot bucket to the ingestion bucket manually, and because there's a batchId in the key, Animl assumes it's part of a batch. However, depending on how much time has elapsed, that batch's corresponding SQS queues may have been torn down already, so inference would fail.

For now, I think the low-tech solution to that issue will be to add a step in our production deployment workflows to manually check batch logs and the DB to make sure there aren't any fresh uploads that are in progress but haven't yet been fully unzipped. In the DB, those batches would have a created: <date_time> property but wouldn't yet have uploadComplete or processingStart or ingestionComplete fields. I'm not sure what a less manual approach might look like; I'd have to think some more on that.

The text was updated successfully, but these errors were encountered:

nathanielrindlaub · 2024-09-04T16:48:39Z

@jue-henry looked into how we could use Cognito to log users out, but I am not sure we even need to log them out... we really just need to force a page refresh when the frontend is in maintenance mode. I think a solution could be:

Create a Maintenance Mode parameter in the SSM Parameter Store, which both animl-ingest and animl-api could retrieve at runtime and use instead of having to hard code it in and deploy.
If animl-api is in M.M., for all /external calls, throw an error early that indicates it's in M.M.
on the frontend, check for that error on each call, and if it detects it, force a page reload.

So the workflow for setting the app into M.M. would be:

set hard-coded M.M. variable to true in frontend config, deploy to prod, and clear Cloudfront cache
check batch logs and DB for any fresh uploads that are in progress but haven't yet been unzipped.
set hard-coded M.M. variable to true in frontend config, deploy to prod, and clear Cloudfront cache
set the SSM M.M. param to true
Wait for messages in ALL SQS queues to wind down to zero (i.e., if there's currently a bulk upload job being processed, wait for it to finish).
Backup prod DB by running npm run export-db-prod from the animl-api project root.
Deploy animl-api to prod
Turn off IN_MAINTENANCE_MODE in SSM first and then animl-frontend (and deploy the frontend to prod, and clear cloudfront cache)
Copy any images that happened to land in animl-images-parkinglot-prod while the stacks were being deployed to animl-images-ingestion-prod, and then delete them from the parking lot bucket.

nathanielrindlaub added the clean up label Apr 26, 2024

nathanielrindlaub mentioned this issue Aug 8, 2024

Revoke users' Cognito JWTs if their role changes #238

Open

This comment was marked as off-topic.

Sign in to view

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make `maintenance-mode` more bulletproof #186

Make `maintenance-mode` more bulletproof #186

nathanielrindlaub commented Apr 26, 2024

nathanielrindlaub commented Sep 4, 2024

This comment was marked as off-topic.

This comment was marked as off-topic.

Make maintenance-mode more bulletproof #186

Make maintenance-mode more bulletproof #186

Comments

nathanielrindlaub commented Apr 26, 2024

nathanielrindlaub commented Sep 4, 2024

This comment was marked as off-topic.

This comment was marked as off-topic.

Make `maintenance-mode` more bulletproof #186

Make `maintenance-mode` more bulletproof #186