Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a where clause to support soft delete #326

Open
wants to merge 9 commits into
base: master
Choose a base branch
from

Conversation

ankitml
Copy link

@ankitml ankitml commented Nov 17, 2022

Why

Tons of databases in the wild are bloated not only by dead tuples, also by zombie tuples. ie Data that exists in table just because it is hard to cleanup and prune the old data.
Pg_repack is everyone's favourite tool to clean up dead tuples. It can easily clean up data that is not needed with a new flag.
--where-clause="deleted_at IS NOT NULL". The where clause is generic and can reference to foreign tables as well.
To keep data that is updated in last 90 days in mytable.
--table="<mytable>" --where-clause="updated_at > NOW() - Interval '90 days'"

This PR adds Data cleanup and soft-delete support with pg_repack.
pg_repack --dbname="ankitmittal" --table="test_repack" --echo --elevel=DEBUG --where-clause="deleted_at IS NOT NULL"
pg_repack --dbname="ankitmittal" --table="test_repack" --echo --elevel=DEBUG --where-clause="updated_at < NOW() - Interval '90 days'"

This has been discussed before (#279) with a different approach.

Why not

It could cause data loss if used incorrectly.

If used properly this cleans up logical bloat ie data that is thrown in database but not cleaned up. An alternative here is to perform repack-like online table-swap manually.

What it doesnt do

Incoming stream of data while repack is running is left as it is for sake of simplicity.

image

image

@ankitml ankitml marked this pull request as draft November 17, 2022 17:42
@ankitml ankitml changed the title Add a where clause Add a where clause to support soft delete Nov 18, 2022
@ankitml ankitml marked this pull request as ready for review November 18, 2022 20:06
@ankitml
Copy link
Author

ankitml commented Nov 22, 2022

@fabriziomello curious to know what are your thoughts on this

@andreasscherbaum
Copy link
Collaborator

There is a bin/.idea/workspace.xml file committed in this PR which does not belong there.

@andreasscherbaum
Copy link
Collaborator

It could cause data loss if used incorrectly

How can this be prevented?

@ankitml
Copy link
Author

ankitml commented May 1, 2023

Thanks, removed the workspace file

@ankitml
Copy link
Author

ankitml commented May 1, 2023

It could cause data loss if used incorrectly

How can this be prevented?

The arguments to where clause as specified by the user is intended to delete data. Users would need to verify if data being deleted is what they want to delete. Dry run of the pg-repack shows precisely which data is going to be remain. It is recommended to think through the outputs of dry run before running real command.

@ankitml
Copy link
Author

ankitml commented May 1, 2023

We have been running this repack at instacart on few tables weekly to clean up zombie tuples for last 4 months. Without any manual intervention, on a cron.

@andreasscherbaum
Copy link
Collaborator

@ankitml Can you add unit tests to ensure this is working as intended?
Please also add tests which have ... let's say unusual where clauses. Like something which is broken, which clearly should not be in a where clause, something with spaces or special characters. Anything which can break the query.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants