Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reimplement DB collections for mirrors, repos and snapshots #766

Merged
merged 1 commit into from
Aug 21, 2018

Conversation

smira
Copy link
Contributor

@smira smira commented Aug 6, 2018

See #765, #761

Collections were relying on keeping in-memory list of all the objects
for any kind of operation which doesn't scale well the number of
objects in the database.

With this rewrite, objects are loaded only on demand which might
be pessimization in some edge cases but should improve performance
and memory footprint significantly.

This doesn't touch PublishedRepoCollection as it relies on list of
all the objects in many places to implement unique checks, proper
cleanup.

Checklist

  • unit-test added (if change is algorithm)
  • functional test added/updated (if change is functional)
  • man page updated (if applicable)
  • bash completion updated (if applicable)
  • documentation updated
  • author name in AUTHORS

@smira smira requested a review from a team August 6, 2018 22:22
@smira smira added this to the 1.4.0 milestone Aug 6, 2018
@codecov
Copy link

codecov bot commented Aug 6, 2018

Codecov Report

Merging #766 into master will increase coverage by 0.34%.
The diff coverage is 89.24%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #766      +/-   ##
==========================================
+ Coverage   63.71%   64.05%   +0.34%     
==========================================
  Files          50       50              
  Lines        6308     6326      +18     
==========================================
+ Hits         4019     4052      +33     
+ Misses       1797     1778      -19     
- Partials      492      496       +4
Impacted Files Coverage Δ
deb/snapshot.go 76.16% <87.14%> (+9.34%) ⬆️
deb/local.go 84.48% <90.9%> (-0.84%) ⬇️
deb/remote.go 64.11% <90.9%> (+0.21%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update fb5985b...699323e. Read the comment docs.

Copy link
Contributor

@sliverc sliverc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM in general just one comment above concerning tests. Maybe it would also be good to do some performance and memory measurements with a large repo.

Just to see what is the memory usage gain and performance hit. The code does get more complicated with this change so I guess it would be good to prove that it is actually decreases memory usage with a acceptable performance hit.

return nil, err
}

r := &LocalRepo{}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this doesn't seem to be covered by a test (https://codecov.io/gh/aptly-dev/aptly/pull/766/diff#D2-210)

I think as this is a success path it would be good to have a unit test for it. There are some similar code bits uncovered in the other collections too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, thanks, I will check it!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've improved test coverage and added some benchmarks for ByUUID() and ByName(). They exercise "worst case": collection is created and one lookup is performed. On master branch this always leads to loading all the objects.

For ByUUID() in the new approach loading is fast, as object is looked up directly (loading only single element). ByName() still requires scanning whole collection, in the worst case it is as bad as the old approach, on average it's 50% better.

ForEach() doesn't cache objects in memory, so this should help #761 as objects would be GCed as soon as they're scanned. ByUUID() is used a lot to lookup source of published repositories.

On master:

BenchmarkSnapshotCollectionByUUID-8    	     500	   2953932 ns/op	 1352168 B/op	   30743 allocs/op
BenchmarkSnapshotCollectionByName-8    	     500	   2922043 ns/op	 1352504 B/op	   30747 allocs/op

This branch:

BenchmarkSnapshotCollectionByUUID-8    	  300000	      4492 ns/op	    1792 B/op	      39 allocs/op
BenchmarkSnapshotCollectionByName-8    	    1000	   1433058 ns/op	  533994 B/op	   14870 allocs/op

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. I guess we can merge this and try to get some feedback of the users with a specific problem in #761 whether this fix helps.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also had an idea to implement very simple "index" to lookup things by name (which is really frequent lookup): just a key in DB which is like name -> UUID. ByName() could use this index optimistically - if it's missing or doesn't point to right entry, ByName() falls back to full scan.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be a way. I think we should see first how these changes are in terms of performance in a actual use case before we make the code even more complex.

See #765, #761

Collections were relying on keeping in-memory list of all the objects
for any kind of operation which doesn't scale well the number of
objects in the database.

With this rewrite, objects are loaded only on demand which might
be pessimization in some edge cases but should improve performance
and memory footprint signifcantly.
@smira smira merged commit 72ff71f into master Aug 21, 2018
@smira smira deleted the 761-more-lazy branch August 21, 2018 14:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants