Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce loaded range tree memory usage #9181

Merged
merged 18 commits into from
Oct 9, 2019
Merged

Conversation

pcd1193182
Copy link
Contributor

@pcd1193182 pcd1193182 commented Aug 19, 2019

Signed-off-by: Paul Dagnelie pcd@delphix.com
Reviewed by: Matt Ahrens mahrens@delphix.com
Reviewed by: Sebastien Roy seb@delphix.com
Reviewed by: George Wilson gwilson@delphix.com

Motivation and Context

On large systems with high fragmentation, very large amounts of memory can be used to store metaslab space maps in memory in range trees. In addition, if the system is very fragmented, it is often necessary to store many of these in memory in order to be able to efficiently perform allocations. Failure to keep enough metaslabs loaded can cause serious performance problems, especially on sync write workloads.

Part of the reason that so much memory is used is that the structure used to store the range segs is a pair of AVL trees (one is sorted by offset, the other by size). This results in range_seg_ts of 72 bytes to store a mere 24 bytes of payload (8 of which is unused for metaslab range trees). While these trees are easy to use and provide good performance, we can significantly reduce the memory usage of the range trees by using a more efficient tree structure.

Description

This patch implements a new tree structure for ZFS, and uses it to store range trees more efficiently.

The new structure is approximately a B-tree, though there are some small differences from the usual characterizations. The tree has core nodes and leaf nodes; each contain data elements, which the elements in the core nodes acting as separators between its children. The difference between core and leaf nodes is that the core nodes have an array of children, while leaf nodes don't. Every node in the tree may be only partially full; in most cases, they are all at least 50% full (in terms of element count) except for the root node, which can be less full. Underfull nodes will steal from their neighbors or merge to remain full enough, while overfull nodes will split in two. The data elements are contained in tree-controlled buffers; they are copied into these on insertion, and overwritten on deletion. This means that the elements are not independently allocated, which reduces overhead, but also means they can't be shared between trees (and also that pointers to them are only valid until a side-effectful tree operation occurs). The overhead varies based on how dense the tree is, but is usually on the order of about 50% of the element size; the per-node overheads are very small, and so don't make a significant difference. The trees can accept arbitrary records; they accept a size and a comparator to allow them to be used for a variety of purposes.

The new trees replace the AVL trees used in the range trees today. Currently, the range_seg_t structure contains three 8 byte integers of payload and two 24 byte avl_tree_node_ts to handle its storage in both an offset-sorted tree and a size-sorted tree (total size: 64 bytes). In the new model, the range seg structures are usually two 4 byte integers, but a separate one needs to exist for the size-sorted and offset-sorted tree. Between the raw size, the 50% overhead, and the double storage, the new btrees are expected to use 8*1.5*2 = 24 bytes per record, or 33.3% as much memory as the AVL trees (this is for the purposes of storing metaslab range trees; for other purposes, like scrubs, they use ~50% as much memory).

We reduced the size of the payload in the range segments by teaching range trees about starting offsets and shifts; since metaslabs have a fixed starting offset, and they all operate in terms of disk sectors, we can store the ranges using 4-byte integers as long as the size of the metaslab divided by the sector size is less than 2^32. For 512-byte sectors, this is a 2^41 (or 2TB) metaslab, which with the default settings corresponds to a 256PB disk. 4k sector disks can handle metaslabs up to 2^46 bytes, or 2^63 byte disks. Since we do not anticipate disks of this size in the near future, there should be almost no cases where metaslabs need 64-byte integers to store their ranges. We do still have the capability to store 64-byte integer ranges to account for cases where we are storing per-vdev (or per-dnode) trees, which could reasonably go above the limits discussed. We also do not store fill information in the compact version of the node, since it is only used for sorted scrub.

We also optimized the metaslab loading process in various other ways to offset some inefficiencies in the btree model. While individual operations (find, insert, remove_from) are faster for the btree than they are for the avl tree, remove usually requires a find operation, while in the AVL tree model the element itself suffices. Some clever changes actually caused an overall speedup in metaslab loading; we use approximately 40% less cpu to load metaslabs in our tests on Illumos.

Another memory and performance optimization was achieved by changing what is stored in the size-sorted trees. When a disk is heavily fragmented, the df algorithm used by default in ZFS will almost always find a number of small regions in its initial cursor-based search; it will usually only fall back to the size-sorted tree to find larger regions. If we increase the size of the cursor-based search slightly, and don't store segments that are smaller than a tunable size floor in the size-sorted tree, we can further cut memory usage down to below 20% of what the AVL trees store. This also results in further reductions in CPU time spent loading metaslabs.

The 16KiB size floor was chosen because it results in substantial memory usage reduction while not usually resulting in situations where we can't find an appropriate chunk with the cursor and are forced to use an oversized chunk from the size-sorted tree. In addition, even if we do have to use an oversized chunk from the size-sorted tree, the chunk would be too small to use for ZIL allocations, so it isn't as big of a loss as it might otherwise be. And often, more small allocations will follow the initial one, and the cursor search will now find the remainder of the chunk we didn't use all of and use it for subsequent allocations. Practical testing has shown little or no change in fragmentation as a result of this change.

If the size-sorted tree becomes empty while the offset sorted one still has entries, it will load all the entries from the offset sorted tree and disregard the size floor until it is unloaded again. This operation occurs rarely with the default setting, only on incredibly thoroughly fragmented pools.

There are some other small changes to zdb to teach it to handle btrees, but nothing major.

How Has This Been Tested?

Correctness testing

This change has undergone extensive testing on Illumos: first, a feature was added to zhack that would allocate a btree and manipulate it at random by inserting and removing random integers. An AVL tree was used alongside it to verify its integrity, and internal integrity checks were run frequently to identify any broken invariants. This acted as a fuzzer to catch basic problems. Next, the code was integrated into the kernel for use in range trees, and then run through zfs-test and zloop to identify issues with any workloads that might occur in normal operation that didn't occur regularly with random inputs and outputs (many inserts/deletions in a row, lots of merges and splits, etc). Finally, the code was used in performance tests on a performance VM with a large, heavily fragmented pool. This provided both further correctness testing as well as performance testing.

On Linux, this change has been run through zfs-test and zloop several times, with a focus on the areas that differ between Illumos and Linux (notably the sorted scrub code, which uses the fill range_seg entry that doesn't exist on Illumos).

Performance testing

Performance test results from Illumos:
~80% decrease in memory utilization to store loaded metaslabs
~40% decrease in cpu utilization to load metaslabs
specific range tree operations vary, but most operations are slightly faster in btrees or comparable, except for range_tree_vacate (which takes somewhat longer because it needs to free two trees).

Some basic performance tests have been run on Linux to verify that there was no regression due to these changes. The specifics of memory and CPU usage are expected to remain approximately the same between the two platforms.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation (a change to man pages or other documentation)

Checklist:

@pcd1193182 pcd1193182 added Status: Code Review Needed Ready for review and testing Type: Performance Performance improvement or performance problem labels Aug 19, 2019
@ahrens ahrens mentioned this pull request Aug 21, 2019
12 tasks
Copy link
Contributor

@behlendorf behlendorf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great, thank you for the detailed PR comment. It was tremendously helpful. Mainly a few comments to head off some future issues.

include/sys/range_tree.h Show resolved Hide resolved
module/zfs/arc.c Outdated Show resolved Hide resolved
module/zfs/btree.c Outdated Show resolved Hide resolved
module/zfs/btree.c Outdated Show resolved Hide resolved
module/zfs/btree.c Outdated Show resolved Hide resolved
module/zfs/metaslab.c Outdated Show resolved Hide resolved
module/zfs/metaslab.c Outdated Show resolved Hide resolved
module/zfs/range_tree.c Outdated Show resolved Hide resolved
module/zfs/btree.c Outdated Show resolved Hide resolved
module/zfs/metaslab.c Outdated Show resolved Hide resolved
@behlendorf
Copy link
Contributor

Several of the bots hit the recent unrelated metaslab ASSERT, I've resubmitted those tests. The style builder also flagged a few minor cstyle warnings.

include/sys/range_tree.h Show resolved Hide resolved
module/zfs/btree.c Outdated Show resolved Hide resolved
Copy link
Member

@ahrens ahrens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the refactor! Makes it easier to read the (quite complicated) insertion/removal code.

module/zfs/btree.c Outdated Show resolved Hide resolved
module/zfs/btree.c Outdated Show resolved Hide resolved
module/zfs/btree.c Outdated Show resolved Hide resolved
module/zfs/btree.c Outdated Show resolved Hide resolved
module/zfs/btree.c Outdated Show resolved Hide resolved
module/zfs/btree.c Outdated Show resolved Hide resolved
module/zfs/btree.c Outdated Show resolved Hide resolved
module/zfs/btree.c Outdated Show resolved Hide resolved
module/zfs/btree.c Outdated Show resolved Hide resolved
module/zfs/btree.c Outdated Show resolved Hide resolved
module/zfs/metaslab.c Outdated Show resolved Hide resolved
@sempervictus
Copy link
Contributor

sempervictus commented Aug 30, 2019

This looks like it will be a godsend for old pools. Unfortunately i can't seem to get this built for testing - failing to build in-tree with

build.log:fs/zfs/zfs/metaslab.c:1452:2: error: invalid initializer
build.log:fs/zfs/zfs/metaslab.c:1455:2: error: initialization of ‘void (*)(range_tree_t *, void *)’ {aka ‘void (*)(struct range_tree *, void *)’} from incompatible pointer type ‘void (*)(range_tree_t *, range_seg_t *, void *)’ {aka ‘void (*)(struct range_tree *, void *, void *)’} [-Werror=incompatible-pointer-types]

EDIT: the comment here seems to indicate that the struct definition was made for human readability, not so much informing the compiler as to what types are involved in a clear manner:

/*
 * This is just for clarity of code purposes, so we can make it clear that a
 * pointer is to a range seg of some type; when we need to do the actual math,
 * we'll figure out the real type.
 */
typedef void range_seg_t;

struct range_tree_ops {
        void    (*rtop_create)(range_tree_t *rt, void *arg);
        void    (*rtop_destroy)(range_tree_t *rt, void *arg);
        void    (*rtop_add)(range_tree_t *rt, void *rs, void *arg);
        void    (*rtop_remove)(range_tree_t *rt, void *rs, void *arg);
        void    (*rtop_vacate)(range_tree_t *rt, void *arg);
};

I'm guessing that this is fixed with a designated initializer detailing the struct members for the compiler up front.

@pcd1193182
Copy link
Contributor Author

@sempervictus It looks like you've got an outdated definition of range_tree_ops_t, or possibly a merge error. In the latest version, line 1452 of metaslab.c is in the middle of a comment. If you can point me at a copy of your code I can take a look at let you know if I see any issues.

@sempervictus
Copy link
Contributor

@pcd1193182: this also includes the kvmem alloc PR (8384) and the grsec patch, neither of which touch that file. Its also from today's master branch.
Usually this sort of thing for grsec users is due to the stricter compiler options and the plugins' need to understand the layout as early as possible, and the function pointer casting bit in range_tree_ops looks very much like that type of issue.
I'll try to do another tree/build after client ops this evening...

@sempervictus
Copy link
Contributor

Ok, so this gets me past the first build errors:

diff --git i/fs/zfs/zfs/metaslab.c w/fs/zfs/zfs/metaslab.c
index de9e1d909..fbfcae713 100644
--- i/fs/zfs/zfs/metaslab.c
+++ w/fs/zfs/zfs/metaslab.c
@@ -1441,11 +1441,11 @@ metaslab_rt_vacate(range_tree_t *rt, void *arg)
 }
 
 static range_tree_ops_t metaslab_rt_ops = {
-	metaslab_rt_create,
-	metaslab_rt_destroy,
-	metaslab_rt_add,
-	metaslab_rt_remove,
-	metaslab_rt_vacate
+	.rtop_create = metaslab_rt_create,
+	.rtop_destroy = metaslab_rt_destroy,
+	.rtop_add = metaslab_rt_add,
+	.rtop_remove = metaslab_rt_remove,
+	.rtop_vacate = metaslab_rt_vacate
 };
 
 /*

which brings me to:

build.log:fs/zfs/zfs/range_tree.c:790:18: error: conflicting type qualifiers for ‘rt_btree_ops’
build.log:fs/zfs/zfs/range_tree.c:791:2: error: invalid initializer
build.log:fs/zfs/zfs/range_tree.c:793:2: error: initialization of ‘void (*)(range_tree_t *, void *)’ {aka ‘void (*)(struct range_tree *, void *)’} from incompatible pointer type ‘void (*)(range_tree_t *, range_seg_t *, void *)’ {aka ‘void (*)(struct range_tree *, void *, void *)’} [-Werror=incompatible-pointer-types]
build.log:fs/zfs/zfs/range_tree.c:795:2: error: initialization of ‘void (*)(range_tree_t *, void *, void *)’ {aka ‘void (*)(struct range_tree *, void *, void *)’} from incompatible pointer type ‘void (*)(range_tree_t *, void *)’ {aka ‘void (*)(struct range_tree *, void *)’} [-Werror=incompatible-pointer-types]

The first issue is likely being caught by the randstruct plugin, which exists upstream as well, so designated initializers will need to be used (anywhere that has more than one struct member which are not VLAs as those go to the end of a randstructed struct, IIUC).
The second issue seems to be more of the same, adding this to the mix and trying another build:

diff --git i/module/zfs/range_tree.c w/module/zfs/range_tree.c
index b705a526f..9e243af96 100644
--- i/module/zfs/range_tree.c
+++ w/module/zfs/range_tree.c
@@ -787,12 +787,12 @@ rt_btree_vacate(range_tree_t *rt, void *arg)
        rt_btree_create(rt, arg);
 }
 
-range_tree_ops_t rt_btree_ops = {
-       rt_btree_create,
-       rt_btree_destroy,
-       rt_btree_add,
-       rt_btree_remove,
-       rt_btree_vacate
+static range_tree_ops_t rt_btree_ops = {
+       .rtop_create = rt_btree_create,
+       .rtop_destroy = rt_btree_destroy,
+       .rtop_add = rt_btree_add,
+       .rtop_remove = rt_btree_remove,
+       .rtop_vacate = rt_btree_vacate
 };

@sempervictus
Copy link
Contributor

sempervictus commented Sep 1, 2019

@pcd1193182: designated initializers PR created in your repo to deal with the randstruct bit, and ensure that rt_btree_ops is always a range_tree_ops_t.
Builds fine in-tree as builtin now, zloop running for a number of cycles so far with no errors coming up.
Adding these changes to the full testing branch with the vmem/cache changes and a few other PRs and building a full kernel to test on something a bit less virtual presently.

@sempervictus
Copy link
Contributor

sempervictus commented Sep 1, 2019

@pcd1193182: so the plot thickens a bit, think i found an edge case related to how i'm building:
The test kernel i produced and verified with zloop yesterday was a minimal virt-device-driver-only image with ZFS built directly into the binary (not as a module). This works fine, passed tests, life is good.
The 2nd phase is similar to allyesconfig with the ZFS module still baked into the kernel, but now i'm being blocked at link time by redefinitions of btree functions - namespace collision @ link time which is brought on by the difference in Kconfig, not codebase. Could all of the btree_* functions become zfs_btree_* to avoid the following?

  LD      vmlinux.o
ld: lib/btree.o: in function `btree_last':
(.text+0x20): multiple definition of `btree_last'; fs/zfs/zfs/btree.o:btree.c:(.text+0x1350): first defined here
ld: lib/btree.o: in function `btree_insert':
(.text+0x1470): multiple definition of `btree_insert'; fs/zfs/zfs/btree.o:btree.c:(.text+0x1bd0): first defined here
ld: lib/btree.o: in function `btree_remove':
(.text+0xe50): multiple definition of `btree_remove'; fs/zfs/zfs/btree.o:btree.c:(.text+0x2930): first defined here
ld: lib/btree.o: in function `btree_init':
(.text+0x4a0): multiple definition of `btree_init'; fs/zfs/zfs/btree.o:btree.c:(.text+0xf90): first defined here
ld: lib/btree.o: in function `btree_destroy':
(.text+0x6c0): multiple definition of `btree_destroy'; fs/zfs/zfs/btree.o:btree.c:(.text+0x1a60): first defined here

@sempervictus
Copy link
Contributor

@pcd1193182: the renaming of btree_* to zfs_btree_* seems to work, everything built, zloop is running happily. I've added that commit to the PR as well, which should address all of the issues i've found so far. Hoping to verify stability and check against some rather old and fragmented pools to see how it behaves.

@sempervictus
Copy link
Contributor

sempervictus commented Sep 2, 2019

@pcd1193182: now it looks like i've found (stumbled over thanks to uderef) a real bug. This happens when importing a zpool which has dedup enabled and created in a prior version of ZFS, then upgraded to the current master's feature set, sha512 IIRC (9th gen HK chip, power to spare).

@pcd1193182
Copy link
Contributor Author

pcd1193182 commented Sep 2, 2019

@sempervictus do you happen to have a crash dump from that kernel panic?

@sempervictus
Copy link
Contributor

@pcd1193182: negative, boot-time crash fully locks up the system, and it cant write anything to any media since all volumes are ZFS (except /boot and /boot/efi) - hence the shoddy phone photo

@pcd1193182
Copy link
Contributor Author

@sempervictus Ok so I've had time now to catch up with the various points you addressed.

I'm happy to pull in the designated initializers part of the PR, that seems like a good change.

I'm not as convinced about the rename from btree_* to zfs_btree_*; we don't use that naming scheme for most other data structures (avl trees, range trees) to avoid external name collisions. From what I can see, we mostly use that prefix for things where it's important for clarify and zfs_refcount_t. I'm not sure it's worth making that change and adding bulk to the code to support an edge-case build setup.

As for the kernel panic, I'll see if I can reproduce it. Does this happen consistently when you populate a pool as you described, or is that just the scenario that your pool happened to be in when you encountered this issue? Neither dedup nor pool features should have an effect on the behavior of the btree code, so if that's the consistent factor it would be useful information.

@sempervictus
Copy link
Contributor

@pcd1193182: if those functions aren't renamed, they will conflict with the in-kernel (in Linux) btree implementation at least when building ZFS into the kernel as a static binary. Edge-case seems like an odd way to minimize the build failure - isn't in-kernel build what Illumos does too? As you mention, refcount is a good example of where we already do this - the kernel has its own refcount API, so ZFS had to prefix its own to avoid the collision IIRC.
I'm perfectly happy to maintain my own patchsets and trees to deal with this, but as is, this will break an existing build option (feature) of ZoL for people who don't. I'm not 100% sure this wont come up when built as a module in some other "edge case" either.

module/zfs/btree.c Show resolved Hide resolved
module/zfs/btree.c Show resolved Hide resolved
module/zfs/btree.c Outdated Show resolved Hide resolved
module/zfs/btree.c Outdated Show resolved Hide resolved
module/zfs/btree.c Outdated Show resolved Hide resolved
module/zfs/btree.c Outdated Show resolved Hide resolved
module/zfs/btree.c Outdated Show resolved Hide resolved
module/zfs/btree.c Outdated Show resolved Hide resolved
module/zfs/btree.c Outdated Show resolved Hide resolved
module/zfs/btree.c Outdated Show resolved Hide resolved
@pcd1193182
Copy link
Contributor Author

@sempervictus I hadn't realized to what extent building as part of the kernel was a supported operation; I've prefixed the data types and functions with zfs_ to help avoid name collisions in the future. Still haven't managed to reproduce the panic you saw, but I'll keep trying.

@sempervictus
Copy link
Contributor

@pcd1193182: i think 65a91b1 might be causing some problems here. Latest naive merge attempt resulted in build failures with incorrect argument numbers for functions and redefinitions of things.

Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Copy link
Contributor

@ikozhukhov ikozhukhov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for update!
LGTM with wrapper at this moment and can update it later to another one if will agreed with proposals.

@ikozhukhov
Copy link
Contributor

ikozhukhov commented Oct 8, 2019

maybe use define something like:

#ifndef	zfs_bcopy
#define	zfs_bcopy(src, dst, size)	memmove((dst), (src), (size))
#endif

just for example
i'll try to test this idea by separate PR

Signed-off-by: Paul Dagnelie <pcd@delphix.com>
@ikozhukhov
Copy link
Contributor

(1) https://bitbucket.org/dilos/dilos-illumos/commits/7495cf4e17d78eb677e2a3d60c38f61c4e66268b
i have tested (1) with 2 tries in loop ZTS (about 12h) - no issues
(2) https://bitbucket.org/dilos/dilos-illumos/commits/4e1638b87137d4de6b0f2bcf5793cafaef1c4fb0
it is update after (1) - build has been done and ZTS in progress.

with my updates I can see several options:
1 - integrate Paul changes after zfs_bcopy - if we have agreements for it for separate PR
2 - integrate Paul version, but define zfs_bcopy and use it, will update others ZFS code later
3 - include zfs_bcopy update in all code to Paul PR
i'll be happy with any options

@behlendorf
Copy link
Contributor

@ikozhukhov I'm going to go ahead and merge this PR as is. It includes the bmov wrapper function for memmove. If we decide it would be useful to rename or refactor that code let's discuss it in a new PR after the merge.

@ikozhukhov
Copy link
Contributor

ok, i have no issues with bmov

@behlendorf behlendorf merged commit ca57777 into openzfs:master Oct 9, 2019
behlendorf pushed a commit that referenced this pull request Dec 19, 2019
Additional test cases for the btree implementation, see #9181.

Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: John Kennedy <john.kennedy@delphix.com>
Closes #9717
allanjude pushed a commit to KlaraSystems/zfs that referenced this pull request Apr 28, 2020
This patch implements a new tree structure for ZFS, and uses it to
store range trees more efficiently.

The new structure is approximately a B-tree, though there are some
small differences from the usual characterizations. The tree has core
nodes and leaf nodes; each contain data elements, which the elements
in the core nodes acting as separators between its children. The
difference between core and leaf nodes is that the core nodes have an
array of children, while leaf nodes don't. Every node in the tree may
be only partially full; in most cases, they are all at least 50% full
(in terms of element count) except for the root node, which can be
less full. Underfull nodes will steal from their neighbors or merge to
remain full enough, while overfull nodes will split in two. The data
elements are contained in tree-controlled buffers; they are copied
into these on insertion, and overwritten on deletion. This means that
the elements are not independently allocated, which reduces overhead,
but also means they can't be shared between trees (and also that
pointers to them are only valid until a side-effectful tree operation
occurs). The overhead varies based on how dense the tree is, but is
usually on the order of about 50% of the element size; the per-node
overheads are very small, and so don't make a significant difference.
The trees can accept arbitrary records; they accept a size and a
comparator to allow them to be used for a variety of purposes.

The new trees replace the AVL trees used in the range trees today.
Currently, the range_seg_t structure contains three 8 byte integers
of payload and two 24 byte avl_tree_node_ts to handle its storage in
both an offset-sorted tree and a size-sorted tree (total size: 64
bytes). In the new model, the range seg structures are usually two 4
byte integers, but a separate one needs to exist for the size-sorted
and offset-sorted tree. Between the raw size, the 50% overhead, and
the double storage, the new btrees are expected to use 8*1.5*2 = 24
bytes per record, or 33.3% as much memory as the AVL trees (this is
for the purposes of storing metaslab range trees; for other purposes,
like scrubs, they use ~50% as much memory).

We reduced the size of the payload in the range segments by teaching
range trees about starting offsets and shifts; since metaslabs have a
fixed starting offset, and they all operate in terms of disk sectors,
we can store the ranges using 4-byte integers as long as the size of
the metaslab divided by the sector size is less than 2^32. For 512-byte
sectors, this is a 2^41 (or 2TB) metaslab, which with the default
settings corresponds to a 256PB disk. 4k sector disks can handle
metaslabs up to 2^46 bytes, or 2^63 byte disks. Since we do not
anticipate disks of this size in the near future, there should be
almost no cases where metaslabs need 64-byte integers to store their
ranges. We do still have the capability to store 64-byte integer ranges
to account for cases where we are storing per-vdev (or per-dnode) trees,
which could reasonably go above the limits discussed. We also do not
store fill information in the compact version of the node, since it
is only used for sorted scrub.

We also optimized the metaslab loading process in various other ways
to offset some inefficiencies in the btree model. While individual
operations (find, insert, remove_from) are faster for the btree than
they are for the avl tree, remove usually requires a find operation,
while in the AVL tree model the element itself suffices. Some clever
changes actually caused an overall speedup in metaslab loading; we use
approximately 40% less cpu to load metaslabs in our tests on Illumos.

Another memory and performance optimization was achieved by changing
what is stored in the size-sorted trees. When a disk is heavily
fragmented, the df algorithm used by default in ZFS will almost always
find a number of small regions in its initial cursor-based search; it
will usually only fall back to the size-sorted tree to find larger
regions. If we increase the size of the cursor-based search slightly,
and don't store segments that are smaller than a tunable size floor
in the size-sorted tree, we can further cut memory usage down to
below 20% of what the AVL trees store. This also results in further
reductions in CPU time spent loading metaslabs.

The 16KiB size floor was chosen because it results in substantial memory
usage reduction while not usually resulting in situations where we can't
find an appropriate chunk with the cursor and are forced to use an
oversized chunk from the size-sorted tree. In addition, even if we do
have to use an oversized chunk from the size-sorted tree, the chunk
would be too small to use for ZIL allocations, so it isn't as big of a
loss as it might otherwise be. And often, more small allocations will
follow the initial one, and the cursor search will now find the
remainder of the chunk we didn't use all of and use it for subsequent
allocations. Practical testing has shown little or no change in
fragmentation as a result of this change.

If the size-sorted tree becomes empty while the offset sorted one still
has entries, it will load all the entries from the offset sorted tree
and disregard the size floor until it is unloaded again. This operation
occurs rarely with the default setting, only on incredibly thoroughly
fragmented pools.

There are some other small changes to zdb to teach it to handle btrees,
but nothing major.

Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed by: Sebastien Roy seb@delphix.com
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes openzfs#9181

Signed-off-by: Bryant G. Ly <bly@catalogicsoftware.com>

Conflicts:
	cmd/zdb/zdb.c
	include/sys/spa.h
	module/zfs/Makefile.in
	module/zfs/dmu_send.c
	module/zfs/dsl_bookmark.c
	module/zfs/dsl_deadlist.c
	module/zfs/metaslab.c
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Accepted Ready to integrate (reviewed, tested) Type: Performance Performance improvement or performance problem
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants