Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

After rsync of ~2TiB of data large amount of SUnreclaim (ARC), keeps on growing (slabtop) without limit - slowing down system to a halt #3157

Closed
kernelOfTruth opened this issue Mar 6, 2015 · 25 comments

Comments

@kernelOfTruth
Copy link
Contributor

Posting the data here before the system goes "boom" (it's getting slower and slower) - hope it's useful

Symptoms:
opening chromium, firefox, konqueror, etc. takes several seconds to load

besides that the system is (still) working fine

I'm not really sure to agree that SUnreclaim should be that huge

echo "786432" > /proc/sys/vm/min_free_kbytes 
echo "6" > /sys/module/zfs/parameters/zfs_arc_shrink_shift
echo "16384" > /sys/module/spl/parameters/spl_kmem_cache_slab_limit
echo "2" > /sys/module/spl/parameters/spl_kmem_cache_expire
echo "0" > /sys/module/spl/parameters/spl_kmem_cache_reclaim
echo "4096" > /sys/module/spl/parameters/spl_kmem_cache_slab_limit
echo "0x100000000" > /sys/module/zfs/parameters/zfs_arc_max
echo "0" > /sys/module/zfs/parameters/zfs_prefetch_disable

should spl_kmem_cache_reclaim be something else ?

slub_nomerge is used during bootup

Below is following output of system - no suspicious output of dmesg

cat /proc/meminfo 
MemTotal:       32897404 kB
MemFree:        10061740 kB
MemAvailable:   10166792 kB
Buffers:            3460 kB
Cached:          2199712 kB
SwapCached:          104 kB
Active:          2017092 kB
Inactive:         977588 kB
Active(anon):     907012 kB
Inactive(anon):   546068 kB
Active(file):    1110080 kB
Inactive(file):   431520 kB
Unevictable:          44 kB
Mlocked:              44 kB
SwapTotal:      37748724 kB
SwapFree:       37743924 kB
Dirty:                 0 kB
Writeback:             0 kB
AnonPages:        791448 kB
Mapped:           322004 kB
Shmem:            661572 kB
Slab:           14331220 kB
SReclaimable:     634560 kB
SUnreclaim:     13696660 kB
KernelStack:       17888 kB
PageTables:        26368 kB
KsmZeroPages:      66420 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:    64066644 kB
Committed_AS:    3589856 kB
VmallocTotal:   34359738367 kB
VmallocUsed:      759104 kB
VmallocChunk:   34358583476 kB
HardwareCorrupted:     0 kB
AnonHugePages:     94208 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
DirectMap4k:     3251076 kB
DirectMap2M:    27095040 kB
DirectMap1G:     4194304 kB
cat /proc/slabinfo 
slabinfo - version: 2.1
# name            <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail>
nf_conntrack_ffff8806b3063480      0      0    312   26    2 : tunables    0    0    0 : slabdata      0      0      0
pid_2                544    544    128   32    1 : tunables    0    0    0 : slabdata     17     17      0
zs_handle          28039  29696      8  512    1 : tunables    0    0    0 : slabdata     58     58      0
zs_handle           2048   2048      8  512    1 : tunables    0    0    0 : slabdata      4      4      0
zil_lwb_cache        360    360    200   20    1 : tunables    0    0    0 : slabdata     18     18      0
l2arc_buf_hdr_t   747879 747966     40  102    1 : tunables    0    0    0 : slabdata   7333   7333      0
arc_buf_t         174450 499317    104   39    1 : tunables    0    0    0 : slabdata  12803  12803      0
arc_buf_hdr_t     1077404 2191700    320   25    2 : tunables    0    0    0 : slabdata  87668  87668      0
dmu_buf_impl_t    2765075 5395338    312   26    2 : tunables    0    0    0 : slabdata 207513 207513      0
dnode_t           2608993 4033260    896   36    8 : tunables    0    0    0 : slabdata 112035 112035      0
sa_cache          2541328 3665376    112   36    1 : tunables    0    0    0 : slabdata 101816 101816      0
lz4_cache             16     16  16384    2    8 : tunables    0    0    0 : slabdata      8      8      0
zio_data_buf_16384     10     10  16384    2    8 : tunables    0    0    0 : slabdata      5      5      0
zio_buf_16384     122973 174912  16384    2    8 : tunables    0    0    0 : slabdata  87456  87456      0
zio_data_buf_14336     10     10  14336    2    8 : tunables    0    0    0 : slabdata      5      5      0
zio_buf_14336        129    164  14336    2    8 : tunables    0    0    0 : slabdata     82     82      0
zio_data_buf_12288      8      8  12288    2    8 : tunables    0    0    0 : slabdata      4      4      0
zio_buf_12288        123    172  12288    2    8 : tunables    0    0    0 : slabdata     86     86      0
zio_data_buf_10240     15     15  10240    3    8 : tunables    0    0    0 : slabdata      5      5      0
zio_buf_10240        255    393  10240    3    8 : tunables    0    0    0 : slabdata    131    131      0
zio_data_buf_8192     24     24   8192    4    8 : tunables    0    0    0 : slabdata      6      6      0
zio_buf_8192         253    384   8192    4    8 : tunables    0    0    0 : slabdata     96     96      0
zio_data_buf_7168     24     24   7168    4    8 : tunables    0    0    0 : slabdata      6      6      0
zio_buf_7168         358    628   7168    4    8 : tunables    0    0    0 : slabdata    157    157      0
zio_data_buf_6144     30     30   6144    5    8 : tunables    0    0    0 : slabdata      6      6      0
zio_buf_6144         561   1035   6144    5    8 : tunables    0    0    0 : slabdata    207    207      0
zio_data_buf_5120     18     18   5120    6    8 : tunables    0    0    0 : slabdata      3      3      0
zio_buf_5120         896   1668   5120    6    8 : tunables    0    0    0 : slabdata    278    278      0
zio_data_buf_4096     32     32   4096    8    8 : tunables    0    0    0 : slabdata      4      4      0
zio_buf_4096         784   1416   4096    8    8 : tunables    0    0    0 : slabdata    177    177      0
zio_data_buf_3584     45     45   3584    9    8 : tunables    0    0    0 : slabdata      5      5      0
zio_buf_3584        1335   2439   3584    9    8 : tunables    0    0    0 : slabdata    271    271      0
zio_data_buf_3072     30     30   3072   10    8 : tunables    0    0    0 : slabdata      3      3      0
zio_buf_3072        1622   2940   3072   10    8 : tunables    0    0    0 : slabdata    294    294      0
zio_data_buf_2560     72     72   2560   12    8 : tunables    0    0    0 : slabdata      6      6      0
zio_buf_2560        1633   3048   2560   12    8 : tunables    0    0    0 : slabdata    254    254      0
zio_data_buf_2048    112    112   2048   16    8 : tunables    0    0    0 : slabdata      7      7      0
zio_buf_2048        2732   4896   2048   16    8 : tunables    0    0    0 : slabdata    306    306      0
zio_data_buf_1536    126    126   1536   21    8 : tunables    0    0    0 : slabdata      6      6      0
zio_buf_1536        3905   7266   1536   21    8 : tunables    0    0    0 : slabdata    346    346      0
zio_data_buf_1024    224    224   1024   32    8 : tunables    0    0    0 : slabdata      7      7      0
zio_buf_1024        5505  10784   1024   32    8 : tunables    0    0    0 : slabdata    337    337      0
zio_data_buf_512    1280   1376    512   32    4 : tunables    0    0    0 : slabdata     43     43      0
zio_buf_512       2612411 4143872    512   32    4 : tunables    0    0    0 : slabdata 129496 129496      0
zio_link_cache      3910   3910     48   85    1 : tunables    0    0    0 : slabdata     46     46      0
zio_cache            870    870   1104   29    8 : tunables    0    0    0 : slabdata     30     30      0
ddt_entry_cache        0      0    448   36    4 : tunables    0    0    0 : slabdata      0      0      0
range_seg_cache     9095  16128     64   64    1 : tunables    0    0    0 : slabdata    252    252      0
bio-2               1020   1020    960   34    8 : tunables    0    0    0 : slabdata     30     30      0
rpc_inode_cache        0      0    640   25    4 : tunables    0    0    0 : slabdata      0      0      0
rpc_buffers           16     16   2048   16    8 : tunables    0    0    0 : slabdata      1      1      0
rpc_tasks             32     32    256   32    2 : tunables    0    0    0 : slabdata      1      1      0
btrfs_end_io_wq     1120   1120    144   28    1 : tunables    0    0    0 : slabdata     40     40      0
btrfs_prelim_ref       0      0     80   51    1 : tunables    0    0    0 : slabdata      0      0      0
btrfs_delayed_extent_op   3162   3162     40  102    1 : tunables    0    0    0 : slabdata     31     31      0
btrfs_delayed_data_ref   2814   2940     96   42    1 : tunables    0    0    0 : slabdata     70     70      0
btrfs_delayed_tree_ref   3128   3128     88   46    1 : tunables    0    0    0 : slabdata     68     68      0
btrfs_delayed_ref_head   1825   1825    160   25    1 : tunables    0    0    0 : slabdata     73     73      0
btrfs_inode_defrag      0      0     64   64    1 : tunables    0    0    0 : slabdata      0      0      0
btrfs_delayed_node   1412   1508    304   26    2 : tunables    0    0    0 : slabdata     58     58      0
btrfs_ordered_extent   1596   1596    424   38    4 : tunables    0    0    0 : slabdata     42     42      0
btrfs_extent_map   20818  22820    144   28    1 : tunables    0    0    0 : slabdata    815    815      0
bio-1               1100   1175    320   25    2 : tunables    0    0    0 : slabdata     47     47      0
btrfs_extent_buffer   7961   9802    280   29    2 : tunables    0    0    0 : slabdata    338    338      0
btrfs_extent_state  16193  34323     80   51    1 : tunables    0    0    0 : slabdata    673    673      0
btrfs_delalloc_work      0      0    152   26    1 : tunables    0    0    0 : slabdata      0      0      0
btrfs_free_space    1408   1408     64   64    1 : tunables    0    0    0 : slabdata     22     22      0
btrfs_path           224    224    144   28    1 : tunables    0    0    0 : slabdata      8      8      0
btrfs_transaction    216    216    296   27    2 : tunables    0    0    0 : slabdata      8      8      0
btrfs_trans_handle    184    184    176   23    1 : tunables    0    0    0 : slabdata      8      8      0
btrfs_inode        43131  48345    984   33    8 : tunables    0    0    0 : slabdata   1465   1465      0
uksm_tree_node     90952 375424     64   64    1 : tunables    0    0    0 : slabdata   5866   5866      0
uksm_vma_slot      23957  24860    184   22    1 : tunables    0    0    0 : slabdata   1130   1130      0
uksm_node_vma      10294  16422     40  102    1 : tunables    0    0    0 : slabdata    161    161      0
uksm_stable_node    7247  10192     72   56    1 : tunables    0    0    0 : slabdata    182    182      0
uksm_rmap_item    103145 403002     80   51    1 : tunables    0    0    0 : slabdata   7902   7902      0
slot_tree_node       180    180   2080   15    8 : tunables    0    0    0 : slabdata     12     12      0
zswap_entry         1360   1360     48   85    1 : tunables    0    0    0 : slabdata     16     16      0
nf-frags               0      0    216   37    2 : tunables    0    0    0 : slabdata      0      0      0
xfrm6_tunnel_spi       0      0    128   32    1 : tunables    0    0    0 : slabdata      0      0      0
ip6-frags              0      0    216   37    2 : tunables    0    0    0 : slabdata      0      0      0
fib6_nodes           256    256     64   64    1 : tunables    0    0    0 : slabdata      4      4      0
ip6_dst_cache         84     84    384   21    2 : tunables    0    0    0 : slabdata      4      4      0
PINGv6                 0      0   1088   30    8 : tunables    0    0    0 : slabdata      0      0      0
RAWv6                180    180   1088   30    8 : tunables    0    0    0 : slabdata      6      6      0
UDPLITEv6              0      0   1088   30    8 : tunables    0    0    0 : slabdata      0      0      0
UDPv6                240    240   1088   30    8 : tunables    0    0    0 : slabdata      8      8      0
tw_sock_TCPv6          0      0    256   32    2 : tunables    0    0    0 : slabdata      0      0      0
request_sock_TCPv6      0      0    256   32    2 : tunables    0    0    0 : slabdata      0      0      0
TCPv6                 80     80   1984   16    8 : tunables    0    0    0 : slabdata      5      5      0
nf_conntrack_ffffffff82104b80   1300   1378    312   26    2 : tunables    0    0    0 : slabdata     53     53      0
nf_conntrack_expect     96     96    256   32    2 : tunables    0    0    0 : slabdata      3      3      0
dm_snap_pending_exception      0      0    128   32    1 : tunables    0    0    0 : slabdata      0      0      0
dm_exception           0      0     32  128    1 : tunables    0    0    0 : slabdata      0      0      0
dm_mpath_io            0      0     16  256    1 : tunables    0    0    0 : slabdata      0      0      0
dm_crypt_io          168    168    192   21    1 : tunables    0    0    0 : slabdata      8      8      0
kcopyd_job             0      0   3312    9    8 : tunables    0    0    0 : slabdata      0      0      0
io                     0      0     64   64    1 : tunables    0    0    0 : slabdata      0      0      0
dm_uevent              0      0   2632   12    8 : tunables    0    0    0 : slabdata      0      0      0
dm_rq_target_io        0      0    408   20    2 : tunables    0    0    0 : slabdata      0      0      0
dm_io               1326   1326     40  102    1 : tunables    0    0    0 : slabdata     13     13      0
uhci_urb_priv          0      0     56   73    1 : tunables    0    0    0 : slabdata      0      0      0
scsi_sense_cache     352    352    128   32    1 : tunables    0    0    0 : slabdata     11     11      0
scsi_cmd_cache       399    399    384   21    2 : tunables    0    0    0 : slabdata     19     19      0
sd_ext_cdb           128    128     32  128    1 : tunables    0    0    0 : slabdata      1      1      0
bfq_io_cq           1024   1024    128   32    1 : tunables    0    0    0 : slabdata     32     32      0
bfq_queue            780    780    400   20    2 : tunables    0    0    0 : slabdata     39     39      0
cfq_io_cq            272    272    120   34    1 : tunables    0    0    0 : slabdata      8      8      0
cfq_queue            280    280    232   35    2 : tunables    0    0    0 : slabdata      8      8      0
bsg_cmd                0      0    312   26    2 : tunables    0    0    0 : slabdata      0      0      0
mqueue_inode_cache     36     36    896   36    8 : tunables    0    0    0 : slabdata      1      1      0
nilfs2_btree_path_cache      0      0   1792   18    8 : tunables    0    0    0 : slabdata      0      0      0
nilfs2_segbuf_cache      0      0    240   34    2 : tunables    0    0    0 : slabdata      0      0      0
nilfs2_transaction_cache      0      0     24  170    1 : tunables    0    0    0 : slabdata      0      0      0
nilfs2_inode_cache      0      0    952   34    8 : tunables    0    0    0 : slabdata      0      0      0
xfs_dqtrx              0      0    576   28    4 : tunables    0    0    0 : slabdata      0      0      0
xfs_dquot              0      0    472   34    4 : tunables    0    0    0 : slabdata      0      0      0
xfs_buf               21     21    384   21    2 : tunables    0    0    0 : slabdata      1      1      0
xfs_icr                0      0    144   28    1 : tunables    0    0    0 : slabdata      0      0      0
xfs_ili                0      0    152   26    1 : tunables    0    0    0 : slabdata      0      0      0
xfs_inode              0      0   1024   32    8 : tunables    0    0    0 : slabdata      0      0      0
xfs_efi_item           0      0    400   20    2 : tunables    0    0    0 : slabdata      0      0      0
xfs_efd_item           0      0    400   20    2 : tunables    0    0    0 : slabdata      0      0      0
xfs_buf_item           0      0    232   35    2 : tunables    0    0    0 : slabdata      0      0      0
xfs_log_item_desc      0      0     32  128    1 : tunables    0    0    0 : slabdata      0      0      0
xfs_trans              0      0    232   35    2 : tunables    0    0    0 : slabdata      0      0      0
xfs_ifork              0      0     64   64    1 : tunables    0    0    0 : slabdata      0      0      0
xfs_da_state           0      0    480   34    4 : tunables    0    0    0 : slabdata      0      0      0
xfs_btree_cur          0      0    208   39    2 : tunables    0    0    0 : slabdata      0      0      0
xfs_bmap_free_item      0      0     24  170    1 : tunables    0    0    0 : slabdata      0      0      0
xfs_log_ticket         0      0    184   22    1 : tunables    0    0    0 : slabdata      0      0      0
xfs_ioend             39     39    104   39    1 : tunables    0    0    0 : slabdata      1      1      0
jfs_mp                32     32    128   32    1 : tunables    0    0    0 : slabdata      1      1      0
jfs_ip                26     26   1224   26    8 : tunables    0    0    0 : slabdata      1      1      0
udf_inode_cache        0      0    704   23    4 : tunables    0    0    0 : slabdata      0      0      0
fuse_request         160    160    400   20    2 : tunables    0    0    0 : slabdata      8      8      0
fuse_inode            69     69    704   23    4 : tunables    0    0    0 : slabdata      3      3      0
isofs_inode_cache      0      0    600   27    4 : tunables    0    0    0 : slabdata      0      0      0
fat_inode_cache        0      0    688   23    4 : tunables    0    0    0 : slabdata      0      0      0
fat_cache              0      0     40  102    1 : tunables    0    0    0 : slabdata      0      0      0
hugetlbfs_inode_cache     29     29    552   29    4 : tunables    0    0    0 : slabdata      1      1      0
squashfs_inode_cache      0      0    640   25    4 : tunables    0    0    0 : slabdata      0      0      0
jbd2_transaction_s      0      0    256   32    2 : tunables    0    0    0 : slabdata      0      0      0
jbd2_inode             0      0     48   85    1 : tunables    0    0    0 : slabdata      0      0      0
jbd2_journal_handle      0      0     48   85    1 : tunables    0    0    0 : slabdata      0      0      0
jbd2_journal_head      0      0    112   36    1 : tunables    0    0    0 : slabdata      0      0      0
jbd2_revoke_table_s      0      0     16  256    1 : tunables    0    0    0 : slabdata      0      0      0
jbd2_revoke_record_s      0      0     32  128    1 : tunables    0    0    0 : slabdata      0      0      0
journal_handle         0      0     24  170    1 : tunables    0    0    0 : slabdata      0      0      0
journal_head           0      0    112   36    1 : tunables    0    0    0 : slabdata      0      0      0
revoke_table           0      0     16  256    1 : tunables    0    0    0 : slabdata      0      0      0
revoke_record          0      0     32  128    1 : tunables    0    0    0 : slabdata      0      0      0
ext4_inode_cache       0      0   1000   32    8 : tunables    0    0    0 : slabdata      0      0      0
ext4_free_data         0      0     64   64    1 : tunables    0    0    0 : slabdata      0      0      0
ext4_allocation_context      0      0    128   32    1 : tunables    0    0    0 : slabdata      0      0      0
ext4_prealloc_space      0      0    104   39    1 : tunables    0    0    0 : slabdata      0      0      0
ext4_system_zone       0      0     40  102    1 : tunables    0    0    0 : slabdata      0      0      0
ext4_io_end            0      0     72   56    1 : tunables    0    0    0 : slabdata      0      0      0
ext4_extent_status      0      0     40  102    1 : tunables    0    0    0 : slabdata      0      0      0
ext2_inode_cache       0      0    776   21    4 : tunables    0    0    0 : slabdata      0      0      0
ext3_inode_cache       0      0    800   20    4 : tunables    0    0    0 : slabdata      0      0      0
ext3_xattr             0      0    104   39    1 : tunables    0    0    0 : slabdata      0      0      0
reiser_inode_cache     66     66    728   22    4 : tunables    0    0    0 : slabdata      3      3      0
configfs_dir_cache      0      0     88   46    1 : tunables    0    0    0 : slabdata      0      0      0
dquot                  0      0    256   32    2 : tunables    0    0    0 : slabdata      0      0      0
kioctx                 0      0    640   25    4 : tunables    0    0    0 : slabdata      0      0      0
kiocb                  0      0    128   32    1 : tunables    0    0    0 : slabdata      0      0      0
fanotify_event_info      0      0     56   73    1 : tunables    0    0    0 : slabdata      0      0      0
fsnotify_mark          0      0     96   42    1 : tunables    0    0    0 : slabdata      0      0      0
inotify_inode_mark    819    819    104   39    1 : tunables    0    0    0 : slabdata     21     21      0
dnotify_mark           0      0    104   39    1 : tunables    0    0    0 : slabdata      0      0      0
dnotify_struct         0      0     32  128    1 : tunables    0    0    0 : slabdata      0      0      0
dio                  200    200    640   25    4 : tunables    0    0    0 : slabdata      8      8      0
fasync_cache          85     85     48   85    1 : tunables    0    0    0 : slabdata      1      1      0
pid_namespace         84     84   2224   14    8 : tunables    0    0    0 : slabdata      6      6      0
posix_timers_cache      0      0    248   33    2 : tunables    0    0    0 : slabdata      0      0      0
iommu_devinfo        448    448     64   64    1 : tunables    0    0    0 : slabdata      7      7      0
iommu_domain         224    224    128   32    1 : tunables    0    0    0 : slabdata      7      7      0
iommu_iova         40754  41600     64   64    1 : tunables    0    0    0 : slabdata    650    650      0
UNIX                1188   1188    896   36    8 : tunables    0    0    0 : slabdata     33     33      0
ip4-frags              0      0    192   21    1 : tunables    0    0    0 : slabdata      0      0      0
ip_mrt_cache           0      0    128   32    1 : tunables    0    0    0 : slabdata      0      0      0
UDP-Lite               0      0    960   34    8 : tunables    0    0    0 : slabdata      0      0      0
tcp_bind_bucket     1024   1024     64   64    1 : tunables    0    0    0 : slabdata     16     16      0
inet_peer_cache       42     42    192   21    1 : tunables    0    0    0 : slabdata      2      2      0
secpath_cache          0      0     64   64    1 : tunables    0    0    0 : slabdata      0      0      0
flow_cache             0      0    104   39    1 : tunables    0    0    0 : slabdata      0      0      0
xfrm_dst_cache         0      0    448   36    4 : tunables    0    0    0 : slabdata      0      0      0
ip_fib_trie          511    511     56   73    1 : tunables    0    0    0 : slabdata      7      7      0
ip_fib_alias         595    595     48   85    1 : tunables    0    0    0 : slabdata      7      7      0
ip_dst_cache         168    168    192   21    1 : tunables    0    0    0 : slabdata      8      8      0
PING                   0      0    896   36    8 : tunables    0    0    0 : slabdata      0      0      0
RAW                  288    288    896   36    8 : tunables    0    0    0 : slabdata      8      8      0
UDP                  272    272    960   34    8 : tunables    0    0    0 : slabdata      8      8      0
tw_sock_TCP          832    832    256   32    2 : tunables    0    0    0 : slabdata     26     26      0
request_sock_TCP       0      0    256   32    2 : tunables    0    0    0 : slabdata      0      0      0
TCP                  323    323   1856   17    8 : tunables    0    0    0 : slabdata     19     19      0
eventpoll_pwq       1064   1064     72   56    1 : tunables    0    0    0 : slabdata     19     19      0
eventpoll_epi       2496   2496    128   32    1 : tunables    0    0    0 : slabdata     78     78      0
sgpool-128            64     64   4096    8    8 : tunables    0    0    0 : slabdata      8      8      0
sgpool-64            128    128   2048   16    8 : tunables    0    0    0 : slabdata      8      8      0
sgpool-32            256    256   1024   32    8 : tunables    0    0    0 : slabdata      8      8      0
sgpool-16            256    256    512   32    4 : tunables    0    0    0 : slabdata      8      8      0
sgpool-8             320    320    256   32    2 : tunables    0    0    0 : slabdata     10     10      0
scsi_data_buffer       0      0     24  170    1 : tunables    0    0    0 : slabdata      0      0      0
blkdev_integrity       0      0     96   42    1 : tunables    0    0    0 : slabdata      0      0      0
blkdev_queue         112    112   1928   16    8 : tunables    0    0    0 : slabdata      7      7      0
blkdev_requests      660    660    368   22    2 : tunables    0    0    0 : slabdata     30     30      0
blkdev_ioc           630    630     96   42    1 : tunables    0    0    0 : slabdata     15     15      0
bio-0               1504   1504    256   32    2 : tunables    0    0    0 : slabdata     47     47      0
biovec-256           264    304   4096    8    8 : tunables    0    0    0 : slabdata     38     38      0
biovec-128           128    128   2048   16    8 : tunables    0    0    0 : slabdata      8      8      0
biovec-64            800    832   1024   32    8 : tunables    0    0    0 : slabdata     26     26      0
biovec-16            288    288    256   32    2 : tunables    0    0    0 : slabdata      9      9      0
bio_integrity_payload     42     42    192   21    1 : tunables    0    0    0 : slabdata      2      2      0
khugepaged_mm_slot    816    816     40  102    1 : tunables    0    0    0 : slabdata      8      8      0
uid_cache            256    256    128   32    1 : tunables    0    0    0 : slabdata      8      8      0
sock_inode_cache    1220   1400    640   25    4 : tunables    0    0    0 : slabdata     56     56      0
skbuff_fclone_cache   1056   1120    512   32    4 : tunables    0    0    0 : slabdata     35     35      0
skbuff_head_cache   1770   2144    256   32    2 : tunables    0    0    0 : slabdata     67     67      0
file_lock_cache      168    168    192   21    1 : tunables    0    0    0 : slabdata      8      8      0
net_namespace         42     42   4480    7    8 : tunables    0    0    0 : slabdata      6      6      0
shmem_inode_cache   3550   4758    624   26    4 : tunables    0    0    0 : slabdata    183    183      0
pool_workqueue       256    256    256   32    2 : tunables    0    0    0 : slabdata      8      8      0
task_delay_info     2432   2432     64   64    1 : tunables    0    0    0 : slabdata     38     38      0
taskstats            192    192    328   24    2 : tunables    0    0    0 : slabdata      8      8      0
proc_inode_cache    6650   7176    608   26    4 : tunables    0    0    0 : slabdata    276    276      0
sigqueue             200    200    160   25    1 : tunables    0    0    0 : slabdata      8      8      0
bdev_cache           312    312    832   39    8 : tunables    0    0    0 : slabdata      8      8      0
kernfs_node_cache  37264  37264    120   34    1 : tunables    0    0    0 : slabdata   1096   1096      0
mnt_cache            525    525    320   25    2 : tunables    0    0    0 : slabdata     21     21      0
filp                8194   9696    256   32    2 : tunables    0    0    0 : slabdata    303    303      0
inode_cache         9787  12499    552   29    4 : tunables    0    0    0 : slabdata    431    431      0
dentry            2535755 2887290    192   21    1 : tunables    0    0    0 : slabdata 137490 137490      0
names_cache          112    112   4096    8    8 : tunables    0    0    0 : slabdata     14     14      0
key_jar               21     21    192   21    1 : tunables    0    0    0 : slabdata      1      1      0
buffer_head         1482   1482    104   39    1 : tunables    0    0    0 : slabdata     38     38      0
nsproxy              510    510     48   85    1 : tunables    0    0    0 : slabdata      6      6      0
vm_area_struct     26092  26950    184   22    1 : tunables    0    0    0 : slabdata   1225   1225      0
mm_struct            396    396    896   36    8 : tunables    0    0    0 : slabdata     11     11      0
fs_cache             512    512     64   64    1 : tunables    0    0    0 : slabdata      8      8      0
files_cache          375    375    640   25    4 : tunables    0    0    0 : slabdata     15     15      0
signal_cache        1358   1428   1152   28    8 : tunables    0    0    0 : slabdata     51     51      0
sighand_cache       1174   1230   2112   15    8 : tunables    0    0    0 : slabdata     82     82      0
task_xstate         2340   2340    832   39    8 : tunables    0    0    0 : slabdata     60     60      0
task_struct         1225   1552   1936   16    8 : tunables    0    0    0 : slabdata     97     97      0
cred_jar            3447   3885    192   21    1 : tunables    0    0    0 : slabdata    185    185      0
Acpi-Operand        4648   4648     72   56    1 : tunables    0    0    0 : slabdata     83     83      0
Acpi-ParseExt        448    448     72   56    1 : tunables    0    0    0 : slabdata      8      8      0
Acpi-Parse           680    680     48   85    1 : tunables    0    0    0 : slabdata      8      8      0
Acpi-State           408    408     80   51    1 : tunables    0    0    0 : slabdata      8      8      0
Acpi-Namespace      3162   3162     40  102    1 : tunables    0    0    0 : slabdata     31     31      0
anon_vma_chain     23684  24384     64   64    1 : tunables    0    0    0 : slabdata    381    381      0
anon_vma           14025  14025     80   51    1 : tunables    0    0    0 : slabdata    275    275      0
pid                 2336   2336    128   32    1 : tunables    0    0    0 : slabdata     73     73      0
radix_tree_node    16602  23632    584   28    4 : tunables    0    0    0 : slabdata    844    844      0
ftrace_event_file   1472   1472     88   46    1 : tunables    0    0    0 : slabdata     32     32      0
ftrace_event_field   3995   3995     48   85    1 : tunables    0    0    0 : slabdata     47     47      0
idr_layer_cache      630    630   2096   15    8 : tunables    0    0    0 : slabdata     42     42      0
dma-kmalloc-8192       0      0   8192    4    8 : tunables    0    0    0 : slabdata      0      0      0
dma-kmalloc-4096       0      0   4096    8    8 : tunables    0    0    0 : slabdata      0      0      0
dma-kmalloc-2048       0      0   2048   16    8 : tunables    0    0    0 : slabdata      0      0      0
dma-kmalloc-1024       0      0   1024   32    8 : tunables    0    0    0 : slabdata      0      0      0
dma-kmalloc-512       32     32    512   32    4 : tunables    0    0    0 : slabdata      1      1      0
dma-kmalloc-256        0      0    256   32    2 : tunables    0    0    0 : slabdata      0      0      0
dma-kmalloc-128        0      0    128   32    1 : tunables    0    0    0 : slabdata      0      0      0
dma-kmalloc-64         0      0     64   64    1 : tunables    0    0    0 : slabdata      0      0      0
dma-kmalloc-32         0      0     32  128    1 : tunables    0    0    0 : slabdata      0      0      0
dma-kmalloc-16         0      0     16  256    1 : tunables    0    0    0 : slabdata      0      0      0
dma-kmalloc-8          0      0      8  512    1 : tunables    0    0    0 : slabdata      0      0      0
dma-kmalloc-192        0      0    192   21    1 : tunables    0    0    0 : slabdata      0      0      0
dma-kmalloc-96         0      0     96   42    1 : tunables    0    0    0 : slabdata      0      0      0
kmalloc-8192       83890 135172   8192    4    8 : tunables    0    0    0 : slabdata  33793  33793      0
kmalloc-4096         521    552   4096    8    8 : tunables    0    0    0 : slabdata     69     69      0
kmalloc-2048        7578   9488   2048   16    8 : tunables    0    0    0 : slabdata    593    593      0
kmalloc-1024        8997  12288   1024   32    8 : tunables    0    0    0 : slabdata    384    384      0
kmalloc-512         8344  14528    512   32    4 : tunables    0    0    0 : slabdata    454    454      0
kmalloc-256         5022  11104    256   32    2 : tunables    0    0    0 : slabdata    347    347      0
kmalloc-192       788570 949788    192   21    1 : tunables    0    0    0 : slabdata  45228  45228      0
kmalloc-128        30980  55136    128   32    1 : tunables    0    0    0 : slabdata   1723   1723      0
kmalloc-96        237820 1249500     96   42    1 : tunables    0    0    0 : slabdata  29750  29750      0
kmalloc-64        3982123 10722240     64   64    1 : tunables    0    0    0 : slabdata 167535 167535      0
kmalloc-32        1043526 2419840     32  128    1 : tunables    0    0    0 : slabdata  18905  18905      0
kmalloc-16         17920  17920     16  256    1 : tunables    0    0    0 : slabdata     70     70      0
kmalloc-8         141663 474624      8  512    1 : tunables    0    0    0 : slabdata    927    927      0
kmem_cache_node      640    640     64   64    1 : tunables    0    0    0 : slabdata     10     10      0
kmem_cache           399    399    192   21    1 : tunables    0    0    0 : slabdata     19     19      0
cat /proc/spl/kstat/zfs/arcstats
5 1 0x01 86 4128 19570905527 43702854394909
name                            type data
hits                            4    46492830
misses                          4    20507067
demand_data_hits                4    15074520
demand_data_misses              4    2806344
demand_metadata_hits            4    27476417
demand_metadata_misses          4    3066533
prefetch_data_hits              4    15827
prefetch_data_misses            4    14490518
prefetch_metadata_hits          4    3926066
prefetch_metadata_misses        4    143672
mru_hits                        4    26915537
mru_ghost_hits                  4    308961
mfu_hits                        4    15635499
mfu_ghost_hits                  4    282392
deleted                         4    17825742
recycle_miss                    4    1056226
mutex_miss                      4    104
evict_skip                      4    5609706
evict_l2_cached                 4    1041728847360
evict_l2_eligible               4    536921002496
evict_l2_ineligible             4    425963240448
hash_elements                   4    1077226
hash_elements_max               4    2463909
hash_collisions                 4    8353270
hash_chains                     4    116449
hash_chain_max                  4    7
p                               4    6095964672
c                               4    13457679280
c_min                           4    4194304
c_max                           4    4294967296
size                            4    7367698704
hdr_size                        4    194825584
data_size                       4    866843136
meta_size                       4    2079236096
other_size                      4    4028910448
anon_size                       4    49152
anon_evict_data                 4    0
anon_evict_metadata             4    0
mru_size                        4    1400349184
mru_evict_data                  4    419882496
mru_evict_metadata              4    3375616
mru_ghost_size                  4    6371602944
mru_ghost_evict_data            4    3234320384
mru_ghost_evict_metadata        4    3137282560
mfu_size                        4    1545680896
mfu_evict_data                  4    446960640
mfu_evict_metadata              4    491201024
mfu_ghost_size                  4    1684319232
mfu_ghost_evict_data            4    0
mfu_ghost_evict_metadata        4    1684319232
l2_hits                         4    173299
l2_misses                       4    20333384
l2_feeds                        4    98785
l2_rw_clash                     4    12
l2_read_bytes                   4    1218807296
l2_write_bytes                  4    986769056768
l2_writes_sent                  4    83836
l2_writes_done                  4    83836
l2_writes_error                 4    0
l2_writes_hdr_miss              4    10
l2_evict_lock_retry             4    1
l2_evict_reading                4    0
l2_free_on_write                4    388925
l2_cdata_free_on_write          4    287
l2_abort_lowmem                 4    17
l2_cksum_bad                    4    0
l2_io_error                     4    0
l2_size                         4    68072326144
l2_asize                        4    62877538816
l2_hdr_size                     4    197883440
l2_compress_successes           4    2112247
l2_compress_zeros               4    0
l2_compress_failures            4    7716608
memory_throttle_count           4    0
duplicate_buffers               4    0
duplicate_buffers_size          4    0
duplicate_reads                 4    0
memory_direct_count             4    61
memory_indirect_count           4    1075
arc_no_grow                     4    0
arc_tempreserve                 4    0
arc_loaned_bytes                4    0
arc_prune                       4    975
arc_meta_used                   4    6500855568
arc_meta_limit                  4    12632603136
arc_meta_max                    4    12642418624
zpool list
NAME       SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
SWAP2_1   58.5G   828K  58.5G         -     0%     0%  1.00x  ONLINE  -
SWAP2_2   58.5G   916K  58.5G         -     0%     0%  1.00x  ONLINE  -
WD30EFRX  2.66T  1.72T   962G         -     3%    64%  1.00x  ONLINE  -
cat /proc/swaps 
Filename                Type        Size    Used    Priority
/dev/zram0                              partition   20971516    1608    100
/dev/zd0                                partition   8388604 1588    100
/dev/zd16                               partition   8388604 1604    100
zpool iostat -v
                     capacity     operations    bandwidth
pool              alloc   free   read  write   read  write
----------------  -----  -----  -----  -----  -----  -----
SWAP2_1            828K  58.5G      0      0     12    114
  swap2_1          828K  58.5G      0      0     12    114
----------------  -----  -----  -----  -----  -----  -----
SWAP2_2            916K  58.5G      0      0     11    112
  swap2_2          916K  58.5G      0      0     11    112
----------------  -----  -----  -----  -----  -----  -----
WD30EFRX          1.72T   962G    410     27  41.0M   260K
  mirror          1.72T   962G    410     27  41.0M   260K
    wd30efrx_002      -      -    202      7  20.6M   324K
    wd30efrx          -      -    202      7  20.6M   324K
cache                 -      -      -      -      -      -
  intelSSD180     58.5G  41.3M      3    178  27.3K  21.5M
----------------  -----  -----  -----  -----  -----  -----

will attempt "echo 3 > /proc/sys/vm/drop_caches" and see how it goes ...

@snajpa
Copy link
Contributor

snajpa commented Mar 6, 2015

Are you by any chance rsyncing over NFS? I've had similar problems, NFS seems to use some caches in SLAB and pagecache, which pressures out the ARC. My workaround was to set up a 5-minute cron with
echo 1 > /proc/sys/vm/drop_caches

@kernelOfTruth
Copy link
Contributor Author

no, only via USB 3.0 :(

I had pasted

echo 3 > /proc/sys/vm/drop_caches

then it would free up SUnreclaim in 20-50 KiB steps, but it partially looked like memory was also growing (I don't have time to wait 5+ days for it to be usable again so)

after 1 hour I did a reboot (via magic sysrq key) - I've the impression that there's still issues with memory pressure despite usage of all the recent stuff and #2129

@kernelOfTruth
Copy link
Contributor Author

I've done several rsync transfers of the 2TB (albeit incremental - so at best 10-30 GB max per import and export)

despite now using a pre-set value for ARC

echo "0x100000000" > /sys/module/zfs/parameters/zfs_arc_max
echo "0x100000000" > /sys/module/zfs/parameters/zfs_arc_min`

memory keeps on growing

anybody can shed a light on what the problem with transparent hugepages and ZFSonLinux is ?

https://groups.google.com/a/zfsonlinux.org/forum/#!msg/zfs-discuss/7a77qQcG4C0/Bpc-VHKSjycJ

the advice keeps on popping up to disable it - when searching for solutions of an ever-growing ARC or ZFS slabs

what is causing SPL or ZFS, ARC to continually grow ?

Slab:           17339664 kB
SReclaimable:    1035176 kB
SUnreclaim:     16304488 kB
  OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME                   
8609728 7768306  90%    0.06K 134527       64    538108K kmalloc-64
5660980 5660980 100%    0.30K 217730       26   1741840K dmu_buf_impl_t
5369568 5367594  99%    0.50K 167799       32   2684784K zio_buf_512
5360112 5358715  99%    0.88K 148892       36   4764544K dnode_t
5314824 5314824 100%    0.11K 147634       36    590536K sa_cache
5282844 5282844 100%    0.19K 251564       21   1006256K dentry
755750 748071  98%    0.31K  30230       25    241840K arc_buf_hdr_t
613504 613504 100%    0.03K   4793      128     19172K kmalloc-32
563262 563262 100%    0.09K  13411       42     53644K kmalloc-96
439110 224047  51%    0.08K   8610       51     34440K uksm_rmap_item
427986 311810  72%    0.10K  10974       39     43896K arc_buf_t
313242 312955  99%    0.04K   3071      102     12284K l2arc_buf_hdr_t
284608 188966  66%    0.06K   4447       64     17788K uksm_tree_node
245358 243018  99%   16.00K 122679        2   3925728K zio_buf_16384
170520 170471  99%    8.00K  42630        4   1364160K kmalloc-8192
140028 110041  78%    0.19K   6668       21     26672K kmalloc-192
122880 122880 100%    0.01K    240      512       960K kmalloc-8
 62080  61604  99%    0.12K   1940       32      7760K kmalloc-128
 38590  38590 100%    0.12K   1135       34      4540K kernfs_node_cache
 34496  34118  98%    0.18K   1568       22      6272K vm_area_struct
 34192  22220  64%    2.00K   2137       16     68384K kmalloc-2048
 32186  31444  97%    0.18K   1463       22      5852K uksm_vma_slot
 30912  29510  95%    0.06K    483       64      1932K anon_vma_chain
 18176  18176 100%    0.02K     71      256       284K kmalloc-16
 16256  11372  69%    0.06K    254       64      1016K range_seg_cache
 16065  15472  96%    0.08K    315       51      1260K anon_vma
 14784  14784 100%    0.06K    231       64       924K iommu_iova
 13824  13760  99%    0.50K    432       32      6912K kmalloc-512
 13224  13224 100%    0.54K    456       29      7296K inode_cache
 12448   7993  64%    0.25K    389       32      3112K kmalloc-256
 12036  11095  92%    0.04K    118      102       472K uksm_node_vma
 11730  10908  92%    0.08K    230       51       920K btrfs_extent_state
 11424   9506  83%    0.57K    408       28      6528K radix_tree_node
 10725  10028  93%    0.96K    325       33     10400K btrfs_inode
 10444  10444 100%    0.14K    373       28      1492K btrfs_extent_map
 10112   9599  94%    0.25K    316       32      2528K filp
  9600   6754  70%    1.00K    300       32      9600K zio_buf_1024
  8832   8283  93%    1.00K    276       32      8832K kmalloc-1024

this is after a short uptime of 12 hours,
one zpool scrub of an 2.77 TB pool of data (1.72 TB partially with ditto blocks),

then 2 rsync transfers (currently on the 2nd incremental run)

I've read that exporting pools is supposed to reset memory consumption

but how can that be the solution ?

assumed - programs or the workstate of them has to be preserved
it's not possible to export and re-import that pool, where my /home partition resides (mirrored, backed by a small l2arc of now 50 GB; 1.72 TB)

@behlendorf , @tuxoko you two are the memory and/or ARC "cracks" in regard to ZFSonLinux:
do you have any suggestions what settings or approach might improve things ?

Meanwhile I keep on looking for reports of experience and settings that might have helped in dealing with this problem (besides disabling THP)

Sorry for bothering in any case - I just want to prevent repeating an experience similar to #3142

Many thanks in advance !

@dweeezil
Copy link
Contributor

dweeezil commented Mar 7, 2015

@kernelOfTruth I've been meaning to give this issue a more serious looking into but haven't yet had a chance to do so. Looking at the numbers from your initial posting, this sticks out:

other_size                      4    4028910448

That's pretty much blowing your 4GiB ARC size limit right there. This value is the sum a few other things, the sizes of which aren't necessarily readily available. It contains the dnode cache "dnode_t", the dbuf cache "dmu_buf_impl_t" and some of the 512-byte zio buffer "zio_buf_512".

Here are the relevant slab lines from above:

# name            <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail>
dnode_t           2608993 4033260    896   36    8 : tunables    0    0    0 : slabdata 112035 112035      0
dmu_buf_impl_t    2765075 5395338    312   26    2 : tunables    0    0    0 : slabdata 207513 207513      0
zio_buf_512       2612411 4143872    512   32    4 : tunables    0    0    0 : slabdata 129496 129496      0

As you can see, they've all got a ton of items. Also, they're all somewhat sparsely populated which likely means there's a fair bit of slab fragmentation.

The common thread in these related problems seems to be the use of rsync which, of course, traverses directory structures and requires all the inode information for every file and directory. I have a feeling the culprit is this other kernel-related slab cache:

dentry            2535755 2887290    192   21    1 : tunables    0    0    0 : slabdata 137490 137490      0

AFAIK right now, the only way to tame the kernel's dentry cache is to set /proc/sys/vm/vfs_cache_pressure to a value > 100.

At the point the values look like the above, I'm not sure what can be done to lower them and to reduce any fragmentation which might have occurred.

@dweeezil
Copy link
Contributor

dweeezil commented Mar 7, 2015

No, it deals with ARC buffers.

@kernelOfTruth
Copy link
Contributor Author

@dweeezil , @kpande , @snajpa thanks for taking a look into this !

#2129 is already on board

/proc/sys/vm/vfs_cache_pressure previously was at 1000 ( #3142 )

for this issue it's at 10000, ok so I'll raise it again one step

Related to vfs_cache_pressure settings what stuck out in my memory was something Andrew Morton (?) wrote: having it set at >= 100000

I'll give that a try, thanks !

Wandering around the web with this issue in mind I found following information:

  • having an l2arc might add 1-2 GB of additional memory usage in my case ( http://www.mail-archive.com/zfs-discuss@opensolaris.org/msg34674.html )
  • zfs_arc_meta_limit is now set to 6 GB from the previous value of 0 ( echo "6442450944" > /sys/module/zfs/parameters/zfs_arc_meta_limit ) , limiting it - that supposedly also made a positive impact for some folks
  • experimentally spl_kmem_cache_kmem_threads is raised to 8 from 4 , not sure what real-world difference this will make
  • high cpu load is of no issue here (in the past high cpu loads were reported in regard to THP, but of course performance decline of up to 30-40% and more if THP was to be disabled + lower cpu load)

Questions - for improving memory reclaiming and/or growth limiting:

What is the "optimal" setting for spl_kmem_cache_expire in this connection ?
2 == by low memory conditions [was set at that all the time]
1 == by age

What is the "optimal" setting for spl_kmem_cache_reclaim ?
code says:

unsigned int spl_kmem_cache_reclaim = 0 /* KMC_RECLAIM_ONCE */;

so 0 == reclaim once and then no more ?

MODULE_PARM_DESC(spl_kmem_cache_reclaim, "Single reclaim pass (0x1)");

so 1 = reclaim once ?
this one's confusion :/

it should reclaim, I wouldn't care if latency was a little higher - as long as memory growth doesn't get out of control

I'm irregularly running memory compaction manually but it might not address this kind of fragmentation issue, I'll take a look and see what can be tweaked in that regard

I'll give @snajpa 's suggestion of

echo 1 > /proc/sys/vm/drop_caches

and see if that works here

Thanks !

@kernelOfTruth
Copy link
Contributor Author

setting spl_kmem_alloc_max to 65536 per #3041 (default: 2097152)

@kernelOfTruth
Copy link
Contributor Author

something's fishy:

even though

echo "0x100000000" > /sys/module/zfs/parameters/zfs_arc_max
echo "0x100000000" > /sys/module/zfs/parameters/zfs_arc_min
echo "6442450944" > /sys/module/zfs/parameters/zfs_arc_meta_limit

are set

and it could be observed that the settings seems to apply on a per-zpool basis

(is that true ?)

after a scrub of an additional pool and now after the export of the mentioned pool (only the pool containing /home is currently imported)

settings are now at:

arc_meta_used                   4    6448093560
arc_meta_limit                  4    12632603136
arc_meta_max                    4    12653024936

I should have copied arc_meta_max and arc_meta_limit before

but I'm sure the values were significantly lower for at least one value (arc_meta_max ? at 6 GB ?)

are those values constantly rising with each subsequent and/or imported pool and after export not reset ?

also SUnreclaim was at a value of around 18-20 GB

well, I would understand if it was around 14-15 GB but that is three times the value of 6 GB

weird ...

copying arcstats for good measure

values should be for /home + l2arc, after import & export of one additional pool (2.7 TB, 1.7 TB with ditto blocks), rsync to that additional pool, after rsync to a btrfs volume (1.7 TB)

cat /proc/spl/kstat/zfs/arcstats
5 1 0x01 86 4128 20141264769 44155010516587
name                            type data
hits                            4    46142768
misses                          4    1620385
demand_data_hits                4    435736
demand_data_misses              4    34844
demand_metadata_hits            4    25703873
demand_metadata_misses          4    1036994
prefetch_data_hits              4    1157
prefetch_data_misses            4    48027
prefetch_metadata_hits          4    20002002
prefetch_metadata_misses        4    500520
mru_hits                        4    8253358
mru_ghost_hits                  4    221404
mfu_hits                        4    24659066
mfu_ghost_hits                  4    40243
deleted                         4    493513
recycle_miss                    4    61459
mutex_miss                      4    5
evict_skip                      4    243854
evict_l2_cached                 4    4067502592
evict_l2_eligible               4    3200269312
evict_l2_ineligible             4    6979838464
hash_elements                   4    337622
hash_elements_max               4    830717
hash_collisions                 4    109652
hash_chains                     4    13556
hash_chain_max                  4    5
p                               4    9516086784
c                               4    11319298288
c_min                           4    4194304
c_max                           4    4294967296
size                            4    11318676224
hdr_size                        4    138155184
data_size                       4    4255234560
meta_size                       4    2702728192
other_size                      4    4214504248
anon_size                       4    153600
anon_evict_data                 4    0
anon_evict_metadata             4    0
mru_size                        4    3688179200
mru_evict_data                  4    2307723264
mru_evict_metadata              4    467456
mru_ghost_size                  4    654553600
mru_ghost_evict_data            4    642440704
mru_ghost_evict_metadata        4    12112896
mfu_size                        4    3269629952
mfu_evict_data                  4    1947249152
mfu_evict_metadata              4    883870208
mfu_ghost_size                  4    1809937408
mfu_ghost_evict_data            4    1809665024
mfu_ghost_evict_metadata        4    272384
l2_hits                         4    143907
l2_misses                       4    1476094
l2_feeds                        4    44103
l2_rw_clash                     4    0
l2_read_bytes                   4    216845312
l2_write_bytes                  4    3365597184
l2_writes_sent                  4    3596
l2_writes_done                  4    3596
l2_writes_error                 4    0
l2_writes_hdr_miss              4    0
l2_evict_lock_retry             4    0
l2_evict_reading                4    0
l2_free_on_write                4    3
l2_cdata_free_on_write          4    17
l2_abort_lowmem                 4    12
l2_cksum_bad                    4    0
l2_io_error                     4    0
l2_size                         4    4174839296
l2_asize                        4    2982881792
l2_hdr_size                     4    8054040
l2_compress_successes           4    128589
l2_compress_zeros               4    0
l2_compress_failures            4    44607
memory_throttle_count           4    0
duplicate_buffers               4    0
duplicate_buffers_size          4    0
duplicate_reads                 4    0
memory_direct_count             4    19
memory_indirect_count           4    0
arc_no_grow                     4    0
arc_tempreserve                 4    0
arc_loaned_bytes                4    0
arc_prune                       4    68
arc_meta_used                   4    7063441664
arc_meta_limit                  4    12632603136
arc_meta_max                    4    12653024936

Exporting all pools seems to reclaim and/or reset memory usage but that can't be the only solution

On opportunity I'll try this out without l2arc and see if that makes a change ...

...

and disabled transparent hugepages as a last resort

@dweeezil
Copy link
Contributor

dweeezil commented Mar 8, 2015

Without addressing a few of the specifics in @kernelOfTruth's last couple of postings, I'd like to summarize the problem: Unlike most (all?) other native Linux filesystems, ZFS carries quite a bit of baggage corresponding to the kernel's dentry cache. As of at least 302f753, ZoL is completely reliant on the Kernel's shrinker callback mechanism to shed memory. Due to the nature of Linux's dentry cache (it can grow to a lot of entries very easily) and the fact that ZFS requires a lot of metadata to be associated with each entry, the ARC can easily blow past the administrator-set limit when lots of files are being traversed. A quick peek through the kernel makes me think that vfs_cache_pressure isn't going to be of much help.

In summary, if the kernel's dcache is large, ZFS will consume a correspondingly-large (actually, several times larger) amount of memory which will show up in arcstats as "other_size".

That all said, however, the shrinker pressure mechanism does work... to a point. If I max out the memory on a system by traversing lots of files and causing other_size to get very large, the ARC will shrink if I apply pressure from a normal userland program trying to allocate memory. The manner in which the pressure is applied is dependent on the kernel's overcommit policy and the quantity of swap space. In particular, userland programs may find it difficult to allocate memory in large chunks but the same amount may succeed if the program "nibbles" away at the memory, causing the shrinkers to engage.

I'm not sure of the best solution to this issue at the moment, but it's not unique to ZFS. There are plenty of reports around the Internet of dcache-related memory problems being caused by rsync on ext4-only systems. The difference is, however, that ext4 doesn't add it's own extras to the dcache so the effects are a lot more severs. Postings in which people are complaining about this problem usually mention vfs_cache_pressure as a solution and, in the case of ext4, I believe it will help more.

@dweeezil
Copy link
Contributor

dweeezil commented Mar 9, 2015

@kernelOfTruth A bit more testing shows me that you might have better success if you set the module parameters in the modprobe as in modprobe zfs zfs_arc_max=1073741824 .... It seems the changes don't "take" properly if set after the module is loaded.

@snajpa
Copy link
Contributor

snajpa commented Mar 9, 2015

@dweeezil would you please elaborate how setting arc limit run-time doesn't take properly, as you say, please? I've only seen so far that if I limit ARC to a smaller size, than it already is, it may in some cases never shrink (we run into a deadlock sooner than it has a chance to).

@dweeezil
Copy link
Contributor

dweeezil commented Mar 9, 2015

Internally, arc_c_max limits arc_c (the target ARC size). The value of arc_c is set to arc_c_max during module initialization and arc_c_max is set to the value of the tunable zfs_arc_max. If zfs_arc_max is changed once the module is loaded, arc_c_max is updated to the new value (in arc_adapt()) but changes to arc_c_max are "soft", they don't have any immediate effect but only take hold when memory pressure is applied.

There should be something in the documentation as to the difference between setting zfs_arc_max (and likely other of the tunables) at module load time and setting them once the module has been loaded.

@kernelOfTruth
Copy link
Contributor Author

like mentioned in #3155 it would be nice, if we could avoid having to use two caches, dentry & dnode

anyway @dweeezil coincidentally I also made the observation that at least two settings can't be set dynamically once spl/zfs was already loaded and started to put some settings into spl.conf & zfs.conf (when the modules are loaded):

spl_kmem_cache_kmem_threads
spl_kmem_cache_magazine_size

also

zfs_arc_max
and
zfs_arc_min

seemingly can't be set to the same value during load (and or that error coincided with a different value of spl_kmem_cache_max_size)

otherwise it would lead to lots of segmentation faults of mount

so the testing settings right now are:

zfs.conf

options zfs zfs_arc_max=0x100000000
#options zfs zfs_arc_min=0x100000000
options zfs zfs_arc_meta_limit=6442450944

spl.conf

options spl spl_kmem_cache_kmem_limit=4096
options spl spl_kmem_cache_slab_limit=16384
options spl spl_kmem_cache_magazine_size=64
options spl spl_kmem_cache_kmem_threads=8
options spl spl_kmem_cache_expire=2
#options spl spl_kmem_cache_max_size=8
options spl spl_kmem_cache_reclaim=1
options spl spl_kmem_alloc_max=65536

currently I've also have transparent hugepages disabled via

echo never > /sys/kernel/mm/transparent_hugepage/enabled
echo never > /sys/kernel/mm/transparent_hugepage/defrag

Thanks !

@dweeezil
Copy link
Contributor

dweeezil commented Mar 9, 2015

Regarding #3155, I was clearly wrong about other filesystems not hanging onto a lot of stuff in the linux slab. Here's some slabinfo entries after stat(2)ing about a million files in a 3-level nested set of 1000 directories on an EXT4 filesystem:

ext4_inode_cache active=1002656 num=1002656 size=1889.47MiB
    inode_cache active=6635   num=7595   size=7.53MiB
         dentry active=1010025 num=1010025 size=308.24MiB

and this is after doing the same on a ZFS file system (with an intervening drop_caches to clean everything up):

 dmu_buf_impl_t active=1034979 num=1034979 size=971.24MiB
        dnode_t active=1002246 num=1002246 size=2217.49MiB
    zio_buf_512 active=1002336 num=1002336 size=489.42MiB
ext4_inode_cache active=1565   num=2224   size=4.19MiB
    inode_cache active=6726   num=7595   size=7.53MiB
         dentry active=1015115 num=1015125 size=309.79MiB

ZFS is definitely grabbing more *node-related stuff but it's not like EXT4 doesn't add on its own stuff.

@tuxoko
Copy link
Contributor

tuxoko commented Mar 9, 2015

@dweeezil
dentry and inode(znode) are all handled by kernel VFS, so there's no reason they will behave different when using different filesystem. However, dnode and dmu_buf are handled by ZFS, and dnode should be loosedly coupled with the inode(znode), so I don't think inode(znode) would hold up on to dnode.

I wonder why ZFS doesn't reclaim more aggressively, I'd like to investigate on this but currently I'm busy on other stuff...

@dweeezil
Copy link
Contributor

dweeezil commented Mar 9, 2015

@tuxoko Right, I mainly wanted to point out that ZFS isn't the only filesystem that uses a lot of inode-related storage. Also, it's not clear to me that the kernel is handling large dentry cache sizes very well. Finally, I wanted to point out that ZFS can behave much better if the arc limit is set during module load rather than after the fact. For my part, I'm not going to be able to look into this much further now, either. I do plan on investigating related issues more closely as part of #3115 (speaking of which, and on a totally unrelated subject, I have a feeling it may be a major pain to merge ABD into that).

@kernelOfTruth
Copy link
Contributor Author

Seems like the issue is resolved (reclaim seems to work fine) - I'm not really sure what of the modified settings made that change possible but I guess that it's a combination

posting the data here for reference if anyone should encounter an ever-growing ARC:

Keep in mind that this is tailored toward a desktop, home backup and workstation - kind of setup

Kernel running 3.19 with following mentionable additional patchsets that give memory allocations a higher success chance:

http://www.eenyhelp.com/patch-0-3-rfc-mm-vmalloc-fix-possible-exhaustion-vmalloc-space-help-215610311.html

[PATCH V4] Allow compaction of unevictable pages

enhanced compaction algorithm

swap on ZRam with LZ4 compression

/etc/modprobe.d/spl.conf

options spl spl_kmem_cache_kmem_limit=4096
options spl spl_kmem_cache_slab_limit=16384
options spl spl_kmem_cache_magazine_size=64
options spl spl_kmem_cache_kmem_threads=8
options spl spl_kmem_cache_expire=2
#options spl spl_kmem_cache_max_size=8
options spl spl_kmem_cache_reclaim=1
options spl spl_kmem_alloc_max=65536

/etc/modprobe.d/zfs.conf

#options spl spl_kmem_cache_kmem_limit=4096
#options spl spl_kmem_cache_slab_limit=16384
#options spl spl_kmem_cache_magazine_size=64
#options spl spl_kmem_cache_kmem_threads=8
#options spl spl_kmem_cache_expire=2
options zfs zfs_arc_max=0x100000000
#options zfs zfs_arc_min=0x100000000
options zfs zfs_arc_meta_limit=6442450944
options zfs zfs_arc_p_dampener_disable=0

<-- several of those parameters both for ZFS and SPL kernel modules have to be specified during loading of the modules - otherwise, behavior seems to be that those aren't adhered to

slub_nomerge is appended to the kernel due to safety reasons (buggy drivers, igb had that problem of memory corruption afaik)

intel_iommu=on appended to kernel per advice from @ryao

CONFIG_PARAVIRT_SPINLOCKS is enabled in kernel configuration, if I remember correctly there was an issue where @ryao mentioned that certain codepaths (slowpath is removed (?) with that configuration option and thus lockups tend to occur less often. #3091

cat /proc/sys/vm/vfs_cache_pressure 
100000

Disabling THP - transparent hugepages - which seems to work fine with the recent tweaks to ZFS, though

and regularly running

echo 1 > /proc/sys/vm/compact_memory

might raise stability in certain cases (if I remember correctly it was also mentioned related to OpenVZ)

echo "786432" > /proc/sys/vm/min_free_kbytes 
echo "65536" > /proc/sys/vm/mmap_min_addr

is also set here as a preventative & stability enhancing measure
(might need adapting, it's tailored towards 32GB of RAM)

Code-changes & commits:
kernelOfTruth@8135db5 from #3181
with raised value to 12500

kernelOfTruth@fa8f5cd higher ZFS_OBJ_MTX_SZ (512 ; double the value) which leads to following error messages during mount/import: http://pastebin.com/cWm5Hvn0 but works fine in operation

kernelOfTruth@086f234
arc_evict_iterations to 180
zfs_arc_grow_retry to 20
zfs_arc_shrink_shift to 4 (but just saw that manually I was still setting it to 5)

So ARC doesn't grow that aggressively and more objects at the same time are scanned through and recycled or evicted.

It might not address SUnreclaim directly but changes in #3115 should refine ARC's behavior in that regard (arc_evict_iterations replaced with zfs_arc_evict_batch_limit)

Codes changes & commits in SPL:
openzfs/spl#372
With the changes "breaking" it reverted: Retire spl_module_init()/spl_module_fini() kernelOfTruth/spl@ee4bd8b , until the pull-request is updated - kernelOfTruth/spl@06dd9cc

Additional manually set settings:
zfs_arc_shrink_shift to 5 (will try between 4 and 5 in the future and see if that raises latency)
spl_taskq_thread_bind to 1
zfs_prefetch_disable to 0 (if disabled ("1") might hurt performance (read), lead to way more reads - a lot ; lower latency though with ("1"))

@kernelOfTruth
Copy link
Contributor Author

Below follow the output of /proc/slabinfo, cat /proc/meminfo and /proc/spl/kstat/zfs/arcstats during the restore operation of 1.7 TB from external USB 3.0 disk (both were ZFS pools)

ZFS ARC stats, 1 TB in, mainly "larger" (hundreds MB to GB): http://pastebin.com/uASLYsqW
(close to beginning, a few small files, others several hundred MB, gigabytes and mostly close to 10 MB)

ZFS ARC stats, 1.3 TB, more larger data: http://pastebin.com/Fi5CMc65
data

ZFS ARC stats, 1.6 TB, mixed (large + little data): http://pastebin.com/uDYUuBGY

ZFS ARC stats, 1.7 TB, heavily mixed, close to end of backup: http://pastebin.com/tHHT1cXX

ZFS ARC stats, 1.7 TB, heavily mixed, after rsync: http://pastebin.com/BEmqGFQX

Mark how other_size doesn't seem to grow out of proportion anymore; only swap used during backup was from ZRAM

Will post the stats later of several imports and exports of pools + Btrfs partitions and small incremental rsync updates - this was always a problem in the past where SUnreclaim would grow almost unstoppably ever larger

@kernelOfTruth
Copy link
Contributor Author

So here the data after:

rsync (1.7 TB) - ZFS /home to ZFS bak (several hundred megabytes transferred)
stage4 (system backup from Btrfs partition to ZFS; 7z)
updatedb (system [Btrfs] + ZFS /home partition)
2x rsync (1.7 TB) - ZFS /home to ZFS bak (5-10 GB transferred)

http://pastebin.com/JgUGMaxi

other_size has twice the size of data_size, meta_size close the size of data_size

zio_buf_16384 never was blown out of proportion (e.g. 4 GB) and had always a value around 1-1.3 GB
dnode_t also had a size of around 1 GB

Will post the data after updatedb with /home + another additional ZFS pool imported and an rsync job after that - this was usually the worst-case scenario for me in the near past where things really seemed to wreak havoc (despite using #2129 )

If things don't change I'll re-add the l2arc device and see how things goes in the next days - with it memory consumption always was greater (improved with #3115 ?) ; but even without those changes it should behave way more civil with an L2ARC

@kernelOfTruth
Copy link
Contributor Author

ok, decided to add l2arc

cache                 -      -      -      -      -      -
  intelSSD180     1.27G  57.3G     46      5   102K   563K

updatedb with additional imported pool, then another rsync: http://pastebin.com/DgxWtjR5

after export of the additional pool zio_buf_16384 now even went down to 553472K - otherwise it would only ever grow;

dnote_t is at around 1156928K

SUnreclaim: 3544568 kB

Seems like everything works as intended now 👍

@snajpa
Copy link
Contributor

snajpa commented Mar 15, 2015

@kernelOfTruth lucky you, my SUnreclaim still just keeps on growing. But unlike you I'm stuck with RHEL6 kernel and can't move on to anything newer (OpenVZ).

@kernelOfTruth
Copy link
Contributor Author

Thanks :)

@snajpa that's unfortunate :/ Does the support contract allow compiling a different kernel out of the sources - as long as you're staying on that version ? (I'm eye-balling towards the Paravirt stability issues since you also mentioned lockup problems in #3160 and that RHEL6 kernels aren't compiled with it)

I also just recently added that support since I only use virtualbox for virtualization purposes

From what I read there seem to be at least 2 significant landmarks: 3.10 (RHEL7 seems to contain it), 3.12 were some locking & dentry-handling changes also were introduced (http://permalink.gmane.org/gmane.linux.kernel.commits.head/407250)

Experimenting with all of the options I summarized above would be possible ?
Like I wrote, I'm not sure what change exactly made it "click" to work properly - but it would be surely nice if it wasn't too kernel-version dependent - 2.6.32, like in some of the issues mentioned, perhaps would be too old though

Anyway: Good luck - if it can be made to work here, I'm sure you'll also figure it out, I don't have that much knowledge or expertise in the kernel or code department compared to you guys, I'm sure (doing this as a mere hobby & from experience, Gentoo user)

edit:

I remember having read that

Disabling THP - transparent hugepages - which seems to work fine with the recent tweaks to ZFS, though

and regularly running

echo 1 > /proc/sys/vm/compact_memory

might raise stability in certain cases (if I remember correctly it was also mentioned related to OpenVZ)

echo "786432" > /proc/sys/vm/min_free_kbytes 
echo "65536" > /proc/sys/vm/mmap_min_addr

is also set here as a preventative & stability enhancing measure
(might need adapting, it's tailored towards 32GB of RAM)

echo "bfq" > /sys/module/zfs/parameters/zfs_vdev_scheduler

(default: noop)

and BFQ is also set in

/sys/block/sd*/queue/scheduler

where that isn't supported, CFQ could make a change related to latency or perhaps even stability - experimenting between noop, deadline & cfq

perhaps that also might be of help

@kernelOfTruth kernelOfTruth changed the title After rsync of ~2TiB of data large amount of SUnreclaim, keeps on growing (slabtop) After rsync of ~2TiB of data large amount of SUnreclaim (ARC), keeps on growing (slabtop) without limit Mar 24, 2015
@kernelOfTruth kernelOfTruth changed the title After rsync of ~2TiB of data large amount of SUnreclaim (ARC), keeps on growing (slabtop) without limit After rsync of ~2TiB of data large amount of SUnreclaim (ARC), keeps on growing (slabtop) without limit - slowing down system to a halt Mar 24, 2015
@kernelOfTruth
Copy link
Contributor Author

With recent upstream changes in master ( #3202 ) - this doesn't seem to appear anymore

but it sure still needs a few days (or weeks+) of testing

appears to be fixed - therefore closing.

@behlendorf
Copy link
Contributor

@kernelOfTruth Excellent news, thanks for the update.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants
@tuxoko @behlendorf @snajpa @dweeezil @kernelOfTruth and others