Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recurring ZFS deadlock; zfs_iput_taskq stuck at 100% for minutes #3687

Closed
zhro opened this issue Aug 17, 2015 · 19 comments
Closed

Recurring ZFS deadlock; zfs_iput_taskq stuck at 100% for minutes #3687

zhro opened this issue Aug 17, 2015 · 19 comments
Labels
Status: Inactive Not being actively updated

Comments

@zhro
Copy link

zhro commented Aug 17, 2015

I have a recurring problem with ZFS on my server where the system crawls and/or deadlocks when attempting to interact with ZFS volumes. For example, I cannot read or execute scripts from these disks but directory navigation and file listing works (albeit slowly).

Sometimes the lockups get so bad that even the ssh sessions start timing out. It's not limited to the network interface as I sometimes can't even login from while at the box with a monitor and keyboard.

Java programs start throwing exceptions when trying to write to the ZFS arrays:

Internal Exception: java.io.IOException: Error while write(...): Connection reset by peer
Internal Exception: java.io.IOException: Error while write(...): Broken pipe

During these instances, according to TOP, zfs_iput_taskq and kswapd0 get stuck at 100%. txg_sync also blocks for longer than 120 seconds and is killed forcibly by the system.

These recurring deadlocks are causing both program crashes and downtime.

zpool status reports everything OK:

  pool: backup
 state: ONLINE
  scan: none requested
config:

        NAME                                          STATE     READ WRITE CKSUM
        backup                                        ONLINE       0     0     0
          raidz3-0                                    ONLINE       0     0     0
            ata-WDC_WD40EFRX-68WT0N0_WD-WCC4E5PKHHE5  ONLINE       0     0     0
            ata-WDC_WD40EFRX-68WT0N0_WD-WCC4E6TYUX25  ONLINE       0     0     0
            ata-WDC_WD40EFRX-68WT0N0_WD-WCC4EJXKYYU2  ONLINE       0     0     0
            ata-WDC_WD40EFRX-68WT0N0_WD-WCC4EE1YSAL2  ONLINE       0     0     0
            ata-WDC_WD40EFRX-68WT0N0_WD-WCC4E2CDE449  ONLINE       0     0     0
            ata-WDC_WD40EFRX-68WT0N0_WD-WCC4E2CDER66  ONLINE       0     0     0
            ata-WDC_WD40EFRX-68WT0N0_WD-WCC4E7HT982J  ONLINE       0     0     0
            ata-WDC_WD40EFRX-68WT0N0_WD-WCC4E0PVT451  ONLINE       0     0     0
        cache
          ata-Crucial_CT128MX100SSD1_14440DA986E7     ONLINE       0     0     0

errors: No known data errors

  pool: data
 state: ONLINE
  scan: none requested
config:

        NAME                                STATE     READ WRITE CKSUM
        data                                ONLINE       0     0     0
          mirror-0                          ONLINE       0     0     0
            ata-ST480HM000-1G5162_Z4P00DGL  ONLINE       0     0     0
            ata-ST480HM000-1G5162_Z4P00BHN  ONLINE       0     0     0

errors: No known data errors

I'm currently using zfs 0.6.4.2-1 on CentOS 7.

Any thoughts?

Is issue #457 related?

@alexanderhaensch
Copy link

a #perf top could show the kernel tasks involved during the lock. Can you post some more information about the system?

@zhro
Copy link
Author

zhro commented Aug 27, 2015

The deadlock is recurring but can take weeks to occur. Unfortunately, I cannot make it happen at will to provide additional information.

The motherboard is a Supermicro H8DG6 with dual AMD 6328:

Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                16
On-line CPU(s) list:   0-15
Thread(s) per core:    2
Core(s) per socket:    4
Socket(s):             2
NUMA node(s):          4
Vendor ID:             AuthenticAMD
CPU family:            21
Model:                 2
Model name:            AMD Opteron(tm) Processor 6328
Stepping:              0
CPU MHz:               3200.000
BogoMIPS:              6399.12
Virtualization:        AMD-V
L1d cache:             16K
L1i cache:             64K
L2 cache:              2048K
L3 cache:              6144K
NUMA node0 CPU(s):     0-3
NUMA node1 CPU(s):     4-7
NUMA node2 CPU(s):     8-11
NUMA node3 CPU(s):     12-15

CentOS Linux release 7.1.1503 (Core)
Linux version 3.10.0-229.7.2.el7.x86_64 (builder@kbuilder.dev.centos.org) (gcc version 4.8.3 20140911 (Red Hat 4.8.3-9) (GCC) ) #1 SMP Tue Jun 23 22:06:11 UTC 2015

@zhro
Copy link
Author

zhro commented Sep 6, 2015

I'm experiencing this issue again right now. The server is beginning to grind to a halt and there is max CPU usage in zfs_iput_taskq/, kswapd0, kswapd1, kswapd2, kswapd3, and kthreadd.

My Skype is Kahenraz if someone who is knowledgeable on this subject can call me. I can provide a shell account for diagnostics.

Currently dumping perf

@zhro
Copy link
Author

zhro commented Sep 6, 2015

I've made a pastbin of a perf top log that I was able to pull before the whole system deadlocked. Still waiting for ssh to become responsive again.

http://pastebin.com/raw.php?i=vFpBFUFz

Here is a log from top as well:

#top
top - 19:10:59 up 59 days, 15:55,  8 users,  load average: 34.09, 21.19, 10.41
Tasks: 674 total,  15 running, 659 sleeping,   0 stopped,   0 zombie
%Cpu(s): 11.1 us, 79.3 sy,  0.0 ni,  7.9 id,  1.4 wa,  0.0 hi,  0.3 si,  0.0 st
KiB Mem : 19785259+total,   902672 free, 96152040 used, 10079788+buff/cache
KiB Swap:  4095932 total,        0 free,  4095932 used.   815976 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 6448 admin     20   0 42.186g 0.035t   7664 S 160.7 18.8   9888:57 java
18841 admin     20   0 15.365g 8.774g  10292 S 153.5  4.7 321:02.60 java
12642 admin     20   0 38.324g 1.589g   6452 S 134.3  0.8   2287:44 java
23767 root       0 -20       0      0      0 R  96.4  0.0  61:19.88 zfs_iput_taskq/
28693 admin     20   0 15.340g 3.198g   7096 S  80.9  1.7 804:21.21 java
14797 admin     20   0 4316332 4.008g   1656 R  78.9  2.1   2:29.82 bash
  170 root      20   0       0      0      0 R  78.5  0.0  26:42.44 kswapd2
  168 root      20   0       0      0      0 R  77.2  0.0 259:35.34 kswapd0
    2 root      20   0       0      0      0 R  73.9  0.0  16:04.48 kthreadd
17531 admin     20   0  113248    960    620 D  70.0  0.0 133:13.18 admin
  171 root      20   0       0      0      0 R  69.3  0.0  36:33.35 kswapd3
 7934 admin     20   0  113248   1276    924 R  58.1  0.0  10:46.98 admin
  169 root      20   0       0      0      0 R  47.2  0.0   7:08.22 kswapd1
21343 admin     20   0 17.504g 6.717g   7140 S  33.7  3.6   1249:59 java
16980 admin     20   0 18.440g 8.021g   7412 S  24.1  4.3   1111:38 java
  124 root      rt   0       0      0      0 S  17.8  0.0   4:15.53 migration/10
   93 root      20   0       0      0      0 S  15.8  0.0   6:04.96 ksoftirqd/4
   42 root      20   0       0      0      0 S  11.9  0.0 695:04.59 rcu_sched

@kernelOfTruth
Copy link
Contributor

the pastebin link has a redirection loop:

http://pastebin.com/vFpBFUFz

<-- works

@zhro can you post arcstats ? next time this occurs ? [hopefully not]

cat /proc/spl/kstat/zfs/arcstats

also try limiting the size of ARC to 1/4 or 1/3 of total memory

stack issues ?

referencing a few issues:
#531
#1928
#3055
#3183
#3225
#610
#3148
http://comments.gmane.org/gmane.linux.file-systems.zfs.user/19754

@zhro
Copy link
Author

zhro commented Sep 6, 2015

Current arcstats:

5 1 0x01 86 4128 34247967600329 5157383365117166
name                            type data
hits                            4    2498352634
misses                          4    92929589
demand_data_hits                4    2386861450
demand_data_misses              4    66292474
demand_metadata_hits            4    92378823
demand_metadata_misses          4    23345083
prefetch_data_hits              4    11986562
prefetch_data_misses            4    2378988
prefetch_metadata_hits          4    7125799
prefetch_metadata_misses        4    913044
mru_hits                        4    124623651
mru_ghost_hits                  4    1419638
mfu_hits                        4    2354625643
mfu_ghost_hits                  4    1429528
deleted                         4    17372701
recycle_miss                    4    1385996
mutex_miss                      4    876
evict_skip                      4    37560009
evict_l2_cached                 4    314152801792
evict_l2_eligible               4    658012682752
evict_l2_ineligible             4    98585411584
hash_elements                   4    12387883
hash_elements_max               4    13134993
hash_collisions                 4    32257366
hash_chains                     4    1782341
hash_chain_max                  4    7
p                               4    18829823488
c                               4    60455276608
c_min                           4    4194304
c_max                           4    101300527104
size                            4    59474479224
hdr_size                        4    3602463504
data_size                       4    4077260800
meta_size                       4    16727907840
other_size                      4    34093835488
anon_size                       4    163840
anon_evict_data                 4    0
anon_evict_metadata             4    0
mru_size                        4    16866227200
mru_evict_data                  4    4076564992
mru_evict_metadata              4    4925252096
mru_ghost_size                  4    43587863040
mru_ghost_evict_data            4    43583545856
mru_ghost_evict_metadata        4    4317184
mfu_size                        4    3938777600
mfu_evict_data                  4    564736
mfu_evict_metadata              4    940006400
mfu_ghost_size                  4    16391070208
mfu_ghost_evict_data            4    16152187904
mfu_ghost_evict_metadata        4    238882304
l2_hits                         4    496324
l2_misses                       4    92422907
l2_feeds                        4    5141328
l2_rw_clash                     4    5
l2_read_bytes                   4    729461760
l2_write_bytes                  4    263430246400
l2_writes_sent                  4    55561
l2_writes_done                  4    55561
l2_writes_error                 4    0
l2_writes_hdr_miss              4    13
l2_evict_lock_retry             4    1
l2_evict_reading                4    0
l2_free_on_write                4    935777
l2_cdata_free_on_write          4    38
l2_abort_lowmem                 4    4485
l2_cksum_bad                    4    391255
l2_io_error                     4    28453
l2_size                         4    287636466176
l2_asize                        4    241466195968
l2_hdr_size                     4    973011592
l2_compress_successes           4    672810
l2_compress_zeros               4    0
l2_compress_failures            4    1740931
memory_throttle_count           4    0
duplicate_buffers               4    0
duplicate_buffers_size          4    0
duplicate_reads                 4    0
memory_direct_count             4    14899
memory_indirect_count           4    745383
arc_no_grow                     4    0
arc_tempreserve                 4    0
arc_loaned_bytes                4    0
arc_prune                       4    0
arc_meta_used                   4    55397218424
arc_meta_limit                  4    75975395328
arc_meta_max                    4    58063263928

The results of perf top have also changed now that things have settled down:

7.73%  [kernel]                    [k] page_fault                                                                   
3.58%  [kernel]                    [k] unmap_page_range                                                             
2.91%  [kernel]                    [k] copy_page_rep                                                                
2.89%  [kernel]                    [k] __find_get_page                                                              
2.42%  libc-2.17.so                [.] _dl_addr                                                                     
2.14%  [kernel]                    [k] clear_page_c                                                                 
1.95%  [kernel]                    [k] _raw_spin_lock                                                               
1.36%  [kernel]                    [k] _raw_spin_lock_irqsave                                                       
1.21%  [kernel]                    [k] __do_page_fault                                                              
1.12%  [kernel]                    [k] filemap_fault                                                                
1.09%  [kernel]                    [k] handle_mm_fault                                                              
1.05%  [kernel]                    [k] __list_del_entry                                                             
0.97%  [kernel]                    [k] release_pages                                                                
0.95%  [kernel]                    [k] copy_pte_range                                                               
0.85%  [kernel]                    [k] generic_exec_single                                                          
0.83%  [kernel]                    [k] do_wp_page                                                                   
0.81%  [kernel]                    [k] vma_interval_tree_insert                                                     
0.78%  [kernel]                    [k] kmem_cache_alloc                                                             
0.76%  ld-2.17.so                  [.] _dl_relocate_object                                                          
0.76%  [kernel]                    [k] find_vma                                                                     
0.75%  [kernel]                    [k] page_remove_rmap                                                             
0.75%  perf                        [.] rb_next                                                                      
0.62%  [kernel]                    [k] get_pageblock_flags_group                                                    
0.62%  [kernel]                    [k] __mem_cgroup_uncharge_common                                                 
0.61%  [kernel]                    [k] kfree                                                                        
0.60%  [kernel]                    [k] cmpxchg_double_slab.isra.50                                                  
0.58%  [kernel]                    [k] get_page_from_freelist                                                       
0.58%  [kernel]                    [k] __wake_up_bit                                                                
0.51%  [kernel]                    [k] l2arc_write_buffers                                                          
0.50%  [kernel]                    [k] __mem_cgroup_commit_charge                                                   
0.50%  ld-2.17.so                  [.] do_lookup_x                                                                  
0.49%  [kernel]                    [k] change_protection_range                                                      
0.48%  [kernel]                    [k] kmem_cache_free                                                              
0.48%  [kernel]                    [k] _cond_resched                                                                
0.48%  libc-2.17.so                [.] _int_malloc                                                                  

@kernelOfTruth
Copy link
Contributor

What non-default settings do you use for spl or the zfs kernel modules ?

/sys/module/zfs/parameters/
/sys/module/spl/parameters

http://gentoovps.net/zfs-cheatsheet-linux-on-zfs/

@zhro
Copy link
Author

zhro commented Sep 6, 2015

Running:

CentOS Linux release 7.1.1503 (Core)
Linux version 3.10.0-229.7.2.el7.x86_64 (builder@kbuilder.dev.centos.org) (gcc version 4.8.3 20140911 (Red Hat 4.8.3-9) (GCC) ) #1 SMP Tue Jun 23 22:06:11 UTC 2015

Motherboard/CPU:

Supermicro H8DG6 with dual AMD 6328

System memory:

$free -h
              total        used        free      shared  buff/cache   available
Mem:           188G         37G         74G        1.8G         76G         74G
Swap:          3.9G        2.3G        1.6G

Note that there are two pools here:

Pool data:

$zpool status data
  pool: data
 state: ONLINE
  scan: none requested
config:

        NAME                                STATE     READ WRITE CKSUM
        data                                ONLINE       0     0     0
          mirror-0                          ONLINE       0     0     0
            ata-ST480HM000-1G5162_Z4P00DGL  ONLINE       0     0     0
            ata-ST480HM000-1G5162_Z4P00BHN  ONLINE       0     0     0

errors: No known data errors

$zpool get all data
NAME  PROPERTY                    VALUE                       SOURCE
data  size                        444G                        -
data  capacity                    24%                         -
data  altroot                     -                           default
data  health                      ONLINE                      -
data  guid                        12863517867423264499        default
data  version                     -                           default
data  bootfs                      -                           default
data  delegation                  on                          default
data  autoreplace                 off                         default
data  cachefile                   -                           default
data  failmode                    wait                        default
data  listsnapshots               off                         default
data  autoexpand                  off                         default
data  dedupditto                  0                           default
data  dedupratio                  1.00x                       -
data  free                        333G                        -
data  allocated                   111G                        -
data  readonly                    off                         -
data  ashift                      13                          local
data  comment                     -                           default
data  expandsize                  -                           -
data  freeing                     0                           default
data  fragmentation               65%                         -
data  leaked                      0                           default
data  feature@async_destroy       enabled                     local
data  feature@empty_bpobj         enabled                     local
data  feature@lz4_compress        active                      local
data  feature@spacemap_histogram  active                      local
data  feature@enabled_txg         active                      local
data  feature@hole_birth          active                      local
data  feature@extensible_dataset  enabled                     local
data  feature@embedded_data       active                      local
data  feature@bookmarks           enabled                     local

Pool backup:

$zpool status backup
  pool: backup
 state: ONLINE
  scan: none requested
config:

        NAME                                          STATE     READ WRITE CKSUM
        backup                                        ONLINE       0     0     0
          raidz3-0                                    ONLINE       0     0     0
            ata-WDC_WD40EFRX-68WT0N0_WD-WCC4E5PKHHE5  ONLINE       0     0     0
            ata-WDC_WD40EFRX-68WT0N0_WD-WCC4E6TYUX25  ONLINE       0     0     0
            ata-WDC_WD40EFRX-68WT0N0_WD-WCC4EJXKYYU2  ONLINE       0     0     0
            ata-WDC_WD40EFRX-68WT0N0_WD-WCC4EE1YSAL2  ONLINE       0     0     0
            ata-WDC_WD40EFRX-68WT0N0_WD-WCC4E2CDE449  ONLINE       0     0     0
            ata-WDC_WD40EFRX-68WT0N0_WD-WCC4E2CDER66  ONLINE       0     0     0
            ata-WDC_WD40EFRX-68WT0N0_WD-WCC4E7HT982J  ONLINE       0     0     0
            ata-WDC_WD40EFRX-68WT0N0_WD-WCC4E0PVT451  ONLINE       0     0     0
        cache
          ata-Crucial_CT128MX100SSD1_14440DA986E7     ONLINE       0     0     0

errors: No known data errors

$zpool get all backup
NAME    PROPERTY                    VALUE                       SOURCE
backup  size                        29T                         -
backup  capacity                    32%                         -
backup  altroot                     -                           default
backup  health                      ONLINE                      -
backup  guid                        9479200480195810554         default
backup  version                     -                           default
backup  bootfs                      -                           default
backup  delegation                  on                          default
backup  autoreplace                 off                         default
backup  cachefile                   -                           default
backup  failmode                    wait                        default
backup  listsnapshots               off                         default
backup  autoexpand                  off                         default
backup  dedupditto                  0                           default
backup  dedupratio                  1.00x                       -
backup  free                        19.6T                       -
backup  allocated                   9.44T                       -
backup  readonly                    off                         -
backup  ashift                      12                          local
backup  comment                     -                           default
backup  expandsize                  -                           -
backup  freeing                     0                           default
backup  fragmentation               3%                          -
backup  leaked                      0                           default
backup  feature@async_destroy       enabled                     local
backup  feature@empty_bpobj         enabled                     local
backup  feature@lz4_compress        active                      local
backup  feature@spacemap_histogram  active                      local
backup  feature@enabled_txg         active                      local
backup  feature@hole_birth          active                      local
backup  feature@extensible_dataset  enabled                     local
backup  feature@embedded_data       active                      local
backup  feature@bookmarks           enabled                     local

ZFS module parameters:

$ls -1 /sys/module/zfs/parameters/zfs_arc_* && cat /sys/module/zfs/parameters/zfs_arc_*
/sys/module/zfs/parameters/zfs_arc_average_blocksize
/sys/module/zfs/parameters/zfs_arc_grow_retry
/sys/module/zfs/parameters/zfs_arc_max
/sys/module/zfs/parameters/zfs_arc_memory_throttle_disable
/sys/module/zfs/parameters/zfs_arc_meta_adjust_restarts
/sys/module/zfs/parameters/zfs_arc_meta_limit
/sys/module/zfs/parameters/zfs_arc_meta_prune
/sys/module/zfs/parameters/zfs_arc_min
/sys/module/zfs/parameters/zfs_arc_min_prefetch_lifespan
/sys/module/zfs/parameters/zfs_arc_p_aggressive_disable
/sys/module/zfs/parameters/zfs_arc_p_dampener_disable
/sys/module/zfs/parameters/zfs_arc_shrink_shift
8192
5
0
1
4096
0
10000
0
1000
1
1
5

@kernelOfTruth
Copy link
Contributor

That's a lot of swap usage (might be due to high swappiness setting, but still ...),

what Java implementation do you use ?

FYI:
in case of IcedTea 7.2* - there appears to be a resource leaking bug with OpenJDK 7

is most of the shared memory exhausted ?

ipcs -m
ipcs -mp

http://www.thegeekstuff.com/2010/08/ipcs-command-examples/

@zhro
Copy link
Author

zhro commented Sep 7, 2015

Currently running:

java version "1.8.0_45"
Java(TM) SE Runtime Environment (build 1.8.0_45-b14)
Java HotSpot(TM) 64-Bit Server VM (build 25.45-b02, mixed mode)

The system will slows down, eventually grinding to a halt (dropping network connections), then will eventually perk back up and start running again.

Every time this happens I am consistently, though with a small sample size so far, seeing ~100% CPU usage on zfs_iput_taskq, kswap, and kthread.

In this case, txg_sync did not block for longer than 120 seconds as it did the last time.

@zhro
Copy link
Author

zhro commented Sep 7, 2015

I have an update to report which may or may not be related to this issue. At the time of this slowdown, an erroneous bash function was being executed:

function cp() {
   cp --preserve=all --no-dereference $@
}

And at the teminal:

cp test/ test2/

top for this user showed virtual memory being allocated in the gigabytes as the process appeared hung but continued to gobble up memory. The process was killed manually and things returned to normal. This function was added recently and could not have been the cause of the lockup from the original post but the symptoms were similar.

I am not an expert on the inner workings of ZFS but according to what I recorded, my virtual memory was being maxed out which caused zfs_iput_taskq, kswapd, and kthread become unresponsive despite there being physical memory available.

Does this information shed any light onto this issue or is it a false lead?

Despite having 196GB of memory is having a small swap of 4GB a problem?

@alexanderhaensch
Copy link

@zhro Having a swap or not depends on the workload you have. Allowing programs to swap brings a minimum of 100x slowdown.

In your case: disable it!

@zfsbot
Copy link

zfsbot commented Sep 11, 2015

is swap on zvol?

@zhro
Copy link
Author

zhro commented Sep 11, 2015

No. Swap is on the primary disk which contains only / and swap. Everything outside of the standard distro files are run from the /data pool.

The primary disk is a 120GB SSD with nothing but the core OS and swap for IOPS. I will also use it as a large tmp or cache but this has never coincided with performance drops as mention in the first post.

@zhro
Copy link
Author

zhro commented Oct 20, 2015

Experienced the problem today. Swap is currently disabled and memory is available so we can rule out swapping.

Current free:

              total        used        free      shared  buff/cache   available
Mem:           188G        100G         23G        4.0G         64G         23G
Swap:            0B          0B          0B

Current arcstats:

5 1 0x01 86 4128 34247967600329 8910619611887517
name                            type data
hits                            4    4467294508
misses                          4    153297091
demand_data_hits                4    4263775094
demand_data_misses              4    108561148
demand_metadata_hits            4    180068225
demand_metadata_misses          4    40815250
prefetch_data_hits              4    15959539
prefetch_data_misses            4    2946943
prefetch_metadata_hits          4    7491650
prefetch_metadata_misses        4    973750
mru_hits                        4    196625296
mru_ghost_hits                  4    3573509
mfu_hits                        4    4247230008
mfu_ghost_hits                  4    5313773
deleted                         4    23040820
recycle_miss                    4    1674320
mutex_miss                      4    17106
evict_skip                      4    37677917
evict_l2_cached                 4    335699931136
evict_l2_eligible               4    747496161792
evict_l2_ineligible             4    116407156736
hash_elements                   4    12946427
hash_elements_max               4    14219314
hash_collisions                 4    75390909
hash_chains                     4    1926890
hash_chain_max                  4    7
p                               4    73613462928
c                               4    85572636448
c_min                           4    4194304
c_max                           4    101300527104
size                            4    85467191160
hdr_size                        4    2676711888
data_size                       4    50815961600
meta_size                       4    9651529216
other_size                      4    20407399920
anon_size                       4    2431423488
anon_evict_data                 4    0
anon_evict_metadata             4    0
mru_size                        4    43442677760
mru_evict_data                  4    38540871168
mru_evict_metadata              4    2284605952
mru_ghost_size                  4    14629716992
mru_ghost_evict_data            4    10837405184
mru_ghost_evict_metadata        4    3792311808
mfu_size                        4    14593389568
mfu_evict_data                  4    12073219584
mfu_evict_metadata              4    2002615808
mfu_ghost_size                  4    6113362944
mfu_ghost_evict_data            4    2350701568
mfu_ghost_evict_metadata        4    3762661376
l2_hits                         4    3873069
l2_misses                       4    149413664
l2_feeds                        4    8890056
l2_rw_clash                     4    21
l2_read_bytes                   4    2831632384
l2_write_bytes                  4    271459764224
l2_writes_sent                  4    62272
l2_writes_done                  4    62272
l2_writes_error                 4    0
l2_writes_hdr_miss              4    27
l2_evict_lock_retry             4    1
l2_evict_reading                4    0
l2_free_on_write                4    1132541
l2_cdata_free_on_write          4    828
l2_abort_lowmem                 4    6622
l2_cksum_bad                    4    534136
l2_io_error                     4    36038
l2_size                         4    299861820416
l2_asize                        4    249242680832
l2_hdr_size                     4    1915588536
l2_compress_successes           4    751769
l2_compress_zeros               4    0
l2_compress_failures            4    1777570
memory_throttle_count           4    0
duplicate_buffers               4    0
duplicate_buffers_size          4    0
duplicate_reads                 4    0
memory_direct_count             4    17659
memory_indirect_count           4    813939
arc_no_grow                     4    0
arc_tempreserve                 4    2228224
arc_loaned_bytes                4    0
arc_prune                       4    0
arc_meta_used                   4    34651228936
arc_meta_limit                  4    75975395328
arc_meta_max                    4    60570318704

@zhro
Copy link
Author

zhro commented Oct 21, 2015

Following up on this, I'm of the assumption that this is all due slab and disk cache allocation by ZFS. After a fresh reboot today and doing nothing but hammering zfs pools resulted in the following:

$free -h
              total        used        free      shared  buff/cache   available
Mem:           188G         79G         61G         25M         47G         62G
Swap:          3.9G          0B        3.9G

The memory in use is all buff/cache and slab/dentry/inode cache. Here is an output of memory for the user processes:

$ps -eo size,pid,user,command --sort -size | awk '{ hr=$1/1024 ; printf("%13.2f Mb ",hr) } { for ( x=4 ; x<=NF ; x++ ) { printf("%s ",$x) } print "" }' | head
     0.00 Mb COMMAND
   753.30 Mb /usr/libexec/mysqld --basedir=/usr --datadir=/var/lib/mysql ..
   440.66 Mb /usr/lib/polkit-1/polkitd --no-debug
   297.09 Mb /usr/sbin/NetworkManager --no-daemon
   297.00 Mb /usr/bin/python -Es /usr/sbin/tuned -l -P
   216.72 Mb /usr/sbin/ModemManager
   153.12 Mb /usr/sbin/rsyslogd -n
   144.51 Mb /usr/libexec/dconf-service
    87.64 Mb /usr/bin/python -Es /usr/sbin/firewalld --nofork --nopid
    72.33 Mb /sbin/audispd

Results from evaluating slabtop and /proc/meminfo:

      OBJS    ACTIVE  USE OBJ    SIZE   SLABS  OBJ/SLAB  CACHE SIZE                    NAME
  10855309  10855105      99%   1.07K  374321        29      11.42G         zfs_znode_cache
  10893059  10892738      99%   0.85K  294407        37       8.98G                 dnode_t
    412686    410725      99%  16.00K  206343         2       6.30G           zio_buf_16384
  12502304  12289095      98%   0.50K  390697        32       5.96G             kmalloc-512
  12776610  12743741      99%   0.29K  232302        55       3.54G          dmu_buf_impl_t
  10855309  10855102      99%   0.27K  374321        29       2.86G                sa_cache
    370668    370630      99%   8.00K   92667         4       2.83G            kmalloc-8192
   3269280   3028039      92%   0.32K   66720        49       1.02G               taskstats
  10882788  10871028      99%   0.08K  213388        51       0.81G  selinux_inode_security
  12158272  12131428      99%   0.06K  189973        64       0.72G              kmalloc-64
   3231396   3212605      99%   0.19K   76938        42       0.59G                  dentry
   5577558   5517626      98%   0.09K  132799        42       0.51G              kmalloc-96
     92872     82296      88%   4.00K   11609         8       0.35G            kmalloc-4096
   1962464   1951965      99%   0.12K   61327        32       0.23G             kmalloc-128
   6022912   6022900      99%   0.03K   47054       128       0.18G              kmalloc-32
      8326      8300      99%  12.00K    4163         2       0.13G           zio_buf_12288
..

Total Slab Allocation: 46.8217GB

I can confirm that dropping all caches frees this memory:

sync; echo 3 > /proc/sys/vm/drop_caches

     OBJS   ACTIVE  USE OBJ   SIZE  SLABS  OBJ/SLAB  CACHE SIZE              NAME
  1118520  1116904      99%  0.10K  28680        39       0.11G        blkdev_ioc
   938816   154521      16%  0.06K  14669        64       0.06G        kmalloc-64
  1677952  1270504      75%  0.03K  13109       128       0.05G        kmalloc-32
    16640     9377      56%  1.00K    520        32       0.02G      kmalloc-1024
    12870    11886      92%  0.58K    234        55       0.01G       inode_cache
    38976    27944      71%  0.19K    928        42       0.01G            dentry
    20874     2119      10%  0.32K    426        49       0.01G         taskstats
    59904    59904     100%  0.11K   1664        36       0.01G   sysfs_dir_cache
    10045     9319      92%  0.64K    205        49       0.01G  proc_inode_cache
     2112     2013      95%  2.84K    192        11       0.01G       task_struct
     2625     2496      95%  2.06K    175        15       0.01G     sighand_cache
    17440     7666      43%  0.25K    545        32       0.00G       kmalloc-256
     4224     1643      38%  1.00K    132        32       0.00G         xfs_inode
    19055    18515      97%  0.21K    515        37       0.00G    vm_area_struct
     3584     3196      89%  1.12K    128        28       0.00G      signal_cache
     7744     2560      33%  0.50K    242        32       0.00G       kmalloc-512
..

Total Slab Allocation: 0.340336GB

Result from free:

$free -h
              total        used        free      shared  buff/cache   available
Mem:           188G        2.2G        186G         25M        447M        185G
Swap:          3.9G          0B        3.9G

I was under the assumption that this kind of cache would be freely available to the system if it was under memory pressure. But upon closer inspection of /proc/meminfo, I found these values:

Slab:         46.8651GB
SReclaimable: 0.63637GB
SUnreclaim:   46.2286GB

If memory allocated by zfs is located in SUnreclaim, is it still released when the system is under pressure? What makes it store so much in SUnreclaim and so little in SReclaimable?

Manually dropping the caches takes brutally long and uses a significant amount of CPU for the arc_reclaim process along with many kworker and migration threads. During the majority of this time the cache is simply being evaluated; memory isn't freed until a long time had passed, suggesting that this is an expensive operation and not something that isn't done quickly. In this instance I didn't time it but it took at least tens of minutes. Past deadlocks (or what seemed like deadlocks) could take well over an hour.

Note that I found a post suggesting that it could take hours even on a lower memory system (16GB), which would definitely give the appearance of a deadlock:
http://serverfault.com/questions/560513/linux-echo-3-proc-sys-vm-drop-caches-takes-hours-to-complete

It seems that Thecus's solution was to run a script which would automatically drop all caches once memory pressure exceeded 50% of the available memory. My research also suggests that Thecus uses Linux with ZFS which suggests zfsonlinux; my point being that the server fault question linked is most likely in reference to ZFS cache allocation and deallocation (arc_reclaim).

But regarding Thecus's script method, isn't zfs_arc_max already configured by default to be 1/2 the system's memory (#1134)?

One might argue that the current behavior is still OK if most of the system tasks are relegated to ZFS operations. But as I've experienced when the pressure hits, on systems with large amounts of memory, the enormous time is takes to free the cache can cause massive swapping, process termination, write failure, and network timeouts.

Thoughts?

@zhro
Copy link
Author

zhro commented Oct 21, 2015

This issue may also be related: #3157

@kernelOfTruth
Copy link
Contributor

@zhro my issue is too general and there were no deadlocks so I doubt it could apply in this case

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Inactive Not being actively updated
Projects
None yet
Development

No branches or pull requests

6 participants
@behlendorf @alexanderhaensch @kernelOfTruth @zhro @zfsbot and others