Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

panic on zfs receive #10698

Closed
helamonster opened this issue Aug 10, 2020 · 21 comments
Closed

panic on zfs receive #10698

helamonster opened this issue Aug 10, 2020 · 21 comments
Labels
Status: Stale No recent activity for issue Type: Defect Incorrect behavior (e.g. crash, hang)

Comments

@helamonster
Copy link

System information

Type Version/Name
Distribution Name Ubuntu
Distribution Version 20.04
Linux Kernel 5.4.0-42-generic #46-Ubuntu
Architecture x86_64
ZFS Version 0.8.4-1~20.04.york0
SPL Version 0.8.4-1~20.04.york0

Describe the problem you're observing

I am encountering a PANIC of ZFS upon a zfs recv that hangs the pool until the system is restarted.
I initially encountered this problem with Ubuntu 20.04's official version of ZFS (0.8.3) and then tried upgrading to 0.8.4 to see if the problem still exists. The PANIC occurs in both versions. I used the pre-built version from this PPA:
https://launchpad.net/~jonathonf/+archive/ubuntu/zfs

Describe how to reproduce the problem

The zfs send command NOOP output:

#  /sbin/zfs send -P -R -v -n  rpool/scratch@2020-07-23-03-39-10
full    rpool/scratch@2020-07-23-03-39-10       27955560808
size    27955560808
tosnap: '2020-07-23-03-39-10'
fss:
    0x56f7d2ee5ca60cf0:
        name: 'rpool/scratch'
        parentfromsnap: 0
        props:
            snapshot_limit: 18446744073709551614
            mountpoint: 'legacy'
        snaps:
            2020-07-23-03-39-10: 14988460586390450179
        snapprops:
            2020-07-23-03-39-10:

I can issue a zfs send to /dev/null that is successful:

# /sbin/zfs send -P -R   rpool/scratch@2020-07-23-03-39-10 | pv -s 27955560808 -p -r -t -a >/dev/null
full    rpool/scratch@2020-07-23-03-39-10       27955560808
size    27955560808
0:01:02 [ 430MiB/s] [ 430MiB/s] [==========================] 101%

The exact zfs send command used when the destination crashes:

# /sbin/zfs send -P -R -v rpool/scratch@2020-07-23-03-39-10

The zfs recv command that crashes:

# /sbin/zfs recv -svuF dpool/backups/xyz/jbsnet/latitude/CURRENT/rpool/scratch

Properties of the source dataset:

# zfs get all rpool/scratch@2020-07-23-03-39-10
NAME                               PROPERTY              VALUE                  SOURCE
rpool/scratch@2020-07-23-03-39-10  type                  snapshot               -
rpool/scratch@2020-07-23-03-39-10  creation              Thu Jul 23  3:39 2020  -
rpool/scratch@2020-07-23-03-39-10  used                  152M                   -
rpool/scratch@2020-07-23-03-39-10  referenced            22.1G                  -
rpool/scratch@2020-07-23-03-39-10  compressratio         1.21x                  -
rpool/scratch@2020-07-23-03-39-10  devices               on                     default
rpool/scratch@2020-07-23-03-39-10  exec                  on                     default
rpool/scratch@2020-07-23-03-39-10  setuid                on                     default
rpool/scratch@2020-07-23-03-39-10  createtxg             9180312                -
rpool/scratch@2020-07-23-03-39-10  xattr                 sa                     inherited from rpool
rpool/scratch@2020-07-23-03-39-10  version               5                      -
rpool/scratch@2020-07-23-03-39-10  utf8only              on                     -
rpool/scratch@2020-07-23-03-39-10  normalization         formD                  -
rpool/scratch@2020-07-23-03-39-10  casesensitivity       sensitive              -
rpool/scratch@2020-07-23-03-39-10  nbmand                off                    default
rpool/scratch@2020-07-23-03-39-10  guid                  14988460586390450179   -
rpool/scratch@2020-07-23-03-39-10  primarycache          all                    default
rpool/scratch@2020-07-23-03-39-10  secondarycache        all                    default
rpool/scratch@2020-07-23-03-39-10  defer_destroy         off                    -
rpool/scratch@2020-07-23-03-39-10  userrefs              1                      -
rpool/scratch@2020-07-23-03-39-10  objsetid              49456                  -
rpool/scratch@2020-07-23-03-39-10  mlslabel              none                   default
rpool/scratch@2020-07-23-03-39-10  refcompressratio      1.21x                  -
rpool/scratch@2020-07-23-03-39-10  written               22.1G                  -
rpool/scratch@2020-07-23-03-39-10  clones                                       -
rpool/scratch@2020-07-23-03-39-10  logicalreferenced     26.0G                  -
rpool/scratch@2020-07-23-03-39-10  acltype               posixacl               inherited from rpool
rpool/scratch@2020-07-23-03-39-10  context               none                   default
rpool/scratch@2020-07-23-03-39-10  fscontext             none                   default
rpool/scratch@2020-07-23-03-39-10  defcontext            none                   default
rpool/scratch@2020-07-23-03-39-10  rootcontext           none                   default
rpool/scratch@2020-07-23-03-39-10  encryption            off                    default

Include any warning/errors/backtraces from the system logs

Relevant dmesg:

[Mon Aug 10 09:30:15 2020] VERIFY3(count >= 0) failed (-1 >= 0)
[Mon Aug 10 09:30:15 2020] PANIC at dsl_dir.c:925:dsl_fs_ss_count_adjust()
[Mon Aug 10 09:30:15 2020] Showing stack for process 10815
[Mon Aug 10 09:30:15 2020] CPU: 0 PID: 10815 Comm: txg_sync Tainted: P           OE     5.4.0-42-generic #46-Ubuntu
[Mon Aug 10 09:30:15 2020] Hardware name: HPE ProLiant MicroServer Gen10/ProLiant MicroServer Gen10, BIOS 5.12 06/26/2018
[Mon Aug 10 09:30:15 2020] Call Trace:
[Mon Aug 10 09:30:15 2020]  dump_stack+0x6d/0x9a
[Mon Aug 10 09:30:15 2020]  spl_dumpstack+0x29/0x2b [spl]
[Mon Aug 10 09:30:15 2020]  spl_panic+0xd4/0xfc [spl]
[Mon Aug 10 09:30:15 2020]  ? avl_find+0x5f/0x90 [zavl]
[Mon Aug 10 09:30:15 2020]  ? dbuf_rele+0x3a/0x40 [zfs]
[Mon Aug 10 09:30:15 2020]  ? dmu_buf_rele+0xe/0x10 [zfs]
[Mon Aug 10 09:30:15 2020]  ? zap_unlockdir+0x3f/0x60 [zfs]
[Mon Aug 10 09:30:15 2020]  ? zap_lookup_norm+0x95/0xc0 [zfs]
[Mon Aug 10 09:30:15 2020]  dsl_fs_ss_count_adjust+0x1ad/0x1c0 [zfs]
[Mon Aug 10 09:30:15 2020]  dsl_destroy_head_sync_impl+0x35c/0x980 [zfs]
[Mon Aug 10 09:30:15 2020]  ? dbuf_rele+0x3a/0x40 [zfs]
[Mon Aug 10 09:30:15 2020]  dmu_recv_end_sync+0x1ec/0x5a0 [zfs]
[Mon Aug 10 09:30:15 2020]  dsl_sync_task_sync+0xb6/0x100 [zfs]
[Mon Aug 10 09:30:15 2020]  dsl_pool_sync+0x3d6/0x4c0 [zfs]
[Mon Aug 10 09:30:15 2020]  spa_sync+0x59b/0xf90 [zfs]
[Mon Aug 10 09:30:15 2020]  ? spa_txg_history_init_io+0x106/0x110 [zfs]
[Mon Aug 10 09:30:15 2020]  txg_sync_thread+0x2be/0x450 [zfs]
[Mon Aug 10 09:30:15 2020]  ? txg_thread_exit.isra.0+0x60/0x60 [zfs]
[Mon Aug 10 09:30:15 2020]  thread_generic_wrapper+0x79/0x90 [spl]
[Mon Aug 10 09:30:15 2020]  kthread+0x104/0x140
[Mon Aug 10 09:30:15 2020]  ? __thread_exit+0x20/0x20 [spl]
[Mon Aug 10 09:30:15 2020]  ? kthread_park+0x90/0x90
[Mon Aug 10 09:30:15 2020]  ret_from_fork+0x22/0x40

I've done plenty of sends and receives between these 2 systems but this only particular dataset is causing this problem, so reproducing it elsewhere may be difficult.
Is this certainly a problem with the destination side, or could it be a problem with the ZFS stream?

What other information can I provide that would help narrow the problem down?
Should the version of ZFS on the source machine matter even though its the receive that is crashing?
The source machine is running kernel 5.4.0-40-generic with the official Ubuntu zfs/spl 0.8.3-1ubuntu12.2.

@ahrens
Copy link
Member

ahrens commented Aug 10, 2020

cc @pcd1193182 @jjelinek

@ahrens
Copy link
Member

ahrens commented Aug 10, 2020

It looks like you are doing a "blow away" receive, where we are receiving a full (non-incremental) stream on top of a filesystem that already exists (dpool/backups/xyz/jbsnet/latitude/CURRENT/rpool/scratch).

In this case, we receive into a temporary %recv filesystem, which is deleted when the receive completes. For fs/ss accounting purposes, these filesystems should be ignored. But that seems like it is not happening here, we try to decrement the fs count but it's already zero. So we need to figure out why it isn't ignoring this filesystem.

Any chance you can provide a crash dump?

@ahrens
Copy link
Member

ahrens commented Aug 10, 2020

If you haven't configured the system to reboot when zfs panic's, you might be able to get the output of cat /proc/spl/kstat/zfs/dbgmsg (at least the end of it, starting when you do the receive), which might also be helpful.

@bghira
Copy link

bghira commented Aug 10, 2020

If you haven't configured the system to reboot when zfs panic's, you might be able to get the output of cat /proc/spl/kstat/zfs/dbgmsg (at least the end of it, starting when you do the receive), which might also be helpful.

this must first generally be enabled by toggling debug flags (on Linux) before reproducing the issue. they might also want to bump the max size of the log. @ahrens can you suggest some debug flags they might want to enable before re-testing, to save time?

@bghira
Copy link

bghira commented Aug 10, 2020

image

@helamonster
Copy link
Author

I noticed that in the NOOP send, there's the line:

snapshot_limit: 18446744073709551614

The source dataset has that value:

# zfs get snapshot_limit -p   dpool/backups/xyz/jbsnet/latitude/CURRENT/rpool/scratch 
NAME                                                     PROPERTY        VALUE    SOURCE
dpool/backups/xyz/jbsnet/latitude/CURRENT/rpool/scratch  snapshot_limit  18446744073709551614  local

As does the destination:

#  /home/jeremy/documents/migrate-zfs-system # zfs get snapshot_limit -p  rpool/scratch
NAME           PROPERTY        VALUE    SOURCE
rpool/scratch  snapshot_limit  18446744073709551614  local

I set that set that value a while back on the source dataset to the maximum value I could. So maybe there's a off-by-1 bug or something...

Since the dmesg line mentions a function ( dsl_fs_ss_count_adjust ) that appears related to snapshot count:

[Mon Aug 10 09:30:15 2020] PANIC at dsl_dir.c:925:dsl_fs_ss_count_adjust()

... looks like it might be related to this property.
The file dsl_dir.c says:

The filesystem and snapshot limits are validated by dsl_fs_ss_limit_check()
and updated by dsl_fs_ss_count_adjust(). A new limit value is setup in
dsl_dir_activate_fs_ss_limit() and the counts are adjusted, if necessary, by
dsl_dir_init_fs_ss_count().

Where it PANICED:

	/*
	 * If we hit an uninitialized node while recursing up the tree, we can
	 * stop since we know the counts are not valid on this node and we
	 * know we shouldn't touch this node's counts. An uninitialized count
	 * on the node indicates that either the feature has not yet been
	 * activated or there are no limits on this part of the tree.
	 */
	if (!dsl_dir_is_zapified(dd) || (err = zap_lookup(os, dd->dd_object,
	    prop, sizeof (count), 1, &count)) == ENOENT)
		return;
	VERIFY0(err);

	count += delta;
	/* Use a signed verify to make sure we're not neg. */
	VERIFY3S(count, >=, 0);            /* PANICED HERE */

I bet its something there...

I can still provide a dbgmsg output if you like, but I'm out of gas for the moment as I've been trying a few things to workaround this, but I do know that lowering the snapshot limit values on the source and destination dataset before the send/recv did not make a difference.

@helamonster
Copy link
Author

Ahh, what the heck: Here's the only dbgmsg output from the recv command:

1597125475   spa_history.c:309:spa_history_log_sync(): txg 2828496 resume receive dpool/backups/xyz/jbsnet/latitude/CURRENT/rpool/scratch/%recv (id 530023) 
1597125475   spa_history.c:309:spa_history_log_sync(): txg 2828497 finish receiving dpool/backups/xyz/jbsnet/latitude/CURRENT/rpool/scratch/%recv (id 530023) snap=2020-07-23-03-39-10
1597125475   spa_history.c:309:spa_history_log_sync(): txg 2828497 clone swap dpool/backups/xyz/jbsnet/latitude/CURRENT/rpool/scratch/%recv (id 530023) parent=scratch
1597125475   spa_history.c:309:spa_history_log_sync(): txg 2828497 snapshot dpool/backups/xyz/jbsnet/latitude/CURRENT/rpool/scratch@2020-07-23-03-39-10 (id 562365) 
1597125475   spa_history.c:309:spa_history_log_sync(): txg 2828497 destroy dpool/backups/xyz/jbsnet/latitude/CURRENT/rpool/scratch/%recv (id 530023) 

@helamonster
Copy link
Author

This also crashes when sending to an entirely new dataset on a different pool, different server (5.4.0-40-generic / 0.8.3-1ubuntu12), even after reducing the snapshot_limit to 9999999 (but this time it sends all the data before crashing).

@ahrens
Copy link
Member

ahrens commented Aug 11, 2020

I think that the panic occurs when destroying the %recv dataset (the last line in the dbgmsg). In dsl_fs_ss_count_adjust() we have this code, which should be causing it to bail out before the assertion that we hit:

	/*
	 * When we receive an incremental stream into a filesystem that already
	 * exists, a temporary clone is created.  We don't count this temporary
	 * clone, whose name begins with a '%'. We also ignore hidden ($FREE,
	 * $MOS & $ORIGIN) objsets.
	 */
	if ((dd->dd_myname[0] == '%' || dd->dd_myname[0] == '$') &&
	    strcmp(prop, DD_FIELD_FILESYSTEM_COUNT) == 0)
		return;

@helamonster
Copy link
Author

Hmm. So either the assertion fails when dsl_fs_ss_count_adjust() was called with dataset %recv and prop being DD_FIELD_SNAPSHOT_COUNT or with dataset being something else, right? Could it be that the snapshot count for %recv was somehow not incremented after the receive but before the call to dsl_fs_ss_count_adjust() ?

What's the best way for me to get more details on what's happening? I can run bpftrace if someone can point me to the probes to monitor.

@ahrens
Copy link
Member

ahrens commented Aug 12, 2020

You could use bpftrace to print out the stack, dd_myname and prop when dsl_fs_ss_count_adjust() is called.

@behlendorf behlendorf added the Type: Defect Incorrect behavior (e.g. crash, hang) label Aug 23, 2020
@helamonster
Copy link
Author

I ran into another problem (#10787) that's kept me from further troubleshooting this.
Also, I couldn't find how to print the stack trace and those vars, can anyone help with exactly how to do that?

@helamonster
Copy link
Author

I was hoping this would have been resolved by some changes in the 2.0 release, but I am still encountering what appears to be the same PANIC when attempting to destroy a dataset.
This is on a production system running zfs on root so when this PANIC happens, I have to reboot the entire server.
I will try to grab more details and report back...

Type Version/Name
Distribution Name Ubuntu
Distribution Version 20.04
Linux Kernel 5.4.0-64-generic #72-Ubuntu
Architecture x86_64
ZFS Version 2.0.1-0york0~18.04
SPL Version 2.0.1-0york0~18.04

root@myhost-mirror ~ # zfs destroy -r -v data/myhost-main/data/backupdata/veeam/repos/repo_01_old                                                                 
will destroy data/myhost-main/data/backupdata/veeam/repos/repo_01_old/%recv    

root@myhost-mirror ~ # tail /var/log/syslog
Jan 26 14:48:45 myhost-mirror kernel: [ 1284.505901] VERIFY3(count >= 0) failed (-1 >= 0)
Jan 26 14:48:45 myhost-mirror kernel: [ 1284.510578] PANIC at dsl_dir.c:929:dsl_fs_ss_count_adjust()
Jan 26 14:48:45 myhost-mirror kernel: [ 1284.516209] Showing stack for process 83858
Jan 26 14:48:45 myhost-mirror kernel: [ 1284.516212] CPU: 58 PID: 83858 Comm: txg_sync Tainted: P           OE     5.4.0-64-generic #72-Ubuntu
Jan 26 14:48:45 myhost-mirror kernel: [ 1284.516213] Hardware name: iXsystems IXR4224-E2C-1280-IXN/X10DRH-CLN4, BIOS 3.2 11/21/2019
Jan 26 14:48:45 myhost-mirror kernel: [ 1284.516214] Call Trace:
Jan 26 14:48:45 myhost-mirror kernel: [ 1284.516234]  dump_stack+0x6d/0x9a
Jan 26 14:48:45 myhost-mirror kernel: [ 1284.516245]  spl_dumpstack+0x29/0x2b [spl]
Jan 26 14:48:45 myhost-mirror kernel: [ 1284.516249]  spl_panic+0xd4/0xfc [spl]
Jan 26 14:48:45 myhost-mirror kernel: [ 1284.516259]  ? avl_find+0x5f/0x90 [zavl]
Jan 26 14:48:45 myhost-mirror kernel: [ 1284.516321]  ? dbuf_rele+0x3d/0x50 [zfs]
Jan 26 14:48:45 myhost-mirror kernel: [ 1284.516370]  ? dmu_buf_rele+0xe/0x10 [zfs]
Jan 26 14:48:45 myhost-mirror kernel: [ 1284.516431]  ? zap_unlockdir+0x3f/0x60 [zfs]
Jan 26 14:48:45 myhost-mirror kernel: [ 1284.516491]  ? zap_lookup_norm+0x95/0xc0 [zfs]
Jan 26 14:48:45 myhost-mirror kernel: [ 1284.516537]  dsl_fs_ss_count_adjust+0x1a8/0x1c0 [zfs]
Jan 26 14:48:45 myhost-mirror kernel: [ 1284.516592]  dsl_destroy_head_sync_impl+0x812/0xe00 [zfs]
Jan 26 14:48:45 myhost-mirror kernel: [ 1284.516634]  dsl_destroy_head_sync+0x5d/0xc0 [zfs]
Jan 26 14:48:45 myhost-mirror kernel: [ 1284.516682]  dsl_sync_task_sync+0xb6/0x100 [zfs]
Jan 26 14:48:45 myhost-mirror kernel: [ 1284.516725]  dsl_pool_sync+0x3d4/0x4c0 [zfs]
Jan 26 14:48:45 myhost-mirror kernel: [ 1284.516784]  spa_sync+0x5a3/0x1000 [zfs]
Jan 26 14:48:45 myhost-mirror kernel: [ 1284.516790]  ? mutex_lock+0x13/0x40
Jan 26 14:48:45 myhost-mirror kernel: [ 1284.516852]  ? spa_txg_history_init_io+0x106/0x110 [zfs]
Jan 26 14:48:45 myhost-mirror kernel: [ 1284.516905]  txg_sync_thread+0x2c6/0x460 [zfs]
Jan 26 14:48:45 myhost-mirror kernel: [ 1284.516960]  ? txg_thread_exit.isra.0+0x60/0x60 [zfs]
Jan 26 14:48:45 myhost-mirror kernel: [ 1284.516967]  thread_generic_wrapper+0x79/0x90 [spl]
Jan 26 14:48:45 myhost-mirror kernel: [ 1284.516972]  kthread+0x104/0x140
Jan 26 14:48:45 myhost-mirror kernel: [ 1284.516976]  ? __thread_exit+0x20/0x20 [spl]
Jan 26 14:48:45 myhost-mirror kernel: [ 1284.516980]  ? kthread_park+0x90/0x90
Jan 26 14:48:45 myhost-mirror kernel: [ 1284.516983]  ret_from_fork+0x35/0x40

@helamonster
Copy link
Author

Dataset details:

root@myhost-mirror ~ # cat /sys/module/zfs/parameters/zfs_dbgmsg_enable 
1

root@myhost-mirror ~ # zfs get all   data/myhost-main/data/backupdata/veeam/repos/repo_01_old
NAME                                                    PROPERTY              VALUE                                                                                                                                                                                                                                                                                                                                                                                                                                                            SOURCE
data/myhost-main/data/backupdata/veeam/repos/repo_01_old  type                  volume                                                                                                                                                                                                                                                                                                                                                                                                                                                           -
data/myhost-main/data/backupdata/veeam/repos/repo_01_old  creation              Mon Oct  5 22:29 2020                                                                                                                                                                                                                                                                                                                                                                                                                                            -
data/myhost-main/data/backupdata/veeam/repos/repo_01_old  used                  10.8T                                                                                                                                                                                                                                                                                                                                                                                                                                                            -
data/myhost-main/data/backupdata/veeam/repos/repo_01_old  available             54.2T                                                                                                                                                                                                                                                                                                                                                                                                                                                            -
data/myhost-main/data/backupdata/veeam/repos/repo_01_old  referenced            9.52T                                                                                                                                                                                                                                                                                                                                                                                                                                                            -
data/myhost-main/data/backupdata/veeam/repos/repo_01_old  compressratio         1.00x                                                                                                                                                                                                                                                                                                                                                                                                                                                            -
data/myhost-main/data/backupdata/veeam/repos/repo_01_old  reservation           64T                                                                                                                                                                                                                                                                                                                                                                                                                                                              received
data/myhost-main/data/backupdata/veeam/repos/repo_01_old  volsize               56T                                                                                                                                                                                                                                                                                                                                                                                                                                                              local
data/myhost-main/data/backupdata/veeam/repos/repo_01_old  volblocksize          1M                                                                                                                                                                                                                                                                                                                                                                                                                                                               -
data/myhost-main/data/backupdata/veeam/repos/repo_01_old  checksum              on                                                                                                                                                                                                                                                                                                                                                                                                                                                               default
data/myhost-main/data/backupdata/veeam/repos/repo_01_old  compression           off                                                                                                                                                                                                                                                                                                                                                                                                                                                              default
data/myhost-main/data/backupdata/veeam/repos/repo_01_old  readonly              off                                                                                                                                                                                                                                                                                                                                                                                                                                                              default
data/myhost-main/data/backupdata/veeam/repos/repo_01_old  createtxg             8608789                                                                                                                                                                                                                                                                                                                                                                                                                                                          -
data/myhost-main/data/backupdata/veeam/repos/repo_01_old  copies                1                                                                                                                                                                                                                                                                                                                                                                                                                                                                default
data/myhost-main/data/backupdata/veeam/repos/repo_01_old  refreservation        none                                                                                                                                                                                                                                                                                                                                                                                                                                                             received
data/myhost-main/data/backupdata/veeam/repos/repo_01_old  guid                  16399445975178063768                                                                                                                                                                                                                                                                                                                                                                                                                                             -
data/myhost-main/data/backupdata/veeam/repos/repo_01_old  primarycache          all                                                                                                                                                                                                                                                                                                                                                                                                                                                              default
data/myhost-main/data/backupdata/veeam/repos/repo_01_old  secondarycache        all                                                                                                                                                                                                                                                                                                                                                                                                                                                              default
data/myhost-main/data/backupdata/veeam/repos/repo_01_old  usedbysnapshots       941G                                                                                                                                                                                                                                                                                                                                                                                                                                                             -
data/myhost-main/data/backupdata/veeam/repos/repo_01_old  usedbydataset         9.52T                                                                                                                                                                                                                                                                                                                                                                                                                                                            -
data/myhost-main/data/backupdata/veeam/repos/repo_01_old  usedbychildren        326G                                                                                                                                                                                                                                                                                                                                                                                                                                                             -
data/myhost-main/data/backupdata/veeam/repos/repo_01_old  usedbyrefreservation  0B                                                                                                                                                                                                                                                                                                                                                                                                                                                               -
data/myhost-main/data/backupdata/veeam/repos/repo_01_old  logbias               latency                                                                                                                                                                                                                                                                                                                                                                                                                                                          default
data/myhost-main/data/backupdata/veeam/repos/repo_01_old  objsetid              1743950                                                                                                                                                                                                                                                                                                                                                                                                                                                          -
data/myhost-main/data/backupdata/veeam/repos/repo_01_old  dedup                 off                                                                                                                                                                                                                                                                                                                                                                                                                                                              default
data/myhost-main/data/backupdata/veeam/repos/repo_01_old  mlslabel              none                                                                                                                                                                                                                                                                                                                                                                                                                                                             default
data/myhost-main/data/backupdata/veeam/repos/repo_01_old  sync                  always                                                                                                                                                                                                                                                                                                                                                                                                                                                           received
data/myhost-main/data/backupdata/veeam/repos/repo_01_old  refcompressratio      1.00x                                                                                                                                                                                                                                                                                                                                                                                                                                                            -
data/myhost-main/data/backupdata/veeam/repos/repo_01_old  written               0                                                                                                                                                                                                                                                                                                                                                                                                                                                                -
data/myhost-main/data/backupdata/veeam/repos/repo_01_old  logicalused           11.3T                                                                                                                                                                                                                                                                                                                                                                                                                                                            -
data/myhost-main/data/backupdata/veeam/repos/repo_01_old  logicalreferenced     9.97T                                                                                                                                                                                                                                                                                                                                                                                                                                                            -
data/myhost-main/data/backupdata/veeam/repos/repo_01_old  volmode               default                                                                                                                                                                                                                                                                                                                                                                                                                                                          default
data/myhost-main/data/backupdata/veeam/repos/repo_01_old  snapshot_limit        none                                                                                                                                                                                                                                                                                                                                                                                                                                                             default
data/myhost-main/data/backupdata/veeam/repos/repo_01_old  snapshot_count        22                                                                                                                                                                                                                                                                                                                                                                                                                                                               local
data/myhost-main/data/backupdata/veeam/repos/repo_01_old  snapdev               hidden                                                                                                                                                                                                                                                                                                                                                                                                                                                           default
data/myhost-main/data/backupdata/veeam/repos/repo_01_old  context               none                                                                                                                                                                                                                                                                                                                                                                                                                                                             default
data/myhost-main/data/backupdata/veeam/repos/repo_01_old  fscontext             none                                                                                                                                                                                                                                                                                                                                                                                                                                                             default
data/myhost-main/data/backupdata/veeam/repos/repo_01_old  defcontext            none                                                                                                                                                                                                                                                                                                                                                                                                                                                             default
data/myhost-main/data/backupdata/veeam/repos/repo_01_old  rootcontext           none                                                                                                                                                                                                                                                                                                                                                                                                                                                             default
data/myhost-main/data/backupdata/veeam/repos/repo_01_old  redundant_metadata    all                                                                                                                                                                                                                                                                                                                                                                                                                                                              default
data/myhost-main/data/backupdata/veeam/repos/repo_01_old  receive_resume_token  1-17920e7c1f-168-789c636064000310a501c49c50360710a715e5e7a69766a630404164fcbe531785f79f5700b2d991d4e52765a5269740f82080219f96569c5a0296f9ffdf0422cf86249f5459925a0ca425b6bc730fc5a2bf241fe28ade9b7758d9e2dfd94520c97382e5f3127353191852124b12f5931293b34b0bc0ccb2d4d4c45cfda2d482fc6230196f60e850549a97979997ae6b646064a06b68a06b60a66b68a86b0a641882ede5654084474e62517a6a524e7e72767e36484c02ea1e987c6a6e526a0a500aa48f1b493c393fb7a028b5b818a20b02007d2a3ec7  -
data/myhost-main/data/backupdata/veeam/repos/repo_01_old  encryption            off                                                                                                                                                                                                                                                                                                                                                                                                                                                              default
data/myhost-main/data/backupdata/veeam/repos/repo_01_old  keylocation           none                                                                                                                                                                                                                                                                                                                                                                                                                                                             default
data/myhost-main/data/backupdata/veeam/repos/repo_01_old  keyformat             none                                                                                                                                                                                                                                                                                                                                                                                                                                                             default
data/myhost-main/data/backupdata/veeam/repos/repo_01_old  pbkdf2iters           0                                                                                                                                                                                                                                                                                                                                                                                                                                                                default
data/myhost-main/data/backupdata/veeam/repos/repo_01_old  zrepl:placeholder     on                                                                                                                                                                                                                                                                                                                                                                                                                                                               inherited from data/myhost-main/data/backupdata

The PANIC does not output anything to /proc/spl/kstat/zfs/dbgmsg although that does have some output from other things beforehand so it appears to be enabled.

@ahrens
Copy link
Member

ahrens commented Jan 26, 2021

The originally-reported issue should have been resolved by #10791, which is fixed in 2.0. It's possible that if your filesystem (data/myhost-main/data/backupdata/veeam/repos/repo_01_old) was created on an earlier release, you could still hit this when destroying it.

@helamonster
Copy link
Author

I also tried aborting the send:

root@myhost-mirror ~ # zfs recv -A   data/myhost-main/data/backupdata/veeam/repos/repo_01_old

Message from syslogd@myhost-mirror at Jan 26 16:01:26 ...
 kernel:[  534.938981] VERIFY3(count >= 0) failed (-1 >= 0)

Message from syslogd@myhost-mirror at Jan 26 16:01:26 ...
 kernel:[  534.943665] PANIC at dsl_dir.c:929:dsl_fs_ss_count_adjust()

From dmseg:

[Tue Jan 26 16:01:26 2021] VERIFY3(count >= 0) failed (-1 >= 0)
[Tue Jan 26 16:01:26 2021] PANIC at dsl_dir.c:929:dsl_fs_ss_count_adjust()
[Tue Jan 26 16:01:26 2021] Showing stack for process 80062
[Tue Jan 26 16:01:26 2021] CPU: 21 PID: 80062 Comm: txg_sync Tainted: P           OE     5.4.0-64-generic #72-Ubuntu
[Tue Jan 26 16:01:26 2021] Hardware name: iXsystems IXR4224-E2C-1280-IXN/X10DRH-CLN4, BIOS 3.2 11/21/2019
[Tue Jan 26 16:01:26 2021] Call Trace:
[Tue Jan 26 16:01:26 2021]  dump_stack+0x6d/0x9a
[Tue Jan 26 16:01:26 2021]  spl_dumpstack+0x29/0x2b [spl]
[Tue Jan 26 16:01:26 2021]  spl_panic+0xd4/0xfc [spl]
[Tue Jan 26 16:01:26 2021]  ? avl_find+0x5f/0x90 [zavl]
[Tue Jan 26 16:01:26 2021]  ? dbuf_rele+0x3d/0x50 [zfs]
[Tue Jan 26 16:01:26 2021]  ? dmu_buf_rele+0xe/0x10 [zfs]
[Tue Jan 26 16:01:26 2021]  ? zap_unlockdir+0x3f/0x60 [zfs]
[Tue Jan 26 16:01:26 2021]  ? zap_lookup_norm+0x95/0xc0 [zfs]
[Tue Jan 26 16:01:26 2021]  dsl_fs_ss_count_adjust+0x1a8/0x1c0 [zfs]
[Tue Jan 26 16:01:26 2021]  dsl_destroy_head_sync_impl+0x812/0xe00 [zfs]
[Tue Jan 26 16:01:26 2021]  dsl_destroy_head_sync+0x5d/0xc0 [zfs]
[Tue Jan 26 16:01:26 2021]  dsl_sync_task_sync+0xb6/0x100 [zfs]
[Tue Jan 26 16:01:26 2021]  dsl_pool_sync+0x3d4/0x4c0 [zfs]
[Tue Jan 26 16:01:26 2021]  spa_sync+0x5a3/0x1000 [zfs]
[Tue Jan 26 16:01:26 2021]  ? mutex_lock+0x13/0x40
[Tue Jan 26 16:01:26 2021]  ? spa_txg_history_init_io+0x106/0x110 [zfs]
[Tue Jan 26 16:01:26 2021]  txg_sync_thread+0x2c6/0x460 [zfs]
[Tue Jan 26 16:01:26 2021]  ? txg_thread_exit.isra.0+0x60/0x60 [zfs]
[Tue Jan 26 16:01:26 2021]  thread_generic_wrapper+0x79/0x90 [spl]
[Tue Jan 26 16:01:26 2021]  kthread+0x104/0x140
[Tue Jan 26 16:01:26 2021]  ? __thread_exit+0x20/0x20 [spl]
[Tue Jan 26 16:01:26 2021]  ? kthread_park+0x90/0x90
[Tue Jan 26 16:01:26 2021]  ret_from_fork+0x35/0x40

@helamonster
Copy link
Author

The originally-reported issue should have been resolved by #10791, which is fixed in 2.0. It's possible that if your filesystem (data/myhost-main/data/backupdata/veeam/repos/repo_01_old) was created on an earlier release, you could still hit this when destroying it.

Yes, it was created in the 0.8.x release. Any idea how I can safely get rid of this dataset then?

@ahrens
Copy link
Member

ahrens commented Jan 26, 2021

If you can remove the VERIFY and compile from source, the destroy might "just work" and not introduce any more problems.

@helamonster
Copy link
Author

I have yet to be able to try that as the system encountering this problem is in production and it's critical that it stay up and running. But I have also now encountered this problem on another system that has hundreds of datasets that I am unable to destroy or destroy snapshots from due to the same problem. And since this server is continuously receiving snapshots from other systems, it is only a matter of time before this pool runs out of space. Yikes! I am now in a race against the clock to solve this, so I will likely be testing this shortly.

But let's say it does work... Shouldn't that change (or a more proper fix for the issue) be made to zfs (not just the manually compiled version on my server) to avoid this problem showing up again in the future for anyone else?

Also, I can't readily tell at the moment, but is that change to dsl_dir.c part of the kernel module or the zfs userland CLI program?

Thanks!

@helamonster
Copy link
Author

On a system running zfs 0.8.4 I have rebuilt zfs, commenting out the VERIFY statement in question. So far I have been able to destroy such problematic datasets (and snapshots on those datasets) without any visible sign of a problem.

How should we move forward with this information to fix this problem permanently for everyone?
I don't want to have to rebuild zfs with this change for every new version and don't want to have to migrate data from problematic datasets to new datasets.
Could there be a way to fix the affected datasets in-place?

@stale
Copy link

stale bot commented Jul 31, 2022

This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the Status: Stale No recent activity for issue label Jul 31, 2022
@stale stale bot closed this as completed Oct 29, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Stale No recent activity for issue Type: Defect Incorrect behavior (e.g. crash, hang)
Projects
None yet
Development

No branches or pull requests

4 participants