kernel/userspace version mismatch: segfault on 'zfs list -r -t all' #892

GregorKopka · 2012-08-24T18:47:38Z

I get this by issuing 'zfs list -r -t all' on a dataset with snapshots, but works when the dataset dosn't have snapshots or on individual snapshots (when i know their name):

$ zfs create data/segfault
$ zfs list -r -t all data/segfault
NAME USED AVAIL REFER MOUNTPOINT
data/segfault 83K 500G 83K /data/segfault

$ zfs snapshot data/segfault@fail
$ zfs list -r -t all data/segfault
Segmentation fault

$ zfs list -r -t all data/segfault@fail
NAME USED AVAIL REFER MOUNTPOINT
data/segfault@fail 0 - 83K -

$ zfs destroy data/segfault@fail
$ zfs list -r -t all data/segfault
NAME USED AVAIL REFER MOUNTPOINT
data/segfault 83K 500G 83K /data/segfault

Linux version 3.0.6-gentoo (root@backend) (gcc version 4.4.5 (Gentoo 4.4.5 p1.2, pie-0.4.5) ) #1 SMP Thu Aug 23 01:05:49 CEST 2012

$ zpool list
NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT
data 3,62T 3,07T 558G 84% 1.00x ONLINE -

$ zpool status
pool: data
state: ONLINE
scan: scrub canceled on Fri Aug 24 17:57:51 2012
config:

    NAME            STATE     READ WRITE CKSUM
    data            ONLINE       0     0     0
      mirror-0      ONLINE       0     0     0
        data-0-2    ONLINE       0     0     0
        data-0-1    ONLINE       0     0     0
      mirror-1      ONLINE       0     0     0
        data-1-2    ONLINE       0     0     0
        data-1-1    ONLINE       0     0     0
      mirror-2      ONLINE       0     0     0
        data-2-2    ONLINE       0     0     0
        data-2-1    ONLINE       0     0     0
    cache
      data-cache-0  ONLINE       0     0     0

errors: No known data errors

Problem is with zfs-rc10 and zfs-9999 from portage, tried also with kernel 3.0.6. Worked fine till the upgrade with rc-8, only other thing i changed in kernel was a kernel flag zfs complained about as missing and recompiled it.

gdb backtrace:

Program received signal SIGSEGV, Segmentation fault.
#0 0x00007fffffffe160 in ?? ()

No symbol table info available.
#1 0x00007ffff6c0c635 in zfs_iter_snapshots () from /lib64/libzfs.so.1

No symbol table info available.
#2 0x0000000000404af8 in zfs_callback (zhp=0x62e400, data=0x7fffffffe160) at ../../cmd/zfs/zfs_iter.c:133

    dontclose = 1
    include_snaps = 2

#3 0x00000000004051c4 in zfs_for_each (argc=, argv=, flags=, types=, sortcol=, proplist=, limit=0, callback=0x408220 <list_callback>, data=0x7fffffffe250) at ../../cmd/zfs/zfs_iter.c:433

    i = 1
    zhp = 0x635bc0
    argtype = 7
    cb = {cb_avl = 0x621fa0, cb_flags = 3, cb_types = 7, cb_sortcol = 0x0, cb_proplist = 0x7fffffffe258, cb_depth_limit = 0, cb_depth = 1, cb_props_table = "\000\000\001\001\001\000\000\000\000\000\000\000\000\001\000\000\000\000\000\000\000\000\001\000\000\000\001\001", '\000' <repeats 31 times>}
    ret = 0
    node = <optimized out>
    walk = <optimized out>

#4 0x0000000000409fa3 in zfs_do_list (argc=5, argv=0x7fffffffe3f0) at ../../cmd/zfs/zfs_main.c:2830

    c = <optimized out>
    scripted = B_FALSE
    default_fields = "name\000used\000available\000referenced\000mountpoint"
    types = 7
    types_specified = B_TRUE
    fields = 0x617aa0 "name"
    cb = {cb_first = B_TRUE, cb_scripted = B_FALSE, cb_proplist = 0x6322f0}
    value = 0x0
    limit = 0
    ret = <optimized out>
    sortcol = 0x0
    flags = 3

#5 0x000000000040b5d7 in main (argc=6, argv=0x7fffffffe3e8) at ../../cmd/zfs/zfs_main.c:6183

    ret = <optimized out>
    i = 9
    cmdname = 0x7fffffffe670 "list"

rax 0x635bc0 6511552
rbx 0x7fffffff9b30 140737488329520
rcx 0x0 0
rdx 0x0 0
rsi 0x20f10 134928
rdi 0x635bc0 6511552
rbp 0x62e400 0x62e400
rsp 0x7fffffff9b18 0x7fffffff9b18
r8 0x62e6d0 6481616
r9 0x0 0
r10 0x0 0
r11 0x1 1
r12 0x404950 4213072
r13 0x7fffffffe160 140737488347488
r14 0x20f10 134928
r15 0x0 0
rip 0x7fffffffe160 0x7fffffffe160
eflags 0x10206 [ PF IF RF ]
cs 0x33 51
ss 0x2b 43
ds 0x0 0
es 0x0 0
fs 0x0 0
gs 0x0 0
=> 0x7fffffffe160: movabs 0x30000000000621f,%al
0x7fffffffe169: add %al,(%rax)
0x7fffffffe16b: add %al,(%rdi)
0x7fffffffe16d: add %al,(%rax)
0x7fffffffe16f: add %al,(%rax)
0x7fffffffe171: add %al,(%rax)
0x7fffffffe173: add %al,(%rax)
0x7fffffffe175: add %al,(%rax)
0x7fffffffe177: add %bl,-0x1e(%rax)
0x7fffffffe17a: (bad)
0x7fffffffe17b: (bad)
0x7fffffffe17c: (bad)
0x7fffffffe17d: jg 0x7fffffffe17f
0x7fffffffe17f: add %al,(%rax)
0x7fffffffe181: add %al,(%rax)
0x7fffffffe183: add %al,(%rcx)

Thread 1 (Thread 0x7ffff7fe4f40 (LWP 8146)):
#0 0x00007fffffffe160 in ?? ()
#1 0x00007ffff6c0c635 in zfs_iter_snapshots () from /lib64/libzfs.so.1
#2 0x0000000000404af8 in zfs_callback (zhp=0x62e400, data=0x7fffffffe160) at ../../cmd/zfs/zfs_iter.c:133
#3 0x00000000004051c4 in zfs_for_each (argc=, argv=, flags=, types=, sortcol=, proplist=, limit=0, callback=0x408220 <list_callback>, data=0x7fffffffe250) at ../../cmd/zfs/zfs_iter.c:433
#4 0x0000000000409fa3 in zfs_do_list (argc=5, argv=0x7fffffffe3f0) at ../../cmd/zfs/zfs_main.c:2830
#5 0x000000000040b5d7 in main (argc=6, argv=0x7fffffffe3e8) at ../../cmd/zfs/zfs_main.c:6183

A debugging session is active.

    Inferior 1 [process 8146] will be killed.

In case you need more info or something please let me know.

The text was updated successfully, but these errors were encountered:

GregorKopka · 2012-08-27T21:03:23Z

I just found out that this issue also affects 'zfs send' operations, rendering the setup completely unusable at the moment.

It also happens with another, freshly created pool (on a new thumb drive, the pool i discovered the problem being hidden by not decrypting the partitions), so i guess the problem is not originating by garbage contained in on-disk data.

behlendorf · 2012-08-27T22:20:31Z

@mmatuska If you've got a minute to look at this. I believe this issue snuck in with one of the recent Illumos backports. I haven't had a chance to seriously investigate yet but it's on the short list of things we need to get fixed.

GregorKopka · 2012-08-28T01:10:12Z

Problem found:

on the system there were stale executables in /usr/local/sbin (for whatever reason) which - not surprisingly - had trouble dealing with kernel modules of different versions. I guess they were leftovers from the initial install of zfsonlinux on the machine which originated from then pendor overlay back in march - and havn't been cleaned up correctly when i removed the overlay after zfs appeared in portage.

Only lesson to learn could maybe be to have the userland tools check the version of the kernel module, but in this case it might have failed anyway since they were all from 0.6.0 RCs.

Sorry for wasting your time with this.

behlendorf · 2012-08-28T15:55:27Z

That's good news, and makes perfect sense. There was a rare change to the kernel ioctl ordering for some Illumos changes. We'll certainly try to minimize that sort of thing going forward. However, if might still not be a bad idea to version that interface as an extra sanity check.

…ve` (openzfs#892) When the agent next restarts after a device removal, it may panic with a message like: ``` thread 'main' panicked at 'called `Option::unwrap()` on a `None` value', zettacache/src/slab_allocator.rs:217:14 stack backtrace: ... 7: core::option::Option<T>::unwrap at library/core/src/option.rs:778:21 zettacache::slab_allocator::SlabAccess::slab_id_to_extent at zettacache/src/slab_allocator.rs:213:49 8: zettacache::block_allocator::slabs::Slabs::open::{{closure}}::{{closure}} at zettacache/src/block_allocator/slabs.rs:107:35 ``` The problem occurs when the agent starts up and there are still `SpaceMapEntry`s that refer to disks that have been removed. Typically, this is not seen because the space maps have been fully condensed (rewritten) by the time the agent is restarted. However, if the agent restart occurs shortly after the removal completes, the problem may be encountered. The fix is to make `slab_id_to_extent()` and `extent_to_slab_id()` not panic, but instead return values referring to the removed disk. `Slabs::open()` will eventually see a `SlabPhysType::Free`, causing it to ignore the earlier references to the removed disk. Note that the BlockAccess layer still panics if we try to read/write to the removed disk, which it no longer has open.

GregorKopka closed this as completed Aug 28, 2012

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kernel/userspace version mismatch: segfault on 'zfs list -r -t all' #892

kernel/userspace version mismatch: segfault on 'zfs list -r -t all' #892

GregorKopka commented Aug 24, 2012

GregorKopka commented Aug 27, 2012

behlendorf commented Aug 27, 2012

GregorKopka commented Aug 28, 2012

behlendorf commented Aug 28, 2012

kernel/userspace version mismatch: segfault on 'zfs list -r -t all' #892

kernel/userspace version mismatch: segfault on 'zfs list -r -t all' #892

Comments

GregorKopka commented Aug 24, 2012

GregorKopka commented Aug 27, 2012

behlendorf commented Aug 27, 2012

GregorKopka commented Aug 28, 2012

behlendorf commented Aug 28, 2012