Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kernel/userspace version mismatch: segfault on 'zfs list -r -t all' #892

Closed
GregorKopka opened this issue Aug 24, 2012 · 4 comments
Closed
Milestone

Comments

@GregorKopka
Copy link
Contributor

I get this by issuing 'zfs list -r -t all' on a dataset with snapshots, but works when the dataset dosn't have snapshots or on individual snapshots (when i know their name):

$ zfs create data/segfault
$ zfs list -r -t all data/segfault
NAME USED AVAIL REFER MOUNTPOINT
data/segfault 83K 500G 83K /data/segfault

$ zfs snapshot data/segfault@fail
$ zfs list -r -t all data/segfault
Segmentation fault

$ zfs list -r -t all data/segfault@fail
NAME USED AVAIL REFER MOUNTPOINT
data/segfault@fail 0 - 83K -

$ zfs destroy data/segfault@fail
$ zfs list -r -t all data/segfault
NAME USED AVAIL REFER MOUNTPOINT
data/segfault 83K 500G 83K /data/segfault

Linux version 3.0.6-gentoo (root@backend) (gcc version 4.4.5 (Gentoo 4.4.5 p1.2, pie-0.4.5) ) #1 SMP Thu Aug 23 01:05:49 CEST 2012

$ zpool list
NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT
data 3,62T 3,07T 558G 84% 1.00x ONLINE -

$ zpool status
pool: data
state: ONLINE
scan: scrub canceled on Fri Aug 24 17:57:51 2012
config:

    NAME            STATE     READ WRITE CKSUM
    data            ONLINE       0     0     0
      mirror-0      ONLINE       0     0     0
        data-0-2    ONLINE       0     0     0
        data-0-1    ONLINE       0     0     0
      mirror-1      ONLINE       0     0     0
        data-1-2    ONLINE       0     0     0
        data-1-1    ONLINE       0     0     0
      mirror-2      ONLINE       0     0     0
        data-2-2    ONLINE       0     0     0
        data-2-1    ONLINE       0     0     0
    cache
      data-cache-0  ONLINE       0     0     0

errors: No known data errors

Problem is with zfs-rc10 and zfs-9999 from portage, tried also with kernel 3.0.6. Worked fine till the upgrade with rc-8, only other thing i changed in kernel was a kernel flag zfs complained about as missing and recompiled it.

gdb backtrace:

Program received signal SIGSEGV, Segmentation fault.
#0 0x00007fffffffe160 in ?? ()

No symbol table info available.
#1 0x00007ffff6c0c635 in zfs_iter_snapshots () from /lib64/libzfs.so.1

No symbol table info available.
#2 0x0000000000404af8 in zfs_callback (zhp=0x62e400, data=0x7fffffffe160) at ../../cmd/zfs/zfs_iter.c:133

    dontclose = 1
    include_snaps = 2

#3 0x00000000004051c4 in zfs_for_each (argc=, argv=, flags=, types=, sortcol=, proplist=, limit=0, callback=0x408220 <list_callback>, data=0x7fffffffe250) at ../../cmd/zfs/zfs_iter.c:433

    i = 1
    zhp = 0x635bc0
    argtype = 7
    cb = {cb_avl = 0x621fa0, cb_flags = 3, cb_types = 7, cb_sortcol = 0x0, cb_proplist = 0x7fffffffe258, cb_depth_limit = 0, cb_depth = 1, cb_props_table = "\000\000\001\001\001\000\000\000\000\000\000\000\000\001\000\000\000\000\000\000\000\000\001\000\000\000\001\001", '\000' <repeats 31 times>}
    ret = 0
    node = <optimized out>
    walk = <optimized out>

#4 0x0000000000409fa3 in zfs_do_list (argc=5, argv=0x7fffffffe3f0) at ../../cmd/zfs/zfs_main.c:2830

    c = <optimized out>
    scripted = B_FALSE
    default_fields = "name\000used\000available\000referenced\000mountpoint"
    types = 7
    types_specified = B_TRUE
    fields = 0x617aa0 "name"
    cb = {cb_first = B_TRUE, cb_scripted = B_FALSE, cb_proplist = 0x6322f0}
    value = 0x0
    limit = 0
    ret = <optimized out>
    sortcol = 0x0
    flags = 3

#5 0x000000000040b5d7 in main (argc=6, argv=0x7fffffffe3e8) at ../../cmd/zfs/zfs_main.c:6183

    ret = <optimized out>
    i = 9
    cmdname = 0x7fffffffe670 "list"

rax 0x635bc0 6511552
rbx 0x7fffffff9b30 140737488329520
rcx 0x0 0
rdx 0x0 0
rsi 0x20f10 134928
rdi 0x635bc0 6511552
rbp 0x62e400 0x62e400
rsp 0x7fffffff9b18 0x7fffffff9b18
r8 0x62e6d0 6481616
r9 0x0 0
r10 0x0 0
r11 0x1 1
r12 0x404950 4213072
r13 0x7fffffffe160 140737488347488
r14 0x20f10 134928
r15 0x0 0
rip 0x7fffffffe160 0x7fffffffe160
eflags 0x10206 [ PF IF RF ]
cs 0x33 51
ss 0x2b 43
ds 0x0 0
es 0x0 0
fs 0x0 0
gs 0x0 0
=> 0x7fffffffe160: movabs 0x30000000000621f,%al
0x7fffffffe169: add %al,(%rax)
0x7fffffffe16b: add %al,(%rdi)
0x7fffffffe16d: add %al,(%rax)
0x7fffffffe16f: add %al,(%rax)
0x7fffffffe171: add %al,(%rax)
0x7fffffffe173: add %al,(%rax)
0x7fffffffe175: add %al,(%rax)
0x7fffffffe177: add %bl,-0x1e(%rax)
0x7fffffffe17a: (bad)
0x7fffffffe17b: (bad)
0x7fffffffe17c: (bad)
0x7fffffffe17d: jg 0x7fffffffe17f
0x7fffffffe17f: add %al,(%rax)
0x7fffffffe181: add %al,(%rax)
0x7fffffffe183: add %al,(%rcx)

Thread 1 (Thread 0x7ffff7fe4f40 (LWP 8146)):
#0 0x00007fffffffe160 in ?? ()
#1 0x00007ffff6c0c635 in zfs_iter_snapshots () from /lib64/libzfs.so.1
#2 0x0000000000404af8 in zfs_callback (zhp=0x62e400, data=0x7fffffffe160) at ../../cmd/zfs/zfs_iter.c:133
#3 0x00000000004051c4 in zfs_for_each (argc=, argv=, flags=, types=, sortcol=, proplist=, limit=0, callback=0x408220 <list_callback>, data=0x7fffffffe250) at ../../cmd/zfs/zfs_iter.c:433
#4 0x0000000000409fa3 in zfs_do_list (argc=5, argv=0x7fffffffe3f0) at ../../cmd/zfs/zfs_main.c:2830
#5 0x000000000040b5d7 in main (argc=6, argv=0x7fffffffe3e8) at ../../cmd/zfs/zfs_main.c:6183

A debugging session is active.

    Inferior 1 [process 8146] will be killed.

In case you need more info or something please let me know.

@GregorKopka
Copy link
Contributor Author

I just found out that this issue also affects 'zfs send' operations, rendering the setup completely unusable at the moment.

It also happens with another, freshly created pool (on a new thumb drive, the pool i discovered the problem being hidden by not decrypting the partitions), so i guess the problem is not originating by garbage contained in on-disk data.

@behlendorf
Copy link
Contributor

@mmatuska If you've got a minute to look at this. I believe this issue snuck in with one of the recent Illumos backports. I haven't had a chance to seriously investigate yet but it's on the short list of things we need to get fixed.

@GregorKopka
Copy link
Contributor Author

Problem found:

on the system there were stale executables in /usr/local/sbin (for whatever reason) which - not surprisingly - had trouble dealing with kernel modules of different versions. I guess they were leftovers from the initial install of zfsonlinux on the machine which originated from then pendor overlay back in march - and havn't been cleaned up correctly when i removed the overlay after zfs appeared in portage.

Only lesson to learn could maybe be to have the userland tools check the version of the kernel module, but in this case it might have failed anyway since they were all from 0.6.0 RCs.

Sorry for wasting your time with this.

@behlendorf
Copy link
Contributor

That's good news, and makes perfect sense. There was a rare change to the kernel ioctl ordering for some Illumos changes. We'll certainly try to minimize that sort of thing going forward. However, if might still not be a bad idea to version that interface as an extra sanity check.

pcd1193182 pushed a commit to pcd1193182/zfs that referenced this issue Sep 26, 2023
…ve` (openzfs#892)

When the agent next restarts after a device removal, it may panic with a message like:

```
thread 'main' panicked at 'called `Option::unwrap()` on a `None` value', zettacache/src/slab_allocator.rs:217:14
stack backtrace:
...
   7: core::option::Option<T>::unwrap
             at library/core/src/option.rs:778:21
      zettacache::slab_allocator::SlabAccess::slab_id_to_extent
             at zettacache/src/slab_allocator.rs:213:49
   8: zettacache::block_allocator::slabs::Slabs::open::{{closure}}::{{closure}}
             at zettacache/src/block_allocator/slabs.rs:107:35
```

The problem occurs when the agent starts up and there are still
`SpaceMapEntry`s that refer to disks that have been removed.  Typically, this
is not seen because the space maps have been fully condensed (rewritten) by the
time the agent is restarted.  However, if the agent restart occurs shortly
after the removal completes, the problem may be encountered.

The fix is to make `slab_id_to_extent()` and `extent_to_slab_id()` not panic,
but instead return values referring to the removed disk.  `Slabs::open()` will
eventually see a `SlabPhysType::Free`, causing it to ignore the earlier
references to the removed disk.  Note that the BlockAccess layer still panics
if we try to read/write to the removed disk, which it no longer has open.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants