Snapshot Directory (.zfs) #173

edwinvaneggelen · 2011-03-23T20:01:40Z

Was unable to find the .zfs directory. Normally this directory is present after snapshots are created. However I could not find this directory with the rc-2 release after snapshots are created.

behlendorf · 2011-03-23T20:09:43Z

The .zfs snapshot directory is not yet supported. It's on the list of development items which need to be worked on. Snapshots can still be mounted directly read-only using 'mount -t zfs pool/dataset@snap /mntpoint'.

http://zfsonlinux.org/zfs-development-items.html

behlendorf · 2011-05-03T17:59:02Z

Summary of Required Work

While snapshots do work the .zfs snapshot directory has not yet been implemented. Snapshots can be manually mounted as needed with the mount command, mount -t zfs dataset@snap /mnt/snap. To implement the .zfs snapshot directory a special .zfs inode must be created. This inode will have custom hooks which allow it list available snapshots as part of readdir(), and when a list is traversed the dataset must be mounted on demand. This should all be doable using the existing Linux automounter framework which has the advantage of simplifying the zfs code.

ghost · 2011-05-06T08:07:26Z

Following your advice to use automounter I came up with the following:

#!/bin/bash

# /etc/auto.zfs
# This file must be executable to work! chmod 755!

key="$1"
opts="-fstype=zfs"

for P in /bin /sbin /usr/bin /usr/sbin
do
    if [ -x $P/zfs ]
    then
        ZFS=$P/zfs
        break
    fi
done

[ -x $ZFS ] || exit 1

ZFS="$ZFS list -rHt snapshot -o name $key"

$ZFS | LC_ALL=C sort -k 1 | \
    awk -v key="$key" -v opts="$opts" -- '
    BEGIN   { ORS=""; first=1 }
        { if (first) { print opts; first=0 }; s=$1; sub(key, "", s); sub(/@/, "/", s); print " \\\n\t" s " :" $1 }
    END { if (!first) print "\n"; else exit 1 } '

and to /etc/auto.master add

/.zfs  /etc/auto.zfs

Snapshots can then be easily accessed through /.zfs/poolname/fsname/snapshotname.
While this gives us not the .zfs directory inside our filesystem, it at least gives an easy way to access the snapshots for now.

behlendorf · 2011-05-06T17:05:18Z

Neat. Thanks for posting this. As you say it provides a handy way to get to the snapshots until we get the .zfs directory in place.

rohan-puri · 2011-05-23T05:07:30Z

Hi Brian, I would like to work on this.

behlendorf · 2011-05-23T17:26:08Z

Sounds good to me! I'd very like to see this get done, it's a big deal for people who want to use ZFS for an NFS server. I've tried to describe my initial thinking at a high level in the previous comment:

https://github.com/behlendorf/zfs/issues/173#issuecomment-1095388

Why don't you dig in to nitty gritty details of this and we can discuss and concern, problems, issue you run in too.

rohan-puri · 2011-05-24T04:56:27Z

Thank you Brian, I will start with this and discuss if i have any problems. FYI, I have created a branch named snapshot of your fork 'zfs' repo and wil be working for this in that branch.

rohan-puri · 2011-05-31T13:14:10Z

Hi Brian,

I am done with snapshot automounting framework. When a snapshot is created for example 'snap1' on a pool by name 'tank' which is default mounted on '/tank', one can access the contents of the snapshot by doing cd to '/tank/.zfs/snapshot/snap1', implementation is done using the linux automount framework as you have suggested. Also, when someone tries to destroy this dataset, then the snapshot is unmounted. When someone tries to destroy the pool with the snapshot mounted, then the pool can be destroyed. Multiple snapshot mounts/unmounts work. Other places where snapshot unmount is called are rename and promote, it also works now.

But, one issue is that the functions which I am calling are linux kernel exported GPL functions which conflicts with the ZFS CDDL license. Currently to check the implementation, I changed the CDDL license to GPL.

One way of solving this issue is, write a wrapper functions in SPL module which is GPL licensed and export them from spl and make use of them in ZFS, instead of directly calling the GPL exported symbols of linux kernel. But want to know your opinion on this.

These symbols are : - vfs_kern_mount(), do_add_mount(), mark_mounts_for_expiry().

BTW, I am currently working on the access of auto-mounted snapshots through the NFS.

The link to the branch is : - https://github.com/rohan-puri/zfs/tree/snapshot.

Please have a look at the branch, if you get time and let me know if the implementation way seems to be correct or not.

behlendorf · 2011-05-31T18:42:28Z

Hi Rohan,

Thanks again for working on this. Here are my initial review comments and questions:

This isn't exactly what I had in mind. I must not have explained myself well, so let me try again. What I want to do is integrate ZFS with the generic Linux automounter to manage snapshots as autofs mount points. See man automount. Basically, we should be able to set up an automount map (/etc/auto.zfs) which describes the .zfs snapshot directory for a dataset. We can then provide a map-type of program which gets passed the proper key and returns the correct snapshot entry. Then when the snapshots are accessed we can rely on the automounter to mount them read-only and umount then after they are idle. On the in-kernel ZFS side we need some minimal support to show this .zfs directory and the list of available snapshots. There are obviously some details to still be worked out, but that's basically the rough outline I'd like to persue. This approach has several advantages.
- Once we're using the standard Linux automounter there should be no need to check /proc/mounts (PROC_MOUNTS) instead of /etc/mnttab (MNTTAB). By using the standard mount utility and mount.zfs helper to mount the snapshots we ensure /etc/mnttab is properly locked and updated.
- We don't need to use any GPL-only symbols such as vfs_kern_mount(), do_add_mount(), and mark_mounts_for_expiry().
- This should requite fewer changes on the ZFS kernel code and be less complex.
- Using the automounter is the Linux-way, it should be more understandable to general Linux community.
Add the full CDDL header to zpl_snap.c. We want it to be clear that all of the code being added to the ZFS repository is licensed under the CDDL.
As an aside if you want to change the module license for debugging purposes you can just update the 'License' line in the META file. Change CDDL to GPL and all the module will be built as GPL. This way you don't need to modify all the files individually. Obviously, this is only for local debugging.

jeremysanders · 2011-06-04T11:03:10Z

Would this automounter idea make it impossible to see the .zfs directories over nfs?

behlendorf · 2011-06-04T17:43:36Z

Good question, I hope not. We should see what the current Linux behavior is regarding automouts on NFS servers. There's no reason we can't use ext4 for this today and see if traversing in to an automount mapped directory via NFS triggers the automount. If it doesn't we'll need to figure out if there's nothing we can do about it, we absolutely want the .zfs directory to work for NFS mounts.

gunnarbeutner · 2011-06-04T18:46:08Z

I've come across this issue while implementing libshare. Assuming auto-mounted directories aren't visible (haven't tested it yet) a possible work-around would be to use the "crossmnt" NFS option, although that would have the side effect of making other sub-volumes accessible via NFS which is different from what Solaris does.

rohan-puri · 2011-06-29T11:55:32Z

Hello Brian,

I tried using the bziller's script posted above, was able to mount the zfs snapshots using the linux automounter. We can make changes to the script and write minimal kernel code to show the .zfs and snapshot dir lists.

I agree with the approach you have provided.

Only one thing which we need to take care of is the unmounting to happen not only in case when the mounted snapshot file-system is idle but also in the following cases : -

When snapshot is destroyed.
When the file-system is destroyed of which the snapshot was taken.
When the pool is destroyed that had the snapshots of one or more filesystem/s.
When a snapshot is renamed.
When a promote command is used.

When we use linux automounter, to force expiry of mountpoint we need to send USR1 signal to automount
command is
killall -USR1 automount.

It unmounts the unused snapshots. Have checked this.

Now the thing is we need to trigger this command for each of the above cases.

Need your feedback on this.

behlendorf · 2011-06-30T22:16:56Z

Wonderful, I'm glad to have you working on this.

I also completely agree we need to handle the 5 cases your describing. I think the cleanest way to do this will be to update the zfs utilities to perform the umount. In all the cases your concerned about the umount is needed because a user ran some zfs command. In the context of that command you can call something like unshare_unmount() to do the umount. This has the advantage of also cleaning tearing down any nfs/smb share.

I don't think you need to inform the automounter of this in anyway, but I haven't checked so I could be wrong about that.

gunnarbeutner · 2011-07-03T09:15:00Z

Ideally the snapshot directory feature should work across cgroups/OpenVZ containers, so (end-)users can access snapshots when using ZFS datasets to store the root filesystem for containers.

rohan-puri · 2011-07-05T11:47:25Z

Hi Brian,

I have implemented the minimal set of directory hierarchy for per file-system (in kernel) in my fork (new snapshot branch created) of your zfs repository, which supports .zfs dir, snapshot dir and snap entries dir's <dentry,inode> creations.

Was playing with linux automounter, facing some issues : -

bziller above used the indirect map, in which we need to specify a mount-point in the auto.master file, under which the zfs snapshot datasets would be mounted (this list is generated by giving the key which in this case is the fs dataset to the auto.zfs map file which is specific for the zfs fs).

In this case bziller solved the problem using /.zfs as the autofs mountpoint.

But each snapshot needs to be mounted under the .zfs/snapshot dir of its mount-point, which is different for each file system.

So this autofs mount-point has to be different for each individual zfs file system, under which we will mount snapshots related to that fs later on (using some kind of script auto.zfs as you said).

So the problem over here is which mountpoint we need to specify in the auto.master file ?

We need to specify mount-point as /fs-path-to-mntpt/.zfs/snapshot (if this is the case, in kernel support for minimal dir hierarchy is also not required, as autofs takes care of creation). The problem is that this list will vary , so on creation of each fs( zpool/zfs utility) or pool need to edit the file and restart the automount service.
use '/' as the mount-point. (cannot do this).
can we execute shell commands in auto.master so that we can get the list and so some string appending to get the proper mount-point. (execution of commands not supported by auto.master, but auto.zfs is specifically executable).

Need your view on this.

Rudd-O · 2011-07-06T12:18:10Z

This will be much easier to do once we integrate with systemd. Systemd will take care of doing the right thing with respect to configuring the Linux kernel automounter -- all we will have to do is simply export the filesystems over NFS and voila.

behlendorf · 2011-07-08T22:32:23Z

In my view the cleanest solution will be your number 1 above.

When a new zfs filesystem is created it will be automatically added to /etc/auto.master and the automount daemon signaled to pick up the change. Conversely, when a dataset is destroyed it must be removed and the automount daemon signaled. For example, if we create the filesystem tank/fish the /etc/auto.master would be updated like this.

/tank/fish/.zfs/snapshot        /etc/auto.zfs        -zpool=tank

The /etc/auto.zfs script can then be used as a generic indirect map as described above. However, since it would be good to validate the key against the known set of snapshots we also need to make the pool name available to the script. I believe this can be done by passing it as an option to the map. The man page says arguments with leading dashes are considered options for the maps but I haven't tested this.

Sound good?

ulope · 2011-08-27T18:03:45Z

To play the devils advocate: I can think of quite a few sysadmins who wouldn't take kindly at all to "some random filesystem" changing system config files.

Isn't this a situation similar to the way mounting of zfs filesystems is handled?
They also "just get mounted" without zfs create modifying /etc/fstab.

behlendorf · 2011-09-01T23:27:25Z

The trouble is we want to leverage the automount daemon to automatically do the mount for us so we don't need to have all the snapshots mounted all the time. For that to work we need to keep the automount daemon aware of the available snapshots via the config file.

ulope · 2011-09-04T11:15:14Z

I assumed (naively I'm sure) that there would be some kind of API that would allow to dynamically register / remove automount maps without having to modify the config file.

behlendorf · 2011-09-05T02:24:56Z

If only that were true. :)

khenriks · 2011-09-10T15:38:23Z

How about running multiple instances of the automount daemon? Then ZFS could have it's own separate auto.master file just for handling its own snapshot mounts.

Rudd-O · 2011-09-10T22:58:17Z

I believe the automounted daemon is phased out in favor of systemd units. Those should be used first. We must write a small systems helper to inject units into systemd without touching configuration or unit files.

Rudd-O · 2011-09-10T22:58:49Z

I believe the automounted daemon is phased out in favor of systemd units. Those should be used first. We must write a small systems helper to inject units into systemd without touching configuration or unit files.

rohan-puri · 2011-10-12T05:37:12Z

Hello Rudd-O, I agree that we should leverage systemd instead of making changes to the current infra (automounte daemon). But not all the systems may come with systemd, in which case we must provide an alternate way.

Hello Brian,

I do agree with ulope's point, also when I was working on it earlier and trying to implement the first solution (#173 (comment)) that you mentioned. Even after restarting the automount daemon I was not seeing the changes & they were getting reflected only after reboot.

NFS, CIFS, etc all these file systems make use of in-kernel mount, in which case we dont have to rely on automount for mounting.
As per #173 (comment) this. We can do this support with the in-kernel mount and the snapshots would be mounted only when they are accessed through their default mounting path in .zfs dir which when mounted will be associated with the timer. When this timer expires & the mountpoint dir in not in use which trigger the unmount.

Also all the above 5 cases in which unmount is to be triggered can also covered in this approach also.

Need your input :)
Regards,
Rohan

Rudd-O · 2011-10-12T06:06:19Z

I totally agree there has got to be an alternate non-systemd way. It'll probably mean some code duplication for a couple of years. It's okay.

rohan-puri · 2011-10-12T06:38:07Z

We can avoid this by following the approach described in #173 (comment) , need Brians input on it though.

Whats your opinion on that?

behlendorf · 2011-10-18T04:21:31Z

So I think there are a few things to consider here. I agree that using the automounter while it seemed desirable on the surface seems to be causing more trouble than it's worth. Implementing the .zfs snapshot directly by mounting the snapshot via a kernel upcall during .zfs path traversal seems like a reasonable approach.

However, it's now clear to me why Solaris does something entirely different. If we mount the snapshots like normal filesystems under .zfs they will not be available from nfsv3 clients because they will have a different fsid. Since this is probably a pretty common use case it may be worth reimplementing the Solaris solution. That said, I wouldn't object to including your proposed solution as a stop gap for the short to medium term.

Rudd-O · 2011-10-18T04:22:47Z

Brilliant observation.

Sent from my Android phone with K-9 Mail. Please excuse my brevity.

Brian Behlendorf reply@reply.github.com wrote:

So I think there are a few things to consider here. I agree that using the automounter while it seemed desirable on the surface seems to be causing more trouble than it's worth. Implementing the .zfs snapshot directly by mounting the snapshot via a kernel upcall during .zfs path traversal seems like a reasonable approach.

However, it's now clear to me why Solaris does something entirely different. If we mount the snapshots like normal filesystems under .zfs they will not be available from nfsv3 clients because they will have a different fsid. Since this is probably a pretty common use case it may be worth reimplementing the Solaris solution. That said, I wouldn't object to including your proposed solution as a stop gap for the short to medium term.

Reply to this email directly or view it on GitHub:
#173 (comment)

Fixes "dataset not found" error on zfs destory <snapshot> see openzfs#173. Fixes race in dsl_dataset_user_release_tmp() when the temp snapshot from zfs diff dataset@snap command is used see openzfs#481.

b333z · 2012-02-06T14:39:00Z

Think I have found what was causing the dataset does not exist issue, have submitted a pull request.

behlendorf · 2012-02-09T21:56:06Z

Refreshed version of the patch which includes zfs diff fix. Once we sort out any .zfs/snapshot issues over zfs this is ready to be merged. behlendorf/zfs@f34cd0a

b333z · 2012-02-11T08:36:47Z

Am doing some testing on accessing snapshots over NFS, but currently am unable to access the .zfs directory from the client, after: zfs set snapdir=visible system, I can now see the .zfs directory on the NFS client, but attempting to access it gives: ls: cannot open directory /system/.zfs: No such file or directory...

So far it looks to be some issue looking up the attr's on the .zfs inode? Am seeing this from the client NFS: nfs_revalidate_inode: (0:17/-1) getattr failed, error=-2

Below are some debug messages from NFS on both sides and some output of a systemtap script I'm using to explore the issue, am a bit stuck of where to go from here... Anyone have this working, or have any suggestions of direction I can take to further debug the issue?

Client:

# cat /etc/fstab | grep system
192.168.128.15:/system  /system         nfs             nfsvers=3,tcp,rw,noauto 0 0

# umount /system; mount /system; ls -la /system
total 39
drwxr-xr-x  3 root root    7 Feb 10 14:32 .
drwxr-xr-x 23 root root 4096 Dec 17 04:20 ..
dr-xr-xr-x  1 root root    0 Jan  1  1970 .zfs
-rw-r--r--  1 root root    5 Jan  5 12:21 file45
drwxr-xr-x  2 root root    7 Dec  6 01:37 logrotate.d
-rw-r--r--  1 root root    5 Dec  6 01:33 test.txt
-rw-r--r--  1 root root   10 Dec  6 01:40 test2.txt
-rw-r--r--  1 root root    8 Feb 10 14:32 testet.etet


# tail -f /var/log/messages &
# rpcdebug -m nfs -s all
# ls -la /system/.zfs
ls: cannot open directory /system/.zfs: No such file or directory
Feb 11 18:06:27 b13 kernel: [60889.473116] NFS: permission(0:17/4), mask=0x1, res=0
Feb 11 18:06:27 b13 kernel: [60889.473133] NFS: nfs_lookup_revalidate(/.zfs) is valid
Feb 11 18:06:27 b13 kernel: [60889.473144] NFS: dentry_delete(/.zfs, c018)
Feb 11 18:06:27 b13 kernel: [60889.473169] NFS: permission(0:17/4), mask=0x1, res=0
Feb 11 18:06:27 b13 kernel: [60889.473176] NFS: nfs_lookup_revalidate(/.zfs) is valid
Feb 11 18:06:27 b13 kernel: [60889.473185] NFS: dentry_delete(/.zfs, c018)
Feb 11 18:06:27 b13 kernel: [60889.474539] NFS: permission(0:17/4), mask=0x1, res=0
Feb 11 18:06:27 b13 kernel: [60889.474549] NFS: revalidating (0:17/-1)
Feb 11 18:06:27 b13 kernel: [60889.474555] NFS call  getattr
Feb 11 18:06:27 b13 kernel: [60889.476370] NFS reply getattr: -2
Feb 11 18:06:27 b13 kernel: [60889.476379] nfs_revalidate_inode: (0:17/-1) getattr failed, error=-2
Feb 11 18:06:27 b13 kernel: [60889.476391] NFS: nfs_lookup_revalidate(/.zfs) is invalid
Feb 11 18:06:27 b13 kernel: [60889.476398] NFS: dentry_delete(/.zfs, c018)
Feb 11 18:06:27 b13 kernel: [60889.476409] NFS: lookup(/.zfs)
Feb 11 18:06:27 b13 kernel: [60889.476415] NFS call  lookup .zfs
Feb 11 18:06:27 b13 kernel: [60889.477680] NFS: nfs_update_inode(0:17/4 ct=2 info=0x7e7f)
Feb 11 18:06:27 b13 kernel: [60889.477689] NFS reply lookup: 0
Feb 11 18:06:27 b13 kernel: [60889.477699] NFS: nfs_update_inode(0:17/-1 ct=1 info=0x7e7f)
Feb 11 18:06:27 b13 kernel: [60889.477704] NFS: nfs_fhget(0:17/-1 ct=1)
Feb 11 18:06:27 b13 kernel: [60889.477719] NFS call  access
Feb 11 18:06:27 b13 kernel: [60889.478948] NFS reply access: -2
Feb 11 18:06:27 b13 kernel: [60889.478988] NFS: permission(0:17/-1), mask=0x24, res=-2
Feb 11 18:06:27 b13 kernel: [60889.478995] NFS: dentry_delete(/.zfs, c010)

Server:

# rpcdebug -m nfsd -s all
# tail -f /var/log/messages &
Feb 11 18:11:40 b15 kernel: [61924.682697] nfsd_dispatch: vers 3 proc 4
Feb 11 18:11:40 b15 kernel: [61924.682712] nfsd: ACCESS(3)   8: 00010001 00000064 00000000 00000000 00000000 00000000 0x1f
Feb 11 18:11:40 b15 kernel: [61924.682724] nfsd: fh_verify(8: 00010001 00000064 00000000 00000000 00000000 00000000)
Feb 11 18:11:40 b15 kernel: [61924.684475] nfsd_dispatch: vers 3 proc 1
Feb 11 18:11:40 b15 kernel: [61924.684490] nfsd: GETATTR(3)  20: 01010001 00000064 ffff000a ffffffff 00000000 00000000
Feb 11 18:11:40 b15 kernel: [61924.684501] nfsd: fh_verify(20: 01010001 00000064 ffff000a ffffffff 00000000 00000000)

     0 nfsd(12660):->zfsctl_is_node ip=0xffff880037ca0e48
    19 nfsd(12660):<-zfsctl_is_node return=0x1
     0 nfsd(12660):->zfsctl_fid ip=0xffff880037ca0e48 fidp=0xffff880078817154
    79 nfsd(12660): ->zfsctl_fid ip=0xffff880037ca0e48 fidp=0xffff880078817154
    93 nfsd(12660): <-zfsctl_fid return=0x0
   101 nfsd(12660):<-zfsctl_fid return=0x0
        zpl_ctldir.c:136 error=?
     0 nfsd(12660):->zpl_root_getattr mnt=0xffff880075dc4e00 dentry=0xffff88007b781000 stat=0xffff88007305bd30
    14 nfsd(12660): ->zpl_root_getattr mnt=0xffff880075dc4e00 dentry=0xffff88007b781000 stat=0xffff88007305bd30
        zpl_ctldir.c:139 error=?
    74 nfsd(12660):  ->simple_getattr mnt=0xffff880075dc4e00 dentry=0xffff88007b781000 stat=0xffff88007305bd30
    83 nfsd(12660):  <-simple_getattr return=0x0
        zpl_ctldir.c:140 error=0x0
        zpl_ctldir.c:143 error=0x0
        zpl_ctldir.c:143 error=0x0
   168 nfsd(12660): <-zpl_root_getattr return=0x0
   175 nfsd(12660):<-zpl_root_getattr return=0x0

b333z · 2012-02-27T23:53:45Z

Think I may have tracked down the issue with not being able to do ls -la /system/.zfs.

The path is something like:

getattr -> nfs client -> nfs server -> nfsd3_proc_getattr -> zpl_fh_to_dentry -> zfs_vget

In zfs_vget() attempts to retreive a znode, but as its a control directory it doesn't have a backing znode so should not do a normal lookup, this condition is identified in zfs_vget() here:

        /* A zero fid_gen means we are in the .zfs control directories */
        if (fid_gen == 0 &&
            (object == ZFSCTL_INO_ROOT || object == ZFSCTL_INO_SNAPDIR)) {

            ...

            ZFS_EXIT(zsb);
            return (0);
        }

But from my traces I found this condition was not triggered.

I did a trace of the locals in zfs_vget() and got the following:

       157 nfsd(3367):   zfs_vfsops.c:1293 zsb=? zp=0xffff88007c137c20 object=? fid_gen=? gen_mask=? zp_gen=? i=? err=? __func__=[...]
       166 nfsd(3367):   zfs_vfsops.c:1294 zsb=? zp=0x286 object=? fid_gen=? gen_mask=? zp_gen=? i=? err=? __func__=[...]
       174 nfsd(3367):   zfs_vfsops.c:1304 zsb=? zp=0x286 object=? fid_gen=? gen_mask=? zp_gen=? i=? err=? __func__=[...]
       183 nfsd(3367):   zfs_vfsops.c:1302 zsb=0xffff88007b29c000 zp=0x286 object=? fid_gen=? gen_mask=? zp_gen=? i=? err=? __func__=[...]
       192 nfsd(3367):   zfs_vfsops.c:1306 zsb=0xffff88007b29c000 zp=0x286 object=? fid_gen=? gen_mask=? zp_gen=? i=? err=? __func__=[...]
       206 nfsd(3367):   zfs_vfsops.c:1326 zsb=0xffff88007b29c000 zp=0x286 object=? fid_gen=? gen_mask=? zp_gen=? i=? err=0x11270000 __func__=[...]
       216 nfsd(3367):   zfs_vfsops.c:1330 zfid=? zsb=0xffff88007b29c000 zp=0x286 object=0x0 fid_gen=? gen_mask=? zp_gen=? i=? err=? __func__=[...]
       229 nfsd(3367):   zfs_vfsops.c:1329 zfid=? zsb=0xffff88007b29c000 zp=0x286 object=0xff fid_gen=? gen_mask=? zp_gen=? i=? err=? __func__=[...]
       243 nfsd(3367):   zfs_vfsops.c:1330 zfid=? zsb=0xffff88007b29c000 zp=0x286 object=0xff fid_gen=? gen_mask=? zp_gen=? i=? err=? __func__=[...]
       252 nfsd(3367):   zfs_vfsops.c:1329 zfid=? zsb=0xffff88007b29c000 zp=0x286 object=0xffff fid_gen=? gen_mask=? zp_gen=? i=? err=? __func__=[...]
       261 nfsd(3367):   zfs_vfsops.c:1330 zfid=? zsb=0xffff88007b29c000 zp=0x286 object=0xffff fid_gen=? gen_mask=? zp_gen=? i=? err=? __func__=[...]
       270 nfsd(3367):   zfs_vfsops.c:1329 zfid=? zsb=0xffff88007b29c000 zp=0x286 object=0xffffff fid_gen=? gen_mask=? zp_gen=? i=? err=? __func__=[...]
       279 nfsd(3367):   zfs_vfsops.c:1330 zfid=? zsb=0xffff88007b29c000 zp=0x286 object=0xffffff fid_gen=? gen_mask=? zp_gen=? i=? err=? __func__=[...]
       288 nfsd(3367):   zfs_vfsops.c:1329 zfid=? zsb=0xffff88007b29c000 zp=0x286 object=0xffffffff fid_gen=? gen_mask=? zp_gen=? i=? err=? __func__=[...]
       297 nfsd(3367):   zfs_vfsops.c:1330 zfid=? zsb=0xffff88007b29c000 zp=0x286 object=0xffffffff fid_gen=? gen_mask=? zp_gen=? i=? err=? __func__=[...]
       306 nfsd(3367):   zfs_vfsops.c:1329 zfid=? zsb=0xffff88007b29c000 zp=0x286 object=0xffffffffff fid_gen=? gen_mask=? zp_gen=? i=? err=? __func__=[...]
       316 nfsd(3367):   zfs_vfsops.c:1330 zfid=? zsb=0xffff88007b29c000 zp=0x286 object=0xffffffffff fid_gen=? gen_mask=? zp_gen=? i=? err=? __func__=[...]
       325 nfsd(3367):   zfs_vfsops.c:1329 zfid=? zsb=0xffff88007b29c000 zp=0x286 object=0xffffffffffff fid_gen=? gen_mask=? zp_gen=? i=? err=? __func__=[...]
       335 nfsd(3367):   zfs_vfsops.c:1333 zfid=? zsb=0xffff88007b29c000 zp=0x286 object=0xffffffffffff fid_gen=0x0 gen_mask=? zp_gen=? i=? err=? __func__=[...]
       348 nfsd(3367):   zfs_vfsops.c:1332 zfid=? zsb=0xffff88007b29c000 zp=0x286 object=0xffffffffffff fid_gen=0x0 gen_mask=? zp_gen=? i=? err=? __func__=[...]
       362 nfsd(3367):   zfs_vfsops.c:1333 zfid=? zsb=0xffff88007b29c000 zp=0x286 object=0xffffffffffff fid_gen=0x0 gen_mask=? zp_gen=? i=? err=? __func__=[...]
       372 nfsd(3367):   zfs_vfsops.c:1332 zfid=? zsb=0xffff88007b29c000 zp=0x286 object=0xffffffffffff fid_gen=0x0 gen_mask=? zp_gen=? i=? err=? __func__=[...]
       381 nfsd(3367):   zfs_vfsops.c:1333 zfid=? zsb=0xffff88007b29c000 zp=0x286 object=0xffffffffffff fid_gen=0x0 gen_mask=? zp_gen=? i=? err=? __func__=[...]
       390 nfsd(3367):   zfs_vfsops.c:1332 zfid=? zsb=0xffff88007b29c000 zp=0x286 object=0xffffffffffff fid_gen=0x0 gen_mask=? zp_gen=? i=? err=? __func__=[...]
       399 nfsd(3367):   zfs_vfsops.c:1333 zfid=? zsb=0xffff88007b29c000 zp=0x286 object=0xffffffffffff fid_gen=0x0 gen_mask=? zp_gen=? i=? err=? __func__=[...]
       409 nfsd(3367):   zfs_vfsops.c:1332 zfid=? zsb=0xffff88007b29c000 zp=0x286 object=0xffffffffffff fid_gen=0x0 gen_mask=? zp_gen=? i=? err=? __func__=[...]
       418 nfsd(3367):   zfs_vfsops.c:1340 zsb=0xffff88007b29c000 zp=0x286 object=0xffffffffffff fid_gen=0x0 gen_mask=? zp_gen=? i=? err=? __func__=[...]
       432 nfsd(3367):   zfs_vfsops.c:1341 zsb=0xffff88007b29c000 zp=0x286 object=0xffffffffffff fid_gen=0x0 gen_mask=? zp_gen=? i=? err=? __func__=[...]
       441 nfsd(3367):   zfs_vfsops.c:1356 zsb=0xffff88007b29c000 zp=0x286 object=0xffffffffffff fid_gen=0x0 gen_mask=? zp_gen=? i=? err=? __func__=[...]
       459 nfsd(3367):   zfs_vfsops.c:1357 zsb=0xffff88007b29c000 zp=0x286 object=0xffffffffffff fid_gen=0x0 gen_mask=? zp_gen=? i=? err=? __func__=[...]
       550 nfsd(3367):   zfs_vfsops.c:1379 zsb=0xffff88007b29c000 zp=0x0 object=? fid_gen=0x0 gen_mask=? zp_gen=0xffffffffa0df2b34 i=? err=0x2 __func__=[...]
       559 nfsd(3367):   zpl_export.c:93 fid=? ip=? len_bytes=? rc=?
       572 nfsd(3367):   zpl_export.c:94 fid=? ip=? len_bytes=? rc=0x2
       585 nfsd(3367):   zpl_export.c:99 fid=? ip=0x0 len_bytes=? rc=?

Did a printk to confirm:

printk("zfs: %d == 0 && ( %llx == %llx || %llx == %llx )",
    fid_gen, object, ZFSCTL_INO_ROOT, object, ZFSCTL_INO_SNAPDIR);

[  204.649576] zfs: 0 == 0 && ( ffffffffffff == ffffffffffffffff || ffffffffffff == fffffffffffffffd )

So looks like zlfid->zf_setid is not long enough, or perhaps it's been trucated by nfs3, but either way object ends up with not enough f's: ffffffffffff so doesn't match ZFSCTL_INO_ROOT.

As a test I adjusted the values for the control inode defines:

        diff --git a/include/sys/zfs_ctldir.h b/include/sys/zfs_ctldir.h
        index 5546aa7..46e7353 100644
        --- a/include/sys/zfs_ctldir.h
        +++ b/include/sys/zfs_ctldir.h
        @@ -105,10 +105,10 @@ extern void zfsctl_fini(void);
          * because these inode numbers are never stored on disk we can safely
          * redefine them as needed in the future.
          */
        -#define        ZFSCTL_INO_ROOT         0xFFFFFFFFFFFFFFFF
        -#define        ZFSCTL_INO_SHARES       0xFFFFFFFFFFFFFFFE
        -#define        ZFSCTL_INO_SNAPDIR      0xFFFFFFFFFFFFFFFD
        -#define        ZFSCTL_INO_SNAPDIRS     0xFFFFFFFFFFFFFFFC
        +#define        ZFSCTL_INO_ROOT         0xFFFFFFFFFFFF
        +#define        ZFSCTL_INO_SHARES       0xFFFFFFFFFFFE
        +#define        ZFSCTL_INO_SNAPDIR      0xFFFFFFFFFFFD
        +#define        ZFSCTL_INO_SNAPDIRS     0xFFFFFFFFFFFC

Then tested, am I am now able to traverse the .zfs directory and see the shares and snapshot dirs inside.

Noticed that other ZFS implementations use low values for the control dir inodes, have we gone to large on these, or is this a limitation in nfs3 itself, wondering on how best to deal with it?

b333z · 2012-02-28T07:13:26Z

Now I can get into the .zfs directory, trying to cd to the snapshot dir causes cd to hang and I get the following in dmesg ( slowly getting there! ):

        [  563.905750] ------------[ cut here ]------------
        [  563.905768] WARNING: at fs/inode.c:901 unlock_new_inode+0x31/0x53()
        [  563.905776] Hardware name: Bochs
        [  563.905780] Modules linked in: zfs(P) zcommon(P) znvpair(P) zavl(P) zunicode(P) spl scsi_wait_scan
        [  563.905805] Pid: 3358, comm: nfsd Tainted: P            3.0.6-gentoo #1
        [  563.905811] Call Trace:
        [  563.905843]  [<ffffffff81071a5d>] warn_slowpath_common+0x85/0x9d
        [  563.905853]  [<ffffffff81071a8f>] warn_slowpath_null+0x1a/0x1c
        [  563.905861]  [<ffffffff81141701>] unlock_new_inode+0x31/0x53
        [  563.905902]  [<ffffffffa046c14a>] snapentry_compare+0xcb/0x12f [zfs]
        [  563.905937]  [<ffffffffa046c44a>] zfsctl_root_lookup+0xc3/0x123 [zfs]
        [  563.905967]  [<ffffffffa047bd25>] zfs_vget+0x1f6/0x3e4 [zfs]
        [  563.905988]  [<ffffffff817475ce>] ? seconds_since_boot+0x1b/0x21
        [  563.905996]  [<ffffffff81748d17>] ? cache_check+0x57/0x2d0
        [  563.906021]  [<ffffffffa0491d93>] zpl_snapdir_rename+0x11e/0x455 [zfs]                                           [43/1854]
        [  563.906052]  [<ffffffff811f160c>] exportfs_decode_fh+0x56/0x21e
        [  563.906060]  [<ffffffff811f4690>] ? fh_compose+0x367/0x367
        [  563.906079]  [<ffffffff813a370f>] ? selinux_cred_prepare+0x1f/0x36
        [  563.906094]  [<ffffffff8112a2ad>] ? __kmalloc_track_caller+0xee/0x101
        [  563.906103]  [<ffffffff813a370f>] ? selinux_cred_prepare+0x1f/0x36
        [  563.906112]  [<ffffffff811f4a58>] fh_verify+0x299/0x4d9
        [  563.906121]  [<ffffffff817475ce>] ? seconds_since_boot+0x1b/0x21
        [  563.906129]  [<ffffffff8174935b>] ? sunrpc_cache_lookup+0x146/0x16d
        [  563.906137]  [<ffffffff811f4f4c>] nfsd_access+0x2d/0xfa
        [  563.906145]  [<ffffffff81748f73>] ? cache_check+0x2b3/0x2d0
        [  563.906154]  [<ffffffff811fc469>] nfsd3_proc_access+0x75/0x80
        [  563.906164]  [<ffffffff811f1afd>] nfsd_dispatch+0xf1/0x1d5
        [  563.906172]  [<ffffffff817400b2>] svc_process+0x45e/0x665
        [  563.906181]  [<ffffffff811f1fa9>] ? nfsd_svc+0x170/0x170
        [  563.906190]  [<ffffffff811f209f>] nfsd+0xf6/0x13a
        [  563.906198]  [<ffffffff811f1fa9>] ? nfsd_svc+0x170/0x170
        [  563.906206]  [<ffffffff8108d357>] kthread+0x82/0x8a
        [  563.906216]  [<ffffffff817ece24>] kernel_thread_helper+0x4/0x10
        [  563.906225]  [<ffffffff8108d2d5>] ? kthread_worker_fn+0x158/0x158
        [  563.906233]  [<ffffffff817ece20>] ? gs_change+0x13/0x13
        [  563.906240] ---[ end trace c8c4cba0e76b487f ]---
        [  563.906272] BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
        [  563.911695] IP: [<ffffffffa048819a>] zfs_inode_destroy+0x72/0xd1 [zfs]
        [  563.913129] PGD 780fc067 PUD 77074067 PMD 0 
        [  563.913857] Oops: 0002 [#1] SMP 
        [  563.914487] CPU 1 
        [  563.914565] Modules linked in: zfs(P) zcommon(P) znvpair(P) zavl(P) zunicode(P) spl scsi_wait_scan
        [  563.915111] 
        [  563.915111] Pid: 3358, comm: nfsd Tainted: P        W   3.0.6-gentoo #1 Bochs Bochs
        [  563.915111] RIP: 0010:[<ffffffffa048819a>]  [<ffffffffa048819a>] zfs_inode_destroy+0x72/0xd1 [zfs]
        [  563.915111] RSP: 0018:ffff8800704779c0  EFLAGS: 00010282
        [  563.915111] RAX: ffff880074945020 RBX: ffff880074945048 RCX: 0000000000000000
        [  563.915111] RDX: 0000000000000000 RSI: 0000000000014130 RDI: ffff880071b763e0
        [  563.915111] RBP: ffff8800704779e0 R08: ffffffff813a3e91 R09: 0000000000000000
        [  563.915111] R10: dead000000200200 R11: dead000000100100 R12: ffff880071b76000
        [  563.915111] R13: ffff880074944ea0 R14: ffff880071b763e0 R15: ffffffffa049d200
        [  563.915111] FS:  00007f8d978a2700(0000) GS:ffff88007fd00000(0000) knlGS:0000000000000000
        [  563.915111] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
        [  563.915111] CR2: 0000000000000008 CR3: 0000000079577000 CR4: 00000000000006e0
        [  563.915111] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        [  563.915111] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
        [  563.915111] Process nfsd (pid: 3358, threadinfo ffff880070476000, task ffff88007caaf800)
        [  563.915111] Stack:
        [  563.915111]  ffff880074945048 ffff880074945068 ffffffffa049dba0 ffff880074944ea0
        [  563.915111]  ffff8800704779f0 ffffffffa0492cbd ffff880070477a10 ffffffff81141df2
        [  563.915111]  ffff880074945048 ffff880074945048 ffff880070477a30 ffffffff81142370
        [  563.915111] Call Trace:
        [  563.915111]  [<ffffffffa0492cbd>] zpl_vap_init+0x525/0x59c [zfs]
        [  563.915111]  [<ffffffff81141df2>] destroy_inode+0x40/0x5a
        [  563.915111]  [<ffffffff81142370>] evict+0x130/0x135
        [  563.915111]  [<ffffffff81142788>] iput+0x173/0x17b
        [  563.915111]  [<ffffffffa046c155>] snapentry_compare+0xd6/0x12f [zfs]
        [  563.915111]  [<ffffffffa046c44a>] zfsctl_root_lookup+0xc3/0x123 [zfs]
        [  563.915111]  [<ffffffffa047bd25>] zfs_vget+0x1f6/0x3e4 [zfs]
        [  563.915111]  [<ffffffff817475ce>] ? seconds_since_boot+0x1b/0x21
        [  563.915111]  [<ffffffff81748d17>] ? cache_check+0x57/0x2d0
        [  563.915111]  [<ffffffffa0491d93>] zpl_snapdir_rename+0x11e/0x455 [zfs]
        [  563.915111]  [<ffffffff811f160c>] exportfs_decode_fh+0x56/0x21e
        [  563.915111]  [<ffffffff811f4690>] ? fh_compose+0x367/0x367
        [  563.915111]  [<ffffffff813a370f>] ? selinux_cred_prepare+0x1f/0x36
        [  563.915111]  [<ffffffff8112a2ad>] ? __kmalloc_track_caller+0xee/0x101
        [  563.915111]  [<ffffffff813a370f>] ? selinux_cred_prepare+0x1f/0x36
        [  563.915111]  [<ffffffff811f4a58>] fh_verify+0x299/0x4d9
        [  563.915111]  [<ffffffff817475ce>] ? seconds_since_boot+0x1b/0x21
        [  563.915111]  [<ffffffff8174935b>] ? sunrpc_cache_lookup+0x146/0x16d
        [  563.915111]  [<ffffffff811f4f4c>] nfsd_access+0x2d/0xfa
        [  563.915111]  [<ffffffff81748f73>] ? cache_check+0x2b3/0x2d0
        [  563.915111]  [<ffffffff811fc469>] nfsd3_proc_access+0x75/0x80
        [  563.915111]  [<ffffffff811f1afd>] nfsd_dispatch+0xf1/0x1d5
        [  563.915111]  [<ffffffff817400b2>] svc_process+0x45e/0x665
        [  563.915111]  [<ffffffff811f1fa9>] ? nfsd_svc+0x170/0x170
        [  563.915111]  [<ffffffff811f209f>] nfsd+0xf6/0x13a
        [  563.915111]  [<ffffffff811f1fa9>] ? nfsd_svc+0x170/0x170
        [  563.915111]  [<ffffffff8108d357>] kthread+0x82/0x8a
        [  563.915111]  [<ffffffff817ece24>] kernel_thread_helper+0x4/0x10
        [  563.915111]  [<ffffffff8108d2d5>] ? kthread_worker_fn+0x158/0x158
        [  563.915111]  [<ffffffff817ece20>] ? gs_change+0x13/0x13
        [  563.915111] Code: 35 e1 4c 89 e8 49 03 84 24 c0 03 00 00 49 bb 00 01 10 00 00 00 ad de 49 ba 00 02 20 00 00 00 ad de 4c 89
         f7 48 8b 08 48 8b 50 08 
        [  563.915111]  89 51 08 48 89 0a 4c 89 18 4c 89 50 08 49 ff 8c 24 d8 03 00 
        [  563.915111] RIP  [<ffffffffa048819a>] zfs_inode_destroy+0x72/0xd1 [zfs]
        [  563.915111]  RSP <ffff8800704779c0>
        [  563.915111] CR2: 0000000000000008
        [  563.966671] ---[ end trace c8c4cba0e76b4880 ]---

behlendorf · 2012-02-28T18:36:50Z

@b333z Nice job. Yes, your exactly right I'd forgotten about this NFS limit when selecting those object IDs. There's actually a very nice comment in the code detailing exactly where this limit comes from. So we're limited to 48-bits for NFSv2 compatibility reasons, and actually the DMU imposes a (not widely advertised) 48-bit object number limit too.

include/sys/zfs_vfsops.h:105

/*
 * Normal filesystems (those not under .zfs/snapshot) have a total
 * file ID size limited to 12 bytes (including the length field) due to
 * NFSv2 protocol's limitation of 32 bytes for a filehandle.  For historical
 * reasons, this same limit is being imposed by the Solaris NFSv3 implementation
 * (although the NFSv3 protocol actually permits a maximum of 64 bytes).  It
 * is not possible to expand beyond 12 bytes without abandoning support
 * of NFSv2.
 *
 * For normal filesystems, we partition up the available space as follows:
 *      2 bytes         fid length (required)
 *      6 bytes         object number (48 bits)
 *      4 bytes         generation number (32 bits)
 *
 * We reserve only 48 bits for the object number, as this is the limit
 * currently defined and imposed by the DMU.
 */
typedef struct zfid_short {
        uint16_t        zf_len;
        uint8_t         zf_object[6];           /* obj[i] = obj >> (8 * i) */
        uint8_t         zf_gen[4];              /* gen[i] = gen >> (8 * i) */
} zfid_short_t;

include/sys/zfs_znode.h:160

/*
 * The directory entry has the type (currently unused on Solaris) in the
 * top 4 bits, and the object number in the low 48 bits.  The "middle"
 * 12 bits are unused.
 */
#define ZFS_DIRENT_TYPE(de) BF64_GET(de, 60, 4)
#define ZFS_DIRENT_OBJ(de) BF64_GET(de, 0, 48)

So the right fix is going to have to be use smaller values for the ZFSCTL_INO_* constants... along with a very good comment explaining why those values are what they are.

As you noted they are quite a bit larger than their upstream counterparts. The reason is because the upstream code creates a separate namespace for the small .zfs/ directory and then mounts the snapshots on top of it. Under Linux it was far easier just to just create these directories in the same namespace and the original zfs filesystem. However, since they are in the same namespace (unlike upstream) we needed to make sure the object ids never conflicted so they used the upper most object ids. Since zfs allocates all of it's object ids from 1 in a monotonically increasing fashion there wouldn't be a conflict.

The second issue looks like it's caused by trying to allocate a new inode for the .zfs/snapshot directory when one already exists in the namespace. Normally, this wouldn't occur in the usual vfs callpaths but the NFS paths differ. We're going to need to perform a lookup and only create the inode when the lookup fails. See zfsctl_snapdir_lookup() as an example of this.

I'd really like to get this code done and merged in to master but I don't have the time to run down all these issues right now. If you can work on this and resolve the remaining NFS bugs that would be great, I'm happy to iterate with you on this in the bug and merge it once it's done.

b333z · 2012-02-29T06:33:10Z

Sounds good Brian, I'll expand my test env to include nfs2 and nfs4 and work towards resolving any remaining issues.

b333z · 2012-03-17T07:34:37Z

Making some slow progress on this, I can traverse the control directory structure down to the snapshots now ( still on nfs3 ). As you said it looks like the it was trying to create a new inodes for the control directories when they were already there, so adding a lookup as you suggested seems to have done the trick.

These are the changes that I have so far:

diff --git a/include/sys/zfs_ctldir.h b/include/sys/zfs_ctldir.h
index 5546aa7..46e7353 100644
--- a/include/sys/zfs_ctldir.h
+++ b/include/sys/zfs_ctldir.h
@@ -105,10 +105,10 @@ extern void zfsctl_fini(void);
  * because these inode numbers are never stored on disk we can safely
  * redefine them as needed in the future.
  */
-#define    ZFSCTL_INO_ROOT     0xFFFFFFFFFFFFFFFF
-#define    ZFSCTL_INO_SHARES   0xFFFFFFFFFFFFFFFE
-#define    ZFSCTL_INO_SNAPDIR  0xFFFFFFFFFFFFFFFD
-#define    ZFSCTL_INO_SNAPDIRS 0xFFFFFFFFFFFFFFFC
+#define    ZFSCTL_INO_ROOT     0xFFFFFFFFFFFF
+#define    ZFSCTL_INO_SHARES   0xFFFFFFFFFFFE
+#define    ZFSCTL_INO_SNAPDIR  0xFFFFFFFFFFFD
+#define    ZFSCTL_INO_SNAPDIRS 0xFFFFFFFFFFFC

#define ZFSCTL_EXPIRE_SNAPSHOT  300

diff --git a/module/zfs/zfs_ctldir.c b/module/zfs/zfs_ctldir.c
index 6abbedf..61dea94 100644
--- a/module/zfs/zfs_ctldir.c
+++ b/module/zfs/zfs_ctldir.c
@@ -346,12 +346,30 @@ zfsctl_root_lookup(struct inode *dip, char *name, struct inode **ipp,
    if (strcmp(name, "..") == 0) {
        *ipp = dip->i_sb->s_root->d_inode;
    } else if (strcmp(name, ZFS_SNAPDIR_NAME) == 0) {
-       *ipp = zfsctl_inode_alloc(zsb, ZFSCTL_INO_SNAPDIR,
-           &zpl_fops_snapdir, &zpl_ops_snapdir);
+       *ipp = ilookup(zsb->z_sb, ZFSCTL_INO_SNAPDIR);
+       if (!*ipp)
+       {
+           *ipp = zfsctl_inode_alloc(zsb, ZFSCTL_INO_SNAPDIR,
+           &zpl_fops_snapdir, &zpl_ops_snapdir);
+       }
    } else if (strcmp(name, ZFS_SHAREDIR_NAME) == 0) {
-       *ipp = zfsctl_inode_alloc(zsb, ZFSCTL_INO_SHARES,
-           &zpl_fops_shares, &zpl_ops_shares);
+       *ipp = ilookup(zsb->z_sb, ZFSCTL_INO_SHARES);
+       if (!*ipp)
+       {         
+           *ipp = zfsctl_inode_alloc(zsb, ZFSCTL_INO_SHARES,
+               &zpl_fops_shares, &zpl_ops_shares);
+       }
    } else {
        *ipp = NULL;
        error = ENOENT;
    }
diff --git a/module/zfs/zfs_vfsops.c b/module/zfs/zfs_vfsops.c
index f895f5c..3197243 100644
--- a/module/zfs/zfs_vfsops.c
+++ b/module/zfs/zfs_vfsops.c
@@ -1336,25 +1336,42 @@ zfs_vget(struct super_block *sb, struct inode **ipp, fid_t *fidp)
        return (EINVAL);
    }

-   /* A zero fid_gen means we are in the .zfs control directories */
-   if (fid_gen == 0 &&
-       (object == ZFSCTL_INO_ROOT || object == ZFSCTL_INO_SNAPDIR)) {
-       *ipp = zsb->z_ctldir;
-       ASSERT(*ipp != NULL);
-       if (object == ZFSCTL_INO_SNAPDIR) {
-           VERIFY(zfsctl_root_lookup(*ipp, "snapshot", ipp,
-               0, kcred, NULL, NULL) == 0);
+   printk("zfs.zfs_vget() - Decoded: fid_gen: %llx object: %llx\n",
+       fid_gen, object);
+
+   if (fid_gen == 0) {
+       if (object == ZFSCTL_INO_ROOT || object == ZFSCTL_INO_SNAPDIR || object == ZFSCTL_INO_SHARES) {
+           *ipp = zsb->z_ctldir;
+           ASSERT(*ipp != NULL);
+           if (object == ZFSCTL_INO_SNAPDIR) {
+               VERIFY(zfsctl_root_lookup(*ipp, ZFS_SNAPDIR_NAME, ipp,
+                   0, kcred, NULL, NULL) == 0);
+           } else if (object == ZFSCTL_INO_SHARES) {
+               VERIFY(zfsctl_root_lookup(*ipp, ZFS_SHAREDIR_NAME, ipp,
+                   0, kcred, NULL, NULL) == 0);
+           } else if (object == ZFSCTL_INO_ROOT) {
+               igrab(*ipp);
+           }
+           ZFS_EXIT(zsb);
+           return (0);
        } else {
-           igrab(*ipp);
+           printk("zfs.zfs_vget() - Not .zfs,shares,snapshot must be snapdir doing lookup...\n");
+           *ipp = ilookup(zsb->z_sb, object);
+           if (*ipp) {
+               printk("zfs.zfs_vget() - Found snapdir Node\n");
+               ZFS_EXIT(zsb);
+               return (0);
+           } else {
+               printk("zfs.zfs_vget() - snapdir Node not found continuing...\n");
+           }
        }
-       ZFS_EXIT(zsb);
-       return (0);
    }

    gen_mask = -1ULL >> (64 - 8 * i);

    dprintf("getting %llu [%u mask %llx]\n", object, fid_gen, gen_mask);
    if ((err = zfs_zget(zsb, object, &zp))) { 
        ZFS_EXIT(zsb);
        return (err);
    }

Have tried a few combinations in zfs_vget of dealing with a snapshot directory ( currently an ilookup ) but so far all I get is the "." and ".." directories inside with a 1970 timestamp.

Had tried traversing the directory first via the local zfs mount to ensure the snapshot is mounted, then traversing via nfs but still get an empty directory.

My currently thinking is that nfsd refuses to export anything below that point as its a new mount point. Did some experimentation in forcing/ensuring that getattr returned the same stat->dev as the parent filesystem, that didn't seem to help. I will start doing some tracing on the nfs code so I can see what its doing.

I then did some experimentation with the crossmnt nfs option, that seems to have some promise it seems to be attempting to traverse the mount but give a stale file handle error but looked to at least attempting it.

Anyhow, slowly getting my head around it all, just planning to keep improving my tracing and hopefully get to the bottom of it soon, let us know if you have any idea's or tips!

behlendorf · 2012-03-19T18:47:11Z

Sounds good. Since I want to avoid this branch getting any staler than it already is I'm seriously considering merging this change in without the NFS support now that -rc7 has been tagged. We can further work on the NFS issues in another bug. Any objections?

As for the specific NFS issues your seeing that idea here is to basically fake out NFS for snapshots. The snapshot filesystems should be created with the same fsid as their parent so NFS can't tell the difference. Then it should allow traversal even without the crossmnt option. The NFS handles themselves are constructed in such a way as to avoid collisions and so lookups will be performed in the proper dataset. That said, clearly that's all not working quite right under Linux. We'll still need to dig in to why.

b333z · 2012-03-20T02:49:19Z

I have no objections, would be great to get this code merged, there's some great functionality there even without NFS support, so if I can assist in any way, let us know.

I'll continue to dig deeper on the nfs stuff and see what I can find.

behlendorf · 2012-03-22T20:17:28Z

The deed is done. Thanks for being patient with me to make sure this was done right. The core .zfs/snapshot code has been merged in the master with the following limitations.

A kernel with d_automount support is required, 2.6.37+ or RHEL6.2
Snapshots may not yet be accessed over NFS, issue Support .zfs/snapshot access via NFS #616
The .zfs/shares directory exists but is not yet functional.

Please open new issues for any problems you observe.

Add support for the .zfs control directory. This was accomplished by leveraging as much of the existing ZFS infrastructure as posible and updating it for Linux as required. The bulk of the core functionality is now all there with the following limitations. *) The .zfs/snapshot directory automount support requires a 2.6.37 or newer kernel. The exception is RHEL6.2 which has backported the d_automount patches. *) Creating/destroying/renaming snapshots with mkdir/rmdir/mv in the .zfs/snapshot directory works as expected. However, this functionality is only available to root until zfs delegations are finished. * mkdir - create a snapshot * rmdir - destroy a snapshot * mv - rename a snapshot The following issues are known defeciences, but we expect them to be addressed by future commits. *) Add automount support for kernels older the 2.6.37. This should be possible using follow_link() which is what Linux did before. *) Accessing the the .zfs/snapshot directory via NFS is not yet possible. The majority of the ground work for this is complete. However, finishing this work will require resolving some lingering integration issues with the Linux NFS kernel server. *) The .zfs/shares directory exists but no futher smb functionality has yet been implemented. Contributions-by: Rohan Puri <rohan.puri15@gmail.com> Contributiobs-by: Andrew Barnes <barnes333@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #173

baquar · 2013-02-12T06:55:09Z

can any one help me
i set up zfs in two system with glusterfs data replicated from one volume to another, but on system 1 i am taking snapshot and it is not visible and not replicating i did many things ls -a, seen in all directories i zfs set snapdir=visible zpool/zfilesystem but still not able to find .zfs please any one

behlendorf · 2013-02-12T18:10:25Z

@baquar I'm not exactly sure what your asking. ZFS is a local filesystem if you take a snapshot it will just be visible in .zfs snapshot on that system. Gluster which layers on top of ZFS will not replicate it to your other systems. You can however manually ship it to the other system with send/recv.

baquar · 2013-02-19T10:17:16Z

@behlendorf hiii behlendorf i m sorry u did nt got me, issue was i am unable to find snapshot location in zfs.0
and i also want to learn spl and zfs code if you can help me to give some useful tips i would be pleased if you give some advice, really i m passionate to learn code of zfs and spl.
thanks
baquar

baquar · 2013-02-19T12:10:25Z

@behlendorf one more question i have i m mounting snapshot using this commond mount -t zfs datapool@dara /export/queue-data/ to share directory but it is in read only mode could you please tell me how to set permission to snapshot in zfs
thanks you

aikudinov · 2013-02-19T16:17:15Z

@baquar You can't write to snapshot, make clone if you need to write. And use mailing list for questions.

baquar · 2013-02-19T16:28:47Z

@aikudinov thanks you i really appreciate your warm response but my question is can we set full permission to snapshot i am sending it to another system and restoring it

behlendorf · 2013-02-19T22:48:53Z

@baquar IMHO the best way to get familiar with the spl/zfs code is to pick an open issue you care about and see if you can fix it. We're always happy to have the help and are willing to provide advise and hints.

However let me second @aikudinov and point you to the zfs-discuss@zfsonlinux.org mailing list. There are lots of helpful people reading the list who can probably very quickly answer your questions. As for your question about the snapshots they are by definition immutable. If you need a read-write copy you must clone it and mount the clone.

…penzfs#173) Signed-off-by: Vishnu Itta <vitta@mayadata.io>

Signed-off-by: Paul Dagnelie <pcd@delphix.com>

b333z mentioned this issue Feb 6, 2012

Fix destroy snapshots and race on zfs diff #562

Closed

behlendorf closed this as completed in ebe7e57 Mar 22, 2012

prateekpandey14 pushed a commit to prateekpandey14/zfs that referenced this issue Mar 13, 2019

[ISS#2319]fix(DU): Reducing synRetry count for faster connect calls (o…

a0c9cf6

…penzfs#173) Signed-off-by: Vishnu Itta <vitta@mayadata.io>

sdimitro pushed a commit to sdimitro/zfs that referenced this issue Feb 18, 2022

Fix rust makefile (openzfs#173)

a3d2638

Signed-off-by: Paul Dagnelie <pcd@delphix.com>

Snapshot Directory (.zfs) #173

Snapshot Directory (.zfs) #173

Comments

edwinvaneggelen commented Mar 23, 2011

behlendorf commented Mar 23, 2011

behlendorf commented May 3, 2011

Summary of Required Work

ghost commented May 6, 2011

behlendorf commented May 6, 2011

rohan-puri commented May 23, 2011

behlendorf commented May 23, 2011

rohan-puri commented May 24, 2011

rohan-puri commented May 31, 2011

behlendorf commented May 31, 2011

jeremysanders commented Jun 4, 2011

behlendorf commented Jun 4, 2011

gunnarbeutner commented Jun 4, 2011

rohan-puri commented Jun 29, 2011

behlendorf commented Jun 30, 2011

gunnarbeutner commented Jul 3, 2011

rohan-puri commented Jul 5, 2011

Rudd-O commented Jul 6, 2011

behlendorf commented Jul 8, 2011

ulope commented Aug 27, 2011

behlendorf commented Sep 1, 2011

ulope commented Sep 4, 2011

behlendorf commented Sep 5, 2011

khenriks commented Sep 10, 2011

Rudd-O commented Sep 10, 2011

Rudd-O commented Sep 10, 2011

rohan-puri commented Oct 12, 2011

Rudd-O commented Oct 12, 2011

rohan-puri commented Oct 12, 2011

behlendorf commented Oct 18, 2011

Rudd-O commented Oct 18, 2011

Brilliant observation.

b333z commented Feb 6, 2012

behlendorf commented Feb 9, 2012

b333z commented Feb 11, 2012

b333z commented Feb 27, 2012

b333z commented Feb 28, 2012

behlendorf commented Feb 28, 2012

b333z commented Feb 29, 2012

b333z commented Mar 17, 2012

behlendorf commented Mar 19, 2012

b333z commented Mar 20, 2012

behlendorf commented Mar 22, 2012

baquar commented Feb 12, 2013

behlendorf commented Feb 12, 2013

baquar commented Feb 19, 2013

baquar commented Feb 19, 2013

aikudinov commented Feb 19, 2013

baquar commented Feb 19, 2013

behlendorf commented Feb 19, 2013