Revert "Revert "Revert "Fix unlink/xattr deadlock""" #2408

ryao · 2014-06-20T00:01:35Z

This reverts commit 7973e46. That had
been intended to workaround a deadlock issue involving zfs_zget(), which
was fixed by 6f9548c. The workaround
had the side effect of causing zfs_zinactive() to cause excessive cpu
utilization in zfs_iput_taskq by queuing an iteration of all objects in
a dataset on every unlink on a directory that had extended attributes.
That resulted in many issue reports about iput_taskq spinning. Since the
original rationale for the change is no longer valid, we can safely
revert it to resolve all of those issue reports.

Conflicts:
module/zfs/zfs_dir.c

Closes:
#457
#2058
#2128
#2240

This reverts commit 7973e46. That had been intended to workaround a deadlock issue involving zfs_zget(), which was fixed by 6f9548c. The workaround had the side effect of causing zfs_zinactive() to cause excessive cpu utilization in zfs_iput_taskq by queuing an iteration of all objects in a dataset on every unlink on a directory that had extended attributes. That resulted in many issue reports about iput_taskq spinning. Since the original rationale for the change is no longer valid, we can safely revert it to resolve all of those issue reports. Conflicts: module/zfs/zfs_dir.c Closes: openzfs#457 openzfs#2058 openzfs#2128 openzfs#2240

8573 · 2014-06-21T19:26:01Z

After updating my Gentoo Linux system from ZFS on Linux 0.6.2 to ZoL 0.6.3, I noticed that echoing 3 into /proc/sys/vm/drop_caches after doing so immediately before (i.e., with an empty cache) took around 0.5–2 minutes, with the process (zsh, if using its built-in echo; or /bin/echo) having ~100% CPU usage during that time. (With a non-empty cache, it took maybe a few seconds.)

ryao/zfs@08c0a3d reduces this time to only around 2–5 seconds with an empty cache, and to less than 1 second with a non-empty cache.

behlendorf · 2014-06-26T20:53:12Z

@ryao My concern with reverting this change is that we'll reintroduce the deadlock described in commit b00131d. It's certainly possible that the 6f9548c fix resolves this case but I haven't spent enough time refreshing my memory to say for certain. If you could verify this is no longer possible and explain why that would help speed up getting this merged.

Here's the original reproducer.

This can be reliably reproduced like this:
$ touch test
$ attr -s a -V b test
$ rm test
$ echo 3 > /proc/sys/vm/drop_caches

dweeezil · 2014-06-27T02:30:21Z

FYI, I've not been able to reproduce the original reproducer during my work on #1548. I'm still in the middle of my testing against a handful of different kernels (thinking the post-3.0 shrinker kernel changes may be involved). It would be interesting to know what kernel versions were involved.

ryao · 2014-06-27T14:37:35Z

@behlendorf Issue #266 has a stack trace for that deadlock. 6f9548c includes a stack trace that is essentially the same deadlock. The only difference is that we enter this code path through shrink_icache_memory() instead of writeback_sb_inodes(). In both stack traces, it is clear that the deadlock occurred because eviction called ilookup() through zfs_zget(), which blocked against eviction in __wait_on_freeing_inode(). The merge that included 6f9548c replaces ilookup() by igrab(). This resolves the deadlock by preventing us from calling __wait_on_freeing_inode() in this code path.

With that in mind, I believe that 6f9548c was the right fix for #266 and that 7973e46 was not. While 7973e46 did help, it did not cover all cases and the improvements in the cases where it did help came at the expense of a serious regression in the form of #457. Now that 6f9548c is in the tree, there is no reason to keep 7973e46 any longer.

behlendorf · 2014-06-27T21:12:57Z

@dweeezil I know I was able to reproduce the original issue under RHEL6, so 2.6.32.

@ryao OK, I've refreshed my memory on this issue. However, there's a lot of history we need to cover so let me try and summarize the critical points.

The original patch to make zfs_rmnode() asynchronous was introduced to prevent a deadlock which could occur when unlinking directory based xattrs. Basically, the prune_icache function (now prune_icache_sb) would select a group of inodes to be evicted and set the I_FREEING flag on them. If a related xattr directory and child inode were selected and the directory was processed before the child then the zfs_zget() in zfs_rmnode() would deadlock in ilookup() on the child. Making the looking asynchronous prevented this deadlock.
The change in 6f9548c won't address this specific issue. Instead what should happen is that instead of blocking in ilookup() we'll loop through the igrab() and retry loop forever. Since the I_FREEING flag was set by the current process in prune_icache_sb() it won't be cleared until we evict all those inodes... which won't happen because now we're looping instead of blocked. 6f9548c did nicely address the case where a different thread is the one which resulted in the contention.
What did mostly solve the issue described in 1) was commit e89260a. This patch ensured that child xattr files always held a reference on their parent xattr directory. This reference ensures that prune_icache_sb will always skip xattr directories until their children are all evicted. This nicely avoids the issue described above and is why I felt safe it was safe to revert the original workaround and return to synchronous behavior. See commit 53c7411.
That is until rcu_sched stall, then lockup when mounting zfs #1176 was filed shortly after I made that change. It shows a case that was not fully considered. As long as the xattr directories are always pruned leaves first which the reference ensure everything is fine. However, as part of mount the drained list can be walked and there we have no such guarantee. And according to rcu_sched stall, then lockup when mounting zfs #1176 a deadlock is possible in this situation, although the details there aren't clear cut from the contents of the issue.

So from my point of view we can only safely merge this patch once we get to the bottom of #1176. If we can convince ourselves that case is no longer possible then this patch can be merged.

dweeezil · 2014-06-29T20:58:30Z

@behlendorf I finally got 2.6.32 (actually, a stock 2.6.32.61) running on my testing rig and, with an appropriately old spl (openzfs/spl@372c257) and zfs (6f0cf71) and a few manual patches to get it to compile with gcc 4.8.2 (it's static analysis finds some old sizeof bugs, etc.). The original #266 reproducer listed above still deadlocks.

Updating spl to master and zfs to master and reverting b00131d and 6f9548c and running the test on my 2.6.32 kernel works just fine (no deadlock).

Now that I've got an old kernel working on my otherwise new-ish test system, I'll do some targeted bisection and see if I can find out where the problem was fixed or worked around.

behlendorf · 2014-06-29T21:16:05Z

@dweeezil If you're going to bisect start with e89260a. This is the commit which should cleanly solve the original issue by ensuring iprune_super() never tries to evict a xattr directory inode while it still has cached xattr directory file inode. The lingering question in my mind is if we reintroduce #1176 if we revert these changes.

dweeezil · 2014-06-29T22:35:25Z

@behlendorf Yep, it was e89260a. For reference, I pushed dweeezil/zfs@9f5dfac to document the commit history of the "working" state. If dweeezil/zfs@e2a6ebc is reverted from that branch (which re-introduces b00131d), the deadlock occurs.

I'm going to try to hack up a pool with an unlinked set (which we ought to add to zfsonlinux/zfs-images) in an attempt to create a reproducer for #1176.

behlendorf · 2014-06-30T02:43:45Z

@dweeezil Great. It's nice to have that fix confirmed. It should be pretty easy to create a large unlinked set and I agree it would be great to add a pool like this to the zfs-images for testing.

ryao · 2014-06-30T03:08:06Z

@dweeezil Why did you revert 6f9548c in your testing? Did you hit the infinite loop @behlendorf predicted might happen?

dweeezil · 2014-06-30T04:28:04Z

@ryao No, I didn't hit it.

I already knew the #226 reproducer didn't deadlock with recent master code and I simply wanted to make sure that neither of b00131d or 6f9548c were the reason (it didn't deadlock). In retrospect, I had no reason to touch 6f9548c since my previous attempts at trying the #226 reproducer were done long before that patch was committed.

Now I'm trying to get a solid reproducer for the #1176 issue by creating pools (actually, filesystems) with unlinked sets but it looks like I need to do it a bit more cleverly so the unlinked set isn't in just the right order to avoid the deadlock.

prometheanfire · 2014-06-30T05:07:48Z

Here's git master with this patch http://bpaste.net/show/420834/

doesn't mount /var/tmp/portage, thinks it's mounted when it isn't (at least not according to zfs mount)
soft lockup on reboot too

3.15.2

prometheanfire · 2014-06-30T05:21:09Z

more debug http://bpaste.net/show/wkTQ7igorQb7IA5zJYiF/

prometheanfire · 2014-06-30T05:55:10Z

@ryao said that this confirmed @behlendorf's opinion that it could cause the look, so yay.

I'm on the second boot now, and it seems fine (until it's not)

behlendorf · 2014-06-30T16:51:47Z

@prometheanfire Yes, http://bpaste.net/show/420834/ was exactly what I was concerned about. We need to explain exactly how that lockup is possible and address it before we can revert this patch.

prometheanfire · 2014-06-30T21:47:15Z

this was 'zfs mount -a'.

ryao · 2014-07-08T15:53:50Z

@behlendorf I suspect that the right thing to do here is to make each active SA handles hold a reference on the corresponding struct inode. I will take time to play with that idea after I get a handle on another mystery that I am debugging.

behlendorf · 2014-07-31T22:10:37Z

Just for future reference. Fully reverting to the Illumos code where zfs_purgedir() is called directly from zfs_rmnode() results in the following deadlock. We'll need to resolve that before this can be merged.

 [<ffffffff81529e2e>] ? mutex_lock+0x1e/0x50
 [<ffffffffa047730f>] refcount_remove_many+0x14f/0x220 [zfs]
 [<ffffffffa04773f6>] ? refcount_remove+0x16/0x20 [zfs]
 [<ffffffffa04486d8>] ? dnode_rele+0x78/0x120 [zfs]
 [<ffffffffa043333a>] ? dmu_bonus_hold+0x17a/0x370 [zfs]
 [<ffffffffa0424379>] ? dbuf_rele_and_unlock+0x179/0x300 [zfs]
 [<ffffffffa0479b8e>] ? sa_buf_hold+0xe/0x10 [zfs]
 [<ffffffffa04e7eee>] ? zfs_zget+0x8e/0x300 [zfs]
 [<ffffffffa04c2b2e>] ? zfs_purgedir+0xde/0x280 [zfs]
 [<ffffffff8111f200>] ? find_get_pages_tag+0x40/0x130
 [<ffffffff8111fd7e>] ? wait_on_page_writeback_range+0x8e/0x190
 [<ffffffffa022ab32>] ? spl_debug_msg+0x442/0xa30 [spl]
 [<ffffffffa044b346>] ? dnode_setdirty+0x116/0x520 [zfs]
 [<ffffffff8109ae27>] ? bit_waitqueue+0x17/0xd0
 [<ffffffffa04c2f35>] ? zfs_rmnode+0x265/0x410 [zfs]
 [<ffffffffa04e9e9e>] ? zfs_zinactive+0xfe/0x200 [zfs]
 [<ffffffffa04e4b0f>] ? zfs_inactive+0x7f/0x370 [zfs]
 [<ffffffffa05017ce>] ? zpl_clear_inode+0xe/0x10 [zfs]
 [<ffffffff811a642c>] ? clear_inode+0xac/0x140
 [<ffffffff811a6c53>] ? generic_drop_inode+0x33/0x80
 [<ffffffff811a5ad2>] ? iput+0x62/0x70
 [<ffffffffa04c2f0c>] ? zfs_rmnode+0x23c/0x410 [zfs]
 [<ffffffffa04e9e9e>] ? zfs_zinactive+0xfe/0x200 [zfs]
 [<ffffffffa04e4b0f>] ? zfs_inactive+0x7f/0x370 [zfs]
 [<ffffffffa0501940>] ? zpl_inode_delete+0x0/0x30 [zfs]
 [<ffffffffa05017ce>] ? zpl_clear_inode+0xe/0x10 [zfs]
 [<ffffffff811a642c>] ? clear_inode+0xac/0x140
 [<ffffffffa0501960>] ? zpl_inode_delete+0x20/0x30 [zfs]
 [<ffffffff811a6b2e>] ? generic_delete_inode+0xde/0x1d0
 [<ffffffff811a6c85>] ? generic_drop_inode+0x65/0x80
 [<ffffffff811a5ad2>] ? iput+0x62/0x70
 [<ffffffffa04e995e>] ? zfs_inode_destroy+0x10e/0x1c0 [zfs]
 [<ffffffffa05019b0>] ? zpl_inode_destroy+0x20/0x60 [zfs]
 [<ffffffff811a60ef>] ? destroy_inode+0x2f/0x60
 [<ffffffff811a6577>] ? dispose_list+0xb7/0x120
 [<ffffffff811a69aa>] ? invalidate_inodes+0xea/0x190
 [<ffffffff8118b0bc>] ? generic_shutdown_super+0x4c/0xe0
 [<ffffffff8118b1b6>] ? kill_anon_super+0x16/0x60
 [<ffffffffa050167e>] ? zpl_kill_sb+0x1e/0x30 [zfs]
 [<ffffffff8118b957>] ? deactivate_super+0x57/0x80
 [<ffffffff811ab2ff>] ? mntput_no_expire+0xbf/0x110
 [<ffffffff811abe4b>] ? sys_umount+0x7b/0x3a0
 [<ffffffff8100b072>] ? system_call_fastpath+0x16/0x1b

This reverts commit 7973e46 which brings the basic flow of the code back inline with the other ZFS implementations. This was possible due to the work done in these in previous commits. e89260a Directory xattr znodes hold a reference on their parent 26cb948 Avoid 128K kmem allocations in mzap_upgrade() 4acaaf7 Add zfs_iput_async() interface Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue openzfs#2408 Issue openzfs#457 Issue openzfs#2058 Issue openzfs#2128 Issue openzfs#2240

This was referenced Jun 20, 2014

High CPU usage of zfs_iput_taskq and slow writes #1953

Closed

zfs_iput_taskq hits 100% with large rsync, but not if you break it up into pieces #1469

Closed

dweeezil mentioned this pull request Jun 20, 2014

Deleting Files Doesn't Free Space, unless I unmount the filesystem #1548

Closed

behlendorf added this to the 0.6.4 milestone Jun 24, 2014

behlendorf added Bug labels Jun 24, 2014

behlendorf modified the milestones: 0.6.7, 0.6.4 Jul 16, 2014

behlendorf closed this Jul 31, 2014

behlendorf mentioned this pull request Aug 5, 2014

Unlink #2573

Closed

8573 mentioned this pull request Aug 23, 2014

Printed issue (etc.) titles are over-escaped ingydotnet/git-hub#139

Closed

behlendorf modified the milestone: 0.6.7 Nov 8, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Revert "Revert "Revert "Fix unlink/xattr deadlock""" #2408

Revert "Revert "Revert "Fix unlink/xattr deadlock""" #2408

ryao commented Jun 20, 2014

8573 commented Jun 21, 2014

behlendorf commented Jun 26, 2014

dweeezil commented Jun 27, 2014

ryao commented Jun 27, 2014

behlendorf commented Jun 27, 2014

dweeezil commented Jun 29, 2014

behlendorf commented Jun 29, 2014

dweeezil commented Jun 29, 2014

behlendorf commented Jun 30, 2014

ryao commented Jun 30, 2014

dweeezil commented Jun 30, 2014

prometheanfire commented Jun 30, 2014

prometheanfire commented Jun 30, 2014

prometheanfire commented Jun 30, 2014

behlendorf commented Jun 30, 2014

prometheanfire commented Jun 30, 2014

ryao commented Jul 8, 2014

behlendorf commented Jul 31, 2014

Revert "Revert "Revert "Fix unlink/xattr deadlock""" #2408

Revert "Revert "Revert "Fix unlink/xattr deadlock""" #2408

Conversation

ryao commented Jun 20, 2014

8573 commented Jun 21, 2014

behlendorf commented Jun 26, 2014

dweeezil commented Jun 27, 2014

ryao commented Jun 27, 2014

behlendorf commented Jun 27, 2014

dweeezil commented Jun 29, 2014

behlendorf commented Jun 29, 2014

dweeezil commented Jun 29, 2014

behlendorf commented Jun 30, 2014

ryao commented Jun 30, 2014

dweeezil commented Jun 30, 2014

prometheanfire commented Jun 30, 2014

prometheanfire commented Jun 30, 2014

prometheanfire commented Jun 30, 2014

behlendorf commented Jun 30, 2014

prometheanfire commented Jun 30, 2014

ryao commented Jul 8, 2014

behlendorf commented Jul 31, 2014