Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revert "Revert "Revert "Fix unlink/xattr deadlock""" #2408

Closed
wants to merge 1 commit into from

Conversation

ryao
Copy link
Contributor

@ryao ryao commented Jun 20, 2014

This reverts commit 7973e46. That had
been intended to workaround a deadlock issue involving zfs_zget(), which
was fixed by 6f9548c. The workaround
had the side effect of causing zfs_zinactive() to cause excessive cpu
utilization in zfs_iput_taskq by queuing an iteration of all objects in
a dataset on every unlink on a directory that had extended attributes.
That resulted in many issue reports about iput_taskq spinning. Since the
original rationale for the change is no longer valid, we can safely
revert it to resolve all of those issue reports.

Conflicts:
module/zfs/zfs_dir.c

Closes:
#457
#2058
#2128
#2240

This reverts commit 7973e46. That had
been intended to workaround a deadlock issue involving zfs_zget(), which
was fixed by 6f9548c. The workaround
had the side effect of causing zfs_zinactive() to cause excessive cpu
utilization in zfs_iput_taskq by queuing an iteration of all objects in
a dataset on every unlink on a directory that had extended attributes.
That resulted in many issue reports about iput_taskq spinning. Since the
original rationale for the change is no longer valid, we can safely
revert it to resolve all of those issue reports.

Conflicts:
	module/zfs/zfs_dir.c

Closes:
	openzfs#457
	openzfs#2058
	openzfs#2128
	openzfs#2240
@8573
Copy link

8573 commented Jun 21, 2014

After updating my Gentoo Linux system from ZFS on Linux 0.6.2 to ZoL 0.6.3, I noticed that echoing 3 into /proc/sys/vm/drop_caches after doing so immediately before (i.e., with an empty cache) took around 0.5–2 minutes, with the process (zsh, if using its built-in echo; or /bin/echo) having ~100% CPU usage during that time. (With a non-empty cache, it took maybe a few seconds.)

ryao/zfs@08c0a3d reduces this time to only around 2–5 seconds with an empty cache, and to less than 1 second with a non-empty cache.

@behlendorf behlendorf added this to the 0.6.4 milestone Jun 24, 2014
@behlendorf
Copy link
Contributor

@ryao My concern with reverting this change is that we'll reintroduce the deadlock described in commit b00131d. It's certainly possible that the 6f9548c fix resolves this case but I haven't spent enough time refreshing my memory to say for certain. If you could verify this is no longer possible and explain why that would help speed up getting this merged.

Here's the original reproducer.

This can be reliably reproduced like this:
$ touch test
$ attr -s a -V b test
$ rm test
$ echo 3 > /proc/sys/vm/drop_caches

@dweeezil
Copy link
Contributor

FYI, I've not been able to reproduce the original reproducer during my work on #1548. I'm still in the middle of my testing against a handful of different kernels (thinking the post-3.0 shrinker kernel changes may be involved). It would be interesting to know what kernel versions were involved.

@ryao
Copy link
Contributor Author

ryao commented Jun 27, 2014

@behlendorf Issue #266 has a stack trace for that deadlock. 6f9548c includes a stack trace that is essentially the same deadlock. The only difference is that we enter this code path through shrink_icache_memory() instead of writeback_sb_inodes(). In both stack traces, it is clear that the deadlock occurred because eviction called ilookup() through zfs_zget(), which blocked against eviction in __wait_on_freeing_inode(). The merge that included 6f9548c replaces ilookup() by igrab(). This resolves the deadlock by preventing us from calling __wait_on_freeing_inode() in this code path.

With that in mind, I believe that 6f9548c was the right fix for #266 and that 7973e46 was not. While 7973e46 did help, it did not cover all cases and the improvements in the cases where it did help came at the expense of a serious regression in the form of #457. Now that 6f9548c is in the tree, there is no reason to keep 7973e46 any longer.

@behlendorf
Copy link
Contributor

@dweeezil I know I was able to reproduce the original issue under RHEL6, so 2.6.32.

@ryao OK, I've refreshed my memory on this issue. However, there's a lot of history we need to cover so let me try and summarize the critical points.

  1. The original patch to make zfs_rmnode() asynchronous was introduced to prevent a deadlock which could occur when unlinking directory based xattrs. Basically, the prune_icache function (now prune_icache_sb) would select a group of inodes to be evicted and set the I_FREEING flag on them. If a related xattr directory and child inode were selected and the directory was processed before the child then the zfs_zget() in zfs_rmnode() would deadlock in ilookup() on the child. Making the looking asynchronous prevented this deadlock.

  2. The change in 6f9548c won't address this specific issue. Instead what should happen is that instead of blocking in ilookup() we'll loop through the igrab() and retry loop forever. Since the I_FREEING flag was set by the current process in prune_icache_sb() it won't be cleared until we evict all those inodes... which won't happen because now we're looping instead of blocked. 6f9548c did nicely address the case where a different thread is the one which resulted in the contention.

  3. What did mostly solve the issue described in 1) was commit e89260a. This patch ensured that child xattr files always held a reference on their parent xattr directory. This reference ensures that prune_icache_sb will always skip xattr directories until their children are all evicted. This nicely avoids the issue described above and is why I felt safe it was safe to revert the original workaround and return to synchronous behavior. See commit 53c7411.

  4. That is until rcu_sched stall, then lockup when mounting zfs #1176 was filed shortly after I made that change. It shows a case that was not fully considered. As long as the xattr directories are always pruned leaves first which the reference ensure everything is fine. However, as part of mount the drained list can be walked and there we have no such guarantee. And according to rcu_sched stall, then lockup when mounting zfs #1176 a deadlock is possible in this situation, although the details there aren't clear cut from the contents of the issue.

So from my point of view we can only safely merge this patch once we get to the bottom of #1176. If we can convince ourselves that case is no longer possible then this patch can be merged.

@dweeezil
Copy link
Contributor

@behlendorf I finally got 2.6.32 (actually, a stock 2.6.32.61) running on my testing rig and, with an appropriately old spl (openzfs/spl@372c257) and zfs (6f0cf71) and a few manual patches to get it to compile with gcc 4.8.2 (it's static analysis finds some old sizeof bugs, etc.). The original #266 reproducer listed above still deadlocks.

Updating spl to master and zfs to master and reverting b00131d and 6f9548c and running the test on my 2.6.32 kernel works just fine (no deadlock).

Now that I've got an old kernel working on my otherwise new-ish test system, I'll do some targeted bisection and see if I can find out where the problem was fixed or worked around.

@behlendorf
Copy link
Contributor

@dweeezil If you're going to bisect start with e89260a. This is the commit which should cleanly solve the original issue by ensuring iprune_super() never tries to evict a xattr directory inode while it still has cached xattr directory file inode. The lingering question in my mind is if we reintroduce #1176 if we revert these changes.

@dweeezil
Copy link
Contributor

@behlendorf Yep, it was e89260a. For reference, I pushed dweeezil/zfs@9f5dfac to document the commit history of the "working" state. If dweeezil/zfs@e2a6ebc is reverted from that branch (which re-introduces b00131d), the deadlock occurs.

I'm going to try to hack up a pool with an unlinked set (which we ought to add to zfsonlinux/zfs-images) in an attempt to create a reproducer for #1176.

@behlendorf
Copy link
Contributor

@dweeezil Great. It's nice to have that fix confirmed. It should be pretty easy to create a large unlinked set and I agree it would be great to add a pool like this to the zfs-images for testing.

@ryao
Copy link
Contributor Author

ryao commented Jun 30, 2014

@dweeezil Why did you revert 6f9548c in your testing? Did you hit the infinite loop @behlendorf predicted might happen?

@dweeezil
Copy link
Contributor

@ryao No, I didn't hit it.

I already knew the #226 reproducer didn't deadlock with recent master code and I simply wanted to make sure that neither of b00131d or 6f9548c were the reason (it didn't deadlock). In retrospect, I had no reason to touch 6f9548c since my previous attempts at trying the #226 reproducer were done long before that patch was committed.

Now I'm trying to get a solid reproducer for the #1176 issue by creating pools (actually, filesystems) with unlinked sets but it looks like I need to do it a bit more cleverly so the unlinked set isn't in just the right order to avoid the deadlock.

@prometheanfire
Copy link
Contributor

Here's git master with this patch http://bpaste.net/show/420834/

doesn't mount /var/tmp/portage, thinks it's mounted when it isn't (at least not according to zfs mount)
soft lockup on reboot too

3.15.2

@prometheanfire
Copy link
Contributor

@prometheanfire
Copy link
Contributor

@ryao said that this confirmed @behlendorf's opinion that it could cause the look, so yay.

I'm on the second boot now, and it seems fine (until it's not)

@behlendorf
Copy link
Contributor

@prometheanfire Yes, http://bpaste.net/show/420834/ was exactly what I was concerned about. We need to explain exactly how that lockup is possible and address it before we can revert this patch.

@prometheanfire
Copy link
Contributor

this was 'zfs mount -a'.

@ryao
Copy link
Contributor Author

ryao commented Jul 8, 2014

@behlendorf I suspect that the right thing to do here is to make each active SA handles hold a reference on the corresponding struct inode. I will take time to play with that idea after I get a handle on another mystery that I am debugging.

@behlendorf behlendorf modified the milestones: 0.6.7, 0.6.4 Jul 16, 2014
@behlendorf
Copy link
Contributor

Just for future reference. Fully reverting to the Illumos code where zfs_purgedir() is called directly from zfs_rmnode() results in the following deadlock. We'll need to resolve that before this can be merged.

 [<ffffffff81529e2e>] ? mutex_lock+0x1e/0x50
 [<ffffffffa047730f>] refcount_remove_many+0x14f/0x220 [zfs]
 [<ffffffffa04773f6>] ? refcount_remove+0x16/0x20 [zfs]
 [<ffffffffa04486d8>] ? dnode_rele+0x78/0x120 [zfs]
 [<ffffffffa043333a>] ? dmu_bonus_hold+0x17a/0x370 [zfs]
 [<ffffffffa0424379>] ? dbuf_rele_and_unlock+0x179/0x300 [zfs]
 [<ffffffffa0479b8e>] ? sa_buf_hold+0xe/0x10 [zfs]
 [<ffffffffa04e7eee>] ? zfs_zget+0x8e/0x300 [zfs]
 [<ffffffffa04c2b2e>] ? zfs_purgedir+0xde/0x280 [zfs]
 [<ffffffff8111f200>] ? find_get_pages_tag+0x40/0x130
 [<ffffffff8111fd7e>] ? wait_on_page_writeback_range+0x8e/0x190
 [<ffffffffa022ab32>] ? spl_debug_msg+0x442/0xa30 [spl]
 [<ffffffffa044b346>] ? dnode_setdirty+0x116/0x520 [zfs]
 [<ffffffff8109ae27>] ? bit_waitqueue+0x17/0xd0
 [<ffffffffa04c2f35>] ? zfs_rmnode+0x265/0x410 [zfs]
 [<ffffffffa04e9e9e>] ? zfs_zinactive+0xfe/0x200 [zfs]
 [<ffffffffa04e4b0f>] ? zfs_inactive+0x7f/0x370 [zfs]
 [<ffffffffa05017ce>] ? zpl_clear_inode+0xe/0x10 [zfs]
 [<ffffffff811a642c>] ? clear_inode+0xac/0x140
 [<ffffffff811a6c53>] ? generic_drop_inode+0x33/0x80
 [<ffffffff811a5ad2>] ? iput+0x62/0x70
 [<ffffffffa04c2f0c>] ? zfs_rmnode+0x23c/0x410 [zfs]
 [<ffffffffa04e9e9e>] ? zfs_zinactive+0xfe/0x200 [zfs]
 [<ffffffffa04e4b0f>] ? zfs_inactive+0x7f/0x370 [zfs]
 [<ffffffffa0501940>] ? zpl_inode_delete+0x0/0x30 [zfs]
 [<ffffffffa05017ce>] ? zpl_clear_inode+0xe/0x10 [zfs]
 [<ffffffff811a642c>] ? clear_inode+0xac/0x140
 [<ffffffffa0501960>] ? zpl_inode_delete+0x20/0x30 [zfs]
 [<ffffffff811a6b2e>] ? generic_delete_inode+0xde/0x1d0
 [<ffffffff811a6c85>] ? generic_drop_inode+0x65/0x80
 [<ffffffff811a5ad2>] ? iput+0x62/0x70
 [<ffffffffa04e995e>] ? zfs_inode_destroy+0x10e/0x1c0 [zfs]
 [<ffffffffa05019b0>] ? zpl_inode_destroy+0x20/0x60 [zfs]
 [<ffffffff811a60ef>] ? destroy_inode+0x2f/0x60
 [<ffffffff811a6577>] ? dispose_list+0xb7/0x120
 [<ffffffff811a69aa>] ? invalidate_inodes+0xea/0x190
 [<ffffffff8118b0bc>] ? generic_shutdown_super+0x4c/0xe0
 [<ffffffff8118b1b6>] ? kill_anon_super+0x16/0x60
 [<ffffffffa050167e>] ? zpl_kill_sb+0x1e/0x30 [zfs]
 [<ffffffff8118b957>] ? deactivate_super+0x57/0x80
 [<ffffffff811ab2ff>] ? mntput_no_expire+0xbf/0x110
 [<ffffffff811abe4b>] ? sys_umount+0x7b/0x3a0
 [<ffffffff8100b072>] ? system_call_fastpath+0x16/0x1b

@behlendorf behlendorf closed this Jul 31, 2014
behlendorf added a commit to behlendorf/zfs that referenced this pull request Aug 5, 2014
This reverts commit 7973e46 which brings the basic flow of the
code back inline with the other ZFS implementations.  This was
possible due to the work done in these in previous commits.

e89260a Directory xattr znodes hold a reference on their parent
26cb948 Avoid 128K kmem allocations in mzap_upgrade()
4acaaf7 Add zfs_iput_async() interface

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue openzfs#2408
Issue openzfs#457
Issue openzfs#2058
Issue openzfs#2128
Issue openzfs#2240
@behlendorf behlendorf mentioned this pull request Aug 5, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Performance Performance improvement or performance problem
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants