Re: [Gluster-devel] RFC on fix to bug #802414

gluster-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Gluster-devel] RFC on fix to bug #802414

From:	Anand Avati
Subject:	Re: [Gluster-devel] RFC on fix to bug #802414
Date:	Tue, 22 May 2012 00:11:36 -0700
User-agent:	Mozilla/5.0 (X11; Linux x86_64; rv:10.0.1) Gecko/20120209 Thunderbird/10.0.1

<in continuation from our chat>

The PARENT_DOWN_HANDLED approach will take us backwards from the currentstate where we are resiliant to frame losses and other class of bugs(i.e, if a frame loss happens on either server or client, it onlyresults in prevented graph cleanup but the graph switch still happens).

The root "cause" here is that we are giving up on a very important andfundamental principle of immutability on the fd object. The realsolution here is to never modify fd->inode. Instead we must bring abouta more native fd "migration" than just re-opening an existing fd on thenew graph.

Think of the inode migration analogy. The handle coming from FUSE (theaddress of the object) is a "hint". Usually the hint is right, if theobject in the address belongs to the latest graph. If not, using theGFID we resolve a new inode on the latest graph and use it.

In case of FD we can do something similar, except there are not GFIDs(which should not be a problem). We need to make the handle coming fromFUSE (the address of fd_t) just a hint. If thefd->inode->table->xl->graph is the latest, then the hint was a HIT. Ifthe graph was not the latest, we look for a previous migrationattempt+result in the "base" (original) fd's context. If that does notexist or is not fresh (on the latest graph) then we do a new fdcreation, open on new graph, fd_unref the old cached result in the fdcontext of the "base fd" and keep ref to this new result. All this musthappen from fuse_resolve_fd(). The setting of the latest fd and updationof the latest fd pointer happens under the scope of the base_fd->lock()which gives it a very clear and unambiguous scope which was missing withthe old scheme.


[The next step will be to nuke the fd->inode swapping in fuse_create_cbk]

Avati

On 05/21/2012 10:26 PM, Raghavendra Gowdappa wrote:



----- Original Message -----

From: "Pranith Kumar Karampuri"<address@hidden>
To: "Anand Avati"<address@hidden>
Cc: "Vijay Bellur"<address@hidden>, "Amar Tumballi"<address@hidden>, "Krishnan 
Parthasarathi"
<address@hidden>, "Raghavendra Gowdappa"<address@hidden>
Sent: Tuesday, May 22, 2012 8:42:58 AM
Subject: Re: RFC on fix to bug #802414

Dude,
     We have already put logs yesterday in LOCK and UNLOCK and saw
     that the&fd->inode->lock address changed from LOCK to UNLOCK.


Yes, even I too believe that the hang is because of fd->inode swap in 
fuse_migrate_fd and not the one in fuse_create_cbk. We could clearly see in the 
log files following race:
fuse-mig-thr: acquires fd->inode->lock for swapping fd->inode (this was a naive 
fix - hold lock on inode in old graph - to the race-condition caused by swapping 
fd->inode, which didn't work)

poll-thr: tries to acquire fd->inode->lock (inode is old_inode present in 
old-graph) in afr_local_cleanup
fuse-mig-thr: swaps fd->inode and releases lock on old_inode->lock
poll-thr: gets woken up from lock call on old_inode->lock.
poll-thr: does its work, but while unlocking, uses fd->inode where inode 
belongs to new graph.

we had logs printing lock address before and after acquisition of lock and we 
could clearly see that lock address changed after acquiring lock in 
afr_local_cleanup.

"The hang in fuse_migrate_fd is _before_ the inode swap performed
there."

All the fds are opened on the same file. So all fds in the fd
migration point to same inode. The race is hit by nth fd, (n+1)th fd
hangs. We have seen that afr_local_cleanup was doing fd_unref, and
LOCK(fd->inode->lock) was done with one address then by the time
UNLOCK(fd->inode->lock) is done the address changed. So the next fd
that has to migrate hung because the prev inode lock is not
unlocked.

If after nth fd introduces the race a _cbk comes in epoll thread on
(n+1)th fd which tries to LOCK(fd->inode->lock) epoll thread will
hang.
Which is my theory for the hang we observed on Saturday.

Pranith.
----- Original Message -----
From: "Anand Avati"<address@hidden>
To: "Raghavendra Gowdappa"<address@hidden>
Cc: "Vijay Bellur"<address@hidden>, "Amar Tumballi"
<address@hidden>, "Krishnan Parthasarathi"
<address@hidden>, "Pranith Kumar Karampuri"
<address@hidden>
Sent: Tuesday, May 22, 2012 2:09:33 AM
Subject: Re: RFC on fix to bug #802414

On 05/21/2012 11:11 AM, Raghavendra Gowdappa wrote:

Avati,

fuse_migrate_fd (running in reader thread - rdthr) assigns new
inode to fd, once it looks up inode in new graph. But this
assignment can race with code that accesses fd->inode->lock
executing in poll-thread (pthr) as follows

pthr: LOCK (fd->inode->lock); (inode in old graph)
rdthr: fd->inode = inode (resolved in new graph)
pthr: UNLOCK (fd->inode->lock); (inode in new graph)


The way I see it (the backtrace output in the other mail), the swap
happening in fuse_create_cbk() must be the one causing lock/unlock to
land on different inode objects. The hang in fuse_migrate_fd is
_before_
the inode swap performed there. Can you put some logs in
fuse_create_cbk()'s inode swap code and confirm this?

Now, any lock operations on inode in old graph will block. Thanks
to pranith for pointing to this race-condition.

The problem here is we don't have a single lock that can
synchronize assignment "fd->inode = inode" and other locking
attempts on fd->inode->lock. So, we are thinking that instead of
trying to synchronize, eliminate the parallel accesses altogether.
This can be done by splitting fd migration into two tasks.

1. Actions on old graph (like fsync to flush writes to disk)
2. Actions in new graph (lookup, open)

We can send PARENT_DOWN when,
1. Task 1 is complete.
2. No fop sent by fuse is pending.

on receiving PARENT_DOWN, protocol/client will shutdown transports.
As part of transport cleanup, all pending frames are unwound and
protocol/client will notify its parents with PARENT_DOWN_HANDLED
event. Each of the translator will pass this event to its parents
once it is convinced that there are no pending fops started by it
(like background self-heal, reads as part of read-ahead etc). Once
fuse receives PARENT_DOWN_HANDLED, it is guaranteed that there
will be no replies that will be racing with migration (note that
migration is done using syncops). At this point in time, it is
safe to start Task 2 (which associates fd with an inode in new
graph).

Also note that reader thread will not do other operations till it
completes both tasks.

As far as the implementation of this patch goes, major work is in
translators like read-ahead, afr, dht to provide the guarantee
required to send PARENT_DOWN_HANDLED event to their parents.

Please let me know your thoughts on this.


All the above steps might not apply if it is caused by the swap in
fuse_create_cbk(). Let's confirm that first.

Avati

[Prev in Thread]

Current Thread

[Next in Thread]

Re: [Gluster-devel] RFC on fix to bug #802414, Anand Avati <=
- Re: [Gluster-devel] RFC on fix to bug #802414, Raghavendra Gowdappa, 2012/05/22
  - Re: [Gluster-devel] RFC on fix to bug #802414, Anand Avati, 2012/05/22

Prev by Date: Re: [Gluster-devel] metadata race confition (was: ename(2) race condition)
Next by Date: Re: [Gluster-devel] Gluster internals
Previous by thread: [Gluster-devel] zero'ed local data in mdc_setattr_cbk
Next by thread: Re: [Gluster-devel] RFC on fix to bug #802414
Index(es):
- Date
- Thread