hi,
Traditionally afr just remembers which of the directories are good vs stale
in extended attributes and then at the time of self-heal, does full directory
scan and deletes stale entries and creates new entries. There are two problems
with this approach
1) even creating/deleting/renaming one entry requires full scan of the
directory.
2) If both bricks crash at the same time while a rename is going on, then it
can lead to same-name, different gfid split-brains.
Example:
0) dir1 has file 'a' with gfid-a, dir2 has file 'b' with gfid-b.
1) user executes rename dir1/a -> dir2/b on the mount over-writing
the original file b.
2) On brick-0 rename succeeds so the end result is dir1 does not
have 'a' and dir2 has file 'b' with gfid-a
3) at this point both the brick processes go down or data center
shutdown happens etc, so brick-1 still has dir1 with file 'a' with 'gfid-a' and
dir2 with file 'b' with 'gfid-b'.
4) Now when both bricks are back up, dir1 can be healed
conservatively where 'a' will be recreated with 'gfid-a' and heal it from
brick-1 to brick-0 (incorrectly undoing the rename).
5) But for dir2 on brick-0 there is a file 'b' with gfid-a where
as on brick-1 there is a file 'b' with 'gfid-b', afr at the moment doesn't
store any information to figure out which one is correct.
To address this issue, granularity of preop/postop of the entry operations need
to be incremented.
a filename inside a directory can be uniquely identified by the entry-tuple
(parent-gfid, entryname, entry-gfid).
Example: For dir2/b in the example above we can represent it as (gfid-of-dir2,
b, gfid-b) on brick-1
So we need to remember such information for every entry fop along with whether
that entry is coming 'in' to the directory or going 'out' of the directory.
So in the previous example we would have remembered dir2/b with gfid-b is going
out of that directory so that entry could be deleted and dir2/b with gfid-a can
be healed from brick-0.
The solution that we come up with should have the following functionalities
broadly:
1) Given an entry-tuple it should be able to remember that it is going in or
out of that directory.
2) Given an existing entry-tuple it should be able to forget it.
3) Given an entry-tuple, we should be able to query if that entry-tuple is
going in/out.
This is one possible way to address this issue:
0) Create directory .glusterfs/indices/entry and two files 'in', 'out' in that
directory and
1) Every time creat/mknod/symlink/link/mkdir happens create a hardlink from
following path .glusterfs/indices/entry/pargfid/gfid/filename to
'.glusterfs/indices/entry/in' as part of pre-op
2) Every time unlink/rmdir happens create a hardlink from following path inside
.glusterfs/indices/entry/pargfid/gfid/filename to
'.glusterfs/indices/entry/out' as part of pre-op
3) Every time rename happens create the following 2/3 hardlinks
- .glusterfs/indices/entry/old-pargfid/gfid/old-filename to
'.glusterfs/indices/entry/out'
- .glusterfs/indices/entry/new-pargfid/gfid/new-filename to
'.glusterfs/indices/entry/in'
and if the destination exists:
- .glusterfs/indices/entry/new-pargfid/exisiting-file-gfid/new-filename to
'.glusterfs/indices/entry/out'
4) Delete the same files as part of post-op.