Hi Vijay,
I will answer your questions inline:
On 11/21/2013 09:25 PM, Vijay Bellur wrote:
On 11/19/2013 10:51 PM, Alex Pyrgiotis wrote:
[for the installation...] we consulted the Debian README [1] of
GlusterFS and downloaded the packages for Wheezy. Although the
packages were installed properly, to the best of our knowledge we
found no trace of the libgfapi library. Thus, we cloned the git
repo, checked out the 3.4 branch and compiled
liblglusterfs/libgfapi from there.
<...snip...>
I think libgfapi is part of glusterfs-common in Debian. But I will
work with our package maintainers to see if having a separate
package for libgfapi is possible.
Hm, that's interesting.
When I was searching for the libgfapi library, I was simply looking for
the header files. I searched a bit more into it and did a:
dpkg -c glusterfs-common_3.4.1-2_amd64.deb
Although there is no trace of the header files, I see that libgfapi.so
is there. I searched about this issue and found this bug report:
http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=717558
My guess is that the fix of the Debian maintainers has not been
backported to the packages you create for Wheezy. So, you probably need
to add that to the packages you provide.
In any case, I created a sole Gluster server/client and copied the
header files by hand to check if the tests that I have posted pass. It
seems that the first two tests complete successfully while the Python
test still fails. I'll re-run them once I tear down and rebuild my
cluster with the correct packages.
<...snip...>
1. There are no async operations for open/create/close/stat/unlink,
which are necessary for various operations of Archipelago.
Is there more description on how various operations of Archipelago
rely on async operations for open/create etc.? I must admit that I
haven't gone through your code but will definitely do so to get a
better understanding.
Sure, I 'll explain our rationale but first, let me provide some insight
on the fundamental logic of Archipelago to understand the context on
which we operate:
An Archipelago volume is a COW volume, consisting of many contiguous
pieces (volume chunks), typically 4MB in size. It is COW since it may
share read-only chunks with other volumes (e.g. if the volumes are
created from the same OS image) and creates new chunks to write to
them. In order to refer to a chunk, we assign a name to it (e.g.
volume1_0002) which can be considered as the object name (Rados) or file
name (Gluster, NFS).
The above logic is handled by separate Archipelago entities (mappers,
volume composers). This means that the storage driver’s only task is to
read/write chunks from and to the storage. Also, given that there is one
such driver per host - where 50 VMs can be running - means that it must
handle a lot of chunks.
Now, back to our storage driver and the need for full asynchronism. When
it receives a read/write request for a chunk, it will generally need to
open the file, create it if it doesn’t exist, perform the I/O and
finally close the file. Having non-blocking read/write but blocking
open/create/close essentially makes this request a blocking request.
This means that if the driver supports e.g. 64 in-flight requests, it
needs to have 64 threads to be able to manage all of them.
Let’s assume that open/create/close are indeed non-blocking or virtually
nonexistent [1]. Most importantly, this would greatly reduce the
read/write latency, especially for 4k requests. Another benefit is the
ability to use a much smaller number of threads. However, besides
read/write, there are also other operations that the driver must support
such as stat()ing or deleting a file. If these operations are
blocking, this means that a spurious delete and stat can stall our
driver. Once more, it needs to have a lot of threads to be operational.
2. There is no way to create notifications on a file (as Rados can
with its objects).
How are these notifications consumed?
They are consumed by the lock/unlock operations that are also handled by
our driver. For instance, the Rados driver can wait asynchronously for
someone to unlock an object by registering a watch to the object and a
callback function. Conversely, the unlock operation makes sure to send a
notification to all watchers of the object. Thus, the lock/unlock
operation can happen asynchronously [2].
I have read that Gluster supports Posix locks, but this is not the
locking scheme we have in mind. We need a persistent type of lock that
would stay on a file even if the process closed the file descriptor or
worse, crashed.
Our current solution is to create a “lock file” e.g.
“volume1_0002_lock” with the owner name written in it. Thus, the
lock/unlock operations generally happen as follows:
a) Lock: Try to exclusively create a lock file. If successful, write the
owner id to it. If not, sleep for 1 second and retry.
b) Unlock: Read a lock file and its owner. If we are the owner, delete
it. Else, fail.
As you can see, this is not an elegant way and is subject to race
conditions. If Gluster can provide a better solution, we would be more
than happy to know about it.
Regards,
Alex
[1] In our NFS driver, we sidestep this issue by caching the file
descriptors, so the blocking open/create/close will happen less
frequently, provided we have a large enough cache. This however is not a
reliable solution.
[2] To be fair, currently Rados does not have an asynchronous
lock/unlock function, so we spawn a thread that handles this task. Since
locking/unlocking operations are scarce though, the driver's performance
is not affected by it.