gluster-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Gluster-devel] Some performance issues in mount/fuse


From: Xavier Hernandez
Subject: [Gluster-devel] Some performance issues in mount/fuse
Date: Mon, 11 Mar 2013 11:49:47 +0100
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/20130221 Thunderbird/17.0.3

Hello,

I've recently performed some tests with gluster on a fast network (IP over infiniband) and got some unexpected results. It seems that mount/fuse is becoming a bottleneck when the network and disk are very fast.

I started with a simple distributed volume with 2 bricks mounted on a ramdisk to avoid possible disk bottlenecks (however I repeated the tests with an SSD and, later, with a normal hard disk and the results were the same, probably due to the good work of performance translators). With this configuration, a single write reached a throughput of ~420 MB/s. It's way below the maximum network limit, but for a single write it's quite acceptable. However with two concurrent writes (carefully chosen so that each one goes to a different brick), the throughput was ~200 MB/s (for each transfer). That was totally unexpected. As there was plenty of bandwith available and no IO limitation, I was expecting something near 800 MB/s.

In fact, any combination of concurrent writes always led to the same combined throughput of ~400 MB/s.

Trying to determine the cause of this odd behavior, I noticed that mount/fuse uses a single thread to serve kernel requests, and once a request is received, it is sent down the xlator stack to process it, only reading additional requests once the stack returns. This means that to reach a 420 MB/s throughput using 128KB per request (the current maximum block size), it needs to serve, at least, 3360 requests per second. In other words, it processes each request in 300 us. If we take into account that every translator will allocate memory, and do some system calls, it's quite possible that it really takes 300 us to serve each request.

To see if this is the case, I added the performance/io-threads just below the mount/fuse. This would queue each request to a different thread, freeing the current one to read another request much before than 300 us. This should improve the concurrent writes case.

The results are good. Using this simple modification, 2 concurrent writes performed at ~300 MB/s each one. However the throughput for a single write dropped to ~250 MB/s. Anyway, this solution is not valid because there is some incompatibility with this configuration and some things do not work well (for example a simple 'ls' does not show all the files).

Then I modified the mount/fuse xlator to start some threads to serve kernel requests. With this modification all seems to work as expected and throughput is quite better: a single write still performs at 420 MB/s, and 2 concurrent writes reach 330 MB/s. In fact, any combination of 2 or more concurrent writes has a combined throughput of ~650 MB/s.

However, a replicate volume does not improve at all. I'm not sure why. It seems that there should be some kind of serialization point in cluster/afr. A single write has a throughput of ~175 MB/s, and 2 concurrent writes ~85 MB/s. I'll have to investigate this further.

Does all this make sense ?

Is this something that would be worth investing more time ?

Regards,

Xavi



reply via email to

[Prev in Thread] Current Thread [Next in Thread]