gluster-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Gluster-devel] Architecture advice


From: Gordan Bobic
Subject: Re: [Gluster-devel] Architecture advice
Date: Mon, 12 Jan 2009 23:41:24 +0000
User-agent: Thunderbird 2.0.0.19 (X11/20090107)

Martin Fick wrote:
Why is that the correct way?  There's nothing
wrong with having "bonding" at the glusterfs
protocol level, is there?

The problem is that it only covers a very narrow edge case
that isn't all that likely. A bonded NIC over separate
switches all the way to both servers is a much more sensible
option. Or else what failure are you trying to protect
yourself against? It's a bit like fitting a big padlock
on the door when there's a wall missing.

I think you need to be more specific then using analogies. My only guess from your assertions is that you have a very narrow specific use case / setup / terminology in mind that does not necessarily mesh with my narrow use case ... :)

LOL! That is a distinct possibility. :)

So, the HA translator supports talking to two
different servers with two different transport
mechanism and two different IPs. Bonding does not support anything like this is far as I can tell.

True. Bonding is more transparent. You make two NICs into one virtual NIC and round-robin packets down them. If one NIC/path fails, all the traffic will fail over to the other NIC/path.

So, it seems like you are assuming a
different back end use case, one where the servers employ the same IP perhaps using round robin or perhaps in an active passive way.

No, not at all. Multiple servers, 1 floating IP per server. Floating as in it can be migrated to the other server if one fails. You balance the load by assigning half of your clients to one floating IP, and the other half of the clients to the other floating IP. So, when both servers are up, each handles half the load. If one server fails, it's IP gets migrated to the other server, and all clients thereafter talk to the surviving server since it has both IPs (until the other server comes back up and asks for it's IP address back).

Both
of these are very different beasts and I would
need to know which you are talking about to
understand what you are getting at.  But the HA
translator setup is closer to the round robin
(active/active) setup and I am guessing you are taking about an active / passive setup.

In general, there are relatively few things that you cannot make active/active, so I always mean active/active + failover unless I explicitly state it.

That is somewhat what the HA translator is, except
that it is supposed to take care of some additional
failures.  It is supposed to retransmit "in
progress" operations that have not succeeded because of
comm failures (I have yet to figure out where in the code
this happens though).

This is a reinvention of a wheel. NFS already handles this
gracefully for the use-case you are describing.

I am lost, what does NFS have to do with it?

It already handles the "server has gone away" situation gracefully. What I'm saying is that you can use GlusterFS underneath for mirroring the data (AFR) and re-export with NFS to the clients. If you want to avoid client-side AFR and still have graceful failover with lightweight transport, NFS is not a bad choice.

Why re-invent the wheel when the tools to deal
with these
failure modes already exist?
Are you referring to bonding here? If so, see above
why HA may be better (or additional benefit).

My original point is that it doesn't add anything new
that you couldn't achieve with tools that are already
available.


Well, I was trying to explain to you that it
does, but then the NFS thing, I am confused.

How do current tools achieve the following
setup? Client A talks to Server A and submits a read request. The read request is received on Server A (TCP acked to the client), and then Server A dies. How will
the following request be completed without
glusterfs returning an "endpoint not connected" error?

You make client <-> server comms NFS.
You make server <-> server comms GlusterFS.

If the NFS server goes away, the client will keep retrying until the server returns. In this case, that would mean it'll keep retrying until the other server fails the IP address over to itself.

This achieves:
1) server side AFR with GlusterFS for redundancy
2) client connects to a single server via NFS so there's no double-bandwidth used by the client
3) servers can fail over relatively transparently to the client

No, I have not confirmed that this actually
works with the HA translator, but I was told
that the following would happen if it were used. Client A talks to Server A and submits a read request. The read request is received on Server A (TCP acked to the client), and then Server A dies. Client A
will then in theory retry the read request
on Server B.  Bonding cannot do anything
like this (since the read was tcp ACKed)?

Agreed, if a server fails, bonding won't help. Cluster fail-over server-side, however, will, provided the network file system protocol can deal with it reasonably well.

Neither can heartbeat/failover
of an active/passive backend since on the
first failure the client will get a connection error and the glusterfs client
protocol does not retransmit).

This is where I clearly failed to clarify what I meant. I was talking about using NFS for the client<->server part of the communication. NFS will typically block until the server starts responding again (note: it doesn't have to be the same server, just one like it).

I think that this is quite different from
any bonding solution.  Not better, different,
If I were to use this it would not preclude me from also using bonding, but it solves a somewhat different problem. It is not a complete solution, it is a piece, but not
a duplicated piece.  If you don't like it,
or it doesn't fit your backend use case, don't use it! :)

If it can handle the described failure more gracefully than what I'm proposing, then I'm all for it. I'm just not sure there is that much scope for it being better since the last write may not have made it to the mirror server anyway, so even if the protocol can re-try, it would need to have some kind of journaling, roll back the journal and replay the operation.

This, however, is a much more complex approach (very similar to what GFS does), and there is a high price to pay in terms of performance when the nodes aren't on the same LAN.

Yes, if a server goes down you are fine (aside from the
scenario where the other server then goes down followed
by the first one coming back up).  But, if you are using
the HA translator above and the communication goes down
between the two servers you may still get split brain
(thus the need for heartbeat/fencing).
And therein lies the problem - unless you are proposing
adding a complete fencing infrastructure into glusterfs,
too.

No. I am proposing adding a complete transactional model to AFR so that if a write fails on one node, some policy can decide whether the same write should be committed of rolled back on the other nodes. Today, the policy is to simply apply it to the other nodes regardless. This is a recipe for split brain.

OK, I get what you mean. It's basically the same problem I described above when I mentioned that you'd need some kind of a journal to roll-back the operation that hasn't been fully committed.

In the case of network segregation some policy should decide to allow writes to be applied to one side of the segregation and denied on the other. This does not require fencing (but it would be better with it), it could be a simple policy like: "apply writes if a majority of nodes can be reached", if not fail (or block would be
even better).

Hmm... This could lead to an elastic shifting quorum. I'm not sure how you'd handle resyncing if nodes are constantly leaving/joining. It seems a bit non-deterministic.

AFR needs to be able write all or nothing to all
servers until some external policy machine (such as
heartbeat) decides that it is safe (because of fencing or
other mechanism) to proceed writing to only a portion of the
subvolumes (servers).  Without this I don't see how you
can prevent split brain?
With server-side AFR, splitbrain cannot really occur (OK,
there's a tiny window of opportunity for it if the
server isn't really totally dead since there's no
total FS lock-out until fencing is completed like on GFS,
but it's probably close enough). If the server's
can't heartbeat to each other, they can't AFR to
each other, either. So either the write gets propagated, or
it doesn't. The machine that remained operational will
have more up to date files and as necessary those will get
synced back. It's not quite as tight as GFS in terms of
ensuring data consistency like a DRBD+GFS solution would be,
but it is probably close enough for most use-cases.


I guess what you call tiny, I call huge. Even if you have your heartbeat fencing occur in under a tenth of a second, that is time enough to split brain a major portion of a filesystem. I would never trust it.

In GlusterFS that problem exists anyway, but it is largely mitigated by the fact that it works on file level rather than block device level. In the case of GFS, RHCS will block all access to the file system until the note is successfully fenced and confirmed fenced before rolling back it's journals and resuming operation.

To borrow your analogy, adding heartbeat to the current AFR: "It's a bit like fitting a big padlock on the door when there's a wall missing." :) Every single write needs to ensure that it will not cause split brain for me to trust it.

Sounds like GlusterFS isn't necessarily the solution for you, then. :(

If not, why would I bother with gluserfs over
AFR instead of glusterfs over DRBD? Oh right, because I cannot get glusterfs to failover without
incurring connection errors on the client! ;)
(not your beef, I know, from another thread)

Precisely - which is why I originally suggested not using GlusterFS for client-server communication. :)

This is one reason I was hoping that the HA
translator would address this, but the HA
translator is useless in an active/passive
backend setup, it only works in active/active.
If you try using it in an active/passive setup,
during failover it will retry too quickly on
the second server causing connection errors
on the client!!!  This is the primary reason
that I am suggesting that the HA translator
block until the connection is restored, it
would allow for failovers to occur.

And this is exactly why I suggested using NFS for the client<->server connection. NFS blocks until the server becomes contactable again.

But, to be clear, I am not disagreeing with you
that the HA translator does not solve the split
brain problem at all. Perhaps this is what is really "upsetting" you, not that it is "duplicated" functionality, but rather that it does not help AFR solve it's split brain personality disorders, it only helps make them more available, thus making split brain even more likely!! ;(

I'm not sure it makes it any worse WRT split-brain, it just seems that you are looking for GlusterFS+HA to provide you with exactly the same set of features that NFS+(server fail-over) already provides. Of course, there could be advantages in GlusterFS behaving the same way as NFS when the server goes away if it's a single-server setup - it would be easier to set up and a bit more elegant. But it wouldn's add any functionality that couldn't be re-created using the sort of a setup I described.

Gordan




reply via email to

[Prev in Thread] Current Thread [Next in Thread]