Re: [Gluster-devel] Architecture advice

gluster-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Gluster-devel] Architecture advice

From:	Gordan Bobic
Subject:	Re: [Gluster-devel] Architecture advice
Date:	Mon, 12 Jan 2009 23:41:24 +0000
User-agent:	Thunderbird 2.0.0.19 (X11/20090107)

Martin Fick wrote:

Why is that the correct way?  There's nothing
wrong with having "bonding" at the glusterfs
protocol level, is there?

The problem is that it only covers a very narrow edge case
that isn't all that likely. A bonded NIC over separate
switches all the way to both servers is a much more sensible
option. Or else what failure are you trying to protect
yourself against? It's a bit like fitting a big padlock
on the door when there's a wall missing.
I think you need to be more specific then usinganalogies. My only guess from your assertions isthat you have a very narrow specific use case /setup / terminology in mind that does notnecessarily mesh with my narrow use case ... :)


LOL! That is a distinct possibility. :)

So, the HA translator supports talking to two
different servers with two different transport
mechanism and two different IPs. Bonding doesnot support anything like this is far as I cantell.

True. Bonding is more transparent. You make two NICs into one virtualNIC and round-robin packets down them. If one NIC/path fails, all thetraffic will fail over to the other NIC/path.

So, it seems like you are assuming a
different back end use case, one where theservers employ the same IP perhaps using roundrobin or perhaps in an active passive way.

No, not at all. Multiple servers, 1 floating IP per server. Floating asin it can be migrated to the other server if one fails. You balance theload by assigning half of your clients to one floating IP, and the otherhalf of the clients to the other floating IP. So, when both servers areup, each handles half the load. If one server fails, it's IP getsmigrated to the other server, and all clients thereafter talk to thesurviving server since it has both IPs (until the other server comesback up and asks for it's IP address back).

Both
of these are very different beasts and I would
need to know which you are talking about to
understand what you are getting at.  But the HA
translator setup is closer to the round robin
(active/active) setup and I am guessing youare taking about an active / passive setup.

In general, there are relatively few things that you cannot makeactive/active, so I always mean active/active + failover unless Iexplicitly state it.

That is somewhat what the HA translator is, except

that it is supposed to take care of some additional
failures.  It is supposed to retransmit "in
progress" operations that have not succeeded because of
comm failures (I have yet to figure out where in the code
this happens though).

This is a reinvention of a wheel. NFS already handles this
gracefully for the use-case you are describing.


I am lost, what does NFS have to do with it?

It already handles the "server has gone away" situation gracefully. WhatI'm saying is that you can use GlusterFS underneath for mirroring thedata (AFR) and re-export with NFS to the clients. If you want to avoidclient-side AFR and still have graceful failover with lightweighttransport, NFS is not a bad choice.

Why re-invent the wheel when the tools to deal

with these

failure modes already exist?

Are you referring to bonding here? If so, see above

why HA may be better (or additional benefit).

My original point is that it doesn't add anything new
that you couldn't achieve with tools that are already
available.



Well, I was trying to explain to you that it
does, but then the NFS thing, I am confused.

How do current tools achieve the following

setup? Client A talks to Server A andsubmits a read request. The read requestis received on Server A (TCP acked to theclient), and then Server A dies. How will

the following request be completed without

glusterfs returning an "endpoint notconnected" error?


You make client <-> server comms NFS.
You make server <-> server comms GlusterFS.

If the NFS server goes away, the client will keep retrying until theserver returns. In this case, that would mean it'll keep retrying untilthe other server fails the IP address over to itself.


This achieves:
1) server side AFR with GlusterFS for redundancy

2) client connects to a single server via NFS so there's nodouble-bandwidth used by the client

3) servers can fail over relatively transparently to the client

No, I have not confirmed that this actually
works with the HA translator, but I was told
that the following would happen if it wereused. Client A talks to Server A andsubmits a read request. The read requestis received on Server A (TCP acked to theclient), and then Server A dies. Client A
will then in theory retry the read request
on Server B.  Bonding cannot do anything
like this (since the read was tcp ACKed)?

Agreed, if a server fails, bonding won't help. Cluster fail-overserver-side, however, will, provided the network file system protocolcan deal with it reasonably well.

Neither can heartbeat/failover
of an active/passive backend since on the
first failure the client will get aconnection error and the glusterfs client
protocol does not retransmit).

This is where I clearly failed to clarify what I meant. I was talkingabout using NFS for the client<->server part of the communication. NFSwill typically block until the server starts responding again (note: itdoesn't have to be the same server, just one like it).

I think that this is quite different from
any bonding solution.  Not better, different,
If I were to use this it would not precludeme from also using bonding, but it solves asomewhat different problem. It is not acomplete solution, it is a piece, but not
a duplicated piece.  If you don't like it,
or it doesn't fit your backend use case,don't use it! :)

If it can handle the described failure more gracefully than what I'mproposing, then I'm all for it. I'm just not sure there is that muchscope for it being better since the last write may not have made it tothe mirror server anyway, so even if the protocol can re-try, it wouldneed to have some kind of journaling, roll back the journal and replaythe operation.

This, however, is a much more complex approach (very similar to what GFSdoes), and there is a high price to pay in terms of performance when thenodes aren't on the same LAN.

Yes, if a server goes down you are fine (aside from the
scenario where the other server then goes down followed
by the first one coming back up).  But, if you are using
the HA translator above and the communication goes down
between the two servers you may still get split brain
(thus the need for heartbeat/fencing).
And therein lies the problem - unless you are proposing
adding a complete fencing infrastructure into glusterfs,
too.
No. I am proposing adding a complete transactionalmodel to AFR so that if a write fails on one node,some policy can decide whether the same writeshould be committed of rolled back on the othernodes. Today, the policy is to simply apply it tothe other nodes regardless. This is a recipe forsplit brain.

OK, I get what you mean. It's basically the same problem I describedabove when I mentioned that you'd need some kind of a journal toroll-back the operation that hasn't been fully committed.

In the case of network segregation some policyshould decide to allow writes to be appliedto one side of the segregation and denied on theother. This does not require fencing (but itwould be better with it), it could be a simplepolicy like: "apply writes if a majority of nodescan be reached", if not fail (or block would be
even better).

Hmm... This could lead to an elastic shifting quorum. I'm not sure howyou'd handle resyncing if nodes are constantly leaving/joining. It seemsa bit non-deterministic.

AFR needs to be able write all or nothing to all
servers until some external policy machine (such as
heartbeat) decides that it is safe (because of fencing or
other mechanism) to proceed writing to only a portion of the
subvolumes (servers).  Without this I don't see how you
can prevent split brain?

With server-side AFR, splitbrain cannot really occur (OK,
there's a tiny window of opportunity for it if the
server isn't really totally dead since there's no
total FS lock-out until fencing is completed like on GFS,
but it's probably close enough). If the server's
can't heartbeat to each other, they can't AFR to
each other, either. So either the write gets propagated, or
it doesn't. The machine that remained operational will
have more up to date files and as necessary those will get
synced back. It's not quite as tight as GFS in terms of
ensuring data consistency like a DRBD+GFS solution would be,
but it is probably close enough for most use-cases.

I guess what you call tiny, I call huge. Even ifyou have your heartbeat fencing occur in under atenth of a second, that is time enough to splitbrain a major portion of a filesystem. I wouldnever trust it.

In GlusterFS that problem exists anyway, but it is largely mitigated bythe fact that it works on file level rather than block device level. Inthe case of GFS, RHCS will block all access to the file system until thenote is successfully fenced and confirmed fenced before rolling backit's journals and resuming operation.

To borrow your analogy, adding heartbeat to thecurrent AFR: "It's a bit like fitting a bigpadlock on the door when there's a wall missing.":)Every single write needs to ensure that it willnot cause split brain for me to trust it.


Sounds like GlusterFS isn't necessarily the solution for you, then. :(

If not, why would I bother with gluserfs over
AFR instead of glusterfs over DRBD? Oh right,because I cannot get glusterfs to failover without
incurring connection errors on the client! ;)
(not your beef, I know, from another thread)

Precisely - which is why I originally suggested not using GlusterFS forclient-server communication. :)

This is one reason I was hoping that the HA
translator would address this, but the HA
translator is useless in an active/passive
backend setup, it only works in active/active.
If you try using it in an active/passive setup,
during failover it will retry too quickly on
the second server causing connection errors
on the client!!!  This is the primary reason
that I am suggesting that the HA translator
block until the connection is restored, it
would allow for failovers to occur.

And this is exactly why I suggested using NFS for the client<->serverconnection. NFS blocks until the server becomes contactable again.

But, to be clear, I am not disagreeing with you
that the HA translator does not solve the split
brain problem at all. Perhaps this is what isreally "upsetting" you, not that it is"duplicated" functionality, but rather that itdoes not help AFR solve it's split brainpersonality disorders, it only helps make themmore available, thus making split brain evenmore likely!! ;(

I'm not sure it makes it any worse WRT split-brain, it just seems thatyou are looking for GlusterFS+HA to provide you with exactly the sameset of features that NFS+(server fail-over) already provides. Of course,there could be advantages in GlusterFS behaving the same way as NFS whenthe server goes away if it's a single-server setup - it would be easierto set up and a bit more elegant. But it wouldn's add any functionalitythat couldn't be re-created using the sort of a setup I described.


Gordan

[Prev in Thread]

Current Thread

[Next in Thread]

Re: [Gluster-devel] Architecture advice, (continued)
- Re: [Gluster-devel] Architecture advice, Joe Landman, 2009/01/08
  - Re: [Gluster-devel] Architecture advice, Dan Parsons, 2009/01/08
- Re: [Gluster-devel] Architecture advice, Gordan Bobic, 2009/01/12
  - Re: [Gluster-devel] Architecture advice, Martin Fick, 2009/01/12
    - Re: [Gluster-devel] Architecture advice, Gordan Bobic, 2009/01/12
- Re: [Gluster-devel] Architecture advice, Gordan Bobic, 2009/01/12
- Re: [Gluster-devel] Architecture advice, Martin Fick, 2009/01/12
  - Re: [Gluster-devel] Architecture advice, Gordan Bobic, 2009/01/12
    - Re: [Gluster-devel] Architecture advice, Martin Fick, 2009/01/12
    - Re: [Gluster-devel] Architecture advice, Gordan Bobic <=
    - Re: [Gluster-devel] Architecture advice, Martin Fick, 2009/01/12
- Re: [Gluster-devel] Architecture advice, Gordan Bobic, 2009/01/14

Prev by Date: [Gluster-devel] erroneous double disk space display in 'df'?
Next by Date: Re: [Gluster-devel] RPM / BerkeleyDB on GlusterFS
Previous by thread: Re: [Gluster-devel] Architecture advice
Next by thread: Re: [Gluster-devel] Architecture advice
Index(es):
- Date
- Thread