I'm evaluating spread for our world-wide server communication as it seems
the most advanced tool.
Now I almost ran into a pitfall: Spread isn't "reliable" as I thought as it
is not bullet-proof against host/daemon/network downtimes. Messages just
disappear if a host isn't joined for some reason.
As a reader of your Book (it's great!) I found the little not in the Spread
chapter p. 245: "Attempting to deal with this scenario ... Would require
... SAFE messages as well as adding add. logic to the purge daemon."
Now I'm interested in the "additional logic". What do you mean with that?
That I have to build my complete own que system for messages? Do you have
more information on this topic?
Thanks!
A description of Spread
Spread provides EVS (Extended virtual synchrony) which provides reliable deliver of messages. The problem is that it sounds like you need durability of messages. Spread will reliably delivery a message to the group and if it is SAFE you are guaranteed that if the group membership changes that the new membership is notified of that at the right time. So, you always know exactly who received which messages -- and thus is both reliable and safe.
In Scalable Internet Architectures, I discuss handling the simple case of cache purging on machines that drop out of the group membership. Caches are unique in that they hold no critical data that cannot be replaced. So a cache can simple purge ALL of its contents upon joining a group and you are guaranteed that it is consistent.
To do durable messaging, (which means that nodes that are down will receive message sent to them once they are rejoined) you likely want closed group semantics and you'd need to build a durable message store. This is not a trivial exercise. Without knowing the exact problem you are trying to solve, I would guess that Spread is not likely the tool you want and you might look instead to something like a JMS system or ActiveMQ.
A more detailed explanation of the problem at hand.
So if one receiver host is down (that joined a group) the group sender sees
which hosts received the message (and which not?). I think the main point is
that if a host goes down it's not subscribed anymore to the group and
therefore spread doesn't care anymore...
I'll describe the challenge a little bit more and maybe you have the right
tool in your knowledge. Further, I thought about an own que system and I'll
descibe that. I know that your help is voluntary but your knowledge about
this seems to be unique.
1. Our challenge
We have to build a worldwide "durable" messaging bus that has to be
resistent against network, host and daemon faults (http://www.jimdo.com/).
The sender daemon should resend the message as long as it got an "OK" from
all hosts. The main feature is to sync databases/data between the servers.
The messages will probably XML-RPC as George describes it in his book
Our main programmic language is PHP and will not change in the near future.
The current system is XML-RPC over HTTP but that isn't really reliable ...
2. Own que system
After I noticed that spread will not do all the work for us I thought about
a que system based on spread.
The sender has a que (sql) table that stores the transactions that have to
be done in a specific group. A background daemon checks what transactions
are not completed on all servers and send them again into the messaging bus.
On the receivers side there's also a que with a transaction id and a daemon
processes the messages. It will ignore transaction ids that are already
done.
If one "message" is done it will send "OK" with its "server_id" and
transaction_id to the originating server (extra spread group).
Now the origination server gets the OK and if all servers have sent an "OK"
(because it knows its servers) the transaction is fully completed.
This would also work without spread but with HTTP/SOAP/REST/whatever because
I got the durability part out of the transport layer.
I hope that points make my problem much clearer.
Maybe there is a prebuilt solution that dies exactly these things or do we
have to build it on our own?
There's a significant feature gap between Spread and durability
So, it is clear now that you are attempting to build a durable messaging system. It isn't clear what the ordering requirements are, but as you mention updating databases, I assume strict ordering guarantees are required. Database replication that doesn't have a single master is a deep problem with considerable nuance -- it is possible, but not a "exercise left to the reader" if you know what I mean.
Establishing durability on top of spread requires a few things:
- You need a durable queueing system.
- You need a priori knowledge on all publishers of all participants (subscribers).
The concept is that you put a message in queue tagged for pending delivery to all subscribers. You send the message to spread and as it is self-received you mark those in the "current" (as of reception) membership as having received it.
You then need to build a secondary system to all subscribers to be sent "missing" messages.
You remove messages from the queue that have no subscribers pending.
There are a lot of moving parts and a good bit of engineering has to go into the solution to make sure it is robust.
ActiveMQ should provide all of these features -- and it supports PHP. I'd recommend evaluating that.