In 30th IEEE Symposium on Reliable Distributed Systems (SRDS 2011), pages 101-110, Madrid, Spain, October 2011. IEEE.
Acceptance rate: 34%.
In this paper, we develop reliable distributed pub- lish/subscribe algorithms that can tolerate concurrent failure of up to Î´ broker machines or communication links. In our approach, Î´ is a configuration parameter which determines the level of fault-tolerance of the system and reliability refers to exactly-once and per-source, in-order delivery of publications to clients with matching subscriptions. We propose protocols to address three problems in presence of broker or link failures: (i) subscription propagation; (ii) publication forwarding; and (iii) broker recovery. Finally, we study the effectiveness of our approach when the number of concurrent failures exceeds Î´. Through large-scale experimental evaluations with up to 500 brokers, we demonstrate that a system configured with a modest value of Î´ = 3 is able to reliably deliver 97% of publications in presence of failure of up to 17% of its brokers.