Middleware Systems Research Group, University of Toronto, Canada

Fault Resiliency

A broker or link failure may stall publication flows and leave parts of the network near the failed region in an inconsistent state, such as an incorrect perception of the topology, or invalid entries in the routing tables. In such cases, the PADRES system recovers from the failure back to a correct operational state. Once a broker or link failure is detected, the triggered recovery procedure performs the following actions:

  • Maintains the integrity of the broker network
  • Updates the advertisement routing tables
  • Updates the subscription routing tables
  • Updates the publication routing tables

Also cycles in the overlay are exploited to speed up the recovery of publication flows. Cyclic networks include redudant paths, and can improve the network's resiliency to failures since failures do not necessarily result in network partitions. The recovery algorithm uses local information gathered as part of the normal system operation, and seamlessly routes publications around failures. The cycle detection component is also used by the PADRES load balancer to distribute load among brokers.

Fault Resilience

Reliable publication routing ensures that once a publisher/subscriber routing path is constructed, no publications are lost. The reliable routing algorithm uses the services provided by the regular content-based routing protocols and failure recovery algorithms to maintain an operational routing path between the publishers and subscribers. It tolerates message loss (due to unreliable links or faulty brokers) to provide publication delivery gaurantees.