This article provides strategies for dealing with failures in a directory service.
The chapter broadly defines failure as anything that prevents Directory Server from providing the minimum level of service required. Failure is divided into two main areas: system unavailable and system unreliable.
The chapter provides a few strategies for addressing failures in a directory service. The paper classifies failures in two classes: system unavailable and system unreliable (classification I did not find particularly enlightening). The paper presents two backup strategies: binary (which is full binary replication) and LDIFF (a logical backup based on differences from previous version). Binary replication is a full backup is faster than LDIF. LDIF has greater control over the granularity but in situations where rapid restoration is required, LDIF may take too long to be viable.
The paper then describes replication topologies starting from a single data center up to 5 data centers. The high level idea is to have one or two master servers per data center, linked by an interconnected topology with redundant links and potential backup links. The architecture seems to allow for 4 master servers, even in the Five Data Center Topology (but does not explain why).
This article gave a good perspective about the current industry thinking for replication and backup. It would be interesting to see how well such a model scales. Their consistency algorithm is probably manually configured based upon the interconnections and where the master servers are placed. As a result, it is unlikely to scale well in the multi-thousand server clusters that are becoming prevalent.