20070821 nothing stays the same adventures in directory rescue - plembo/onemoretech GitHub Wiki

title: nothing stays the same: adventures in directory rescue link: https://onemoretech.wordpress.com/2007/08/21/nothing-stays-the-same-adventures-in-directory-rescue/ author: lembobro description: post_id: 658 created: 2007/08/21 19:02:00 created_gmt: 2007/08/21 19:02:00 comment_status: open post_name: nothing-stays-the-same-adventures-in-directory-rescue status: publish post_type: post

nothing stays the same: adventures in directory rescue

OK. So this all relates to the downing of my company’s primary master directory some weeks ago when the host server’s system disk failed. Apparently on it’s way out, the system caused some fatal corruption in the directory db that prevented it from starting up. Because that particular master was also the “hub” from which we replicate to our read-only replicas, losing the db also meant that we’d have to rebuild those replicas as well (a drawback of how Netscape engineered replication in it’s directory servers). Yup, a pretty bad situation.

When I first started with iPlanet Directory back in ‘00, I quickly learned the most efficient way to rebuild not only a master, but also a replica, directory in the event the underlying db got corrupted.

For the most part I’ve continued using the procedures I developed back then until now. For example, to rebuild a replica I normally just did a re-initialization from the master over the wire.

This time, instead of doing an online rebuild, I let the replication agreement wizard create an init file (an LDIF with all the pertinent info), shut down the replica and did an offline init using ldif2db. This solved a couple of problems, the most important of which was that there was a Local Director in front of my replicas that would do a round-robin to a directory as long as it was listening (LD isn’t very smart, it can’t test to see if the directory is actually responding, just that it’s listening). If I kept the directory online, which you must do for an over-the-wire init, apps and users would hit a wall every time they got shifted over to the initializing directory.

It also turns out that with the number of entries I now have in my directory (50,000 plus), the rebuild process also went faster using init files. This was especially true for my heavily indexed primary master, which I rebuilt from the secondary master.

In all it took about an hour to get most everything back online, another hour to get the whole environment back running at peak performance. Because we use clustering in our SSO solution, most apps and their users were unaware of the outtage. We also bought ourselves some breathing space by assigning a CNAME that belonged to the dead master to it’s partner, causing all traffic that was hardcoded for that CNAME to go to the functioning directory.