8 September 2008

Understanding your environment

A practical demonstration of why understanding your environment is vital occurred a few evenings ago when some NetApp filer\domino work went wrong. A little bit of background first, domino data is stored on a NetApp filer which is shared using nfs. This is mounted by the domino server and it all (most of the time) works.

For some reason this particular server running Domino (let's call him Bob) was showing high i/o stats, although the server itself was responding fine. The filer (Nutkins) wasn't reporting any problems but it was deemed that Nutkins had to be at fault. There are a lot of connections to Nutkins after all and in fairness the mount point is living in an aggregate that is unbalanced in terms of i/o profile so the decision was made to create a new aggregate decided for Bob. Simple enough to do. For those not filer aware an aggregate is a collection of physical disks. In giving Bob his own aggregate it dedicated 8 spindles to the Domino data. More than enough to remove any i/o bottleneck.

Now, Nutkins itself has a very cool piece of technology called snapmirror. A snapmirror was duly setup and Nutkins began copying the data to its new home.

So, the big evening arrives. The paperwork is signed (in blood, naturally). The changes authorised, the servers poised....... A hush descends and the commands to stop Domino are typed into Bob......... and Domino promptly hangs.

Red flag 1 - when a manager says "oh, it always does that. Just issue kill -9 and everything will be fine, well except that a few databses might be corrupt" it's probably time to start worrying. However, the final snapmirror is initiated and the last 140mb of changes are copied (in 22 seconds no less, not even enough time to get a cup of tea). The snapmirror is then quiesed and broken. This makes the destination for the snapmirror writable. Over to the unix admin and a few key clicks later the export is mounted and Bob was started.........
Or not. Seems that a small fact was missed. Bob not only has data stored on Nutkins but also has a local directory for crash dump logs.

Red-flag 2 - when Bob's admin doesn't know the configuration of Bob's setup it is probably time to start panicking. Anyway, a tappety-tap of the keyboard and the directory is created. Oh, lets stop and start Bob hoping red flag 1 doesn't pop up. Mr. Unix issues the command and on the screen "server shutdown. Bob_stop not found". Ok, so did it shut down or not? Ps -ef | grep lotus and nope, nothing running. Red flag 1 avoided! So, start Bob and..... Nothing. Not happy. Hmmm. Time to fail back, something isn't understood\not working.. So Mr. Unix does his stuff and...... No Bob. Seems red flag 1 corrupted the data then the final snapmirror copied corrupt data. Also seems that the shutdown script has at least one bug in it which causes a loop to fail when the script is executed.

Anyway, to cut a long story short we backed out and made the change a few days later. There are several lessons learnt here mostly revolving around documentation, standarisation and knowing your environment. I'll leave it as an excercise to the reader to work out the rest!

Understanding your environment

Some DNS Tips

AD Find

Subscribe to Ramblings of a Sysadmin

Subscribe to Ramblings of a Sysadmin