You can increase NT's reliability
Microsoft Cluster Server (MSCS) can deliver excellent Windows NT system availability for a reasonable cost. From my experience deploying several MSCS clusters that are achieving near-perfect reliability (i.e., running with only about 0.001 percent of unplanned downtime), I know that you can use MSCS to make NT fault tolerant. However, managing a dependable MSCS cluster (i.e., a cluster that offers services at all required times) involves more specialized preparation, implementation, and maintenance than managing two standalone servers.
MSCS is a component of NT Server, Enterprise Edition (NTS/E), and Microsoft also includes MSCS in Windows 2000 Advanced Server and Win2K Datacenter Server. Win2K MSCS is nearly identical to the NTS/E version except that MSCS supports more than two servers in a cluster (e.g., Datacenter will support a 4-node cluster). Most of the information in this article applies equally to Win2K and NT clusters, although concerns that I don't address here might arise from Win2K clustering. If you're considering an MSCS cluster implementation in the near future, I recommend that you use NTS/E.
Preparation
MSCS clusters can provide high availability, although not to the degree that very specializedand very expensivehigh-end non-NT systems offer. After all, you're still dealing with NT's security vulnerabilities, hotfix reboots, and volume hardware. Therefore, you must evaluate your requirements and environment before you move ahead with MSCS, then decide whether the product will offer sufficient functionality to meet your needs.
Evaluate your business needs. If your business demands a system that is available 24 * 7, or if your business or profile makes you a target for attack, an NT-based solution isn't for you. If you can occasionally take down your system for brief maintenance, and if your business can tolerate a few minutes of unplanned outage each year, an MSCS cluster might be appropriate.
Non-business-critical systems can also be good candidates for clustering. Each server in an MSCS cluster can run different applications and provide failover for the other servers. The security of having a managed hot standby for each server can justify the incremental cost of configuring two otherwise standalone servers as a cluster.
Evaluate your applications. Some applications are less suitable for a clustered environment than others. Suppose you install an application on both cluster servers. The application runs on one server and is dormant on the second server. If the active server fails, the dormant version on the second server will ideally start automatically. But some applications (e.g., databases that don't implement automatic integrity recovery) can't recover automatically and so aren't optimal choices for a clustered environment.
If you decide to install on a cluster an application that can't recover automatically, you'll need to intervene to recover the application if it fails. Document the required steps, and thoroughly test the recovery process. Alternatively, if you can program the process as a script, you can configure MSCS to execute the script before attempting to bring the application online. Again, comprehensive testing is essential. If the script fails, you'll need a tested manual process to bring up the application.
Some proprietary applications write information to the Registry as they execute. In a clustered environment, the MSCS service replicates this information in the failover server's Registry. However, a systems administrator must exactly describe those Registry keys in the application's cluster resource definition. This task requires detailed application knowledge or documentation, so I recommend against clustering applications that write to the Registry while they're running.
The best candidates for clustering are applications that maintain configuration and state information on the shared-disk storage: Examples include file and printer shares, Microsoft IIS, Microsoft SQL Server, and Oracle databases. If you use Oracle databases, I strongly recommend that you also install Oracle's Fail Safe product, which creates an Oracle Database cluster-resource type and provides useful tools to integrate Oracle databases into an MSCS environment.
Evaluate your hardware. Although MSCS lets you cluster dissimilar servers, I recommend that you use identical servers in a cluster whenever possible. Doing so lets you identically configure and manage clustered servers, simplifying administration and increasing the likelihood of successful failover.
To leverage resources, you can run different applications on each server. However, be sure each server has sufficient capacity to run all applications if one server fails; otherwise, you'll need to accept an increase in response time and a reduction in user population while all the applications temporarily run on one server during failover.
The choices for cluster storage architecture are SCSI or fibre channel. SCSI is economical and established; fibre channel is expensive and relatively new but promises better performance and reliability than SCSI. Microsoft has also mentioned (e.g., in TechNet presentations) that fibre channel will be the primary focus for the company's future clustering solutions. (For a detailed comparison of SCSI and fibre channel, see Dean Porter, "Fibre Channel, SCSI, and You," September 1997.) I recommend that you use fibre channel if it's within your budget.
Implementing a cluster doesn't mean that you can neglect server and storage resilience measures. Several factors, such as hardware-component resilience, determine your system's overall availability; you need to make each system constituent as reliable as possible
don't depend on cluster software to come to the rescue during server failures. Invest in relatively inexpensive redundancy features (e.g., power, fans, network cards) that most modern servers include, and protect your local server storage against disk failure with mirroring (i.e., use an internal RAID controller or NT mirroring).
Common shared-disk cluster storage creates a single point of failure: If the cluster storage becomes inaccessible, so does your system. Implement disk controllers as redundant pairs that work together. Provide redundant power and cooling for the storage unit. Protect disks, ideally by mirroring.
Implementation
Remember, you aren't implementing just two servers and a storage unit, you're implementing a cluster. You need specific knowledge and skills to ensure successful performance. I recommend you read as much authoritative documentation as possible. (For a list of useful documentation, see "MSCS Resources.") Don't rely on the NTS/E printed manual, which is out of date in several key areas. Research the subject thoroughly, not only to gain cluster-specific knowledge but also to find out how a cluster environment will affect your existing processes. For example, you might be using Rdisk or Regback as part of your overall security strategy. The cluster Registry hive, Clusdb, resides in the \winnt\cluster directory; neither Rdisk nor Regback will automatically copy this hive. Unless you use Regback to manually copy the hive, your Emergency Repair Disk (ERD) or manual repair directory will be incomplete.
Implementing MSCS. The implementation process comprises several stages. Bear in mind that your aim is to configure all cluster elements as perfectly as possible, leaving only circumstances beyond your control as threats to your system's availability. Complete and test each stage, and resolve any problems before you progress to the next stagedon't wait until after you complete the full installation process to resolve problems. I suggest that you progress through installation following these steps (refer to detailed documentation for step-by-step installation instructions).
- Install all the hardware (e.g., servers, controllers, disks).
- Install NTS/E on each server, and upgrade to Service Pack 3 (SP3). SP3 comes with NTS/E, so use this service pack during these initial steps. You can apply a higher service pack later in the process, if you want.
- For recovery purposes, build a second basic OS installation (i.e., an installation without software other than programs that you need to run your network card, tape drive, and cluster storage access) on each server. Try to put this emergency recovery installation on a different disk from the server's primary installation. Install SP3.
- Install any additional device drivers that you need to access cluster common shared-disk storage.
- Use an external access method (e.g., serial port) to configure the cluster storage controllers. Configure one device to be the cluster quorum disk. (For information about quorum disks, see Mark Russinovich, NT Internals, "Inside Microsoft Cluster Server," February 1998.)
- Install MSCS on one server; keep the second server at the OS selection menu. Reboot the first server, and confirm that MSCS Cluster Administrator connects to the cluster service and displays the cluster details.
- Install MSCS on the second server while the first server is fully booted. Reboot the second server, and confirm that Cluster Administrator now shows both servers in the cluster.
- Confirm that the Cluster Group and Quorum Disk Group can move successfully between servers during both manual initiation and server shutdown.
- If you want to use a service pack higher than SP3, apply it (and any hotfixes) now.
- If you want to configure more cluster storage devices, do so now. Follow the method that Microsoft's "MS Cluster Server Administrator's Guide" describes. You can find this guide on TechNet (http://www.microsoft.com/technet) or on the Microsoft Product Support Services (PSS) Web site (http://support.microsoft.com). Incorporate the additional devices into the cluster resource groups, and test the devices for successful failover. You now have a functional cluster on which to build applications.