Keeping mission-critical applications available for employees, business partners, and customers is a key goal for any IT department. Many IT administrators turn to clusteringphysically and programmatically linking two or more systems to ensure high availability and to balance workloadsto achieve this goal. Clustering lets application processing continue on another server if the primary system fails or you need to shut down the primary system for maintenance. Clustering can also improve performance by evenly distributing workloads among several servers. No matter how many systems make up the cluster, it appears as one server to end users.
If your primary goal is to improve the availability of your back-end applications, you can use Microsoft Cluster service, which is included in Windows 2000 Advanced Server (Win2K AS) and Win2K Datacenter Server, or choose from a variety of competing products. (Cluster service is known as Microsoft Cluster ServerMSCSin Windows NT Server, Enterprise EditionNTS/E.) All these products move processing of back-end applications, such as enterprise resource planning (ERP), database management, and messaging applications, from the primary cluster server (called a node) to other cluster nodes when the primary node fails. This process is known as failover. After you repair the failed server, the clustering software shifts resources and processing back to the original node, a process known as failback. Making the best clustering-product choice depends on multiple factors, including the risks to your computing environment, your current disaster-recovery plan, the geographic separation of your servers, the location of clients that use the clustered applications, the number of applications and servers to be clustered, and of course, your budget.
Clustering Basics
When deciding on a high-availability strategy, you must first understand that clustering doesn't deliver true fault tolerance. No matter whose clustering product you choose, failure of a cluster node or application results in 5 seconds to 30 seconds of application downtime, depending on the number of transactions written to the transaction log since it was last saved. In addition, depending on the design of the client application, users might have to reconnect to the clustered application when it resumes on the new node. For some environments, these inconveniences are inconsequential. Other environments might need a fault-tolerant solution that can deliver higher levels of availability than clustering products provide. For more information about two such solutions, see Ed Roth, Lab Feature, "Stratus ftServer 3210," July 2002, InstantDoc ID 25335, and John Green, Lab Feature, "Endurance 6200 3.0," July 2001, InstantDoc ID 21140.
A common clustering implementation is to have all clustered nodes run the cluster-protected applications so that servers don't sit idle. This arrangement is referred to as an active/active configuration. Should one node fail or be shut down, the failover process copies the necessary resources to the active target node. Depending on the target node's configuration and utilization, the added workload might degrade performance of existing and failed-over applications. The alternative is to create an active/passive configuration, in which one or more servers sit idle or run nonclustered applications until the primary server fails.
Cluster Service
Win2K AS supports two nodes running Cluster service (as does NTS/E running MSCS); Win2K Datacenter supports as many as four nodes running Cluster service. Win2K AS's two-node limitation might be a problem if you want to cluster two Microsoft Exchange Server systems that would support several thousand employees. Microsoft recommends that you support no more than 1900 Exchange users per cluster node. If you followed this guideline, you'd need to create two active/passive clusters, with an idle node for each Exchange server. Alternatively, Win2K Datacenter's four-node limit would let you create a three-node cluster with one idle system serving as a failover node for both Exchange servers. (To anticipate the failure of both primary Exchange nodes, you'd need to add a fourth passive node to the cluster, of course.) But because Microsoft sells Win2K Datacenter only with server hardware, this would be an expensive clustering solution. Clustering multiple Exchange servers with Cluster service will be easier and less expensive when Windows .NET Enterprise Server (Win.NET Enterprise Serverthe follow-on to Win2K AS) is available because it will support eight-node clusters and won't be packaged with server hardware. Obviously, the failover targets need to have sufficient processor, memory, and storage resources to support the additional workload of the failed servers.
Cluster service is a good solution if you plan to run more than one clustered application per node. If Cluster service detects that one of the clustered applications has failed and can't be restarted, Cluster service can fail over just the affected application without disrupting the others if the applications are cluster-aware and are in different resource groups.
Most Cluster service clusters share a common storage array that connects directly to each node or is part of a Storage Area Network (SAN). Shared-storage clusters are simpler to implement than clusters that replicate data between nodes, but the shared array becomes a single point of failure and limits the geographic separation of the cluster nodes. You can geographically separate Win2K AS or Win2K Datacenter's Cluster service nodes if each node connects to a SAN-connected storage array and each array's I/O controller synchronously replicates data and quorum disk information (i.e., cluster-configuration information stored on a special volume). Third-party products such as NSI Software's GeoCluster Advanced Server 4.1 can also perform data replication for Cluster service. Win.NET Enterprise Server and Win.NET Datacenter Server will let you separate nodes in two locationsthe OSs will provide a quorum mechanism that synchronizes cluster configuration information among all nodes, but you'll still need a separate data-replication capability. All Cluster service nodes, storage arrays, and I/O controllers must be on the Win2K or Win.NET Cluster Hardware Compatibility List (HCL) if you want Microsoft to support them. Figure 1 shows Cluster service's management console.
Phil McCue, MCP October 30, 2002