Find out what it has to tell you
You're working along just fine, and suddenly, your screen display changes from your nice
user interface to something that looks like Screen 1. You know what it is: a Windows NT kernel STOP
error, or the blue screen of death. So, what can you do? Often, the problem goes away when you
reboot the system. But what if it doesn't? What does that screen mean? Is it safe to continue using
the system? Let's look at what a kernel STOP error means, what can cause it, and most important,
what information you can get from the blue screen.
What Happens at the Kernel Level?
First, let's review the basics of the NT architecture. The NT operating system has two layers:
user mode and kernel mode. User mode is where the various subsystems--such as the Win32, POSIX, or
OS/2 subsystem--reside. Components in this mode provide the environments in which all user
applications run. For instance, Win32 programs run on the Win32 subsystem.
As you see in Figure 1, the kernel mode sits between the user mode and the physical layer (the
hardware) and prevents the user mode from directly accessing the hardware. The kernel mode also is
the home for the various NT executive services, such as the Object Manager, Security Reference
Monitor, and Process Manager. Just above the physical device hardware lies the hardware abstraction
layer (HAL) and above that is the NT microkernel. The HAL is the portion of the kernel that is
written in the specific platform assembly language. The microkernel is the heart of the OS that
takes care of all the NT internal OS operations.
An important component of executive services is the I/O Manager. Besides taking care of all
input and output for the operating system, the I/O Manager manages communications between drivers
and supports all file system drivers and hardware device drivers.
NT is a modular operating system; this fact means you can add DLLs or device drivers to add
capabilities to the system. You can, for instance, add fault tolerance to NT by adding device
drivers. When a peripheral manufacturer develops a driver for NT, the driver is most likely a kernel
mode driver: It resides in the kernel mode area and probably interfaces with Microsoft kernel
drivers. You can think of kernel drivers as the NT counterpart to Windows 3.1 or NT virtual device
drivers (VxDs). Kernel drivers are the low-level mechanisms for talking to the hardware. So when the
driver does something it's not supposed to, the error occurs at the lowest level and directly
affects the overall system and causes a kernel STOP error.
If an application operating in user mode does something to cause an error, NT halts the process
and generates an Illegal Operation error. Because every Win32 application has its own virtual
protected space, this error condition doesn't affect any other Win32 programs running. If the
application tries to directly access the hardware without going through the correct methods, NT
notices this and generates an exception error. A nice thing about NT is that it has good protection
systems for erratic applications.
When an application faults, you can close the offending program and resume work. Kernel error
conditions, however, typically are not recoverable; you have to reboot the system. You can think of
the kernel STOP as a built-in error-trapping mechanism. A kernel STOP error is NT's way of halting
further activity before the activity severely damages your system or corrupts data.
What Does This Weird Screen Mean?
OK, so what does this screen tell me? The kernel STOP may mean that a kernel driver--either a
system device driver or a third party driver--has illegally accessed the privileged kernel area. Or
the kernel STOP may mean that you have mixed SIMMs or added a bad network controller or SCSI
controller. In these cases, you can fix the problem by removing the offending hardware device. If
you have not added any new hardware, you need to get more information from the blue screen. Let's
look at each portion of Screen 1. Fortunately, you don't need to understand everything on the
screen.
At the top of the display is a hexadecimal value followed by four hex numbers in parentheses.
The first hex code is the kernel error code. With this error code, you can determine where the error
occurred, but not which driver caused the error. Table 1 lists the various error conditions. In our
example, the error condition is 0*0000000A, IRQL_NOT_LESS_OR_EQUAL. This code means that a process
attempted to access pageable memory at a process internal request level (IRQL) that was too high.
Microsoft Windows NT Server Resource Kit and Microsoft Windows NT Workstation Resource Kit have
complete listings of STOP codes.
The values in the parentheses give more specific information about what the driver was doing
when the error happened. The first value (00000000) points to the address that the driver referenced
improperly. The second value is the IRQL that was required to access the memory. The third value
specifies whether the driver was doing a read or a write. The fourth value points to the instruction
address that attempted the access. By looking at the STOP code and the third and fourth parameters,
you can possibly determine what caused the error condition.
The only problem is that much of what they’ll learn from this article is wrong. The article is absolutely rife with technical errors. For example, the second sentence in the section headed “What Does This Weird Screen Mean?” reads, “The kernel STOP may mean that a kernel driver ... has illegally accessed the privileged kernel area.” This statement is very close to meaningless, and any meaning I can attach to it is wrong. Kernel drivers are privileged (i.e., they run in kernel mode) and have full access to the kernel area.<br>
Another example? Let’s take Table 1: Kernel Mode Error Conditions. The first entry for IRQL_NOT_LESS_OR_EQUAL tells the reader, “A process attempted to access pageable memory at a process internal request level (IRQL) that was too high.” That statement is one reason for getting this error. But it’s not the reason.<br>
The text continues, “A process can access only objects that have priorities (IRQL) equal to or lower than its own.” This statement is nonsense. Objects don’t have IRQLs. IRQLs are not typically associated with processes (with one exception). The IRQL indicates the CPU state at any point in time, relative to that CPU’s interruptability, preemptability, and dispatchability. That is, the IRQL identifies which devices can interrupt the CPU, whether at the end of a quantum the current thread will be rescheduled, and whether any scheduling operations at all are allowable.<br>
Yikes! I found plenty of other technical problems in the article, too.<br>
I’ve spent lots of time reading blue screens and debugging drivers, so I recognized the problems with this article. How many of your readers can say the same? <br>
--Peter G. Viscarola<br><br>
<i>Thanks, Peter, for pointing out these problems. We’ll be very careful not to let such errors slip by in the future, and we apologize for any inconvenience these errors may have caused anyone.<br>
--Karen Forster</i>
Peter G. Viscarola August 13, 1999