Collecting Additional Data Points
Continuing where I left off in Part 1, I had identified that the issue was likely either faulty hardware or a bad driver – which pretty much describes the cause of every bugcheck. I was hoping for something a bit more precise. I did have the one mini-dump so I decided to tear into it a little bit more until I could get my hands on a kernel dump.
Deep Dive: Level 2
The first thing I did after loading the crash dump into WinDbg was see what !analyze -v had to say about it:
2: kd> !analyze -v ******************************************************************************* * * * Bugcheck Analysis * * * ******************************************************************************* SYSTEM_SERVICE_EXCEPTION (3b) An exception happened while executing a system service routine. Arguments: Arg1: 00000000c0000005, Exception code that caused the bugcheck Arg2: fffff800031842b1, Address of the instruction which caused the bugcheck Arg3: fffff88007a64ef0, Address of the context record for the exception that caused the bugcheck Arg4: 0000000000000000, zero. Debugging Details: ------------------ EXCEPTION_CODE: (NTSTATUS) 0xc0000005 - The instruction at 0x%08lx referenced memory at 0x%08lx. The memory could not be %s. FAULTING_IP: nt!ObpLookupObjectName+461 fffff800`031842b1 f6 ??? CONTEXT: fffff88007a64ef0 -- (.cxr 0xfffff88007a64ef0) rax=fffffa80042bccc8 rbx=fffff80002e0f000 rcx=fffff8a00c73af12 rdx=000000000000006c rsi=0000000000000000 rdi=0000000000000001 rip=fffff800031842b1 rsp=fffff88007a658d0 rbp=fffffa80042bcb10 r8=fffffa8003c6cf30 r9=fffffa8005fec268 r10=fffff8a000004ca0 r11=fffffa80040f6890 r12=0000000000000000 r13=fffff8a000004ca0 r14=0000000000000040 r15=fffff88007a65a40 iopl=0 nv up ei pl nz na po nc cs=0010 ss=0018 ds=002b es=002b fs=0053 gs=002b efl=00010206 nt!ObpLookupObjectName+0x461: fffff800`031842b1 f6 ??? Resetting default scope CUSTOMER_CRASH_COUNT: 1 DEFAULT_BUCKET_ID: CODE_CORRUPTION BUGCHECK_STR: 0x3B PROCESS_NAME: svchost.exe CURRENT_IRQL: 0 LAST_CONTROL_TRANSFER: from fffff800031855f6 to fffff800031842b1 STACK_TEXT: fffff880`07a658d0 fffff800`031855f6 : 00000000`00000000 fffffa80`042bcb10 fffffa80`00000000 fffffa80`03d0a260 : nt!ObpLookupObjectName+0x461 fffff880`07a659c0 fffff800`03158d8c : fffff800`03013940 00000000`00000000 fffff8a0`0c445301 00000000`00000005 : nt!ObOpenObjectByName+0x306 fffff880`07a65a90 fffff800`03163ef2 : 00000000`03c9f980 00000000`00020019 00000000`03c9f990 000007fe`00000000 : nt!CmOpenKey+0x28a fffff880`07a65be0 fffff800`02e8d253 : 00000000`00000008 000007fe`e8425c34 00000000`00000001 00000000`00000000 : nt!NtOpenKey+0x12 fffff880`07a65c20 00000000`771a143a : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : nt!KiSystemServiceCopyEnd+0x13 00000000`03c9f938 00000000`00000000 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : 0x771a143a CHKIMG_EXTENSION: !chkimg -lo 50 -d !nt fffff800031842b0 - nt!ObpLookupObjectName+460 [ 85:95 ] 1 error : !nt (fffff800031842b0) MODULE_NAME: memory_corruption IMAGE_NAME: memory_corruption FOLLOWUP_NAME: memory_corruption DEBUG_FLR_IMAGE_TIMESTAMP: 0 MEMORY_CORRUPTOR: ONE_BIT STACK_COMMAND: .cxr 0xfffff88007a64ef0 ; kb FAILURE_BUCKET_ID: X64_MEMORY_CORRUPTION_ONE_BIT BUCKET_ID: X64_MEMORY_CORRUPTION_ONE_BIT Followup: memory_corruption ---------
So we have a one-bit memory corruption that has overwritten code in the nt module. Besides the nice bucket id to key us into this, we can also see it in the disassembly. We know that the f6 instruction has decoded into an invalid instruction, so the code corruption must be right before or right after that instruction. If we disassemble starting just before the faulting IP in the crash we get this:
2: kd> u nt!ObpLookupObjectName+0x457 nt!ObpLookupObjectName+0x457: fffff800`031842a7 488bb42428010000 mov rsi,qword ptr [rsp+128h] fffff800`031842af 4895 xchg rax,rbp fffff800`031842b1 f6 ??? fffff800`031842b2 0f8544060000 jne nt!ObpLookupObjectName+0xaac (fffff800`031848fc) fffff800`031842b8 b801000000 mov eax,1 fffff800`031842bd 488d942490000000 lea rdx,[rsp+90h] fffff800`031842c5 440fb6c8 movzx r9d,al fffff800`031842c9 458bc6 mov r8d,r14d
Going to a known good copy of the same OS (Windows 7 x64 SP1) and disassembling that same location, we get this:
1: kd> u nt!ObpLookupObjectName+457 nt!ObpLookupObjectName+0x457: fffff800`02bc6357 488bb42428010000 mov rsi,qword ptr [rsp+128h] fffff800`02bc635f 4885f6 test rsi,rsi fffff800`02bc6362 0f8544060000 jne nt!ObpLookupObjectName+0xaac (fffff800`02bc69ac) fffff800`02bc6368 b801000000 mov eax,1 fffff800`02bc636d 488d942490000000 lea rdx,[rsp+90h] fffff800`02bc6375 440fb6c8 movzx r9d,al fffff800`02bc6379 458bc6 mov r8d,r14d fffff800`02bc637c 498bcd mov rcx,r13
Ignoring the different addresses due to the different module base addresses, we can see that the second instruction is incorrect. The xchg rax, rbp; ??? should be a test rsi,rsi. Looking at the code bytes, we should have 0x4885f6 but instead we have 0x4895f6. There’s the single-bit corruption!
Let’s Try Throwing Money at the Problem
This was another strike in the faulty hardware column for me. The next thing I did was have them run the built-in Windows Memory Diagnostic Tool. I had them configure it to use the Extended test with Default caching and an unlimited number of passes. I was hoping for an easy win at this point because RAM is cheap and easy to install. After about an 18 hrs test run however, nothing had shown up. The next thing I tried as a CPU stress-test using Prime95 to see if the CPU was flaking out when it heated up. I had them use the ‘Blend’ setting which uses lots of RAM in addition to stressing the CPU and 4 threads since the machine is a 4 Core i5 750 which lacks hyperthreading. After running for a few hours it will still running strong without any signs of trouble.
Going back to my last visit when I cleaned the case out, the video card was the only thing that had a significant amount of dust on it. Video cards make heavy use of RAM and use DMA so a fault in the hardware could result in direct corruption of physical memory that would appear to show up in random places in the virtual address space of the kernel and processes on the system. This fact plus the lack of a good GPU diagnostic tool led me to just order them a Radeon HD 6450 to replace their existing ATI Radeon HD 4350. I’ve also had several of those same ATI Radeon HD 4350 cards from the same manufacturer die on me at work as well, which further added to my suspicion. A few days later it arrived and they installed it. All seemed to be well for a couple of weeks and I was beginning to think we had nipped this one. Apparently the machine had other plans though, as it crashed yet again.
Deep Dive: Level 3
Having previously enabled kernel crash dumps, I was now armed with both a mini-dump and a kernel memory dump for this latest crash. Zipped, the kernel memory dump was only 144MB which was easily uploaded to me overnight, even on the slow DSL connection. I once again started off with !analyze -v:
3: kd> !analyze -v ******************************************************************************* * * * Bugcheck Analysis * * * ******************************************************************************* PAGE_FAULT_IN_NONPAGED_AREA (50) Invalid system memory was referenced. This cannot be protected by try-except, it must be protected by a Probe. Typically the address is just plain bad or it is pointing at freed memory. Arguments: Arg1: fffff8800764a6c0, memory referenced. Arg2: 0000000000000001, value 0 = read operation, 1 = write operation. Arg3: fffff960006b9eca, If non-zero, the instruction address which referenced the bad memory address. Arg4: 0000000000000000, (reserved) Debugging Details: ------------------ Could not read faulting driver name WRITE_ADDRESS: fffff8800764a6c0 FAULTING_IP: cdd+9eca fffff960`006b9eca 48898c2490000000 mov qword ptr [rsp+90h],rcx MM_INTERNAL_CODE: 0 CUSTOMER_CRASH_COUNT: 1 DEFAULT_BUCKET_ID: VISTA_DRIVER_FAULT BUGCHECK_STR: 0x50 PROCESS_NAME: OUTLOOK.EXE CURRENT_IRQL: 0 TRAP_FRAME: fffff8800744a4a0 -- (.trap 0xfffff8800744a4a0) NOTE: The trap frame does not contain all registers. Some register values may be zeroed or incorrect. rax=fffff900c0151020 rbx=0000000000000000 rcx=0000000000000060 rdx=fffff8800744aa50 rsi=0000000000000000 rdi=0000000000000000 rip=fffff960006b9eca rsp=fffff8800744a630 rbp=fffff8800744a6e0 r8=0000000000000001 r9=fffff8800744af50 r10=fffff900c0151080 r11=fffff8800744a6e0 r12=0000000000000000 r13=0000000000000000 r14=0000000000000000 r15=0000000000000000 iopl=0 nv up ei pl nz na pe nc cdd+0x9eca: fffff960`006b9eca 48898c2490000000 mov qword ptr [rsp+90h],rcx ss:0018:fffff880`0744a6c0=fffff8800744aa50 Resetting default scope LAST_CONTROL_TRANSFER: from fffff80002e8deec to fffff80002ee5fc0 STACK_TEXT: fffff880`0744a338 fffff800`02e8deec : 00000000`00000050 fffff880`0764a6c0 00000000`00000001 fffff880`0744a4a0 : nt!KeBugCheckEx fffff880`0744a340 fffff800`02ee40ee : 00000000`00000001 fffff880`0764a6c0 fffff6fc`80610800 fffff880`0744aa50 : nt! ?? ::FNODOBFM::`string'+0x4514f fffff880`0744a4a0 fffff960`006b9eca : 00000000`00000000 fffff880`0744ab70 fffff900`c0145020 fffff900`c1daf9c8 : nt!KiPageFault+0x16e fffff880`0744a630 00000000`00000000 : fffff880`0744ab70 fffff900`c0145020 fffff900`c1daf9c8 00000000`0000056a : cdd+0x9eca STACK_COMMAND: kb FOLLOWUP_IP: cdd+9eca fffff960`006b9eca 48898c2490000000 mov qword ptr [rsp+90h],rcx SYMBOL_STACK_INDEX: 3 SYMBOL_NAME: cdd+9eca FOLLOWUP_NAME: MachineOwner MODULE_NAME: cdd IMAGE_NAME: cdd.dll DEBUG_FLR_IMAGE_TIMESTAMP: 0 FAILURE_BUCKET_ID: X64_0x50_cdd+9eca BUCKET_ID: X64_0x50_cdd+9eca Followup: MachineOwner ---------
This bugcheck is different from the last one we’ve got, but we’ve seen that this machine has been issuing bugcheck 0x50 stop codes via the Windows Event Logs I investigated in Part 1. The faulting IP this time is in cdd.dll, which is the Microsoft Canonical Display Driver. Although the bugcheck name is PAGE_FAULT_IN_NONPAGED_AREA, we can see that our current IRQL is 0 (Passive) where page faults are allowed. This indicates that the reference memory address is invalid. We can see that it was a write operation to fffff8800764a6c0 and that the faulting IP was fffff960006b9eca. Viewing the trap frame to see the disassembly of the faulting instruction and the current values of the registers, we see this:
3: kd> .trap 0xfffff8800744a4a0 NOTE: The trap frame does not contain all registers. Some register values may be zeroed or incorrect. rax=fffff900c0151020 rbx=0000000000000000 rcx=0000000000000060 rdx=fffff8800744aa50 rsi=0000000000000000 rdi=0000000000000000 rip=fffff960006b9eca rsp=fffff8800744a630 rbp=fffff8800744a6e0 r8=0000000000000001 r9=fffff8800744af50 r10=fffff900c0151080 r11=fffff8800744a6e0 r12=0000000000000000 r13=0000000000000000 r14=0000000000000000 r15=0000000000000000 iopl=0 nv up ei pl nz na pe nc cdd+0x9eca: fffff960`006b9eca 48898c2490000000 mov qword ptr [rsp+90h],rcx ss:0018:fffff880`0744a6c0=fffff8800744aa50
Hrmn! All this instruction is trying to do is write the value of rcx to a stack location and it’s essentially access violating. A buffer overrun is one way this can occur, but you generally don’t see that with a static stack offset. Let’s check the math on this calculation:
rsp+90 => fffff8800744a630 + 90 => fffff8800744a6c0
So the address of the stack write should be fffff8800744a6c0 however we see from argument one of the bugcheck that it was fffff8800764a6c0. Viewing these in 64-bit binary representation we see:
fffff8800744a630 (correct calculation) 1111 1111 1111 1111 1111 1000 1000 0000 0000 0111 0100 0100 1010 0110 1100 0000 fffff8800764a630 (actual access attempt) 1111 1111 1111 1111 1111 1000 1000 0000 0000 0111 0110 0100 1010 0110 1100 0000 ^
Notice that bit 21 is incorrectly set in the actual access attempt. We have another single-bit memory corruption!
At this point I am a bit puzzled. In my experience, CPU failures are generally pretty rare and if the CPU is going to go bad, it does so very early on (this machine is >3yrs old). I’ve already replaced the video card. I’m leaning towards this being a faulty RAM issue although it could be a motherboard issue. However, from my recollection from my computer organization courses in college, the way that move instruction is going to execute is that that the arithmetic logic unit (ALU) will get the value from the RSP register, add 90, and then directly pass the resulting address to the memory-management unit (MMU) via its select lines, indicating a write operation and the value in RCX. The various processor caches (L1, L2, L3, etc.) are of course involved but the main point is that I don’t recall main system memory (RAM) being involved in the calculation of the write address. The address of the fault is a virtual address and the output from the MMU to the RAM should be a physical address. Said another way, although RAM is the target of the write, I don’t recall there being any RAM access required to lookup the target address. The only things I could think of were:
- During the hardware assisted conversion of the virtual address to a physical address, the various structures (PXE, PDE, PTE,PFN) weren’t present in the translation look-aside buffer (TLB) which resulted in a page walk, introducing RAM accesses which presents an opportunity for the target address to be corrupted.
- The problem isn’t RAM at all, but the motherboard or CPU or Power Supply Unit (PSU) or ….
Let’s Throw Some More Money at the Problem
Motherboards are fairly pricey, especially OEM boards and they are difficult for the average joe to replace. The CPU in this machine is a Core i5 750 which is still plenty fast for what they need. However, it’s the old LGA 1156 form factor, which they don’t make anymore. As a result, getting a replacement processor was going to be $200 and it’s still difficult for the average joe to replace. So, although the last memory test didn’t show anything, I went ahead and ordered 2 4GB DIMMS to replace the existing 4 1GB DIMMS.
I also noticed a few drivers on the system that I didn’t recognize or that have been known to cause issues in the past (based on various Google searches). Windows has a built-in utility called Driver Verifier. Per MSDN – “Driver Verifier monitors kernel-mode drivers to detect incorrect function calls or actions that might corrupt the system.” I went ahead and had them enable Driver Verifier using the standard settings on the following list of drivers:
- 000 – Associated with Cyberlink Power DVD. Fishy module name most likely associated with DRM.
- mcupdate –
McAfee update service. McAfee isn’t installed on the box.Update: After looking at the loaded drivers in Process Explorer, the full name of this module mcupdate_GenuineIntel.dll and it’s the Intel Microcode Update Library from Microsoft Corporation
- ElbyCDIO – VirtualClone drive
- VClone – VirtualClone drive
- fowp64 – This one is really fishy. It’s a driver that injects into every process and hooks file accesses. It has a corresponding user-mode service and is part of something called FileOpen Manager Service. Google searches are turning up very little and it’s installed into it’s own Program Files directory so I don’t know what application it’s associated with.
- rikvm – Associated with Cyberlink Power DVD. Google searches turn up a lot of talk of this being a rootkit and people being unhappy with Cyberlink about it
- psi_mf – Associated with Secunia PSI
- pnarp – Associated with Cisco Pure Networks
- k57nd60a – Broadcom Ethernet Driver
Wait and See
Now that I have still have kernel crash dumps enabled, Driver Verifier turned on, the video card replaced, all the drivers and applications fully updated, and some new RAM on the way, it’s time to just wait and see. Hopefully the new RAM takes care of the issue in which case I’ll update this post with a “Hooray!”. If not, then it will be on to Part 3. Until then….
Update 2013-03-08:Within 12hrs of installing the new ram, the machine bug checked 3 times. All the new bug checks point at single bit memory corruption. Going back to the mov qword ptr [rsp+90h],rcx with a stuck bit, I just can’t see how this could be anything other than the CPU. At this point I’m hunting eBay for a Core i5-750 LGA 1156 to give that a go. More when it’s available….
Update 2013-06-13:After swapping out the processor almost 4 months ago, there have been zero crashes. Looks like we finally got this one beat!