Xserve - frequent Kernel panics

mshapiro

Registered
I have a Dual G4 Xserve running 10.4.11 and I've been experiencing frequent Kernel panics. I've been unable to determine the cause and I've run out of ideas and options.

Warranty support is of course long since expired.
The internal CD-ROM drive is broken so I've been trying to use an external DVD/CD-ROM drive to boot off the Apple Hardware Test CD, but that doesn't work either. I downloaded the AHT image for Xserve from Apple and it comes up as an option when I reboot holding the Option key, but when I select the Hardware test disc and click the Enter key the boot options refresh and I'm not able to boot off the CD.

I was running 10.4.9 at the time when the kernel panics started so I upgraded to 10.4.11 thinking it might help, but it didn't.

There is no regular interval between kernel panics. I've experienced the server simply crashing and rebooting on its own, error logging directly on the screen over the desktop and a gray screen saying "Please reboot your machine."

It would seem that my last option would be to re-install OS X Server.

Here are the latest panics from the panic.log:

Sun Apr 19 23:34:24 2009
panic(cpu 1 caller 0x000A8C00): Uncorrectable machine check: pc = 00000000907C5E5C, msr = 000000004200F030, dsisr = 42000000, dar = 00000000041310B6
AsyncSrc = 0000000000000000, CoreFIR = 0000000000000000
L2FIR = 0000000000000000, BusFir = 0000000000000000

Latest stack backtrace for cpu 1:
Backtrace:
0x000954F8 0x00095A10 0x00026898 0x000A8C00 0x000A8348 0x000ABB80
Proceeding back via exception chain:
Exception state (sv=0x52A6E280)
PC=0x907C5E5C; MSR=0x4200F030; DAR=0x041310B6; DSISR=0x42000000; LR=0x907C5E34; R1=0xBFFFF210; XCP=0x00000008 (0x200 - Machine check)

Kernel version:
Darwin Kernel Version 8.11.0: Wed Oct 10 18:26:00 PDT 2007; root:xnu-792.24.17~1/RELEASE_PPC
*********

Mon Apr 20 09:09:16 2009
panic(cpu 1 caller 0x000A8C00): Uncorrectable machine check: pc = 00000000907C5E5C, msr = 000000004200F030, dsisr = 42000000, dar = 00000000041310B6
AsyncSrc = 0000000000000000, CoreFIR = 0000000000000000
L2FIR = 0000000000000000, BusFir = 0000000000000000

Latest stack backtrace for cpu 1:
Backtrace:
0x000954F8 0x00095A10 0x00026898 0x000A8C00 0x000A8348 0x000ABB80
Proceeding back via exception chain:
Exception state (sv=0x52A6E280)
PC=0x907C5E5C; MSR=0x4200F030; DAR=0x041310B6; DSISR=0x42000000; LR=0x907C5E34; R1=0xBFFFF210; XCP=0x00000008 (0x200 - Machine check)

Kernel version:
Darwin Kernel Version 8.11.0: Wed Oct 10 18:26:00 PDT 2007; root:xnu-792.24.17~1/RELEASE_PPC
*********

Mon Apr 20 09:11:43 2009
panic(cpu 1 caller 0x000A8C00): Uncorrectable machine check: pc = 00000000907C5E5C, msr = 000000004200F030, dsisr = 42000000, dar = 00000000041310B6
AsyncSrc = 0000000000000000, CoreFIR = 0000000000000000
L2FIR = 0000000000000000, BusFir = 0000000000000000

Latest stack backtrace for cpu 1:
Backtrace:
0x000954F8 0x00095A10 0x00026898 0x000A8C00 0x000A8348 0x000ABB80
Proceeding back via exception chain:
Exception state (sv=0x52A6E280)
PC=0x907C5E5C; MSR=0x4200F030; DAR=0x041310B6; DSISR=0x42000000; LR=0x907C5E34; R1=0xBFFFF210; XCP=0x00000008 (0x200 - Machine check)

Kernel version:
Darwin Kernel Version 8.11.0: Wed Oct 10 18:26:00 PDT 2007; root:xnu-792.24.17~1/RELEASE_PPC
*********

Mon Apr 20 15:01:09 2009
panic(cpu 1 caller 0x000A8C00): Uncorrectable machine check: pc = 000000000458DD58, msr = 000000004000F030, dsisr = 42000000, dar = 000000001E649000
AsyncSrc = 0000000000000000, CoreFIR = 0000000000000000
L2FIR = 0000000000000000, BusFir = 00000000fffc0000

Latest stack backtrace for cpu 1:
Backtrace:
0x000954F8 0x00095A10 0x00026898 0x000A8C00 0x000A8348 0x000ABB80
Proceeding back via exception chain:
Exception state (sv=0x490B6280)
PC=0x0458DD58; MSR=0x4000F030; DAR=0x1E649000; DSISR=0x42000000; LR=0x0457F9C0; R1=0xF07FEC30; XCP=0x00000008 (0x200 - Machine check)

Kernel version:
Darwin Kernel Version 8.11.0: Wed Oct 10 18:26:00 PDT 2007; root:xnu-792.24.17~1/RELEASE_PPC
*********

Mon Apr 20 15:24:59 2009
panic(cpu 1 caller 0x000A8C00): Uncorrectable machine check: pc = 000000000458DD58, msr = 000000004000F030, dsisr = 42000000, dar = 000000001E649000
AsyncSrc = 0000000000000000, CoreFIR = 0000000000000000
L2FIR = 0000000000000000, BusFir = 00000000fffc0000

Latest stack backtrace for cpu 1:
Backtrace:
0x000954F8 0x00095A10 0x00026898 0x000A8C00 0x000A8348 0x000ABB80
Proceeding back via exception chain:
Exception state (sv=0x490B6280)
PC=0x0458DD58; MSR=0x4000F030; DAR=0x1E649000; DSISR=0x42000000; LR=0x0457F9C0; R1=0xF07FEC30; XCP=0x00000008 (0x200 - Machine check)

Kernel version:
Darwin Kernel Version 8.11.0: Wed Oct 10 18:26:00 PDT 2007; root:xnu-792.24.17~1/RELEASE_PPC
*********

Any help would be appreciated. Thank you.
 
Run the hardware test.
CPU 1 is referenced a lot, maybe shutting it off with CHUD could be something to make it work for a test...
 
I will try shutting off CPU 1 with the CHUD tools (never knew one could do that). I found the original Apple Hardware Test CD that came with the Xserve and ran the Extended test only to find nothing wrong (all tests passed).

It's been a few days with a new install of OS X Server 10.4 and I'm still getting kernel panics.
 
Using the hwprefs command-line tool, I was not able to disable CPU 1. I get an error that says "ERROR: CPU1 cannot be disabled." I assume this is because all current processes are running with CPU 1. Is there a way to swap which is the primary CPU?

(side note: I can successfully disable CPU 2)
 
SOLUTION!

Not sure if anyone is following this anymore, but I started removing the RAM one-by-one and I think bad RAM may have been the culprit after all. The server has gone a day without crashing after I removed one particular RAM stick. Will continue to wait and see, but I'd consider it solved for now. Thanks!
 
Good to know :)
Did you just pick some of the RAMs or run the hardware test to see which one of them would come with an error first?
 
I ran the hardware test CD, but all the tests passed, including the memory! So that wasn't any help. I pulled one RAM stick Friday night and left the server on over the weekend. It crashed over the weekend. On Monday I put that stick back and pulled the next one. The server has not crashed since. :-)
 
Aah - extended test once, or for at least a few hours or overnight? Often with RAM the test can pass once or twice, so the secret is loop it, and keep it ideally overnight... but anyway - that was good luck in locating the RAM piece causing that with the first change.
 
SOLUTION!

Not sure if anyone is following this anymore, but I started removing the RAM one-by-one and I think bad RAM may have been the culprit after all. The server has gone a day without crashing after I removed one particular RAM stick. Will continue to wait and see, but I'd consider it solved for now. Thanks!

Bad RAM is a very common reason for kernel panics.

I've found that RAM checking software often gives a false negative. The only sure way to test for bad RAM is to swap out RAM cards and/or replace it with known good RAM.

Good work!
 
Back
Top