Find the answer to your Linux question:
Results 1 to 3 of 3
I set up a new amd64 gentoo server yesterday on an opteron but within a few hours of it being up I got a "Machine Check Exception" and the thing ...
Enjoy an ad free experience by logging in. Not a member yet? Register.
  1. #1
    Linux Newbie humbletech99's Avatar
    Join Date
    Nov 2005
    Posts
    225

    Machine Check Exception on new Opteron server


    I set up a new amd64 gentoo server yesterday on an opteron but within a few hours of it being up I got a "Machine Check Exception" and the thing froze up. I had to go to the the local console to see this and then had to hard reboot the machine. It wasn't really doing much at the time other than compiling a couple of things. The server is a dual-cpu dual-core machine (4 cores that is) with 8GB ram and 12 SCSI disks + 2 satas for OS.

    The error from the console is below:

    Code:
    HARDWARE ERROR
    CPU 2: Machine Check Exception:                                    4 Bank 4:  f615200133000813
    TSC 5ac60e50b6a ADDR 1d251ec00
    This is not a software problem!
    Run through mcelog --ascii to decode and contact your hardware vendor
    Kernel panic - not syncing: Machine check
    I have been googling around since yesterday but haven't found anything conclusive

    I've tried running mcelog and got the following:
    Code:
    # mcelog --k8 /dev/mcelog
    MCE 0
    HARDWARE ERROR. This is *NOT* a software problem!
    Please contact your hardware vendor
    CPU 0 4 northbridge TSC a4d0cd72d5a8
    ADDR 23c400000
      Northbridge GART error
           bit61 = error uncorrected
      TLB error 'generic transaction, level generic'
    STATUS a40000000005001b MCGSTATUS 0
    MCE 1
    HARDWARE ERROR. This is *NOT* a software problem!
    Please contact your hardware vendor
    CPU 0 4 northbridge TSC a56b2eba7649
    ADDR 23c400000
      Northbridge GART error
           bit61 = error uncorrected
      TLB error 'generic transaction, level generic'
    STATUS a40000000005001b MCGSTATUS 0
    MCE 2
    HARDWARE ERROR. This is *NOT* a software problem!
    Please contact your hardware vendor
    CPU 0 4 northbridge TSC a60591585bda
    ADDR 23c400000
      Northbridge GART error
           bit61 = error uncorrected
      TLB error 'generic transaction, level generic'
    STATUS a40000000005001b MCGSTATUS 0
    MCE 3
    HARDWARE ERROR. This is *NOT* a software problem!
    Please contact your hardware vendor
    CPU 0 4 northbridge TSC a69ff2a635e8
    ADDR 23c400000
      Northbridge GART error
           bit61 = error uncorrected
      TLB error 'generic transaction, level generic'
    STATUS a40000000005001b MCGSTATUS 0
    MCE 4
    HARDWARE ERROR. This is *NOT* a software problem!
    Please contact your hardware vendor
    CPU 0 4 northbridge TSC a73a53f42ca9
    ADDR 23c400000
      Northbridge GART error
           bit61 = error uncorrected
      TLB error 'generic transaction, level generic'
    STATUS a40000000005001b MCGSTATUS 0
    MCE 5
    HARDWARE ERROR. This is *NOT* a software problem!
    Please contact your hardware vendor
    CPU 0 4 northbridge TSC a7d4b6934fdf
    ADDR 23c400000
      Northbridge GART error
           bit61 = error uncorrected
      TLB error 'generic transaction, level generic'
    STATUS a40000000005001b MCGSTATUS 0
    MCE 6
    HARDWARE ERROR. This is *NOT* a software problem!
    Please contact your hardware vendor
    CPU 2 4 northbridge TSC a86f17e0a6a8
    ADDR 191b0b000
      Northbridge Chipkill ECC error
      Chipkill ECC syndrome = c12f
           bit46 = corrected ecc error
           bit62 = error overflow (multiple errors)
      bus error 'local node response, request didn't time out
          generic read mem transaction
          memory access, level generic'
    STATUS d417c000c1080a13 MCGSTATUS 0
    MCE 7
    HARDWARE ERROR. This is *NOT* a software problem!
    Please contact your hardware vendor
    CPU 0 4 northbridge TSC a86f17e0c311
    ADDR 23c400000
      Northbridge GART error
           bit61 = error uncorrected
      TLB error 'generic transaction, level generic'
    STATUS a40000000005001b MCGSTATUS 0

    Does anybody know anything about this?

  2. #2
    Linux Newbie humbletech99's Avatar
    Join Date
    Nov 2005
    Posts
    225
    after replacing processors, then mobo, the problem was still occurring. After exhausting all other options including eliminating drive cards (since it happened under drive load) I eventually replaced the memory and the problem went away.

    This was annoying because I ran extensive memtesting for the memory at the beginning and it showed no errors. But replacing the mem was the only thing that stopped the problem.

  3. #3
    Linux Guru bigtomrodney's Avatar
    Join Date
    Nov 2004
    Location
    Ireland
    Posts
    6,132
    Thanks for posting that back. I've never seen any of those errors. I'm just waiting for the Quads next year, so I'm glad it wasn't an issue with multiple cores (didn't think it would be but I'm still glad).

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •