Find the answer to your Linux question:
Results 1 to 4 of 4
Hi, I have a Sandybridge platform running Ubuntu 12.04.2, and dmesg is showing lots of errors of the form: 534501.259784] sbridge: HANDLING MCE MEMORY ERROR [534501.259801] CPU 0: Machine Check ...
Enjoy an ad free experience by logging in. Not a member yet? Register.
  1. #1
    Just Joined!
    Join Date
    Dec 2009
    Posts
    12

    Hardware error being logged; how serious is it?


    Hi,
    I have a Sandybridge platform running Ubuntu 12.04.2, and dmesg is showing lots of errors of the form:

    534501.259784] sbridge: HANDLING MCE MEMORY ERROR
    [534501.259801] CPU 0: Machine Check Exception: 0 Bank 5: cc0000c000010090
    [534501.259802] TSC 0 ADDR dc5d48300 MISC 20402e2e86 PROCESSOR 0:206d7 TIME 1375889743 SOCKET 0 APIC 0
    [534501.259956] sbridge: HANDLING MCE MEMORY ERROR
    [534501.259959] CPU 0: Machine Check Exception: 0 Bank 5: 8c00004000010090
    [534501.259961] TSC 0 ADDR dc5d49000 MISC 2040404086 PROCESSOR 0:206d7 TIME 1375889743 SOCKET 0 APIC 0
    [534501.260130] sbridge: HANDLING MCE MEMORY ERROR
    [534501.260133] CPU 0: Machine Check Exception: 0 Bank 5: 8c00004000010090
    [534501.260135] TSC 0 ADDR dc5d4a000 MISC 20400a0a86 PROCESSOR 0:206d7 TIME 1375889743 SOCKET 0 APIC 0
    [534501.260303] sbridge: HANDLING MCE MEMORY ERROR
    [534501.260305] CPU 0: Machine Check Exception: 0 Bank 5: 8c00004000010090
    [534501.260306] TSC 0 ADDR dc5d4b000 MISC 2040646486 PROCESSOR 0:206d7 TIME 1375889743 SOCKET 0 APIC 0
    [534501.814827] EDAC MC0: CE row 2, channel 0, label "CPU_SrcID#0_Channel#2_DIMM#0": 3 Unknown error(s): memory read on FATAL area OVERFLOW: cpu=0 Err=0001:0090 (ch=0), addr = 0xdc5d48300 => socket=0, Channel=2(mask=4), rank=0
    [534501.814830]
    [550160.585051] soft_offline: 0xcc5d4d: unknown non LRU page type 200000000008000

    mcelog reports it as:

    MCE 30
    CPU 0 BANK 8
    MISC 90844000400208c ADDR cc5d4d800
    TIME 1375860607 Wed Aug 7 08:30:07 2013
    MCG status:
    MCi status:
    Corrected error
    MCi_MISC register valid
    MCi_ADDR register valid
    MCA: MEMORY CONTROLLER MS_CHANNEL0_ERR
    Transaction: Memory scrubbing error
    STATUS 8c000050000800c0 MCGSTATUS 0
    MCGCAP 1000c14 APICID 0 SOCKETID 0
    CPUID Vendor Intel Family 6 Model 45
    Corrected memory errors on page cc5d4d000 exceed threshold 10 in 24h: 10 in 24h
    Location SOCKET:0 CHANNEL:0 DIMM []
    Offlining page cc5d4d000
    Offlining page cc5d4d000 failed: Input/output error
    Hardware event. This is not a software error.

    The machine keeps running and jobs even finish, so are these errors that the ECC memory can safely handle (as some people online have suggested), or is it the precursor to a dying memory module (or worse)?

    TIA

  2. #2
    Linux Newbie
    Join Date
    Nov 2009
    Posts
    228
    Hello boodle.

    I would say that it *is* indicative of failing in the memory department. A bit like dementia really, it creeps up on you.

    Don't know if it is the DIMM card(s) or some other component involved in accessing it/them but it seems to be happening every 2 1/2 hours.

    My experience has been that if I see this kind of error message, it is very often the pre-cursor to a failure of the hardware. If you can, pop some other memory card(s) in there and see if the problem goes away. If not, there is something amiss on the hardware platform itself.

    Cheers - VP

  3. #3
    Just Joined!
    Join Date
    Dec 2009
    Posts
    12
    Hi VP,

    Just a quick update. After some drawn out dialogue with the vendor and Intel, it turns out that a memory chip needs replacing and that there's a known bug in the ME firmware (downgrade recommended).

    boodle

  4. $spacer_open
    $spacer_close
  5. #4
    Linux Newbie
    Join Date
    Nov 2009
    Posts
    228
    Good for you! Well done in nailing the b****!

    Cheers - VP

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •