Find the answer to your Linux question:
Page 1 of 2 1 2 LastLast
Results 1 to 10 of 17
Hello all, I am struggling trying to find the cause of a hardlock under heavy IO. First things first: Hardware: lshw attached, in short: ASUS ncch-dl (dual xeon, i875 chipset) ...
Enjoy an ad free experience by logging in. Not a member yet? Register.
  1. #1
    Just Joined!
    Join Date
    Aug 2009
    Posts
    12

    Question Need help debugging hard lock on high disk activity


    Hello all,
    I am struggling trying to find the cause of a hardlock under heavy IO. First things first:
    Hardware:
    lshw attached, in short:
    ASUS ncch-dl (dual xeon, i875 chipset)
    Nvidia Quadro FX 4000
    Adaptex 2120s U320 SCSI Raid controller on PCIX 66MHz, 4 drives on RAID0

    Symptoms:
    System often hangs on heavy disk I/O (i.e. dd if=/dev/zero of=test bs=1M count=1024, installing a package, etc.). Keyboard and mouse are completely dead, including numlock and capslock. Happens both when using the GUI or a virtual terminal. While on the GUI it looks like a complete system freeze, under the virtual terminal I can still see the cursor blinking, making me rather think of a kernel deadlock than a system freeze.

    My efforts so far:
    It is not a distribution specific issue, I tested from kernel 2.6.24 onwards and always got the issue.
    It is not a hardware failure since the system is rock solid on pcbsd. I tried three different raid controllers, both from adaptec and lsi, but it made no difference. I tested the RAM will memtest, no problems.
    I can not find any indication of nearing failure in any system log. I tried enabling nmi_watchdog from GRUB, but kernel log stays empty and system still hangs.

    I'd be really happy if anyone could give me any hints on what may be the cause respectively on how I can collect any useful information that would allow me to open a bug in the kernel bugzilla.
    Thanks in advance!
    Attached Files Attached Files

  2. #2
    Linux Guru Rubberman's Avatar
    Join Date
    Apr 2009
    Location
    I can be found either 40 miles west of Chicago, in Chicago, or in a galaxy far, far away.
    Posts
    11,380
    Have you tried another controller board? This sounds like possibly an interrupt deadlock - more specifically, interrupts did not get reenabled. The Linux kernel can nest interrupt requests, but perhaps under too high an interrupt load this got too deeply nested. Possibly there are some kernel parameters than can be tweaked to improve this. My suggestion is to post your query to the kernel.org forums so you get the benefit of the kernel developers' expertise.
    Sometimes, real fast is almost as good as real time.
    Just remember, Semper Gumbi - always be flexible!

  3. #3
    Just Joined!
    Join Date
    Aug 2009
    Posts
    12
    Thanks for you answer. Changing the controller board would imply changing the mainboard right? If so, there are to my knowledge only two dual xeon mainboards with fsb800 and AGP that were produced, the asus ncch-dl and the iwill dh800, the latter one is quite unfindable... hence quite hard to do so...
    Concerning the official kernel forums, would you mind pointing me out which ones they are? I already searched previously, but the closest I got was here or at kernelnewbies.org which didn't seem the right place as I am not actually developing...
    Thanks!

  4. #4
    Linux Guru Rubberman's Avatar
    Join Date
    Apr 2009
    Location
    I can be found either 40 miles west of Chicago, in Chicago, or in a galaxy far, far away.
    Posts
    11,380
    I was just thinking of the Adaptec SCSI controller board, which is a PCI-X board, according to your original hardware list.
    Sometimes, real fast is almost as good as real time.
    Just remember, Semper Gumbi - always be flexible!

  5. #5
    Just Joined!
    Join Date
    Aug 2009
    Posts
    12
    Uhm it's actually already the third controller I am trying with, but the same issue already occurred with another adaptec controller as well as with a lsi controller...

  6. #6
    Linux Guru Rubberman's Avatar
    Join Date
    Apr 2009
    Location
    I can be found either 40 miles west of Chicago, in Chicago, or in a galaxy far, far away.
    Posts
    11,380
    Quote Originally Posted by smani View Post
    Uhm it's actually already the third controller I am trying with, but the same issue already occurred with another adaptec controller as well as with a lsi controller...
    Ok. Do you have another PCI-X slot in the motherboard to try? Have you tried another motherboard, such as one from Intel or other than ASUS? I have an Intel dual xeon motherboard myself (S5000XVN) with a RAID controller as well as internal sata/ide controllers w/ 10 sata/esata drives attached and a lot of I/O - no problems here.
    Sometimes, real fast is almost as good as real time.
    Just remember, Semper Gumbi - always be flexible!

  7. #7
    Just Joined!
    Join Date
    Aug 2009
    Posts
    12
    I have two PCIX slots, tried both of them... As for another mainboard, nothing at hand (actually I have an iwill dh800 around, but something around the ram capacitors is shorted and the mainboard won't go past POST).
    Anyway, I guess it is pretty clear that it's the ASUS mainboard that doesn't work well with the linux kernel (actually not the first ASUS mainboard that behaved funnily I happened to use), question is if there is anyway I can narrow the issue down: should I just open a bug in the kernel bugzilla and hope that someone there can provide me with instructions on how to obtain some debug information?

  8. #8
    Linux Guru Rubberman's Avatar
    Join Date
    Apr 2009
    Location
    I can be found either 40 miles west of Chicago, in Chicago, or in a galaxy far, far away.
    Posts
    11,380
    Last thing to try with this motherboard is to see if ASUS has an updated bios and/or firmware to download and install, and to check on the Intel website for a CPU microcode update for your chips. Some of these issues are bios/firmware or microcode related (floating/lost IRQ's). If that doesn't work, then I think you need a better system board.
    Sometimes, real fast is almost as good as real time.
    Just remember, Semper Gumbi - always be flexible!

  9. #9
    Just Joined!
    Join Date
    Aug 2009
    Posts
    12
    I tried all bioses up to the latest beta bios, no luck... But it cannot be just the hardware's fault since bsd is working just fine... But I'll try with microcode.ctl and the latest microcode from intel, who knows..

  10. #10
    Linux Guru Rubberman's Avatar
    Join Date
    Apr 2009
    Location
    I can be found either 40 miles west of Chicago, in Chicago, or in a galaxy far, far away.
    Posts
    11,380
    Quote Originally Posted by smani View Post
    I tried all bioses up to the latest beta bios, no luck... But it cannot be just the hardware's fault since bsd is working just fine... But I'll try with microcode.ctl and the latest microcode from intel, who knows..
    You're probably right. You seem to be doing the right things to get to a root cause of this problem. Anyway, that's why I suggested before that you post this issue on the kernel.org forums. However, if it is a kernel issue, I cannot see why others haven't had the same problem.
    Sometimes, real fast is almost as good as real time.
    Just remember, Semper Gumbi - always be flexible!

Page 1 of 2 1 2 LastLast

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •