Results 1 to 10 of 17
Hello all,
I am struggling trying to find the cause of a hardlock under heavy IO. First things first:
Hardware:
lshw attached, in short:
ASUS ncch-dl (dual xeon, i875 chipset)
...
Enjoy an ad free experience by logging in. Not a member yet? Register.
- 08-16-2009 #1Just Joined!
- Join Date
- Aug 2009
- Posts
- 12
Need help debugging hard lock on high disk activity
Hello all,
I am struggling trying to find the cause of a hardlock under heavy IO. First things first:
Hardware:
lshw attached, in short:
ASUS ncch-dl (dual xeon, i875 chipset)
Nvidia Quadro FX 4000
Adaptex 2120s U320 SCSI Raid controller on PCIX 66MHz, 4 drives on RAID0
Symptoms:
System often hangs on heavy disk I/O (i.e. dd if=/dev/zero of=test bs=1M count=1024, installing a package, etc.). Keyboard and mouse are completely dead, including numlock and capslock. Happens both when using the GUI or a virtual terminal. While on the GUI it looks like a complete system freeze, under the virtual terminal I can still see the cursor blinking, making me rather think of a kernel deadlock than a system freeze.
My efforts so far:
It is not a distribution specific issue, I tested from kernel 2.6.24 onwards and always got the issue.
It is not a hardware failure since the system is rock solid on pcbsd. I tried three different raid controllers, both from adaptec and lsi, but it made no difference. I tested the RAM will memtest, no problems.
I can not find any indication of nearing failure in any system log. I tried enabling nmi_watchdog from GRUB, but kernel log stays empty and system still hangs.
I'd be really happy if anyone could give me any hints on what may be the cause respectively on how I can collect any useful information that would allow me to open a bug in the kernel bugzilla.
Thanks in advance!
- 08-17-2009 #2Linux Guru
- Join Date
- Apr 2009
- Location
- I can be found either 40 miles west of Chicago, or in a galaxy far, far away.
- Posts
- 10,143
Have you tried another controller board? This sounds like possibly an interrupt deadlock - more specifically, interrupts did not get reenabled. The Linux kernel can nest interrupt requests, but perhaps under too high an interrupt load this got too deeply nested. Possibly there are some kernel parameters than can be tweaked to improve this. My suggestion is to post your query to the kernel.org forums so you get the benefit of the kernel developers' expertise.
Sometimes, real fast is almost as good as real time.
Just remember, Semper Gumbi - always be flexible!
- 08-17-2009 #3Just Joined!
- Join Date
- Aug 2009
- Posts
- 12
Thanks for you answer. Changing the controller board would imply changing the mainboard right? If so, there are to my knowledge only two dual xeon mainboards with fsb800 and AGP that were produced, the asus ncch-dl and the iwill dh800, the latter one is quite unfindable... hence quite hard to do so...
Concerning the official kernel forums, would you mind pointing me out which ones they are?
I already searched previously, but the closest I got was here or at kernelnewbies.org which didn't seem the right place as I am not actually developing...
Thanks!
- 08-17-2009 #4Linux Guru
- Join Date
- Apr 2009
- Location
- I can be found either 40 miles west of Chicago, or in a galaxy far, far away.
- Posts
- 10,143
I was just thinking of the Adaptec SCSI controller board, which is a PCI-X board, according to your original hardware list.
Sometimes, real fast is almost as good as real time.
Just remember, Semper Gumbi - always be flexible!
- 08-17-2009 #5Just Joined!
- Join Date
- Aug 2009
- Posts
- 12
Uhm it's actually already the third controller I am trying with, but the same issue already occurred with another adaptec controller as well as with a lsi controller...
- 08-17-2009 #6Linux Guru
- Join Date
- Apr 2009
- Location
- I can be found either 40 miles west of Chicago, or in a galaxy far, far away.
- Posts
- 10,143
Ok. Do you have another PCI-X slot in the motherboard to try? Have you tried another motherboard, such as one from Intel or other than ASUS? I have an Intel dual xeon motherboard myself (S5000XVN) with a RAID controller as well as internal sata/ide controllers w/ 10 sata/esata drives attached and a lot of I/O - no problems here.
Sometimes, real fast is almost as good as real time.
Just remember, Semper Gumbi - always be flexible!
- 08-17-2009 #7Just Joined!
- Join Date
- Aug 2009
- Posts
- 12
I have two PCIX slots, tried both of them... As for another mainboard, nothing at hand (actually I have an iwill dh800 around, but something around the ram capacitors is shorted and the mainboard won't go past POST).
Anyway, I guess it is pretty clear that it's the ASUS mainboard that doesn't work well with the linux kernel (actually not the first ASUS mainboard that behaved funnily I happened to use), question is if there is anyway I can narrow the issue down: should I just open a bug in the kernel bugzilla and hope that someone there can provide me with instructions on how to obtain some debug information?
- 08-17-2009 #8Linux Guru
- Join Date
- Apr 2009
- Location
- I can be found either 40 miles west of Chicago, or in a galaxy far, far away.
- Posts
- 10,143
Last thing to try with this motherboard is to see if ASUS has an updated bios and/or firmware to download and install, and to check on the Intel website for a CPU microcode update for your chips. Some of these issues are bios/firmware or microcode related (floating/lost IRQ's). If that doesn't work, then I think you need a better system board.
Sometimes, real fast is almost as good as real time.
Just remember, Semper Gumbi - always be flexible!
- 08-17-2009 #9Just Joined!
- Join Date
- Aug 2009
- Posts
- 12
I tried all bioses up to the latest beta bios, no luck... But it cannot be just the hardware's fault since bsd is working just fine... But I'll try with microcode.ctl and the latest microcode from intel, who knows..
- 08-17-2009 #10Linux Guru
- Join Date
- Apr 2009
- Location
- I can be found either 40 miles west of Chicago, or in a galaxy far, far away.
- Posts
- 10,143
You're probably right. You seem to be doing the right things to get to a root cause of this problem. Anyway, that's why I suggested before that you post this issue on the kernel.org forums. However, if it is a kernel issue, I cannot see why others haven't had the same problem.
Sometimes, real fast is almost as good as real time.
Just remember, Semper Gumbi - always be flexible!


Reply With Quote

