2007.0 hard crashes after 4 hours
2007.0 is reliably hard crashing after about 4 hours of uptime on my Acer Ferrari 3200. The syslog does not suggest any obvious cause. Quite often there was recent activity on the wireless network card (bcm4306, using bcm43xx driver), but not always, and usually not exactly at the moment that the machine crashed. For instance, the following was recorded in the syslog at the time of the last crash (items from 15:55:32 are start of reboot after hard reset):
Nov 13 15:52:26 localhost last message repeated 4 times
Nov 13 15:52:38 localhost dhclient: DHCPREQUEST on eth0 to 192.168.111.2 port 67
Nov 13 15:52:47 localhost kernel: NETDEV WATCHDOG: eth0: transmit timed out
Nov 13 15:52:47 localhost kernel: bcm43xx: Controller RESET (TX timeout) ...
Nov 13 15:52:47 localhost kernel: ACPI: PCI interrupt for device 0000:00:09.0 disabled
Nov 13 15:52:47 localhost kernel: ACPI: PCI Interrupt 0000:00:09.0[A] -> GSI 19 (level, low) -> IRQ 20
Nov 13 15:52:47 localhost kernel: bcm43xx: Chip ID 0x4306, rev 0x3
Nov 13 15:52:47 localhost kernel: bcm43xx: Number of cores: 5
Nov 13 15:52:47 localhost kernel: bcm43xx: Core 0: ID 0x800, rev 0x4, vendor 0x4243, enabled
Nov 13 15:52:47 localhost kernel: bcm43xx: Core 1: ID 0x812, rev 0x5, vendor 0x4243, disabled
Nov 13 15:52:47 localhost kernel: bcm43xx: Core 2: ID 0x80d, rev 0x2, vendor 0x4243, enabled
Nov 13 15:52:47 localhost kernel: bcm43xx: Core 3: ID 0x807, rev 0x2, vendor 0x4243, disabled
Nov 13 15:52:47 localhost kernel: bcm43xx: Core 4: ID 0x804, rev 0x9, vendor 0x4243, enabled
Nov 13 15:52:47 localhost kernel: bcm43xx: PHY connected
Nov 13 15:52:47 localhost kernel: bcm43xx: Detected PHY: Version: 2, Type 2, Revision 2
Nov 13 15:52:48 localhost kernel: bcm43xx: Detected Radio: ID: 2205017f (Manuf: 17f Ver: 2050 Rev: 2)
Nov 13 15:52:48 localhost kernel: bcm43xx: Radio turned off
Nov 13 15:52:48 localhost kernel: bcm43xx: Radio turned off
Nov 13 15:52:48 localhost kernel: bcm43xx: Controller restarted
Nov 13 15:52:50 localhost dhclient: DHCPREQUEST on eth0 to 192.168.111.2 port 67
Nov 13 15:53:30 localhost last message repeated 2 times
Nov 13 15:55:32 localhost syslogd 1.4.1: restart.
Nov 13 15:55:32 localhost kernel: klogd 1.4.1, log source = /proc/kmsg started.
Nov 13 15:55:32 localhost kernel: Inspecting /boot/System.map-2.6.17-5mdv
Nov 13 15:55:32 localhost kernel: Loaded 21427 symbols from /boot/System.map-2.6.17-5mdv.
Despite messages, the wireless networking appears to be working, as is the wired NIC.
The laptop is AMD64-based (2800+), but I am using the 32-bit version of 2007.0 because of previous issues with plugins for Mozilla, compatibility of OpenOffice and Java, etc. Installation went smoothly. 512M of RAM. NetXtreme BCM5788 Gigabit Ethernet (tg3 driver).
Graphics card is "ATI Technologies Inc RV350 [Mobility Radeon 9600 M10]" according to lspci. I have tried with both fglrx and the xorg radeon driver, with the same result on both. Currently using the xorg driver, because Mandriva's fglrx driver has occasional artefacts on screen (small horizontal line following the pointer around).
I've tried with and without the new 3D desktop features, it doesn't make any difference. Same behaviour in KDE and Gnome.
I assume there isn't a similar problem to kat in 2006?
I am not using any particular application when it crashes. I usually have Firefox, Thunderbird and Konsole running, but it will hang regardless of which application I am using, or if I am not using it at all at the time.
Any ideas?
dmesg, acpi and kernel panic
fingal: yeah, I know dmesg thanks, but there's no problem on boot that stands out. If you see my new information below, I guess the problem isn't with the wireless NIC, so I won't chase that error from the log, as the NIC works.
bigtomrodney: good point. I'll try the acpi=off, noapic and nolapic options that have been necessary with past incarnations with Mandr[iva|ake]. I don't want to lose acpi though, as this is an AMD64 laptop, and gets pretty hot without power management. But it's worth testing
I learnt something new just now by accident. I had left the machine shutting down (so I thought) a few hours ago. I don't know if anyone else is finding shutdown pretty eratic with 3D enabled, but it varies between shutting down correctly straight from KDE, restarting KDM from where I can shutdown, and dropping to a shell. In this case, it had dropped to a shell, where it had sat until it crashed. I therefore got the system message that is not normally visible when it crashes in X, and doesn't get logged. It said:
CPU0: Machine Check Exception: 0000000000000004
Bank 4: b200000000070f0f
Kernel panic - not syncing: CPU context corrupt
That sounds like very bad news. I've got no particular reason to say this other than past experience, but I guess noapic may be my best hope.
It would be nice if Mandriva worked 100% satisfactorily for a change, rather than the usual 95% brilliantly and 5% badly. Maybe it's time to try Ubuntu, but I go back to the very early days of Mandrake, and know how to get under its skin when there is a problem. Changing distros, you have to learn a whole new set of secrets.