Find the answer to your Linux question:
Page 1 of 2 1 2 LastLast
Results 1 to 10 of 15
Dear All, I am new to this forum and hope to get positive reply to fix my issue which is: We have a high end server for running Asterisk PBX ...
Enjoy an ad free experience by logging in. Not a member yet? Register.
  1. #1
    Just Joined!
    Join Date
    Mar 2013
    Posts
    9

    BUG: spinlock lockup on CPU by MySQL and Asterisk


    Dear All,

    I am new to this forum and hope to get positive reply to fix my issue which is:

    We have a high end server for running Asterisk PBX services with Sangoma telephony cards. This was previously running fine. since couple of days it is giving this error:

    Mar 19 10:59:46 localhost kernel: BUG: spinlock lockup on CPU#7, asterisk/15200, ffff8100a3cf1028 (Tainted: G )
    Mar 19 10:59:46 localhost kernel:
    Mar 19 10:59:46 localhost kernel: Call Trace:
    Mar 19 10:59:46 localhost kernel: [<ffffffff800079d6>] _raw_spin_lock+0xcd/0xeb
    Mar 19 10:59:46 localhost kernel: [<ffffffff8006843f>] _spin_lock+0x47/0x52
    Mar 19 10:59:46 localhost kernel: [<ffffffff8004d16a>] unix_stream_sendmsg+0x255/0x363
    Mar 19 10:59:46 localhost kernel: [<ffffffff8003a041>] do_sock_write+0xc6/0x102
    Mar 19 10:59:46 localhost kernel: [<ffffffff80049f8e>] sock_aio_write+0x4f/0x5e
    Mar 19 10:59:46 localhost kernel: [<ffffffff8001930a>] do_sync_write+0xc7/0x104
    Mar 19 10:59:46 localhost kernel: [<ffffffff800a7596>] autoremove_wake_function+0x0/0x2e
    Mar 19 10:59:46 localhost kernel: [<ffffffff8000d4a1>] dnotify_parent+0x1f/0x79
    Mar 19 10:59:46 localhost kernel: [<ffffffff80017885>] vfs_write+0xe1/0x174
    Mar 19 10:59:46 localhost kernel: [<ffffffff8001816a>] sys_write+0x45/0x6e
    Mar 19 10:59:46 localhost kernel: [<ffffffff800602a6>] tracesys+0xd5/0xdf
    Mar 19 10:59:46 localhost kernel:
    Mar 19 11:12:11 localhost kernel: BUG: spinlock lockup on CPU#7, mysqld/20042, ffff81012c4afc08 (Tainted: G )
    Mar 19 11:12:11 localhost kernel:
    Mar 19 11:12:11 localhost kernel: Call Trace:
    Mar 19 11:12:11 localhost kernel: [<ffffffff800079d6>] _raw_spin_lock+0xcd/0xeb
    Mar 19 11:12:11 localhost kernel: [<ffffffff8006843f>] _spin_lock+0x47/0x52
    Mar 19 11:12:11 localhost kernel: [<ffffffff8004d16a>] unix_stream_sendmsg+0x255/0x363
    Mar 19 11:12:11 localhost kernel: [<ffffffff8003a041>] do_sock_write+0xc6/0x102
    Mar 19 11:12:11 localhost kernel: [<ffffffff80049f8e>] sock_aio_write+0x4f/0x5e
    Mar 19 11:12:11 localhost kernel: [<ffffffff8001930a>] do_sync_write+0xc7/0x104
    Mar 19 11:12:11 localhost kernel: [<ffffffff800a7596>] autoremove_wake_function+0x0/0x2e
    Mar 19 11:12:11 localhost kernel: [<ffffffff8013b076>] file_has_perm+0x48/0xa3
    Mar 19 11:12:11 localhost kernel: [<ffffffff80017885>] vfs_write+0xe1/0x174
    Mar 19 11:12:11 localhost kernel: [<ffffffff8001816a>] sys_write+0x45/0x6e
    Mar 19 11:12:11 localhost kernel: [<ffffffff800602a6>] tracesys+0xd5/0xdf


    Please help me resolve this issue as this server is production server and having issues periodically.

    Below is the system information:

    [root at localhost ~]# uname -a
    Linux localhost.localdomain 2.6.18-238.19.1.el5debug #1 SMP Fri Jul 15 09:01:56 EDT 2011 x86_64 x86_64 x86_64 GNU/Linux


    [root at localhost ~]# lspci
    00:00.0 Host bridge: Intel Corporation Core Processor DMI (rev 11)
    00:03.0 PCI bridge: Intel Corporation Core Processor PCI Express Root Port 1 (rev 11)
    00:05.0 PCI bridge: Intel Corporation Core Processor PCI Express Root Port 3 (rev 11)
    00:08.0 System peripheral: Intel Corporation Core Processor System Management Registers (rev 11)
    00:08.1 System peripheral: Intel Corporation Core Processor Semaphore and Scratchpad Registers (rev 11)
    00:08.2 System peripheral: Intel Corporation Core Processor System Control and Status Registers (rev 11)
    00:08.3 System peripheral: Intel Corporation Core Processor Miscellaneous Registers (rev 11)
    00:10.0 System peripheral: Intel Corporation Core Processor QPI Link (rev 11)
    00:10.1 System peripheral: Intel Corporation Core Processor QPI Routing and Protocol Registers (rev 11)
    00:1a.0 USB Controller: Intel Corporation 5 Series/3400 Series Chipset USB2 Enhanced Host Controller (rev 05)
    00:1c.0 PCI bridge: Intel Corporation 5 Series/3400 Series Chipset PCI Express Root Port 1 (rev 05)
    00:1c.4 PCI bridge: Intel Corporation 5 Series/3400 Series Chipset PCI Express Root Port 5 (rev 05)
    00:1d.0 USB Controller: Intel Corporation 5 Series/3400 Series Chipset USB2 Enhanced Host Controller (rev 05)
    00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev a5)
    00:1f.0 ISA bridge: Intel Corporation 3400 Series Chipset LPC Interface Controller (rev 05)
    00:1f.2 IDE interface: Intel Corporation 5 Series/3400 Series Chipset 4 port SATA IDE Controller (rev 05)
    00:1f.5 IDE interface: Intel Corporation 5 Series/3400 Series Chipset 2 port SATA IDE Controller (rev 05)
    01:03.0 VGA compatible controller: Matrox Graphics, Inc. MGA G200eW WPCM450 (rev 0a)
    02:00.0 Ethernet controller: Broadcom Corporation NetXtreme II BCM5716 Gigabit Ethernet (rev 20)
    02:00.1 Ethernet controller: Broadcom Corporation NetXtreme II BCM5716 Gigabit Ethernet (rev 20)
    05:00.0 PCI bridge: PLX Technology, Inc. PEX8112 x1 Lane PCI Express-to-PCI Bridge (rev aa)
    06:04.0 Network controller: Sangoma Technologies Corp. A104d QUAD T1/E1 AFT card


    [root at localhost ~]# cat /proc/interrupts
    CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7
    0: 329604018 0 0 0 0 0 0 0 IO-APIC-edge timer
    8: 1 0 0 0 0 0 0 0 IO-APIC-edge rtc
    9: 0 0 0 0 0 0 0 0 IO-APIC-level acpi
    58: 48 0 0 0 0 0 302492 77909 PCI-MSI-X eth1-0
    66: 31 67 86 1256 0 575 8253 0 PCI-MSI-X eth1-1
    74: 22 223 216 21737 0 1018 920 0 PCI-MSI-X eth1-2
    82: 23 2612 1128 207 0 14616 406 930 PCI-MSI-X eth1-3
    90: 22 118 7263 7790 0 8 0 7709 PCI-MSI-X eth1-4
    98: 37 17838 316 3724 0 6702 309 25094 PCI-MSI-X eth1-5
    106: 21 225 95 132 0 6673 7937 34191 PCI-MSI-X eth1-6
    114: 30 923 48 38 0 404 113 323 PCI-MSI-X eth1-7
    122: 2 0 0 0 0 0 0 0 PCI-MSI-X cnic
    169: 43305 0 0 329418631 0 0 0 0 IO-APIC-level wanpipe1, wanpipe2, wanpipe3, wanpipe4, wanpipe5, wanpipe6, wanpipe7, wanpipe8
    217: 131 0 0 0 0 0 0 0 IO-APIC-level ehci_hcd:usb1, ehci_hcd:usb2
    225: 12858 0 271539 0 0 0 0 0 IO-APIC-level ata_piix
    233: 222 3565275 0 0 0 0 0 0 IO-APIC-level ata_piix
    NMI: 40211 9293 9929 153109 34809 9374 9205 14554
    LOC: 329603953 329603848 329603776 329510665 329603638 329603562 329603481 329603420
    ERR: 0
    MIS: 0


    [root at localhost ~]# asterisk -V
    Asterisk 1.6.2.19

    dahdi-linux-complete-2.4.0+2.4.0

    MySQL Server version: 5.0.77 Source distribution

    Please let me know if any other information is required.

  2. #2
    Trusted Penguin Irithori's Avatar
    Join Date
    May 2009
    Location
    Munich
    Posts
    3,440
    Hi and welcome

    From the log, my guess would be a harddisc error.
    This needs to be verified by either looking at dmesg or the md status or via raidcontroller utilities.

    However, there is no raidcontroller in the lspci listing, so I hope you do have a software raid in place.
    Then you could replace the broken drive.
    If there is no raid, my suggestion is to verify the backup of config and data *now*
    and build a replacement machine with it.
    You must always face the curtain with a bow.

  3. #3
    Linux Guru Rubberman's Avatar
    Join Date
    Apr 2009
    Location
    I can be found either 40 miles west of Chicago, in Chicago, or in a galaxy far, far away.
    Posts
    11,748
    Quote Originally Posted by Irithori View Post
    Hi and welcome

    From the log, my guess would be a harddisc error.
    This needs to be verified by either looking at dmesg or the md status or via raidcontroller utilities.

    However, there is no raidcontroller in the lspci listing, so I hope you do have a software raid in place.
    Then you could replace the broken drive.
    If there is no raid, my suggestion is to verify the backup of config and data *now*
    and build a replacement machine with it.
    Also, run the smart tools and see if you have a failing drive. It could just be a file system problem that has munged the boot kernel or some other files. So, if SMART says the physical drives are ok., then boot a recovery or live cd/dvd and run fsck on the various file systems. You can do that with the -n option which will give you a report without modifying the file systems themselves.
    Sometimes, real fast is almost as good as real time.
    Just remember, Semper Gumbi - always be flexible!

  4. $spacer_open
    $spacer_close
  5. #4
    Just Joined!
    Join Date
    Mar 2013
    Posts
    9
    Thank you all,

    Are you sure it is related to hard drives and not the issue of kernel? I thought I could fix it by running yum update -y to update everything on my system as it is more than a year old installation.

    Don't understand why the word "debug" is appended with kernel version... here is kernel version

    [root@localhost ~]# uname -r
    2.6.18-238.19.1.el5debug

  6. #5
    Trusted Penguin Irithori's Avatar
    Join Date
    May 2009
    Location
    Munich
    Posts
    3,440
    Well, if the system ran with that kernel for 1 year without problems
    and now it suddenly reports errors, then this is because of a change.

    The kernel did not change (as I understood your post so far), but a hardware error might always happen.
    Also the call trace indicates issues with writing/syncing files.

    So I would check harddiscs first.
    Or in other words: If a firealarm rings, dont shut down the firealarm but extinguish the fire.
    You must always face the curtain with a bow.

  7. #6
    Just Joined!
    Join Date
    Mar 2013
    Posts
    9
    Thanks a lot Irithori

    I will definitely check the hard disks and other hardware too. please suggest some good tools which I can use to check/test hard drives without any disturbance as it is a production server.

    Regards.

  8. #7
    Trusted Penguin Irithori's Avatar
    Join Date
    May 2009
    Location
    Munich
    Posts
    3,440
    First of all:
    Make sure, there is a recent and functional backup of configs and data.
    Second:
    Make sure, there is a recent and functional backup of configs and data.

    Then:
    - Have a look at dmesg for recent messages.
    - As suggested, use the smart tools (and of course read the man page)
    Insert your <DEVICE> as appropiate
    Code:
    smartctl -a <DEVICE>
    smartctl -t short <DEVICE>
    smartctl -l selftest <DEVICE>
    - You could also call badblocks on the mounted devices.
    WARNING:
    Be *very* sure which options you use here
    badblocks is capable of overwriting your harddisc.
    See the man page
    Code:
    badblocks -sve <DEVICE>
    You must always face the curtain with a bow.

  9. #8
    Just Joined!
    Join Date
    Mar 2013
    Posts
    9
    Thanks Irithori,

    I will sure try smart tools for JUST monitoring the state of the device. so if needed any guidance again, i will revert here. Thank you again from your help.

    Regards

  10. #9
    Just Joined!
    Join Date
    Mar 2013
    Posts
    9
    Hi,

    Below is the log collected with smartctl -a /dev/sda

    does the pre_fail and old_age mean my drives are aging and going bad.. so I need to replace them???? I haven't ran the smartctl -t short test as it will take 10 minutes and I fear if it cause any disturbance or drive failure to my production server. thanks in advance for your help.

    [root AT localhost ~]# smartctl -a /dev/sda
    smartctl version 5.38 [x86_64-redhat-linux-gnu] Copyright (C) 2002-8 Bruce Allen
    Home page is xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

    === START OF INFORMATION SECTION ===
    Device Model: SAMSUNG HE502HJ
    Serial Number: S2B6J90B311768
    Firmware Version: 1AJ30001
    User Capacity: 500,107,862,016 bytes
    Device is: In smartctl database [for details use: -P show]
    ATA Version is: 8
    ATA Standard is: Not recognized. Minor revision code: 0x28
    Local Time is: Fri Apr 5 12:50:22 2013 ICT

    ==> WARNING: May need -F samsung or -F samsung2 enabled; see manual for details.

    SMART support is: Available - device has SMART capability.
    SMART support is: Enabled

    === START OF READ SMART DATA SECTION ===
    SMART overall-health self-assessment test result: PASSED

    General SMART Values:
    Offline data collection status: (0x82) Offline data collection activity
    was completed without error.
    Auto Offline Data Collection: Enabled.
    Self-test execution status: ( 0) The previous self-test routine completed
    without error or no self-test has ever
    been run.
    Total time to complete Offline
    data collection: (4620) seconds.
    Offline data collection
    capabilities: (0x5b) SMART execute Offline immediate.
    Auto Offline data collection on/off support.
    Suspend Offline collection upon new
    command.
    Offline surface scan supported.
    Self-test supported.
    No Conveyance Self-test supported.
    Selective Self-test supported.
    SMART capabilities: (0x0003) Saves SMART data before entering
    power-saving mode.
    Supports SMART auto save timer.
    Error logging capability: (0x01) Error logging supported.
    General Purpose Logging supported.
    Short self-test routine
    recommended polling time: ( 2) minutes.
    Extended self-test routine
    recommended polling time: ( 77) minutes.
    SCT capabilities: (0x003f) SCT Status supported.
    SCT Feature Control supported.
    SCT Data Table supported.

    SMART Attributes Data Structure revision number: 16
    Vendor Specific SMART Attributes with Thresholds:
    ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
    1 Raw_Read_Error_Rate 0x002f 100 100 051 Pre-fail Always - 0
    2 Throughput_Performance 0x0026 056 056 000 Old_age Always - 4182
    3 Spin_Up_Time 0x0023 083 083 025 Pre-fail Always - 5289
    4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 29
    5 Reallocated_Sector_Ct 0x0033 252 252 010 Pre-fail Always - 0
    7 Seek_Error_Rate 0x002e 252 252 051 Old_age Always - 0
    8 Seek_Time_Performance 0x0024 252 252 015 Old_age Offline - 0
    9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 11151
    10 Spin_Retry_Count 0x0032 252 252 051 Old_age Always - 0
    11 Calibration_Retry_Count 0x0032 252 252 000 Old_age Always - 0
    12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 30
    13 Read_Soft_Error_Rate 0x003a 100 100 000 Old_age Always - 0
    191 G-Sense_Error_Rate 0x0022 252 252 000 Old_age Always - 0
    192 Power-Off_Retract_Count 0x0022 252 252 000 Old_age Always - 0
    193 Load_Cycle_Count 0x0032 100 100 000 Old_age Always - 30
    194 Temperature_Celsius 0x0002 064 064 000 Old_age Always - 20 (Lifetime Min/Max 17/36)
    195 Hardware_ECC_Recovered 0x003a 100 100 000 Old_age Always - 0
    196 Reallocated_Event_Count 0x0032 252 252 000 Old_age Always - 0
    197 Current_Pending_Sector 0x0032 252 252 000 Old_age Always - 0
    198 Offline_Uncorrectable 0x0030 252 252 000 Old_age Offline - 0
    199 UDMA_CRC_Error_Count 0x0036 200 200 000 Old_age Always - 0
    200 Multi_Zone_Error_Rate 0x002a 100 100 000 Old_age Always - 0
    240 Head_Flying_Hours 0x0032 100 100 000 Old_age Always - 11151
    241 Unknown_Attribute 0x0032 096 094 000 Old_age Always - 5817166
    242 Unknown_Attribute 0x0032 096 095 000 Old_age Always - 5901314
    254 Unknown_Attribute 0x0032 100 100 000 Old_age Always - 1

    SMART Error Log Version: 1
    No Errors Logged

    SMART Self-test log structure revision number 1
    Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
    # 1 Extended offline Completed without error 00% 1 -
    # 2 Short offline Completed without error 00% 0 -

    SMART Selective Self-Test Log Data Structure Revision Number (0) should be 1
    SMART Selective self-test log data structure revision number 0
    Warning: ATA Specification requires selective self-test log data structure revision number = 1
    SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
    1 0 0 Completed [00% left] (0-65535)
    2 0 0 Not_testing
    3 0 0 Not_testing
    4 0 0 Not_testing
    5 0 0 Not_testing
    Selective self-test flags (0x0):
    After scanning selected spans, do NOT read-scan remainder of disk.
    If Selective self-test is pending on power-up, resume after 0 minute delay.

    [root AT localhost ~]#

  11. #10
    Just Joined!
    Join Date
    Mar 2013
    Posts
    9
    Also below is the result of another command smartctl -Hc /dev/sda


    [root AT localhost ~]# smartctl -Hc /dev/sda
    smartctl version 5.38 [x86_64-redhat-linux-gnu] Copyright (C) 2002-8 Bruce Allen
    Home page is xxxxxxxxxxxxmartmontools.sourceforge.net/

    === START OF READ SMART DATA SECTION ===
    SMART overall-health self-assessment test result: PASSED

    General SMART Values:
    Offline data collection status: (0x82) Offline data collection activity
    was completed without error.
    Auto Offline Data Collection: Enabled.
    Self-test execution status: ( 0) The previous self-test routine completed
    without error or no self-test has ever
    been run.
    Total time to complete Offline
    data collection: (4620) seconds.
    Offline data collection
    capabilities: (0x5b) SMART execute Offline immediate.
    Auto Offline data collection on/off support.
    Suspend Offline collection upon new
    command.
    Offline surface scan supported.
    Self-test supported.
    No Conveyance Self-test supported.
    Selective Self-test supported.
    SMART capabilities: (0x0003) Saves SMART data before entering
    power-saving mode.
    Supports SMART auto save timer.
    Error logging capability: (0x01) Error logging supported.
    General Purpose Logging supported.
    Short self-test routine
    recommended polling time: ( 2) minutes.
    Extended self-test routine
    recommended polling time: ( 77) minutes.
    SCT capabilities: (0x003f) SCT Status supported.
    SCT Feature Control supported.
    SCT Data Table supported.

Page 1 of 2 1 2 LastLast

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •