Scientific Linux Forum.org



  Reply to this topicStart new topicStart Poll

> Kernel soft lockup when multiple threads pinned to one core, Kernel soft lockup
cliffdi
 Posted: Jan 11 2013, 05:13 PM
Quote Post


SLF Newbie


Group: Members
Posts: 1
Member No.: 2191
Joined: 11-January 13









Hi,

I have what I think is a kernel bug and just wondered if anyone has any experience with a similar issue.

I have a test app that starts 6 threads. Each is set to run at real-time priority using the round-robin scheduler. They are all pinned to the same core.

There are various reasons why I'm doing this but testing out the kernel under various runtime configurations and stress conditions is one of them. I'm trying to get detailed stats of performance improvements by changing various kernel parameters. Obviously I'm chainging one thing at a time, and I've hit this problem with the out-of-the-box configuration!

The threads do some work (for a microsecond or so) and then go back to sleep for a random internal (could be anything from 50 microsecs to a few millisecs). The software is running on a 12 core (2x6 core) IBM M3 server and runs continually collecting lots of performance stats.

Occasionally (once every couple of days or so), without any specific cause, one of the threads gets hung up in the kernel (soft lockup as dump below) when the thread goes to sleep. As all the threads are pinned to the same core, they all stop running.

I have tried running the same test without setting the core affinities and so far I haven't had the same problem with that configuration. So the problem does seem to be scheduler-related.

I'm using SL6.3 minimal installation (latest updates) with the standard kernel (2.6.32-279.19.1.el6.x86_64 #1 SMP Tue Dec 18 17:22.54 CST 2012).

Thanks

Cliff

Jan 11 15:23:16 testhost1 kernel: BUG: soft lockup - CPU#7 stuck for 67s! [TestApp1:32607]
Jan 11 15:23:16 testhost1 kernel: Modules linked in: autofs4 sunrpc 8021q garp stp llc ipv6 onload(U) sfc_char(U) sfc_resource(U) sfc_affinity(U) sfc_tune(U) sfc(U) i2c_algo_bit mdio bnx2 cdc_ether usbnet mii microcode serio_raw i2c_i801 i2c_core iTCO_wdt iTCO_vendor_support sg ioatdma dca i7core_edac edac_core shpchp ext4 mbcache jbd2 sd_mod crc_t10dif pata_acpi ata_generic ata_piix qla2xxx scsi_transport_fc scsi_tgt megaraid_sas dm_mirror dm_region_hash dm_log dm_mod [last unloaded: speedstep_lib]
Jan 11 15:23:16 testhost1 kernel: CPU 7
Jan 11 15:23:16 testhost1 kernel: Modules linked in: autofs4 sunrpc 8021q garp stp llc ipv6 onload(U) sfc_char(U) sfc_resource(U) sfc_affinity(U) sfc_tune(U) sfc(U) i2c_algo_bit mdio bnx2 cdc_ether usbnet mii microcode serio_raw i2c_i801 i2c_core iTCO_wdt iTCO_vendor_support sg ioatdma dca i7core_edac edac_core shpchp ext4 mbcache jbd2 sd_mod crc_t10dif pata_acpi ata_generic ata_piix qla2xxx scsi_transport_fc scsi_tgt megaraid_sas dm_mirror dm_region_hash dm_log dm_mod [last unloaded: speedstep_lib]
Jan 11 15:23:16 testhost1 kernel:
Jan 11 15:23:16 testhost1 kernel: Pid: 32607, comm: TestApp1 Not tainted 2.6.32-279.19.1.el6.x86_64 #1 IBM System x3650 M3 -[7945N2G]-/69Y5698
Jan 11 15:23:16 testhost1 kernel: RIP: 0033:[<00007f2e35de2dd8>] [<00007f2e35de2dd8>] 0x7f2e35de2dd8
Jan 11 15:23:16 testhost1 kernel: RSP: 002b:00007f2e35399cf0 EFLAGS: 00000297
Jan 11 15:23:16 testhost1 kernel: RAX: 0000000000007f61 RBX: 0000000000fc0708 RCX: 00000039b90e4ae9
Jan 11 15:23:16 testhost1 kernel: RDX: 0000000000007f5f RSI: 00007f2e37944380 RDI: 00007f2e35399cb0
Jan 11 15:23:16 testhost1 kernel: RBP: ffffffff8100bb8e R08: 0000000000000000 R09: 0000000000000035
Jan 11 15:23:16 testhost1 kernel: R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
Jan 11 15:23:16 testhost1 kernel: R13: 0000000000000035 R14: 0000000000000000 R15: 0000000000fc0700
Jan 11 15:23:16 testhost1 kernel: FS: 00007f2e3539a700(0000) GS:ffff88099fe20000(0000) knlGS:0000000000000000
Jan 11 15:23:16 testhost1 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Jan 11 15:23:16 testhost1 kernel: CR2: 00000000012a2b40 CR3: 0000000907397000 CR4: 00000000000006e0
Jan 11 15:23:16 testhost1 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Jan 11 15:23:16 testhost1 kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Jan 11 15:23:16 testhost1 kernel: Process TestApp1 (pid: 32607, threadinfo ffff8809073fc000, task ffff880907384ae0)
Jan 11 15:23:16 testhost1 kernel:
Jan 11 15:23:16 testhost1 kernel: Call Trace:
PM
^
1 User(s) are reading this topic (1 Guests and 0 Anonymous Users)
0 Members:

Topic Options Reply to this topicStart new topicStart Poll