Scientific Linux Forum.org



  Reply to this topicStart new topicStart Poll

> NFSD/KSWAPD problems on high usage
MikeDacre
 Posted: Oct 21 2013, 08:35 PM
Quote Post


SLF Newbie


Group: Members
Posts: 1
Member No.: 1561
Joined: 22-May 12









I am having troubles with a small cluster I have that runs scientific linux. We have a 28TB RAID drive connected through an LSI MegaRAID card. It has an XFS file system and is mounted as an NFS v3 drive on 21 other machines over Infiniband DDR x4 (10Gb/s). These machines are a cluster controlled by Torque, which means that highly parallel filesystem access is quite common.

Recently, when drive usage is approaching 100% IO (around 500MB/s write speed), I am getting the following error messages printed to the console:

CODE
INFO: task nfsd:13278 blocked for more than 120 seconds
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.


On the other machines, I get this message:
CODE

       nfs: server 192.168.2.1 not responding, still trying
nfs: server 192.168.2.1 OK


All of these machines are on SL 6.2, kernel version 2.6.32-358.23.2.el6.x86_64, with all software including fastbugs up to date as of last Friday. This problem happened before I switched to the fastbugs repository.

The machine has 32GB of RAM and 32GB of swap space. I am just running out of memory and swap? I can't upgrade the memory further on this machine, but I can add an extra SSD as swap, perhaps as large as 512GB, will that help?

Other important entries from /var/log/messages:

Many messages equivalent to this:
CODE
Oct 21 10:34:45 fruster kernel: nfsd: page allocation failure. order:5, mode:0x20
Oct 21 10:34:45 fruster kernel: Pid: 13387, comm: nfsd Not tainted 2.6.32-358.23.2.el6.x86_64 #1
Oct 21 10:34:45 fruster kernel: Call Trace:
Oct 21 10:34:45 fruster kernel: [<ffffffff8112c287>] ? __alloc_pages_nodemask+0x757/0x8d0
Oct 21 10:34:45 fruster kernel: [<ffffffff81166dc2>] ? kmem_getpages+0x62/0x170
Oct 21 10:34:45 fruster kernel: [<ffffffff811679da>] ? fallback_alloc+0x1ba/0x270
Oct 21 10:34:45 fruster kernel: [<ffffffff8116742f>] ? cache_grow+0x2cf/0x320
Oct 21 10:34:45 fruster kernel: [<ffffffff81167759>] ? ____cache_alloc_node+0x99/0x160
Oct 21 10:34:45 fruster kernel: [<ffffffff8143f7b2>] ? pskb_expand_head+0x62/0x270
Oct 21 10:34:45 fruster kernel: [<ffffffff81168529>] ? __kmalloc+0x189/0x220
Oct 21 10:34:45 fruster kernel: [<ffffffff8143f7b2>] ? pskb_expand_head+0x62/0x270
Oct 21 10:34:45 fruster kernel: [<ffffffff8143f006>] ? skb_checksum+0x56/0x2e0
Oct 21 10:34:45 fruster kernel: [<ffffffff8144008a>] ? __pskb_pull_tail+0x2aa/0x360
Oct 21 10:34:45 fruster kernel: [<ffffffff814493de>] ? dev_hard_start_xmit+0x2be/0x530
Oct 21 10:34:45 fruster kernel: [<ffffffff81485ae8>] ? ip_output+0xb8/0xc0
Oct 21 10:34:45 fruster kernel: [<ffffffff814677aa>] ? sch_direct_xmit+0x15a/0x1c0
Oct 21 10:34:45 fruster kernel: [<ffffffff8144d130>] ? dev_queue_xmit+0x3b0/0x550
Oct 21 10:34:45 fruster kernel: [<ffffffff81452c0d>] ? neigh_connected_output+0xbd/0x100
Oct 21 10:34:45 fruster kernel: [<ffffffff81485957>] ? ip_finish_output+0x237/0x310
Oct 21 10:34:45 fruster kernel: [<ffffffff81485ae8>] ? ip_output+0xb8/0xc0
Oct 21 10:34:45 fruster kernel: [<ffffffff8149c4de>] ? tcp_write_xmit+0x20e/0xa20
Oct 21 10:34:45 fruster kernel: [<ffffffff81484de5>] ? ip_local_out+0x25/0x30
Oct 21 10:34:45 fruster kernel: [<ffffffff814852c0>] ? ip_queue_xmit+0x190/0x420
Oct 21 10:34:45 fruster kernel: [<ffffffff81510c8b>] ? _spin_unlock_bh+0x1b/0x20
Oct 21 10:34:45 fruster kernel: [<ffffffff8149a0be>] ? tcp_transmit_skb+0x40e/0x7b0
Oct 21 10:34:45 fruster kernel: [<ffffffff81499777>] ? tcp_init_tso_segs+0x37/0x50
Oct 21 10:34:45 fruster kernel: [<ffffffff8149c4cb>] ? tcp_write_xmit+0x1fb/0xa20
Oct 21 10:34:45 fruster kernel: [<ffffffff8149cd20>] ? tcp_push_one+0x30/0x40
Oct 21 10:34:45 fruster kernel: [<ffffffff8148df79>] ? tcp_sendpage+0x569/0x580
Oct 21 10:34:45 fruster kernel: [<ffffffff814343d8>] ? kernel_sendpage+0x58/0x90
Oct 21 10:34:45 fruster kernel: [<ffffffffa0695f1d>] ? svc_send_common+0xfd/0x160 [sunrpc]
Oct 21 10:34:45 fruster kernel: [<ffffffffa0695ff2>] ? svc_sendto+0x72/0x1f0 [sunrpc]
Oct 21 10:34:45 fruster kernel: [<ffffffffa0696bed>] ? auth_domain_put+0x1d/0x70 [sunrpc]
Oct 21 10:34:45 fruster kernel: [<ffffffffa0696209>] ? svc_tcp_sendto+0x39/0xa0 [sunrpc]
Oct 21 10:34:45 fruster kernel: [<ffffffffa06a168b>] ? svc_send+0xab/0xf0 [sunrpc]
Oct 21 10:34:45 fruster kernel: [<ffffffff81063990>] ? default_wake_function+0x0/0x20
Oct 21 10:34:45 fruster kernel: [<ffffffffa0693c80>] ? svc_process+0x130/0x160 [sunrpc]
Oct 21 10:34:45 fruster kernel: [<ffffffffa071cb62>] ? nfsd+0xc2/0x160 [nfsd]
Oct 21 10:34:45 fruster kernel: [<ffffffffa071caa0>] ? nfsd+0x0/0x160 [nfsd]
Oct 21 10:34:45 fruster kernel: [<ffffffff81096a36>] ? kthread+0x96/0xa0
Oct 21 10:34:45 fruster kernel: [<ffffffff8100c0ca>] ? child_rip+0xa/0x20
Oct 21 10:34:45 fruster kernel: [<ffffffff810969a0>] ? kthread+0x0/0xa0
Oct 21 10:34:45 fruster kernel: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20


Also this:
CODE
Oct 21 12:40:00 fruster kernel: kswapd0: page allocation failure. order:5, mode:0x20
Oct 21 12:40:00 fruster kernel: Pid: 98, comm: kswapd0 Not tainted 2.6.32-358.23.2.el6.x86_64 #1
Oct 21 12:40:00 fruster kernel: Call Trace:
Oct 21 12:40:00 fruster kernel: <IRQ>  [<ffffffff8112c287>] ? __alloc_pages_nodemask+0x757/0x8d0
Oct 21 12:40:00 fruster kernel: [<ffffffff81166dc2>] ? kmem_getpages+0x62/0x170
Oct 21 12:40:00 fruster kernel: [<ffffffff811679da>] ? fallback_alloc+0x1ba/0x270
Oct 21 12:40:00 fruster kernel: [<ffffffff8116742f>] ? cache_grow+0x2cf/0x320
Oct 21 12:40:00 fruster kernel: [<ffffffff81167759>] ? ____cache_alloc_node+0x99/0x160
Oct 21 12:40:00 fruster kernel: [<ffffffff8143f7b2>] ? pskb_expand_head+0x62/0x270
Oct 21 12:40:00 fruster kernel: [<ffffffff81168529>] ? __kmalloc+0x189/0x220
Oct 21 12:40:00 fruster kernel: [<ffffffff8143f7b2>] ? pskb_expand_head+0x62/0x270
Oct 21 12:40:00 fruster kernel: [<ffffffff8143f006>] ? skb_checksum+0x56/0x2e0
Oct 21 12:40:00 fruster kernel: [<ffffffff8144008a>] ? __pskb_pull_tail+0x2aa/0x360
Oct 21 12:40:00 fruster kernel: [<ffffffff814493de>] ? dev_hard_start_xmit+0x2be/0x530
Oct 21 12:40:00 fruster kernel: [<ffffffffa063fc2b>] ? ipoib_start_xmit+0x10b/0x440 [ib_ipoib]
Oct 21 12:40:00 fruster kernel: [<ffffffff814677aa>] ? sch_direct_xmit+0x15a/0x1c0
Oct 21 12:40:00 fruster kernel: [<ffffffff81449428>] ? dev_hard_start_xmit+0x308/0x530
Oct 21 12:40:00 fruster kernel: [<ffffffff8144d130>] ? dev_queue_xmit+0x3b0/0x550
Oct 21 12:40:00 fruster kernel: [<ffffffff81452c0d>] ? neigh_connected_output+0xbd/0x100
Oct 21 12:40:00 fruster kernel: [<ffffffff81485957>] ? ip_finish_output+0x237/0x310
Oct 21 12:40:00 fruster kernel: [<ffffffff81485ae8>] ? ip_output+0xb8/0xc0
Oct 21 12:40:00 fruster kernel: [<ffffffff81065c75>] ? enqueue_entity+0x125/0x410
Oct 21 12:40:00 fruster kernel: [<ffffffff81484de5>] ? ip_local_out+0x25/0x30
Oct 21 12:40:00 fruster kernel: [<ffffffff814852c0>] ? ip_queue_xmit+0x190/0x420
Oct 21 12:40:00 fruster kernel: [<ffffffff8149a0be>] ? tcp_transmit_skb+0x40e/0x7b0
Oct 21 12:40:00 fruster kernel: [<ffffffff8149c4cb>] ? tcp_write_xmit+0x1fb/0xa20
Oct 21 12:40:00 fruster kernel: [<ffffffff8149ce80>] ? __tcp_push_pending_frames+0x30/0xe0
Oct 21 12:40:00 fruster kernel: [<ffffffff81494913>] ? tcp_data_snd_check+0x33/0x100
Oct 21 12:40:00 fruster kernel: [<ffffffff8149855d>] ? tcp_rcv_established+0x3ed/0x800
Oct 21 12:40:00 fruster kernel: [<ffffffff8149855d>] ? tcp_rcv_established+0x3ed/0x800
Oct 21 12:40:00 fruster kernel: [<ffffffff814a0553>] ? tcp_v4_do_rcv+0x2e3/0x430
Oct 21 12:40:00 fruster kernel: [<ffffffff8149855d>] ? tcp_rcv_established+0x3ed/0x800
Oct 21 12:40:00 fruster kernel: [<ffffffff814a1dde>] ? tcp_v4_rcv+0x4fe/0x8d0
Oct 21 12:40:00 fruster kernel: [<ffffffff8145b4ed>] ? sk_filter+0x9d/0xd0
Oct 21 12:40:00 fruster kernel: [<ffffffff8147fa6d>] ? ip_local_deliver_finish+0xdd/0x2d0
Oct 21 12:40:00 fruster kernel: [<ffffffff8147fcf8>] ? ip_local_deliver+0x98/0xa0
Oct 21 12:40:00 fruster kernel: [<ffffffff8147f1bd>] ? ip_rcv_finish+0x12d/0x440
Oct 21 12:40:00 fruster kernel: [<ffffffff8147f745>] ? ip_rcv+0x275/0x350
Oct 21 12:40:00 fruster kernel: [<ffffffff8144891b>] ? __netif_receive_skb+0x4ab/0x750
Oct 21 12:40:00 fruster kernel: [<ffffffff8144acf8>] ? netif_receive_skb+0x58/0x60
Oct 21 12:40:00 fruster kernel: [<ffffffffa064a52a>] ? ipoib_cm_handle_rx_wc+0x20a/0x7a0 [ib_ipoib]
Oct 21 12:40:00 fruster kernel: [<ffffffff8143dbe8>] ? skb_release_data+0xd8/0x110
Oct 21 12:40:00 fruster kernel: [<ffffffff8143d7db>] ? consume_skb+0x3b/0x80
Oct 21 12:40:00 fruster kernel: [<ffffffffa06490b3>] ? ipoib_cm_handle_tx_wc+0x193/0x340 [ib_ipoib]
Oct 21 12:40:00 fruster kernel: [<ffffffffa0642365>] ? ipoib_poll+0x115/0x1d0 [ib_ipoib]
Oct 21 12:40:00 fruster kernel: [<ffffffff8144d4c3>] ? net_rx_action+0x103/0x2f0
Oct 21 12:40:00 fruster kernel: [<ffffffff810770b1>] ? __do_softirq+0xc1/0x1e0
Oct 21 12:40:00 fruster kernel: [<ffffffff810e1760>] ? handle_IRQ_event+0x60/0x170
Oct 21 12:40:00 fruster kernel: [<ffffffff8107710f>] ? __do_softirq+0x11f/0x1e0
Oct 21 12:40:00 fruster kernel: [<ffffffff8100c1cc>] ? call_softirq+0x1c/0x30
Oct 21 12:40:00 fruster kernel: [<ffffffff8100de05>] ? do_softirq+0x65/0xa0
Oct 21 12:40:00 fruster kernel: [<ffffffff81076e95>] ? irq_exit+0x85/0x90
Oct 21 12:40:00 fruster kernel: [<ffffffff81517775>] ? do_IRQ+0x75/0xf0
Oct 21 12:40:00 fruster kernel: [<ffffffff8100b9d3>] ? ret_from_intr+0x0/0x11
Oct 21 12:40:00 fruster kernel: <EOI>  [<ffffffff8117045d>] ? bit_spin_lock+0xd/0x30
Oct 21 12:40:00 fruster kernel: [<ffffffff811721f1>] ? __mem_cgroup_uncharge_common+0x81/0x300
Oct 21 12:40:00 fruster kernel: [<ffffffff81172480>] ? mem_cgroup_uncharge_cache_page+0x10/0x20
Oct 21 12:40:00 fruster kernel: [<ffffffff81131938>] ? __remove_mapping+0xb8/0x160
Oct 21 12:40:00 fruster kernel: [<ffffffff811327f7>] ? shrink_page_list.clone.3+0x3f7/0x650
Oct 21 12:40:00 fruster kernel: [<ffffffff81133433>] ? shrink_inactive_list+0x343/0x830
Oct 21 12:40:00 fruster kernel: [<ffffffff8116c67f>] ? putback_lru_pages+0x5f/0x80
Oct 21 12:40:00 fruster kernel: [<ffffffff8116d546>] ? migrate_pages+0x276/0x4c0
Oct 21 12:40:00 fruster kernel: [<ffffffff8112cfba>] ? determine_dirtyable_memory+0x1a/0x30
Oct 21 12:40:00 fruster kernel: [<ffffffff8112d067>] ? get_dirty_limits+0x27/0x2f0
Oct 21 12:40:00 fruster kernel: [<ffffffff8111dfcf>] ? zone_watermark_ok+0x1f/0x30
Oct 21 12:40:00 fruster kernel: [<ffffffff81133cce>] ? shrink_mem_cgroup_zone+0x3ae/0x610
Oct 21 12:40:00 fruster kernel: [<ffffffff8117289d>] ? mem_cgroup_iter+0xfd/0x280
Oct 21 12:40:00 fruster kernel: [<ffffffff81133f93>] ? shrink_zone+0x63/0xb0
Oct 21 12:40:00 fruster kernel: [<ffffffff81135355>] ? balance_pgdat+0x705/0x820
Oct 21 12:40:00 fruster kernel: [<ffffffff811355a4>] ? kswapd+0x134/0x3c0
Oct 21 12:40:00 fruster kernel: [<ffffffff81096da0>] ? autoremove_wake_function+0x0/0x40
Oct 21 12:40:00 fruster kernel: [<ffffffff81135470>] ? kswapd+0x0/0x3c0
Oct 21 12:40:00 fruster kernel: [<ffffffff81096a36>] ? kthread+0x96/0xa0
Oct 21 12:40:00 fruster kernel: [<ffffffff8100c0ca>] ? child_rip+0xa/0x20
Oct 21 12:40:00 fruster kernel: [<ffffffff810969a0>] ? kthread+0x0/0xa0
Oct 21 12:40:00 fruster kernel: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20


CODE
grep "page allocation failure" /var/log/messages | wc


Returns 93 instances of page allocation failure for the following four programs:
CODE
kswapd0
nfsd
rsyslogd
swapper

Of these the overwhelming majority were nfsd, with kwapd0 and swapper almost tied with 14 and 13 counts respectively.

If you guys need any other information, let me know and I can send it.

I may also post a copy of this post in another forum if this isn't the right place for it.

Thanks!
PMEmail PosterICQ
^
toracat
 Posted: Oct 22 2013, 05:32 PM
Quote Post


SLF Geek
****

Group: Members
Posts: 300
Member No.: 11
Joined: 10-April 11









I've read somewhere that "order:5" means kernel asked for 32 pages of contiguous memory and that increasing the value of "min_free_kbytes might help. It might be worth a try.

--------------------
ELRepo: repository specializing in hardware support for EL
PMUsers Website
^
1 User(s) are reading this topic (1 Guests and 0 Anonymous Users)
0 Members:

Topic Options Reply to this topicStart new topicStart Poll