Scientific Linux Forum.org



  Reply to this topicStart new topicStart Poll

> NFS issue - files become unavailable
Lasastard
 Posted: Dec 18 2013, 07:50 AM
Quote Post


SLF Newbie


Group: Members
Posts: 1
Member No.: 2855
Joined: 18-December 13









Dear SL community,

first of all, I hope this topic belongs here - otherwise someone please move it.

That being said, on to the issue.

I have recently installed a small scientific cluster using SL 6.2 (dictated by our Infiniband hardware, which I can't get to work under 6.4) - consisting of 7 compute nodes, 1 head node and a file server. The file server exports folders to the compute nodes, and I have started to notice that files on those shares become randomly unavailable on random nodes. In practice this means that jobs sent to the queueing system won't start because a binary can't be read/executed on one node, where it works perfectly fine on other nodes. I can neither specifically trigger this behaviour nor pin down how to fix it... I can 'ls' the file and it will have all relevant attributes, but can't do anything with it - be it reading or executing. Other files in the same folder to not exhibit this behaviour.... It's like the system doesn't think that specific file exists. Un-mounting the share and re-mounting it fixes the problem, but this is obviously not a long-term solution, since it will happen again after a random interval (days).

Some details about the setup:

The file server exports a 16.x TB partition (XFS formatted) to the nodes, in several pieces. Those pieces are subfolders on the big storage partition - one for projects, the home directories and software.

/data0/projects
/data0/home
/data0/sw

Those folders are mounted on each node as

/projects
/home
/sw

And SW is the partition that creates the problems (so far I haven't seen problems with the other two mounts).

/proc/mounts:

192.168.1.2:/data0/sw /sw nfs4 rw,relatime,vers=4,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,port=0,timeo=600,retrans=2,sec=sys,clientaddr=192.168.1.100,minorversion=0,local_lock=none,addr=192.168.1.2 0 0

/etc/fstab (node)

192.168.1.2:/data0/sw /sw nfs4 _netdev,auto 0 0


Note: The project partition is actually exported and mounted over infiniband for speed - not sure if that would explain why I am having trouble with the ethernet-exported shares.
PM
^
redman
 Posted: Dec 18 2013, 09:15 AM
Quote Post


Retired SLF Administrator
********

Group: Admins
Posts: 1276
Member No.: 2
Joined: 8-April 11









QUOTE (Lasastard @ Dec 18 2013, 09:50 AM)
first of all, I hope this topic belongs here - otherwise someone please move it.

Don't worry, right subforum wink.gif


You try "nolock" instead of "local_lock=none".
Myself, I use the "intr" option as well.
More details on common NFS mount options can be found here (Red Hat portal).

--------------------
"Sometimes the best helping hand you can give is a good, firm push."
PM
^
1 User(s) are reading this topic (1 Guests and 0 Anonymous Users)
0 Members:

Topic Options Reply to this topicStart new topicStart Poll