|This forum is proudly powered by Scientific Linux 6||SL website Download SL Help Search Members|
|Welcome Guest ( Log In | Register )||Resend Validation Email|
Posted: Dec 18 2013, 07:50 AM
Member No.: 2855
Joined: 18-December 13
Dear SL community,
first of all, I hope this topic belongs here - otherwise someone please move it.
That being said, on to the issue.
I have recently installed a small scientific cluster using SL 6.2 (dictated by our Infiniband hardware, which I can't get to work under 6.4) - consisting of 7 compute nodes, 1 head node and a file server. The file server exports folders to the compute nodes, and I have started to notice that files on those shares become randomly unavailable on random nodes. In practice this means that jobs sent to the queueing system won't start because a binary can't be read/executed on one node, where it works perfectly fine on other nodes. I can neither specifically trigger this behaviour nor pin down how to fix it... I can 'ls' the file and it will have all relevant attributes, but can't do anything with it - be it reading or executing. Other files in the same folder to not exhibit this behaviour.... It's like the system doesn't think that specific file exists. Un-mounting the share and re-mounting it fixes the problem, but this is obviously not a long-term solution, since it will happen again after a random interval (days).
Some details about the setup:
The file server exports a 16.x TB partition (XFS formatted) to the nodes, in several pieces. Those pieces are subfolders on the big storage partition - one for projects, the home directories and software.
Those folders are mounted on each node as
And SW is the partition that creates the problems (so far I haven't seen problems with the other two mounts).
192.168.1.2:/data0/sw /sw nfs4 rw,relatime,vers=4,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,port=0,timeo=600,retrans=2,sec=sys,clientaddr=192.168.1.100,minorversion=0,local_lock=none,addr=192.168.1.2 0 0
192.168.1.2:/data0/sw /sw nfs4 _netdev,auto 0 0
Note: The project partition is actually exported and mounted over infiniband for speed - not sure if that would explain why I am having trouble with the ethernet-exported shares.
Posted: Dec 18 2013, 09:15 AM
Retired SLF Administrator
Member No.: 2
Joined: 8-April 11
Don't worry, right subforum
You try "nolock" instead of "local_lock=none".
Myself, I use the "intr" option as well.
More details on common NFS mount options can be found here (Red Hat portal).
"Sometimes the best helping hand you can give is a good, firm push."