Scientific Linux Forum.org



  Reply to this topicStart new topicStart Poll

> kernel i/o, system goes to read only for /root
PSchiffer
 Posted: Aug 14 2012, 09:24 AM
Quote Post


SLF Newbie


Group: Members
Posts: 13
Member No.: 1797
Joined: 14-August 12









Hi!

Coming back from my Holidays I find a problem with my workstation. I am running SL, which I updated to 6.3 today when it kept running for just long enough. The principle error seems to be a kernel i/o (kernel: journal commit I/O error), which leads the system mount the /root filesystem as read only and later on to lock down (i.e. no access over ssh or directly).
My system is installed on a ssd which has a small /boot partition and a larger one which are in an lvm set up. The larger partition is further divided into /root, /home and /swap. In addition I've got three 2TB HDDs, two of which are combined in a hardware RAID as /tmp and the last one being partitioned into two 1TB chunks as my /data and /work. System has 48Gigs of RAM and 24virtual Procs. Everything has been running smoothly for half a year. I can't really remember having changed anything substantial before going on vacations besides mounting a remote SAMBA share (on an Ubuntu system). I got the startup message that the max number of mounts is reached for my sda HDD. This seems however not be the main problem, as setting <fsckorder> to 2 for the /data and /work partitions removed the error. Also as said above, the read only is for /root as far as I can see.
Quite at a loss here and glad for any advice, also happy to give more log info (although I couldn't see anything myself e.g. in dmesg).

Thanks

Phil

PM
^
PSchiffer
 Posted: Aug 14 2012, 09:31 AM
Quote Post


SLF Newbie


Group: Members
Posts: 13
Member No.: 1797
Joined: 14-August 12









Update: looking at my boot log, what I get is a lot of
udevd[1022]: worker [1103] unexpectedly returned with status 0x0100
^M
udevd[1022]: worker [1103] failed while handling '/devices/LNXSYSTM:00/LNXTHERM:00'
^M
Wait timeout. Will continue in the background.udevd[1022]: worker [1026] unexpectedly returned with status 0x0100
^M
udevd[1022]: worker [1026] failed while handling '/devices/LNXSYSTM:00/LNXPWRBN:00/input/input1'
^M
udevd[1022]: worker [1048] unexpectedly returned with status 0x0100
^M
udevd[1022]: worker [1048] failed while handling '/devices/pci0000:3e/0000:3e:02.0'

and later on

udevd[1022]: worker [1194] unexpectedly returned with status 0x0100
^M
udevd[1022]: worker [1194] failed while handling '/devices/virtual/cpuid/cpu18'
^M
udevd[1022]: worker [1030] unexpectedly returned with status 0x0100
^M
udevd[1022]: worker [1030] failed while handling '/devices/virtual/cpuid/cpu19'
^M
udevd[1022]: worker [1195] unexpectedly returned with status 0x0100

I am wondering im the ^M is indicating something, as other log files (e.g. dmesg) don't contain any weird characters.
PM
^
PSchiffer
 Posted: Aug 14 2012, 09:39 AM
Quote Post


SLF Newbie


Group: Members
Posts: 13
Member No.: 1797
Joined: 14-August 12









2 Update (sorry for fragmentation):
I see that my lvm2-monitor(ing) service is dead. Might that be causing the trouble, having the /root (and /boot and /home) on an logical volume?
PM
^
tux99
 Posted: Aug 14 2012, 09:55 AM
Quote Post


SLF Moderator
********

Group: Moderators
Posts: 1272
Member No.: 224
Joined: 28-May 11









My guess is your SSD is dying and has defective sectors (actually memory cells).

Do a check of the SSD with smartctl (but boot from a CD or USB stick).

--------------------
My personal SL6 repository, specialized in audio/video software: http://pkgrepo.linuxtech.net/el6/
(can be used together with EPEL and ELRepo repositories) - repository mirror: http://linuxsoft.cern.ch/linuxtech/el6/
PM
^
PSchiffer
 Posted: Aug 14 2012, 09:59 AM
Quote Post


SLF Newbie


Group: Members
Posts: 13
Member No.: 1797
Joined: 14-August 12









QUOTE (tux99 @ Aug 14 2012, 09:55 AM)
My guess is your SSD is dying and has defective sectors (actually memory cells).

Do a check of the SSD with smartctl (but boot from a CD or USB stick).


hmm, ja, I think that might be possible. However: it's just about 8 months old, it did not really have a lot of writing done to it and disk util said it's ok. Is there another way to check?
PM
^
tux99
 Posted: Aug 14 2012, 10:03 AM
Quote Post


SLF Moderator
********

Group: Moderators
Posts: 1272
Member No.: 224
Joined: 28-May 11









QUOTE (PSchiffer @ Aug 14 2012, 11:59 AM)

hmm, ja, I think that might be possible. However: it's just about 8 months old, it did not really have a lot of writing done to it


What brand and model is it? Some SSDs are very flakey and have high failure rates.

QUOTE
and disk util said it's ok. Is there another way to check?


Post the output of:

smartctl --all /dev/sdX (where X is the device letter of your ssd)

--------------------
My personal SL6 repository, specialized in audio/video software: http://pkgrepo.linuxtech.net/el6/
(can be used together with EPEL and ELRepo repositories) - repository mirror: http://linuxsoft.cern.ch/linuxtech/el6/
PM
^
PSchiffer
 Posted: Aug 14 2012, 10:22 AM
Quote Post


SLF Newbie


Group: Members
Posts: 13
Member No.: 1797
Joined: 14-August 12









QUOTE (tux99 @ Aug 14 2012, 10:03 AM)
QUOTE (PSchiffer @ Aug 14 2012, 11:59 AM)

hmm, ja, I think that might be possible. However: it's just about 8 months old, it did not really have a lot of writing done to it


What brand and model is it? Some SSDs are very flakey and have high failure rates.

QUOTE
and disk util said it's ok. Is there another way to check?


Post the output of:

smartctl --all /dev/sdX (where X is the device letter of your ssd)


I know, but given the price of this workstation I would have guessed that the vendor put in something sensible. Seems to be a Micron C400 SSD 128GB (need to open the Computer to check the label, as my invoice only says Highspeed SATA III SSD...) dry.gif

Anyway, out put of smartctl --all /dev/sdb is
smartctl 5.39.1 2010-01-28 r3054 [x86_64-redhat-linux-gnu] (local build)
Copyright © 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Device Model: C400-MTFDDAC128MAM
Serial Number: 0000000011460320397F
Firmware Version: 0009
User Capacity: 128.035.676.160 bytes
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: 8
ATA Standard is: ATA-8-ACS revision 6
Local Time is: Tue Aug 14 12:14:14 2012 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status: (0x80) Offline data collection activity
was never started.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 595) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 9) minutes.
Conveyance self-test routine
recommended polling time: ( 3) minutes.
SCT capabilities: (0x003d) SCT Status supported.
SCT Feature Control supported.
SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 100 100 050 Pre-fail Always - 0
5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0
9 Power_On_Hours 0x0032 100 100 001 Old_age Always - 5225
12 Power_Cycle_Count 0x0032 100 100 001 Old_age Always - 46
170 Unknown_Attribute 0x0033 100 100 010 Pre-fail Always - 0
171 Unknown_Attribute 0x0032 100 100 001 Old_age Always - 0
172 Unknown_Attribute 0x0032 100 100 001 Old_age Always - 0
173 Unknown_Attribute 0x0033 100 100 010 Pre-fail Always - 1
174 Unknown_Attribute 0x0032 100 100 001 Old_age Always - 0
181 Program_Fail_Cnt_Total 0x0022 100 100 001 Old_age Always - 94489346069
183 Runtime_Bad_Block 0x0032 100 100 001 Old_age Always - 0
184 End-to-End_Error 0x0033 100 100 050 Pre-fail Always - 0
187 Reported_Uncorrect 0x0032 100 100 001 Old_age Always - 0
188 Command_Timeout 0x0032 100 100 001 Old_age Always - 0
189 High_Fly_Writes 0x000e 100 100 001 Old_age Always - 84
194 Temperature_Celsius 0x0022 100 100 000 Old_age Always - 0
195 Hardware_ECC_Recovered 0x003a 100 100 001 Old_age Always - 0
196 Reallocated_Event_Count 0x0032 100 100 001 Old_age Always - 0
197 Current_Pending_Sector 0x0032 100 100 001 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 100 001 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 100 100 001 Old_age Always - 0
202 Data_Address_Mark_Errs 0x0018 100 100 001 Old_age Offline - 0
206 Flying_Height 0x000e 100 100 001 Old_age Always - 0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
No self-tests have been logged. [To run self-tests, use: smartctl -t]


SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
PM
^
tux99
 Posted: Aug 14 2012, 11:30 AM
Quote Post


SLF Moderator
********

Group: Moderators
Posts: 1272
Member No.: 224
Joined: 28-May 11









The Crucial/Micron SSDs are actually among the better ones (but of course even the best ones can fail).

There isn't much useful info in that output, Program_Fail_Cnt_Total is the only attribute that looks suspicious.
I found this with regards to it:
QUOTE
Description
Program Fail Count (chip) S.M.A.R.T. parameter indicates a number of flash program failures.

Recommendations
This parameter is considered informational by the most hardware vendors. Although degradation of this parameter can be an indicator of drive aging and/or potential electromechanical problems, it does not directly indicate imminent drive failure. Regular backup is recommended. Pay closer attention to other parameters and overall drive health.
http://lime-technology.com/forum/index.php?topic=13946.0

Unfortunately the smartctl version included by default in SL6 is rather old and doesn't recognize many SSD specific attributes.
You could get a newer version in my linuxtech-backports repo and try with that again:
http://pkgrepo.linuxtech.net/el6/backports/x86_64/smartmontools-5.41-1.el6.x86_64.rpm

For example the output on my Transcend SSD looks like this (the two lines with SSD specific wear and failure info are in bold):

QUOTE
SMART Attributes Data Structure revision number: 1280
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG    VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
12 Power_Cycle_Count      0x0032  100  100  000    Old_age  Always      -      368
  9 Power_On_Hours          0x0032  100  100  000    Old_age  Always      -      0
194 Temperature_Celsius    0x0007  032  100  000    Pre-fail  Always      -      0
229 Halt_System/Flash_ID    0x0002  100  ---  000    Old_age  Always      -      0x00ecd551a668ecd5
232 Firmware_Version_Info  0x0002  100  ---  000    Old_age  Always      -      0x3038303832370802
233 ECC_Fail_Record        0x0002  100  ---  000    Old_age  Always      -      0x000000000000
234 Avg/Max_Erase_Ct        0x0002  100  ---  000    Old_age  Always      -      1155/1369
235 Good/Sys_Block_Ct      0x0002  100  ---  000    Old_age  Always      -      16148/860


Also you could try to run an extended disk self-test (but you should do that while booted from a CD or USB drive with all filesystems on the SSD unmounted):
smartctl --test=long /dev/sdb

--------------------
My personal SL6 repository, specialized in audio/video software: http://pkgrepo.linuxtech.net/el6/
(can be used together with EPEL and ELRepo repositories) - repository mirror: http://linuxsoft.cern.ch/linuxtech/el6/
PM
^
tux99
 Posted: Aug 14 2012, 11:44 AM
Quote Post


SLF Moderator
********

Group: Moderators
Posts: 1272
Member No.: 224
Joined: 28-May 11









Page 4 and 5 of this PDF provide better explanations of the smart attributes for your SSD:
http://www.micron.com/~/media/Documents/Products/Technical%20Note/Solid%20State%20Storage/5611tnfd03.ashx

--------------------
My personal SL6 repository, specialized in audio/video software: http://pkgrepo.linuxtech.net/el6/
(can be used together with EPEL and ELRepo repositories) - repository mirror: http://linuxsoft.cern.ch/linuxtech/el6/
PM
^
PSchiffer
 Posted: Aug 14 2012, 12:22 PM
Quote Post


SLF Newbie


Group: Members
Posts: 13
Member No.: 1797
Joined: 14-August 12









QUOTE (tux99 @ Aug 14 2012, 11:44 AM)
Page 4 and 5 of this PDF provide better explanations of the smart attributes for your SSD:
http://www.micron.com/~/media/Documents/Products/Technical%20Note/Solid%20State%20Storage/5611tnfd03.ashx



Hmm, I don't see a difference between the two tests (only pasting sdiff diff below), actually it's just the sector size and LU WWN Id that is reported in addition. Will go through the pdf (many thanks!) and run the long test booting from a life dvd. Will be back with the output after that.

Short self-test routine Short self-test routine
smartctl 5.41 2011-06-09 r3365 [x86_64-linux-2.6.32-279.1.1.e | smartctl 5.39.1 2010-01-28 r3054 [x86_64-redhat-linux-gnu] (l
Copyright © 2002-11 by Bruce Allen, http://smartmontools.so | Copyright © 2002-10 by Bruce Allen, http://smartmontools.so

LU WWN Device Id: 5 00a075 10320397f <

User Capacity: 128.035.676.160 bytes [128 GB] | User Capacity: 128.035.676.160 bytes
Sector Size: 512 bytes logical/physical <

Local Time is: Tue Aug 14 14:06:20 2012 CEST | Local Time is: Tue Aug 14 12:14:14 2012 CEST

data collection: ( 595) seconds. | data collection: ( 595) seconds.

SCT Error Recovery Co <

9 Power_On_Hours 0x0032 100 100 001 Old_a | 9 Power_On_Hours 0x0032 100 100 001 Old_a
12 Power_Cycle_Count 0x0032 100 100 001 Old_a | 12 Power_Cycle_Count 0x0032 100 100 001 Old_a
PM
^
tux99
 Posted: Aug 14 2012, 12:35 PM
Quote Post


SLF Moderator
********

Group: Moderators
Posts: 1272
Member No.: 224
Joined: 28-May 11









It could well be that smartctl 5.41 is still too old for your SSD and doesn't know about the correct smart attributes yet.
As you can see in the PDF, several attributes have a different name compared to the smartctl output and the data is probably not shown correctly either.

--------------------
My personal SL6 repository, specialized in audio/video software: http://pkgrepo.linuxtech.net/el6/
(can be used together with EPEL and ELRepo repositories) - repository mirror: http://linuxsoft.cern.ch/linuxtech/el6/
PM
^
tux99
 Posted: Aug 14 2012, 12:53 PM
Quote Post


SLF Moderator
********

Group: Moderators
Posts: 1272
Member No.: 224
Joined: 28-May 11









Ok, I have quickly rebuilt the latest version (5.43) of the smartmontools package, you can find it here:
http://pkgrepo.linuxtech.net/el6/backports/x86_64/smartmontools-5.43-2.el6.x86_64.rpm

--------------------
My personal SL6 repository, specialized in audio/video software: http://pkgrepo.linuxtech.net/el6/
(can be used together with EPEL and ELRepo repositories) - repository mirror: http://linuxsoft.cern.ch/linuxtech/el6/
PM
^
PSchiffer
 Posted: Aug 14 2012, 01:40 PM
Quote Post


SLF Newbie


Group: Members
Posts: 13
Member No.: 1797
Joined: 14-August 12









QUOTE (tux99 @ Aug 14 2012, 12:35 PM)
It could well be that smartctl 5.41 is still too old for your SSD and doesn't know about the correct smart attributes yet.
As you can see in the PDF, several attributes have a different name compared to the smartctl output and the data is probably not shown correctly either.


Yes, you were right. 5.43 seems to work actually. At least it gets
Model Family: Crucial/Micron RealSSD C300/C400/m4
and also the formerly unknown attributes are in accordance with the pdf (though there is one, 184 End-to-End_Error, which is not in the document). Is it safe to assume that the Pre-fail attributes indicate an imminent failure then? Guess it's time to call the vendor and ask for replacement (?) - hopefully under warranty. (Long test still not run).

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 100 100 050 Pre-fail Always - 0
5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0
9 Power_On_Hours 0x0032 100 100 001 Old_age Always - 5227
12 Power_Cycle_Count 0x0032 100 100 001 Old_age Always - 48
170 Grown_Failing_Block_Ct 0x0033 100 100 010 Pre-fail Always - 0
171 Program_Fail_Count 0x0032 100 100 001 Old_age Always - 0
172 Erase_Fail_Count 0x0032 100 100 001 Old_age Always - 0
173 Wear_Levelling_Count 0x0033 100 100 010 Pre-fail Always - 1
174 Unexpect_Power_Loss_Ct 0x0032 100 100 001 Old_age Always - 0
181 Non4k_Aligned_Access 0x0022 100 100 001 Old_age Always - 22 1 21
183 SATA_Iface_Downshift 0x0032 100 100 001 Old_age Always - 0
184 End-to-End_Error 0x0033 100 100 050 Pre-fail Always - 0
187 Reported_Uncorrect 0x0032 100 100 001 Old_age Always - 0
188 Command_Timeout 0x0032 100 100 001 Old_age Always - 0
189 Factory_Bad_Block_Ct 0x000e 100 100 001 Old_age Always - 84
194 Temperature_Celsius 0x0022 100 100 000 Old_age Always - 0
195 Hardware_ECC_Recovered 0x003a 100 100 001 Old_age Always - 0
196 Reallocated_Event_Count 0x0032 100 100 001 Old_age Always - 0
197 Current_Pending_Sector 0x0032 100 100 001 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 100 001 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 100 100 001 Old_age Always - 0
202 Perc_Rated_Life_Used 0x0018 100 100 001 Old_age Offline - 0
206 Write_Error_Rate 0x000e 100 100 001 Old_age Always - 0
PM
^
tux99
 Posted: Aug 14 2012, 02:05 PM
Quote Post


SLF Moderator
********

Group: Moderators
Posts: 1272
Member No.: 224
Joined: 28-May 11









Pre-fail means something is deteriorating, not necessarily imminent.

As far as I understand it the only attribute that is non-zero and that according to that PDF is relevant with regards to a warranty claim is:

173 Wear_Levelling_Count 0x0033 100 100 010 Pre-fail Always - 1

But I'm not sure how to interpret that value, for example the Wear_Levelling_Count on my SSD is currently 1155/1369 (avg/max).

TBH judging purely by the smart output I would think the drive is still OK, but the fact that you are having problems with the kernel remounting the filesystem to read-only (which usually happens when the kernel has detected an uncorrectable error on the device) seems to indicate that the drive has a problem.

Have you tried a full forced fsck yet, just to see if there are filesystem errors already?

If not try it after the extended smart test has completed, do it still while booted with the live CD.

I don't know what else to suggest, I guess contacting Crucial/Micron support could be a good idea but that usually takes time so it's not a quick solution.

--------------------
My personal SL6 repository, specialized in audio/video software: http://pkgrepo.linuxtech.net/el6/
(can be used together with EPEL and ELRepo repositories) - repository mirror: http://linuxsoft.cern.ch/linuxtech/el6/
PM
^
PSchiffer
 Posted: Aug 14 2012, 02:43 PM
Quote Post


SLF Newbie


Group: Members
Posts: 13
Member No.: 1797
Joined: 14-August 12









Hi again. Many thanks for all the help!
I need to read a students thesis now and will thus do the remaining tests (including fsck) overnight (I guess from a gparted live CD). So I will be back with details tomorrow if something comes up from there. It's still strange to me that I can't pinpoint the exact time or at what action the system swaps to read only. An hour ago I even managed to run two programs (which are using quite some memory and processors without problems), but these data were on other HDDs. Just now I copied all the data from /home (on the ssd) to /tmp, but that was a read access of course.

Maybe I will get in contact with my computer vendor anyway (got a next day support in my buying contract).

Just to double-check; you think the
udevd[1022]: worker [1103] unexpectedly returned with status 0x0100
alike errors I posted are not connected to the main problem?

Hey now! I just booted into runlevel 3 (just to see actually) and it is now printing
EXT4-fs related errors on device dm-0 (which must be the logical volume). Also there is something with orphaned inodes, but doing a df -i on all volumes does not show anything suspicious (unlikely I know, but I once managed to use all inodes on a disc on another system).

Guess more tomorrow, but thanks again!

PM
^
tux99
 Posted: Aug 14 2012, 02:55 PM
Quote Post


SLF Moderator
********

Group: Moderators
Posts: 1272
Member No.: 224
Joined: 28-May 11









QUOTE (PSchiffer @ Aug 14 2012, 04:43 PM)

Just to double-check; you think the
udevd[1022]: worker [1103] unexpectedly returned with status 0x0100
alike errors I posted are not connected to the main problem?

I think these errors are just a consequence of the main problem (when the filesystem switches to read only udevd will have trouble writing to files).

QUOTE (PSchiffer @ Aug 14 2012, 04:43 PM)
Hey now! I just booted into runlevel 3 (just to see actually) and it is now printing EXT4-fs related errors on device dm-0 (which must be the logical volume). Also there is something with orphaned inodes, but doing a df -i on all volumes does not show anything suspicious (unlikely I know, but I once managed to use all inodes on a disc on another system).

That sounds like the filesystem is corrupted, which could be purely a filesystem problem but more likely is a consequence of the SSD i/o errors.

--------------------
My personal SL6 repository, specialized in audio/video software: http://pkgrepo.linuxtech.net/el6/
(can be used together with EPEL and ELRepo repositories) - repository mirror: http://linuxsoft.cern.ch/linuxtech/el6/
PM
^
PSchiffer
 Posted: Aug 14 2012, 06:56 PM
Quote Post


SLF Newbie


Group: Members
Posts: 13
Member No.: 1797
Joined: 14-August 12









[QUOTE=tux99,Aug 14 2012, 02:55 PM][QUOTE=PSchiffer,Aug 14 2012, 04:43 PM]
Just to double-check; you think the
udevd[1022]: worker [1103] unexpectedly returned with status 0x0100
alike errors I posted are not connected to the main problem?[/QUOTE]
[quote]I think these errors are just a consequence of the main problem (when the filesystem switches to read only udevd will have trouble writing to files).
[/quote]
Hmm, but the udevd errors come already during boot, so before the file system goes read only.

[QUOTE=PSchiffer,Aug 14 2012, 04:43 PM]Hey now! I just booted into runlevel 3 (just to see actually) and it is now printing EXT4-fs related errors on device dm-0 (which must be the logical volume). Also there is something with orphaned inodes, but doing a df -i on all volumes does not show anything suspicious (unlikely I know, but I once managed to use all inodes on a disc on another system).[/QUOTE]
[quote]That sounds like the filesystem is corrupted, which could be purely a filesystem problem but more likely is a consequence of the SSD i/o errors.

fsck says all volumes in the lvm are clean (the problem seems to be actually with /root). wondering if it would make sense to reformat and copy the system back (or do a fresh install), just to make sure it's really not a software issue.
PM
^
redman
 Posted: Aug 14 2012, 08:15 PM
Quote Post


Retired SLF Administrator
********

Group: Admins
Posts: 1276
Member No.: 2
Joined: 8-April 11









PSchiffer, please correct the above message, make sure you quotations are correct wink.gif

--------------------
"Sometimes the best helping hand you can give is a good, firm push."
PM
^
tux99
 Posted: Aug 14 2012, 09:51 PM
Quote Post


SLF Moderator
********

Group: Moderators
Posts: 1272
Member No.: 224
Joined: 28-May 11









QUOTE (PSchiffer @ Aug 14 2012, 08:56 PM)
Hmm, but the udevd errors come already during boot, so before the file system goes read only.

Ok then they aren't caused by the fs going ro, but I still think they are secondary issues caused by the main problem, udev is complaining about several devices unrelated to each other so it's very unlikely those devices all have a problem, more likely udev is malfunctioning for some reason (i.e. this is just an effect of the underlying cause).
That said it might be worth doing a 24 hour memtest86 check of your workstation, because in case the culprit isn't the SSD then it could be the RAM (or even the motherboard or the cpu or the PSU, but those are IMHO less likely).


QUOTE (PSchiffer @ Aug 14 2012, 08:56 PM)
fsck says all volumes in the lvm are clean (the problem seems to be actually with /root). wondering if it would make sense to reformat and copy the system back (or do a fresh install), just to make sure it's really not a software issue.


Your reply isn't clear, it would have been more useful if you had posted the output of the fsck (including the command you ran).
Did you do a forced fsck "e2fsck -f" for every filesystem on the SSD?
Are you saying /root had errors and got repaired but the other filesystems didn't have errors?

--------------------
My personal SL6 repository, specialized in audio/video software: http://pkgrepo.linuxtech.net/el6/
(can be used together with EPEL and ELRepo repositories) - repository mirror: http://linuxsoft.cern.ch/linuxtech/el6/
PM
^
PSchiffer
 Posted: Aug 15 2012, 10:48 AM
Quote Post


SLF Newbie


Group: Members
Posts: 13
Member No.: 1797
Joined: 14-August 12









QUOTE
That said it might be worth doing a 24 hour memtest86 check of your workstation, because in case the culprit isn't the SSD then it could be the RAM (or even the motherboard or the cpu or the PSU, but those are IMHO less likely).

okay, i will be looking at that next. thanks.



QUOTE
Your reply isn't clear, it would have been more useful if you had posted the output of the fsck (including the command you ran).
Did you do a forced fsck "e2fsck -f" for every filesystem on the SSD?
Are you saying /root had errors and got repaired but the other filesystems didn't have errors?


Sorry about that. I am posting the output of e2fsck -f below. Please note that under the live environment dm-0 which is mentioned in the error reports becomes dm-2. What you see is e2fsck -f for sda1, which is /boot and not in the lv and then each output for /home and /root in the lv twice (omitting the /swap).

/dev/sda1: recovering journal
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
/dev/sda1: 53/128016 files (7.5% non-contiguous), 108862/512000 blocks


/dev/vg_rechenknecht/lv_root: recovering journal
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information

/dev/vg_rechenknecht/lv_root: ***** FILE SYSTEM WAS MODIFIED *****
/dev/vg_rechenknecht/lv_root: 189472/3276800 files (0.2% non-contiguous), 3936475/13107200 blocks

e2fsck -f /dev/dm-2
e2fsck 1.41.12 (17-May-2010)
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
/dev/dm-2: 189472/3276800 files (0.2% non-contiguous), 3936475/13107200 block


e2fsck -f /dev/vg_rechenknecht/lv_home
e2fsck 1.41.12 (17-May-2010)
/dev/vg_rechenknecht/lv_home: recovering journal
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
/dev/vg_rechenknecht/lv_home: 116900/1277952 files (0.2% non-contiguous), 3238986/5107712 blocks

e2fsck -f /dev/dm-4
e2fsck 1.41.12 (17-May-2010)
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
/dev/dm-4: 116900/1277952 files (0.2% non-contiguous), 3238986/5107712 blocks

I am not really sure what to make of that...
PM
^
tux99
 Posted: Aug 15 2012, 11:21 AM
Quote Post


SLF Moderator
********

Group: Moderators
Posts: 1272
Member No.: 224
Joined: 28-May 11









Well on the /root fs the "***** FILE SYSTEM WAS MODIFIED *****" indicates that fsck found filesystem problems of some kind and had to correct it.
Of course filesystem problems don't necessarily mean disk problems (just like disk problems don't always necessarily cause filesystem problems as fsck only checks fs metadata not integrity of the actual data). Filesystem problems could have been caused by a simple unclean shutdown.

Try booting the system normally from the SSD again now that you have done the fsck and see if you get again errors of any kind (udevd errors, ext4 errors, i/o errors, fs switching read-only or anything else unusual). Also check with smartctl if any of the attribute values have changed.

--------------------
My personal SL6 repository, specialized in audio/video software: http://pkgrepo.linuxtech.net/el6/
(can be used together with EPEL and ELRepo repositories) - repository mirror: http://linuxsoft.cern.ch/linuxtech/el6/
PM
^
PSchiffer
 Posted: Aug 16 2012, 07:13 AM
Quote Post


SLF Newbie


Group: Members
Posts: 13
Member No.: 1797
Joined: 14-August 12









I think I can put this issue to SOLVED:
To my own embarrassment (as I think I should have checked this much earlier) it comes down to a firmware update needed on the SSD.

QUOTE
Try booting the system normally from the SSD again now that you have done the fsck and see if you get again errors of any kind (udevd errors, ext4 errors, i/o errors, fs switching read-only or anything else unusual). Also check with smartctl if any of the attribute values have changed.


As the error was still there and after running a couple of hours of mem test without anything coming up I finally called the workstation vendor and well they just said:"oh, yes, there is a firmware update for your SSD that adresses exactly this issue". I am quoting the Micron document below, just for how ridiculous it is. So after the update the computer is running sweetly for more than 12h, even when I put substantial load on the system.

Many thanks again for all the advice and help tux99, I guess I learned a lot.

QUOTE
• The C400 drive may experience a condition in which an incorrect response to a SMART counter will cause the C400 drive to become unresponsive after 5184 hours of power-on time. Although the drive may recover after a power cycle, such failure may repeat once per hour after reaching this condition. Even if the drive has reached 5184 hours of power-on time and experienced the foregoing offline condition, the drive will allow the end user to successfully apply the firmware update to correct the condition. If the drive has not yet reached 5184 hours of power-on time, the installation of the firmware update to the drive will prevent the foregoing condition from occurring.
• This firmware update is strongly recommended for all drives currently in the field to avoid interruptions to normal computing operations and to ensure satisfactory user experiences.


PM
^
redman
 Posted: Aug 16 2012, 09:34 AM
Quote Post


Retired SLF Administrator
********

Group: Admins
Posts: 1276
Member No.: 2
Joined: 8-April 11









Thanks for the feedback.
As for the firmware goes, you wouldn't be the first one to forget that wink.gif

--------------------
"Sometimes the best helping hand you can give is a good, firm push."
PM
^
tux99
 Posted: Aug 16 2012, 10:11 AM
Quote Post


SLF Moderator
********

Group: Moderators
Posts: 1272
Member No.: 224
Joined: 28-May 11









I'm glad you solved it and I agree with you it's ridiculous that these days you need to worry even about firmware updates for SSDs... wacko.gif (this never used to be the case for hard disks).

--------------------
My personal SL6 repository, specialized in audio/video software: http://pkgrepo.linuxtech.net/el6/
(can be used together with EPEL and ELRepo repositories) - repository mirror: http://linuxsoft.cern.ch/linuxtech/el6/
PM
^
1 User(s) are reading this topic (1 Guests and 0 Anonymous Users)
0 Members:

Topic Options Reply to this topicStart new topicStart Poll