一、存储故障概述1、故障环境两组分别由4块600G容量的SAS硬盘组成的raid5阵列,并且两组阵列划分LUN,组成LVM结构,并格式化为EXT3文件系统。2、故障分析一块硬盘意外离线,热备盘上线,顶替离线硬盘。但在热备盘上线过程中,又一块硬盘离线,导致热备盘同步失败,两组raid阵列中的一组崩溃,LVM结构不完整,文件系统无法正常使用。对两块离线硬盘进行检测,发现先离线硬盘无法识别,初步推断是硬件故障,需要进行开盘修复操作,另一块硬盘可以识别。二、解决方案概述根据前期的故障分析结果,总结出以下解决方案:1、对故障硬盘进行修复,使用MRT专业数据恢复软件对故障硬盘进行备份。2、使用专业数据恢复软件winhex对raid其余成员盘和另一组raid全部成 员盘进行全盘备份。3、分析每个硬盘的数据,根据分析的结构重组RAID 阵列。4、分析重组完的阵列,找到LVM信息,重组LVM卷。5、对重组的LVM卷上的EXT3文件系统进行解析,恢复并导出全部数据。三、实施解决方案1、故障盘修复对故障硬盘进行开盘修复操作。开盘后发现,硬盘盘片磨损严重,已无法修复,只能对阵列进行缺盘处理。2、硬盘备份使用专业数据恢复软件winhex对故障raid阵列的其余成员盘进行全盘备份,并且对另一组好的raid阵列的全部成员盘进行全盘备份,备份情况如下:3、重组raid阵列仔细分析硬盘底层数据,通过对EXT3文件系统结构进行解析,分别分析出两组raid阵列的盘序、条带大小、校验方向等配置信息,使用专业数据恢复软件winhex重组出两组raid阵列。经分析,两组raid阵列块大小都为64K,校验方向为做同步,对故障raid进行重组时注意进行缺盘处理。4、重组LVM结构重组出两组raid阵列之后,对两组raid中的底层数据进行分析,找到LVM结构信息,对LVM结构进行分析,将两组raid中作为PV(LVM物理卷)的LUN导出,然后使用专业数据恢复软件UFS Explorer将两个PV重组,重新生成LVM逻辑卷。5、恢复数据LVM重组之后,对LV(逻辑卷)中的EXT3文件系统进行解析,恢复并导出其中的全部数据。以下为回复出来的数据:四、数据校验对恢复出来的数据,挑选部分压缩文件等进行校验,发现部分文件损坏,对解析结果和恢复结果对比,发现部分文件损坏且无法恢复。经过分析,初步推断文件损坏与两组raid中部分硬盘存在坏道有关。以下为两组raid中部分硬盘坏道情况:Raid 1:2# 67 bad source sectors encountered.4# 13 bad source sectors encountered.Raid 2:2# 37 bad source sectors encountered.五、恢复结论由于故障硬盘损坏严重,硬件无法修复,并且部分硬盘存在坏道,导致raid结构中可能存在缺陷,部分文件损坏,但大部份文件经验证后恢复成功,只有小部分文件丢失或者损坏,本次数据恢复成功完成。
1、 Overview of storage failure 1. Two groups of RAID 5 arrays composed of four 600g SAS hard disks in failure environment, and two groups of arrays are divided into Lun, forming LVM structure and formatted as ext3 file system. 2. Failure analysis: a hard disk is unexpectedly offline, and the hot spare disk is online to replace the offline hard disk. However, in the process of hot spare online, another hard disk is offline, which leads to the failure of hot spare synchronization. One of the two groups of raid arrays crashes, LVM structure is incomplete, and the file system cannot be used normally. Two off-line hard disks were detected. It was found that the first off-line hard disk could not be identified. It was preliminarily inferred that it was a hardware failure and needed to open the disk for repair. The other hard disk could be identified. 2、 Solution overview according to the previous failure analysis results, summarized the following solutions: 1. Repair the failed hard disk, use the MRT professional data recovery software to back up the failed hard disk. 2. Use the professional data recovery software WinHex to backup the rest of the raid member disks and all the other raid member disks. 3. Analyze the data of each hard disk and reorganize the RAID array according to the analyzed structure. 4. Analyze the reorganized array, find the LVM information, and reorganize the LVM volume. 5. Analyze the ext3 file system on the restructured LVM volume, recover and export all data. 3、 Implement solution 1. Repair the faulty disk and open the faulty hard disk for repair. After opening the disk, it was found that the disk of the hard disk was seriously worn and could not be repaired, so the array could only be treated as missing disk. 2. Hard disk backup uses WinHex, a professional data recovery software, to back up the rest of the members of the failed RAID array, and to back up all the members of another group of good RAID array. The backup situation is as follows: 3. Reorganize the RAID array to carefully analyze the underlying data of the hard disk. Through the analysis of the ext3 file system structure, the disk order and stripe of the two groups of RAID array are analyzed respectively With configuration information such as size and verification direction, two groups of raid arrays are reconstructed by using the professional data recovery software WinHex. After analysis, the size of the two groups of RAID array blocks is 64K, and the verification direction is for synchronization. When reassembling the failed raid, attention should be paid to the disk missing processing. 4. After reorganizing LVM structure and reorganizing two groups of raid arrays, analyze the underlying data in the two groups of raid, find LVM structure information, analyze LVM structure, export the two groups of raid as the Lun of PV (LVM physical volume), and then use the professional data recovery software UFS explorer to reorganize the two PVS to regenerate LVM logical volume. 5. After LVM reorganizes the recovered data, the ext3 file system in LV (logical volume) is parsed, and all the data in it is recovered and exported. The following is the returned data: 4. Data verification verifies the recovered data, selects some compressed files, and finds some files are damaged. Comparing the analysis results with the recovery results, it finds that some files are damaged and cannot be recovered. After analysis, it is preliminarily concluded that the file corruption is related to the existence of bad tracks on some hard disks in the two groups of raid. The following are the bad track conditions of some hard disks in two groups of raid: raid 1:2 × 67 bad source sectors encrypted.4 × 13 bad source sectors encrypted.raid 2:2 × 37 bad source sectors 5. Recovery conclusion: due to the serious damage of the failed hard disk, the hardware cannot be repaired, and some hard disks have bad paths, which may lead to defects in the raid structure, some files are damaged, but most files are successfully recovered after verification, only a small number of files are lost or damaged, and the data recovery is completed successfully.
