HDFS—Storage Optimization
Eraser code
Principle of erasure coding
-
Overview
HDFS by default, a file has 3 copies, which improves data reliability but also brings 2 times the redundancy overhead.HadoopAn erasure coding mechanism has been introduced, throughCalculation method, can save about 50% of storage space.
-
Erasing code operation command
[lzl@hadoop12 hadoop-3.1.3]$ hdfs ec Usage: bin/hdfs ec [COMMAND] [-listPolicies] [-addPolicies -policyFile <file>] [-getPolicy -path <path>] [-removePolicy -policy <policy>] [-setPolicy -path <path> [-policy <policy>] [-replicate]] [-unsetPolicy -path <path>] [-listCodecs] [-enablePolicy -policy <policy>] [-disablePolicy -policy <policy>] [-help <command-name>].
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
-
View supported erasure coding policies
[lzl@hadoop12 hadoop-3.1.3] hdfs ec -listPolicies
- 1
-
Strategy explanation
RS-3-2-1024k: Use RS encoding, and every 3 data units are generated, 2 verification units are generated, totaling 5 units. As long as any 3 units exist (whether it is a data unit or a verification unit), the original data can be restored. Each unit size is 1024k.
RS-10-4-1024k: Use RS encoding, and 4 verification units are generated for every 10 data units, totaling 14 units. As long as any 10 units exist, the original data can be restored. Each unit size is 1024k.
RS-6-3-1024k: Use RS encoding, and every 6 data units are generated, 3 verification units are generated, totaling 9 units. As long as any 6 units exist, the original data can be restored. Each unit size is 1024k.
RS-LEGACY-6-3-1024k: The strategy is the same as RS-6-3-1024k, but the encoding is usedalgorithmIt's rs-legacy.
XOR-2-1-1024k: Use XOR encoding (speed is faster than RS encoding), and generate 1 verification unit for every 2 data units, a total of 3 units. As long as any 2 units exist, the original data can be restored. Each unit size is 1024k.
Elimination code case practice
-
Policy Application
- The erasure coding strategy is set for specific paths. All files uploaded to this path will follow this policy.
-
Specific steps
Enable RS-3-2-1024k policy support
[lzl@hadoop12 hadoop-3.1.3]$ hdfs ec -enablePolicy -policy RS-3-2-1024k Erasure coding policy RS-3-2-1024k is enabled
- 1
- 2
Create an HDFS directory and set policies
[lzl@hadoop12 hadoop-3.1.3]$ hdfs dfs -mkdir /input [lzl@hadoop12 hadoop-3.1.3]$ hdfs ec -setPolicy -path /input -policy RS-3-2-1024k
- 1
- 2
Upload files and view storage
[lzl@hadoop12 hadoop-3.1.3]$ hdfs dfs -put /input
- 1
Notice: The uploaded file size needs to be greater than 2M to see the effect of erasure coding (when it is lower than 2M, there will only be one data unit and two verification units).
- View data units and verification units for storage paths
- Destruction experiment
Heterogeneous storage (cold and cold data separation)
Heterogeneous storage shell operation
View available storage policies
[lzl@hadoop12 hadoop-3.1.3]$ hdfs storagepolicies -listPolicies
- 1
Set storage policy for specified paths
hdfs storagepolicies -setStoragePolicy -path xxx -policy xxx
- 1
Get the storage policy for the specified path
hdfs storagepolicies -getStoragePolicy -path xxx
- 1
Cancel storage policy
hdfs storagepolicies -unsetStoragePolicy -path xxx
- 1
View the distribution of file blocks
bin/hdfs fsck xxx -files -blocks -locations
- 1
View cluster nodes
hadoop dfsadmin -report
- 1
Test environment preparation
-
Environment description
Server size: 5 units
Cluster configuration: The number of replicas is 2, create a directory with storage type (create it in advance)
Cluster Planning:
node Storage type allocation hadoop12 RAM_DISK,SSD hadoop13 SSD,DISK hadoop14 DISK,RAM_DISK hadoop15 ARCHIVE hadoop16 ARCHIVE -
Configuration file information
hadoop12 node
<property> <name></name> <value>2</value> </property> <property> <name></name> <value>true</value> </property> <property> <name></name> <value>[SSD]file:///opt/module/hadoop-3.1.3/hdfsdata/ssd,[RAM_DISK]file:///opt/module/hadoop-3.1.3/hdfsdata/ram_disk</value> </property>
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
hadoop13 node
<property> <name></name> <value>2</value> </property> <property> <name></name> <value>true</value> </property> <property> <name></name> <value>[SSD]file:///opt/module/hadoop-3.1.3/hdfsdata/ssd,[DISK]file:///opt/module/hadoop-3.1.3/hdfsdata/disk</value> </property>
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
hadoop14 node
<property> <name></name> <value>2</value> </property> <property> <name></name> <value>true</value> </property> <property> <name></name> <value>[RAM_DISK]file:///opt/module/hdfsdata/ram_disk,[DISK]file:///opt/module/hadoop-3.1.3/hdfsdata/disk</value> </property>
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
hadoop15 node
<property> <name></name> <value>2</value> </property> <property> <name></name> <value>true</value> </property> <property> <name></name> <value>[ARCHIVE]file:///opt/module/hadoop-3.1.3/hdfsdata/archive</value> </property>
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
hadoop16 node
<property> <name></name> <value>2</value> </property> <property> <name></name> <value>true</value> </property> <property> <name></name> <value>[ARCHIVE]file:///opt/module/hadoop-3.1.3/hdfsdata/archive</value> </property>
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
-
Data preparation
Start the cluster
[lzl@hadoop12 hadoop-3.1.3]$ hdfs namenode -format [lzl@hadoop12 hadoop-3.1.3]$ start
- 1
- 2
Create HDFS file directory
1[lzl@hadoop12 hadoop-3.1.3]$ hadoop fs -mkdir /hdfsdata
- 1
Upload file
[lzl@hadoop12 hadoop-3.1.3]$ hadoop fs -put /opt/module/hadoop-3.1.3/ /hdfsdata
- 1
HOT storage strategy case
-
Get the initial storage policy
[lzl@hadoop12 hadoop-3.1.3]$ hdfs storagepolicies -getStoragePolicy -path /hdfsdata
- 1
-
View file block distribution
1[lzl@hadoop12 hadoop-3.1.3]$ hdfs fsck /hdfsdata -files -blocks -locations
- 1
-
The default storage policy is HOT
WARM storage policy testing
-
Setting up WARM storage policy
[lzl@hadoop12 hadoop-3.1.3]$ hdfs storagepolicies -setStoragePolicy -path /hdfsdata -policy WARM
- 1
-
View file block distribution
[lzl@hadoop12 hadoop-3.1.3]$ hdfs fsck /hdfsdata -files -blocks -locations
- 1
-
Manually migrate file blocks
[lzl@hadoop12 hadoop-3.1.3]$ hdfs mover /hdfsdata
- 1
-
Check the file block distribution again
[lzl@hadoop12 hadoop-3.1.3]$ hdfs fsck /hdfsdata -files -blocks -locations
- 1
COLD strategy testing
Set up COLD storage policy
[lzl@hadoop12 hadoop-3.1.3]$ hdfs storagepolicies -setStoragePolicy -path /hdfsdata -policy COLD
- 1
Manually migrate file blocks
[lzl@hadoop12 hadoop-3.1.3]$ hdfs mover /hdfsdata
- 1
View file block distribution
[lzl@hadoop12 hadoop-3.1.3]$ hdfs fsck /hdfsdata -files -blocks -locations
- 1
ONE_SSD policy testing
Set ONE_SSD storage policy
[lzl@hadoop12 hadoop-3.1.3]$ hdfs storagepolicies -setStoragePolicy -path /hdfsdata -policy One_SSD
- 1
Manually migrate file blocks
[lzl@hadoop12 hadoop-3.1.3]$ hdfs mover /hdfsdata
- 1
View file block distribution
[lzl@hadoop12 hadoop-3.1.3]$ hdfs fsck /hdfsdata -files -blocks -locations
- 1
ALL_SSD policy testing
Setting ALL_SSD storage policy
[lzl@hadoop12 hadoop-3.1.3]$ hdfs storagepolicies -setStoragePolicy -path /hdfsdata -policy All_SSD
- 1
Manually migrate file blocks
[lzl@hadoop12 hadoop-3.1.3]$ hdfs mover /hdfsdata
- 1
View file block distribution
[lzl@hadoop12 hadoop-3.1.3]$ hdfs fsck /hdfsdata -files -blocks -locations
- 1
LAZY_PERSIST policy testing
Setting LAZY_PERSIST storage policy
[lzl@hadoop12 hadoop-3.1.3]$ hdfs storagepolicies -setStoragePolicy -path /hdfsdata -policy lazy_persist
- 1
Manually migrate file blocks
[lzl@hadoop12 hadoop-3.1.3]$ hdfs mover /hdfsdata
- 1
View file block distribution
[lzl@hadoop12 hadoop-3.1.3]$ hdfs fsck /hdfsdata -files -blocks -locations
- 1
Things to note
-
When the DataNode node where the client is located does not have RAM_DISK, the file block will be written to the DISK disk of the DataNode node where the client is located, and the remaining copies will be written to the DISK disk of the other nodes.
-
If the DataNode where the client is located has RAM_DISK, but the "" parameter value is not set or is set too small (less than the "" parameter value), the file block will also be written to the DISK disk of the DataNode node where the client is located, and the remaining copies will be written to the DISK disk of the other nodes.
-
Virtual MachineThe "max locked memory" limit is 64KB, so if the parameter configuration is too large, an error will be reported.
-
Query the "max locked memory" parameter
[lzl@hadoop12 hadoop-3.1.3]$ ulimit -a
- 1