Big Data Technology in Hadoop: Tuning Chapter (2)

HDFS—Storage Optimization

Eraser code

Principle of erasure coding

Overview

HDFS by default, a file has 3 copies, which improves data reliability but also brings 2 times the redundancy overhead.HadoopAn erasure coding mechanism has been introduced, throughCalculation method, can save about 50% of storage space.

Erasing code operation command

[lzl@hadoop12 hadoop-3.1.3]$ hdfs ec
Usage: bin/hdfs ec [COMMAND]
          [-listPolicies]
          [-addPolicies -policyFile <file>]
          [-getPolicy -path <path>]
          [-removePolicy -policy <policy>]
          [-setPolicy -path <path> [-policy <policy>] [-replicate]]
          [-unsetPolicy -path <path>]
          [-listCodecs]
          [-enablePolicy -policy <policy>]
          [-disablePolicy -policy <policy>]
          [-help <command-name>].

View supported erasure coding policies
```
[lzl@hadoop12 hadoop-3.1.3] hdfs ec -listPolicies
```
- 1
Strategy explanation

RS-3-2-1024k: Use RS encoding, and every 3 data units are generated, 2 verification units are generated, totaling 5 units. As long as any 3 units exist (whether it is a data unit or a verification unit), the original data can be restored. Each unit size is 1024k.

RS-10-4-1024k: Use RS encoding, and 4 verification units are generated for every 10 data units, totaling 14 units. As long as any 10 units exist, the original data can be restored. Each unit size is 1024k.

RS-6-3-1024k: Use RS encoding, and every 6 data units are generated, 3 verification units are generated, totaling 9 units. As long as any 6 units exist, the original data can be restored. Each unit size is 1024k.

RS-LEGACY-6-3-1024k: The strategy is the same as RS-6-3-1024k, but the encoding is usedalgorithmIt's rs-legacy.

XOR-2-1-1024k: Use XOR encoding (speed is faster than RS encoding), and generate 1 verification unit for every 2 data units, a total of 3 units. As long as any 2 units exist, the original data can be restored. Each unit size is 1024k.

Elimination code case practice

Policy Application
- The erasure coding strategy is set for specific paths. All files uploaded to this path will follow this policy.
Specific steps

Enable RS-3-2-1024k policy support
```
[lzl@hadoop12 hadoop-3.1.3]$ hdfs ec -enablePolicy -policy RS-3-2-1024k
Erasure coding policy RS-3-2-1024k is enabled
```
- 1
- 2
Create an HDFS directory and set policies
```
[lzl@hadoop12 hadoop-3.1.3]$ hdfs dfs -mkdir /input
[lzl@hadoop12 hadoop-3.1.3]$ hdfs ec -setPolicy -path /input -policy RS-3-2-1024k
```
- 1
- 2
Upload files and view storage
```
[lzl@hadoop12 hadoop-3.1.3]$ hdfs dfs -put  /input
```
- 1
Notice: The uploaded file size needs to be greater than 2M to see the effect of erasure coding (when it is lower than 2M, there will only be one data unit and two verification units).
1. View data units and verification units for storage paths
2. Destruction experiment

Heterogeneous storage (cold and cold data separation)

Heterogeneous storage shell operation

View available storage policies

[lzl@hadoop12 hadoop-3.1.3]$ hdfs storagepolicies -listPolicies

Set storage policy for specified paths

hdfs storagepolicies -setStoragePolicy -path xxx -policy xxx

Get the storage policy for the specified path

hdfs storagepolicies -getStoragePolicy -path xxx

Cancel storage policy

hdfs storagepolicies -unsetStoragePolicy -path xxx

View the distribution of file blocks

bin/hdfs fsck xxx -files -blocks -locations

View cluster nodes

hadoop dfsadmin -report

Test environment preparation

Environment description

Server size: 5 units

Cluster configuration: The number of replicas is 2, create a directory with storage type (create it in advance)

Cluster Planning:

node Storage type allocation

hadoop12 RAM_DISK，SSD

hadoop13 SSD，DISK

hadoop14 DISK，RAM_DISK

hadoop15 ARCHIVE

hadoop16 ARCHIVE

node	Storage type allocation
hadoop12	RAM_DISK，SSD
hadoop13	SSD，DISK
hadoop14	DISK，RAM_DISK
hadoop15	ARCHIVE
hadoop16	ARCHIVE

Configuration file information

hadoop12 node

<property>
  <name></name>
  <value>2</value>
</property>
<property>
  <name></name>
  <value>true</value>
</property>
<property>
  <name></name>
  <value>[SSD]file:///opt/module/hadoop-3.1.3/hdfsdata/ssd,[RAM_DISK]file:///opt/module/hadoop-3.1.3/hdfsdata/ram_disk</value>
</property>

hadoop13 node

<property>
  <name></name>
  <value>2</value>
</property>
<property>
  <name></name>
  <value>true</value>
</property>
<property>
  <name></name>
  <value>[SSD]file:///opt/module/hadoop-3.1.3/hdfsdata/ssd,[DISK]file:///opt/module/hadoop-3.1.3/hdfsdata/disk</value>
</property>

hadoop14 node

<property>
  <name></name>
  <value>2</value>
</property>
<property>
  <name></name>
  <value>true</value>
</property>
<property>
  <name></name>
  <value>[RAM_DISK]file:///opt/module/hdfsdata/ram_disk,[DISK]file:///opt/module/hadoop-3.1.3/hdfsdata/disk</value>
</property>

hadoop15 node

<property>
  <name></name>
  <value>2</value>
</property>
<property>
  <name></name>
  <value>true</value>
</property>
<property>
  <name></name>
  <value>[ARCHIVE]file:///opt/module/hadoop-3.1.3/hdfsdata/archive</value>
</property>

hadoop16 node

<property>
  <name></name>
  <value>2</value>
</property>
<property>
  <name></name>
  <value>true</value>
</property>
<property>
  <name></name>
  <value>[ARCHIVE]file:///opt/module/hadoop-3.1.3/hdfsdata/archive</value>
</property>

Data preparation

Start the cluster

[lzl@hadoop12 hadoop-3.1.3]$ hdfs namenode -format
[lzl@hadoop12 hadoop-3.1.3]$  start

Create HDFS file directory

1[lzl@hadoop12 hadoop-3.1.3]$ hadoop fs -mkdir /hdfsdata

Upload file

[lzl@hadoop12 hadoop-3.1.3]$ hadoop fs -put /opt/module/hadoop-3.1.3/ /hdfsdata

HOT storage strategy case

Get the initial storage policy

[lzl@hadoop12 hadoop-3.1.3]$ hdfs storagepolicies -getStoragePolicy -path /hdfsdata

View file block distribution

1[lzl@hadoop12 hadoop-3.1.3]$ hdfs fsck /hdfsdata -files -blocks -locations

The default storage policy is HOT

WARM storage policy testing

Setting up WARM storage policy

[lzl@hadoop12 hadoop-3.1.3]$ hdfs storagepolicies -setStoragePolicy -path /hdfsdata -policy WARM

View file block distribution

[lzl@hadoop12 hadoop-3.1.3]$ hdfs fsck /hdfsdata -files -blocks -locations

Manually migrate file blocks

[lzl@hadoop12 hadoop-3.1.3]$ hdfs mover /hdfsdata

Check the file block distribution again

[lzl@hadoop12 hadoop-3.1.3]$ hdfs fsck /hdfsdata -files -blocks -locations

COLD strategy testing

Set up COLD storage policy

[lzl@hadoop12 hadoop-3.1.3]$ hdfs storagepolicies -setStoragePolicy -path /hdfsdata -policy COLD

Manually migrate file blocks

[lzl@hadoop12 hadoop-3.1.3]$ hdfs mover /hdfsdata

View file block distribution

[lzl@hadoop12 hadoop-3.1.3]$ hdfs fsck /hdfsdata -files -blocks -locations

ONE_SSD policy testing

Set ONE_SSD storage policy

[lzl@hadoop12 hadoop-3.1.3]$ hdfs storagepolicies -setStoragePolicy -path /hdfsdata -policy One_SSD

Manually migrate file blocks

[lzl@hadoop12 hadoop-3.1.3]$ hdfs mover /hdfsdata

View file block distribution

[lzl@hadoop12 hadoop-3.1.3]$ hdfs fsck /hdfsdata -files -blocks -locations

ALL_SSD policy testing

Setting ALL_SSD storage policy

[lzl@hadoop12 hadoop-3.1.3]$ hdfs storagepolicies -setStoragePolicy -path /hdfsdata -policy All_SSD

Manually migrate file blocks

[lzl@hadoop12 hadoop-3.1.3]$ hdfs mover /hdfsdata

View file block distribution

[lzl@hadoop12 hadoop-3.1.3]$ hdfs fsck /hdfsdata -files -blocks -locations

LAZY_PERSIST policy testing

Setting LAZY_PERSIST storage policy

[lzl@hadoop12 hadoop-3.1.3]$ hdfs storagepolicies -setStoragePolicy -path /hdfsdata -policy lazy_persist

Manually migrate file blocks

[lzl@hadoop12 hadoop-3.1.3]$ hdfs mover /hdfsdata

View file block distribution

[lzl@hadoop12 hadoop-3.1.3]$ hdfs fsck /hdfsdata -files -blocks -locations

Things to note

When the DataNode node where the client is located does not have RAM_DISK, the file block will be written to the DISK disk of the DataNode node where the client is located, and the remaining copies will be written to the DISK disk of the other nodes.
If the DataNode where the client is located has RAM_DISK, but the "" parameter value is not set or is set too small (less than the "" parameter value), the file block will also be written to the DISK disk of the DataNode node where the client is located, and the remaining copies will be written to the DISK disk of the other nodes.
Virtual MachineThe "max locked memory" limit is 64KB, so if the parameter configuration is too large, an error will be reported.
Query the "max locked memory" parameter
```
[lzl@hadoop12 hadoop-3.1.3]$ ulimit -a
```
- 1