web123456

Linux tar 100g, it's time to compress very large (100G) files

I found myself having to compress many very large files (80-ish GB) and I was surprised by the (lack of) speed my system showed. I get a conversion speed of about 500 MB/min; using top, I seem to be using a single CPU at about 100%.

I'm pretty sure this is not (just) disk access speed, as it took only a few minutes (maybe 5 or 10) to create the tar file (this is how 80G files are created) but after over 2 hours I still use the simple gzip command not finished.

In summary:

tar -cvf myDir/*

It took less than 5 minutes to create 87 G tar files

gzip

It took two hours and ten minutes to create a 55G zip file.

My question: Is this normal? Is there some options for gzip to speed up? Will concatenate commands and use them faster? See the reference pigz-GZip compressedparallelImplementation - but unfortunately I can't install the software on the machine I'm using, so this is not my choice. For example, see the previous question.

I'm going to try some of these options myself and time them - however, I'm most likely not going to touch the "magic combo" of options. I hope someone on this site knows the right way to speed up.

I will update this question when I get results from other trials - but I would be very grateful if anyone has particularly good tips. Maybe gzip just took more processing time than I realized...

renew

As promised, I tried the following suggested tips: Change the amount of compression, and change the target of the file. For a tar of about 4.1GB I get the following result:

flag user system size sameDisk

-1 189.77s 13.64s 2.786G +7.2s

-2 197.20s 12.88s 2.776G +3.4s

-3 207.03s 10.49s 2.739G +1.2s

-4 223.28s 13.73s 2.735G +0.9s

-5 237.79s 9.28s 2.704G -0.4s

-6 271.69s 14.56s 2.700G +1.4s

-7 307.70s 10.97s 2.699G +0.9s

-8 528.66s 10.51s 2.698G -6.3s

-9 722.61s 12.24s 2.698G -4.0s

So yes, changing the flag from default -6 to fastest can -1 improves me by 30% (for my data) hardly changes the size of the zip file. Whether I'm using the same disk or another disk, there's no difference in nature (I have to run multiple times to get any statistical significance).

If anyone is interested, I will generate these timing benchmarks using the following two scripts:

#!/bin/bash

# compare compression speeds with different options

sameDisk='./'

otherDisk='/tmp/'

sourceDir='/dirToCompress'

logFile='./timerOutput'

rm $logFile

for i in {1..9}

do /usr/bin/time -a --output=timerOutput ./compressWith $sourceDir $i $sameDisk $logFile

do /usr/bin/time -a --output=timerOutput ./compressWith $sourceDir $i $otherDisk $logFile

done

The second script (compressWith):

#!/bin/bash

# use: compressWith sourceDir compressionFlag destinationDisk logFile

echo "compressing $1 to $3 with setting $2" >> $4

tar -c $1 | gzip -$2 > $3test-$

Three things to note:

Use /usr/bin/time instead of time, because the built-in command bash has fewer options than GNU commands

I'm not bothering using the --format option, although this will make the log file easier to read

I used scripting because time seems to only operate on the first command in the pipeline sequence (so I made it look like a command...).

Through all this learning, my conclusion is

Use the -1 sign to speed up (acceptable answer)

It takes much more time to compress data than reading it from disk

Investments are faster to compresssoftware(pigz seems to be a good choice).

If you have multiple files to compress, you can put each gzip command in your own thread and use more available CPUs (poor pigz)

Thank you to everyone who helped me learn all this!