Compressed Files: How They Work
Posted by Max Baun on Sun, Aug 10, 2008 @ 08:59 AM
Have you ever downloaded a file
from a fellow co-workers email with a .ZIP extension? When you open the file up, WinZIP
takes over and creates a folder with a collection of text files that he has
been working on. You also notice that the size of the actual .ZIP file was smaller
than the files that came out of it. The laws of physics cannot explain how a
large object can fit in a box physically smaller than itself. The trick to
doing this is through file compression.
Here I will take a quick phrase
and compress it.
“Try not to
become a man of success but rather to become a man of value”
This phrase contains 16 words, 55
letters, and 15 spaces. All together we will say this phrase takes up 70 (characters
and spaces together) units of memory. We see some redundant words that we can
create a common value for. Using the following key, we can assign some words to
our common value.
1
- to
2
- become
3 -
a
4
- man
5
- of
Now our sentence reads:
“Try
not 1 2 3 4 5 success but rather 1 2 3 4 5 value”
When we assign the common values
to certain words, we have come o
ut with fewer characters. There are still a
total of 16 words and 15 spaces, but now we only have 37 characters. Together we
can save the new phrase in 52 units of memory. If you knew the system we used
to compress the original phrase, you could easily translate our compressed
sentence into the original phrase. Essentially, this is what your compression
software, WinZIP, does when you double click on a .ZIP file.
Searching for even more patterns
we can see the string “1 2 3 4 5” appears twice. Assigning those both to the
number 6 eliminates having to double save “1 2 3 4 5” into memory locations.
“Try
not 6 success but rather 6 value”
Now we have an even more
compressed sentence with 8 words, 29 characters, and 7 spaces, making 36 units
in total. To recap it all, we have taken the original phrase with 70 unites of
memory and compressed it down into 36 units of memory. The computer now takes
the 36 new memory units and the compression algorithm and zips it all up into a
.ZIP file.
Even though I saw patterns in
words with this sentence, patterns of characters can also be put together and
assigned to a common value. For example “cc” or “at” could appear in the text
after or before this phrase within the file. There is countless number of ways
that your software can take to find redundancy in a text file. This works
excellent with text files containing large character strings. The larger the
file size, the greater chance of redundancy. File types, like video and audio
files cannot be compressed because there is little redundancy within the file
type.
Our above example shows the type
of compression called lossless compression. With this type of compression, the
file you compressed will be the same size as the file you get when you
decompress it. It is like physically taking a big object of X volume, breaking
it down into pieces and squeezing it all into a box of smaller volume with the
instructions on how to piece it together. When you open the box you read the
instructions and rebuild the object. The other type of compression, lossy
compression, is very different.
When breaking down the same object, it is like
putting the pieces that fit in the box and throwing out the extra. With fewer
pieces than you started with, you cannot rebuild the object that you started
with. Applying this example to audio and image files, lossy compression results
in quality lost. An image of grass with a lot of color and texture may be
compressed into a solid green image.
After reading this blog, you will
never look at a compressed file the same way. We can now see how useful compression
is when sending large files through an email document or file sharing website. I
hope you don’t go hacking that $40,000 vase apart when you have to send cross
country it in a physically smaller box.