It’s interesting how long you can go using something without really understanding what’s happening under the hood. I didn’t think anything of it until today, when I was doing my annual download via Google Takeout. I have something like 500GiB of compressed (zipped) data in Google from last year. Google asked me if I wanted a tgz or a zip file and I found myself stumped. Why does it matter? Aren’t they all just compressed archives?1

Re-summarizing the info I found on stack overflow:

  1. tar files are uncompressed, where many files are bunched together into a single file for ease of movement/storage. They are also called tarballs. The imagery of files glued together with tar is pretty evocative.
  2. tgz files are tar files where the tarball is made first, then gzipped as a single unit to save space. Thus tgz (tarred and gzipped).
  3. zip files are created by compressing individual files, then gluing those together.

The consequences of this are:

  • tgz files may compress better since you can take advantage of compressing a larger piece of data.
  • zip files compress worse, but make it so that you can uncompress individual files on demand – meaning random access is faster.

Practical recommendations

  • If you’re compressing a huge file like from Google Takeout, get it zipped. You’re unlikely to need to access all the files at once later, and being able to decompress a little file you want to look at is definitely useful.
  • If you’re just downloading a file for long long term storage and have no plans to access anything in it, tgz will be slightly more space efficient in theory. I haven’t tested the theory.
  1. This is so embarassing because I’ve been a linux user since grade school and my brain was like “tgz was just a free version of zip”