Let's create an archive utility!
Topics we'll learn more about
| File formats | .zip, .7z, .warc, .gz, bzip2 § Format |
| Compression codecs | DEFLATE, LZMA, Brotli, bzip2 § Codec |
| Hashes and checksums | SHA-256, CRC-32 |
| HTTP | Parsing HTTP, HTTP redirects, Content-Encoding, Transfer-Encoding § chunked |
| Data economics | Indexes, Reducing seeks, Replication, Storage reliability |
Archive files
Archive files contain items. For example, you might have:
- a
.zipor.7zarchive file containing zero or more files - a
.warcarchive file containing raw HTTP requests and responses
Sometimes there is no compression involved, and each item simply exists as a substring of the archive file; i.e. the archive file is simply the concatenation of item content along with fragments of metadata.
TODO: example diagram, with more details as you hover/select parts of it
E.g. you can read foo.png's bytes simply by reading TODO bytes from offset TODO of that archive file.
Sometimes each item in an archive is compressed individually:
TODO: example diagram, with more details as you hover/select parts of it
Sometimes the archive file does not involve compression, but then the entire archive file gets compressed (e.g. .zip.bz2 or .warc.gz). This can lead to a smaller file, especially if the items in the archive have lots of substrings in common.
Goals
Let's create some tooling to help with:
- Creating and indexing archives, using a very basic index format that we design
- Reading from archives, making use of index files
- Organizing and auditing the persistence of archive files
- Inspecting/understanding archive files, in full detail
Quests
Prepare:
Implement:
- Parsing a WARC file
- Parsing a ZIP file
- Writing a ZIP file