Let's create an archive utility!

Topics we'll learn more about#


File formats	`.zip`, `.7z`, `.warc`, `.gz`, bzip2 § Format
Compression codecs	DEFLATE, LZMA, Brotli, bzip2 § Codec
Hashes and checksums	SHA-256, CRC-32
HTTP	Parsing HTTP, HTTP redirects, `Content-Encoding`, `Transfer-Encoding` § chunked
Data economics	Indexes, Reducing seeks, Replication, Storage reliability

Archive files#

Archive files contain items. For example, you might have:

a .zip or .7z archive file containing zero or more files
a .warc archive file containing raw HTTP requests and responses

Sometimes there is no compression involved, and each item simply exists as a substring of the archive file; i.e. the archive file is simply the concatenation of item content along with fragments of metadata.

TODO: example diagram, with more details as you hover/select parts of it

E.g. you can read foo.png's bytes simply by reading TODO bytes from offset TODO of that archive file.

Sometimes each item in an archive is compressed individually:

TODO: example diagram, with more details as you hover/select parts of it

Sometimes the archive file does not involve compression, but then the entire archive file gets compressed (e.g. .zip.bz2 or .warc.gz). This can lead to a smaller file, especially if the items in the archive have lots of substrings in common.

Goals#

Let's create some tooling to help with:

Creating and indexing archives, using a very basic index format that we design
Reading from archives, making use of index files
Organizing and auditing the persistence of archive files
Inspecting/understanding archive files, in full detail

Quests#

Prepare:

Implement:

Parsing a WARC file
Parsing a ZIP file
Writing a ZIP file