So You Want To Retrieve and Extract .tar.gz Archives with Empscripten
As part of ongoing work with my scientific project’s cool cool web interface, I spent this afternoon and evening figuring out how to download configuration and data files into the browser. I figured I should share what I learned!
Emscripten’s nifty file packaging, which prepares elements of the browser file system at compile time and packages it with your “executable,” had been my go to for a while. However, it became necessary to be able to swap out file contents without recompiling so this no longer cut the mustard.
The configuration and data files I work with are organized in a hierarchical structure with meaningful filenames that have unpredictable components. I could try to have the user specify the filenames one by one in the browser then copy them down one by one, but that would be brittle and annoying.
Instead, I’d like to grab them all at once. A tar archive, which allows entire directory structures to be packaged into a single file, seems an appealing option here.
My files are big and repetitive, so I’d like to use gzip compression so they’ll serve up over the internets faster.
Compressing and extracting .tar.gz
archives at the terminal is easy-ish.
Getting it going from a C++ program inside Emscripten’s browser sandbox?
A little harder, but once I found the right pieces to slot together not actually too bad!
In this blog, I’ll briefly discuss each the components I assembled and tweaked to create a minimal working example of retrieving and extracting a .tar.gz
archive with Emscripten.
Then, I’ll walk you through getting the minimal working example actually running on your own machine.
You can find all the complete minimal working example on Github.
🔗 inflate.h
The first step is to decompress (or “inflate”) the archive.
The zlib
project, which has a maintained port for Emscripten, provides a gzFile_s
file handle analogous to the C-style FILE
handle to get at a gzipped file.
So, we just do some K & R B.S. to copy content from the gzipped file over to a regular file.
🔗 untar.h
Now we just have a regular old tar
file.
But we still have to do a little more work.
This header provides untar
, which reads that tar
archive into the file system.
I borrowed most of the code in this header from Tim Kientzle’s untar and added support for long filenames stored using @LongLink.
Besides standard library components, untar.h
has no dependencies.
🔗 main.cc
We use emscripten_wget
to copy the .tar.gz
archive into Emscripten’s in-browser file system.
Then, we call inflate
and untar
in sequence.
Finally, we clean up the original .tar.gz
file (don’t need it anymore!) and print some results to make sure everything worked.
This file contains the interesting code bits and bops you’d plug into your project if you were trying to do this yourself, so I’ve included a listing here for your edification.
#include <iostream>
#include <stdio.h>
#include <fstream>
#include <set>
#include <experimental/filesystem>
#include <zlib.h>
#include <emscripten.h>
#include "inflate.h"
#include "untar.h"
const std::string source_filename{"example.tar.gz"};
int main() {
// this call to copy down the tar.gz archive is blocking
// you have to compile with -s ASYNCIFY=1 to use it
// emscripten_async_wget doesn't require ASYNCIFY
emscripten_wget(
"http://127.0.0.1:8000/example.tar.gz",
source_filename.c_str()
);
auto file = gzopen(source_filename.c_str(), "rb");
auto temp = std::tmpfile();
// unzip into temporary file
inflate(file, temp);
gzclose(file);
std::rewind(temp);
// untar into present working directory
untar(temp, "temp");
// deletes temporary file
std::fclose(temp);
// remove the original .tar.gz archive... we don't need it anymore!
std::experimental::filesystem::remove(source_filename);
// print results
std::cout << "time to print results!" << std::endl;
for (const auto & filename : std::set{
"example/example_file.txt",
"example/example_directory/another_file.txt"
}) {
std::cout << "filename: " << filename << std::endl;
std::cout << " size: " << std::experimental::filesystem::file_size(filename) << std::endl;
std::cout << " content: " << std::endl;
std::ifstream file{filename};
std::string line;
while(getline(file, line)) std::cout << " " << line << std::endl;
}
std::cout << "all done" << std::endl;
}
🔗 index.html
This html file gives us something to look at when we serve up the minimal working example. :squirrel:
🔗 example.tar.gz
This is the example .tar.gz
archive we want to retrieve and extract.
example
├── example_directory
│ └── another_file.txt
└── example_file.txt
🔗 Compile & Run
You’ll need a working copy of Emscripten to start with. Emscripten’s Getting Started page has instructions on getting that set up. I used version 1.38.28.
Grab the minimal working code.
git clone https://github.com/mmore500/emscripten-targz.git
cd emscripten-targz
Compile with emcc
with flags to request zlib and asyncify.
emcc -std=c++17 -s USE_ZLIB=1 -s ASYNCIFY=1 -O3 main.cc -o example.js
All that’s left to do is serve it up…
python3 -m http.server 8000
…and surf over to http://127.0.0.1:8000 and pull up the JavaScript console in your developer tools.
🔗 Let’s Chat
I would love to hear your thoughts, questions, and comments RE: Emscripten .tar.gz
kung fu!!!
I started a twitter thread (right below) so we can chat
throwing together this Quality Blog on retrieving and extracting .tar.gz archives with #Empscripten at 230 am ...
— mmore500 (@mmore500) January 27, 2020
so I'll let you guess how much fiddling and twiddling it took to get working 🙃 🙃🙃https://t.co/ovolzUmqot