Developer’s Guide

General Development Information

Development happens on GitHub; bug reports and feature requests are welcome. If you are interested in giving us a hand, please do not hesitate to submit a pull request there.

Continuous testing is provided by GitHub Actions. Note that ReproZip still tries to support Python 2 as well as Python 3. Test coverage is not very high because there are a lot of operations that are difficult to cover on CI (for instance, Vagrant VMs cannot be used over there).

If you have any questions or need help with the development of an unpacker or plugin, please use our development mailing-list at reprozip@nyu.edu.

Introduction to ReproZip

ReproZip works in two steps: tracing and packing. Under the hood, tracing is two separate steps, leading to the following workflow:

  • Running the experiment under trace. During this part, the experiment is running, and the _pytracer C extension watches it through the ptrace mechanism, recording information in the trace SQLite3 database (.reprozip-trace/trace.sqlite3). This database contains raw information as it is recorded and does little else, leaving that to the next step. This part is referred to as the “C tracer”.
  • After the experiment is done, some additional information is computed by the Python code to generate the configuration file, by looking at the trace database and the filesystem. For example, all accesses to a file are aggregated to decide if it is read or written by the overall experiment, if it is an input or output file, resolve symlinks, etc. Additional information is written such as OS information and which distribution package each file comes from.
  • Packing reads the configuration file to create the .rpz bundle, which includes a configuration file (re-written into a “canonical” version), the trace database (though it is not read at this step), and the files listed in the configuration which was possibly altered by the user.

Therefore it is important to note that the configuration file and the trace database contain distinct information, and although the configuration is inferred from the database, it contains some additional details that was obtained from the original machine afterwards.

Only the configuration file should be necessary to run unpackers. The trace database is included for information, and to support additional commands like reprounzip graph (Visualizing the Provenance Graph).

Writing Unpackers

ReproZip is divided into two steps. The first is packing, which gives a generic package containing the trace SQLite database, the YAML configuration file (which lists the paths, packages, and metadata such as command line, environment variables, and input/output files), and actual files. In the second step, a package can be run using reprounzip. This decoupling allows the reproducer to select the unpacker of his/her desire, and also means that when a new unpacker is released, users will be able to use it on their old packages.

Currently, different unpackers are maintained: the defaults ones (directory and chroot), vagrant (distributed as reprounzip-vagrant) and docker (distributed as reprounzip-docker). However, the interface is such that new unpackers can be easily added. While taking a look at the “official” unpackers’ source is probably a good idea, this page gives some useful information about how they work.

ReproZip Bundle Format (.rpz)

An .rpz file is a tar.gz archive that contains a directory METADATA, which contains meta-information from reprozip, and an archive DATA.tar.gz, which contains the actual files that were packed and that will be unpacked to the target directory for reproducing the experiment.

The METADATA/version file marks the file as a ReproZip bundle. It always contains the string REPROZIP VERSION 2. It previously contained REPROZIP VERSION 1 before version 0.8 (2015), where DATA was a directory instead of being a tar.gz file.

The METADATA/config.yml file is in the same format as the configuration file generated by reprozip, but without the additional_patterns section (at this point, it has already been expanded to the actual list of files while packing).

The METADATA/trace.sqlite3 file is the original trace generated by the C tracer and maintained in a SQLite database; it contains all the information about the experiment, in case the configuration file is insufficient in some aspect. This file is used, for instance, by the graph unpacker, so that it can recover the exact hierarchy of processes, together with the executable images they execute and the files they access (with the time and mode of these accesses).

Structure

An unpacker is a Python module. It can be distributed separately or be a part of a bigger distribution, given that it is declared in that distribution’s setup.py as an entry_point to be registered with pkg_resources (see setuptools’ advertising behavior section). You should declare a function as entry_point reprounzip.unpackers. The name of the entry_point (before =) will be the reprounzip subcommand, and the value is a callable that will get called with the argparse.ArgumentParser object for that subcommand.

The package reprounzip.unpackers is a namespace package, so you should be able to add your own unpackers there if you want to. Please remember to put the correct code in the __init__.py file (which you can copy from here) so that namespace packages work correctly.

The modules reprounzip.common, reprounzip.utils, and reprounzip.unpackers.common contain utilities that you might want to use (make sure to list reprounzip as a requirement in your setup.py).

Example of setup.py:

setup(name='reprounzip-vagrant',
      namespace_packages=['reprounzip', 'reprounzip.unpackers'],
      install_requires=['reprounzip>=0.4'],
      entry_points={
          'reprounzip.unpackers': [
              'vagrant = reprounzip.unpackers.vagrant:setup'
              # The setup() function sets up the parser for reprounzip vagrant
          ]
      }
      # ...
)

Usual Commands

If possible, you should try to follow the same command names that the official unpackers use, which are:

  • setup: to create the experiment directory and set everything for execution;
  • run: to reproduce the experiment;
  • destroy: to bring down all that setup and to prepare and delete the experiment directory safely;
  • upload and download: to replace input files in the experiment, and to get the output files for further examination, respectively.

If these commands can be broken down into different steps that you want to expose to the user, or if you provide completely different actions from these defaults, you can add them to the parser as well. For instance, the vagrant unpacker exposes setup/start, which starts or resumes the virtual machine, and destroy/vm, which stops and deallocates the virtual machine but leaves the template for possible reuse.

A Note on File Paths

ReproZip supports Python 2 and 3, is portable to different operating systems, and is meant to accept a wide variety of configurations so that it is compatible with most experiments out there. Even trickier, reprounzip-vagrant needs to manipulate POSIX filenames on Windows, e.g.: in the unpacker. Therefore, the rpaths library is used everywhere internally. You should make sure to use the correct type of path (either PosixPath or Path) and to cast these to the type that Python functions expect, keeping in mind 2/3 differences (most certainly either filename.path or str(filename)).

Experiment Directory Format

Unpackers usually create a directory with everything necessary to later run the experiment. This directory is created by the setup operation, cleaned up by destroy, and is the argument to every command. For example, with reprounzip-vagrant:

$ reprounzip vagrant setup someexperiment.rpz mydirectory
$ reprounzip vagrant upload mydirectory /tmp/replace.txt:input_text

Unpackers unpack the config.yml file to the root of that directory, and keep status information in a .reprounzip file, which is a dict in pickle format. Following the same structure will allow the showfiles command, as well as FileUploader and FileDownloader classes, to work correctly. Please try to follow this structure.

Signals

Since version 0.4.1, reprounzip has signals that can be used to hook in plugins, although no such plugin has been released at this time. To ensure that these work correctly when using your unpacker, you should emit them when appropriate. The complete list of signals is available in signal.py.

Final Observations

After reading this page, reading the source code of one of the “official” unpackers is probably the best way of understanding how to write your own. They should be short enough to be easy to grasp. Should you have additional questions, do not hesitate to use our mailing-list: reprozip@nyu.edu.