Using reprozip

The reprozip component is responsible for packing an experiment. In ReproZip, we assume that the experiment can be executed by a single command line, preferably with no GUI involved (please refer to Further Considerations for additional information regarding different types of experiments).

There are three steps when packing an experiment with reprozip: tracing the experiment, editing the configuration file (if necessary), and creating the reproducible package. Each of these steps is explained in more details below. Please note that reprozip is only available for Linux distributions.

Tracing an Experiment

First, reprozip needs to trace the operating system calls used by the experiment, so as to identify all the necessary information for its future re-execution, such as binaries, files, library dependencies, and environment variables.

The following command is used to trace an experiment:

$ reprozip trace <command-line>

where <command-line> is the command line used to execute the experiment. By running this command, reprozip executes the experiment and uses ptrace to trace all the system calls issued, storing them in an SQLite database.

By default, if the operating system is Debian or Debian-based (e.g.: Ubuntu), reprozip will also try to automatically identify the distribution packages from which the files come, using the available package manager of the system. This is useful to provide more detailed information about the dependencies, as well as to further help when reproducing the experiment. However, note that the trace command can take some time doing that after the experiment has finished, depending on the number of file dependencies that the experiment has. To disable this feature, users may use the flag --dont-identify-packages:

$ reprozip trace --dont-identify-packages <command-line>

The database, together with a configuration file (see below), are placed in a directory named .reprozip, created under the path where the reprozip trace command was issued.

Editing the Configuration File

The configuration file, which can be found in .reprozip/config.yml, contains all the information necessary for creating the experiment package. It is generated by the tracer and drives the packing step.

You possibly do not need to edit it, as the automatically-generated file should be sufficient to generate a working package. However, you may want to edit this file prior to the creation of the package in order to add or remove files. This can be particularly useful, for instance, to remove big files that can be obtained elsewhere when reproducing the experiment, so as to keep the size of package small, and also to remove sensitive information that the experiment may use. The configuration file can also be used to edit the main command line, to add or remove environment variables, and to edit information regarding input/output files.

The first part of the configuration file gives general information with respect to the experiment execution, including the command line, environment variables, main input and output files, and machine information:

# Run info
version: <reprozip-version>
runs:
- architecture: <machine-architecture>
  argv: <command-line-arguments>
  binary: <command-line-binary>
  distribution: <linux-distribution>
  environ: <environment-variables>
  exitcode: <exit-code>
  gid: <group-id>
  hostname: <machine-hostname>
  input_files: <input-files>
  output_files: <output-files>
  system: <system-kernel>
  uid: <user-id>
  workingdir: <working-directory>

If necessary, users may change the command line parameters by editing argv, and add or remove environment variables by editing environ. Besides, input_files and output_files can be modified to inform ReproZip about any input/output file that the tool may have failed in detecting, and also to give meaningful id names to these files (this may be useful for the unpacking step). Other attributes should mostly not be changed

The next section in the configuration file shows the files to be packed. If the software dependencies were identified by the package manager of the system during the reprozip trace command, they will be listed under packages; the file dependencies not identified in software packages are listed under other_files:

packages:
  - name: <package-name>
    version: <package-version>
    size: <package-size>
    packfiles: <include-package>
    files:
      # Total files used: <used-files-size>
      # Installed package size: <package-size>
      <files-list>
  - name: ...
  ...

other_files:
  <files-list>

The attribute packfiles can be used to control which software packages will be packed: its default value is true, but users may change it to false to inform reprozip that the corresponding software package should not be included. To remove a file that was not identified as part of a package, users can simply remove it from the list under other_files.

Last, users may add file patterns under additional_patterns to include other files that they think it will be useful for a future reproduction. As an example, the following would add everything under /etc/apache2/ and all the Python files of all users from LXC containers (contrieved example):

additional_patterns:
  - /etc/apache2/**
  - /var/lib/lxc/*/rootfs/home/**/*.py

Note that users can always reset the configuration file to its initial state by running the following command:

$ reprozip reset

Creating a Package

After tracing the experiment and optionally editing the configuration file, the experiment package can be created by issuing the command below:

$ reprozip pack <package-name>

where <package-name> is the name given to the package. This command generates a .rpz file in the current directory, which can then be sent to others so that the experiment can be reproduced. For more information regarding the unpacking step, please see Using reprounzip.

Note that, by using reprozip pack, files will be copied from your environment to the package; as such, you should not change any file that the experiment used before packing it, otherwise the package will contain different files from the ones the experiment used when it was traced.

Further Considerations

Packing Multiple Command Lines

ReproZip is meant to trace a whole experiment in one go. Therefore, if an experiment comprises multiple successive commands, users should create a simple script that runs all these commands, and pass that to reprozip trace.

Packing GUI and Interactive Tools

Currently, ReproZip cannot ensure that GUI interfaces will be correctly reproduced, so we recommend packing tools in a non-GUI mode for a successfull reproduction.

Additionally, there is no restriction in packing interactive experiments (i.e., experiments that require input from users). Note, however, that if entering something different can make the experiment load additional dependencies, the experiment will probably fail in that case when reproduced on a different machine.

Capturing Connections to Servers

When reproducing an experiment that communicates with a server, the experiment will try to connect to the same server, which may or may not fail depending on the status of the server at the moment of the reproduction. However, if the experiment uses a local server (e.g.: database) that the user has control over, this server can also be captured, together with the experiment, to ensure that the connection will succeed. Users should create a script to:

  • start the server,
  • execute the experiment, and
  • stop the server,

and use reprozip to trace the script execution, rather than the experiment itself. This way, ReproZip is able to capture the local server as well, which ensures that the server will be alive at the time of the reproduction.

Excluding Sensitive and Third-Party Information

ReproZip automatically tries to identify log and temporary files, removing them from the package, but the configuration file should be edited to remove any sensitive information that the experiment uses, or any third-party file/software that should not be distributed. Note that the ReproZip team is not responsible for personal and non-authorized files that may get distributed in a package; users should double-check the configuration file and their package before sending it to others.

Identifying Output Files

ReproZip tries to automatically identify the main output files generated by the experiment during the trace command to provide useful interfaces for users during the unpacking step. However, if the experiment creates unique names for its outputs every time it is executed (e.g.: names with current date and time), the reprounzip component will not be able to correctly detect these; it assumes that input and output files do not have their path names changed between different executions. In this case, handling output files will fail. It is recommended that users modify their experiment (or use a wrapper script) to generate a symbolic link (with a default name) that always points to the latest result, and use that as the output file’s path in the configuration file (under the output_files section).