Using reprozip¶

The reprozip component is responsible for packing an experiment, which is done in three steps: tracing the experiment, editing the configuration file (if necessary), and creating the reproducible package. Each of these steps is explained in more details below. Please note that reprozip is only available for Linux distributions.

Tracing an Experiment¶

First, reprozip needs to trace the operating system calls used by the experiment, so as to identify all the necessary information for its future re-execution, such as binaries, files, library dependencies, and environment variables.

The following command is used to trace a command line, or a run, used by the experiment:

$ reprozip trace <command-line>

where <command-line> is the command line. By running this command, reprozip executes <command-line> and uses ptrace to trace all the system calls issued, storing them in an SQLite database.

If you run the command multiple times, reprozip might ask you if you want to continue with your current trace (append the new command-line to it) or replace it (throw away the previous command-line you traced). You can skip this prompt by using either the --continue or --overwrite flag, like this:

$ reprozip trace --continue <command-line>

Note that the final bundle will be able to reproduce any of the runs, and files shared by multiple runs are only stored once.

By default, if the operating system is based on Debian or RPM packages (e.g.: Ubuntu, CentOS, Fedora, …), reprozip will also try to automatically identify the distribution packages from which the files come, using the available package manager of the system. This is useful to provide more detailed information about the dependencies, as well as to further help when reproducing the experiment. However, note that the trace command can take some time doing that after the experiment finishes, depending on the number of file dependencies that the experiment has. To disable this feature, users may use the flag --dont-identify-packages:

$ reprozip trace --dont-identify-packages <command-line>

The database, together with a configuration file (see below), are placed in a directory named .reprozip-trace, created under the path where the reprozip trace command was issued.

Editing the Configuration File¶

The configuration file, which can be found in .reprozip-trace/config.yml, contains all the information necessary for creating the experiment bundle. This file is generated by the tracer and drives the packing step.

It is very likely that you won’t need to modify this file, as the automatically-generated one should be sufficient to create a working bundle. However, in some cases, you may want to edit it prior to the creation of the package to add or remove files used by your experiment. This can be particularly useful, for instance, to remove big files that can be obtained elsewhere when reproducing the experiment, to keep the size of package small, and also to remove sensitive information that the experiment may use. The configuration file can also be used to edit the main command line, to add or remove environment variables, and to edit information regarding input/output files.

The first part of the configuration file gives general information with respect to the experiment and its runs, including command lines, environment variables, working directory, and machine information. Also, each run has a unique identifier (given by id) that is consistently used for packing and unpacking purposes:

# Run info
version: <reprozip-version>
runs:
# Run 0
- id: <run-id>
  architecture: <machine-architecture>
  argv: <command-line-arguments>
  binary: <command-line-binary>
  distribution: <linux-distribution>
  environ: <environment-variables>
  exitcode: <exit-code>
  gid: <group-id>
  hostname: <machine-hostname>
  system: <system-kernel>
  uid: <user-id>
  workingdir: <working-directory>

# Run 1
- id: ...
...

If necessary, users may change command line parameters by editing argv, and add or remove environment variables by editing environ. Users may also give a more meaningful and user-friendly identifier for a run by changing id. Other attributes should not be changed in general.

The next section brings information about input and output files, including their original paths and which runs read and/or wrote them. These are the files that reprozip identified as the main input or result of the experiment, which reprounzip will later be able to replace and extract from the experiment when reproducing it. You may add, remove, or edit these files in case reprozip fails in recognizing any important information, as well as give meaningful names to them by editing name:

# Inputs are files that are only read by a run; reprounzip can replace these
# files on demand to run the experiment with custom data.
# Outputs are files that are generated by a run; reprounzip can extract these
# files from the experiment on demand, for the user to examine.
# The name field is the identifier the user will use to access these files.
inputs_outputs:
  - name: <file-identifier>
    path: <path-to-file>
    read_by_runs: <run-ids>
    written_by_runs: <run-ids>
  - name: ...
  ...

Note that you can prevent reprozip from identifying which files are input or output by using the --dont-find-inputs-outputs flag in the reprozip trace command.

Note

To visualize the dataflow of the experiment, pleaser refer to Visualizing the Provenance Graph.

Creating a Bundle¶

After tracing all the runs from the experiment and optionally editing the configuration file, the experiment bundle can be created by using the following command:

$ reprozip pack <bundle>

where <bundle> is the name given to the package. This command generates a .rpz file in the current directory, which can then be sent to others so that the experiment can be reproduced. For more information regarding the unpacking step, please see Using reprounzip.

Note that, by using reprozip pack, files will be copied from your environment to the package; as such, you should not change any file that the experiment used before packing it, otherwise the package will contain different files from the ones the experiment used when it was originally traced.

Warning

Before sending your bundle to others, it is advisable to test it and ensure that the reproduction of the experiment works.

Further Considerations¶

Packing Multiple Command Lines¶

As mentioned before, ReproZip allows multiple runs (i.e., command lines) to be traced and included in the same bundle. Alternatively, users can create a simple script that runs all the command lines, and pass that to reprozip trace. However, in this case, there will be no flexibility in choosing a single run to be reproduced, since the entire script will be re-executed.

Note that this flexibility has the caveat that users may reproduce the runs in a different order than the one originally used while tracing. If the order is important for the reproduction (e.g.: each run represents a step in a dataflow), please make sure to inform the correct reproduction order to whoever wants to replicate the experiment. This can also be obtained by running reprounzip graph; please refer to Creating a Provenance Graph for more information.

ReproZip can also combine multiple traces into a single one, in order to create a single bundle, using the reprozip combine command. The runs of each subsequent trace are simply appended in order.

Packing GUI and Interactive Tools¶

ReproZip is able to pack GUI tools. Additionally, there is no restriction in packing interactive experiments (i.e., experiments that require input from users). Note, however, that if entering something different can make the experiment load additional dependencies, the experiment will probably fail when reproduced on a different machine.

Capturing Connections to Servers¶

When reproducing an experiment that communicates with a server, the experiment will try to connect to the same server, which may or may not fail depending on the status of the server at the moment of the reproduction. However, if the experiment uses a local server (e.g.: database) that the user has control over, this server can also be captured, together with the experiment, to ensure that the connection will succeed. Users should create a script to:

start the server,
execute the experiment, and
stop the server,

and use reprozip to trace the script execution, rather than the experiment itself. In this way, ReproZip is able to capture the local server as well, which ensures that the server will be alive at the time of the reproduction.

For example, if you have an web app that uses MySQL and that runs until Ctrl+C is received, you can use the following script:

#!/bin/sh

if [ "$(id -u)" != 0 ]; then echo "This script needs to run as root so that it can execute MySQL" >&2; exit 1; fi

# Start MySQL
sudo -u mysql /usr/sbin/mysqld --pid-file=/run/mysqld/mysqld.pid &
sleep 5

# Don't exit the whole script on Ctrl+C
trap ' ' INT

# Execute actual experiment that uses the database
./manage.py runserver 0.0.0.0:8000

trap - INT

# Graceful shutdown
/usr/bin/mysqladmin shutdown

Note the use of trap to avoid exiting the entire script when pressing Ctrl+C, to make sure that the database gets shutdown via the next command.

Excluding Sensitive and Third-Party Information¶

ReproZip automatically tries to identify log and temporary files, removing them from the bundle, but the configuration file should be edited to remove any sensitive information that the experiment uses, or any third-party file/software that should not be distributed. Note that the ReproZip team is not responsible for personal and non-authorized files that may get distributed in a package; users should double-check the configuration file and their package before sending it to others.

Identifying Output Files¶

The reprozip component tries to automatically identify the main output files generated by the experiment during the trace command to provide useful interfaces for users during the unpacking step. However, if the experiment creates unique names for its outputs every time it is executed (e.g.: names with current date and time), the reprounzip component will not be able to correctly detect these; it assumes that input and output files do not have their path names changed between different executions. In this case, handling output files will fail. It is recommended that users modify their experiment (or use a wrapper script) to generate a symbolic link (with a fixed name) that always points to the latest result, and use that as the output file’s path in the configuration file (under the inputs_outputs section).

Using reprozip¶

Tracing an Experiment¶

Editing the Configuration File¶

Creating a Bundle¶

Further Considerations¶

Packing Multiple Command Lines¶

Packing GUI and Interactive Tools¶

Capturing Connections to Servers¶

Excluding Sensitive and Third-Party Information¶

Identifying Output Files¶

Table of Contents

Previous topic

Next topic

This Page