Artifact Evaluation (AE) for RTSS is an additional and optional evaluation process for research works that have been accepted for publication at RTSS. This process offers authors the opportunity to show guarantees of reproducibility and validation given by the community for the experiments and data reported in their paper.
Authors of accepted regular papers with computational components will be invited to submit (but are not required to submit) the relevant artifact for evaluation by the artifact evaluation committee. AE is designed to help with the goal of producing reproducible science. In the AE process, peer practitioners from the community will follow the instructions included in the artifacts and give feedback to the authors, while keeping papers and artifacts confidential and under the control of the authors.
The AE process is non-competitive. The acceptance of the papers has already been decided before the AE process starts and the hope is that all the artifacts submitted will pass the evaluation criteria.
Based on previous experience, the biggest hurdle to successful reproducibility is the setup and installation of the necessary libraries and dependencies. Authors are therefore strongly encouraged to prepare a virtual machine (VM) image of their artifact and keep it accessible through an HTTP link throughout the evaluation process. As the basis of the VM image, please choose commonly used OS versions that have been tested with the virtual machine software and that evaluators are likely to be accustomed to. We encourage you to use https://www.virtualbox.org and save the VM image as an Open Virtual Appliance (OVA) file.
The submission website is available to authors of accepted and shepherded papers by clicking on: https://www.softconf.com/i/ae2018. When submitting an artifact for evaluation, please provide a README file with instructions on how to use the artifact to reproduce the results in the paper. The README file should include a link to the virtual machine image. Additionally, include a description of the OS and parameters of the image, as well as the host platform on which you prepared and tested your virtual machine image (OS, RAM, number of cores, CPU frequency). Please describe how to proceed after booting the image, including the instructions for running the artifact. Finally, be sure to include the closest to final version of the accepted paper related to the artifact. A good “how-to to prepare an artifact evaluation package” is available online at http://bit.ly/HOWTO-AEC
If you are not in a position to prepare the artifact as above, or if your artifact requires special libraries (Matlab or specific toolboxes) or hardware, please contact the AE chair.
Authors of the papers corresponding to the artifacts which pass the evaluation will be entitled to use as part of the camera ready version of their paper an RTSS AE seal that indicates that the artifact has passed the repeatability test. Authors are also entitled to, and indeed encouraged to also use this RTSS AE seal on the title slide of the corresponding presentation.
The artifact evaluation criteria are similar to those used for other conferences’ repeatability evaluation. Submissions will be judged based on three criteria — coverage, instructions, and quality — where each criteria is assessed on the following scale:
- significantly exceeds expectations (5),
- exceeds expectations (4),
- meets expectations (3),
- falls below expectations (2),
- missing or significantly falls below expectations (1).
In order to be judged “repeatable” an artifact must “meet expectations” (average score of 3 or more), and must not have any missing elements (no scores of 1). Each artifact is evaluated independently according to the objective criteria. The higher scores (“exceeds” or “significantly exceeds expectations”) in the criteria will be considered aspirational goals, not requirements for acceptance.
What fraction of the appropriate figures and tables are reproduced by the artifact? Note that some figures and tables should not be included in this calculation; for example, figures generated by a drawing program, or tables listing only parameter values. The focus is on those figures or tables in the paper containing computationally generated or processed experimental evidence used to support the claims of the paper.
Note that satisfying this criterion does not require that the corresponding figures or tables be recreated in exactly the same format as appears in the paper, merely that the data underlying those figures or tables be generated faithfully in a recognizable format.
A repeatable element is one for which the computation can be rerun by following the instructions provided with the artifact in a suitably equipped environment. An extensible element is one for which variations of the original computation can be run by modifying elements of the code and/or data. Consequently, necessary conditions for extensibility include that the modifiable elements be identified in the instructions or documentation, and that all source code must be available and/or involve calls to commonly available and trusted software (e.g.: Windows, Linux, C or Python standard libraries, Matlab, etc.).
The categories for this criterion are:
- None (missing / 1): There are no repeatable elements.
- Some (falls below expectations / 2): There is at least one repeatable element.
- Most (meets expectations / 3): The majority (at least half) of the elements are repeatable.
- All repeatable or most extensible (exceeds expectations / 4): All elements are repeatable or most are repeatable and easily modified. Note that if there is only one computational element and it is repeatable, then this score should be awarded.
- All extensible (significantly exceeds expectations / 5): All elements are repeatable and easily modified.
This criterion is focused on the instructions which will allow other practitioners to re-create the computationally generated results from the paper. The categories for this criterion are:
- None (missing / 1): No instructions were included in the artifact.
- Rudimentary (falls below expectations / 2): The instructions specify a script or command to run, but little else.
- Complete (meets expectations / 3): For every computational element that is repeatable, there is a specific instruction which explains how to repeat it. The environment under which the software was originally run is described.
- Comprehensive (exceeds expectations / 4): For every computational element that is repeatable there is a single command or clearly defined short series of steps which recreates that element almost exactly as it appears in the published paper (e.g.: file format, fonts, line styles, etc. might not be the same, but the content of the element is the same). In addition to identifying the specific environment under which the software was originally run, a broader class of environments is identified under which it could run.
- Outstanding (significantly exceeds expectations / 5): In addition to the criteria for a comprehensive set of instructions, explanations are provided of:
- all the major components / modules in the software,
- important design decisions made during implementation,
- how to modify / extend the software, and/or
- what environments / modifications would break the software.
This criterion explores the means provided to infer, show, or prove trustworthiness of the software and its results. While a set of scripts which exactly recreate, for example, the figures from the paper certainly aid in repeatability, without well-documented code it is hard to understand how the data in that figure was processed, without well-documented data it is hard to determine whether the input is correct, and without testing it is hard to determine whether the results can be trusted.
If there are tests in the artifact which are not included in the paper, they should at least be mentioned in the instructions document. Documentation of test details can be put into the instructions document or into a separate document in the artifact.
The categories for this criterion are:
- None (missing / 1): There is no evidence of software documentation or testing.
- Rudimentary documentation (falls below expectations / 2): The purpose of almost all files is documented (preferably within the file, but otherwise in the instructions or a separate readme file).
- Comprehensive documentation (meets expectations / 3): The purpose of almost all files is documented. Within source code files, almost all classes, methods, attributes and variables are given lengthy clear names and/or documentation of their purpose. Within data files, the format and structure of the data is documented; for example, in comma separated value (csv) files there is a header row and/or comments explaining the contents of each column.
- Comprehensive documentation and rudimentary testing (exceeds expectations / 4): In addition to the criteria for comprehensive documentation, there are identified test cases with known solutions which can be run to validate at least some components of the code.
- Comprehensive documentation and testing (significantly exceeds expectations / 5): In addition to the criteria for comprehensive documentation, there are clearly identified unit tests (preferably run within a unit test framework) which exercise a significant fraction of the smaller components of the code (individual functions and classes) and system level tests which exercise a significant fraction of the full package. Unit tests are typically self-documenting, but the system level tests will require documentation of at least the source of the known solution. Note that tests are a form of documentation, so it is not really possible to have testing without documentation.
Artifacts for accepted papers are expected to be submitted on or before August 17th 2018 (two weeks after notification). The evaluators may give early feedback if there are any issues with the artifacts that prevent them from being run correctly. Notification of the results of artifact evaluation will be given one week prior to the camera-ready deadline (i.e. 21st September 2018).
Artifact Evaluation Committee Chair
- Julio Medina (University of Cantabria – Spain)
Artifact Evaluation Committee
- Mohammad Ashjaei (Mälardalen University – Sweden)
- Sergiy Bogomolov (The Australian National University – Australia)
- José Fonseca (CISTER, ISEP – Portugal)
- David García-Villaescusa (University of Cantabria – Spain)
- Mohamed Hassan (University of Guelph – Canada)
- Taylor Johnson (Vanderbilt University – USA)
- Cláudio Maia (CISTER, ISEP – Portugal)
- Dionisio de Niz (Software Engineering Institute, Carnegie Mellon University – USA)
- Alessandro Papadopoulos (Mälardalen University – Sweden)
- Aditya Zutshi (Duke University – USA)
More information about AE in general