METRICS evaluation framework
The evaluation framework that will be implemented during the METRICS competitions is based on metrology concepts and will build on previous projects aiming at organizing robotics competitions (e.g. SciRoc, RockEU2, RoCKIn, euRathlon, ECHORD++, EURON, EuRoC and RoboCup).
METRICS evaluation paradigm consists in comparing reference data (the “ground truth” annotated by human experts or provided by measuring instruments in the test facility) with hypothesis data (the behaviour or output produced automatically by the intelligent system). This comparison allows the estimation of the performance, the reliability and other characteristics such as the efficiency of robots. The evaluation can concern the entire system (during TBM) or the main technological bricks taken independently (during FBM).
This approach to evaluation is based on fundamental metrological principles:
- the measurements must be quantitative and provided by a formula (metric) indicating the distance between the reference and the hypothesis (for example: the distance between a real or ideal trajectory in a navigation task, the number of false positives and false negatives in an image recognition task, the binary success of a task, etc.), or a direct measurement (time to completion, distance covered, etc.);
- the measuring must be repeatable: two evaluations carried out under the same conditions produce the same result;
- the experiments must be reproducible (based on transparent evaluation protocol, and common testing environments and datasets): one can replay the evaluation in one’s own laboratory in similar conditions;
- the influence factors must be identified and controlled: the evaluation should be designed so as to address a specific dependent variable;
- the results must be interpretable to identify areas of improvement explicitly and determine the maturity of a technology. The mesh size of the evaluation must be small enough to identify the technological brick that is responsible for the possible underperformance of the system.
Moreover, the evaluation framework will include the following characteristics:
- The simultaneity of evaluation of all systems is required by the difficulty to model the influence of environmental factors on the system’s performance and by two imperatives concerning testing environments:
- A priori ignorance: evaluated systems must have a learning capacity and as a consequence, they should not have a priori knowledge of the testing environment (testbeds and testing datasets) used for the evaluation in order to avoid measurements bias. This remark remains valid for systems that do not have learning skills, since developers can influence the design of their systems if they have a priori information about testbeds and data;
- A posteriori publication: to ensure the reproducibility of the evaluation experiments, it is necessary that the testing environments used be publicly described (and accessible if they are datasets) when the measurements and results are published;
- The evaluation will be carried out by a trusted third party (METRICS Consortium). The impartial trusted third party will have metrology expertise applied to these systems to develop an evaluation protocol common to all participants.