AI evaluation campaigns during robotics competitions: the METRICS paradigm

Guillaume Avrin, Virginie Barbosa, Agnes Delaborde

The H2020 METRICS project (2020-2023) organizes competitions in four application areas (Healthcare, Infrastructure Inspection and Maintenance, Agri-Food, and Agile Production) relying on both physical testing facilities (field evaluation campaign) and virtual testing facilities (data-based evaluation campaign) to mobilize, in addition to the European robotics community, the artificial intelligence one. This article presents this approach and pave the way for a new robotics and artificial intelligence competition paradigm.

ACRE: Quantitative Benchmarking in Agricultural Robotics

Riccardo Bertoglio, Giulio Fontana, Matteo Matteucci, Davide Facchinetti and Stefano Santoro

The aim of ACRE (Agri-food Competition for Robot Evaluation) is to provide a set of benchmarks for agricultural robots and smart implements. While involving capabilities of general application, ACRE puts a special focus on weeding, identified as one of the tasks where it is easier for robotics to demonstrate its potential. ACRE, as the other three robot competitions that are being organised by European project METRICS, is built on the established idea of benchmarking through competitions. In this paper we present the framework of ACRE and examples of its benchmarks.


Activity Recognition for Ambient Assisted Living with Videos, Inertial Units and Ambient Sensors

Ranieri, C. M., MacLeod, S., Dragone, M., Vargas, P. A., & Romero, R. A. F.

Worldwide demographic projections point to a progressively older population. This fact has fostered research on Ambient Assisted Living, which includes developments on smart homes and social robots. To endow such environments with truly autonomous behaviours, algorithms must extract semantically meaningful information from whichever sensor data is available. Human activity recognition is one of the most active fields of research within this context. Proposed approaches vary according to the input modality and the environments considered.


Automatic Dataset Generation From CAD for Vision-Based Grasping

Saad Ahmad, Kulunu Samarawickrama, Esa Rahtu and Roel Pieters

Published in: 20th International Conference on Advanced Robotics

Recent developments in robotics and deep learning enable the training of models for a wide variety of tasks, from large amounts of collected data. Visual and robotic tasks, such as pose estimation or grasping, are trained from image data (RGB-D) or point clouds that need to be representative for the actual objects, to acquire accurate and robust results. This implies either generalized object models or large datasets that include all object and environment variability, for training. However, data collection is often a bottleneck in the fast development of learning-based models. In fact, data collection might be impossible or even undesirable, as physical objects are unavailable or the physical recording of data is too time-consuming and expensive. For example, when building a data recording setup with cameras and robotic hardware. CAD tools, in combination with robot simulation, offer a solution for the generation of training data that can be easily automated and that can be just as realistic as real world data. In this work, we propose a data generation pipeline that takes as input a CAD model of an object and automatically generates the required training data for object pose estimation and object grasp detection. The object data generated are: RGB and depth image, object binary mask, class label and ground truth pose in camera- and world frame. We demonstrate the dataset generation of several sets of industrial object assemblies and evaluate the trained models on state of the art pose estimation and grasp detection approaches. Code and video are available at:


Domestic service robots are becoming more ubiquitous and can perform various assistive tasks such as fetching items or helping with medicine intake to support humans with impairments of varying severity. However, the development of robots taking care of humans should not only be focused on developing advanced functionalities, but should also be accompanied by the definition of benchmarking protocols enabling the rigorous and reproducible evaluation of robots and their functionalities. Thereby, of particular importance is the assessment of robots’ ability to deal with failures and unexpected events which occur when they interact with humans in real-world scenarios. For example, a person might drop an object during a robot-human hand over due to its weight. However, the systematic investigation of hazardous situations remains challenging as (i) failures are difficult to reproduce; and (ii) possibly impact the health of humans. Therefore, we propose in this paper to employ the concept of scientific robotic competitions as a benchmarking protocol for assessing care robots and to collect datasets of human-robot interactions covering a large variety of failures which are present in real-world domestic environments. We demonstrate the process of defining the benchmarking procedure with the human-to-robot and robot-to-human handover functionalities, and execute a dry-run of the benchmarks while inducing several failure modes such as dropping objects, ignoring the robot, and not releasing objects. A dataset comprising colour and depth images, a wrist force-torque sensor and other internal sensors of the robot was collected during the dry-run. In addition, we discuss the relation between benchmarking protocols and standards that exist or need to be extended with regard to the test procedures required for verifying and validating conformance to standards.


From ERL to RAMI: Expanding Marine Robotics Competitions Through Virtual Events

G. Ferri, F. Ferreira, A. Faggiani, T. Fabbri

In: OCEANS 2021: San Diego – Porto, 2021, pp. 1-8.

Robotics competitions have the potential of engaging the future engineers and improving their technical and soft skills. In competitions, students are faced with unique challenges, usually not present in theoretical lectures. CMRE has been organising marine robotics competitions since 2010. Along the years, these have become more complex, including multi-domain cooperation and have become more applicative (from search and rescue to oil & gas). Similarly, there were changes in the scoring methodology and purpose of the competition. While the European Robotics League (ERL) Emergency has been running for a few years with search and rescue scenarios, recently, CMRE has been working on a new competition, the Robotics for Asset Maintenance and Inspection (RAMI), which expands the precise scientific scoring method of ERL Emergency to a metrological evaluation. RAMI competition inaugurates a new concept of having both physical and cascade (based on acquired data) competitions. This enlarges the competition to research communities typically not engaged in marine robotics such as the Artificial Intelligence (AI) community. This paper reports on the latest ERL Emergency event (held in 2019) and on the ongoing implementation of the new RAMI competition.

On the Design of the Agri-Food Competition for Robot Evaluation (ACRE)

Riccardo Bertoglio, Giulio Fontana, Matteo Matteucci, Davide Facchinetti, Michel Berducat, Daniel Boffety

Published in: 2021 IEEE International Conference on Autonomous Robot Systems and Competitions (ICARSC)

The Agri-Food Competition for Robot Evaluation (ACRE) is a novel competition for autonomous robots and smart implements. It is focused on agricultural tasks such as removing weeds or mapping/surveying crops down to single-plant resolution. Such abilities are crucial for the transition to so-called “Agriculture 4.0”, i.e., precision agriculture supported by ICT, Artificial Intelligence, and Robotics. ACRE is a benchmarking competition, i.e., the activities that participants are required to execute are structured as performance benchmarks. The benchmarks are grounded on the key scientific concepts of objective evaluation, repeatability, and reproducibility. Transferring such concepts in the agricultural context, where large parts of the test environment are not fully controllable, is one of the challenges tackled by ACRE. The ACRE competition involves both physical Field Campaigns and data-based Cascade Campaigns. In this paper, we present the benchmarks designed for both kinds of Campaigns and report the outcome of the ACRE dry-runs that took place in 2020.


"Validation of methodologies for evaluating stand-alone weeding solutions, within the framework of the Challenge ROSE and METRICS projects"

The ROSE Challenge is the first global robotics and artificial intelligence competition to implement a third-party evaluation of the performance of robotized intra-row weed control in real and reproducible conditions, to ensure a credible and objective assessment of their effectiveness. This paper reports on the design and validation of test facilities for this competition, which presents a particular complexity: the evaluations take place in real conditions on crop plots and target living organisms (crops and weeds). Moreover, the experimental conditions need to be reproductible to allow for comparison of evaluation results and for fair treatment of different participants. The article also discusses the opportunity this challenge offers to define, in a consensual manner, the means and methods for characterizing these intelligent systems. The tools developed in the framework of this challenge establish the necessary references for future research in the field of agricultural robotics: the annotated images will be particularly useful to the community and the evaluation protocol will allow to define harmonized methodologies beyond the ROSE challenge. After presenting the objectives of the challenge, the article will present the methodology and tools developed and used to allow an objective and comparable evaluation of the performances of the systems and solutions developed. Finally, the article will illustrate this potential for harmonization and sharing of references through the European competition ACRE of the European project H2020 METRICS.


Learning-enabled components in robots must be assessed concerning non-functional requirements (NFR) such as reliability, fault tolerance, and adaptability to ease the acceptance of responsible robots into human-centered environments. While many factors impact NFRs, in this paper, we focus on datasets that are used to train learning models that are applied in robots. We describe desirable characteristics for robotics datasets and identify the associated NFRs they affect. The characteristics are described in relation to the variability of the instances in the dataset, out-of-distribution data, the Spatio-temporal embodiment of robots, interaction failures, and lifelong learning. We emphasize the need to include out-of-distribution and failure data in the datasets, both to improve the performance of learning models, and to allow the assessment of robots in unexpected situations. We also stress the importance of continually updating the datasets throughout the lifetime of the robot, and the associated documentation of the datasets for improved transparency and traceability.


Assessing artificial intelligence capabilities

Guillaume Avrin

Published in: AI and the Future of Skills, Volume 1: Capabilities and Assessments, OECD Publishing, Paris (2021)

As artificial intelligence (AI) becomes more mature, it is increasingly used in the world of work alongside human beings. This raises the question of the real value provided by AI, its limits and its complementarity with the skills of biological intelligence. Based on evaluations of AI systems by the Laboratoire national de métrologie et d’essais in France, this chapter proposes a high-level taxonomy of AI capabilities and generalises it to other AI tasks to draw a parallel with human capabilities. It also presents proven practices for evaluating AI systems, which could serve as a basis for developing a methodology for comparing AI and human intelligence. Finally, it recommends further actions to progress in identifying the strengths and weaknesses of AI vs. human intelligence. To that end, it considers the functions and mechanisms underlying capabilities, taking into account the specificities of non-convex AI behaviour in the definition of evaluation tools.


The VISTA datasets, a combination of inertial sensors and depth cameras data for activity recognition.

Fiorini, L., Cornacchia Loizzo, F. G., Sorrentino, A., Rovini, E., Di Nuovo, A., & Cavallo, F.

Published in: Scientific Data, 9(1), 1-14. (2022)

This paper makes the VISTA database, composed of inertial and visual data, publicly available for gesture and activity recognition. The inertial data were acquired with the SensHand, which can capture the movement of wrist, thumb, index and middle fingers, while the RGB-D visual data were acquired simultaneously from two different points of view, front and side. The VISTA database was acquired in two experimental phases: in the former, the participants have been asked to perform 10 different actions; in the latter, they had to execute five scenes of daily living, which corresponded to a combination of the actions of the selected actions. In both phase, Pepper interacted with participants. The two camera point of views mimic the different point of view of pepper. Overall, the dataset includes 7682 action instances for the training phase and 3361 action instances for the testing phase. It can be seen as a framework for future studies on artificial intelligence techniques for activity recognition, including inertial-only data, visual-only data, or a sensor fusion approach.


A COTS (UHF) RFID Floor for Device-Free Ambient Assisted Living Monitoring. 

Smith, R., Ding, Y., Goussetis, G., Dragone, M.

In: Novais, P., Vercelli, G., Larriba-Pey, J.L., Herrera, F., Chamoso, P. (eds) Ambient Intelligence – Software and Applications . ISAmI 2020. Advances in Intelligent Systems and Computing, vol 1239. Springer, Cham. (2021)

The complexity and the intrusiveness of current proposals for AAL monitoring negatively impact end-user acceptability, and ultimately still hinder widespread adoption by key stakeholders (e.g. public and private sector care providers) who seek to balance system usefulness with upfront installation and long-term configuration and maintenance costs. We present the results of our experiments with a device-free wireless sensing (DFWS) approach utilising commercial off-the-shelf (COTS) Ultra High Frequency (UHF) Radio Frequency Identification (RFID) equipment. Our system is based on antennas above the ceiling and a dense deployment of passive RFID tags under the floor. We provide baseline performance of state of the art machine learning techniques applied to a region-level localisation task. We describe the dataset, which we collected in a realistic testbed, and which we share with the community. Contrary to past work with similar systems, our dataset was collected in a realistic domestic environment over a number of days. The data highlights the potential but also the problems that need to be solved before RFID DFWS approaches can be used for long-term AAL monitoring.


A comparative study of Fourier transform and CycleGAN as domain adaptation techniques for weed segmentation

Riccardo Bertoglio, Alessio Mazzucchelli, Nico Catalano, Matteo Matteucci

In: Smart Agricultural Technology, Volume 4, 2023.

Automatic weed identification is becoming increasingly important in the Precision Agriculture field as a fundamental capability for targeted spraying or mechanical weed destruction. Targeted weed elimination reduces herbicides' use and thus lowers the environmental impact of treatments. Convolutional Neural Networks are one of the most successful techniques to automatically detect weeds on RGB images. Such models require a high amount of labeled data to obtain satisfying detection performance. The agricultural context presents a high degree of variability, and it is thus unfeasible to expect a representative dataset for each specific condition that can appear in the fields. Domain Adaptation techniques are exploited to maintain high detection performance in different field conditions, lowering the need for labeled data. This study presents a comparison of the two main style transfer techniques for performing domain adaptation, that is, the Fourier Transform and the CycleGAN architecture. We used these techniques to reduce the domain gap in two use cases: one with images collected by different robots with different cameras and another with images collected by the same platform in different years. We show how, in the first case, the CycleGAN architecture attains satisfying performance and beats the simpler Fourier Transform. Instead, in the second case, all the tested DA techniques struggle to reach baseline performance. We also show how introducing a loss based on phase discrepancy in the CycleGAN architecture stabilizes the training and improves the performance. Moreover, we release a new dataset of labeled agricultural images and the code of our experiments for the reproducibility of the results and comparison with future works.


Towards Multimodal Sensing and Interaction for Assistive Autonomous Robots

Emanuele De Pellegrin, Ronnie Smith, Scott MacLeod, Mauro Dragone,  Ronald P. A. Petrick

In: Towards Autonomous Robotic Systems. TAROS 2023

Assistive environments that combine sensors, actuators, and robotic devices have gained attention in recent years, with the potential to provide assistance and support to individuals with disabilities or to the ageing population. One key challenge in designing autonomous assistive environments is to accurately recognise and understand the user’s goals and intentions. To address these challenges, we propose a system with a goal recognition and planning loop that uses data collected from sensors embedded in an assistive environment. We present the general architecture of the system, highlighting the key components available at the time of writing, plus the proposed extensions for the long-term goal of the project. We present data collected in a smart kitchen used by the activity recognition system, as well as a custom tool to simulate and test the data.


Robustness of Deep Learning Methods for Occluded Object Detection - A Study Introducing a Novel Occlusion Dataset

Ziling WuArmaghan MoemeniSimon Castle-GreenPraminda Caleb-Solly

In: 2023 International Joint Conference on Neural Networks (IJCNN) (pp. 1-10). IEEE.

A large number of deep learning based object detection algorithms have been proposed and applied in a wide range of domains such as security, autonomous driving and robotics. In practical usage, objects being occluded are common, and can result in reduced accuracy and reliability. To increase the robustness of object detection algorithms under occlusion scenarios, it is necessary to consider the influence of different types of occlusion on the performance of object detection approaches. Our research revealed a gap in benchmarking datasets that could provide exemplars of occlusion that covered a range of occlusion scenarios. In this paper, we present a new benchmarking dataset that includes a range of exemplars providing coverage of different types of occlusion cases. This dataset is designed for object detection of everyday objects in indoor scenarios, and comprises occlusion in three orthogonal atomic factors, namely, the degree of occlusion, the location of occlusion, and classes of occluded object and those occluding other objects. Our dataset is balanced in terms of classes and degrees of occlusion, with a total of 5970 sample images. The effect of these three atomic factors has been investigated on some classic general object detectors. Using this benchmarking dataset, we also present results on the impact of the distribution of the training dataset, in terms of degree of occlusion, on the robustness of several typical object detection algorithms (e.g. Fast RCNN, Faster RCNN, and FCOS, etc). The benchmark is available at “”. This dataset is seen as a key contribution to research investigating the influence of occlusion on the performance of object detectors.


Lowering the Entry Barrier to Aerial Robotics Competitions

Francisco J. Pérez-Grau, Pablo León Barriga, Antidio Viguria

In: 2023 International Conference on Unmanned Aircraft Systems (ICUAS)

Conference paper


Agri-Food Competition for Robot Evaluation (ACRE), the Design of a competition of METRICS project

Daniel Boffety, Manon Boulet, Michel Berducat, Matteo Matteucci, Riccardo Bertoglio, Giulio Fontana, and Davide Facchinetti

In: MCG 2022 – Machine Control and Guidance – International conference

METRICS is an EU-funded project dedicated to the metrological evaluation and testing of autonomous robots. ACRE is one of the four benchmarking competitions for autonomous robots and smart implements organized by METRICS. ACRE deals with the applications of robotics to agriculture. The participants are required to execute performance benchmarks, which are grounded on the concepts of objective evaluation, repeatability, and reproducibility.
Transferring such concepts in the agricultural context, where large parts of the test environment are not fully controllable, is one of the challenges tackled by ACRE. ACRE is focused on agricultural tasks such as weeding or mapping/surveying crops down to single-plant resolution. Seven benchmarks were identified and classified in two categories.
ACRE competition involves two separated interconnected events: a Field Campaign that involves robots executing activities in agricultural environments such as open-air experimental plots, and a Cascade Campaign during which Artificial Intelligence systems perform activities on data collected during the Field Campaigns.
Two dry-run evaluation campaigns took place in 2020 and 2021. These two first events called “dry-run evaluation campaigns” allowed to test, check, complete and validate the ACRE organization with its evaluation plan. The first field evaluation campaign began with an on-field event from 7th to 10th June, 2022. A second cascade evaluation will take place at the beginning of 2023 and a last field evaluation campaign of the project is scheduled to take place in May 2023, in Cornaredo, near Milan, in Italy.


Instructing Hierarchical Tasks to Robots by Verbal Commands

P. Telkes, A. Angleraud, R. Pieters

In: 16th IEEE/SICE International Symposium on System Integration (SII), 2024

Natural language is an effective tool for communication, as information can be expressed in different ways and at different levels of complexity. Verbal commands, utilized for instructing robot tasks, can therefor replace traditional robot programming techniques, and provide a more expressive means to assign actions and enable collaboration. However, the challenge of utilizing speech for robot programming is how actions and targets can be grounded to physical entities in the world. In addition, to be time-efficient, a balance needs to be found between fine- and course-grained commands and natural language phrases. In this work we provide a framework for instructing tasks to robots by verbal commands. The framework includes functionalities for single commands to actions and targets, as well as longer-term sequences of actions, thereby providing a hierarchical structure to the robot tasks. Experimental evaluation demonstrates the functionalities of the framework by human collaboration with a robot in different tasks, with different levels of complexity. The tools are provided open-source at this https URL


Multi-label Annotation for Visual Multi-Task Learning Models

G. Sharma, A. Angleraud, R. Pieters

In: IEEE International Conference on Robotic Computing, 2023

Deep learning requires large amounts of data, and a well-defined pipeline for labeling and augmentation. Current solutions support numerous computer vision tasks with dedicated annotation types and formats, such as bounding boxes, polygons, and key points. These annotations can be combined into a single data format to benefit approaches such as multi-task models. However, to our knowledge, no available labeling tool supports the export functionality for a combined benchmark format, and no augmentation library supports transformations for the combination of all. In this work, these functionalities are presented, with visual data annotation and augmentation to train a multi-task model (object detection, segmentation, and key point extraction). The tools are demonstrated in two robot perception use cases.


6D Assembly Pose Estimation by Point Cloud Registration for Robot Manipulation

K. Samarawickrama, G. Sharma, A. Angleraud, R. Pieters


The demands on robotic manipulation skills to perform challenging tasks have drastically increased in recent times. To perform these tasks with dexterity, robots require perception tools to understand the scene and extract useful information that transforms to robot control inputs. To this end, recent research has introduced various object pose estimation and grasp pose detection methods that yield precise results. Assembly pose estimation is a secondary yet highly desirable skill in robotic assembling as it requires more detailed information on object placement as compared to bin picking and pick-and-place tasks. However, it has been often overlooked in research due to the complexity of integration in an agile framework. To address this issue, we propose an assembly pose estimation method with RGB-D input and 3D CAD models of the associated objects. The framework consists of semantic segmentation of the scene and registering point clouds of local surfaces against target point clouds derived from CAD models to estimate 6D poses. We show that our method can deliver sufficient accuracy for assembling object assemblies using evaluation metrics and demonstrations. The source code and dataset for the work can be found at: this https URL


Visual inspection is a crucial yet time-consuming task across various industries. Numerous established methods employ machine learning in inspection tasks, necessitating specific training data that includes predefined inspection poses and training images essential for the training of models. The acquisition of such data and their integration into an inspection framework is challenging due to the variety in objects and scenes involved and due to additional bottlenecks caused by the manual collection of training data by humans, thereby hindering the automation of visual inspection across diverse domains. This work proposes a solution for automatic path planning using a single depth camera mounted on a robot manipulator. Point clouds obtained from the depth images are processed and filtered to extract object profiles and transformed to inspection target paths for the robot end-effector. The approach relies on the geometry of the object and generates an inspection path that follows the shape normal to the surface. Depending on the object size and shape, inspection paths can be defined as single or multi-path plans. Results are demonstrated in both simulated and real-world environments, yielding promising inspection paths for objects with varying sizes and shapes. Code and video are open-source available at: this https URL

The VISTA datasets, a combination of inertial sensors and depth cameras data for activity recognition

Laura Fiorini, Federica Gabriella Cornacchia Loizzo, Alessandra Sorrentino, Erika Rovini, Alessandro Di Nuovo & Filippo Cavallo

In: Scientific Data 9, 218 (2022)


This paper makes the VISTA database, composed of inertial and visual data, publicly available for gesture and activity recognition. The inertial data were acquired with the SensHand, which can capture the movement of wrist, thumb, index and middle fingers, while the RGB-D visual data were acquired simultaneously from two different points of view, front and side. The VISTA database was acquired in two experimental phases: in the former, the participants have been asked to perform 10 different actions; in the latter, they had to execute five scenes of daily living, which corresponded to a combination of the actions of the selected actions. In both phase, Pepper interacted with participants. The two camera point of views mimic the different point of view of pepper. Overall, the dataset includes 7682 action instances for the training phase and 3361 action instances for the testing phase. It can be seen as a framework for future studies on artificial intelligence techniques for activity recognition, including inertial-only data, visual-only data, or a sensor fusion approach.

Évaluation de l’intelligence artificielle

Guillaume AVRIN

In: Techniques de l'ingénieur (10.02.2023)


L’intelligence artificielle (IA) est en pleine croissance ; elle interroge tous les publics : particuliers, professionnels et universitaires. Pour encadrer ces échanges, des principes et pratiques de mesure des performances, rationnelles et partagées, ainsi que ceux des limites de systèmes intelligents doivent être établis.

Cet article présente une approche méthodique et conforme aux règles de la métrologie, permettant d’en dessiner les grandes lignes :

- des métriques pour effectuer des mesures quantitatives et répétables de performance ;

- des environnements de test physiques et virtuels pour procéder à des expérimentations reproductibles et représentatives des conditions de fonctionnement réelles de l’IA évaluée et des outils organisationnels (benchmarking, challenges, compétitions).

Le tout répondant aux besoins de l’ensemble de l’écosystème.