The HEART-MET cascade campaign
We are happy to invited you to participate in the HEART-MET Activity Recognition Challenge, which is conducted online on Codalab, and is open to individuals and teams.
The Challenge
In the first challenge, the task is to recognize activities performed by a person in short video clips. In subsequent challenges, which will be announced shortly, the task will be to recognize and temporally segment activities in long untrimmed videos of a person performing daily activities in a home.
The second dataset addresses the need for robots to adapt to new environments, and thus contains videos of activities being performed by multiple participants in a single apartment from two viewpoints.
Results
In this campaign we took the opportunity to evaluate our dataset using state-of-the-art video classification models. These models were designed to identify and categorise actions in video clips. For this analysis, both Convolutional Neural Networks (CNNs) and Vision Transformer models were employed, covering a broad range of modern techniques in the field of video classification.
The models used include:
- I3D: Based on the Inception architecture, this model is enhanced to handle the temporal nature of videos by using 3D convolutions, which analyse both spatial and temporal dimensions.
- R2Plus1D: A ResNet-based model that separates the spatial and temporal convolutions, using 2D for spatial data and 1D for temporal features.
- S3D: A variant of the I3D model, which employs separable 3D convolutions to process the spatial and temporal aspects independently.
- Swin Transformers (Tiny, Small, and Base): These models rely on the transformer architecture, which processes spatio-temporal patches (or 3D tokens) from video clips. The models differ in size, with Tiny and Small being more compact than the Base model.
For all the models, we used pre-trained weights from the Kinetics-400 dataset, and only trained the final classification layers on our data. The results of this evaluation are summarised in the table below. *Numbers are true positive rates.
Model | Dataset A | Dataset B |
I3D | 0.58 | 0.80 |
R2Plus1D | 0.37 | 0.57 |
S3D | 0.42 | 0.57 |
Swin Tiny | 0.53 | 0.73 |
Swin Small | 0.45 | 0.87 |
Swin Base | 0.50 | 0.73 |
Key findings
Our evaluation showed that the models generally performed better on Dataset B, even though it contains fewer samples. This outcome was expected, as Dataset B was collected in a controlled environment, leading to more consistent data across training and testing phases. This consistency is beneficial for models to learn effectively, making it particularly useful in healthcare scenarios where robots must quickly adapt to the environment and individuals they interact with.