Deploying robotized solutions in hybrid workspaces has been hindered by safety concerns for the past decade. Current solutions like physical barriers, safety monitoring systems, and wearable sensors are costly and require significant resources. To overcome this barrier, new approaches using artificial intelligence (AI) are needed to enable safe and natural human-robot interaction. The recent improvements in computational power and the rise of AI offer a glimmer of hope for exploiting data-driven techniques in real-time and safe robotized solutions in hybrid environments. For human detection, two techniques are at the forefront: background subtraction and machine learning. Background subtraction involves continuously comparing images from the workspace with and without humans and detecting areas with human presence using statistical models like the Gaussian mixed model. Machine learning techniques require datasets, feature descriptors, and classical supervised classification algorithms or deep learning techniques like YOLO, FasterRCNN, and single-shot detector (SSD). For people tracking in stereo matching frames, traditional key point descriptors as scale-invariant feature transform (SIFT) or Speeded up robust feature (SURF) algorithms cannot be used due to similar operators clothing and environment similarities. However, deploying such techniques in real-time is still a challenge.

How can a multi-modal perception system based on DL, using a few cameras for detection, tracking and gestures recognition, ensure safety without wearable?

MERGING perception system applies cameras and deep learning to detect and track workers without requiring them to wear any extra equipment. It deploys 2D and 360º cameras to create a multi-level perception system that recognizes single and multiple workers in a robot’s surroundings. The system also extracts information about workers’ gestures and limb positions for safety purposes. The perception system was validated at AIMEN laboratory environment.

The MERGING perception system detects, tracks, decomposes movements, and recognizes gestures of workers in a hybrid environment. It enables dynamic adaptation of robotic behavior for safe and natural human-robot interaction.

In this blog post, we summarize the assessment of a multi-level perception system that validates technologies for perception in human-robot interaction. The system involves non-wearable devices, such as person detection, tracking, movement decomposition, and gesture recognition for use in specific applications such as fabric detection, planning for wrinkle removal, pose extraction, continuous monitoring of fabric deformation, and continuous gripping monitoring. To achieve human detection and tracking, we deployed a calibrated stereo system consisting of two RGB cameras aided by YOLO filtering and pipelined the output for each detected person into OpenPose, which estimates the pixel position of 15 key points on the individual human body, mainly corresponding to body joints. Our implementation achieved a frame rate of 13 fps. The key point information was then fed into a gesture recognition algorithm that converted it into a line diagram of the human, focusing only on upper body key points. The gesture detection algorithm calculated the relative angle between the key points and returned a gesture by looking up the reference quaternion defined in a lookup table.
We verified and validated the system under actual industrial conditions before scheduling its deployment for the next few months. Simulated results show that the system is robust under industrial lighting conditions and can detect, track, and decompose human motion at high frame rates. However, we also highlight some limitations of the system, such as scenarios where two humans cross each other or wear clothing that matches the background color. In the future, improving the frame rate and addressing the system’s limitations will be a point of focus. For more details, please see the public deliverable D5.3 Perception functionalities demonstration you will find at .

The results show that our system can detect, track, and decompose movement of single or multiple humans with varied industrial clothing except for occlusion pose by crossing two humans and cloth-background color matching. In addition, our gesture recognition system was 97% successful for recognizing five human gestures.

Afra María Pertusa Llopis

Afra María Pertusa Llopis is a robotics engineer and a Ph.D. student in Advanced Robotic Technologies and Application at AIMEN. She has been an active contributing member of the MERGING project in validation and verification. She contributed as the writer of this blog


Jawad Masood

Jawad Masood is the team leader in Advanced Robotic Technologies and Application group at AIMEN. He has been involved in the application of sensors for human detection and tracking using DL approaches in automotive domain and wearable robotics. He contributed as the writer of this blog

Acknowledgement: We would like to acknowledge the contribution of Adriana Costas López and Diego Pérez Losada for the perception system development, verification, and validation.