Leader: Alessandra Sorrentino (UNIFI); Other collaborator(s):
Intelligent and adaptable multi-modal behaviors will be designed and developed to improve HCI/HRI thus to provide advanced human-like interaction and communication. The data perceived from the scene/context will be merged with the user profile to increase the level of understanding, to measure parameters and to plan tailored machine reactions. Particularly, i) social cues (e.g. gaze, emotion, body pose and gestures, voice tone) extracted from vision sensors and Natural Language Processing (NLP) algorithms will be merged at different level to generate models of interaction. Then, ii) such models will be used from advanced reasoning strategies based on AI for leveraging the machine with advanced social capabilities to adapt the behaviors to the humans enhancing the social-task engagement.
Brief description of the activities and of the intermediate results:
The activities of Task 2.1 followed on two main directions, inter-connected among them. On one side, we focused on the identification of the requirements of an innovative co-speech gesture generation model, that once integrated over a robotic platform could foster the social engagement on the end-user. Namely, we revised the current literature on the topic, identifying the main limitation and design the most appropriate solution. We developed a co-speech gesture generation model based on a GAN-based solution, which allows the robot to directly learn the appropriate association between the gesture and the content (what) and quality (how) of speech, simultaneously. We integrated the proposed model on a humanoid robot (i.e. Pepper robot) and we evaluated the developed model quantitively and qualitatively. In the first case, we adopted statistical metrics to compare the quality and accuracy of the generated gestures with respect to the target one. To qualitatively evaluate the mode, we recruited a cohort of individuals that interacted with the robot in an adhoc experimental session, compiling a questionnaire at the end. The results return that the proposed model is accurate and appreciated by the human counterpart, even if fine tuning activity on gesture synchronisation and velocity requires additional effort.
Additionally, the activities of the team have been focused on the engagement estimation task. Namely, we addressed this problem by investigating user engagement dynamics during a robot-to-human (R2H) handover task, considering three main components of engagement: affective, cognitive, and behavioral. For this study, we automatically extracted 10 visual features from the video recordings of 31 participants, using stateof-the-art automatic framework: Mediapipe (for pose estimation), and OpenFace (for gazing detection and emotion recognition). Each individual engaged in eight consecutive sessions with a robot manipulator designed with social cues (i.e. social manipulator). Our statistical analysis indicated that prolonged interaction with the robot could influence user engagement. Comparing the user engagement in the first and in the last interaction, we observed a decrease in positive emotions (affective) and a more regulated Quantity of motion (behavioral). Additionally, there was a reduced attentional focus on the robot’s assigned tasks (cognitive), although the participants’ execution of the task itself remained unchanged. The obtained results were reported in a regular paper, submitted to the 33rd IEEE International Conference on Robot and Human Interactive Communication (IEEE RO-MAN 2024).
Main policy, industrial and scientific implications:
The scenario investigated for assessing the social-task engagement in participants interacting the robotic manipulator is part of a rehabilitation scenario designed in collaboration with the UNIFI-Don Gnocchi Joint Lab.
Please see the next reporting period.
The activities of Task 2.1 focused on the identification of digital biomarkers that can be extracted directly from the sensors mounted over a social robotic platform. Two main scenarios were investigated. The first scenario refers to a gait activity task, in which the human walks in front of the robot, which follows the human from behind. Considering the laser and RGB-D camera as perception sensors of interest, a preliminary data analysis on healthy and young subjects was conducted to automatically segment the gait phases of each foot (i.e., stance and swing phases). Given this information, a selected pool of digital biomarkers was extracted, namely: number of steps (NS), average step length (SL), gait time (GT), and gait velocity (GV). In parallel to this activity, we continued the investigation of digital biomarkers of interest in the robot-to-human handover scenario. In this activity, the work focused on exploiting other digital biomarkers related to the user’s arms motion trajectories that may be associated to the social-task engagement as well as on the user’s task performances. Similarly to the analysis already performed for the behavioral engagement, we exploited the information carried out in the body motion of the user in each sub-phase of the interaction (Reaching, Handover, Placing), and we compared the variability of the collected parameters between the first and the last interactions. A small group of interactions has been currently analyzed (i.e., 4 interactions).
The activities of Task 2.1 continued to be focused on the identification of digital biomarkers that can be extracted directly from the sensors mounted over a social robotic platform during a robot-to-human handover scenario. Namely, the work focused on exploiting the information carried out in the body motion of the user in each sub-phase of the interaction (Reaching, Handover, Placing), and we compared the variability of the collected parameters between the first and the last interactions of each user. A total of 20 users have been analyzed, demonstrating some differences in the behaviors related to the diminishing engagement of the users while performing the task. In parallel to this activity, we also started the design, and the implementation of a real-time hand-tracking system based on visual sensors. Namely, a novel multi-device hand-tracking system composed solely of visual sensors (i.e. two Leap Motion Controllers and one Intel Realsense camera), which could be integrated into several human-robot interaction frameworks. The aim of this system is to detect the hand motion of the user, so that the robot can use this information to better predict the action the user wants to perform in physical human-robot interactions. The system's performance was assessed during two distinct hand gestures, referred to as Grasp 1 and Grasp 2, involving different palm orientations. Data were collected from real users performing 60 grasping repetitions each. The results indicate that middle and index fingertip positions most effectively characterize grasping gestures. In terms of data quality, the RGB-D camera provided higher confidence values in hand detection compared to the Leap Motion Controllers, though its higher rate of missing data reduces overall reliability. The data stream from the Leap Motion Controllers achieved a favorable balance between data quality and quantity. The preliminary results of this activity have been presented at the “Real-World Physical and Social Human-Robot Interaction” workshop, held in conjuction with IEEE-RAS International Conference on Humanoid Robots, Nancy (France), Nov 22 - 24, 2024.
The activities of the Task continued to be focused on the identification of digital biomarkers, exploiting the data recorded by the sensors mounted over the social robots. Namely, we conducted a brief survey on the parameters commonly assessed during human-robot interaction in three clinical scenarios, namely: i) cognitive assessment, ii) physical assessment, and iii) physical rehabilitation. This literature reviews helped us to identify the parameters of interest as well as the main technological challenges related to this research topic. Additionally, we keep investigate the role of biomarkers in robot-to-human handover scenario, exploiting the duality of information (i.e., social and clinical) carried out by each parameter. In the same scenario, we explored the dynamics of user engagement during repeated human-robot handover interactions by integrating human-annotated data, clustering-based analysis of behavioral features, and the predictive capabilities of large language models (LLMs). While manual annotation remains a standard approach in engagement assessment, the inclusion of clustering analysis enables a more objective identification of behavioral patterns associated with varying engagement states. Given their advanced reasoning abilities, LLMs were also evaluated for their effectiveness in this domain. Specifically, we compared two reasoning strategies: Monte Carlo multi-trial sampling and single-trial Chain-of-Thought (CoT) inference. Both approaches were employed to assess engagement at an overall level and across three key components: cognitive, affective, and behavioral. The findings highlight the pivotal role of affective engagement, with clustering analysis effectively distinguishing between highly engaged and less responsive participants. A consistent decline in engagement was observed across users by the fourth trial, indicating a potential reduction in attention or interest over time. LLM-based assessments closely mirrored the clustering results, especially in affective engagement classification. Notably, Monte Carlo sampling demonstrated superior performance in identifying affective engagement, whereas CoT reasoning yielded higher accuracy in overall engagement classification. These results emphasize the promise of LLMs in modeling engagement and point to critical challenges in tracking engagement dynamics over time
Scientific publications
Dissemination events