Ph. D. Thesis Presentation for “Dottorandi in Ateneo” UNIBS Event

The University of Brescia is an institution of Napoleonic origin established in 1802 that gathers, through a system of co-optation, a defined number of scholars and experts in artistic, literary and scientific disciplines. It carries out an intense activity of cultural and scientific promotion through conventions, conferences and publications.

These initiatives are primarily aimed at fostering mutual acquaintance among academics and the interchange between different disciplines, in the two classes in which the academy is organized: arts & literature and sciences

The class of sciences offers final year doctoral students in scientific disciplines the opportunity to present their research work, their experience and their expectations as scholars in training to the academics of the University, to their doctoral colleagues in other scientific fields and to the interested public.

In the following video you can see Cristina Nuzzi‘s presentation about her Ph. D. thesis work (Italian only). The event was also published in the Giornale di Brescia.

UNIBS-UNITN Collaboration continues!

The collaboration between the Mechanical and Thermal Measurements group (MMT and Vis4Mechs) of the University of Brescia and the MiRo laboratory of the University of Trento continues.

The measurement experiments involving the various members of the two teams are aimed at evaluating, thanks to a precise protocol established in a previous publication, the reconstruction accuracy of consumer (Kinect v2 and Kinect Azure) and industrial (Basler) time-of-flight cameras, compared to the gold standard obtained with a Konica Minolta 3D scanner.

The experimental campaign includes several tests, which will be held for now separately in the two locations in compliance with the pandemic prevention regulations. A first meeting has already been held in the MiRo Laboratory of Trento, the only one that required the simultaneous presence of the various devices.

related publications

S. Pasinetti, M.M. Hassan, J. Eberhardt, M. Lancini, F. Docchio, G. Sansoni, “Performance analysis of the PMD Camboard Picoflexx time-of-flight camera for markerless motion capture applications“, in IEEE Transactions on Instrumentation and Measurement 68 (11), 4456-4471

Hands-Free v2 to teleoperate robotic manipulators: three axis precise positioning study


The idea of this thesis project is to improve the already developed teleoperation system presented at Ubiquitous Robotics by implementing the z-axis control. In fact, the original system only performed a xy teleoperation allowing users to move the end-effector of the robot to the desired position determined by the index finger keypoint extracted by OpenPose. However, a more interesting application also integrates a precise z-axis control and a trajectory planner, which are the key improvements of this version as seen in Fig. 1.

Fig. 1 - Concept of Hands-Free v2. By analyzing the hand skeleton using OpenPose it is possible to extract the index finger position over time to build a complete trajectory. The points are interpolated by the ad-hoc interpolator and the final trajectory is sent to the ROS node of the Sawyer robot.


Due to the COVID-19 pandemic restrictions, this project has been carried out using ROS Gazebo to reproduce the laboratory set-up already seen for Hands-Free v1. The camera adopted is a consumer-end RGB camera that has been calibrated following the standard procedure described in Hands-Free v1 paper. The robot calibration has been similarly performed by setting up a simulated environment as shown in Fig. 2.

Fig. 2 - Example of the robot calibration procedure carried out into the simulated Gazebo environment.


The trajectory builder is an extension of the capabilities of the old version of the software, hence it only works in 2D considering the user frame (xy plane) and the vertical robot frame (zy plane). The procedure to use this application is the following and may be seen in the video below:

  1. Users place their hand open on the user frame in order to detect the “hand-open” gesture. This allows the system to reset the variables and move the robot to its home pose (Fig. 3)
  2. After the initialization phase, users may move their hand around the user frame performing the “index” gesture (with both thumb and index finger opened). The index finger position is extracted considering keypoint 8. Only positions which differ from the preceding one of at least 5 px are retained. Moreover, each position is extracted as the mean position detected over N consecutive frames. In this case, we set N = 3 (Fig. 4)
  3. The detected trajectory points are filtered and interpolated according to the ad-hoc interpolator developed. The resulting trajectory may be sent to the robot if the “move” gesture is performed (with index and middle fingers opened, see Fig. 5).
Fig. 3 - Example of the initialization phase performed by using the "hand-open" gesture.
Fig. 4 - Example of the definition of a trajectory by performing the "index" gesture. The trajectory points are saved and filtered according to their position with respect to the preceding point.
Fig. 5 - Example of the launch of the interpolated trajectory, performed by using the "move" gesture.


To control the manipulator along the z-axis (which in this set-up corresponds to the robot’s x-axis) three different modalities have been studied and implemented. For now, however, the depth control is still separate from the trajectory planner.


Intuitively, by using this mode the robot may be moved home (h), forward (w), or backward (s) according to the pressed key of the keyboard. The stepsize of its movement is fixed. By pressing ctrl+c or esc it is possible to exit the modality and close the communication with the robot.


In this modality, the core functions of Hands-Free are retained in order to detect the hand gestures. However, in this case, only the “hand-open” gesture and the “move” gesture are detected. By checking the mutual distance between the two fingers of the “move” gesture it is possible to detect if the robot should move forward (small distance up to zero) or backward (higher distance, corresponding to the original “move” gesture with the two fingers quite separate from each other). An example of this modality may be seen in the video below.


The last modality implements depth control by leveraging the Vicara Kai wearable sensor. The sensor should be wear on the hand and, according to the detected orientation of the opened hand, it is possible to determine if the robot should stay still (hand parallel to the ground), move forward (hand tilted down), or backward (hand tilted up).


C. Nuzzi, S. Ghidini, R. Pagani, S. Pasinetti, G. Coffetti and G. Sansoni, “Hands-Free: a robot augmented reality teleoperation system,” 2020 17th International Conference on Ubiquitous Robots (UR), Kyoto, Japan, 2020, pp. 617-624, doi: 10.1109/UR49135.2020.9144841.

Analysis and development of a teleoperation system for cobots based on vision systems and hand-gesture recognition


The idea behind this thesis work is to make a first step towards the development of a vision-based system to teleoperate cobots in real-time using the tip of the user’s hand.

This has been experimentally done by developing a ROS-based program that simultaneously (i) analyzes the hand-gesture performed by the user leveraging OpenPose skeletonization algorithm and (ii) moves the robot accordingly.


After the Kinect v2 sensor has been intrinsecally calibrated to perfectly align the depth data to the RGB images, it is necessary to calibrate the workspace, hence establishing a user frame reference system in order to convert from image pixels to meters and vice-versa.
This has been done by adopting the vision library OpenCV. By using its functions it has been possible to detect automatically the master’s markers and assigning to each of them the corresponding coordinates in the user reference system. Hence, given the couples of points in the user frame and in the camera frame, it has been possible to estimate the calibration matrix M by solving the linear system using the least squares method.
In Fig. 1 the experimental set-up of the camera-user frame portion of the project is presented.
Fig. 1 - Experimental set-up. The horizontal user frame is viewed by a Kinect v2 camera mounted at fixed height.


First test

In this test a rectangular object of shape 88 x 41 x 25 mm has been positioned in correspondence of 8 measure points of the set-up as shown in Fig. 2 by placing its bottom-left corner in the measure point. 

The measured position of the object in each point has been calculated by applying the conversion from pixels to meters developed before. Hence, it has been possible to estimate the positional error as the difference between the measured position and the real position of the bottom-left corner of the object in each pose.
From the results of this analysis it emerged that the average displacement for the two coordinates is equal to 6.0 mm for the x-axis and 5.8 mm for the y-axis. These two errors are probably caused by the prospectic distortions of the lenses and by the scene illumination. In fact, since the calibration performed did not consider the lens distortions, their effect affects the measurements especially around the corners of the image. Moreover, the height of the object casts shadows on the set-up that introduce errors in the corner detection due to the mutual position of the light and the object.


Similarly to the first test, in this one a rectangular object of shape 73 x 48 x 0.5 mm has been positioned in correspondence of the 8 measure points.

In this case the average displacement observed is equal to 6.4 mm for the x-axis and 6.5 mm for the y-axis. This highlighted how the prospectic distortions heavily affect the measurements: in fact, since the height of the object is not enough to cast shadows on the plane, these errors are only due to the lens distortions.

Fig. 2 - Measure points of the calibration master adopted. The reference system is centered around the bottom-left marker (marker 0) of the target.


Since the system adoperates OpenPose to extract in real-time the hand skeleton, it has been necessary to define three hand-gestures to detect according to the position of the keypoints (see Fig. 3, Fig. 4 and Fig. 5).

However, since OpenPose estimates the hand keypoints even if they are not present, it has been necessary to define a filtering procedure to determine the output gesture according to some geometrical references:

  • the thumb must be present to correctly assign the keypoints numbers
  • the distance between the start and end keypoint of the thumb must be between 20 and 50 mm
  • the angle between the thumb and the x-axis must be > 90°
  • the distance between the start and end keypoints of index, middle and ring fingers must be between 20 and 100 mm
  • the distance between the start and end keypoint of the pinky finger must be between 20 and 70 mm
Moreover, the acquisition window has been set to 3 s in order to take into account the time intervals between a change of hand gesture or a movement.
Fig. 3 - Hand-gesture of the first sub-task named "positioning".
Fig. 4 - Hand-gesture of the second sub-task named "picking".
Fig. 5 - Hand-gesture of the third sub-task named "release".


The proposed gestures have been performed by three male students with pale skin in different moments of the day (morning, afternoon, late afternoon). The purpose of this test was to determine if the system was able to robustly detect the gestures also considering the illumination of the scene.

The students moved their hand around the user frame and performed the gesture one at a time. I resulted that, on average, the proposed gestures were recognized 90% of the times.

It is worth noting that gesture “positioning” was defined in such a way to reduce the misclassification of the keypoints that could happen in some cases due to the presence of only one finger (the index). In fact, it has been observed that incrementing the number of fingers clearly visible in the scene also incremented the recognition accuracy of the gesture. This is probably due to the fact that the thumb must be present to avoid misclassifications with the index finger. However, even if the “positioning” gesture adopts three fingers, only the position of the index finger’s tip (keypoint 8) is used to estimate the position to which the user is pointing to.


The complete system is composed by (i) the gesture recognition ROS node that detects the gesture and (ii) a robot node to properly move the Sawyer cobot accordingly. Hence, since the robot workspace is vertical (as shown in Fig. 6) it has been necessary to properly calibrate the vertical workspace with respect to the robot user frame. This has been done using a centering tool to build the calibration matrix (adopting points couples of robot frame coordinates – vertical workspace coordinates).

Fig. 6 - Complete set-up developed, showing the two workspaces (horizontal and vertical).

Project Hands-Free presented at Ubiquitous Robotics 2020

Hands-Free is a ROS-based software to teleoperate a robot with the user hand. The skeleton of the hand is extracted by using OpenPose and the position of the user’s index finger in the user workspace is mapped to the corresponding robot position in the robot workspace.

The project is available on GitHub and the paper has been published in the Ubiquitous Robotics 2020 virtual conference proceedings.

Check out the presentation video below!

Deep Learning in the kitchen: development and validation of an action recognition system based on RGB-D sensors

This thesis work is part of a research project between the Laboratory and Cast Alimenti cooking school. Cast Alimenti aims to obtain a product to improve teaching in its classrooms. The idea is to develop a system (hardware and software) that simplifies the process of writing recipes while the teacher performs its lecture in the kitchen. The project aims to translate into written language a recipe performed during the demonstration lessons, allowing also to write in Italian the lessons performed by foreign teachers. This goal has been achieved through the use of an action recognition system.

Action recognition aims to interpret human actions through mathematical algorithms and is based on the identification and tracking of the position of the human body in time. The topic is actively studied in various fields, in fact, by using this technique, devices such as smart bands and cell phones may recognize if a person is stationary, walking or running. Another example is related to security cameras and systems, in which the recognition of actions that may be considered violent or dangerous allow the authorities to intervene quickly when necessary.

The project is divided into several phases. First, it is necessary to implement the recognition of the cook’s activity during the practical demonstration and later, based on the recognized actions, it is necessary to automate the writing of the recipe. This thesis work is focused on the first part of the project, in particular on the choice of the mathematical algorithm necessary for the recognition of actions 


For a complete understanding of the difficulties of application of the system under development, it is necessary to contextualize the environment in which it will work. The scenario is that of an open kitchen in which the cook works behind a counter and for most of the time is in the same position and only its upper body is visible (Fig. 1). The cook interacts with the various working tools and food only through the upper limbs and there is no interaction with other subjects. 

The kitchen changes as the recipe progresses, as work tools such as stoves, mixers, pots and pans and other utensils are added and removed as needed. The predominant colors of the scene are white (the cook’s uniform, the wall behind him, the cutting board) and steel (the worktop and equipment). This results in uneven lighting and the creation of reflections and shadows. The presence of machinery and heat sources generates both auditory and visual noises, especially in the infrared spectrum. In the working environment there are also other critical issues such as: the presence of water, steam, temperature changes, substances of various kinds such as oil and acids, as well as chemicals for cleaning. 

To obtain reliable data in this context it is necessary to use appropriate instrumentation, with specific technical characteristics.

Fig. 1 - Example of the "smart" kitchen used in this work at Cast Alimenti. The predominant colors are white and gray and the overall luminance is dark. Due to the characteristics of the scenario the instrumentation of choice must be carefully selected.

technology of choice for action recognition

Tracking is the first step in action recognition. The technology used for this project was carefully selected by evaluating the systems used commercially to track full-body movements, such as accelerometers and gyroscopes, RGB cameras and depth sensors (3D cameras). 

Wearable solutions involve the use of accelerometers coupled with gyroscopes, a technique adopted in almost all commercially available smart bands for sports performance assessment and sleep monitoring. However, to properly use such wearable devices in this project it means that the cook must be equipped with sensors applied to the hands or wrists and the data obtained from them should be synchronized with each other. Considering the scenario in which the cook operates, it is also necessary that the sensors are waterproof and resistant to aggressive substances.

Optical instrumentation, such as RGB or 3D cameras positioned externally to the scene, makes it possible to keep the image sensors away from the chef’s workspace, thus avoiding exposure to the critical environmental conditions of the kitchen. A disadvantage of this solution is the huge amount of images obtained and the related processing. However, the computational power of modern computer systems allow their application in real-time systems.

Given the environmental conditions and the need of making the acquisition system easily accessible to non-expert users, we opted for the second solution choosing to place the cameras in front of the cook who performs the recipe, thus replicating the view of the students during the lectures.

Fig. 2 - Example taken from the experimental set-up. The two Kinect v2 cameras have been positioned in front of the kitchen counter, replicating the students' view.


An action is any movement made by the operator in which tools are used to obtain a certain result. Characteristics of an action are (i) the speed of execution and (ii) the space in which it takes place. Based on these variables, it is important to select an appropriate frame rate and sensor resolution.

Recognizing the performance of actions through a mathematical algorithm that analyzes images is not a computationally simple task, because the computational load increases proportionally to the number of frames per second and to the image resolution. It is therefore crucial to find the optimal configuration in order to maintain a good image quality, which is necessary to recognize the action, while still being able to use a consumer processing system.

Modern consumer computing systems (PCs) currently provide sufficient computational power to perform the necessary calculations by harnessing the power of parallel computing in GPUs (graphics processing units).


There are several algorithms available to analyze actions in real time given a set of image frames temporally consistent. They may be subdivided into two main categories:

  • Algorithms that analyze 3D images, such as images generated using depth cameras. This type of data removes all issues related to the color composition of the scene and subjects that may be blurred or that vary during the execution (Fig. 3 a); 
  • Algorithms that process Skeleton data, in which an artificial skeleton composed of keypoints corresponding to the fundamental joints of the body is computed by the network. The keypoints represent (x, y, z) positions of the body’s joints in the camera reference system  (Fig. 3 b). 

Moreover, by combining the two categories it is possible to obtain hybrid algorithms that analzye both types of data.

Among the broad set of algorithms available two of them have been selected for this work:

  1. HPM+TM: it is a supervised classification algorithm developed in MATLAB by the University of Western Australia. It was created specifically for action recognition and has achieved the best performance in the 3D Action Pairs dataset, reaching an accuracy of 98%.
  2. indRNN: this model was developed as part of a collaboration between Australia’s University of Wollongong and the University of Electronic Science and Technology of China. Although the algorithm has not been specifically designed for action recognition, it is still applicable where it is necessary to recognize features over time. It is a supervised classification algorithm and obtained an efficiency of 88% in the NTU RGB+D Dataset.
Fig. 3 - Example of data that may be processed by deep learning models. (a) Image frame in RGB on which skeletal data is drawn; (b) depth frame taken from Kinect v2 cameras.


The experimental campaign took place during two days at Cast Alimenti cooking school. Two Kinect v2 cameras recorded Nicola Michieletto, the chef that worked with our Laboratory since the beginning of the project, while he cooked lasagne. The entire preparation was repeated and filmed twice, this with the aim of obtaining a larger and more representative dataset.

From a first analysis, it is made evident that some actions were repeated much more than others. This strong difference in the number of samples available forced us to make some preliminary analysis in order to understand how the presence of categories with a small number of samples influences the accuracy of the algorithm adopted. 

Therefore, we selected only a sub-sample of the actions present in the dataset, namely:

  1. stirring: using a ladle the cook mixes the ingredients inside a pot or a bowl with circular movements;
  2. pouring: the cook takes an ingredient from one container and pours it inside another container;
  3. rolling out the pasta: a process in which pasta is made flat and thin using an inclined plane dough sheeter. The cook loads a thick sheet from the top of the machine and pulls out a thinner sheet from the bottom;
  4. cutting: the cook cuts a dish by means of a kitchen knife; the dish is held steady with the left hand and the knife is used with the right hand;
  5. placing the pasta: the cook takes the pastry from the cloths on which it was put to rest and deposits it inside the pan where he is composing the lasagna;
  6. spreading: process by which the béchamel and Bolognese sauce are distributed in an even layer during the composition of the lasagna;
  7. sprinkling: the cook takes the Parmesan grated cheese and distributes it forming an even layer;
  8. blanching: a process in which the cook takes freshly flaked pasta from the counter and plunges it inside a pot with salted water for a brief cooking time;
  9. straining the pasta: with the use of a perforated ladle the cook removes the pasta from the pot in which it was cooking and deposits it in a pan with water and ice;
  10. draining the pasta on cloth: the cook removes with his hands the pasta from the water and ice pan and lays it on a cloth in order to allow it to dry;
  11. folding the pasta: during the puffing process it is sometimes necessary to fold the pasta on itself in order to proceed to a further puffing process and obtain a more uniform pasta layer;
  12. turn on/off induction plate: the cook turns on or off a portable induction plate located on the work counter;
  13. catching: simple process where the cook grabs an object and moves it closer to the work point;
  14. moving pot: the cook moves the pot within the work space, in most cases involves moving it to or from the induction plate.
Fig. 4 - Examples of the 14 actions selected for the work. (a) stirring, (b) pouring; (c) rolling out the pasta; (d) cutting; (e) placing the pasta; (f) spreading; (g) sprinkling; (h) blanching; (i) straining the pasta; (l) draining the pasta on cloth; (m) folding the pasta; (n) turn on/off induction plate; (o) catching; (p) moving pot.


We performed a detailed analysis to determine the performances of the two algorithms according to (i) a reduction of the number of classes and (ii) an increase of the number of samples per class. Albeit theoretically known that deep learning algorithms improve the more data are used and that a high number of classes presenting similarities between each other reduce the overall inference accuracy, with this test we wanted to quantify this phenomenon.

In summary, the results for each algorithm are:

  • HPM+TM: this algorithm performs better when less classes are adopted and a high number of samples per class is used. Highest accuracy achieved: 54%
  • indRNN: this model performs better than the other one and is more robust even if less samples per class are used. Moreover, no significant improvements can be observed by reducing by more than a half the number of classes. Highest accuracy achieved: 85%
Moreover, by observing the resulting confusion matrixes it is possible to note that “stirring” and “pouring” classes are the most critical. In fact, the highest number of false positives is obtained for the “stirring” class while the highest number of false negatives is observed for the “pouring” class. The two cases are often due to misclassifications between each other. This highlighted the fact that during the cooking procedure the chef often poured an ingredient while stirring the pot with the other hand, hence the two actions are more often than not overlapping. Hence, it would be best to merge the two classes into one to account for this eventuality.

Project “RemindLy” participating in the HRI Student Competition!

Cristina Nuzzi, Stefano Ghidini, Roberto Pagani, and Federica Ragni, a team of DRIMI Applied Mechanics Ph. D. Students, participated in the Student Competition of the 2020 ACM/IEEE International Conference on Human-Robot Interaction.

Their project, called “RemindLy“, has been published in the conference proceedings and it is available at the ACM library.

Check out their presentation video below!

This video has also been selected by IEEE SPECTRUM weekly video selection on Robotics and published online!