UNIBS-UNITN Collaboration continues!

The collaboration between the Mechanical and Thermal Measurements group (MMT and Vis4Mechs) of the University of Brescia and the MiRo laboratory of the University of Trento continues.

The measurement experiments involving the various members of the two teams are aimed at evaluating, thanks to a precise protocol established in a previous publication, the reconstruction accuracy of consumer (Kinect v2 and Kinect Azure) and industrial (Basler) time-of-flight cameras, compared to the gold standard obtained with a Konica Minolta 3D scanner.

The experimental campaign includes several tests, which will be held for now separately in the two locations in compliance with the pandemic prevention regulations. A first meeting has already been held in the MiRo Laboratory of Trento, the only one that required the simultaneous presence of the various devices.

related publications

S. Pasinetti, M.M. Hassan, J. Eberhardt, M. Lancini, F. Docchio, G. Sansoni, “Performance analysis of the PMD Camboard Picoflexx time-of-flight camera for markerless motion capture applications“, in IEEE Transactions on Instrumentation and Measurement 68 (11), 4456-4471

Analysis and development of a teleoperation system for cobots based on vision systems and hand-gesture recognition


The idea behind this thesis work is to make a first step towards the development of a vision-based system to teleoperate cobots in real-time using the tip of the user’s hand.

This has been experimentally done by developing a ROS-based program that simultaneously (i) analyzes the hand-gesture performed by the user leveraging OpenPose skeletonization algorithm and (ii) moves the robot accordingly.


After the Kinect v2 sensor has been intrinsecally calibrated to perfectly align the depth data to the RGB images, it is necessary to calibrate the workspace, hence establishing a user frame reference system in order to convert from image pixels to meters and vice-versa.
This has been done by adopting the vision library OpenCV. By using its functions it has been possible to detect automatically the master’s markers and assigning to each of them the corresponding coordinates in the user reference system. Hence, given the couples of points in the user frame and in the camera frame, it has been possible to estimate the calibration matrix M by solving the linear system using the least squares method.
In Fig. 1 the experimental set-up of the camera-user frame portion of the project is presented.
Fig. 1 - Experimental set-up. The horizontal user frame is viewed by a Kinect v2 camera mounted at fixed height.


First test

In this test a rectangular object of shape 88 x 41 x 25 mm has been positioned in correspondence of 8 measure points of the set-up as shown in Fig. 2 by placing its bottom-left corner in the measure point. 

The measured position of the object in each point has been calculated by applying the conversion from pixels to meters developed before. Hence, it has been possible to estimate the positional error as the difference between the measured position and the real position of the bottom-left corner of the object in each pose.
From the results of this analysis it emerged that the average displacement for the two coordinates is equal to 6.0 mm for the x-axis and 5.8 mm for the y-axis. These two errors are probably caused by the prospectic distortions of the lenses and by the scene illumination. In fact, since the calibration performed did not consider the lens distortions, their effect affects the measurements especially around the corners of the image. Moreover, the height of the object casts shadows on the set-up that introduce errors in the corner detection due to the mutual position of the light and the object.


Similarly to the first test, in this one a rectangular object of shape 73 x 48 x 0.5 mm has been positioned in correspondence of the 8 measure points.

In this case the average displacement observed is equal to 6.4 mm for the x-axis and 6.5 mm for the y-axis. This highlighted how the prospectic distortions heavily affect the measurements: in fact, since the height of the object is not enough to cast shadows on the plane, these errors are only due to the lens distortions.

Fig. 2 - Measure points of the calibration master adopted. The reference system is centered around the bottom-left marker (marker 0) of the target.


Since the system adoperates OpenPose to extract in real-time the hand skeleton, it has been necessary to define three hand-gestures to detect according to the position of the keypoints (see Fig. 3, Fig. 4 and Fig. 5).

However, since OpenPose estimates the hand keypoints even if they are not present, it has been necessary to define a filtering procedure to determine the output gesture according to some geometrical references:

  • the thumb must be present to correctly assign the keypoints numbers
  • the distance between the start and end keypoint of the thumb must be between 20 and 50 mm
  • the angle between the thumb and the x-axis must be > 90°
  • the distance between the start and end keypoints of index, middle and ring fingers must be between 20 and 100 mm
  • the distance between the start and end keypoint of the pinky finger must be between 20 and 70 mm
Moreover, the acquisition window has been set to 3 s in order to take into account the time intervals between a change of hand gesture or a movement.
Fig. 3 - Hand-gesture of the first sub-task named "positioning".
Fig. 4 - Hand-gesture of the second sub-task named "picking".
Fig. 5 - Hand-gesture of the third sub-task named "release".


The proposed gestures have been performed by three male students with pale skin in different moments of the day (morning, afternoon, late afternoon). The purpose of this test was to determine if the system was able to robustly detect the gestures also considering the illumination of the scene.

The students moved their hand around the user frame and performed the gesture one at a time. I resulted that, on average, the proposed gestures were recognized 90% of the times.

It is worth noting that gesture “positioning” was defined in such a way to reduce the misclassification of the keypoints that could happen in some cases due to the presence of only one finger (the index). In fact, it has been observed that incrementing the number of fingers clearly visible in the scene also incremented the recognition accuracy of the gesture. This is probably due to the fact that the thumb must be present to avoid misclassifications with the index finger. However, even if the “positioning” gesture adopts three fingers, only the position of the index finger’s tip (keypoint 8) is used to estimate the position to which the user is pointing to.


The complete system is composed by (i) the gesture recognition ROS node that detects the gesture and (ii) a robot node to properly move the Sawyer cobot accordingly. Hence, since the robot workspace is vertical (as shown in Fig. 6) it has been necessary to properly calibrate the vertical workspace with respect to the robot user frame. This has been done using a centering tool to build the calibration matrix (adopting points couples of robot frame coordinates – vertical workspace coordinates).

Fig. 6 - Complete set-up developed, showing the two workspaces (horizontal and vertical).

Extrinsic Calibration procedure by using human skeletonization

Metrology techniques based on Industrial Vision are increasingly used both in research and industry. These are contactless techniques characterized by the use of optical devices, such as RGB and Depth cameras. In particular, multi-camera systems, i. e. systems composed of several cameras, are used to carry out three-dimensional measurements of various types. In addition to mechanical measurements this type of systems are often used to monitor the movements of human subjects within a given space, as well as for the reconstruction of three-dimensional objects and shapes.

To perform an accurate measurment, multi-camera systems must be carefully calibrated. The calibration process is a fundamental concept in Computer Vision applications that involve image-based measurements. In fact, Vision System calibration is a key factor to obtain reliable performances, as it allows to find the correspondence between the workspace and the points present on the images acquired by the cameras.

A multi-camera system calibration is the process that allows to obtain the geometric parameters related to distortions, positions and orientations of each camera that constitutes the system. It is therefore necessary to identify a mathematical model that takes into account both the internal functioning of the device and the relationship between the camera and the external world.

The calibration process consists of two parts:

  1. The intrinsic calibration, necessary to model the individual devices in terms of focal length, coordinates of the main point and distortion coefficients of the acquired image;
  2. The extrinsic calibration, necessary to determine the position and orientation of each device with respect to an absolute reference system. By means of extrinsic calibration it is possible to obtain the position of the camera reference system with respect to the absolute reference system in terms of rotation and translation.
Traditional calibration methods are based on the use of a calibration target, i. e. a known object, usually a plane object, on which there are calibration points that are unambiguously recognizable and have known coordinates. These calibration targets are automatically recognized by the system in order to calculate the position of the object with respect to the camera and thus obtain the parameters of rotation and translation of the camera with respect to the global reference system (Fig. 1).
Fig. 1 - Examples of traditional calibration masters.

The methods based on the recognition of three-dimensional shapes are based on the geometric coherence of a 3D object positioned in the field of view (FoV) of the various cameras. Each device records only a part of the target object and then, combining each view with the portion actually recorded by the camera, the relative displacement of the sensors with respect to the object is evaluated, thus obtaining the rotation and translation of the acquisition device itself (Fig. 2).

Fig. 2 - Example of a known object used as the calibration master, in this case a green sphere of a known diameter.

However, both traditional calibration methods and those based on the recognition of three-dimensional shapes have the disadvantage of being very complex and computationally expensive. In fact, former methods exploit the use of a generally flat calibration target which limits the positioning of the target itself because, under certain conditions and positions, it is not possible to obtain simultaneous views. The latter methods also require to converge to a solution, hence a good initialization of the parameters of 3D recognition by the operator is needed, resulting in low reliability and limited automation.

Finally, the methods based on the recognition of the human skeleton use as targets the skeleton joints of the human positioned within the FoV of the cameras. Skeleton based methods represent therefore an evolution of the 3D shape matching methods, since the human is considered as the target object (Fig. 3).

In this thesis work, we exploited skeleton based methods to create a new calibration method which is (i) easier to use compared to known methods, since users only need to stand in front of a camera and the system will take care of everything and (ii) faster and computationally inexpensive.

Fig. 3 - Example of a skeleton obtained by connecting the joints acquired from a Kinect camera (orange dots) and from a calibration set up based on multiple cameras (red dots).

STEP 1: Measurment set-up

The proposed set-up is very simple, as it is composed of a pair of Kinect v2 intrinsecally calibrated by using known procedures, a process required to correctly align the color information on top of the depth information acquired by the camera.

The cameras are placed in order to acquire the measurment area in a suitable way. After the calibration images have been taken, these are evaluated by a skeletonization algorithm based on OpenPose, which also takes into account the depth information to correctly place the skeleton in the 3D world [1]. Joints coordinates are therefore extracted for every pair of color-depth images correctly aligned both spatially and temporally.

To obtain accurate joint positions we also perform an optimization procedure afterwards to correctly place the joints on the human figure according to a minimization error procedure written in MATLAB.

Fig. 4 shows the abovementioned steps, while Fig. 5 shows a detail of the skeletonization algorithm used.


Fig. 4 - Steps required for the project. First we acquire both RGB and Depth frames from every Kinect, second we perform the skeletonization of the human using the skeletonization algorithm of choice and finally we use it to obtain the extrinsinc calibration matrix for each kinect, in order to reproject them to a known reference system.
Fig. 5 - Skeletonization procedure used by the algorithm.

STEP 2: Validation of the proposed procedure

To validate the system we used a mannequin placed still in 3 different positions (2 m, 3.5 m, 5 m) showing both the front and the back of it to the cameras. In fact, skeletonization procedures are usually more robust when humans are in a frontal position since also the face keypoints are visible. The validation positions are shown in Fig. 6, while the joint positions calculated by the procedure are shown in Fig. 7 and 8.

Fig. 6 - Validation positions of the mannequin. Only a single camera was used in this phase.
Fig. 7 - Joints calculated by the system when the mannequin is positioned in frontal position at (a) 2 m, (b) 3.5 m and (c) 5 m.
Fig. 8 - Joints calculated by the system when the mannequin is positioned in back position at (a) 2 m, (b) 3.5 m and (c) 5 m.

STEP 3: Calibration experiments

We tested the system in 3 configurations shown in Fig. 9: (a) when the two cameras are placed in front of the operator, one next to the other; (b) when a camera is positioned in front of the operator and the other is positioned laterally with an angle between them of 90°; (c) when the two cameras are positioned with an angle of 180° between them (one in front of the other, operator in the middle).

We first obtained the calibration matrix by using our procedure, hence the target used is the human subject placed in the middle of the scene (at position zero). Then, we compared the calibration matrix obtained in this way to the calibration matrix obtained from another algorithm developed by the University of Trento [2]. This algorithm is based on the recognition of a 3D known object, in this case the green sphere shown in Fig. 2.

The calibration obtained from both methods is evaluated using a 3D object, a cylinder of known shape which has been placed in 7 different positions in the scene as shown in Fig. 10. The exact same positions have been used for each configuration.

Fig. 9 - Configurations used for the calibration experiments.
Fig. 10 - Cylinder positions in the scene.

We finally compared the results aligning the point clouds obtained by using both calibration matrixes. These results have been elaborated in PolyWorks for better visualization, and are shown in the presentation below. Feel free to download it to know more about the project and to view the results!