Hands-Free v2 to teleoperate robotic manipulators: three axis precise positioning study


The idea of this thesis project is to improve the already developed teleoperation system presented at Ubiquitous Robotics by implementing the z-axis control. In fact, the original system only performed a xy teleoperation allowing users to move the end-effector of the robot to the desired position determined by the index finger keypoint extracted by OpenPose. However, a more interesting application also integrates a precise z-axis control and a trajectory planner, which are the key improvements of this version as seen in Fig. 1.

Fig. 1 - Concept of Hands-Free v2. By analyzing the hand skeleton using OpenPose it is possible to extract the index finger position over time to build a complete trajectory. The points are interpolated by the ad-hoc interpolator and the final trajectory is sent to the ROS node of the Sawyer robot.


Due to the COVID-19 pandemic restrictions, this project has been carried out using ROS Gazebo to reproduce the laboratory set-up already seen for Hands-Free v1. The camera adopted is a consumer-end RGB camera that has been calibrated following the standard procedure described in Hands-Free v1 paper. The robot calibration has been similarly performed by setting up a simulated environment as shown in Fig. 2.

Fig. 2 - Example of the robot calibration procedure carried out into the simulated Gazebo environment.


The trajectory builder is an extension of the capabilities of the old version of the software, hence it only works in 2D considering the user frame (xy plane) and the vertical robot frame (zy plane). The procedure to use this application is the following and may be seen in the video below:

  1. Users place their hand open on the user frame in order to detect the “hand-open” gesture. This allows the system to reset the variables and move the robot to its home pose (Fig. 3)
  2. After the initialization phase, users may move their hand around the user frame performing the “index” gesture (with both thumb and index finger opened). The index finger position is extracted considering keypoint 8. Only positions which differ from the preceding one of at least 5 px are retained. Moreover, each position is extracted as the mean position detected over N consecutive frames. In this case, we set N = 3 (Fig. 4)
  3. The detected trajectory points are filtered and interpolated according to the ad-hoc interpolator developed. The resulting trajectory may be sent to the robot if the “move” gesture is performed (with index and middle fingers opened, see Fig. 5).
Fig. 3 - Example of the initialization phase performed by using the "hand-open" gesture.
Fig. 4 - Example of the definition of a trajectory by performing the "index" gesture. The trajectory points are saved and filtered according to their position with respect to the preceding point.
Fig. 5 - Example of the launch of the interpolated trajectory, performed by using the "move" gesture.


To control the manipulator along the z-axis (which in this set-up corresponds to the robot’s x-axis) three different modalities have been studied and implemented. For now, however, the depth control is still separate from the trajectory planner.


Intuitively, by using this mode the robot may be moved home (h), forward (w), or backward (s) according to the pressed key of the keyboard. The stepsize of its movement is fixed. By pressing ctrl+c or esc it is possible to exit the modality and close the communication with the robot.


In this modality, the core functions of Hands-Free are retained in order to detect the hand gestures. However, in this case, only the “hand-open” gesture and the “move” gesture are detected. By checking the mutual distance between the two fingers of the “move” gesture it is possible to detect if the robot should move forward (small distance up to zero) or backward (higher distance, corresponding to the original “move” gesture with the two fingers quite separate from each other). An example of this modality may be seen in the video below.


The last modality implements depth control by leveraging the Vicara Kai wearable sensor. The sensor should be wear on the hand and, according to the detected orientation of the opened hand, it is possible to determine if the robot should stay still (hand parallel to the ground), move forward (hand tilted down), or backward (hand tilted up).


C. Nuzzi, S. Ghidini, R. Pagani, S. Pasinetti, G. Coffetti and G. Sansoni, “Hands-Free: a robot augmented reality teleoperation system,” 2020 17th International Conference on Ubiquitous Robots (UR), Kyoto, Japan, 2020, pp. 617-624, doi: 10.1109/UR49135.2020.9144841.

Analysis and development of a teleoperation system for cobots based on vision systems and hand-gesture recognition


The idea behind this thesis work is to make a first step towards the development of a vision-based system to teleoperate cobots in real-time using the tip of the user’s hand.

This has been experimentally done by developing a ROS-based program that simultaneously (i) analyzes the hand-gesture performed by the user leveraging OpenPose skeletonization algorithm and (ii) moves the robot accordingly.


After the Kinect v2 sensor has been intrinsecally calibrated to perfectly align the depth data to the RGB images, it is necessary to calibrate the workspace, hence establishing a user frame reference system in order to convert from image pixels to meters and vice-versa.
This has been done by adopting the vision library OpenCV. By using its functions it has been possible to detect automatically the master’s markers and assigning to each of them the corresponding coordinates in the user reference system. Hence, given the couples of points in the user frame and in the camera frame, it has been possible to estimate the calibration matrix M by solving the linear system using the least squares method.
In Fig. 1 the experimental set-up of the camera-user frame portion of the project is presented.
Fig. 1 - Experimental set-up. The horizontal user frame is viewed by a Kinect v2 camera mounted at fixed height.


First test

In this test a rectangular object of shape 88 x 41 x 25 mm has been positioned in correspondence of 8 measure points of the set-up as shown in Fig. 2 by placing its bottom-left corner in the measure point. 

The measured position of the object in each point has been calculated by applying the conversion from pixels to meters developed before. Hence, it has been possible to estimate the positional error as the difference between the measured position and the real position of the bottom-left corner of the object in each pose.
From the results of this analysis it emerged that the average displacement for the two coordinates is equal to 6.0 mm for the x-axis and 5.8 mm for the y-axis. These two errors are probably caused by the prospectic distortions of the lenses and by the scene illumination. In fact, since the calibration performed did not consider the lens distortions, their effect affects the measurements especially around the corners of the image. Moreover, the height of the object casts shadows on the set-up that introduce errors in the corner detection due to the mutual position of the light and the object.


Similarly to the first test, in this one a rectangular object of shape 73 x 48 x 0.5 mm has been positioned in correspondence of the 8 measure points.

In this case the average displacement observed is equal to 6.4 mm for the x-axis and 6.5 mm for the y-axis. This highlighted how the prospectic distortions heavily affect the measurements: in fact, since the height of the object is not enough to cast shadows on the plane, these errors are only due to the lens distortions.

Fig. 2 - Measure points of the calibration master adopted. The reference system is centered around the bottom-left marker (marker 0) of the target.


Since the system adoperates OpenPose to extract in real-time the hand skeleton, it has been necessary to define three hand-gestures to detect according to the position of the keypoints (see Fig. 3, Fig. 4 and Fig. 5).

However, since OpenPose estimates the hand keypoints even if they are not present, it has been necessary to define a filtering procedure to determine the output gesture according to some geometrical references:

  • the thumb must be present to correctly assign the keypoints numbers
  • the distance between the start and end keypoint of the thumb must be between 20 and 50 mm
  • the angle between the thumb and the x-axis must be > 90°
  • the distance between the start and end keypoints of index, middle and ring fingers must be between 20 and 100 mm
  • the distance between the start and end keypoint of the pinky finger must be between 20 and 70 mm
Moreover, the acquisition window has been set to 3 s in order to take into account the time intervals between a change of hand gesture or a movement.
Fig. 3 - Hand-gesture of the first sub-task named "positioning".
Fig. 4 - Hand-gesture of the second sub-task named "picking".
Fig. 5 - Hand-gesture of the third sub-task named "release".


The proposed gestures have been performed by three male students with pale skin in different moments of the day (morning, afternoon, late afternoon). The purpose of this test was to determine if the system was able to robustly detect the gestures also considering the illumination of the scene.

The students moved their hand around the user frame and performed the gesture one at a time. I resulted that, on average, the proposed gestures were recognized 90% of the times.

It is worth noting that gesture “positioning” was defined in such a way to reduce the misclassification of the keypoints that could happen in some cases due to the presence of only one finger (the index). In fact, it has been observed that incrementing the number of fingers clearly visible in the scene also incremented the recognition accuracy of the gesture. This is probably due to the fact that the thumb must be present to avoid misclassifications with the index finger. However, even if the “positioning” gesture adopts three fingers, only the position of the index finger’s tip (keypoint 8) is used to estimate the position to which the user is pointing to.


The complete system is composed by (i) the gesture recognition ROS node that detects the gesture and (ii) a robot node to properly move the Sawyer cobot accordingly. Hence, since the robot workspace is vertical (as shown in Fig. 6) it has been necessary to properly calibrate the vertical workspace with respect to the robot user frame. This has been done using a centering tool to build the calibration matrix (adopting points couples of robot frame coordinates – vertical workspace coordinates).

Fig. 6 - Complete set-up developed, showing the two workspaces (horizontal and vertical).

Project Hands-Free presented at Ubiquitous Robotics 2020

Hands-Free is a ROS-based software to teleoperate a robot with the user hand. The skeleton of the hand is extracted by using OpenPose and the position of the user’s index finger in the user workspace is mapped to the corresponding robot position in the robot workspace.

The project is available on GitHub and the paper has been published in the Ubiquitous Robotics 2020 virtual conference proceedings.

Check out the presentation video below!

Project “RemindLy” participating in the HRI Student Competition!

Cristina Nuzzi, Stefano Ghidini, Roberto Pagani, and Federica Ragni, a team of DRIMI Applied Mechanics Ph. D. Students, participated in the Student Competition of the 2020 ACM/IEEE International Conference on Human-Robot Interaction.

Their project, called “RemindLy“, has been published in the conference proceedings and it is available at the ACM library.

Check out their presentation video below!

This video has also been selected by IEEE SPECTRUM weekly video selection on Robotics and published online!

Real-Time robot command system based on hand gestures recognition

With the Industry 4.0 paradigm, the industrial world has faced a technological revolution. Manufacturing environments in particular are required to be smart and integrate automatic processes and robots in the production plant. To achieve this smart manufacturing it is necessary to re-think the production process in order to create a true collaboration between human operators and robots. Robotic cells usually have safety cages in order to protect the operators from any harm that a direct contact can produce, thus limiting the interaction between the two. Only collaborative robots can really collaborate in the same workspace as humans without risks, due to their proper design. They pose another problem, though: in order to not harm human safety, they must operate at low velocities and forces, hence their operations are slow and quite comparable to the ones a human operator does. In practice, collaborative robots hardly have a place in a real industrial environment with high production rates.

In this context, this thesis work presents an innovative command system to be used in a collaborative workstation, in order to work alongside robots in a more natural and straightforward way for humans, thus reducing the time to properly command the robot on the fly. Recent techniques of Computer Vision, Image Processing and Deep Learning are used to create the intelligence behind the system, which is in charge of properly recognize the gestures performed by the operator in real-time.

Step 1: Creation of the gesture recognition system

A number of suitable algorithms and models are available in the literature for this purpose. An Object Detector in particular has been chosen for the job, called “Faster Region Proposal Convolutional Neural Network“, or Faster R-CNN, developed in MATLAB.

Object Detectors are especially suited for the task of gesture recognition because they are capable to (i) find the objects in the image and (ii) classify them, thus recognizing which objects they are. Figure 1 shows this concept: the object “number three” is showed in the figure, which the algorithm has to find. 

Fig. 1 - The process undergone by Object Detectors in general. Two networks elaborate the image in different steps: first the region proposals are extracted, which are the positions of object of interest found. Then, the proposals are evaluated by the classification network, which at the end outputs both the position of the object (the bounding box) and the name of the object class.

After a careful selection of gestures, purposely acquired by means of different mobile phones, and a preliminary study to understand if the model was able to differentiate between left and right hand and at the same time between the palm and the back of the hand, the final gestures proposed and their meaning in the control system are showed in Fig. 2.

Fig. 2 - Definitive gesture commands used in the command system.

Step 2: creation of the command system

The proposed command system is structured as in Fig. 3: the images are acquired in real-time by a Kinect v2 camera connected to the master PC and elaborated in MATLAB in order to obtain the gesture commands frame by frame. The commands are then sent to the ROS node in charge of translating the numerical command into an operation for the robot. It is the ROS node, by means of a purposely developed driver for the robot used, that sends the movement positions to the robot controller. Finally, the robot receives the ROS packets of the desired trajectory and executes the movements. Fig. 4 shows how the data are sent to the robot.

Fig. 3 - Overview of the complete system, composed of the acquisition system, the elaboration system and the actuator system.
Fig. 4 - The data are sent to the "PUB_Joint" ROS topic, elaborated by the Robox Driver which uses ROS Industrial and finally sent to the controller to move the robot.

Four modalities have been developed for the interface, by means of a State Machine developed in MATLAB:

  1. Points definition state
  2. Collaborative operation state
  3. Loop operation state
  4. Jog state
Below you can see the initialization of the system, in order to address correctly the light conditions of the working area and the areas where the hands will probably be found, according to barycenter calibration performed by the initialization procedure. 
If you are interested in the project, download the presentation by clicking the button below. The thesis document is also available on request.

Related Publications

Nuzzi, C.; Pasinetti, S.; Lancini, M.; Docchio, M.; Sansoni, G. “Deep Learning based Machine Vision: first steps towards a hand gesture recognition set up for Collaborative Robots“, Workshop on Metrology for Industry 4.0 and IoT, pp. 28-33. 2018

Nuzzi, C.; Pasinetti, S.; Lancini, M.; Docchio, M.; Sansoni, G. “Deep learning-based hand gesture recognition for collaborative robots“, IEEE Instrumentation & Measurement Magazine 22 (2), pp. 44-51. 2019

Vision and safety for collaborative robotics

The communication and collaboration between humans and robots is one of main principles of the fourth industrial revolution (Industry 4.0). In the next years, robots and humans will become co-workers, sharing the same working space and helping each other. A robot intended for collaboration with humans has to be equipped with safety components, which are different from the standard ones (cages, laser scans, etc.).

In this project, a safety system for applications of human-robot collaboration has been developed. The system is able to:

  • recognize and track the robot;
  • recognize and track the human operator;
  • measure the distance between them;
  • discriminate between safe and unsafe situations.

The safety system is based on two Microsoft Kinect v2 Time-Of-Flight (TOF) cameras. Each TOF camera measures the 3D position of each point in the scene evaluating the time-of-flight of a light signal emitted by the camera and reflected by each point. The cameras are placed on the safety cage of a robotic cell (Figure 1) so that the respective field of view covers the entire robotic working space. The 3D point clouds acquired by the TOF cameras are aligned with respect to a common reference system using a suitable calibration procedure [1].

Figure 1 - Positions of the TOF cameras on the robotic cell.

The robot and human detections are developed analyzing the RGB-D images (Figure 2) acquired by the cameras. These images contain both the RGB information and the depth information of each point in the scene.

Figure 2 - RGB-D images captured by the two TOF cameras.

The robot recognition and tracking (Figure 3) is based on a KLT (Kanade-Lucas-Tomasi) algorithm, using the RGB data to detect the moving elements in a sequence of images [2]. The algorithm analyzes the RGB-D images and finds feature points such as edges and corners (see the green crosses in figure 3). The 3D position of the robot (represented by the red triangle in figure 3) is finally computed by averaging the 3D positions of feature points.

Figure 3 - Robot recognition and tracking.

The human recognition and tracking (figure 4) is based on the HOG (Histogram of Oriented Gradient) algorithm [3]. The algorithm computes the 3D human position analyzing the gradient orientations of portions of RGB-D images and using them in a trained support vector machine (SVM). The human operator is framed in a yellow box after being detected, and his 3D center of mass is computed (see the red square in figure 4).

Figure 4 - Human recognition and tracking.

Three different safety strategies have been developed. The first strategy is based on the definition of suitable comfort zones of both the human operator and the robotic device. The second strategy implements virtual barriers separating the robot from the operator. The third strategy is based on the combined use of the comfort zones and of the virtual barriers.

In the first strategy, a sphere and a cylinder are defined around the robot and the human respectively, and the distance between them is computed. Three different situations may occur (figure 5):

  1. Safe situation (figure 5.a): the distance is higher than zero and the sphere and the cylinder are far from each other;
  2. Warning situation (figure 5.b): the distance decreases toward zero and sphere and cylinder are very close;
  3. Unsafe situation (figure 5.c): the distance is negative and sphere and cylinder collide.
Figure 5 - Monitored situations in the comfort zones strategy. Safe situation (a), warning situation (b), and unsafe situation (c).

In the second strategy, two virtual barriers are defined (Figure 6). The former (displayed in green in figure 6) defines the limit between the safe zone (i.e. the zone where the human can move safely and the robot can not hit him) and the warning zone (i.e. the zone where the contact between human and robot can happen). The second barrier (displayed in red in figure 6) defines the limit between the warning zone and the error zone (i.e. the zone where the robot works and can easily hit the operator).

Figure 6 - Virtual barriers defined in the second strategy.

The third strategy is a combination of comfort zones and virtual barriers (figure 7). This strategy gives redundant information: both the human-robot distance and positions are considered.

Figure 7 - Redundant safety strategy: combination of comfort zones and virtual barriers.


 The safety system shows good performances:
  • The robotic device is always recognized;
  • The human operator is recognized when he moves frontally with respect to the TOF cameras. The human recognition must be improved (for example increasing the number of TOF cameras) in case the human operator moves transversally with respect to the TOF cameras;
  • The safety situations are always identified correctly. The algorithm classifies the safety situations with an average delay of 0.86 ± 0.63s (k=1). This can be improved using a real time hardware.

Related Publications

Pasinetti, S.; Nuzzi, C.; Lancini, M.; Sansoni, G.; Docchio, F.; Fornaser, A. “Development and characterization of a Safety System for Robotic Cells based on Multiple Time of Flight (TOF) cameras and Point Cloud Analysis“, Workshop on Metrology for Industry 4.0 and IoT, pp. 1-6. 2018

Gesture control of robotic arm using the Kinect Module

The aim of this project is to create a remote control system for a robotic arm controlled by using the Kinect v2 sensor, to track the movements of the user arm, without any additional point of measurement (marker-less modality).

The Kinect camera acquires a 3D point cloud of the body and a skeleton representation of the gesture/pose is obtained using the SDK library software provided by the Kinect. The skeleton joints are tracked and used to estimate the angles.

Figure 1 - The point cloud acquired by the Kinect, and the skeleton. Points A, B, and C are the joints.

Point A is the joint of the wrist, point B is the joint of the elbow and point C is the joint of the shoulder. In the three dimensional space, vectors BA and BC are calculated with using the space coordinates of points A, B and C, which are taken from the skeleton. Angle α is calculated by using the dot product of the two vectors. 

The software has been developed in C# in Visual Studio 2015.

Figure 2 - Elbow angle geometry.

Related Publications

Sarikaya, Y.; Bodini, I.; Pasinetti, S.; Lancini, M.; Docchio, F.; Sansoni, G. “Remote control system for 3D printed robotic arm based on Kinect camera“, Congresso Nazionale delle Misure Elettriche ed Elettroniche GMEE-GMMT. 2017

Optoranger: a 3D pattern matching method for bin picking applications

Optoranger is a new method, based on 3D vision, for the recognition of free-form objects in the presence of clutters and occlusions, ideal for robotic bin picking tasks. The method can be considered as a compromise between complexity and effectiveness.

A 3D point cloud representing the scene is generated by a triangulation-based scanning system, where a fast camera acquires a blade projected by a laser source. Image segmentation is based on 2D images, and on the estimation of the distances between point pairs, to search for empty areas. Object recognition is performed using commercial software libraries integrated with custom-developed segmentation algorithms, and a database of model clouds created by means of the same scanning system.

Experiments carried out to verify the performance of the method have been designed by randomly placing objects of different types in the Robot work area. The results demonstrate the excellent ability of the system to perform the bin picking procedure, and the reliability of the method proposed for automatic recognition of identity, position and orientation of the objects.

Related Publications

Sansoni, G.; Bellandi, P.; Leoni, F.; Docchio, F. “Optoranger: A 3D pattern matching method for bin picking applications“, Optics and Lasers in Engineering, Vol. 54, pp. 222-231. 2014

The Robo3DScan Project

Recently, the integration of vision with robots has gained considerable attention from industry. Pick and place, sorting, assembling, cutting and welding processes are examples of applications which can have great advantage from the combination of information from 3D images with robot motion.

Our laboratory developed, in collaboration with DENSO EUROPE B. V., a system integrating 3D vision into robotic cells. The project led to the Roboscan system.

Roboscan is a Robot cell that combines 2D and 3D vision in a simple device, to aid a Robot manipulator in pick-and-place operations in a fast and accurate way. The optical head of Roboscan combines the two vision systems: the camera is used “stand-alone” in the 2D system, and combined to a laser slit projector in the 3D system, which operates in the triangulation mode. The 2D system, using suitable libraries, provides the preliminary 2D information to the 3D system to perform in a very fast, flexible and robust way the point cloud segmentation and fitting. The most innovative part of the system is represented by the use of robust 2D geometric template matching as a means to classify 3D objects. In this way, we avoid time-consuming 3D point cloud segmentation and 3D object classification, using 3D data only for estimating pose and orientation of the robot gripper. In addition, a novel approach to the template definition in the 2D geometric template matching is proposed, where the influence of surface reflectance and colour of the objects over the definition of the template geometry is minimized.

Related Publications

Bellandi, P.; Docchio, F.; Sansoni, G. “Roboscan: a combined 2D and 3D vision system for improved speed and flexibility in pick-and-place operation“, The International Journal of Advanced Manufacturing Technology, Vol. 69, no. 5–8, pp. 1873–1886. 2013

The BARMAN project

The aim of this project is to add 2D vision to the BARMAN demonstrator shown in Fig. 1. The BARMAN is composed of two DENSO robots. In its basic release it picks up bottles, uncorks them and places them on the rotating table. It then rotates the table, so that people can pick them up and drink.

The tasks of the Barman are summarized here:

  1. to survey the foreground and check if empty glasses are present;
  2. to rotate the table and move glasses to the background;
  3. to monitor for a bottle on the conveyor, recognize it, pick it up, uncork it and fill the glasses;
  4. to rotate the table to move glasses to the foreground zone.

These simple operations require that suitable image processing is developed and validated. The software environment is the Halcon Library 9.0; the whole-project is deveoped in VB2005. The robot platform is the ORiN 2 (from DENSO).

The work performed so far implements the following basic functions described below.

Calibration of BOTH cameras and robot

The aim is to define an absolute reference system for space point coordinates, where the camera coordinates can be mapped to the robot coordinates and viceversa. To perform this task a predefined master is used acquired under different perspectives (see Fig. 1 and Fig. 2 ).

The acquired images are elaborated by the Halcon calibration functions, and both extrinsic and intrinsic camera parameters are estimated. In parallel, special procedures for robot calibration have been reproduced and combined with the parameters estimated for the cameras (Fig 3).

Fig. 1 - Acquisition of the master in correspondence with the background zone (left); corresponding image in the Halcon environment (right).
Fig. 2 - Images of the master before (left) and after (right) the elaboration.
Fig. 3 - Calibration process of the robot.

Detection of empty glasses in the foreground

Flexible detection has been implemented to monitor the number of empty glasses. In addition, some special cases have been taken into account. Some exaples are shown in the following images (Fig. 4 and Fig. 5).

Fig. 4 - Detection of a single glass (left); detection of three glasses (right).
Fig. 5 - Detection of four glasses very close to each other (left); detection in the presence of glasses turned upside down.

Detection of glasses in the background

The position of the glasses in the background are calculated very precisely, since the camera is calibrated. In addition, it is possible to recognize semi-empty glasses and glasses turned upside down. This detection is mandatory to guarantee that the filling operation is performed correctly. Fig. 6 shows some significant examples.

Fig. 6 - Detection of the position of the glasses in the background. The system detects the presence of semi-empty glasses (center image) and turned glasses and do not mark them as available for subsequent operations.

Bottle detection

Three different types of bottles are recognized as “good” bottles by the system. The system is able to detect any other object which does not match the cases described above. It can recognize either if the “unknown” object can be picked up and disposed by the Barman, or if it must be removed manually.

Fig. 8 - Detection of unmatched objects.

Filling of the glasses

The robot is moved to the positions detected by the camera (see Fig. 6), and fills the glasses. Since both the cameras and the robots share the same reference system, this operation is “safe”: the robot knows where the glasses are.

Fig. 9 - Filling operation.

RAI Optolab Interview TG Neapolis 12/05/2010