Deep Learning in the kitchen: development and validation of an action recognition system based on RGB-D sensors

This thesis work is part of a research project between the Laboratory and Cast Alimenti cooking school. Cast Alimenti aims to obtain a product to improve teaching in its classrooms. The idea is to develop a system (hardware and software) that simplifies the process of writing recipes while the teacher performs its lecture in the kitchen. The project aims to translate into written language a recipe performed during the demonstration lessons, allowing also to write in Italian the lessons performed by foreign teachers. This goal has been achieved through the use of an action recognition system.

Action recognition aims to interpret human actions through mathematical algorithms and is based on the identification and tracking of the position of the human body in time. The topic is actively studied in various fields, in fact, by using this technique, devices such as smart bands and cell phones may recognize if a person is stationary, walking or running. Another example is related to security cameras and systems, in which the recognition of actions that may be considered violent or dangerous allow the authorities to intervene quickly when necessary.

The project is divided into several phases. First, it is necessary to implement the recognition of the cook’s activity during the practical demonstration and later, based on the recognized actions, it is necessary to automate the writing of the recipe. This thesis work is focused on the first part of the project, in particular on the choice of the mathematical algorithm necessary for the recognition of actions 


For a complete understanding of the difficulties of application of the system under development, it is necessary to contextualize the environment in which it will work. The scenario is that of an open kitchen in which the cook works behind a counter and for most of the time is in the same position and only its upper body is visible (Fig. 1). The cook interacts with the various working tools and food only through the upper limbs and there is no interaction with other subjects. 

The kitchen changes as the recipe progresses, as work tools such as stoves, mixers, pots and pans and other utensils are added and removed as needed. The predominant colors of the scene are white (the cook’s uniform, the wall behind him, the cutting board) and steel (the worktop and equipment). This results in uneven lighting and the creation of reflections and shadows. The presence of machinery and heat sources generates both auditory and visual noises, especially in the infrared spectrum. In the working environment there are also other critical issues such as: the presence of water, steam, temperature changes, substances of various kinds such as oil and acids, as well as chemicals for cleaning. 

To obtain reliable data in this context it is necessary to use appropriate instrumentation, with specific technical characteristics.

Fig. 1 - Example of the "smart" kitchen used in this work at Cast Alimenti. The predominant colors are white and gray and the overall luminance is dark. Due to the characteristics of the scenario the instrumentation of choice must be carefully selected.

technology of choice for action recognition

Tracking is the first step in action recognition. The technology used for this project was carefully selected by evaluating the systems used commercially to track full-body movements, such as accelerometers and gyroscopes, RGB cameras and depth sensors (3D cameras). 

Wearable solutions involve the use of accelerometers coupled with gyroscopes, a technique adopted in almost all commercially available smart bands for sports performance assessment and sleep monitoring. However, to properly use such wearable devices in this project it means that the cook must be equipped with sensors applied to the hands or wrists and the data obtained from them should be synchronized with each other. Considering the scenario in which the cook operates, it is also necessary that the sensors are waterproof and resistant to aggressive substances.

Optical instrumentation, such as RGB or 3D cameras positioned externally to the scene, makes it possible to keep the image sensors away from the chef’s workspace, thus avoiding exposure to the critical environmental conditions of the kitchen. A disadvantage of this solution is the huge amount of images obtained and the related processing. However, the computational power of modern computer systems allow their application in real-time systems.

Given the environmental conditions and the need of making the acquisition system easily accessible to non-expert users, we opted for the second solution choosing to place the cameras in front of the cook who performs the recipe, thus replicating the view of the students during the lectures.

Fig. 2 - Example taken from the experimental set-up. The two Kinect v2 cameras have been positioned in front of the kitchen counter, replicating the students' view.


An action is any movement made by the operator in which tools are used to obtain a certain result. Characteristics of an action are (i) the speed of execution and (ii) the space in which it takes place. Based on these variables, it is important to select an appropriate frame rate and sensor resolution.

Recognizing the performance of actions through a mathematical algorithm that analyzes images is not a computationally simple task, because the computational load increases proportionally to the number of frames per second and to the image resolution. It is therefore crucial to find the optimal configuration in order to maintain a good image quality, which is necessary to recognize the action, while still being able to use a consumer processing system.

Modern consumer computing systems (PCs) currently provide sufficient computational power to perform the necessary calculations by harnessing the power of parallel computing in GPUs (graphics processing units).


There are several algorithms available to analyze actions in real time given a set of image frames temporally consistent. They may be subdivided into two main categories:

  • Algorithms that analyze 3D images, such as images generated using depth cameras. This type of data removes all issues related to the color composition of the scene and subjects that may be blurred or that vary during the execution (Fig. 3 a); 
  • Algorithms that process Skeleton data, in which an artificial skeleton composed of keypoints corresponding to the fundamental joints of the body is computed by the network. The keypoints represent (x, y, z) positions of the body’s joints in the camera reference system  (Fig. 3 b). 

Moreover, by combining the two categories it is possible to obtain hybrid algorithms that analzye both types of data.

Among the broad set of algorithms available two of them have been selected for this work:

  1. HPM+TM: it is a supervised classification algorithm developed in MATLAB by the University of Western Australia. It was created specifically for action recognition and has achieved the best performance in the 3D Action Pairs dataset, reaching an accuracy of 98%.
  2. indRNN: this model was developed as part of a collaboration between Australia’s University of Wollongong and the University of Electronic Science and Technology of China. Although the algorithm has not been specifically designed for action recognition, it is still applicable where it is necessary to recognize features over time. It is a supervised classification algorithm and obtained an efficiency of 88% in the NTU RGB+D Dataset.
Fig. 3 - Example of data that may be processed by deep learning models. (a) Image frame in RGB on which skeletal data is drawn; (b) depth frame taken from Kinect v2 cameras.


The experimental campaign took place during two days at Cast Alimenti cooking school. Two Kinect v2 cameras recorded Nicola Michieletto, the chef that worked with our Laboratory since the beginning of the project, while he cooked lasagne. The entire preparation was repeated and filmed twice, this with the aim of obtaining a larger and more representative dataset.

From a first analysis, it is made evident that some actions were repeated much more than others. This strong difference in the number of samples available forced us to make some preliminary analysis in order to understand how the presence of categories with a small number of samples influences the accuracy of the algorithm adopted. 

Therefore, we selected only a sub-sample of the actions present in the dataset, namely:

  1. stirring: using a ladle the cook mixes the ingredients inside a pot or a bowl with circular movements;
  2. pouring: the cook takes an ingredient from one container and pours it inside another container;
  3. rolling out the pasta: a process in which pasta is made flat and thin using an inclined plane dough sheeter. The cook loads a thick sheet from the top of the machine and pulls out a thinner sheet from the bottom;
  4. cutting: the cook cuts a dish by means of a kitchen knife; the dish is held steady with the left hand and the knife is used with the right hand;
  5. placing the pasta: the cook takes the pastry from the cloths on which it was put to rest and deposits it inside the pan where he is composing the lasagna;
  6. spreading: process by which the béchamel and Bolognese sauce are distributed in an even layer during the composition of the lasagna;
  7. sprinkling: the cook takes the Parmesan grated cheese and distributes it forming an even layer;
  8. blanching: a process in which the cook takes freshly flaked pasta from the counter and plunges it inside a pot with salted water for a brief cooking time;
  9. straining the pasta: with the use of a perforated ladle the cook removes the pasta from the pot in which it was cooking and deposits it in a pan with water and ice;
  10. draining the pasta on cloth: the cook removes with his hands the pasta from the water and ice pan and lays it on a cloth in order to allow it to dry;
  11. folding the pasta: during the puffing process it is sometimes necessary to fold the pasta on itself in order to proceed to a further puffing process and obtain a more uniform pasta layer;
  12. turn on/off induction plate: the cook turns on or off a portable induction plate located on the work counter;
  13. catching: simple process where the cook grabs an object and moves it closer to the work point;
  14. moving pot: the cook moves the pot within the work space, in most cases involves moving it to or from the induction plate.
Fig. 4 - Examples of the 14 actions selected for the work. (a) stirring, (b) pouring; (c) rolling out the pasta; (d) cutting; (e) placing the pasta; (f) spreading; (g) sprinkling; (h) blanching; (i) straining the pasta; (l) draining the pasta on cloth; (m) folding the pasta; (n) turn on/off induction plate; (o) catching; (p) moving pot.


We performed a detailed analysis to determine the performances of the two algorithms according to (i) a reduction of the number of classes and (ii) an increase of the number of samples per class. Albeit theoretically known that deep learning algorithms improve the more data are used and that a high number of classes presenting similarities between each other reduce the overall inference accuracy, with this test we wanted to quantify this phenomenon.

In summary, the results for each algorithm are:

  • HPM+TM: this algorithm performs better when less classes are adopted and a high number of samples per class is used. Highest accuracy achieved: 54%
  • indRNN: this model performs better than the other one and is more robust even if less samples per class are used. Moreover, no significant improvements can be observed by reducing by more than a half the number of classes. Highest accuracy achieved: 85%
Moreover, by observing the resulting confusion matrixes it is possible to note that “stirring” and “pouring” classes are the most critical. In fact, the highest number of false positives is obtained for the “stirring” class while the highest number of false negatives is observed for the “pouring” class. The two cases are often due to misclassifications between each other. This highlighted the fact that during the cooking procedure the chef often poured an ingredient while stirring the pot with the other hand, hence the two actions are more often than not overlapping. Hence, it would be best to merge the two classes into one to account for this eventuality.

Performance Benchmark between an Embedded GPU and a FPGA

Nowadays is impossible to think of a device without considering it “intelligent” to some extent. If ten years ago “intelligent” systems where carefully designed and could only be used in specific cases (i. e. industry or defense or research), today smart sensors which are big as a button are everywhere, from cellphones to refrigerators, from vacuum cleaner to industrial machines.
If you think that’s it, you’re wrong: the future is tiny.

Embedded systems are more and more needed even in industries, because they’re capable to perform complex elaborations on board, sometimes comparable to the ones carried out by standard PCs. Small, portable, flexible and smart: it is not hard to understand why they’re more and more used!

A plethora of embedded systems is available on the market, according to the needs of the client. One important characteristic to check out is the capability of the embedded platform of choice to be “smart”, which nowadays means if it’s able to run a Deep Learning model with the same performances obtained while running it on a PC. The reason is that Deep Learning models require a lot of resources to perform well, and running them on CPUs usually means to lose accuracy and speed compared to running them on GPUs.

To solve this issue, some companies started to produce embedded platforms with GPUs on board. While the architecture of these systems is still different to the architecture of PCs, they are a quite good improvement on the matter. Another type of embedded systems is the FPGA: these platforms lead the market for a while before embedded GPUs became common, are low-level programmed and because of that are usually high performing.

In this thesis work, conducted in collaboration with Tattile, we performed a benchmark between the Nvidia Jetson TX2 embedded platform and the Xilinx FPGA Evaluation Board ZCU104.

STEP 1: Determine the BASELINE model

To perform our benchmark we selected an example model. We kept it simple by choosing the well-known VGG Net model, which we trained on a host machine equipped with a GPU on the standard dataset CIFAR-10. This dataset is composed by 10 classes of different objects (dogs, cats, airplanes, cars…) with standard image dimensions of 32×32 px.

The network was trained in Caffe for 40000 iterations, reaching an average accuracy of 86.1%. Note that the model obtained after this step is represented in floating point 32bit (FP32), which allows a refined and accurate representation of weights and activations.

Fig. 3 - CIFAR-10 dataset examples for each class.

STEP 2: Performances on Nvidia Jetson TX2

This board is equipped with a GPU, thus allowing a representation in floating point. Even if the FP32 representation is supported, we choose to perform a quantization procedure to reduce the representation complexity of the trained model to a FP16 representation. This choice was driven by the fact that the performances obtained with this representation are considered the best ones by the literature.

We used TensorRT, which is natively installed on the board, to perform the quantization procedure. After this process the model obtained an average accuracy of 85.8%.

STEP 3: Performances on Xilinx FPGA Zynq ZCU104

The FPGA do not support floating point representations. The toolbox used to modify the original model and adapt it for the board is a proprietary one, called DNNDK.

Two configurations were tested: a configuration where the original model was quantized from FP32 to INT8, thus losing the floating point and critically reduce the dimension of the network to few MB. The average accuracy obtained in this case is 86.6%, slightly better than the baseline probably because of the big representation gap, which in some cases lead to correct predictions the ones that were borderline between correct and incorrect.

The second configuration applies a pruning process after the quantization procedure, thus deleting useless layers. In this case the average accuracy reached is of 84.2%, as expected after the combination of both processes.

STEP 4: Performance benchmark of the boards

Finally we compared the performances obtained by the two boards when running inference in real-time. The results can be found in the presentation below: if you’re interested, feel free to download it!

The thesis document is also available on request.

Real-Time robot command system based on hand gestures recognition

With the Industry 4.0 paradigm, the industrial world has faced a technological revolution. Manufacturing environments in particular are required to be smart and integrate automatic processes and robots in the production plant. To achieve this smart manufacturing it is necessary to re-think the production process in order to create a true collaboration between human operators and robots. Robotic cells usually have safety cages in order to protect the operators from any harm that a direct contact can produce, thus limiting the interaction between the two. Only collaborative robots can really collaborate in the same workspace as humans without risks, due to their proper design. They pose another problem, though: in order to not harm human safety, they must operate at low velocities and forces, hence their operations are slow and quite comparable to the ones a human operator does. In practice, collaborative robots hardly have a place in a real industrial environment with high production rates.

In this context, this thesis work presents an innovative command system to be used in a collaborative workstation, in order to work alongside robots in a more natural and straightforward way for humans, thus reducing the time to properly command the robot on the fly. Recent techniques of Computer Vision, Image Processing and Deep Learning are used to create the intelligence behind the system, which is in charge of properly recognize the gestures performed by the operator in real-time.

Step 1: Creation of the gesture recognition system

A number of suitable algorithms and models are available in the literature for this purpose. An Object Detector in particular has been chosen for the job, called “Faster Region Proposal Convolutional Neural Network“, or Faster R-CNN, developed in MATLAB.

Object Detectors are especially suited for the task of gesture recognition because they are capable to (i) find the objects in the image and (ii) classify them, thus recognizing which objects they are. Figure 1 shows this concept: the object “number three” is showed in the figure, which the algorithm has to find. 

Fig. 1 - The process undergone by Object Detectors in general. Two networks elaborate the image in different steps: first the region proposals are extracted, which are the positions of object of interest found. Then, the proposals are evaluated by the classification network, which at the end outputs both the position of the object (the bounding box) and the name of the object class.

After a careful selection of gestures, purposely acquired by means of different mobile phones, and a preliminary study to understand if the model was able to differentiate between left and right hand and at the same time between the palm and the back of the hand, the final gestures proposed and their meaning in the control system are showed in Fig. 2.

Fig. 2 - Definitive gesture commands used in the command system.

Step 2: creation of the command system

The proposed command system is structured as in Fig. 3: the images are acquired in real-time by a Kinect v2 camera connected to the master PC and elaborated in MATLAB in order to obtain the gesture commands frame by frame. The commands are then sent to the ROS node in charge of translating the numerical command into an operation for the robot. It is the ROS node, by means of a purposely developed driver for the robot used, that sends the movement positions to the robot controller. Finally, the robot receives the ROS packets of the desired trajectory and executes the movements. Fig. 4 shows how the data are sent to the robot.

Fig. 3 - Overview of the complete system, composed of the acquisition system, the elaboration system and the actuator system.
Fig. 4 - The data are sent to the "PUB_Joint" ROS topic, elaborated by the Robox Driver which uses ROS Industrial and finally sent to the controller to move the robot.

Four modalities have been developed for the interface, by means of a State Machine developed in MATLAB:

  1. Points definition state
  2. Collaborative operation state
  3. Loop operation state
  4. Jog state
Below you can see the initialization of the system, in order to address correctly the light conditions of the working area and the areas where the hands will probably be found, according to barycenter calibration performed by the initialization procedure. 
If you are interested in the project, download the presentation by clicking the button below. The thesis document is also available on request.

Related Publications

Nuzzi, C.; Pasinetti, S.; Lancini, M.; Docchio, M.; Sansoni, G. “Deep Learning based Machine Vision: first steps towards a hand gesture recognition set up for Collaborative Robots“, Workshop on Metrology for Industry 4.0 and IoT, pp. 28-33. 2018

Nuzzi, C.; Pasinetti, S.; Lancini, M.; Docchio, M.; Sansoni, G. “Deep learning-based hand gesture recognition for collaborative robots“, IEEE Instrumentation & Measurement Magazine 22 (2), pp. 44-51. 2019