Keyword spotting on embedded device demo

21 Aug 2023 Dmytro Liukevych

In this article, I will share my experience setting up a software demo project for a device, which can recognize and act on certain voice commands. The solution will be used to evaluate the feasibility of using ML (Machine Learning) approach to detect keywords in a constrained environment.

Our demo is quite simple conceptually: a hardware device with connected LEDs and a microphone. The microphone is listening to voice commands recognizing three main colors: GREEN, RED, and WHITE. For each recognized word, an LED is turned on with an appropriate color.

Before we dive into the technical side, watch the video presenting the basics of our application:

 

Terms and abbreviations

First if all, let me share a terminology and abbreviations used in this article:

ML – machine learning.
NN – neural network.
MCU – micro controller unit.
SoC – System on Chip. Sort of integrated circuits which usually combines MCU with analog, digital or other type of circuitry that is not typical for general purpose MCU. For instance, MCU that incorporates radio front-end is SoC.
HW – hardware.
FW – firmware.
DSP – digital signal processing
POC – proof of concept
SNR – signal to noise ratio

Hardware platform description

For the demo setup I chose hardware that consists of 3 main elements: custom electret microphone shield, smart-LEDs board and nRF52 development kit equipped with nRF52832 SoC (512kB FLASH, 64kB RAM Cortex-M4F). This device does not have strong computing capabilities, so it can be challenging to run NN on such a platform.

 

The microphone shield is equipped with transistor preamplifier, schematic as below:

The shield is powered from external Li-Ion battery to reduce noise level. The Smart LED board has 4 WS2812B LEDs.

Machine learning framework description

Creating ML algorithms from scratch by native Tensor Flow tools is a complex and time-consuming task. To make it easier Edge Impulse released the platform that allows embedded developers to take all advantages of ML without having experience in data science, Tensor Flow technology and Python coding skills. Despite it having certain flaws and unlikely can be considered as a professional tool or used with some limitations, it gives a good entry point to start from in machine learning. This platform provides convenient and easy-to-use tools to cover all stages of model creation from data collection to deploying models on customer’s device. It also offers predefined model creation patterns for the most common use cases such as motion and gestures recognition, keyword spotting and so on. You can learn more about Edge Impulse in its official documentation.

The workflow for our solution contains the following key steps:
Data Collection-> Model Creation -> Model Integration -> Testing

Let’s dive into each one and look at them in detail.

Data collection

The ML model creation starts from collecting sets of data used for model training and verification. We are going to detect 3 keywords: GREEN, RED, WHITE.

For the clear distinction between keywords and other sounds or ambient noise, the ML algorithm requires an additional two classes – unknown words and silence, hence we need to collect five sets of audio data.
We need to record each word numerous times and load audio data to Edge Impulse Studio for processing. In our case it is important to have the same HW set for data collection and for generated model thus we use the same microphone and devkit.

Data collection FW

This firmware is a regular one, it is only doing two things:

  • Sampling analog data from the microphone. Sampling rate 8kHz 12bit.
  • Sends data over UART. Baud 1Mbps.

At this point, we got an audio stream in digital representation as an input on PC side. The next step is to upload it to Edge Impulse server, so PC is acting as a bridge to route data from device to the place where it will be processed. There are two ways we have explored:

Data forwarder command line utility (the wrong way)

The Data forwarder is a utility written in JS that listens to COM port to obtain data and sends it to Edge Impulse. It can automatically detect sample rate of input stream; however, it was not accurate. To optimize UART interface usage FW sends data in small chunks containing 16 samples every 2 ms rather than send single sample each 125 us. The oscilloscope measurements proved those timings:

The data forwarder always shows sample rate much lower than it is(8000Hz) which means data losses and inconsistency:

 

Based on those findings, it became obvious that it is not the way we are looking for. Therefore, we need to consider another approach: custom data collection for training the model.

Custom data collecting

Edge Impulse supports manual loading of WAV files, so we need to create them on our own. That was not too hard: I just recorded myself saying the three words (GREEN, WHITE, RED). I used the following process for collecting audio data:

The Python script listens to COM port, records data to buffer and saves to WAV file. The Audio sample length is configurable. To synchronize the stream, every 100 ms FW inserts a special byte sequence so-called ‘anchor’ to let the script know when audio sample starts.

 

By default, the script records 25 seconds of audio which allows you to speak 15-20 short words. Later Edge Impulse will automatically split a file to the single word short samples.

Once the data collected and sorted in groups it can be loaded to Edge Impulse for the further processing:

 

The experience shows that manual collection and loading of data is more accurate than the Data Forwarder utility by Edge Impulse. It is more complex though.

Model creation and training

Once data is collected and organized by labels in 5 equal size groups, it gets split to the train and test sets to start creating a model.

According to the official tutorial, the process of model creation is simple and straightforward but has numerous parameters for fine tuning.

 

Unlike the default values in Time Series Data block, the Window size was set to 500ms, selected keywords are short enough to fit 500 ms window. The shorter the window, the less amount of RAM needed to handle data. The rest of processing blocks from the picture above are left as is.

The next step is setting up DSP parameters which are crucial for the future model’s performance. All parameters except for two are set by default. Frequency range narrowed according to human voice bandwidth. It would filter out frequencies we are not interested in and reduce RAM footprint.

 

DSP stage converts raw audio data to the mel-scaled spectrogram – the format is suitable for extracting features NN can recognize. Spectrogram comes to the NN input as a one-dimensional array.

NN architecture offered by Edge Impulse as follows:

The result of NN training seems to give decent accuracy, however the confusion matrix shows that misclassification is highly probable for keyword “white”.

Model integration

Now we have all components ready to start integrating it all together. The model output format depends on the target platform. Since the nRF52832 is not directly supported by Edge Impulse the best option is selecting generic Cortex-M4 device. In this case Edge Impulse will generate a pack of C++ source files grouped in following categories:

  1. Edge Impulse SDK – libraries for supporting Tensor Flow core, DSP computations and Edge Impulse specific code.
  2. Model itself – C++ source file with related headers. It represents parsed model in *.tflite format which allows to omit parsing on device and makes it easy to analyze by developer.
  3. Model parameters – set of header files with exported model parameters to use by customer code.

Such format is convenient and easy to integrate in a project. Once the project infrastructure is ready switching between different models means just replacement of a few files .

Integration flow as follows:

Testing

At this point I’ve run only two model versions (V1 & V2) that were tested. Despite calculated accuracy being high, the real performance of model is low, because the device could recognize keywords in only 10-20% of cases. The main reason it is so poor is due to a small training dataset – only 50 seconds of audio per class were used, while Edge Impulse recommends at least 10 minutes. The table below compares characteristics of these two models in terms of MCU resources utilization:

 

Conclusion and possible improvements

According to results of our ML demo (particularly based on Tensor Flow Lite), we can clearly say that this approach is feasible for medium-scale and sophisticated embedded systems.

The Edge Impulse platform is easy-to-use for developers to create POC or simple NN-powered projects. As Edge Impulse evolves, it may become a powerful tool in ML for professional usage in projects of high complexity. Unfortunately, for now some parts of the product are not stable, which forces developers to use custom tools.

For this specific demo project there is a list of suggested improvements to increase keywords recognition ratio and overall performance:

  • Change microphone and/or rework preamplifier circuit to increase SNR and output signal level. Digital microphone also would be a good option.
  • Use sufficient training dataset collected from different speakers in different ambient conditions
  • Conduct a series of experiments with DSP and NN architecture parameters to figure out the optimal ones
  • Analyze model for the ways to reduce RAM usage.

Ready to start the conversation?