• Home
Name Date Size #Lines LOC

..--

tests/03-May-2024-6440

README.mdD03-May-20249.3 KiB176114

README.md.licenseD03-May-2024107 54

__init__.pyD03-May-20240 10

audio_utils.pyD03-May-20242.7 KiB8357

requirements.txtD03-May-2024195 109

run_audio_file.pyD03-May-20243.3 KiB10066

wav2letter_mfcc.pyD03-May-20243.7 KiB9266

README.md

1# Automatic Speech Recognition with PyArmNN
2
3This sample application guides the user to perform automatic speech recognition (ASR) with PyArmNN API.
4
5## Prerequisites
6
7### PyArmNN
8
9Before proceeding to the next steps, make sure that you have successfully installed the newest version of PyArmNN on your system by following the instructions in the README of the PyArmNN root directory.
10
11You can verify that PyArmNN library is installed and check PyArmNN version using:
12
13```bash
14$ pip show pyarmnn
15```
16
17You can also verify it by running the following and getting output similar to below:
18
19```bash
20$ python -c "import pyarmnn as ann;print(ann.GetVersion())"
21'32.0.0'
22```
23
24### Dependencies
25
26Install the PortAudio package:
27
28```bash
29$ sudo apt-get install libsndfile1 libportaudio2
30```
31
32Install the required Python modules:
33
34```bash
35$ pip install -r requirements.txt
36```
37
38### Model
39The model we are using is the [Wav2Letter](https://github.com/ARM-software/ML-zoo/tree/master/models/speech_recognition/wav2letter/tflite_int8 ) which can be found in the [Arm Model Zoo repository](
40https://github.com/ARM-software/ML-zoo/tree/master/models).
41
42A small selection of suitable wav files containing human speech can be found [here](https://github.com/Azure-Samples/cognitive-services-speech-sdk/tree/master/sampledata/audiofiles).
43
44Labels for this model are defined within run_audio_file.py.
45
46## Performing Automatic Speech Recognition
47
48### Processing Audio Files
49
50Please ensure that your audio file has a sampling rate of 16000Hz.
51
52To run ASR on an audio file, use the following command:
53
54```bash
55$ python run_audio_file.py --audio_file_path <path/to/your_audio> --model_file_path <path/to/your_model>
56```
57
58You may also add the optional flags:
59
60* `--preferred_backends`
61
62  * Takes the preferred backends in preference order, separated by whitespace. For example, passing in "CpuAcc CpuRef" will be read as list ["CpuAcc", "CpuRef"] (defaults to this list)
63
64    * CpuAcc represents the CPU backend
65
66    * GpuAcc represents the GPU backend
67
68    * CpuRef represents the CPU reference kernels
69
70* `--help` prints all available options to screen
71
72## Application Overview
73
741. [Initialization](#initialization)
75
762. [Creating a network](#creating-a-network)
77
783. [Automatic speech recognition pipeline](#automatic-speech-recognition-pipeline)
79
80### Initialization
81
82The application parses the supplied user arguments and loads the audio file in chunks through the `capture_audio()` method which accepts sampling criteria as an `AudioCaptureParams` tuple.
83
84With ASR from an audio file, the application will create a generator object to yield blocks of audio data from the file with a minimum sample size defined in AudioCaptureParams.
85
86MFCC features are extracted from each block based on criteria defined in the `MFCCParams` tuple.
87these extracted features constitute the input tensors for the model.
88
89To interpret the inference result of the loaded network; the application passes the label dictionary defined in run_audio_file.py to a decoder and displays the result.
90
91### Creating a network
92
93A PyArmNN application must import a graph from file using an appropriate parser. Arm NN provides parsers for various model file types, including TFLite and ONNX. These parsers are libraries for loading neural networks of various formats into the Arm NN runtime.
94
95Arm NN supports optimized execution on multiple CPU, GPU, and Ethos-N devices. Before executing a graph, the application must select the appropriate device context by using `IRuntime()` to create a runtime context with default options. We can optimize the imported graph by specifying a list of backends in order of preference and implementing backend-specific optimizations, identified by a unique string, for example CpuAcc, GpuAcc, CpuRef represent the accelerated CPU and GPU backends and the CPU reference kernels respectively.
96
97Arm NN splits the entire graph into subgraphs based on these backends. Each subgraph is then optimized, and the corresponding subgraph in the original graph is substituted with its optimized version.
98
99The `Optimize()` function optimizes the graph for inference, then `LoadNetwork()` loads the optimized network onto the compute device. The `LoadNetwork()` function also creates the backend-specific workloads for the layers and a backend-specific workload factory.
100
101Parsers extract the input information for the network. The `GetSubgraphInputTensorNames()` function extracts all the input names and the `GetNetworkInputBindingInfo()` function obtains the input binding information of the graph. The input binding information contains all the essential information about the input. This information is a tuple consisting of integer identifiers for bindable layers and tensor information (data type, quantization info, dimension count, total elements).
102
103Similarly, we can get the output binding information for an output layer by using the parser to retrieve output tensor names and calling the `GetNetworkOutputBindingInfo()` function
104
105For this application, the main point of contact with PyArmNN is through the `ArmnnNetworkExecutor` class, which will handle the network creation step for you.
106
107```python
108# common/network_executor.py
109# The provided wav2letter model is in .tflite format so we use TfLiteParser() to import the graph
110if ext == '.tflite':
111    parser = ann.ITfLiteParser()
112network = parser.CreateNetworkFromBinaryFile(model_file)
113...
114# Optimize the network for the list of preferred backends
115opt_network, messages = ann.Optimize(
116    network, preferred_backends, self.runtime.GetDeviceSpec(), ann.OptimizerOptions()
117    )
118# Load the optimized network onto the runtime device
119self.network_id, _ = self.runtime.LoadNetwork(opt_network)
120# Get the input and output binding information
121self.input_binding_info = parser.GetNetworkInputBindingInfo(graph_id, input_names[0])
122self.output_binding_info = parser.GetNetworkOutputBindingInfo(graph_id, output_name)
123```
124
125### Automatic speech recognition pipeline
126Mel-frequency Cepstral Coefficients (MFCCs, [see Wikipedia](https://en.wikipedia.org/wiki/Mel-frequency_cepstrum)) are extracted based on criteria defined in the MFCCParams tuple and associated`MFCC Class`.
127MFCCs are the result of computing the dot product of the Discrete Cosine Transform (DCT) Matrix and the log of the Mel energy.
128
129The `MFCC` class is used in conjunction with the `AudioPreProcessor` class to extract and process MFCC features from a given audio frame.
130
131
132After all the MFCCs needed for an inference have been extracted from the audio data we convolve them with 1-dimensional Savitzky-Golay filters to compute the first and second MFCC derivatives with respect to time. The MFCCs and the derivatives constitute the input tensors that will be classified by an `ArmnnNetworkExecutor`object.
133
134
135```python
136# mfcc.py & wav2lettermfcc.py
137# Extract MFCC features
138log_mel_energy = np.maximum(log_mel_energy, log_mel_energy.max() - top_db)
139mfcc_feats = np.dot(self.__dct_matrix, log_mel_energy)
140...
141# Compute first and second derivatives (delta and delta-delta respectively) by passing a
142# Savitzky-Golay filter as a 1D convolution over the features
143for i in range(features.shape[1]):
144            idelta = np.convolve(features[:, i], self.__savgol_order1_coeffs, 'same')
145            mfcc_delta_np[:, i] = (idelta)
146            ideltadelta = np.convolve(features[:, i], self.savgol_order2_coeffs, 'same')
147            mfcc_delta2_np[:, i] = (ideltadelta)
148```
149
150```python
151# audio_utils.py
152# Quantize the input data and create input tensors with PyArmNN
153input_tensor = quantize_input(input_tensor, input_binding_info)
154input_tensors = ann.make_input_tensors([input_binding_info], [input_data])
155```
156
157Note: `ArmnnNetworkExecutor` has already created the output tensors for you.
158
159After creating the workload tensors, the compute device performs inference for the loaded network by using the `EnqueueWorkload()` function of the runtime context. Calling the `workload_tensors_to_ndarray()` function obtains the inference results as a list of ndarrays.
160
161```python
162# common/network_executor.py
163status = runtime.EnqueueWorkload(net_id, input_tensors, self.output_tensors)
164self.output_result = ann.workload_tensors_to_ndarray(self.output_tensors)
165```
166
167The output from the inference must be decoded to obtain the recognised characters from the speech. A simple greedy decoder classifies the results by taking the highest element of the output as a key for the labels dictionary. The value returned is a character which is appended to a list, and the list is filtered to remove unwanted characters. The produced string is displayed on the console.
168
169## Next steps
170
171Having now gained a solid understanding of performing automatic speech recognition with PyArmNN, you are able to take control and create your own application. For your next steps we suggest to first implement your own network, which can be done by updating the parameters of `ModelParams` and `MfccParams` to match your custom model. The `ArmnnNetworkExecutor` class will handle the network optimisation and loading for you.
172
173An important step to improving accuracy of the generated output sentences is by providing cleaner data to the network. This can be done by including additional preprocessing steps such as noise reduction of your audio data.
174
175In this application, we had used a greedy decoder to decode the integer-encoded output however, better results can be achieved by implementing a beam search decoder. You may even try adding a language model at the end to aim to correct any spelling mistakes the model may produce.
176

README.md.license

1#
2# Copyright © 2020-2022 Arm Ltd and Contributors. All rights reserved.
3# SPDX-License-Identifier: MIT
4#
5