Video Demo:
1 Introduction
Intruduction for Module / Chip
ESP32-S2-MINI-1 and ESP32-S2-MINI-1U are two powerful, generic Wi-Fi MCU modules that have a rich set of peripherals. They are an ideal choice for a wide variety of application scenarios related to Internet of Things (IoT), such as wearable electronics and smart home.
- Core: Xtensa® single-core 32-bit LX7 CPU, frequency up to 240MHz
- Memories:
- 128 KB of ROM
- 320 KB of SRAM
- 16 KB of RTCSRAM
- 4 MB of Flash memory
Please refer to: module datasheet
Introduction for Dev Board
A development board based on the module above provided by EETree is used for this project. The following circuits or features are on this development board:
- Based on ESP32-S2 WiFi core module;
- 128*64 OLED display, SPI interface, for message, parameter and waveform displaying;
- 4 buttons for parameter and mode selection;
- 1-channel Mic audio input – analog circuit, a range of 0-40dB adjusted by potentiometer, with filter;
- 1-channel audio input through headphone jack – analog circuit, a range of 0-40dB adjusted by potentiometer, with filter;
- 2-hannel audio output, with power amplifier, specker and headphones can be driven;
- A FM reveiver, program selection and volume adjustment by I2C interface;
- An analog switch to select audio source (ESP32 or FM) to output devices (speaker, headphones)
Please refer to the development board’s offical website:
Introduction for the Task
Audio Scene Classification (ASC) is one of the basic tasks in the computer acoustics field. It is expected to classify a piece of acquired sound into the correct environment labels, such as dog bark, raining. This project deployed ASC on the embedded platform ESP32, which benefits our daily life and runs at low power consumption. This post will intruduce the process for this project, including audio acquisition, feature extrachtion, neural network and deployment.
2 Audio Signal Acquisition
观察原理图可以得知,Mic接收到的信号通过运算放大器U9放大,并可以通过RV1调节增益,最终接入ESP32 ADC输入中。因此采集部分根据参考手册和编程手册,使用DIG ADC1以20kHz的采样率来测量引脚的电压,并通过DMA(直接存储器访问)直接复制到内存中而无需CPU处理。本部分驱动程序直接使用了该项目开发好的,不过因为API有更新,我强行使用了已弃用的库函数,因此编译时会有警告产生,日后可以对此部分驱动升级。
3 Feature Extraction
3.1 Window Function
首先我们截取采样到的音频数据中的连续的1×1024个点。这些点后续将会被进行傅里叶变换。在此之前为减少频谱泄露,我们会对这一段数据乘上一个中间大,最大为1,两边小并且逐渐接近0的函数,称为“窗函数”。我是用的窗是常用的汉宁窗(Hanning Window),具体可以参考:
3.2 Fast Fourier Transformation (FFT)
3.3 Mel Spectrogram
研究表明,人类对频率的感知并不是线性的,并且对低频信号的感知要比高频信号敏感。例如,人们可以比较容易地发现500和1000Hz的区别,确很难发现7500和8000Hz的区别。然而,傅里叶变换后的频谱的横轴是线性的。这时,梅尔标度(the Mel Scale)被提出,它是Hz的非线性变换,对于以mel scale为单位的信号,可以做到人们对于相同频率差别的信号的感知能力几乎相同。图3.1(b)展示了人耳对音调的感知和它们实际频率的关系。更多介绍请参考:
3.4 Program Pipeline
4 Model Training
4.1 Dataset
4.2 Model Architecture
在特征提取后,音频分类问题就被转换成了图像分类问题。我们使用较简单的CNN卷积神经网络来对其进行分类。由于ESP32平台内存有限并且硬禾使用的是没有2M PSRAM的芯片型号,也就是只有320k的内置SRAM,因此网络的规模不能太大。一般在嵌入式平台中,内存大小限制了网络的规模,而CPU性能决定了网络能跑多快。对于CNN模型结构的可视化,推荐参考以下网站:
5 Model Deployment
6 Program Flow
- 按键1:DAC开关(ADC关时有效);ADC数字滤波器参数设置(ADC开时)
- 按键2:ADC开关(DAC关时有效)
- 按键3:ADC开时:ADC模式调整(下文详细介绍);DAC开时:DAC音量调整
- 按键4:DAC频率切换
图6重点介绍了ADC模式时的程序逻辑。当ADC打开后,ADC_Task任务会开始执行并持续响应ADC DMA中断。之后根据ADC的按键选择的4种模式之一来执行对应的代码。
7 Results
Appendix: Codes
xSemaphoreTake( xMutexSpec, portMAX_DELAY );
conv2d_relu(&spec, 1, 30, 40, feature1, 10, conv1_1_w, conv1_1_b);
xSemaphoreGive( xMutexSpec );
conv2d_relu(feature1, 10, 28, 38, feature2, 10, conv1_2_w, conv1_2_b);
swap_feature(feature1, feature2);
max_pool2d(feature1, 10, 26, 36, feature2);
swap_feature(feature1, feature2);
conv2d_relu(feature1, 10, 13, 18, feature2, 10, conv2_1_w, conv2_1_b);
swap_feature(feature1, feature2);
conv2d_relu(feature1, 10, 11, 16, feature2, 10, conv2_2_w, conv2_2_b);
swap_feature(feature1, feature2);
max_pool2d(feature1, 10, 9, 14, feature2);
swap_feature(feature1, feature2);
flatten_fc(feature1, 10, 4, 7, out_neurons, 10, fc_w, fc_b);
for epoch in range(num_epochs): # loop over the dataset multiple times
running_loss = 0.0
correct = 0
for i, data in enumerate(asc_dataloader, 0):
# get the inputs; data is a list of [inputs, labels]
inputs, labels = data
labels = torch.tensor(labels, dtype=torch.long)
# zero the parameter gradients
# forward + backward + optimize
outputs = model(inputs)
loss = criterion(outputs, labels)
_, preds = torch.max(outputs, 1)
# show_images(inputs[:6], labels[:6], preds[:6])
correct += (preds == labels).sum()
# print statistics
running_loss += loss.item()
if i % 1 == 0: # print every 2000 mini-batches
print('Epoch [{}/{}], Step [{}/{}], Loss(Avg): {:.3f}'
.format(epoch + 1, num_epochs, i + 1, len(asc_dataloader), running_loss / 1))
running_loss = 0.0
print("Epoch {}, Acc: {}/{}".format(epoch + 1, correct, asc_dataset.__len__()))
correct = 0