一款开源的唤醒词项目，可以自定义唤醒词

wangerxian

一款开源的唤醒词项目，可以自定义唤醒词 [复制链接]

Howl

为Firefox Voice提供唤醒词检测模型，支持谷歌Speech Commands和Mozilla Common Voice等开放数据集。

引用:

@inproceedings{tang-etal-2020-howl,
    title = "Howl: A Deployed, Open-Source Wake Word Detection System",
    author = "Tang, Raphael and Lee, Jaejun and Razi, Afsaneh and Cambre, Julia and Bicking, Ian and Kaye, Jofish and Lin, Jimmy",
    booktitle = "Proceedings of Second Workshop for NLP Open Source Software (NLP-OSS)",
    month = nov,
    year = "2020",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.nlposs-1.9",
    doi = "10.18653/v1/2020.nlposs-1.9",
    pages = "61--65"
}

训练指南

安装

git clone https://github.com/castorini/howl && cd howl
按照特定于平台的说明安装PyTorch。
通过发行版的包系统安装PyAudio及其依赖项。
pip install -r requirements.txt -r requirements_training.txt (可能需要安装一些apt软件包)
./download_mfa.sh 设置蒙特利尔强制对齐器(MFA)用于数据集生成

准备数据集

为自定义唤醒词生成数据集需要三个步骤：

生成原始音频数据集，Howl 可以从开放的数据集加载
生成正字法转录对齐为每个音频文件。
将对齐附加到步骤1中生成的原始音频数据集。

话虽如此，我们推荐Common Voice dataset作为开放音频数据集，Montreal Forced Aligner (MFA)作为转录对齐。
只需运行 download_mfa.sh 脚本即可下载MFA。随着对齐器，脚本将下载必要的英语发音字典。

./generate_dataset.sh <common voice dataset path> <underscore separated wakeword (e.g. hey_fire_fox)> <inference sequence (e.g. [0,1,2])> <(Optional) "true" to skip negative dataset generation>

详细说明请参见如何为自定义唤醒词生成数据集

训练和运行模型

获取训练res8模型的相关环境变量：Source envs/res8.env。
训练模型：python -m training.run.train -i datassets /fire/positive datassets /fire/negative——model res8——workspace workspaces/fire-res8。如果训练数据集很小，建议使用——use-stitch -datasets。
对于CLI演示，运行python -m training.run.demo——model res8——workspace workspaces/fire-res8。

./train_model.sh <env file path (e.g. envs/res8.env)> <model type (e.g. res8)> <workspace path (e.g. workspaces/fire-res8)> <dataset1 (e.g. datasets/fire-positive)> <dataset2 (e.g. datasets/fire-negative)> ...

预训练模型

howl-models 包含预训练模型的工作空间

要获取最新模型，只需执行 git submodule update --init --recursive

嘿，火狐

VOCAB='["hey","fire","fox"]' INFERENCE_SEQUENCE=[0,1,2] INFERENCE_THRESHOLD=0 NUM_MELS=40 MAX_WINDOW_SIZE_SECONDS=0.5 python -m training.run.demo --model res8 --workspace howl-models/howl/hey-fire-fox

使用pip安装Howl

通过您的发行版包管理系统安装PyAudio和PyTorch 1.5+ 。
利用 pip 安装 Howl

pip install howl

为了立即使用预训练的Howl模型进行推理，我们提供了“客户端”API。下面的示例（也可以在examples/hey_fire_fox.py下找到）使用一个简单的回调加载“hey_fire_fox”预训练模型，并启动推理客户端。

from howl.client import HowlClient

def hello_callback(detected_words):
    print("Detected: {}".format(detected_words))

client = HowlClient()
client.from_pretrained("hey_fire_fox", force_reload=False)
client.add_listener(hello_callback)
client.start().join()

复现论文成果

首先，遵循快速入门指南中的安装指示。

Google语音命令

下载 Google语音命令数据集解压后保留。
设置相应的环境变量：source envs/res8.env
设定数据集路径至语音命令数据集根目录： export DATASET_PATH=/path/to/dataset
训练 res8 模型: NUM_EPOCHS=20 MAX_WINDOW_SIZE_SECONDS=1 VOCAB='["yes","no","up","down","left","right","on","off","stop","go"]' BATCH_SIZE=64 LR_DECAY=0.8 LEARNING_RATE=0.01 python -m training.run.pretrain_gsc --model res8

嘿，火狐

下载嘿，火狐语料库，依据CC0许可解压之。
下载我们的噪音数据集，基于Microsoft SNSD和MUSAN构建而成，将其解压缩。
设置相应的环境变量： source envs/res8.env
设定噪音数据集路径至其根目录： export NOISE_DATASET_PATH=/path/to/snsd
设定火狐数据集路径至根目录： export DATASET_PATH=/path/to/hey_firefox
训练模型： LR_DECAY=0.98 VOCAB='["hey","fire","fox"]' USE_NOISE_DATASET=True BATCH_SIZE=16 INFERENCE_THRESHOLD=0 NUM_EPOCHS=300 NUM_MELS=40 INFERENCE_SEQUENCE=[0,1,2] MAX_WINDOW_SIZE_SECONDS=0.5 python -m training.run.train --model res8 --workspace workspaces/hey-ff-res8

嘿，Snips

下载嘿，Snips数据集
将数据集处理成吼叫可以加载的格式

VOCAB='["hey","snips"]' INFERENCE_SEQUENCE=[0,1] DATASET_PATH=datasets/hey-snips python -m training.run.deprecated.create_raw_dataset --dataset-type 'hey-snips' -i ~/path/to/hey_snips_dataset

为数据集生成一些模拟对齐，此时不必关心实际的对齐情况：

python -m training.run.attach_alignment \
  --input-raw-audio-dataset datasets/hey-snips \
  --token-type word \
  --alignment-type stub

利用MFA为数据集生成对齐：

mfa_align datasets/hey-snips/audio eng.dict pretrained_models/english.zip datasets/hey-snips/alignments

将MFA对齐结果附着于数据集上：

python -m training.run.attach_alignment \
  --input-raw-audio-dataset datasets/hey-snips \
  --token-type word \
  --alignment-type mfa \
  --alignments-path datasets/hey-snips/alignments

设置相应的环境变量： source envs/res8.env
设定噪音数据集路径至根目录：export NOISE_DATASET_PATH=/path/to/snsd
再次设定噪音数据集路径至根目录：export DATASET_PATH=/path/to/hey-snips
训练模型：LR_DECAY=0.98 VOCAB='["hey","snips"]' USE_NOISE_DATASET=True BATCH_SIZE=16 INFERENCE_THRESHOLD=0 NUM_EPOCHS=300 NUM_MELS=40 INFERENCE_SEQUENCE=[0,1] MAX_WINDOW_SIZE_SECONDS=0.5 python -m training.run.train --model res8 --workspace workspaces/hey-snips-res8

创建Mycroft-Precise数据集

howl还提供了一个脚本，用于将howl数据集转换为 mycroft-precise 数据集

VOCAB='["hey","fire","fox"]' INFERENCE_SEQUENCE=[0,1,2] python -m training.run.generate_precise_dataset --dataset-path /path/to/howl_dataset

实验

为了验证我们实现的正确性，我们首先在Google Speech Commands数据集上训练和评估我们的模型，该数据集有许多已知的结果。接下来，我们创建唤醒词检测数据集，并报告我们的模型性能。

在这两个实验中，都会生成Excel格式的报告。 experiments 文件夹包含每个实验的样本输出，相应的工作区可以在这里找到： here

commands_recognition

对于命令识别，我们训练四种不同的模型（res8, LSTM, LAS编码器, MobileNetv2），以检测十二个不同的关键词：“是”，“否”，“上”，“下”，“左”，“右”，“开”，“关”，“停”，“走”，未知或静音。

python -m training.run.eval_commands_recognition --num_iterations n --dataset_path < path_to_gsc_datasets >

word_detection

在这个实验中，我们用最佳的命令识别模型res8训练“嘿，Firefox”和“嘿，Snips”，并使用不同的阈值进行评估。

会生成两个不同性能报告，一个针对干净音频，另一个针对带噪声的音频。

python -m training.run.eval_wake_word_detection --num_models n --hop_size < number between 0 and 1 > --exp_type < hey_firefox | hey_snips > --dataset_path "x" --noiseset_path "y"

我们还提供了生成ROC曲线的脚本。 exp_timestamp 可以从之前命令生成的报告中找到。

python -m training.run.generate_roc --exp_timestamp < experiment timestamp > --exp_type < hey_firefox | hey_snips >