Building a Neuromorphic Keyword Detector - abr/telluride-nengo GitHub Wiki

Note: That the abr_speech repo is not available here. This is to demonstrate a typical workflow

This notebook is designed to provide an illustration of the steps required to go from a high-level deep network for speech recognition to a low-level neuromorphic implementation of this network. We'll specifically focus on the example keyword spotting, or the task of identifying a specific target word (in this case "aloha") from a speech signal.

The general workflow we'll use can be broken down into four basic steps: (1) load a dataset of audio/text pairs, (2) use a TensorFlow model to learn alignments between windows of audio and specific text characters, (3) use these alignments as data to train a spiking Nengo DL model, and (4) save the parameters of this Nengo DL model to load onto a neuromorphic chip (e.g., Loihi).

1. Load the data

To start, we'll load up a dataset of audio/text pairs that constitute a mix of positive examples of the target phrase and negative examples of other phrases. Note that you'll need to have downloaded a pickle file containing the dataset by following the instructions in the abr_speech README file. (in brief, just run python download.py from within the /scripts directory of the abr_speech repo).

We can print some basic info about the contents of the dataset:

import pickle
import nengo
import nengo_dl
import numpy as np

from abr_speech.models import TFSpeechModel, SpikingSpeechModel

with open('../data/example_data.pickle', 'rb') as pfile:
    dataset = pickle.load(pfile)

n_speakers = len(dataset.speakers)
print('Speakers: %d' % n_speakers)
print('Testing Items: %d' % len(dataset.test_data))
print('Training Items: %d' % len(dataset.train_data))

2. Train a TensorFlow alignment model

Now, we can train up a Tensorflow model that aligns windows of the audio signal with specific cahracters (e.g., "a", "o", etc.). This is relatively straightfoward, though to get a good model, it is advisable to monitor performance and possibly do additional training until the loss is acceptably low. (Just pass resume=True to tf_model.train(...) to resume training from the most recent checkpoint). To get some intuition about how the alignments are learned, this blog post is a good place to start: https://distill.pub/2017/ctc/

tf_checkpoints = './tf_model_checkpoints'

tf_model = TFSpeechModel(n_speakers=n_speakers, checkpoints=tf_checkpoints)
tf_model.train(dataset.train_data, rate=0.001, n_epochs=12)

We can also look at the average label error rate (LER) on the test data to see how well the model is doing. Note that we primarily care about the LER for positive examples, since we want to be highly accurate in these cases, whereas in the case of negative examples, we primarily just care about not mistaking the input for the target phrase (i.e., we just want to avoid false positives).

print('Pos char LER on test data:')
pos_data = [x for x in dataset.test_data if x.text == 'aloha']
print('LER: %4f' % tf_model.label_error_rate(pos_data))
print('')

print('Neg char LER on test data:')
neg_data = [x for x in dataset.test_data if x.text != 'aloha']
label_error = tf_model.label_error_rate(neg_data)
print('LER: %4f' % label_error)

3. Train a Nengo DL model

Next we'll use the aligned (audio_window, target_character) pairs produced by the Tensorflow model to train a spiking neural network in Nengo DL. This training procedure involves first building a rate-based model that uses a differentiable approximation of the leaky-integrate-and-fire neuron model; once this rate-based model is trained to predict the right output character for each input audio window, we can replace the rate-based neuron model with a spiking neuron model. To minimize the noise introduced by using filtered spike trains to approximate the instantaneous rates originally used, we increase the firing rate of the spiking neurons while shrinking the amplitude of each spike, so as to approximate the ideal rate more effectively.

Below, we'll first define the rate-based and spiking neuron models to be used. Then, we'll use the TensorFlow model to build an aligned dataset of (audio_window, target_character) pairs. After this, we can train the rate model in Nengo DL using the aligned dataset, and compute a loss measure on both train and test sets.

nengo_checkpoints = './nengo_model_checkpoints'

# define scaling parameters for reducing spike noise in spiking implementation
base_amp = 0.01
softlif_scale = 1
lif_scale = 5
max_rate = 100

# define neuron models for training and for inference
softlifs = nengo_dl.SoftLIFRate(
    tau_rc=0.02, tau_ref=0.002, sigma=0.002, amplitude=base_amp/softlif_scale)

lifs = nengo.LIF(tau_rc=0.02, tau_ref=0.001, amplitude=base_amp/lif_scale)

# convert data to (n_items, n_steps, n_features) Nengo DL node format
nengo_train_data = tf_model.create_nengo_data(dataset.train_data, n_steps=1)
nengo_test_data = tf_model.create_nengo_data(dataset.test_data, n_steps=1)

# initialize the model
nengo_model = SpikingSpeechModel(n_neurons=tf_model.n_per_layer,
                                 n_inputs=tf_model.size_in,
                                 n_chars=tf_model.n_chars,
                                 n_ids=tf_model.n_speakers,
                                 n_layers=2,
                                 checkpoints=nengo_checkpoints)

# build the network using softlifs for training, then train
nengo_model.build_network(softlifs, softlif_scale, max_rate)
nengo_model.train(nengo_train_data, rate=0.001, n_epochs=15)

# display the loss for each dataset after training 
print('Train loss: %.2f' % nengo_model.compute_error_metric(nengo_train_data))
print('Test loss: %.2f' % nengo_model.compute_error_metric(nengo_test_data))

Now that we have a trained model, we can substitute in the spiking neurons, rebuild the network, and evaluate the model's performance.

# rebuild the network using LIF neurons for spiking inference
nengo_model.build_network(lifs, lif_scale, max_rate)
nengo_model.set_probes(char_synapse=0.005, id_synapse=None, d_synapse=0.02)

# format data as continual streams at 10ms per window for doing evaluations
train_stream = tf_model.create_nengo_data(
    dataset.train_data, n_steps=10, stream=True, itemize=True)
test_stream = tf_model.create_nengo_data(
    dataset.test_data, n_steps=10, stream=True, itemize=True)

# compute statistics and display example decodings
nengo_model.compute_d_vectors(test_stream)
nengo_model.compute_statistics(test_stream)

print(nengo_model.stats)

We can also look at some of the predictions made by the model to get a qualitative feel for its performance:

for arrays, text, speaker_id, _ in test_stream[:2]:
    p_text = nengo_model.decode_audio(arrays)
    print('Correct: %s' % text)
    print('Predicted: %s' % p_text)
    print('')

4. Save parameters and build a reference Nengo model to run on-chip

With a working, spiking implementation of a keyword spotter running in Nengo DL, we can move on to saving the network parameters to subsequently load into a reference Nengo model that can run on a chip such as Loihi. This model can be easily interfaced with a mic and an audio preprocessing module to build a proper demo, as illustrated in reference_demo_model.py in the /scripts directory.

# save and reload the parameters for each Nengo object
nengo_model.save_param_dict('./parameters.pickle')

with open('./parameters.pickle', 'rb') as pfile:
    params = pickle.load(pfile)

# build the reference model using the saved parameters    
with nengo.Network() as model:
    model.config[nengo.Connection].synapse = None

    inp = nengo.Node(None, size_in=tf_model.size_in)
    out = nengo.Node(None, size_in=tf_model.n_chars)
    bias = nengo.Node(1)

    layer_1 = nengo.Ensemble(n_neurons=tf_model.n_per_layer, dimensions=1,
                             neuron_type=lifs,
                             gain=params['x_c_0']['gain'],
                             bias=params['x_c_0']['bias'],
                             label='Layer 1')

    layer_2 = nengo.Ensemble(n_neurons=tf_model.n_per_layer, dimensions=1,
                             neuron_type=lifs,
                             gain=params['x_c_1']['gain'],
                             bias=params['x_c_1']['bias'],
                             label='Layer 2')

    nengo.Connection(
        inp, layer_1.neurons, transform=params['input_node -> x_c_0'])

    nengo.Connection(
        layer_1.neurons, layer_2.neurons, transform=params['x_c_0 -> x_c_1'])
    
    nengo.Connection(
        layer_2.neurons, out, transform=params['x_c_1 -> char_output'])

    nengo.Connection(bias, out, 
        transform=np.expand_dims(params['char_output_bias'], axis=1))

    
sim = nengo.Simulator(model)

# for using loihi, something like the following instead
# sim = nengo_loihi.Simulator(model)