Technical

Speech Emotion Classification

Emotion Analysis is a widely used in many industries. Companies providing online services use it to analyse customer feedback. In clinical investigation, it is used to predict the state of the patients. In Human Resources Departments, it is used for application like attrition prediction, etc. In Insurance industry, it is used for detecting fraud. And there are many more use cases for emotion analysis.

Emotion Analysis is generally conducted on Text data. This is a field of Natural Language Processing (NLP). In this field, the different words are analysed to determine the sentiment and/or emotion. In this article, we discuss emotion analysis from speech. Here, the aspects like pitch, tone, speaking pace, loudness, etc. are analysed to decide the emotion expressed in a speech.

Detecting emotions from speech requires machine learning. This field is still in its nascent stage as there are not too many reliable datasets available for building good machine learning models. In this article, we discuss the mechanics of building such models. The models can be improved by supplying better quality data to the model and/or improving the model itself.

Dataset

The datasets used in this article have been obtained from Kaggle.

TESS Dataset

The first dataset used in this article is the TESS (Toronto emotional speech set) dataset. It contains 2880 files. A set of 200 target words were spoken in the carrier phrase “Say the word _‘ by two actresses and the sets were recorded in seven different emotions (anger, disgust, fear, happiness, pleasant surprise, sadness, and neutral). Both actresses spoke English as their first language, were university educated, and had musical training. Audiometric testing indicated that both actresses had thresholds within the normal range.

Ravdess Dataset

The second dataset used in this article is Ravdess (The Ryerson Audio-Visual Database of Emotional Speech and Song). This dataset contains 1440 files: 60 trials per actor x 24 actors = 1440. The RAVDESS contains 24 professional actors (12 female, 12 male), vocalizing two lexically-matched statements in a neutral North American accent. Speech emotions includes calm, happy, sad, angry, fearful, surprise, and disgust expressions. Each expression is produced at two levels of emotional intensity (normal, strong), with an additional neutral expression.

File naming convention

Each of the 1440 files has a unique filename. The filename consists of a 7-part numerical identifier (e.g., 03-01-06-01-02-01-12.wav). These identifiers define the stimulus characteristics:

Filename identifiers

  • Modality (01 = full-AV, 02 = video-only, 03 = audio-only).
  • Vocal channel (01 = speech, 02 = song).
  • Emotion (01 = neutral, 02 = calm, 03 = happy, 04 = sad, 05 = angry, 06 = fearful, 07 = disgust, 08 = surprised).
  • Emotional intensity (01 = normal, 02 = strong). NOTE: There is no strong intensity for the ‘neutral’ emotion.
  • Statement (01 = “Kids are talking by the door”, 02 = “Dogs are sitting by the door”).
  • Repetition (01 = 1st repetition, 02 = 2nd repetition).
  • Actor (01 to 24. Odd numbered actors are male, even numbered actors are female).

Filename example: 03-01-06-01-02-01-12.wav

- Audio-only - 03
- Speech - 01
- Fearful - 06
- Normal intensity - 01
- Statement "dogs" - 02
- 1st Repetition - 01
- 12th Actor - 12 Female, as the actor ID number is even.

Links to download the datasets

To obtain the Tess and Rav Datasets, click this link.

To obtain the Test Dataset, click this link.

Information

Librosa

Librosa is a Python package, built for speech and audio analytics. It provides modular functions that simplify working with audio data and help in achieving a wide range of applications such as identification of the personal characteristics of different individuals’ voice samples, detecting emotions from audio samples etc.

For further details on the Librosa package, refer here.

In [1]:
import warnings
warnings.filterwarnings('ignore')

Building the Model

Now we start building our model for extracting emotions from the audio files.

Loading the Tess data and Ravdess data audio files

We start by loading the datasets. Using the links provided in the above section, one would get zipped files containing the audio files. The ZIP files need to be unzipped before proceeding to the next step.

We use the glob library in Python to load the files. The appropriate path where the files in the dataset have been downloaded will need to be provided to the glob() function. In my case, the files are stored in the same directory where my Jupyter Notebook is stored.

In [2]:
import glob

tessFiles = glob.glob('Ravdess_Tess/Tess/**/*.wav', recursive = True)
ravdessFiles = glob.glob('Ravdess_Tess/ravdess/**/*.wav', recursive = True)

print("Number of Tess Files loaded:", len(tessFiles))
print("Number of Ravdess Files loaded:", len(ravdessFiles))
Number of Tess Files loaded: 2679
Number of Ravdess Files loaded: 1168

Play sample audios

Once the data (audio files) are loaded, we can play them using the code as shown below.

In [3]:
import IPython.display as ipd

sampleAudio1 = ravdessFiles[0]
ipd.Audio(sampleAudio1)
Out[3]:

In [4]:
sampleAudio2 = tessFiles[100]
ipd.Audio(sampleAudio2)
Out[4]:

Exploring and Visualising the data

From the dataset, we need to first establish the distribution of the data. Distribution of the data is key aspect in building a good model. We know that the data contains audio clips of 8 different emotions. We need to check whether the number of audio files in the dataset for each emotion are similar or dissimilar. If the number of audio files for each emotion are similar, then we would say that the dataset is balanced. Otherwise, the dataset would be termed as unbalance.

If the dataset is balanced, then the model would give almost equal weight to each of the emotion. If this is not the case, the model could biased towards the emotion where more number of audio clips are available.

In case the dataset is unbalanced, we can balance the dataset algorithmically. We will discuss this later in the article.

Visualise the distribution of all the labels

We first establish how many audio clips we have for each emotion. Here, the emotion expressed in the audio clips is our label. We will tell the machine that a particular audio clip expresses a particular emotion. Based on this knowledge provided to the machine, the machine would be expected to learn how to determine the emotion expressed in a new audio clip provided to the machine.

The emotion associated with every audio clip is encoded in the file name of the audio clip as has been described earlier. The below code extracts the text associated with the emotion from the file name and associates it to the audio clip. Please note that the file naming convention for the Tess files and the Rav files is different. Also, note that the variables tessFiles and ravFiles contains the full path of the all the audio clip files.

In [5]:
emotionsInTessFiles = [fileName.split("/")[-1].split("_")[-1].split(".")[0].lower() for fileName in tessFiles]
emotionsInRavdessFiles = [fileName.split("/")[-1].split("_")[-1].split(".")[0].lower() for fileName in ravdessFiles]

We now check the audio clips of which emotions are available in Tess Files and Rav files.

In [6]:
print('Audio Clips of Emotions available in Tess Files:', set(emotionsInTessFiles))
print('Audio Clips of Emotions available in Ravdess Files:', set(emotionsInRavdessFiles))
Audio Clips of Emotions available in Tess Files: {'fear', 'sad', 'happy', 'surprised', 'neutral', 'angry', 'disgust'}
Audio Clips of Emotions available in Ravdess Files: {'fear', 'sad', 'happy', 'surprised', 'neutral', 'angry', 'disgust'}

Notice that we have no samples are the emotion calm. So, we cannot determine the emotion calm from the model we will build.

Next, let us count how many audio files for each emotion are available in the 2 datasets.

In [7]:
import pandas as pd

print('Number of Audio Clips available for each Emotion in the Tess Files')
dfTessEmotions = pd.DataFrame(emotionsInTessFiles, columns = ['Emotion'])
dfTessCount = dfTessEmotions.value_counts().reset_index(name = 'count')
dfTessCount
Number of Audio Clips available for each Emotion in the Tess Files
Out[7]:
Emotion count
0 disgust 391
1 surprised 387
2 happy 383
3 angry 382
4 fear 379
5 sad 379
6 neutral 378
In [8]:
print('Number of Audio Clips available for each Emotion in the Rav Files')
dfRavdessEmotions = pd.DataFrame(emotionsInRavdessFiles, columns = ['Emotion'])
dfRavdessCount = dfRavdessEmotions.value_counts().reset_index(name = 'count')
dfRavdessCount
Number of Audio Clips available for each Emotion in the Rav Files
Out[8]:
Emotion count
0 sad 183
1 fear 182
2 surprised 182
3 disgust 180
4 angry 179
5 happy 174
6 neutral 88
In [9]:
print('Number of Audio Clips available for each Emotion in both the Tess and Rav Files combined together')
dfCombined = dfTessEmotions.append(dfRavdessEmotions)
dfCombinedCount = dfCombined.value_counts().reset_index(name = 'count')
dfCombinedCount
Number of Audio Clips available for each Emotion in both the Tess and Rav Files combined together
Out[9]:
Emotion count
0 disgust 571
1 surprised 569
2 sad 562
3 angry 561
4 fear 561
5 happy 557
6 neutral 466

As we will be building the model by combining the Tess and Rav datasets, let us see the distribution of the number of audio clips across the different emotions for the combines set.

In [10]:
from matplotlib import pyplot as plt
%matplotlib inline
import seaborn as sb

sb.barplot(x = dfCombinedCount['Emotion'], y = dfCombinedCount['count'])
plt.show()
Distribution Chart

We see that we have more or less the same number of files for all emotions except the neutral emotion. We will proceed as it is now. Later we will see how we can balance the dataset.

Visualise sample audio signal using librosa

We will now visualise audio signals in the audio files. For doing this, we will use the librosa library.

In most cases, the librosa library may need to be installed. Use the code below to install the librosa library on your machine. To be able to install the librosa library on your machine, you would need the numba library in version of at least 0.53 (This is at the time of writing this article). Check whether have the appropriate version of the numba library installed on your machine. The code below installs the numba library and then installs the librosa library for safety.

In [11]:
!pip -qq install numba
!pip -qq install librosa

Now, we can use the librosa library to visualise the wave form in the audio files.

Before we can visualise the wave form, we need to load the audio clip. Audio Clips are a Time Series which has frequencies at different points of time.

An audio time series is in the form of a 1-dimensional array for mono or in the form of a 2-dimensional array for stereo, along with time sampling rate (which defines the length of the array), where the elements within each of the arrays represent the amplitude of the sound waves is returned by librosa.load() function.

For the Physics enthusiasts, frequency and amplitude of a wave form are inversely proportional.

librosa.load() function loads an audio file and decodes it into a 1-dimensional array (for mono) which is a time series X, and sr is a sampling rate of X. Default sr is 22kHz.

In [12]:
import librosa

X, samplingRate = librosa.load(sampleAudio1)
print('X\n', X)
print('\nNumber of elements in X:', len(X))
print('Sampling Rates:', samplingRate)
X
 [ 0.0000000e+00  0.0000000e+00  0.0000000e+00 ... -2.2909328e-05
 -2.3180044e-06  0.0000000e+00]

Number of elements in X: 91231
Sampling Rates: 22050

Now we plot the waveform for the sample audio clip.

In [13]:
import librosa.display

plt.figure(figsize=(15, 5))
librosa.display.waveshow(X, sr = samplingRate)
plt.show()
Wave Form

We display the wave form for another sample audio clip.

In [14]:
X, samplingRate = librosa.load(sampleAudio2, mono = False)

plt.figure(figsize=(15, 5))
librosa.display.waveshow(X, sr = samplingRate)
plt.show()
Wave Form

Extracting Features

Next we need to extract the features from the audio clips. For extracting the features from the audio files, we will use the librosa library. We will take one audio clip and apply individual functions in the librosa library to see what features we extract.

Load the Audio Clip

The first step is to load the audio clip using the librosa.load() function. We have discussed how to load a audio clip in the previous section.

Read one WAV file at a time using Librosa. Refer to the supplementary notebook (‘Audio feature extraction’)

To know more about Librosa, explore the link

In [15]:
X, samplingRate = librosa.load(sampleAudio1)
print('X\n', X)
print('\nNumber of elements in X:', len(X))
print('Sampling Rates:', samplingRate)

plt.figure(figsize=(15, 5))
librosa.display.waveshow(X, sr = samplingRate)
plt.show()
X
 [ 0.0000000e+00  0.0000000e+00  0.0000000e+00 ... -2.2909328e-05
 -2.3180044e-06  0.0000000e+00]

Number of elements in X: 91231
Sampling Rates: 22050
Wave Form

Pre-Emphasis

The first step is to apply a pre-emphasis filter on the signal to amplify the high frequencies. A pre-emphasis filter is useful in several ways: (1) balance the frequency spectrum since high frequencies usually have smaller magnitudes compared to lower frequencies, (2) avoid numerical problems during the Fourier transform operation and (3) may also improve the Signal-to-Noise Ratio (SNR).

The pre-emphasis filter can be applied to a signal X using the first order filter in the following equation:

y(t)=X(t)−αX(t−1)

which can be implemented using the following code, where typical values for the filter coefficient (α) are 0.95 or 0.97.

In [16]:
import numpy as np

preEmphasis = 0.97
emphasizedSignal = np.append(X[0], X[1:] - preEmphasis * X[:-1])
print('Emphasized Signal\n', emphasizedSignal)
print('Number of elements:', len(emphasizedSignal))

plt.figure(figsize=(15, 5))
librosa.display.waveshow(emphasizedSignal, sr = samplingRate)
plt.show()
Emphasized Signal
 [0.0000000e+00 0.0000000e+00 0.0000000e+00 ... 1.4261317e-05 1.9904044e-05
 2.2484644e-06]
Number of elements: 91231
Wave Form

Apply Soft-Time Fourier-Transformation (STFT) to the signal

The Short-time Fourier transform (STFT) is a Fourier-related transform used to determine the sinusoidal frequency and phase content of local sections of a signal as it changes over time. In practice, the procedure for computing STFTs is to divide a longer time signal into shorter segments of equal length and then compute the Fourier transform separately on each shorter segment. This reveals the Fourier spectrum on each shorter segment.

The parametter n_fft is the length of the windowed signal after padding with zeros. The number of rows in the STFT matrix D is (1 + n_fft/2). The default value, n_fft=2048 samples, corresponds to a physical duration of 93 milliseconds at a sample rate of 22050 Hz, i.e. the default sample rate in librosa. This value is well adapted for music signals. However, in speech processing, the recommended value is 512, corresponding to 23 milliseconds at a sample rate of 22050 Hz.

In [17]:
stft = np.abs(librosa.stft(emphasizedSignal, n_fft = 512))

print('STFT\n', stft)
print('\nSTFT Shape:', stft.shape)
STFT
 [[0.0000000e+00 0.0000000e+00 0.0000000e+00 ... 2.7251750e-05
  3.9638679e-05 1.7839702e-05]
 [0.0000000e+00 0.0000000e+00 0.0000000e+00 ... 4.9780352e-05
  7.5593362e-06 3.2044696e-05]
 [0.0000000e+00 0.0000000e+00 0.0000000e+00 ... 6.8415771e-05
  3.0557694e-05 4.4763026e-05]
 ...
 [0.0000000e+00 0.0000000e+00 0.0000000e+00 ... 4.9180654e-10
  1.2953085e-07 2.1654926e-06]
 [0.0000000e+00 0.0000000e+00 0.0000000e+00 ... 4.3327109e-10
  1.3034068e-07 2.1655023e-06]
 [0.0000000e+00 0.0000000e+00 0.0000000e+00 ... 1.8417015e-10
  1.2918262e-07 2.1647879e-06]]

STFT Shape: (257, 713)

Extract Mel-frequency cepstral coefficients (MFCCs)

In sound processing, the mel-frequency cepstrum (MFC) is a representation of the short-term power spectrum of a sound, based on a linear cosine transform of a log power spectrum on a nonlinear mel scale of frequency.

Mel-frequency cepstral coefficients (MFCCs) are coefficients that collectively make up an MFC. They are derived from a type of cepstral representation of the audio clip (a nonlinear “spectrum-of-a-spectrum”). The difference between the cepstrum and the mel-frequency cepstrum is that in the MFC, the frequency bands are equally spaced on the mel scale, which approximates the human auditory system’s response more closely than the linearly-spaced frequency bands used in the normal spectrum. This frequency warping can allow for better representation of sound.

Even though higher order coefficients represent increasing levels of spectral details, depending on the sampling rate and estimation method, 12 to 20 cepstral coefficients are typically optimal for speech analysis. The parameter n_mfcc is the length of the Fast Fourier Transformation (FFT) window.

MFCCs are feature to be used for our model.

In [18]:
# Compute MFCCs
mfccs = np.mean(librosa.feature.mfcc(y = emphasizedSignal, sr = samplingRate, n_mfcc = 20).T,axis=0)

print('MFCCs\n', mfccs)
print('\nShape of MFCCs:', mfccs.shape)
MFCCs
 [-691.9465      16.475983   -41.23284     30.430498   -37.66319
    9.062544   -32.74897     -3.751106   -16.79984      1.284387
   -8.620657   -13.641662     7.2394624  -15.900555     2.3270628
  -11.306092     1.4221047   -7.637101    -4.7733912   -2.9576488]

Shape of MFCCs: (20,)

Let us plot the Spectrogram.

In [19]:
plt.specgram(mfccs, Fs = samplingRate)
plt.xlabel('Time')
plt.ylabel('Frequency')
plt.show()
Spectrogram

Compute chroma features

In Western music, the term chroma feature or chroma gram closely relates to the twelve different pitch classes. Chroma-based features, which are also referred to as “pitch class profiles”, are a powerful tool for analysing music whose pitches can be meaningfully categorised (often into twelve categories) and whose tuning approximates to the equal-tempered scale. One main property of chroma features is that they capture harmonic and melodic characteristics of music, while being robust to changes in timbre and instrumentation.

We can extract chroma from speeches to gather information about the pitch.

The parameter n_fft is the length of the Fast Fourier Transformation (FFT) window.

The parameter hop_length is the number of samples between successive frames.

chroma features are feature to be used in our model.

In [20]:
chroma = np.mean(librosa.feature.chroma_stft(S = stft, sr = samplingRate,
                                             n_fft = 2048, hop_length = 1024).T, axis = 0)
print('Chroma Features\n', chroma)
print('\nShape of Chroma Features:', chroma.shape)
Chroma Features
 [0.50195783 0.4888782  0.4903264  0.5083782  0.5654427  0.6067438
 0.6496609  0.7068625  0.7637172  0.7783719  0.6315559  0.51893705]

Shape of Chroma Features: (12,)

Let us plot the Spectrogram.

In [21]:
plt.specgram(chroma, Fs = samplingRate)
plt.xlabel('Time')
plt.ylabel('Frequency')
plt.show()
Spectrogram

Compute Mel Spectrogram

In 1937, Stevens, Volkmann, and Newmann proposed a unit of pitch such that equal distances in pitch sounded equally distant to the listener. This is called the mel scale.

Mel Spectrograms are features for our Model.

In [22]:
mel = np.mean(librosa.feature.melspectrogram(y = emphasizedSignal, 
                                             sr = samplingRate, 
                                             n_fft = 2048, hop_length = 1024).T,axis=0)
mel = librosa.power_to_db(mel, ref = np.max)
print('Mel Features\n', mel)
print('\nShape of Mel Features:', mel.shape)
Mel Features
 [-61.801987   -55.450996   -50.93421    -49.083176   -36.02623
 -20.011873   -16.51483    -11.3316555  -13.352919   -22.661512
 -31.087145   -23.020506   -13.105236   -12.087397    -7.0429688
  -4.0740967   -6.888693   -11.014183   -14.583776   -13.200335
 -10.399036    -7.8551064   -7.265728    -7.757189    -7.788246
  -6.970579   -10.638233   -17.545675   -16.610128   -16.1247
 -13.935667    -8.652401    -1.6291714   -1.3393536   -7.800398
 -16.06399    -19.65119    -20.810633   -21.928988   -14.4132
 -12.689123   -13.038822   -10.37067    -17.309526   -21.313086
 -21.712477   -16.863276   -16.091433   -15.040375   -15.491827
 -14.489065   -18.051981   -20.749914   -18.52836    -18.619913
 -17.552462   -20.481653   -23.189596   -19.737371   -16.960604
 -16.758047   -19.439604   -18.422312   -16.890676   -15.539293
 -16.266619   -16.751532   -14.235237   -16.51673    -16.35687
 -17.693552   -21.775587   -21.612665   -17.628744   -19.65919
 -19.551107   -18.433867   -18.725119   -17.297182   -16.050035
 -12.273521   -12.48354    -14.198645   -13.677156   -11.450945
  -9.101324    -9.204962   -10.222328    -8.791361    -9.5330715
 -10.055771   -11.536472   -10.589256   -12.779024   -14.519482
 -14.234987   -12.8337345  -12.06805    -14.898588   -11.875582
  -8.260246    -7.2238255   -7.3592777   -4.006094    -1.3495064
   0.          -1.7769852   -1.3771267   -0.64060783  -1.0396976
  -0.49908638  -0.60912514  -1.7561302   -9.229126   -21.85959
 -38.128456   -72.50883    -80.         -80.         -80.
 -80.         -80.         -80.         -80.         -80.
 -80.         -80.         -80.        ]

Shape of Mel Features: (128,)

Let us plot the Spectrogram.

In [23]:
plt.specgram(mel, Fs = samplingRate)
plt.xlabel('Time')
plt.ylabel('Frequency')
plt.show()
Spectrogram

Function to collect all the features from an audio clip

Now, that we have discussed what features we could collect from the Audio Clips, let us write a function to collect all the features from an Audio Clip.

In [24]:
def extractFeatures(fileName):
    features = np.array([]) # Variable to store all the features
    
    # Load the Audio Clip
    X, samplingRate = librosa.load(fileName)
    
    # Pre Emphasise
    preEmphasis = 0.97
    emphasizedSignal = np.append(X[0], X[1:] - preEmphasis * X[:-1])
    
    # Apply Soft-Time Fourier-Transformation (STFT) to the signal
    stft = np.abs(librosa.stft(emphasizedSignal, n_fft = 512))
    
    # Compute mfcc and collect features
    mfccs = np.mean(librosa.feature.mfcc(y = emphasizedSignal, sr = samplingRate, n_mfcc = 20).T,axis=0)
    features = np.hstack((features, mfccs))
    
    # Compute chroma features and collect features
    chroma = np.mean(librosa.feature.chroma_stft(S = stft, sr = samplingRate,
                                                 n_fft = 2048, hop_length = 1024).T, axis = 0)
    features = np.hstack((features, chroma))

    # Compute melspectrogram and collect features
    mel = np.mean(librosa.feature.melspectrogram(y = emphasizedSignal, 
                                                 sr = samplingRate, 
                                                 n_fft = 2048, hop_length = 1024).T,axis=0)
    mel = librosa.power_to_db(mel, ref = np.max)
    features = np.hstack((features, mel))

    return features

Let us test our function to collect features from our sample Audio Clip

In [25]:
features = extractFeatures(sampleAudio1)
print('Features\n', features)
print('\nNumber of Features:', len(features))
Features
 [-6.91946472e+02  1.64759827e+01 -4.12328415e+01  3.04304981e+01
 -3.76631889e+01  9.06254387e+00 -3.27489700e+01 -3.75110602e+00
 -1.67998409e+01  1.28438699e+00 -8.62065697e+00 -1.36416616e+01
  7.23946238e+00 -1.59005547e+01  2.32706285e+00 -1.13060923e+01
  1.42210472e+00 -7.63710117e+00 -4.77339125e+00 -2.95764875e+00
  5.01957834e-01  4.88878191e-01  4.90326405e-01  5.08378208e-01
  5.65442681e-01  6.06743813e-01  6.49660885e-01  7.06862509e-01
  7.63717175e-01  7.78371871e-01  6.31555915e-01  5.18937051e-01
 -6.18019867e+01 -5.54509964e+01 -5.09342117e+01 -4.90831757e+01
 -3.60262299e+01 -2.00118732e+01 -1.65148296e+01 -1.13316555e+01
 -1.33529186e+01 -2.26615124e+01 -3.10871449e+01 -2.30205059e+01
 -1.31052361e+01 -1.20873966e+01 -7.04296875e+00 -4.07409668e+00
 -6.88869286e+00 -1.10141830e+01 -1.45837765e+01 -1.32003345e+01
 -1.03990364e+01 -7.85510635e+00 -7.26572800e+00 -7.75718880e+00
 -7.78824615e+00 -6.97057915e+00 -1.06382332e+01 -1.75456753e+01
 -1.66101284e+01 -1.61247005e+01 -1.39356670e+01 -8.65240097e+00
 -1.62917137e+00 -1.33935356e+00 -7.80039787e+00 -1.60639896e+01
 -1.96511898e+01 -2.08106327e+01 -2.19289875e+01 -1.44132004e+01
 -1.26891232e+01 -1.30388222e+01 -1.03706703e+01 -1.73095264e+01
 -2.13130856e+01 -2.17124767e+01 -1.68632755e+01 -1.60914326e+01
 -1.50403748e+01 -1.54918270e+01 -1.44890652e+01 -1.80519810e+01
 -2.07499142e+01 -1.85283604e+01 -1.86199131e+01 -1.75524616e+01
 -2.04816532e+01 -2.31895962e+01 -1.97373714e+01 -1.69606037e+01
 -1.67580471e+01 -1.94396038e+01 -1.84223118e+01 -1.68906765e+01
 -1.55392933e+01 -1.62666187e+01 -1.67515316e+01 -1.42352371e+01
 -1.65167294e+01 -1.63568707e+01 -1.76935520e+01 -2.17755871e+01
 -2.16126652e+01 -1.76287441e+01 -1.96591892e+01 -1.95511074e+01
 -1.84338665e+01 -1.87251186e+01 -1.72971821e+01 -1.60500355e+01
 -1.22735214e+01 -1.24835396e+01 -1.41986446e+01 -1.36771564e+01
 -1.14509449e+01 -9.10132408e+00 -9.20496178e+00 -1.02223282e+01
 -8.79136086e+00 -9.53307152e+00 -1.00557709e+01 -1.15364723e+01
 -1.05892563e+01 -1.27790241e+01 -1.45194817e+01 -1.42349873e+01
 -1.28337345e+01 -1.20680504e+01 -1.48985882e+01 -1.18755817e+01
 -8.26024628e+00 -7.22382545e+00 -7.35927773e+00 -4.00609398e+00
 -1.34950638e+00  0.00000000e+00 -1.77698517e+00 -1.37712669e+00
 -6.40607834e-01 -1.03969765e+00 -4.99086380e-01 -6.09125137e-01
 -1.75613022e+00 -9.22912598e+00 -2.18595905e+01 -3.81284561e+01
 -7.25088272e+01 -8.00000000e+01 -8.00000000e+01 -8.00000000e+01
 -8.00000000e+01 -8.00000000e+01 -8.00000000e+01 -8.00000000e+01
 -8.00000000e+01 -8.00000000e+01 -8.00000000e+01 -8.00000000e+01]

Number of Features: 160

Extracting features from all the Audio Clips

Now that we have the function to extract features from Audio Clips, we extract the features from all the Audio Clips.

In [26]:
!pip -qq install tqdm
In [27]:
from tqdm import tqdm

def extractFeaturesFromFiles(files):
    X = []
    for file in tqdm(files):
        features = extractFeatures(file)

        X.append(features)
        
    return X

XTess = extractFeaturesFromFiles(tessFiles)
XRavdess = extractFeaturesFromFiles(ravdessFiles)
100%|██████████| 2679/2679 [09:11<00:00,  4.86it/s]
100%|██████████| 1168/1168 [05:58<00:00,  3.25it/s]

Now we combine the 2 sets of features (one obtained from the Tess files and one obtained from the Ravdess files) to a single list.

In [28]:
X = np.vstack((XTess, XRavdess))

print('Shape of X:', X.shape)
Shape of X: (3847, 160)

Converting the labels to numbers

We have the labels for the files in the dataframe dfCombined. However, the labels are stored as strings. We cannot these strings to train a model. We need to convert these string to a number. We will use a LabelEncoder to convert these string labels to numeric labels.

In [29]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
dfCombined['EmotionLabel'] = le.fit_transform(dfCombined['Emotion'])

y = dfCombined['EmotionLabel']

Creating a RandomForest Classification Model

We will create a multi-class Random Forest Classification Model.

Create the Training Set and Test Set

First we have to create the training set on which the model will be created and the test set which will be used to test the model.

In [30]:
from sklearn.model_selection import train_test_split

XTrain, XTest, yTrain, yTest = train_test_split(X, y, test_size = 0.05, random_state = 42)

print('XTrain - Shape:', XTrain.shape, ' XTest - Shape:', XTest.shape)
print('yTrain - Shape:', yTrain.shape, ' yTest - Shape:', yTest.shape)
XTrain - Shape: (3654, 160)  XTest - Shape: (193, 160)
yTrain - Shape: (3654,)  yTest - Shape: (193,)

Create the Model

Now, we create the Random Forest Model on the Training set.

Among the parameters, class_weight parameter is set to balanced as we have different number of samples for each class of labels.

In [31]:
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(bootstrap = False, 
                            max_depth = 16, 
                            n_estimators = 400, 
                            class_weight = 'balanced', 
                            random_state = 42)

rf.fit(XTrain, yTrain)
Out[31]:
RandomForestClassifier(bootstrap=False, class_weight='balanced', max_depth=16,
                       n_estimators=400, random_state=42)

Check the Prediction on the Training Set

We make the predictions using the model on the training set and check the metrics.

For the metrics, we will check the accuracy score, precision, recall and f-score.

In [32]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Make the predictions
yPredTrain = rf.predict(XTrain)

# Compute the metrics
accuracy = accuracy_score(yTrain, yPredTrain)
precision = precision_score(yTrain, yPredTrain, average = 'weighted')
recall = recall_score(yTrain, yPredTrain, average = 'weighted')
f1Score = f1_score(yTrain, yPredTrain, average = 'weighted')

print('Accuracy:', accuracy)
print('Precision:', precision)
print('Recall:', recall)
print('F1 Score:', f1Score)
Accuracy: 1.0
Precision: 1.0
Recall: 1.0
F1 Score: 1.0

Check the Prediction on the Test Set

We now make the predictions on the Test Set and check the metrics.

In [33]:
# Make the predictions
yPredTest = rf.predict(XTest)

# Compute the metrics
accuracy = accuracy_score(yTest, yPredTest)
precision = precision_score(yTest, yPredTest, average = 'weighted')
recall = recall_score(yTest, yPredTest, average = 'weighted')
f1Score = f1_score(yTest, yPredTest, average = 'weighted')

print('Accuracy:', accuracy)
print('Precision:', precision)
print('Recall:', recall)
print('F1 Score:', f1Score)
Accuracy: 0.8860103626943006
Precision: 0.8954733016260285
Recall: 0.8860103626943006
F1 Score: 0.8872015465652233

We can see that the Training Accuracy is 100%. Whereas the Test Accuracy is 88.6%. Clearly the model is overfitting. Nevertheless, the model will still be capable of giving reasonable predictions.

Validating the Model on Live Data

The purpose of building the model is to make predictions on live data. For this purpose, I demonstrate the mechanics using the Test Dataset provided by Kaggle. The code below loads the Kaggle Test Data and makes predictions on the same. In the end, the predictions are saved in a CSV file to submit to Kaggle.

The model developed in this article predicted with an accuracy of 81.094% on the Kaggle Test Dataset.

In [34]:
!unzip -qq Kaggle_Testset.zip
unzip:  cannot find or open Kaggle_Testset.zip, Kaggle_Testset.zip.zip or Kaggle_Testset.zip.ZIP.
In [35]:
# Set the model to Random Forest Model we just developed
MODEL = rf

# Load the Kaggle Files
kaggleFiles = glob.glob('KaggleDataSet/Kaggle_Testset/Kaggle_Testset/*.wav', recursive = False)

# Extract features from the Kaggle Files
XKaggle = extractFeaturesFromFiles(kaggleFiles)

# Make predictions for the emotions in the Kaggle Files
dfKaggle = pd.DataFrame(columns = ['Id', 'Label'])
predKaggle = MODEL.predict(XKaggle)
predictedEmotions = le.inverse_transform(predKaggle)
for ctr in range(len(kaggleFiles)):
    fileNumber = kaggleFiles[ctr].split("/")[-1].split(".")[0]
    dfKaggle = dfKaggle.append({'Id':fileNumber, 'Label':predictedEmotions[ctr]}, ignore_index=True)

print(dfKaggle.head())

# Store the prediction in a CSV file in case you need to submit to Kaggle
dfKaggle.to_csv('KaggleSubmission.csv', index = False)
100%|██████████| 201/201 [01:07<00:00,  2.99it/s]
    Id    Label
0   16      sad
1  103  neutral
2  117    angry
3  116      sad
4  102     fear

Balancing the Dataset and Making the Model

In the last step, we will balance the dataset using Synthetic Minority Over-sampling Technique (SMOTE) and create the model on the balanced dataset. We will check the results after that.

In [36]:
from imblearn.over_sampling import SMOTE

sm = SMOTE(k_neighbors=2,random_state=42)
XBalanced, yBalanced = sm.fit_resample(X, y)

XBTrain, XBTest, yBTrain, yBTest = train_test_split(XBalanced, yBalanced, test_size=0.05, random_state=42)

print('X Train Shape:', XBTrain.shape, ' X Test Shape:', XBTest.shape)
print('y Train Shape:', yBTrain.shape, ' y Test Shape:', yBTest.shape)

print('\nNumber of data points for each label in the Training set\n', yBTrain.value_counts())
print('\nNumber of data points for each label in the Test set\n', yBTest.value_counts())
X Train Shape: (3797, 160)  X Test Shape: (200, 160)
y Train Shape: (3797,)  y Test Shape: (200,)

Number of data points for each label in the Training set
 0    550
4    549
1    543
6    541
3    541
2    537
5    536
Name: EmotionLabel, dtype: int64

Number of data points for each label in the Test set
 5    35
2    34
3    30
6    30
1    28
4    22
0    21
Name: EmotionLabel, dtype: int64

Build the Model

We build the Random Forest Model on this balanced dataset.

In [37]:
# Build the model
rfb = RandomForestClassifier(bootstrap=False, max_depth=16, n_estimators=600)
rfb.fit(XBTrain, yBTrain)

# Make Predictions on the Training Data
yBPredTrain = rfb.predict(XBTrain)

# Make Predictions on the Test Data
yBPredTest = rfb.predict(XBTest)

# Compute the metrics for the Training Dataset
accuracy = accuracy_score(yBTrain, yBPredTrain)
precision = precision_score(yBTrain, yBPredTrain, average = 'weighted')
recall = recall_score(yBTrain, yBPredTrain, average = 'weighted')
f1Score = f1_score(yBTrain, yBPredTrain, average = 'weighted')

print('\nMetrics for the Training Set')
print('Accuracy:', accuracy)
print('Precision:', precision)
print('Recall:', recall)
print('F1 Score:', f1Score)

# Compute the metrics for the Test Dataset
accuracy = accuracy_score(yBTest, yBPredTest)
precision = precision_score(yBTest, yBPredTest, average = 'weighted')
recall = recall_score(yBTest, yBPredTest, average = 'weighted')
f1Score = f1_score(yBTest, yBPredTest, average = 'weighted')

print('\nMetrics for the Test Set')
print('Accuracy:', accuracy)
print('Precision:', precision)
print('Recall:', recall)
print('F1 Score:', f1Score)
Metrics for the Training Set
Accuracy: 1.0
Precision: 1.0
Recall: 1.0
F1 Score: 1.0

Metrics for the Test Set
Accuracy: 0.845
Precision: 0.8512914707457484
Recall: 0.845
F1 Score: 0.8459180970551824

We see that the training accuracy is 100% and the test accuracy is ~85%. Clearly, this model is overfitting. And it is overfitting more that the model we developed without balancing the data.

 

 

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: