Master Tesia
Title:
Noisy Speech Recognition using Kaldi and
Neural Architectures
Author:
Ander González Docasal
Laburpena:
Hizketa Automatikoki Ezagutzeko (ASR) sistema baten helburua tasun akustikoen multzo
bat hitz sekuentzia batean bihurtzea da. Ondorengo atalez osatuta dago: tasunen eraus-
keta, hizkuntza-informazioa audio seinaletik erauzten du tasun akustikoko bektore gisa;
eredu akustikoa, bektore akustikoak fonematan bihurtzearen arduraduna; eta hizkuntza-
eredua, hautemandako fonemekin probabilitate gehien duen hitz sekuentzia itzultzen du.
Haien historia osoan zehar, sistema hauek metodo estatistikoak erabilita eraikitzen ziren,
batez ere Markoven Ezkutuko Ereduak (HMM) eta Gaussen Eredu Mistoak (GMM). Hala
ere, azkenengo urteetean arkitektura neuronalak erabiliz, hala nola Sare Neuronal Sako-
nak, Konboluziokoak eta Errepikariak (DNN, CNN eta RNN), lehendabizi lortutako emai-
tzak modu esanguratsuan hobetzea lortu da. Kaldi gehien ezagutzen eta erabiltzen diren
ASR sistemetako bat da. Sare neuronalak ezartzen dituen zenbait pakete (nnet1, nnet2 eta
nnet3) ditu barne. Hauek eredu akustikoa inplementatzeko erabil daitezke azkarrak, zeha-
tzak eta datu-base handiak erabiltzeko gai direlako, azken hau zama makina multzoetan
banatzen. Hala ere, Kaldik duen garapen ziklo motela dela eta, arkitektura neuronal be-
rriak haien argitalpenetik urte asko igaro arte ez dira ezarriko.
Beraz, lan honetan Kaldiren eredu akustikoa TensorFlow programazio-lengoaian guk ida-
tzitako inplementazioekin ordezkatuko da. TensorFlowk erabiltzaile-elkarte handiena eta
euskarririk hoberena ditu ikaskuntza sakoneko beste liburutegiekin konparatuta, alegia.
Kaldiren eredu akustikoa beste arkitektura ezberdinekin ordezkatzean Aurora-4 deritzon
datubasearekin, lehenengo % 15.14ko hitz-errore-tasako (WER) emaitzak % 3.17 puntu-
tan hobetu ahal izan dira Konboluziozko Sare Neuronalekin entrenatzean. Halaber, Test
datubaseko submultzo garbian bakarrik fokatzean, emaitzak are gehiago hobetzea lortu
da CNN + RNN egitura bat ezartzean; konkretuki, CNN bakarrik erabiltzean lortutako
% 4.54ko WERa % 4.13 arte murriztu da arkitektura hau erabilita.
Beraz, lan honek ASR sistema zabalenetako batekin lortzen diren emaitzak soilik ikaskun-
tza sakoneko teknika aurreratuagoak inplementatzen hobe daitezkela frogatzen du. Izan
ere, hauek ardura bakarreko beste programa boteretsuagoren bidez exekuta daitezkeela
ere erakusten du.
Hurrengo lanetarako, CNN konplexuagoetan analisi sakonagoak egiteak ASR sisteman
errendimendu hobea izatea erakar lezake datu-base konkretu honetan eta, orokorrean,
inguru zaratatsuetan. Hala ere, egoera garbietan lan eginez gero CNN-etan eta RNN-etan
fokatu beharko lizateke, hauek izan baitira baldintza hauekin emaitza hoberenak lortu
dituztenak.
Abstract:
The goal of an Automatic Speech Recognition (ASR) system is to transform a set of acoustic
features into a sequence of words. It mainly consists of various parts: the feature extrac-
tion part which extracts information from a speech signal; the acoustic model, in charge
of the conversion from speech to phonemes; and the language model that transforms the
detected phonemes into the most probable sequence of words.
Throughout their history, these systems were built with statistical methods, mainly Hid-
den Markov Models (HMM) and Gaussian Mixture Models (GMM). However, in recent
years the use of neural architectures such as Deep, Convolutional and Recurrent Neural
Networks (DNN, CNN and RNN), have improved the achieved results significantly. Mo-
reover, freely available tools made ASR research develop quickly. Kaldi is one of the most
known and widely used ASR systems. It includes a set of neural network packages —nnet1,
nnet2 and nnet3— which can be used for implementing the acoustic model. These are fast,
accurate and able to handle huge databases since they distribute the load on clusters of
machines. However, Kaldi’s slow development cycle implies that new neural architectu-
res may be introduced many years after their publications.
Therefore, in this work we substitute the neural acoustic model of Kaldi by our own imple-
mentations written in TensorFlow. TensorFlow has the largest community of users and the
best support among the available deep learning libraries. By substituting the Acoustic Mo-
del of Kaldi with different architectures and testing their performance on the well-known
database Aurora-4, we managed to reduce Word Error Rate (WER) by 3.17 % (baseline
15.14 %) when using a CNN architecture. Also, focusing on just the clean subset of the
Test part of the database, a further improvement has been achieved once implementing
a CNN + RNN structure, from a 4.54 % WER with the CNNs alone to a 4.13 % with this
architecture.
This work is therefore believed to improve the results on obtained by one of the widely
used ASR tools simply by implementing more advanced deep learning techniques, which
could be executed by more powerful and dedicated external programs.
For future work, a further analysis on more complex convolutional networks could lead
to a better performance in this particular database and, in general, in noisy environments.
Finally, further improvement of convolutional and recurrent architectures is suggested in
clean and noise-free conditions, since they have been shown to obtain the best results in
this specific circumstances.
Tutor:
Vassilis Tsiaras, George P. Kafentzis, Yannis Stylianou
Urtea:
2018