Speech To Text

CS+

December 27, 2017

Outline

  • Introduction
  • Signal Preprocessing
  • Acoustic Modeling
  • Language Modeling
  • Decoding & Searching

Introduction

Basic Approach for Large Vocabulary Speech Recognition

Block Diagram

Hierarchy of Research Areas

Research Areas

Reference

Signal Preprocessing

Audio Waveform

Speech Waveform

Transcription: "mining a year of speech"

MFCC

  • Mel-frequency cepstral coefficients
  • 在語音辨識和語者辨識方面,最常用到的語音特徵就是「梅爾倒頻譜係數」,此參數考慮到人耳對不同頻率的感受程度,因此特別適合用在語音辨識
MFCC

Pre-emphasis

將語音訊號 $s(t)$ 通過一個高通濾波器

Pre-emphasis

$$ s'(t) = s(t) - a \cdot s(t-1),\ 0.9 \leq a \leq 1.0$$

Framing & Windowing

Frame blocking

  • N samples => frame (N = 256 or 512)
  • A frame covers 20~30 ms with overlapping

Hamming window

  • Multiply framed signals with a window

Hamming Window

Hamming Window

Windowed Signal (Noisy)

Windowing

Windowed Signal (Speech)

Windowing

Fast Fourier Transform

由於訊號在時域(Time Domain)上的變化通常很難看出訊號的特性,所以通常將它轉換成頻域(Frequency Domain)上的能量分佈來觀察。不同的能量分佈,就能代表不同語音的特性。所以在 windowing 後,每個音框還必需再經過 FFT 以得到在頻譜上的能量分佈

Triangular Bandpass Filters

  • 將能量頻譜能量乘以一組 20 個三角帶通濾波器,求得每一個濾波器輸出的對數能量(Log Energy)
  • 這 20 個三角帶通濾波器在「梅爾頻率」(Mel Frequency)上是平均分佈的
  • 梅爾頻率代表一般人耳對於頻率的感受度

Mel-frequency

Mel-frequency

Triangular Filters

Triangular filter

Discrete cosine transform

  • 將前述的 20 個對數能量 $E_k$ 帶入離散餘弦轉換,求出 $L$ 階的 Mel-scale Cepstrum 參數($L$ 通常取 12)
  • 轉回類似 Time Domain 的情況來看,又稱 Quefrency Domain,其實也就是 Cepstrum

Log energy

一個音框的音量(即能量),也是語音的重要特徵。我們通常再加上一個音框的對數能量,使得每一個音框基本的語音特徵就有 13 維

Delta cepstrum

  • 截至目前為止,我們得到了 13 維的特徵
  • 實際應用於語音辨識時,我們通常會再加上 delta cepstrum(一階導數),以顯示 cepstral coefficients 對時間的變化
  • 再加個二階導數,就得到總共 39 維的 Mel-frequency cepstral coefficients

就這樣,我們得到了 39 維的 MFCC。算是一種放之四海皆準的特徵,在不同的語音處理工作中都可以比較好的發揮其作用。

Reference

Acoustic Modeling

Acoustic Modeling

聲學模型(Acoustic Model),使用於 HMM 的一個抽象單位,通常一個聲學模型包含數個狀態。我們可以使用音節(Syllables)、音素(Phoneme)甚至單詞(Word)作為一個聲學模型

Phoneme Model

  • Monophone
  • Biphone
  • Triphone

Triphone Model

  • A phoneme model taking into consideration both left and right neighboring phonemes $$ 60^3 = 216000 $$
  • Very good generalizability and accuracy
  • By parameter-sharing techniques, we could make a balance between accuracy/trainability

Parameter Sharing

Parameter Sharing

Clustering by Decision Trees

Decision Tree Clustering
Phone Model

Take a Look At HMM

Emoji HMM

hmm...

Hidden Markov Model

HMM

$ \lambda_{HMM} = (A, B, \pi) $

Gaussian Mixture Model

高斯混合模型(GMM)是用多個高斯機率密度函數精確地量化變量分布,是將變量分布分解為若干基於高斯機率密度函數分布的統計模型

Gaussian Mixture Model

Model Hierarchy (Biphone)

Acoustic Model Hierarchy

Reference

Language Modeling

LM Example

Entropy

$$ H(S) = - \sum_i{p(x_i) \log_2{p(x_i)}} $$


  • e.g. 英文字母大小寫 $$ H(S) = - \sum_1^{52}{\frac{1}{52} \log_2{\frac{1}{52}}} \approx 6bits $$

Perplexity

$$ PP(S) = 2^{H(S)} $$


  • e.g. 英文字母大小寫 $$ PP(S) = 52 $$
Perplexity

Smoothing

  • Add-one Smoothing
  • Back-off Smoothing
  • Interpolation Smoothing
  • Good-Turing Smoothing
  • Katz Smoothing

Language Modeling

Language Modeling

Reference

Decoding & Searching

Block Diagram

Viterbi Algorithm

  • Dynamic Programming

Tree Lexicon

Tree Lexicon
Tre Lexicon 2

From words to sentences

  • Intra-word Transition (HMM only)
  • Inter-word Transition (LM only)

More Search Algorithms

  • Beam Search
  • Two-pass Search
    N-best List and Word Graph
    N-best List and Word Graph

Search Techniques (optional)

  • Blind Search Algorithms (BFS, DFS)
  • Heuristic Search
Heuristic Search

Reference

END