VAD(Voice Activity Detection)算法详解

Voice Activity Detection (VAD) 在语音信号处理中,例如语音增强,语音识别等领域有着非常重要的作用。它的作用是从一段语音(纯净或带噪)信号中标识出语音片段与非语音片段。VAD系统通常包括两个部分,特征提取和语音/非语音判决;
常用的特征提取可以分为五类:

  • 基于能量
    基于能量的准则是检测信号的强度,并且假设语音能量大于背景噪声能量,这样当能量大于某一门限时,可以认为有语音存在。然而当噪声大到和语音一样时,能量这个特征无法区分语音还是纯噪声。
    早先基于能量的方法,将宽带语音分成各个子带,在子带上求能量;因为语音在2KHz以下频带包含大量的能量,而噪声在24KHz或者4KHz以上频带比02HKz频带倾向有更高的能量。这其实就是频谱平坦度的概念,webrtc中已经用到了。在信噪比低于10dB时,语音和噪声的区分能力会加速下降。
  • 频域
    通过STFT将时域信号变成频域信号,即使在SNR到0dB时,一些频带的长时包络还是可以区分语音和噪声。
  • 倒谱
    对于VAD,能量倒谱峰值确定了语音信号的基频(pitch),也有使用MFCC做为特征的;
    在这里插入图片描述
  • 谐波
    语音的一个明显特征是包含了基频F0 及其多个谐波频率,即使在强噪声场景,谐波这一特征也是存在的。可以使用自相关的方法找到基频。
  • 长时信息
    语音是非稳态信号。普通语速通常每秒发出10~15个音素,音素见的谱分布是不一样的,这就导致了随着时间变化语音统计特性也是变化的。另一方面,日常的绝大多数噪声是稳态的(变化比较慢的),如白噪声/机器噪声。

基于能量的特征常用硬件实现,谱(频谱和倒谱)在低SNR可以获得较好的效果。当SNR到达0dB时,基于语音谐波和长时语音特征更具有鲁棒性。

判决准则可以分为三类:

  • 基于门限
  • 统计模型
  • 机器学习
    对于VAD分类问题,特征尤为重要,好的特征应该能具备如下特性:
  • 区分能力:含噪语音帧分布和仅仅噪声帧分布的分离度,理论上,好的特征能够让语音和噪声两类没有交集
  • 噪声鲁棒性:背景噪声会造成语音失真,进而语音特征也会失真,影响到提取到的特征的区分能力。

在语音增强中,我们希望从带噪语音信号中剔除噪音,得到纯净的语音信号,第一步就是提取噪音信息。通常的思路是通过VAD函数得到非语音片段,而非语音片段可以认为是纯噪音片段。从而可以从纯噪音信号中提取出有用信息,例如进行傅里叶变换得到噪音频谱等,再进而做下一步处理。例如谱减法,维纳滤波。此处不作讨论。

VAD有很多种方法,此处介绍一种最简单直接的办法。 通过short timeenergy (STE)和zero cross counter (ZCC) 来测定。(实际上精确度高的VAD会提取4种或更多的特征进行判断,这里只介绍两种特征的基本方法)。
STE: 短时能量,即一帧语音信号的能量,ZCC: 过零率,即一帧语音时域信号穿过0(时间轴)的次数。
理论基础是在信噪比(SNR)不是很低的情况下,语音片段的STE相对较大,而ZCC相对较小;而非语音片段的STE相对较小,但是ZCC相对较大。因为语音信号能量绝大部分包含在低频带内,而噪音信号通常能量较小且含有较高频段的信息。故而可以通过测量语音信号的这两个特征并且与两个门限(阈值)进行对比,从而判断语音信号与非语音信号。一段音频小于STE门限同时大于ZCC门限的部分,我们可以认为它是噪声
通常在对语音信号分帧时,取一帧20ms (因为一般会进行短时傅里叶变换,时域和频域的分辨率需要一个平衡,20ms为平衡点,此处不用考虑)。此处输入信号采样率为8000HZ。因此每一帧长度为160 samples.
STE的计算方法是 , 即帧内信号的平方和。ZCC的计算方法是,将帧内所有sample平移1,再对应点做乘积,符号为负的则说明此处过零,只需将帧内所有负数乘积数目求出则得到该帧的过零率。

python代码实现

import numpy as np
import scipy.io.wavfile as wf
import matplotlib.pyplot as plt

class VoiceActivityDetector():
    """ Use signal energy to detect voice activity in wav file """
    
    def __init__(self, wave_input_filename):
        self._read_wav(wave_input_filename)._convert_to_mono()
        self.sample_window = 0.02 #20 ms
        self.sample_overlap = 0.01 #10ms
        self.speech_window = 0.5 #half a second
        self.speech_energy_threshold = 0.6 #60% of energy in voice band
        self.speech_start_band = 300
        self.speech_end_band = 3000
           
    def _read_wav(self, wave_file):
        self.rate, self.data = wf.read(wave_file)
        self.channels = len(self.data.shape)
        self.filename = wave_file
        return self
    
    def _convert_to_mono(self):
        if self.channels == 2 :
            self.data = np.mean(self.data, axis=1, dtype=self.data.dtype)
            self.channels = 1
        return self
    
    def _calculate_frequencies(self, audio_data):
        data_freq = np.fft.fftfreq(len(audio_data),1.0/self.rate)
        data_freq = data_freq[1:]
        return data_freq    
    
    def _calculate_amplitude(self, audio_data):
        data_ampl = np.abs(np.fft.fft(audio_data))
        data_ampl = data_ampl[1:]
        return data_ampl
        
    def _calculate_energy(self, data):
        data_amplitude = self._calculate_amplitude(data)
        data_energy = data_amplitude ** 2
        return data_energy
        
    def _znormalize_energy(self, data_energy):
        energy_mean = np.mean(data_energy)
        energy_std = np.std(data_energy)
        energy_znorm = (data_energy - energy_mean) / energy_std
        return energy_znorm
    
    def _connect_energy_with_frequencies(self, data_freq, data_energy):
        energy_freq = {}
        for (i, freq) in enumerate(data_freq):
            if abs(freq) not in energy_freq:
                energy_freq[abs(freq)] = data_energy[i] * 2
        return energy_freq
    
    def _calculate_normalized_energy(self, data):
        data_freq = self._calculate_frequencies(data)
        data_energy = self._calculate_energy(data)
        #data_energy = self._znormalize_energy(data_energy) #znorm brings worse results
        energy_freq = self._connect_energy_with_frequencies(data_freq, data_energy)
        return energy_freq
    
    def _sum_energy_in_band(self,energy_frequencies, start_band, end_band):
        sum_energy = 0
        for f in energy_frequencies.keys():
            if start_band<f<end_band:
                sum_energy += energy_frequencies[f]
        return sum_energy
    
    def _median_filter (self, x, k):
        assert k % 2 == 1, "Median filter length must be odd."
        assert x.ndim == 1, "Input must be one-dimensional."
        k2 = (k - 1) // 2
        y = np.zeros ((len (x), k), dtype=x.dtype)
        y[:,k2] = x
        for i in range (k2):
            j = k2 - i
            y[j:,i] = x[:-j]
            y[:j,i] = x[0]
            y[:-j,-(i+1)] = x[j:]
            y[-j:,-(i+1)] = x[-1]
        return np.median (y, axis=1)
        
    def _smooth_speech_detection(self, detected_windows):
        median_window=int(self.speech_window/self.sample_window)
        if median_window%2==0: median_window=median_window-1
        median_energy = self._median_filter(detected_windows[:,1], median_window)
        return median_energy
        
    def convert_windows_to_readible_labels(self, detected_windows):
        """ Takes as input array of window numbers and speech flags from speech
        detection and convert speech flags to time intervals of speech.
        Output is array of dictionaries with speech intervals.
        """
        speech_time = []
        is_speech = 0
        for window in detected_windows:
            if (window[1]==1.0 and is_speech==0): 
                is_speech = 1
                speech_label = {}
                speech_time_start = window[0] / self.rate
                speech_label['speech_begin'] = speech_time_start
                print(window[0], speech_time_start)
                #speech_time.append(speech_label)
            if (window[1]==0.0 and is_speech==1):
                is_speech = 0
                speech_time_end = window[0] / self.rate
                speech_label['speech_end'] = speech_time_end
                speech_time.append(speech_label)
                print(window[0], speech_time_end)
        return speech_time
      
    def plot_detected_speech_regions(self):
        """ Performs speech detection and plot original signal and speech regions.
        """
        data = self.data
        detected_windows = self.detect_speech()
        data_speech = np.zeros(len(data))
        it = np.nditer(detected_windows[:,0], flags=['f_index'])
        while not it.finished:
            data_speech[int(it[0])] = data[int(it[0])] * detected_windows[it.index,1]
            it.iternext()
        plt.figure()
        plt.plot(data_speech)
        plt.plot(data)
        plt.show()
        return self
       
    def detect_speech(self):
        """ Detects speech regions based on ratio between speech band energy
        and total energy.
        Output is array of window numbers and speech flags (1 - speech, 0 - nonspeech).
        """
        detected_windows = np.array([])
        sample_window = int(self.rate * self.sample_window)
        sample_overlap = int(self.rate * self.sample_overlap)
        data = self.data
        sample_start = 0
        start_band = self.speech_start_band
        end_band = self.speech_end_band
        while (sample_start < (len(data) - sample_window)):
            sample_end = sample_start + sample_window
            if sample_end>=len(data): sample_end = len(data)-1
            data_window = data[sample_start:sample_end]
            energy_freq = self._calculate_normalized_energy(data_window)
            sum_voice_energy = self._sum_energy_in_band(energy_freq, start_band, end_band)
            sum_full_energy = sum(energy_freq.values())
            speech_ratio = sum_voice_energy/sum_full_energy
            # Hipothesis is that when there is a speech sequence we have ratio of energies more than Threshold
            speech_ratio = speech_ratio>self.speech_energy_threshold
            detected_windows = np.append(detected_windows,[sample_start, speech_ratio])
            sample_start += sample_overlap
        detected_windows = detected_windows.reshape(int(len(detected_windows)/2),2)
        detected_windows[:,1] = self._smooth_speech_detection(detected_windows)
        return detected_windows
 

from vad import VoiceActivityDetector
import argparse
import json

def save_to_file(data, filename):
    with open(filename, 'w') as fp:
        json.dump(data, fp)

if __name__ == "__main__":
    parser = argparse.ArgumentParser(description='Analyze input wave-file and save detected speech interval to json file.')
    parser.add_argument('inputfile', metavar='INPUTWAVE',
                        help='the full path to input wave file')
    parser.add_argument('outputfile', metavar='OUTPUTFILE',
                        help='the full path to output json file to save detected speech intervals')
    args = parser.parse_args()
    
    v = VoiceActivityDetector(args.inputfile)
    raw_detection = v.detect_speech()
    speech_labels = v.convert_windows_to_readible_labels(raw_detection)
    
    save_to_file(speech_labels, args.outputfile)

语音增强和语音识别书籍链接:
链接

相关推荐
©️2020 CSDN 皮肤主题: 像素格子 设计师:CSDN官方博客 返回首页