JP6731362B2

JP6731362B2 - Audio coding/decoding method

Info

Publication number: JP6731362B2
Application number: JP2017038988A
Authority: JP
Inventors: 行雄松尾; 祥好中島; 和夫上田; 岸田　拓也; 拓也岸田
Original assignee: gakkou houjin touhoku Gakuin
Current assignee: gakkou houjin touhoku Gakuin
Priority date: 2017-03-02
Filing date: 2017-03-02
Publication date: 2020-07-29
Anticipated expiration: 2037-03-02
Also published as: JP2018146652A

Description

本発明は、送信側で音声をコーディングし、受信側で音声をデコーディングする方法に関する。 The present invention relates to a method of coding voice at a transmitter and decoding voice at a receiver.

今日、光ファイバー回線などの高速回線の普及や大容量ストレージの登場により、我々が扱うことができるデータサイズに制限はなくなりつつあり、データの圧縮についても必要性が認められないケースもある。一方、宇宙空間や海中内、或いは宇宙や海中と地上とのデータ通信ではデータサイズに制限が発生する。このような場合には、混信せず、データサイズの小さい音声圧縮方法が求められている。 Due to the spread of high-speed lines such as optical fiber lines and the advent of large-capacity storage, the data size that we can handle is becoming unlimited, and in some cases the need for data compression is not recognized. On the other hand, data size is limited in data communication between outer space and the sea, or between space and the sea and the ground. In such a case, there is a demand for a voice compression method that does not cause interference and has a small data size.

音声の圧縮方法としては、ＭＰ３、ＷＭＡが知られている。これらの方法は言語情報のみならず音楽的要素を残して圧縮することができ、約８７％の圧縮率が可能である。 MP3 and WMA are known as audio compression methods. These methods can compress not only linguistic information but also musical elements, and a compression rate of about 87% is possible.

また、音声を圧縮するのではないが、残響や背景騒音があっても特定の音声、例えば駅の放送音などを明瞭にする方法として、特許文献１に開示される方法が提案されている。この方法は、入力された音声信号を複数の周波数帯域に分割し、分割されたそれぞれの周波数帯域内の信号を複数の時間フレームに分割し、それぞれの時間フレーム内の平均パワーを算出し、パワー算出されたそれぞれの時間フレーム内の平均パワーを互いに比較し、比較結果に基づいて時間フレーム分割部で分割されたそれぞれの信号の増幅度を決定し、増幅されたそれぞれの周波数帯域内の信号を加算する内容である。 In addition, a method disclosed in Patent Document 1 has been proposed as a method of clarifying a specific sound, for example, a broadcast sound of a station even if there is reverberation or background noise, although the sound is not compressed. This method divides the input voice signal into multiple frequency bands, divides the divided signals in each frequency band into multiple time frames, calculates the average power in each time frame, and calculates the power. The calculated average power in each time frame is compared with each other, the amplification degree of each signal divided by the time frame division unit is determined based on the comparison result, and the amplified signal in each frequency band is determined. This is the content to be added.

特許第５１１５８１８号公報Japanese Patent No. 5115818

ＭＰ３、ＷＭＡなどの既存の方法による圧縮は圧縮率が十分ではなく、データサイズに制限が課せられる環境では、より高い圧縮率が求められる。 The compression rate of the existing method such as MP3 and WMA is not sufficient, and a higher compression rate is required in an environment where the data size is limited.

また特許文献１の図６に示す実施例では、コーディング工程において音声を４つの帯域に分割するとともに、時間波形の瞬時振幅値が正なら+１、ゼロなら０、負なら-１の符号に変換するゼロクロス波生成処理を行い、このゼロクロス波を更に４つの帯域に分割した後にコーディングするようにしているが、圧縮処理は行っていない。 Further, in the embodiment shown in FIG. 6 of Patent Document 1, the voice is divided into four bands in the coding step, and the instantaneous amplitude value of the time waveform is converted into a code of +1 if it is positive, 0 if it is zero, and -1 if it is negative. A zero-cross wave generation process is performed and the zero-cross wave is further divided into four bands and then coded, but no compression process is performed.

上記の課題を解消すべく本発明は、コーディングについては、入力音声を複数の帯域に分割した後に圧縮する処理と、入力音声を０,１符号化した後に圧縮する処理とで構成し、デコーディングについては、前記複数の帯域に分割した後に圧縮された音声データをアップサンプリングする処理と、前記０,１符号化した後に圧縮された音声データをアップサンプリングした後にコーディングでの分割帯域と同じ帯域に分割する処理と、入力音声を複数の帯域に分割して圧縮した後にアップサンプリングされた各帯域の音声データと０,１符号化（矩形クリップ音声）した後にサンプリングされた各帯域の音声データとを合成する処理とから構成した。 In order to solve the above-mentioned problems, the present invention is configured for decoding by dividing the input voice into a plurality of bands and then compressing the input voice, and 0 and 1 encoding the input voice and then compressing the input voice. With regard to the above, the process of up-sampling the audio data compressed after being divided into the plurality of bands, and the up-sampling of the audio data compressed after the 0,1 encoding are made into the same band as the divided band in the coding. The division processing, the input voice is divided into a plurality of bands and compressed, and then the upsampled voice data of each band and the 0,1 encoding (rectangular clip voice) and the sampled voice data of each band It is composed of a process of synthesizing.

前記入力音声を複数の帯域に分割した後に圧縮する処理の具体例としては、バンドパスフィルタを用いて複数の帯域に分割し、この分割された周波数帯域内の音声データの包絡線を抽出する時間間隔を変更し、この後ビット深度を小さくすることが考えられる。 As a specific example of the process of compressing the input voice after dividing the input voice into a plurality of bands, a bandpass filter is used to divide the input voice into a plurality of bands, and a time for extracting an envelope of voice data in the divided frequency bands. It is conceivable to change the interval and then reduce the bit depth.

また、前記入力音声を複数の帯域に分割して圧縮した後に、振幅情報を数値化し、最大値を有する帯域の番号とその振幅値を振幅情報として取り出し、デコーディングでは最大値を有する帯域番号と振幅値を振幅情報に戻す態様も考えられる。 Further, after the input voice is divided into a plurality of bands and compressed, the amplitude information is digitized, the number of the band having the maximum value and its amplitude value are extracted as the amplitude information, and the band number having the maximum value is used in decoding. A mode in which the amplitude value is returned to the amplitude information is also conceivable.

更に、コーディングでは入力音声を複数の帯域に分割した後に圧縮する処理を音声のパワー変化の因子分析の結果を用いて行い、デコーディングでは、コーディングによって得られた因子ごとの相関値と因子負荷量と掛け合わせる態様も考えられる。 Furthermore, in coding, the processing of dividing the input speech into multiple bands and then compressing is performed using the result of the factor analysis of the power change of the speech, and in decoding, the correlation value and factor loading amount for each factor obtained by coding. It is also conceivable that it is multiplied with.

本発明によれば、音声のコーディングの際の圧縮率を高くして且つデコーディングにおける音声認識（正答率）の高い音声のコーディング・デコーディング方法を提供することができる。
また本発明に係る方法はあらゆる環境における音声通信として利用することができる。 ADVANTAGE OF THE INVENTION According to this invention, the compression rate at the time of audio|voice coding can be made high, and the audio|voice coding/decoding method of the audio|voice recognition (correct answer rate) in decoding can be provided.
Further, the method according to the present invention can be used as voice communication in any environment.

コーディングの一例を説明した図Diagram explaining an example of coding デコーディングの一例を説明した図Diagram explaining an example of decoding 抽出時間間隔の変更による圧縮を説明した図Diagram explaining compression by changing the extraction time interval ビット深度による圧縮を説明した図Diagram explaining compression by bit depth ０,１符号化を説明した図Diagram explaining 0,1 encoding 別実施例を説明した図１と同様の図A view similar to FIG. 1 for explaining another embodiment. 別実施例を説明した図２と同様の図A diagram similar to FIG. 2 for explaining another embodiment. 別実施例を説明した図１と同様の図A view similar to FIG. 1 for explaining another embodiment. 別実施例を説明した図２と同様の図A diagram similar to FIG. 2 for explaining another embodiment. （ａ）は４因子と中心周波数の関係を示すグラフ、（ｂ）は３因子と中心周波数の関係を示すグラフ。(A) is a graph showing the relationship between the four factors and the center frequency, and (b) is a graph showing the relationship between the three factors and the center frequency. 因子分析の方法を説明した図。The figure explaining the method of factor analysis. コーディングの際の演算方法を説明した図Diagram explaining the calculation method during coding デコーディングの際の演算方法を説明した図Diagram explaining the calculation method for decoding

１６０００Ｈｚ、１６ビットで表現された音声を元音声（入力音声）とする場合について説明する。
コーディングでは、元音声に対してバンドパスフィルタを用いた帯域分割と０,１符号化（矩形クリップ音声）の２つの処理を並行して行う。 A case where a voice represented by 16000 Hz and 16 bits is used as an original voice (input voice) will be described.
In coding, two processes of band division using a bandpass filter and 0,1 coding (rectangular clip voice) are performed in parallel on the original voice.

バンドパスフィルタを用いた帯域分割では例えば、第１帯域２０〜５１０Ｈｚ、第２帯域５１０〜１２７０Ｈｚ、第３帯域１２７０〜２７００Ｈｚ、第４帯域２７００〜６４００Ｈｚの４つの帯域に分割する。なお、本実施例においては４つの帯域に分割しているが、この帯域数は２、３、もしくは５帯域以上であっても差し支えない。 In band division using a bandpass filter, for example, it is divided into four bands of a first band 20 to 510 Hz, a second band 510 to 1270 Hz, a third band 1270 to 2700 Hz, and a fourth band 2700 to 6400 Hz. In this embodiment, the band is divided into four bands, but the number of bands may be 2, 3 or 5 bands or more.

そして分割した各帯域毎にダウンサンプリングを行う。ダウンサンプリングは図３に示すように音声曲線の振幅包絡線を抽出する時間間隔を変更することで行う。図３では２０ｍｓ（ミリ秒）毎に振幅包絡線を抽出している。尚、実施例にあっては時間間隔を４０ｍｓとすることで、１６０００Ｈｚを２５Ｈｚまで１／６４０に圧縮している。 Then, downsampling is performed for each of the divided bands. Downsampling is performed by changing the time interval for extracting the amplitude envelope of the voice curve as shown in FIG. In FIG. 3, the amplitude envelope is extracted every 20 ms (milliseconds). In the embodiment, the time interval is set to 40 ms so that 16000 Hz is compressed to 25 Hz by 1/640.

ダウンサンプリングによって圧縮された音声データは更にビット深度による圧縮が行われる。元音声は１６ビット深度（６５５３６段階での表現）のデータであるが、例えば図４に示すように３ビット（０〜７の８段階）のデータに圧縮する。尚、実施例にあっては８ビット（２５６段階）のデータに圧縮している。 The audio data compressed by downsampling is further compressed by bit depth. The original voice is data of 16-bit depth (expressed in 65536 steps), but is compressed into 3-bit data (8 steps of 0 to 7) as shown in FIG. 4, for example. In the embodiment, the data is compressed into 8-bit (256 levels) data.

また、コーディングにおける０、１符号化（矩形クリップ音声）処理は、図５に示すように、入力音声の強度が０以上なら１、０未満なら０とする処理で、実施例では１６０００Ｈｚを１６００Ｈｚ（１／１０）まで圧縮している。即ち、１／１６００秒毎に０,１を決定している。 Further, as shown in FIG. 5, the 0, 1 encoding (rectangular clip voice) process in the coding is a process of setting 1 when the intensity of the input voice is 0 or more and 0 when the intensity of the input voice is less than 0. Compressed up to 1/10). That is, 0, 1 is determined every 1/1600 seconds.

０、１の決定の仕方は、１０回サンプリングした値の平均値を採用するか、任意（例えば１点目）の瞬時値を採用する等が考えられる。 As a method of determining 0 or 1, it is considered that an average value of values sampled 10 times is adopted, or an arbitrary (for example, first point) instantaneous value is adopted.

このようにしてバンドパスフィルタを用いた帯域分割された各帯域の音声データの振幅情報は２５×８＝２００ビット、０,１符号化後のビット数は１６００ビットで、総ビット数は２００×４＋１６００＝２４００ビットとなる。 In this way, the amplitude information of the audio data of each band divided into bands using the bandpass filter is 25×8=200 bits, the number of bits after 0,1 encoding is 1600 bits, and the total number of bits is 200×. 4+1600=2400 bits.

デコーディングでは、圧縮された音声データに対してアップサンプリングなどが行われる。
先ず、バンドパスフィルタを用いて帯域分割された後に圧縮された音声データに対してはアップサンプリングを行う。実施例では６４０倍にアップサンプリングしている。 In decoding, upsampling or the like is performed on the compressed audio data.
First, upsampling is performed on audio data compressed after being band-divided using a bandpass filter. In the embodiment, 640 times upsampling is performed.

アップサンプリングの方法としては線形補間（インターポレーション）を行い、その後にナイキスト周波数相当のカットオフのローパスフィルタを通すことが考えられる。 As an upsampling method, it is conceivable to perform linear interpolation (interpolation) and then pass through a low-pass filter with a cutoff corresponding to the Nyquist frequency.

一方、上記と並行して同時に送信されてきた０、１符号化された音声データについてもアップサンプリングを行う。この場合は１／１０に圧縮されていたので１０倍のアップサンプリングを上記と同様の手法で行う。 On the other hand, up-sampling is also performed on 0 and 1 encoded audio data transmitted simultaneously in parallel with the above. In this case, since it was compressed to 1/10, upsampling of 10 times is performed by the same method as described above.

そして、１０倍にアップサンプリングされた音声データを前記バンドパスフィルタによる分割と同じ帯域に４分割し、この４分割された音声データを前記６４倍にアップサンプリングされた音声データと各帯域毎に合成する。 Then, the 10-fold upsampled voice data is divided into 4 bands in the same band as the bandpass filter division, and the 4-divided voice data is combined with the 64-times upsampled voice data in each band. To do.

評価条件
・使用音声：「ＮＴＴ-ＡＴ多言語音声データベース２００２」（サンプリング周波数１６ｋＨｚ、１６ビット）の日本語の音声データの男女各１名、１５通りづつ、計３０文章を抽出してランダムに使用。
・被験者：正常な聴覚を持つ学生１２名。
・キーボードによる回答。 Evaluation conditions /Voice to be used: 15 sentences for each male and female of Japanese voice data of “NTT-AT multilingual voice database 2002” (sampling frequency 16 kHz, 16 bits), 15 sentences in total, and 30 sentences were used at random. ..
-Subjects: 12 students with normal hearing.
・Answer by keyboard.

上記の条件内で、帯域数を４帯域、サンプリング周波数（０,１符号化）を１６００Ｈｚ、時間間隔を４０ｍｓ、ビット深度を８bitとすることで総ビット数を２４００とした場合（圧縮率９９．１％）にモーラ正答率８０％以上、完全一致正答率６０％程度の結果を得た。この値は十分に音声通信に使用できる値である。 Under the above conditions, when the total number of bits is 2400 by setting the number of bands to 4, the sampling frequency (0,1 coding) to 1600 Hz, the time interval to 40 ms, and the bit depth to 8 bits (compression rate 99. 1%), a mora correct answer rate of 80% or more and a perfect match correct answer rate of about 60% were obtained. This value is a value that can be sufficiently used for voice communication.

図６及び図７は前記実施例において８ビットで数値化した振幅情報は周波数帯域が４つのため２ビットで帯域番号を表すことが可能である。そこで、最大値を有する帯域番号とその振幅値を振幅情報として取り出し、デコーディングでは４つの振幅情報に戻すようにしている。
このような構成とすることで、圧縮率を大きくしても復号化の高いコーディング・デコーディング方法とすることができる。 In FIG. 6 and FIG. 7, the amplitude information quantified by 8 bits in the above embodiment has four frequency bands, so that the band number can be expressed by 2 bits. Therefore, the band number having the maximum value and its amplitude value are taken out as amplitude information, and are restored to four pieces of amplitude information in decoding.
With such a configuration, it is possible to provide a coding/decoding method with high decoding even if the compression rate is increased.

図８及び図９は更なる別実施例を説明したものであり、この実施例のコーディングでは、多言語のデータベースを用いて、２０の帯域からなるバンドパスフィルタとしての臨界帯域フィルタの出力から得られる音声のパワー変化の因子分析の結果を用いることで、音声の特徴を少ない因子によって表現しており、デコーディングでは、コーディングによって得られた因子ごとの相関値と因子負荷量とを掛け合わせることで２０帯域の振幅情報に変換し、音声合成を行う。 FIGS. 8 and 9 explain still another embodiment, in which the coding of this embodiment uses a multilingual database to obtain from the output of the critical band filter as a band pass filter consisting of 20 bands. By using the result of the factor analysis of the power change of the generated speech, the characteristics of the speech are expressed by a small number of factors, and in decoding, the correlation value for each factor obtained by the coding is multiplied by the factor load amount. Is converted into amplitude information of 20 bands, and voice synthesis is performed.

３因子または４因子の負荷量を用いて音声の振幅スペクトルを情報表現することが出来ることが分かっている。図１０（ａ）は４因子と中心周波数の関係を示すグラフであり、因子数は（ｂ）に示すように３因子としてもよい。 It has been found that the amplitude spectrum of a voice can be represented by information using a load factor of three factors or four factors. FIG. 10A is a graph showing the relationship between 4 factors and the center frequency, and the number of factors may be 3 factors as shown in FIG. 10B.

因みに因子分析によって音声の因子負荷量を導き出す方法を図１１に示している。この方法の更なる詳細は、Science Reports（doi:10.1038/srep42468）に開示されている。 Incidentally, FIG. 11 shows a method of deriving a factor load amount of voice by factor analysis. Further details of this method are disclosed in Science Reports (doi:10.1038/srep42468).

また、因子分析の結果を用いたコーディングの因子負荷量との相関における演算方法を図１２に、デコーディングの因子負荷量を用いて各帯域の振幅を計算する方法は図１３に示している。コーディングにおける因子負荷量との相関では、あらかじめ用意された、帯域ごとの４因子もしくは３因子の因子負荷量と、入力音声の帯域ごとの出力を積和している。一方、デコーディングで因子負荷量を用いて各帯域の振幅を計算する際には、コーディングによって得られた因子ごとの相関値と因子負荷量を帯域ごとに掛け合わせて算出している。 Further, FIG. 12 shows the calculation method in the correlation with the factor load of coding using the result of the factor analysis, and FIG. 13 shows the method of calculating the amplitude of each band using the factor load of the decoding. In the correlation with the factor load amount in the coding, the factor load amount of 4 factors or 3 factors for each band, which is prepared in advance, and the output of each band of the input speech are summed. On the other hand, when the amplitude of each band is calculated by using the factor loading amount in decoding, the correlation value for each factor obtained by coding is multiplied by the factor loading amount for each band.

コーディイング及びデコーディングはそれぞれ個別の装置内に組み込むことも可能であるが、実用上はコーディング部とデコーディング部が込み込まれた１つの送受信機として製造される。 Although the coding and decoding can be incorporated in separate devices, they are practically manufactured as one transceiver in which the coding unit and the decoding unit are incorporated.

Claims

It is a method of coding voice and then decoding it.In the coding, the input voice is divided into a plurality of bands and then compressed, and the compressed amplitude information is digitized to determine the band number having the maximum value and its amplitude width. It consists of a process of extracting it as information and a process of compressing the input information after 0, 1 encoding, and the decoding is the compressed voice data in which the extracted band number having the maximum value and its amplitude width are digitized. And the process of up-sampling the digitized compressed voice data, and the process of up-sampling the voice data compressed after the 0,1 encoding and then dividing it into the same band as the division band in coding. , A process of synthesizing the voice data of each band that has been up-sampled after the input voice is divided into a plurality of bands and compressed, and the voice data of each band that has been up-sampled after being 0,1 encoded. Characteristic speech coding/decoding method.

A method of coding voice and then decoding it. The coding is a process of dividing an input voice into a plurality of bands, a process of dividing the input voice using a result of a factor analysis of a power change of the voice, and input information. 0,1 encoding and then compression, the decoding upsamples the compressed speech data, and the upsampled speech data is correlated with each factor and factor load obtained by coding. A process of multiplying the amount for each band to convert into amplitude information; a process of up-sampling the voice data compressed after the 0,1 encoding and then dividing it into the same band as the division band in coding; And dividing the voice into a plurality of bands, compressing the voice, and then combining the voice data of each band that has been up-loaded with the voice data of each band that has been up-sampled after 0,1 encoding. Audio coding/decoding method.