JP4364544B2

JP4364544B2 - Audio signal processing apparatus and method

Info

Publication number: JP4364544B2
Application number: JP2003105148A
Authority: JP
Inventors: 陽平池田; 哲也高橋; 孝之稗方; 敏章下田
Original assignee: Kobe Steel Ltd
Current assignee: Kobe Steel Ltd
Priority date: 2003-04-09
Filing date: 2003-04-09
Publication date: 2009-11-18
Anticipated expiration: 2023-04-09
Also published as: JP2004309893A

Abstract

<P>PROBLEM TO BE SOLVED: To prevent an increase in operation load and generation of a phase difference between channels which gives a feeling of physical disorder to a listener from occurring while preventing voice quality from becoming worse by reflecting periodicity of each channel signal when time-base compression and/or expansion is carried out according to pitch cycles obtained from input voice signals of a plurality of channels. <P>SOLUTION: A composite signal generation part 11 generates a plurality of composite signals L+R and L-R by putting together input signals L and R of a plurality of channels through different composition processes (addition and subtraction) and an effective signal selection part 12 selects one effective signal having the largest amplitude; and a pitch cycle detection part 13 finds pitch cycles from the effective signal and according to the pitch cycles, a signal compression/expansion part 14 compresses and/or expands the time base of all channel signals by a PICOLA system. <P>COPYRIGHT: (C)2005,JPO&NCIPI

Description

【０００１】
【発明の属する技術分野】
本発明は，複数チャンネルの入力音声信号から得られるピッチ周期に基づいて入力音声信号の時間軸の圧縮及び／又は伸張を行う音声信号処理装置及びその方法に関するものである。
【０００２】
【従来の技術】
カラオケのテンポ（速度）変更やビデオの再生速度変更等を行う際に，音程を変えずに音声信号（オーディオ信号）の再生速度を速くしたり遅くしたりする時間軸圧縮伸張処理（音声信号処理の一例）が必要となる。また，再生速度を変えずに，音程（音の高低）のみを変更する音程変換処理（音声信号処理の一例）が必要となることもある。
従来，非特許文献１及び非特許文献２には，音声信号の周期性の強い部分を見出し，その周期（ピッチ周期）の単位での音声信号の省略や繰り返し（挿入）によって（ピッチ周期に基づく）時間軸圧縮伸張処理を行う技術が示されている。この技術では，音声信号における省略するピッチ周期分の信号をその次のピッチ周期分の信号にクロスフェードの重み付けにより重複加算する，或いは挿入するピッチ周期分の信号をその前後のピッチ周期分の信号をクロスフェードの重み付けにより重複加算した信号とするＰＩＣＯＬＡ（Pointer Interval Control OverLap and Add，ポンター移動量制御による重複加算法）という手法が採用されている。
【０００３】
図２は，ＰＩＣＯＬＡ方式により時間軸圧縮が行われる際の音声信号の波形を模式的に表したものである。
まず，図２（ａ）に示すように，時間軸圧縮（音声信号の省略）の対象となる音声信号の範囲の先頭位置Ｐｏ１にポインタが設定され，このポインタ位置Ｐｏ１からの音声信号について，そのピッチ周期Ｐ（強い周期性を有する周期）が検出される。ピッチ周期Ｐの検出方法の例については後述する。
次に，図２（ｂ）に示すように，前記ポインタ位置Ｐｏ１からピッチ周期Ｐ分の（ピッチ周期Ｐの長さの）２つの信号ａ，ｂをクロスフェードの重み付けにより重複加算した信号ａ’を生成する。即ち，２つの信号ａ，ｂを合成（加算）する際に，図２（ａ）に破線Ｗ１，Ｗ２で示すように，信号ａに対する重みは時間軸が進むに従ってフェードアウト（次第に低下）し，信号ｂに対する重みは時間軸が進むに従ってフェードイン（次第に増大）するようクロスフェードの重み付けがなされる。
次に，信号ａを削除（省略）するとともに，信号ｂを信号ａ’に置き換える。これにより，１ピッチ周期Ｐ分の時間軸圧縮が完了する。ここで，音声信号の省略部に設定された信号ａ’は，クロスフェードの重み付けにより重複加算した信号であるので，その前後の音声信号との繋がりがスムーズとなり，違和感の少ない時間軸圧縮が可能となる。
次に，目標圧縮比がＲｘ（０＜Ｒｘ＜１）であるとすると，ポインタが，前記Ｐｏ１の位置からＣ（＝Ｐ×Ｒｘ／（１−Ｒｘ））だけ進んだ位置Ｐｏ２に再設定され，前記Ｐｏ１の位置から位置Ｐｏ２までの圧縮処理後の音声信号が出力されるとともに，このポインタ位置Ｐｏ２から同様の時間軸圧縮処理が繰り返される。これにより，Ｐ＋Ｃの長さの元の音声信号から，Ｃの長さの圧縮音声信号が生成（出力）されることになり，目標圧縮比Ｒｘ（＝Ｃ／（Ｐ＋Ｃ））を達成する時間軸圧縮がなされる。
【０００４】
一方，図３は，ＰＩＣＯＬＡ方式により時間軸伸張が行われる際の音声信号の波形を模式的に表したものである。
まず，図３（ａ）に示すように，時間軸伸張（音声信号の挿入）の対象となる音声信号の範囲の先頭位置Ｐｏ３にポインタが設定され，このポインタ位置Ｐｏ３からの音声信号について，そのピッチ周期Ｐ（強い周期性を有する周期）が検出される。
次に，図３（ｂ）に示すように，前記ポインタ位置Ｐｏ３からピッチ周期Ｐ分の（ピッチ周期Ｐの長さの）２つの信号ａ，ｂをクロスフェードの重み付けにより重複加算した信号ａ’を生成する。時間軸伸張の場合のクロスフェードの重み付けは，図３（ａ）に破線Ｗ３，Ｗ４で示すように，信号ａに対する重みは時間軸が進むに従ってフェードイン（次第に増加）し，信号ｂに対する重みは時間軸が進むに従ってフェードアウト（次第に低下）するよう重み付けがなされる。
次に，信号ａ，ｂの間に信号ａ’を挿入する。これにより，１ピッチ周期Ｐ分の時間軸伸張が完了する。ここで，挿入された信号ａ’は，クロスフェードの重み付けにより重複加算した信号であるので，その前後の音声信号との繋がりがスムーズとなり，違和感の少ない時間軸伸張が可能となる。
次に，目標伸張比がＲｙ（０＜Ｒｙ＜１）であるとすると，ポインタが，前記Ｐｏ３の位置からＰ＋Ｓ（Ｓ＝Ｐ×１／（Ｒｙ−１））だけ進んだ位置Ｐｏ４に再設定され，前記Ｐｏ３の位置から位置Ｐｏ４までの伸張処理後の音声信号が出力されるとともに，このポインタ位置Ｐｏ４から同様の時間軸伸張処理が繰り返される。これにより，Ｓの長さの元の音声信号から，Ｐ＋Ｓの長さの圧縮音声信号が生成（出力）されることになり，目標伸張比Ｒｙ（＝（Ｐ＋Ｓ）／Ｓ）を達成する時間軸伸張がなされる。
【０００５】
また，特許文献１には，入力音声信号をＰＩＣＯＬＡ等を用いた時間軸圧縮又は伸張により時間調整をした後，補間処理によりサンプリングレートを変換して入力信号と同じ時間長（サンプル数）に戻すことによって，音声信号の音程変換を行う技術が示されている。これにより，音声信号のテンポ（速度）を変えずに，音程のみを変更することが可能となる。
【０００６】
ところで，処理する音声信号が，ステレオオーディオ信号等のように複数チャンネルの音声信号である場合，各チャンネルについてＰＩＣＯＬＡを適用すると，ピッチ周期を求める高負荷の演算をチャンネルごとに実行する必要があるため演算負荷が非常に高くなることに加え，チャンネルごとにピッチ周期が異なりうるので，圧縮伸張処理後の音声信号にチャンネル間で元の音声信号とは異なる位相差が生じ，聞く人に違和感を与えてしまうという問題点がある。
この問題を解決するためには，音声信号の圧縮伸張に用いるピッチ周期を，全てのチャンネルで統一（共通化）することが有効である。
例えば，特許文献２には，ステレオ音声信号のＬチャンネルとＲチャンネルとを加算した信号（Ｌ＋Ｒ）についてピッチ周期を検出し，そのピッチ周期に基づいて両チャンネルの音声信号の圧縮伸張処理（ＰＩＣＯＬＡ）を行う技術が提案されている。
さらに，特許文献３には，複数のチャンネル信号を加算した信号或いは最大の振幅を有するチャンネル信号についてピッチ周期を検出し，そのピッチ周期に基づいて全てのチャンネル信号の圧縮伸張処理を行う技術が提案されている。
これらの技術により，ピッチ周期を求める高負荷の演算を１つの音声信号について求めるだけで済むので演算負荷の増大を防止できるとともに，圧縮伸張処理後の音声信号に，聞く人に違和感を与えるようなチャンネル間での信号の位相差が生じることを防止できる。
【０００７】
【特許文献１】
特開平８−２７２３９０号公報
【特許文献２】
特開２００１−５５００号公報
【特許文献３】
特開２００２−２９７２００号公報
【非特許文献１】
森田，板倉「自己相関関数を用いた音声の時間軸での伸縮」日本音響学会講演論文集，S61.3，PP199-200
【非特許文献２】
森田，板倉「ポインター移動量制御による重複加算法（ＰＩＣＯＬＡ）を用いた音声の時間軸での伸張圧縮とその評価」，S61.10，PP149-150
【０００８】
【発明が解決しようとする課題】
しかしながら，複数のチャンネル信号を加算合成した信号からピッチ周期を求める場合，例えば，ステレオ音声信号におけるＬチャンネルとＲチャンネルとが逆位相である場合，加算合成した信号には元の各チャンネル信号の周期性が表れず（周期性が相殺され），適切なピッチ周期が検出されずに圧縮伸張後の音声品質が劣化するという問題点があった。
また，複数のチャンネル信号のいずれか１つ（例えば，振幅が最大のもの）から検出したピッチ周期を用いる場合は，他のチャンネル信号の周期性がまったく反映されず，ピッチ周期検出に用いられなかったチャンネル信号については，時間軸圧縮伸張による音声品質の劣化が大きいという問題点があった。
従って，本発明は上記事情に鑑みてなされたものであり，その目的とするところは，ステレオ音声信号等の複数チャンネルの入力音声信号から得たピッチ周期に基づいて時間軸の圧縮及び／又は伸張を行う際に，各チャンネル信号の周期性を反映して音声品質の劣化を防止しつつ，演算負荷の増大や聞く人に違和感を与えるようなチャンネル間の位相差発生も防止できる音声信号処理装置及びその方法を提供することにある。
【０００９】
【課題を解決するための手段】
上記目的を達成するために本発明は，複数チャンネルの入力音声信号から得られるピッチ周期に基づいて前記入力音声信号の時間軸の圧縮及び／又は伸張を行う時間軸調節手段を具備する音声信号処理装置において，全ての前記チャンネルの入力音声信号それぞれに所定の重み係数を掛けて加算及び減算することにより複数の合成信号を生成する合成信号生成手段と，前記複数の合成信号のうち振幅が最も大きい合成信号を有効信号として選択する有効信号選択手段と，前記有効信号からピッチ周期を検出するピッチ周期検出手段と，を具備し，前記時間軸調節手段が，前記有効信号から得られたピッチ周期に基づいて全てのチャンネルの前記入力音声信号の時間軸の圧縮及び／又は伸張を実行してなることを特徴とする音声信号処理装置として構成されるものである。
これにより，異なる合成処理によって合成された複数の合成信号から，そのときの入力音声信号の各チャンネル間の相対的な関係に応じて，各チャンネル信号の周期性が最も反映された合成信号（チャンネル信号相互に相殺し合わない合成信号）１つを，ピッチ周期検出用の有効信号として選ぶことができるので，音声品質の劣化を防止しつつ，演算負荷の増大や聞く人に違和感を与えるようなチャンネル間の位相差発生も防止できる。
【００１０】
例えば，前記入力音声信号が，２チャンネルのステレオ音声信号である場合，前記合成信号生成手段が，前記２チャンネルのステレオ音声信号それぞれに所定の重み係数を掛けて加算した合成信号と減算した合成信号とを生成するものであることが考えられる。
これにより，各チャンネル信号の特性を均等に反映させた，或いは所望の重み付けがなされた合成信号を生成することができる。さらに，各チャンネル信号が相互に同位相となるような場合には，加算した合成信号に各チャンネル信号の周期性がよく反映され，逆に，各チャンネル信号が相互に逆位相となるような場合には，減算した合成信号に周期性がよく反映されることになるので，その都度適正な合成信号を選択することができる。
【００１１】
また，前記有効信号選択手段における選択は，例えば，平均振幅や信号の標準偏差が最大のものを選択する規則等とすることが考えられる。
また，前記時間軸調節手段により前記入力音声信号の時間軸が圧縮又は伸張された各チャンネルの時間軸調節後の音声信号について，そのサンプリングレートを変換して元の時間長に戻すことにより前記入力音声信号の音程を変換する音程変換手段を具備するものも考えられる。
これにより，音声品質の劣化が小さい音程変換処理の実現が可能となる。
【００１２】
また，本発明は，前記音声信号処理装置の処理に対応する音声信号処理方法として捉えたものであってもよい。
即ち，複数チャンネルの入力音声信号から得られるピッチ周期に基づいて前記入力音声信号の時間軸の圧縮及び／又は伸張を行う音声信号処理方法において，全ての前記チャンネルの入力音声信号それぞれに所定の重み係数を掛けて加算及び減算することにより複数の合成信号を生成する合成信号生成工程と，前記複数の合成信号のうち振幅が最も大きい合成信号を有効信号として選択する有効信号選択工程と，前記有効信号からピッチ周期を検出するピッチ周期検出工程と，前記有効信号から得られたピッチ周期に基づいて全てのチャンネルの前記入力音声信号の時間軸の圧縮及び／又は伸張を実行する時間軸調節工程と，を有してなることを特徴とする音声信号処理方法である。
また，記入力音声信号が，２チャンネルのステレオ音声信号である場合には，前記合成信号生成工程が，前記２チャンネルのステレオ音声信号それぞれに所定の重み係数を掛けて加算した合成信号と減算した合成信号とを生成するものであることが考えられる。
【００１３】
【発明の実施の形態】
以下添付図面を参照しながら，本発明の実施の形態及び実施例について説明し，本発明の理解に供する。尚，以下の実施の形態及び実施例は，本発明を具体化した一例であって，本発明の技術的範囲を限定する性格のものではない。
ここに，図１は本発明の実施の形態に係る音声信号処理装置Ｘの概略構成を表すブロック図，図２はＰＩＣＯＬＡ方式により音声信号の時間軸圧縮が行われる際の音声信号の波形を模式的に表した図，図３はＰＩＣＯＬＡ方式により音声信号の時間軸伸張が行われる際の音声信号の波形を模式的に表した図である。
【００１４】
以下，図１を用いて，本発明の実施の形態に係る音声信号処理装置Ｘについて説明する。
音声信号処理装置Ｘは，２チャンネル（ＬとＲ）のステレオ音声信号（入力音声信号）を入力し，それぞれ異なる合成処理によりその両チャンネル（全てのチャンネル）について合成した複数の合成信号を生成する合成信号生成部１１と，生成された複数の合成信号のうちの１つを所定の選択規則に従って有効信号として選択する有効信号選択部１２と，その有効信号からピッチ周期を検出するピッチ周期検出部１３と，前記有効信号から得られたピッチ周期に基づいて両チャンネル（全てのチャンネル）の入力ステレオ音声信号の時間軸の圧縮及び伸張を実行する信号圧縮／伸張部１４（前記時間軸調節手段の一例）とを具備している。
ここで，前記合成信号生成部１１は，２チャンネルの前記ステレオ音声信号それぞれに同じ重み係数（例えば，１や０．５等）を掛けて加算した信号（Ｌ＋Ｒ）と減算した信号（Ｌ−Ｒ）とを前記合成信号として生成するものである。
これにより，各チャンネル信号（Ｌ，Ｒ）が，相互に同位相又はそれに近い状態である場合には，加算合成信号（Ｌ＋Ｒ）の振幅が大きくなり，減算合成信号（Ｌ−Ｒ）の振幅は小さくなる。これに対し，各チャンネル信号（Ｌ，Ｒ）が，相互に逆位相又はそれに近い状態である場合には，加算合成信号（Ｌ＋Ｒ）の振幅は小さくなり，減算合成信号（Ｌ−Ｒ）の振幅が大きくなる。
また，前記有効信号選択部１２は，２つ（複数）の前記合成信号のうち，振幅の大きな方を前記有効信号として選択するものである（前記選択規則の一例）。振幅の大きさの評価方法としては，例えば，所定時間の範囲（所定サンプル数）における各チャンネル信号の値の２乗値の和や，次の（１）式に示すように絶対値の和の大きさにより評価することが考えられる。このような評価演算は簡易な演算であり，演算負荷の増加はわずかである。
【数１】

これにより，元の各チャンネル信号（Ｌ，Ｒ）の位相に応じて，それらの周期性がより顕著に表れている（周期性が相殺されていない）合成信号が，ピッチ周期検出に用いられる前記有効信号として選択されるので，各チャンネル信号の周期性が反映された適切なピッチ周期が検出され，圧縮伸張後の音声品質が劣化することを防止できる。
図１に示す各構成要素１１〜１４は，それぞれＣＰＵ及びその周辺装置（ＲＯＭ，ＲＡＭ等）とそのＣＰＵにより実行されるプログラムとにより構成することも考えられるが，１つのＣＰＵ及びその周辺装置と，そのＣＰＵにより実行され，図１に示す各構成要素１１〜１４が行う処理に対応するプログラムモジュールとにより構成されたものも考えられる。
【００１５】
前記ピッチ周期検出部１３によるピッチ周期の検出（算出）方法の一例としては，ピッチ周期Ｐの候補ｊとして予めｊ＝Ｎ₀〜Ｎの所定範囲を設定し，このピッチ周期候補ｊ（Ｎ₀〜Ｎ）それぞれについての周期性の強さを比較し，最も周期性が強いと評価される周期をピッチ周期Ｐとする方法が考えられる。
例えば，周期性の評価対象とする前記有効信号Ｘ_iの時間範囲（サンプル数）ｉを０〜Ｎ（ここで，参照される有効信号の最大時間範囲は，０〜２Ｎ）としたときに，周期性の強さの評価関数を，次の（２）式や（３）式とすることが考えられる。
【数２】

【数３】

これらは，ｊサンプルだけ離れた信号値同士の差（絶対値又は２乗値）を計算し，その差が小さいほど周期ｊにおける周期性が強い（即ち，周期ｊごとに似た波形が現れる）として評価するものである。従って，ｊ＝Ｎ₀〜Ｎそれぞれについて，（２）式又は（３）式による評価値を計算し，その評価値が最も小さくなるときのｊをピッチ周期Ｐとして検出（算出）する。
そして，前記信号圧縮／伸張部１４は，以上のようにして検出されたピッチ周期Ｐに基づいて，前記ステレオ音声信号の両チャンネル信号それぞれについて所望の圧縮率（伸張率）で時間軸圧縮（伸張）がなされ，圧縮（伸張）後の音声信号Ｌ’，Ｒ’が出力される。ここで，圧縮・伸張の方式は，前述したＰＩＣＯＬＡ方式が採用される。
このように，複数チャンネルの音声入力信号から得た１つのピッチ周期Ｐに基づいて，全てのチャンネル信号の圧縮・伸張処理がなされるので，演算負荷の増大や，聞く人に違和感を与えるような圧縮・伸張後のチャンネル間の位相差発生を防止できる。
【００１６】
ところで，ピッチ周期分の音声信号の削除・挿入により時間軸圧縮・伸張が施された音声信号（チャンネル信号Ｌ’，Ｒ’）は，その周波数が変換されて圧縮・伸張される場合と異なり，入出力間で音程は変わらない。
ここで，前記信号圧縮／伸張部１４の後段に，時間軸圧縮・伸張が施された音声信号（チャンネル信号Ｌ’，Ｒ’）それぞれについて，元の時間長に戻すようにサンプリングレート変換を行うサンプリングレート変換部（音程変換手段の一例）を設ければ，音声品質の劣化が小さい音程変換処理の実現が可能となる。
即ち，目標圧縮比Ｒｘ（０＜Ｒｘ＜１）で時間軸圧縮された音声信号Ｌ’，Ｒ’を，その時間長が１／Ｒｘ倍となるようにサンプリングレート変換を行って再生すれば，信号が遅く再生されることになるので，再生信号（サンプリングレート変換後の信号）の周波数がＲｘ倍となり，音程がその分だけ低くなる。同様に，目標伸張率Ｒｙ（＞１）で時間軸伸張された音声信号Ｌ’，Ｒ’を，その時間長が１／Ｒｙ倍となるようにサンプリングレート変換を行って再生すれば，信号が速く再生されることになるので，再生信号の周波数がＲｙ倍となり，音程がその分だけ高くなる。従って，入力音声信号の周波数に対する出力音声信号（再生信号）の周波数の比をＲｚとすると，所望のＲｚが設定された場合に，０＜Ｒｚ＜１の場合は，Ｒｘ←Ｒｚとして時間軸圧縮を，Ｒｚ＞１の場合は，Ｒｙ←Ｒｚとして時間軸伸張を行った後に，元の時間長となるようにサンプリングレート変換を行えば，所望の音程変換が可能となる。
【００１７】
また，ここでは，入力音声信号が，２チャンネルのステレオ音声信号である場合について示したが，３チャンネル以上のマルチチャンネル音声信号を入力音声信号とすることも考えられる。
この場合，一般には，全チャンネルの情報が均等に反映された合成信号を生成することが望ましいので，前記合成信号生成部１１により，各チャンネル信号に絶対値の等しい重みを掛けて加算又は減算した合成信号を生成することが考えられる。例えば，３つのチャンネル信号Ｉ１，Ｉ２，Ｉ３について，所定の重み係数α（例えば，α＝０．３等）を用いて，α・（Ｉ１＋Ｉ２＋Ｉ３），α・（Ｉ１＋Ｉ２−Ｉ３），α・（Ｉ１−Ｉ２＋Ｉ３），α・（−Ｉ１＋Ｉ２＋Ｉ３）の４つの合成信号を生成すること等が考えられる。
もちろん，入力音声信号の特性により，いずれかのチャンネル信号の周期性を特に強調して反映させたい場合等には，そのチャンネル信号に相対的に大きな重みを掛ける等により加減算した合成信号を生成することも考えられる。
【００１８】
【発明の効果】
以上説明したように，本発明によれば，複数チャンネルの前記入力音声信号について，加算及び減算の合成処理により全てのチャンネルについて合成した複数の合成信号を生成し，その合成信号から選択された振幅が最も大きい合成信号から得られたピッチ周期に基づいて，全チャンネル信号の時間軸の圧縮及び／又は伸張が行われるので，各チャンネル信号の周期性を反映して音声品質の劣化を防止しつつ，演算負荷の増大や聞く人に違和感を与えるようなチャンネル間の位相差発生も防止できる。
さらに，時間軸圧縮及び／又は伸張が施された音声信号のサンプリングレートを変換して元の時間長に戻すことにより音程変換を行うことも可能となる。
【図面の簡単な説明】
【図１】本発明の実施の形態に係る音声信号処理装置Ｘの概略構成を表すブロック図。
【図２】ＰＩＣＯＬＡ方式により音声信号の時間軸圧縮が行われる際の音声信号の波形を模式的に表した図。
【図３】ＰＩＣＯＬＡ方式により音声信号の時間軸伸張が行われる際の音声信号の波形を模式的に表した図。
【符号の説明】
１１…合成信号生成部（合成信号生成手段）
１２…有効信号選択部（有効信号選択手段）
１３…ピッチ周期検出部（ピッチ周期選択手段）
１４…信号圧縮／伸張部（時間軸調節手段）[0001]
BACKGROUND OF THE INVENTION
The present invention relates to an audio signal processing apparatus and method for compressing and / or expanding a time axis of an input audio signal based on a pitch period obtained from input audio signals of a plurality of channels.
[0002]
[Prior art]
Time-axis compression / expansion processing (audio signal processing) that increases or decreases the playback speed of audio signals (audio signals) without changing the pitch when changing the tempo (speed) of karaoke or changing the playback speed of video Example) is required. In addition, there may be a need for pitch conversion processing (an example of audio signal processing) that changes only the pitch (sound pitch) without changing the playback speed.
Conventionally, in Non-Patent Document 1 and Non-Patent Document 2, a portion having a strong periodicity of an audio signal is found, and the audio signal is omitted or repeated (inserted) in units of the cycle (pitch cycle) (based on the pitch cycle). ) A technique for performing time-axis compression / decompression processing is shown. In this technology, a signal corresponding to a pitch period that is omitted in an audio signal is added to a signal corresponding to the next pitch period by cross-fading weighting, or a signal corresponding to a pitch period that is inserted is added to a signal corresponding to the pitch period before and after the signal. A technique called PICOLA (Pointer Interval Control OverLap and Add, overlapping addition method by Ponter movement amount control) is adopted in which the signal is overlapped and added by crossfade weighting.
[0003]
FIG. 2 schematically shows a waveform of an audio signal when time axis compression is performed by the PICOLA method.
First, as shown in FIG. 2 (a), a pointer is set at the start position Po1 of the range of the audio signal to be subjected to time axis compression (omission of the audio signal), and the audio signal from the pointer position Po1 is A pitch period P (a period having a strong periodicity) is detected. An example of a method for detecting the pitch period P will be described later.
Next, as shown in FIG. 2 (b), a signal a ′ obtained by overlappingly adding two signals a and b corresponding to the pitch period P (the length of the pitch period P) from the pointer position Po1 by cross-fading weighting. Is generated. That is, when the two signals a and b are combined (added), as indicated by broken lines W1 and W2 in FIG. 2A, the weight for the signal a fades out (gradually decreases) as the time axis advances. The weight for b is weighted so that it fades in (increases gradually) as the time axis advances.
Next, the signal a is deleted (omitted) and the signal b is replaced with the signal a ′. Thereby, the time axis compression for one pitch period P is completed. Here, since the signal a ′ set in the omitted portion of the audio signal is a signal that is overlapped and added by weighting of the cross fade, the connection with the audio signals before and after the smooth becomes smooth, and the time axis compression with less sense of incongruity is possible. It becomes.
Next, assuming that the target compression ratio is Rx (0 <Rx <1), the pointer is reset to a position Po2 advanced by C (= P × Rx / (1-Rx)) from the position of Po1. The audio signal after the compression processing from the position Po1 to the position Po2 is output, and the same time axis compression processing is repeated from the pointer position Po2. As a result, a compressed audio signal having a length of C is generated (output) from the original audio signal having a length of P + C, and the time axis for achieving the target compression ratio Rx (= C / (P + C)). Compression is done.
[0004]
On the other hand, FIG. 3 schematically shows the waveform of an audio signal when time axis expansion is performed by the PICOLA method.
First, as shown in FIG. 3A, a pointer is set at the start position Po3 of the range of the audio signal to be subjected to time axis expansion (audio signal insertion), and the audio signal from the pointer position Po3 is A pitch period P (a period having a strong periodicity) is detected.
Next, as shown in FIG. 3 (b), a signal a ′ obtained by overlapping and adding two signals a and b corresponding to the pitch period P (the length of the pitch period P) from the pointer position Po3 by cross-fade weighting. Is generated. As shown by broken lines W3 and W4 in FIG. 3 (a), the weight for the signal a fades in (increases gradually) as the time axis advances, and the weight for the signal b is Weighting is performed so that fade-out (gradual decrease) occurs as the time axis advances.
Next, the signal a ′ is inserted between the signals a and b. Thereby, the time base extension for one pitch period P is completed. Here, since the inserted signal a ′ is a signal that is overlapped and added by weighting the crossfade, the connection with the audio signals before and after that becomes smooth, and the time axis can be expanded with little discomfort.
Next, assuming that the target expansion ratio is Ry (0 <Ry <1), the pointer is reset to a position Po4 advanced by P + S (S = P × 1 / (Ry−1)) from the position Po3. Then, the audio signal after the expansion process from the position Po3 to the position Po4 is output, and the same time axis expansion process is repeated from the pointer position Po4. As a result, a compressed audio signal having a length of P + S is generated (output) from the original audio signal having a length of S, and a time axis for achieving the target expansion ratio Ry (= (P + S) / S). Stretching is done.
[0005]
In Patent Document 1, the input audio signal is time-adjusted by time-axis compression or expansion using PICOLA or the like, and then the sampling rate is converted by interpolation processing to return to the same time length (number of samples) as the input signal. Thus, a technique for converting the pitch of an audio signal is shown. As a result, it is possible to change only the pitch without changing the tempo (speed) of the audio signal.
[0006]
By the way, if the audio signal to be processed is a multi-channel audio signal such as a stereo audio signal, applying PICOLA to each channel requires a high-load operation for obtaining the pitch period to be executed for each channel. In addition to the extremely high computational load, the pitch period can be different for each channel, resulting in a phase difference that differs from the original audio signal between channels in the audio signal after compression / expansion processing, giving the listener a sense of incongruity. There is a problem that.
In order to solve this problem, it is effective to unify (commonize) the pitch period used for compression / expansion of audio signals in all channels.
For example, in Patent Document 2, the pitch period of a signal (L + R) obtained by adding the L channel and the R channel of a stereo audio signal is detected, and the compression / decompression processing (PICOLA) of the audio signals of both channels is performed based on the pitch period. A technique for performing the above has been proposed.
Further, Patent Document 3 proposes a technique for detecting a pitch period of a signal obtained by adding a plurality of channel signals or a channel signal having the maximum amplitude, and compressing / decompressing all the channel signals based on the pitch period. Has been.
With these technologies, it is only necessary to obtain a high-load operation for determining the pitch period for one audio signal, so that the increase in the operation load can be prevented and the audio signal after compression / decompression processing can be uncomfortable for the listener. It is possible to prevent a signal phase difference between channels from occurring.
[0007]
[Patent Document 1]
JP-A-8-272390 [Patent Document 2]
JP 2001-5500 A [Patent Document 3]
JP 2002-297200 A Non-Patent Document 1
Morita, Itakura “Speech and contraction of speech in time axis using autocorrelation function” Proceedings of the Acoustical Society of Japan, S61.3, PP199-200
[Non-Patent Document 2]
Morita, Itakura, “Expansion and compression of speech along the time axis using the overlap addition method (PICOLA) with pointer movement control and its evaluation”, S61.10, PP149-150
[0008]
[Problems to be solved by the invention]
However, when the pitch period is obtained from a signal obtained by adding and synthesizing a plurality of channel signals, for example, when the L channel and the R channel in a stereo audio signal are in opposite phases, the period of the original channel signal is added to the added and synthesized signal. There is a problem that the sound quality after compression / decompression is deteriorated because the proper pitch period is not detected and the proper pitch period is not detected.
In addition, when a pitch period detected from any one of a plurality of channel signals (for example, one having the maximum amplitude) is used, the periodicity of other channel signals is not reflected at all and is not used for pitch period detection. However, there was a problem that the sound quality of the channel signal was greatly deteriorated by the time axis compression / expansion.
Accordingly, the present invention has been made in view of the above circumstances, and an object of the present invention is to compress and / or expand a time axis based on a pitch period obtained from a plurality of channels of input audio signals such as stereo audio signals. Audio signal processing apparatus capable of preventing the deterioration of voice quality by reflecting the periodicity of each channel signal, and preventing the occurrence of phase difference between channels which causes an uncomfortable feeling to the listener. And providing a method thereof.
[0009]
[Means for Solving the Problems]
In order to achieve the above object, the present invention provides an audio signal processing comprising time axis adjusting means for compressing and / or expanding the time axis of the input audio signal based on a pitch period obtained from the input audio signals of a plurality of channels. In the apparatus, combined signal generating means for generating a plurality of combined signals by multiplying and adding and subtracting each input audio signal of all the channels by a predetermined weighting factor, and the largest amplitude among the plurality of combined signals a valid signal selection means for selecting the composite signal as an enable signal, anda pitch period detecting means for detecting the pitch period from the effective signal, the time axis adjustment means, the pitch period obtained from the valid signal And an audio signal processing apparatus which performs compression and / or expansion of the time axis of the input audio signal of all channels based on It is intended to be constructed Te.
Thus, different synthesis processing a plurality of synthetic signal combined it by, depending on the relative relationship between the channels of the input audio signal at that time, synthesis periodicity of each channel signal is most reflected Since one signal (a composite signal that does not cancel each other out of channel signals) can be selected as an effective signal for pitch period detection, the computation load increases and the listener feels uncomfortable while preventing deterioration in voice quality. It is also possible to prevent occurrence of phase difference between channels.
[0010]
For example, the input audio signal, when a stereo audio signals of two channels, the composite signal generating means, by subtracting the synthesized signal obtained by adding by multiplying a predetermined weighting coefficient to each stereo audio signal of the 2-channel synthesis It is considered that the signal is generated .
As a result, it is possible to generate a composite signal in which the characteristics of each channel signal are reflected evenly or with a desired weighting. Furthermore, when the channel signals are in phase with each other, the periodicity of each channel signal is well reflected in the added composite signal, and conversely, the channel signals are in opposite phase with each other. Since the periodicity is well reflected in the subtracted composite signal, an appropriate composite signal can be selected each time.
[0011]
Further, the selection that put enable signal selecting means, if example embodiment, it is considered that the standard deviation of the mean amplitude or signal to the regulations for selecting the largest one.
In addition, the audio signal after the time axis adjustment of each channel in which the time axis of the input audio signal is compressed or expanded by the time axis adjusting means is converted into the original time length by converting the sampling rate. It is also possible to have a pitch conversion means for converting the pitch of the audio signal.
As a result, it is possible to realize a pitch conversion process with little deterioration in voice quality.
[0012]
Further, the present invention may be understood as an audio signal processing method corresponding to the processing of the audio signal processing device.
That is, in the audio signal processing method for compressing and / or expanding the time axis of the input audio signal based on the pitch period obtained from the input audio signals of a plurality of channels, each input audio signal of all the channels has a predetermined weight. a valid signal selection step of selecting a composite signal generating step of generating a plurality of combined signals, the largest combined signal amplitude of the plurality of combined signals as a valid signal by addition and subtraction is multiplied by the coefficient, the A pitch period detecting step for detecting a pitch period from an effective signal, and a time axis adjusting step for performing compression and / or expansion of the time axis of the input audio signal of all channels based on the pitch period obtained from the effective signal And an audio signal processing method characterized by comprising:
When the input audio signal is a two-channel stereo audio signal, the synthesized signal generation step subtracts the synthesized signal obtained by multiplying each of the two-channel stereo audio signals by a predetermined weighting factor and adding it. It is conceivable that it generates a composite signal.
[0013]
DETAILED DESCRIPTION OF THE INVENTION
Hereinafter, embodiments and examples of the present invention will be described with reference to the accompanying drawings so that the present invention can be understood. It should be noted that the following embodiments and examples are examples embodying the present invention, and do not limit the technical scope of the present invention.
FIG. 1 is a block diagram showing a schematic configuration of the audio signal processing apparatus X according to the embodiment of the present invention. FIG. 2 is a schematic diagram of the waveform of the audio signal when the time axis compression of the audio signal is performed by the PICOLA method. FIG. 3 is a diagram schematically showing the waveform of the audio signal when the time base expansion of the audio signal is performed by the PICOLA method.
[0014]
Hereinafter, the audio signal processing apparatus X according to the embodiment of the present invention will be described with reference to FIG.
The audio signal processing device X receives two channels (L and R) of stereo audio signals (input audio signals) and generates a plurality of synthesized signals synthesized for both channels (all channels) by different synthesis processes. Synthetic signal generation unit 11, effective signal selection unit 12 that selects one of a plurality of generated combined signals as an effective signal according to a predetermined selection rule, and a pitch period detection unit that detects a pitch period from the effective signal 13 and a signal compression / expansion unit 14 for executing time axis compression and expansion of the input stereo sound signals of both channels (all channels) based on the pitch period obtained from the effective signal (of the time axis adjusting means). Example).
Here, the synthesized signal generation unit 11 multiplies the stereo audio signals of two channels by the same weighting coefficient (for example, 1 or 0.5) and adds the signal (L + R) and the subtracted signal (LR). ) As the synthesized signal.
As a result, when the channel signals (L, R) are in the same phase or close to each other, the amplitude of the addition composite signal (L + R) is increased, and the amplitude of the subtraction composite signal (LR) is Get smaller. On the other hand, when the channel signals (L, R) are in opposite phases or close to each other, the amplitude of the addition composite signal (L + R) becomes small and the amplitude of the subtraction composite signal (LR). Becomes larger.
The effective signal selection unit 12 selects one of the two (plural) combined signals having the larger amplitude as the effective signal (an example of the selection rule). As a method for evaluating the magnitude of the amplitude, for example, the sum of the square values of the values of each channel signal in a predetermined time range (predetermined number of samples) or the sum of absolute values as shown in the following equation (1) can be used. It is conceivable to evaluate by size. Such an evaluation calculation is a simple calculation, and the increase in calculation load is slight.
[Expression 1]

As a result, the synthesized signal in which the periodicity appears more remarkably (the periodicity is not canceled) according to the phase of each original channel signal (L, R) is used for the pitch period detection. Since it is selected as an effective signal, an appropriate pitch period reflecting the periodicity of each channel signal is detected, and it is possible to prevent deterioration in voice quality after compression / expansion.
Each component 11 to 14 shown in FIG. 1 may be configured by a CPU and its peripheral devices (ROM, RAM, etc.) and a program executed by the CPU. A program module that is executed by the CPU and that corresponds to the processing performed by the constituent elements 11 to 14 shown in FIG.
[0015]
As an example of the pitch period detection (calculation) method by the pitch period detector 13, a predetermined range of j = N _{0 to} N is set in advance as a candidate j of the pitch period P, and this pitch period candidate j (N ₀ to N) A method of comparing the strengths of the periodicity for each of them and setting the pitch period P as the cycle evaluated as having the strongest periodicity can be considered.
For example, when the time range (number of samples) i of the effective signal X _i to be evaluated for periodicity is 0 to N (where the maximum time range of the effective signal to be referred to is 0 to 2N), It can be considered that the evaluation function of the strength of periodicity is expressed by the following equations (2) and (3).
[Expression 2]

[Equation 3]

These calculate the difference (absolute value or square value) between signal values separated by j samples, and the smaller the difference, the stronger the periodicity in period j (that is, a similar waveform appears for each period j). Is to be evaluated. Therefore, for each of j = N _{0 to} N, an evaluation value according to the expression (2) or (3) is calculated, and j when the evaluation value becomes the smallest is detected (calculated) as the pitch period P.
Then, the signal compression / decompression unit 14 performs time-axis compression (decompression) with a desired compression ratio (expansion ratio) for each of the two channel signals of the stereo audio signal based on the pitch period P detected as described above. ) And compressed audio signals L ′ and R ′ are output. Here, the above-described PICOLA method is adopted as the compression / decompression method.
In this way, all the channel signals are compressed / expanded based on one pitch period P obtained from a plurality of channels of audio input signals, so that the calculation load increases and the listener is uncomfortable. It is possible to prevent the phase difference between the channels after compression and expansion.
[0016]
By the way, the audio signal (channel signals L ′ and R ′) subjected to time axis compression / expansion by deleting / inserting the audio signal corresponding to the pitch period is different from the case where the frequency is converted and compressed / expanded. The pitch does not change between input and output.
Here, at the subsequent stage of the signal compression / expansion unit 14, the sampling rate conversion is performed so that each of the audio signals (channel signals L ′ and R ′) subjected to time axis compression / expansion is restored to the original time length. If a sampling rate conversion unit (an example of a pitch conversion unit) is provided, it is possible to realize a pitch conversion process with a small deterioration in voice quality.
That is, if the audio signals L ′ and R ′ that have been time-axis compressed at the target compression ratio Rx (0 <Rx <1) are reproduced by performing sampling rate conversion so that the time length becomes 1 / Rx times, Since the signal is reproduced late, the frequency of the reproduction signal (signal after sampling rate conversion) is multiplied by Rx, and the pitch is lowered accordingly. Similarly, if the audio signals L ′ and R ′ expanded in time axis with the target expansion rate Ry (> 1) are reproduced by performing sampling rate conversion so that the time length becomes 1 / Ry times, the signal becomes Since the reproduction is fast, the frequency of the reproduction signal is Ry times and the pitch is increased accordingly. Therefore, assuming that the ratio of the frequency of the output audio signal (reproduced signal) to the frequency of the input audio signal is Rz, when the desired Rz is set, if 0 <Rz <1, the time axis is compressed as Rx ← Rz. In the case of Rz> 1, after performing the time axis expansion as Ry ← Rz, if the sampling rate conversion is performed so that the original time length is obtained, the desired pitch conversion can be performed.
[0017]
Although the case where the input audio signal is a two-channel stereo audio signal is shown here, a multi-channel audio signal of three or more channels may be used as the input audio signal.
In this case, in general, it is desirable to generate a composite signal in which information of all channels is reflected evenly. Therefore, the composite signal generation unit 11 adds or subtracts each channel signal by applying a weight having an equal absolute value. It is conceivable to generate a composite signal. For example, with respect to three channel signals I1, I2, and I3, α · (I1 + I2 + I3), α · (I1 + I2−I3), α · (I1) using a predetermined weighting coefficient α (for example, α = 0.3). It is conceivable to generate four composite signals of −I2 + I3) and α · (−I1 + I2 + I3).
Of course, when it is desired to reflect the periodicity of any channel signal with particular emphasis due to the characteristics of the input audio signal, a synthesized signal is generated by adding / subtracting the channel signal by applying a relatively large weight. It is also possible.
[0018]
【The invention's effect】
As described above, according to the present invention, a plurality of synthesized signals synthesized for all channels are generated by the synthesis process of addition and subtraction for the input audio signals of a plurality of channels, and amplitudes selected from the synthesized signals are generated. Since the time axis compression and / or expansion of all channel signals is performed on the basis of the pitch period obtained from the synthesized signal having the largest value, the deterioration of voice quality is prevented while reflecting the periodicity of each channel signal. Therefore, it is possible to prevent an increase in calculation load and occurrence of a phase difference between channels that gives a strange feeling to the listener.
Furthermore, it is possible to perform pitch conversion by converting the sampling rate of the audio signal subjected to time axis compression and / or expansion and returning it to the original time length.
[Brief description of the drawings]
FIG. 1 is a block diagram showing a schematic configuration of an audio signal processing apparatus X according to an embodiment of the present invention.
FIG. 2 is a diagram schematically showing a waveform of an audio signal when time base compression of the audio signal is performed by the PICOLA method.
FIG. 3 is a diagram schematically showing the waveform of an audio signal when the time base expansion of the audio signal is performed by the PICOLA method.
[Explanation of symbols]
11: Synthetic signal generator (synthetic signal generator)
12 ... Effective signal selection section (effective signal selection means)
13: Pitch cycle detection unit (pitch cycle selection means)
14: Signal compression / decompression unit (time axis adjustment means)

Claims

An audio signal processing apparatus comprising time axis adjusting means for compressing and / or expanding the time axis of the input audio signal based on a pitch period obtained from a plurality of channels of input audio signals,
Combined signal generating means for generating a plurality of combined signals by adding and subtracting a predetermined weighting factor to each of the input audio signals of all the channels ;
A valid signal selection means for selecting the largest combined signal amplitude of the plurality of combined signals as a valid signal,
Pitch period detecting means for detecting a pitch period from the effective signal;
Comprising
The audio signal processing apparatus, wherein the time axis adjusting means executes compression and / or expansion of the time axis of the input audio signals of all channels based on the pitch period obtained from the effective signal. .

The input audio signal is a two-channel stereo audio signal;
The composite signal generating means, the audio signal processing apparatus according to claim 1 and generates a synthesized signal obtained by subtracting the synthesized signal obtained by adding by multiplying a predetermined weighting coefficient to each stereo audio signal of the two channels.

For the audio signal after the time axis adjustment of each channel whose time axis of the input audio signal is compressed or expanded by the time axis adjusting means, the input audio signal is converted to the original time length by converting the sampling rate. the audio signal processing apparatus according to any one of the pitch formed by including a pitch conversion means for converting the claim 1 or 2.

In the audio signal processing method for compressing and / or expanding the time axis of the input audio signal based on the pitch period obtained from the input audio signals of a plurality of channels,
A combined signal generating step of generating a plurality of combined signals by adding and subtracting a predetermined weighting factor to each of the input audio signals of all the channels ;
A valid signal selection step of selecting the largest combined signal amplitude of the plurality of combined signals as a valid signal,
A pitch period detecting step of detecting a pitch period from the effective signal;
A time axis adjustment step of performing time axis compression and / or expansion of the input audio signal of all channels based on the pitch period obtained from the effective signal;
An audio signal processing method comprising:

The input audio signal is a two-channel stereo audio signal;
5. The audio signal processing method according to claim 4, wherein the synthesized signal generating step generates a synthesized signal obtained by multiplying each of the two-channel stereo audio signals by a predetermined weighting coefficient and adding and subtracting a synthesized signal.