JP5572338B2

JP5572338B2 - Multipoint connection device, multipoint connection method

Info

Publication number: JP5572338B2
Application number: JP2009148755A
Authority: JP
Inventors: 仲大室; 祐介日和▲崎▼; 茂明佐々木
Original assignee: Nippon Telegraph and Telephone Corp; NTT Inc USA
Current assignee: NTT Inc; NTT Inc USA
Priority date: 2009-06-23
Filing date: 2009-06-23
Publication date: 2014-08-13
Anticipated expiration: 2029-06-23
Also published as: JP2011009845A

Description

本発明は、ディジタル化された音声、音楽などの音響信号（以下、本明細書では音声信号ということにする。）を、ディジタル通信網を介して送受信する際に、複数地点の音声信号をミキシングして各地点へ混合信号を配信する多地点接続技術に関する。 The present invention mixes audio signals at a plurality of points when transmitting / receiving digitalized audio signals such as voice and music (hereinafter referred to as audio signals in this specification) via a digital communication network. In addition, the present invention relates to a multipoint connection technology for delivering mixed signals to each point.

パケット通信網をはじめとするディジタル通信網を介し、３地点以上の複数地点間で音声信号を送受信する利用（例えば音声会議である。）が増えている。音声会議システムを構築するためには、多地点接続装置と呼ばれるサーバ装置を通信網上に配置し、複数地点から送られてくる音声信号をミキシングして、各地点に配信する方法が知られている。 The use (for example, a voice conference) of transmitting and receiving voice signals between a plurality of three or more points via a digital communication network such as a packet communication network is increasing. In order to build an audio conference system, a method is known in which a server device called a multipoint connection device is arranged on a communication network, audio signals sent from a plurality of points are mixed, and distributed to each point. Yes.

図１は、３地点で音声会議を行う場合に、多地点接続装置（MCU）１００が中核となって各地点間の相互通信が実現することを示す概念図である。図２は、多地点接続装置１００に含まれる多地点ミキシング部１１０の構成例を示している。Ａ、Ｂ、Ｃ各地点から送られた各音声符号は、それぞれデコーダ１１２ａ，１１２ｂ，１１２ｃで、例えばＰＣＭ形式のディジタル音声信号に復号される。PCMミキシング部１１１は、各復号済みディジタル音声信号を用いて、各地点向けの混合信号を作成する。例えば、地点Ａ向けの混合信号は、地点Ｂと地点Ｃの信号が混合されたものであり、地点Ｂ向けの混合信号は、地点Ａと地点Ｃの信号が混合されたものである。混合信号に自地点信号を含めない理由は、自分の声がエコーとして戻ってくることを防ぐためである。各地点向けの混合信号は、それぞれエンコーダ１１３ａ，１１３ｂ，１１３ｃによってエンコードされて、各地点に配信される。 FIG. 1 is a conceptual diagram showing that a multipoint connection device (MCU) 100 serves as a core to realize mutual communication between points when an audio conference is performed at three points. FIG. 2 shows a configuration example of the multipoint mixing unit 110 included in the multipoint connection apparatus 100. Each voice code sent from each of the points A, B, and C is decoded by a decoder 112a, 112b, and 112c, for example, into a digital voice signal in the PCM format. The PCM mixing unit 111 uses each decoded digital audio signal to create a mixed signal for each point. For example, the mixed signal for point A is a signal obtained by mixing points B and C, and the mixed signal for point B is a signal obtained by mixing points A and C. The reason why the local signal is not included in the mixed signal is to prevent the voice from returning as an echo. The mixed signal for each point is encoded by the encoders 113a, 113b, and 113c and distributed to each point.

このような多地点接続装置１００において、デコーダ１１２ａ，１１２ｂ，１１２ｃとエンコーダ１１３ａ，１１３ｂ，１１３ｃに多くの演算処理が必要になると、多地点接続装置１００に多大な負荷がかかることが課題となる。例えば、音声符号化方式として、ITU-T（International Telecommunication Union - Telecommunication Standardization Sector） G.711（非特許文献１参照）が用いられる場合には、デコーダ１１２ａ，１１２ｂ，１１２ｃ、エンコーダ１１３ａ，１１３ｂ，１１３ｃとも、必要な演算処理は少ないため、多地点接続装置１００に多大な負荷はかからない。しかし、例えば、音声符号化方式として、広帯域音声符号化方式であるITU-T G.722（非特許文献２参照）が用いられる場合には、G.711を利用する場合に比べて、多くの演算処理が必要となり、結果、１台の多地点接続装置１００で処理できる地点数が大幅に減ってしまうという問題につながる。 In such a multipoint connection apparatus 100, when many arithmetic processes are required for the decoders 112a, 112b, and 112c and the encoders 113a, 113b, and 113c, it becomes a problem that a large load is applied to the multipoint connection apparatus 100. For example, when ITU-T (International Telecommunication Union-Telecommunication Standardization Sector) G.711 (see Non-Patent Document 1) is used as a speech encoding method, decoders 112a, 112b, 112c, encoders 113a, 113b, 113c. In both cases, since the required arithmetic processing is small, the multipoint connection apparatus 100 is not subjected to a great load. However, for example, when ITU-T G.722 (see Non-Patent Document 2), which is a wideband speech coding system, is used as the speech coding system, it is more than when using G.711. As a result, arithmetic processing is required, which leads to a problem that the number of points that can be processed by one multipoint connection device 100 is greatly reduced.

このような問題に対して、音声符号化方式として、G.711をコアとするエンベデド符号化方式（スケーラブル符号化方式とも言う。）を用い、多地点接続装置１００の処理として、選択ミキシング（パーシャルミキシングとも言う。）を用いる方法が提案されている。この処理の詳細は、非特許文献３および非特許文献４に記載されている。 To deal with such problems, an embedded coding system (also called a scalable coding system) having G.711 as a core is used as a speech coding system, and selective mixing (partial mixing) is performed as processing of the multipoint connection apparatus 100. A method that uses mixing is also proposed. Details of this processing are described in Non-Patent Document 3 and Non-Patent Document 4.

図３は、非特許文献３に記載されている、選択ミキシングによる多地点接続装置１００に含まれる機能構成例である。ＰＣＭミキシング部１１１は図２に示すＰＣＭミキシング部１１１と同一であるので説明を省略する。各地点から送られてきた音声パケットはそれぞれ、デマルチプレクシング部（ｄｅＭＵＸ）１１４ａ，１１４ｂ，１１４ｃで、G.711ビット（第１音声符号）、拡張ビット（第２音声符号）、制御情報に分離される。各地点に対応するG.711ビットはそれぞれ、G.711デコーダ１１２ａ，１１２ｂ，１１２ｃで復号され、ＰＣＭミキシング部１１１で混合された後、G.711エンコーダ１１３ａ，１１３ｂ，１１３ｃでエンコードされて、各地点向けのG.711ビットが作成される。 FIG. 3 is a functional configuration example included in the multipoint connection device 100 by selective mixing described in Non-Patent Document 3. The PCM mixing unit 111 is the same as the PCM mixing unit 111 shown in FIG. Voice packets sent from each point are separated into G.711 bits (first voice code), extension bits (second voice code), and control information by demultiplexing units (deMUX) 114a, 114b, and 114c, respectively. Is done. The G.711 bits corresponding to each point are decoded by the G.711 decoders 112a, 112b, and 112c, mixed by the PCM mixing unit 111, and then encoded by the G.711 encoders 113a, 113b, and 113c. A G.711 bit for the point is created.

各地点の音声パケットから分離された拡張ビットはそれぞれ、拡張ビット切替部１１７に入力される。また、各地点の音声パケットから分離された制御情報は、地点選択部１１６に入力される。地点選択部１１６は、各地点の制御情報を用いて、主たる発話地点と、従たる発話地点を時々刻々決定し、拡張ビット切替部１１７を制御するための制御信号を出力する。仮に、ある時点における主たる発話地点をＡ、従たる発話地点をＢとすると、拡張ビット切替部１１７は、地点ＢとＣ向けには、地点Ａの拡張ビットを、地点Ａ向けには地点Ｂの拡張ビットを出力するように動作する。各地点向けのG.711ビットと拡張ビットはそれぞれ、マルチプレクシング部（ＭＵＸ）１１５ａ，１１５ｂ，１１５ｃでそれぞれ結合されて、各地点に各地点向けの音声パケットが送信される。 The extension bits separated from the voice packet at each point are input to the extension bit switching unit 117, respectively. The control information separated from the voice packet at each point is input to the point selection unit 116. The point selection unit 116 determines a main utterance point and a subordinate utterance point from time to time using the control information of each point, and outputs a control signal for controlling the extension bit switching unit 117. If the main utterance point at a certain point in time is A and the subordinate utterance point is B, the extension bit switching unit 117 sets the extension bit of the point A for the points B and C, and the point B for the point A. Operates to output extension bits. The G.711 bit and the extension bit for each point are respectively combined by multiplexing units (MUX) 115a, 115b, and 115c, and a voice packet for each point is transmitted to each point.

なお、図３に示す構成において、G.711符号化方式、およびG.711デコーダ１１２ａ，１１２ｂ，１１２ｃ、G.711エンコーダ１１３ａ，１１３ｂ，１１３ｃは、他の符号化方式、および対応するエンコーダ、デコーダでもよく、一般に、エンコード、デコードにかかる処理量の少ない符号化方式が望ましい。また、拡張ビットは、G.711.1音声符号化方式の場合であればG.711コアレイヤを除く低域拡張レイヤと高域拡張レイヤのための音声符号に相当する。また、制御情報は、音声符号にマルチプレクスされて独立して送られる情報のほか、G.711ビット、G.711デコーダの出力であるＰＣＭ音声信号、または拡張ビットを用いて、多地点ミキシング部内で作成される情報などである。一般に、制御情報としては、音声信号（G.711デコーダの出力であるＰＣＭ音声を含む）のパワー、音声／非音声区間情報（ＶＡＤともいう）、有声音／無声音の識別情報などが用いられる。 In the configuration shown in FIG. 3, the G.711 encoding method, the G.711 decoders 112a, 112b, and 112c, and the G.711 encoders 113a, 113b, and 113c are other encoding methods and the corresponding encoders and decoders. However, in general, an encoding method with a small processing amount for encoding and decoding is desirable. Further, the extension bits correspond to speech codes for a low-frequency enhancement layer and a high-frequency enhancement layer excluding the G.711 core layer in the case of the G.711.1 speech coding scheme. The control information is multiplexed into the voice code and sent independently, as well as the G.711 bit, the PCM voice signal output from the G.711 decoder, or the extension bit. Information created by In general, as control information, power of a voice signal (including PCM voice output from a G.711 decoder), voice / non-voice section information (also referred to as VAD), voiced / unvoiced voice identification information, and the like are used.

ITU-T Recommendation G.711, "Pulse Code Modulation (PCM) of voice frequencies," ITU-T, 1988.ITU-T Recommendation G.711, "Pulse Code Modulation (PCM) of voice frequencies," ITU-T, 1988. ITU-T Recommendation G.722, "7kHz Audio-Coding within 64kbit/s," ITU-T, 1988.ITU-T Recommendation G.722, "7kHz Audio-Coding within 64kbit / s," ITU-T, 1988. 「A G.711 Embedded Wideband Speech Coding for VoIP Conferences」, IEICE TRANS. INF. & SYST., VOL.E89-D, No 9, 2006.“A G.711 Embedded Wideband Speech Coding for VoIP Conferences”, IEICE TRANS. INF. & SYST., VOL.E89-D, No 9, 2006. 「広帯域音声符号化の国際標準ITU-T G.711.1」, NTT技術ジャーナル, Vol.20 No. 5, 2008."International Standard ITU-T G.711.1 for Wideband Speech Coding", NTT Technical Journal, Vol.20 No. 5, 2008.

図３に示すように、音声パケットに含まれる一部の音声符号のみを従来通りにミキシングし、音声パケットに含まれる他の音声符号は、ミキシングしないで時々刻々切り替える選択式として処理とする場合においては、地点選択部および拡張ビット切替部の処理アルゴリズムによって、各地点で再生されるミキシング音声の品質が劣化する可能性がある。 As shown in FIG. 3, in the case where only a part of voice codes included in a voice packet is mixed as usual, and other voice codes included in the voice packet are processed as selection formulas that are switched from moment to moment without being mixed. There is a possibility that the quality of the mixed sound reproduced at each point may deteriorate due to the processing algorithms of the point selection unit and the extension bit switching unit.

また、上記非特許文献３では、ＶＡＤ情報を用いて、200ms以下の周期での発話地点の切り替えを抑止するアルゴリズムが例示されている。このアルゴリズムでは、頻繁な切り替えによる耳障りなノイズが発生しないというメリットがある反面、発話地点の切り替わり目では、一時的に主たる発話地点の声が、3.4kHz帯域の狭帯域音声として聞こえてしまい、7kHz帯域の広帯域音声と、3.4kHz帯域の狭帯域音声の切り替わり感が目立つという課題があった。 Further, Non-Patent Document 3 exemplifies an algorithm that uses the VAD information to suppress switching of utterance points at a cycle of 200 ms or less. While this algorithm has the merit of not causing annoying noise due to frequent switching, the voice at the main speaking point is temporarily heard as a narrowband sound of 3.4 kHz band at the switching point of the speaking point, 7 kHz There was a problem that the feeling of switching between wideband audio in the band and narrowband audio in the 3.4 kHz band was conspicuous.

そこで本発明は、音質劣化の少ない多地点接続技術を提供することを目的とする。 Accordingly, an object of the present invention is to provide a multipoint connection technique with little deterioration in sound quality.

本発明は、３地点以上の各地点から送られた音声パケットをそれぞれ、パケット化周期よりも短い時間単位（サブフレーム長単位）に分割して、分割音声パケットを出力し［時間方向分割処理］、各地点に対応する分割音声パケットそれぞれから、少なくとも第１音声符号と第２音声符号を取り出し［デマルチプレクシング処理］、各地点に対応する第１音声符号をそれぞれ復号して第１音声信号を出力し［デコーディング処理］、各地点に対応する第１音声信号をミキシングして各地点向けの混合音声信号を出力し［ミキシング処理］、各地点に対応する混合音声信号をそれぞれ符号化して混合音声符号を出力し［エンコーディング処理］、各地点の中から発話地点を決定して、当該発話地点に対応する制御信号を出力し［地点選択処理］、各地点のうち制御信号に応じて定まる地点向けとして、各地点に対応する第２音声符号のうち制御信号に応じて定まる第２音声符号を出力し［第２音声符号切替処理］、各地点に対応する混合音声符号と、第２音声符号切替処理で出力された各地点に対応する第２音声符号とを結合して、サブフレーム長単位の単位音声パケットを出力し［マルチプレクシング処理］、各地点に対応する単位音声パケットを複数結合して、パケット化周期の時間単位を持つ送信用音声パケットを出力する［時間方向結合処理］。 The present invention divides voice packets sent from three or more points into time units (subframe length units) shorter than the packetization period, and outputs divided voice packets [time direction division processing]. Then, at least the first voice code and the second voice code are extracted from each of the divided voice packets corresponding to each point [demultiplexing processing], and the first voice code corresponding to each point is decoded to obtain the first voice signal. Output [decoding process], mix the first audio signal corresponding to each point to output a mixed audio signal for each point [mixing process], and encode and mix the mixed audio signal corresponding to each point A voice code is output [encoding process], an utterance point is determined from each point, and a control signal corresponding to the utterance point is output [point selection process] For each point determined according to the control signal, a second speech code determined according to the control signal among the second speech codes corresponding to each point is output [second speech code switching process], and The corresponding mixed speech code and the second speech code corresponding to each point output in the second speech code switching process are combined to output a unit speech packet in subframe length units [multiplexing process]. A plurality of unit voice packets corresponding to the points are combined to output a transmission voice packet having a time unit of a packetization period [time direction combining process].

あるいは、地点選択処理は、各地点に対応する第１音声符号の音響属性を求める音響属性決定処理と、各地点に対応する第１音声符号の音響属性を記憶する処理と、各地点に対応する第１音声符号の現在の音響属性と、各地点に対応する第１音声符号の過去の音響属性とに基づき、各地点の中から現在の発話地点を決定する発話地点決定処理と、決定された発話地点に対応する制御信号を出力する制御信号出力処理を行うようにしてもよい。 Or a point selection process respond | corresponds to each point with the acoustic attribute determination process which calculates | requires the acoustic attribute of the 1st audio | voice code corresponding to each point, the process which memorize | stores the acoustic attribute of the 1st audio | voice code corresponding to each point, and Based on the current acoustic attribute of the first speech code and the past acoustic attribute of the first speech code corresponding to each location, the speech location determination process for determining the current speech location from each location is determined. You may make it perform the control signal output process which outputs the control signal corresponding to an utterance point.

あるいは、地点選択処理は、各地点に対応する第１音声符号の音響属性を求める音響属性決定処理と、各地点に対応する第１音声符号の音響属性に基づき、各地点の中から現在の発話地点を決定する発話地点決定処理と、発話地点を表す情報を記憶する処理と、決定された発話地点に対応する制御信号を出力する制御信号出力処理を行うようにしてもよい。 Alternatively, in the point selection process, based on the acoustic attribute determination process for obtaining the acoustic attribute of the first speech code corresponding to each point and the acoustic attribute of the first speech code corresponding to each point, the current utterance from each point You may be made to perform the utterance point determination process which determines a point, the process which memorize | stores the information showing an utterance point, and the control signal output process which outputs the control signal corresponding to the determined utterance point.

本発明に拠れば、パケット化周期よりも短い時間単位で処理を行い、その後にパケット化周期を持つ音声パケットに回復することから、耳障りなノイズが発生しないことと、広帯域音声と狭帯域音声の切り替わりが目立たないことの両立が可能となり、少ない演算量で、音質劣化の少ない多地点接続技術が実現できる。 According to the present invention, processing is performed in a unit of time shorter than the packetization period, and then the voice packet having the packetization period is recovered, so that no annoying noise is generated and wideband and narrowband voices are not generated. It is possible to achieve both inconspicuous switching, and it is possible to realize a multipoint connection technology with a small amount of calculation and little deterioration in sound quality.

３地点で音声会議を行う場合に、多地点接続装置（MCU）が中核となって各地点間の相互通信が実現することを示す概念図。The conceptual diagram which shows that a multipoint connection apparatus (MCU) becomes a core and implement | achieves the mutual communication between each point, when performing a voice conference at three points. 従来の多地点接続装置に含まれる多地点ミキシング部の機能構成例を示す図。The figure which shows the function structural example of the multipoint mixing part contained in the conventional multipoint connection apparatus. 従来の選択ミキシングによる多地点接続装置に含まれる機能構成例を示す図。The figure which shows the function structural example contained in the multipoint connection apparatus by the conventional selection mixing. 第１実施形態の多地点接続装置の機能構成例を示すブロック図。The block diagram which shows the function structural example of the multipoint connection apparatus of 1st Embodiment. 第１実施形態の多地点接続装置における処理フローを示す図。The figure which shows the processing flow in the multipoint connection apparatus of 1st Embodiment. 第２実施形態の多地点接続装置の機能構成例を示すブロック図。The block diagram which shows the function structural example of the multipoint connection apparatus of 2nd Embodiment. 第２実施形態の多地点接続装置における処理フローを示す図。The figure which shows the processing flow in the multipoint connection apparatus of 2nd Embodiment. 制御情報計算・地点選択部１１６ｐの構成例１の機能構成例を示すブロック図。The block diagram which shows the function structural example of the structural example 1 of the control information calculation and the spot selection part 116p. 制御情報計算・地点選択部１１６ｐの構成例１における処理フローを示す図。The figure which shows the processing flow in the structural example 1 of the control information calculation and the spot selection part 116p. 制御情報計算・地点選択部１１６ｐの構成例２の機能構成例を示すブロック図。The block diagram which shows the function structural example of the structural example 2 of the control information calculation and the spot selection part 116p. 制御情報計算・地点選択部１１６ｐの構成例２における処理フローを示す図。The figure which shows the processing flow in the structural example 2 of the control information calculation and the spot selection part 116p. 制御情報計算・地点選択部１１６ｐの構成例３の機能構成例を示すブロック図。The block diagram which shows the function structural example of the structural example 3 of the control information calculation and the spot selection part 116p. 制御情報計算・地点選択部１１６ｐの構成例３における処理フローを示す図。The figure which shows the processing flow in the structural example 3 of the control information calculation and the spot selection part 116p. 制御情報計算・地点選択部１１６ｐの構成例４の機能構成例を示すブロック図。The block diagram which shows the function structural example of the structural example 4 of the control information calculation and the spot selection part 116p. 制御情報計算・地点選択部１１６ｐの構成例４における処理フローを示す図。The figure which shows the processing flow in the structural example 4 of the control information calculation and the spot selection part 116p.

［第１実施形態］
例えば、パケット通信網を介して音声パケットを送受信する場合には、パケット化周期と呼ばれる、１つの音声パケットに入れる音声符号に対応する音声信号の時間長を決めて通信する。一般には、パケット化周期は、１０ミリ秒または２０ミリ秒とする場合が多い。一方、各音声符号化方式には、エンコード／デコード処理を行う最小時間単位として、フレームと呼ばれる時間長がある。例えば、G.711.1のフレーム長は５ミリ秒である。また、フレーム長は２０ミリ秒であるが、内部情報の一部がそれよりも短い時間単位、例えば５ミリ秒に分けることができる構造になっている音声符号化方式もある。このようなフレームよりも短い時間単位は、サブフレームと呼ばれる。 [First Embodiment]
For example, when voice packets are transmitted / received via a packet communication network, communication is performed by determining a time length of a voice signal corresponding to a voice code to be put in one voice packet, called a packetization period. In general, the packetization period is often 10 milliseconds or 20 milliseconds. On the other hand, each audio coding method has a time length called a frame as a minimum time unit for performing encoding / decoding processing. For example, the frame length of G.711.1 is 5 milliseconds. In addition, there is a speech coding scheme in which the frame length is 20 milliseconds, but a part of the internal information can be divided into shorter time units, for example, 5 milliseconds. A time unit shorter than such a frame is called a subframe.

図４に、本発明による第１実施形態の多地点接続装置２００の構成例を示す。多地点接続装置２００は、図３に示す多地点接続装置１００に時間方向分割部１２０ａ，１２０ｂ，１２０ｃおよび時間方向結合部１２１ａ，１２１ｂ，１２１ｃが付加された構成を持つ。図５に、第１実施形態の多地点接続装置２００の処理フローを示す。 In FIG. 4, the structural example of the multipoint connection apparatus 200 of 1st Embodiment by this invention is shown. The multipoint connection device 200 has a configuration in which time direction division units 120a, 120b, 120c and time direction coupling units 121a, 121b, 121c are added to the multipoint connection device 100 shown in FIG. FIG. 5 shows a processing flow of the multipoint connection apparatus 200 of the first embodiment.

この多地点接続装置２００では、まず時間方向分割部１２０ａ，１２０ｂ，１２０ｃが、各地点から受信した音声パケットをそれぞれ、可能な限り、パケット化周期よりも短い時間単位（以下、サブフレーム長単位という。）に分割して、分割音声パケットを出力する（ステップＳ１）。ただし、時間方向分割部１２０ａ，１２０ｂ，１２０ｃが分割する処理対象は音声パケットのペイロード（音声情報に関連するデータのまとまり）である。具体的には、時間方向分割部１２０ａは、Ａ地点から受信した音声パケットを、可能な限り、パケット化周期よりも短いサブフレーム長単位に分割して、分割音声パケットを出力する。時間方向分割部１２０ｂはＢ地点から受信した音声パケットに対して、時間方向分割部１２０ｃはＣ地点から受信した音声パケットに対して、時間方向分割部１２０ａと同様の処理を行う。なお、時間方向分割部１２０ａ，１２０ｂ，１２０ｃの入力および出力は、それぞれ通信網を介した送受信のための狭義のパケット形式（ヘッダ情報等を含む）である必要はなく、例えば一定の時間単位に区切られた符号列であってもよい。 In this multipoint connection apparatus 200, first, the time direction division units 120a, 120b, and 120c each of the voice packets received from each point are as short as possible in time units (hereinafter referred to as subframe length units). .) And output a divided voice packet (step S1). However, the processing target divided by the time direction dividing units 120a, 120b, and 120c is a payload of a voice packet (a group of data related to voice information). Specifically, the time direction division unit 120a divides the voice packet received from the point A into subframe length units shorter than the packetization period as much as possible, and outputs the divided voice packet. The time direction division unit 120b performs the same process as the time direction division unit 120a on the voice packet received from the point B, and the time direction division unit 120c performs the same process on the voice packet received from the point C. The input and output of the time direction division units 120a, 120b, and 120c do not need to be in a narrowly-defined packet format (including header information) for transmission / reception via the communication network. It may be a delimited code string.

そして、デマルチプレクシング部（ｄｅＭＵＸ）１１４ａ，１１４ｂ，１１４ｃが、各地点に対応するサブフレーム長単位の分割音声パケットをそれぞれ、G.711ビット（第１音声符号）、拡張ビット（第２音声符号）、制御情報に分離する（ステップＳ２）。具体的には、デマルチプレクシング部１１４ａは、Ａ地点に対応するサブフレーム長単位の分割音声パケットをそれぞれ、G.711ビット、拡張ビット、制御情報に分離する。デマルチプレクシング部１１４ｂは、Ｂ地点に対応するサブフレーム長単位の分割音声パケットそれぞれに対して、デマルチプレクシング部１１４ｃは、Ｃ地点に対応するサブフレーム長単位の分割音声パケットそれぞれに対して、デマルチプレクシング部１１４ａと同様の処理を行う。 Then, the demultiplexing units (deMUX) 114a, 114b, and 114c convert the divided speech packets in units of subframe length corresponding to each point into G.711 bits (first speech code) and extension bits (second speech code), respectively. ) And separated into control information (step S2). Specifically, the demultiplexing unit 114a separates the divided voice packets in subframe length units corresponding to the point A into G.711 bits, extension bits, and control information, respectively. The demultiplexing unit 114b is for each sub-frame length unit divided speech packet corresponding to the B point, and the demultiplexing unit 114c is for each sub-frame length unit divided speech packet corresponding to the C point. The same processing as that of the demultiplexing unit 114a is performed.

各地点に対応するG.711ビットはそれぞれ、G.711デコーダ１１２ａ，１１２ｂ，１１２ｃで復号され（ステップＳ３）、ＰＣＭミキシング部１１１（図２参照）で混合された後（ステップＳ４）、G.711エンコーダ１１３ａ，１１３ｂ，１１３ｃでエンコードされて、各地点向けのG.711ビットが作成される（ステップＳ５）。 The G.711 bits corresponding to each point are decoded by the G.711 decoders 112a, 112b, and 112c (step S3) and mixed by the PCM mixing unit 111 (see FIG. 2) (step S4). Encoding is performed by the 711 encoders 113a, 113b, and 113c, and G.711 bits for each point are created (step S5).

すなわち、G.711デコーダ１１２ａ，１１２ｂ，１１２ｃは、デマルチプレクシング部１１４ａ，１１４ｂ，１１４ｃから送られてきたサブフレーム長単位の各G.711ビットをそれぞれ、例えばＰＣＭ形式のディジタル音声信号（第１音声信号）に復号する。具体的には、G.711デコーダ１１２ａは、デマルチプレクシング部１１４ａから送られてきたサブフレーム長単位の各G.711ビットを、ＰＣＭ形式のディジタル音声信号に復号する。G.711デコーダ１１２ｂは、デマルチプレクシング部１１４ｂから送られてきたサブフレーム長単位の各G.711ビットに対して、G.711デコーダ１１２ｃは、デマルチプレクシング部１１４ｃから送られてきたサブフレーム長単位の各G.711ビットに対して、G.711デコーダ１１２ａと同様の処理を行う。 In other words, the G.711 decoders 112a, 112b, and 112c respectively convert the G.711 bits in subframe length units sent from the demultiplexing units 114a, 114b, and 114c, for example, digital audio signals in the PCM format (first Audio signal). Specifically, the G.711 decoder 112a decodes each G.711 bit in the subframe length unit sent from the demultiplexing unit 114a into a digital audio signal in the PCM format. For each G.711 bit in the subframe length unit sent from the demultiplexing unit 114b, the G.711 decoder 112c sends a subframe sent from the demultiplexing unit 114c. The same processing as the G.711 decoder 112a is performed on each G.711 bit in the long unit.

そして、PCMミキシング部１１１は、サブフレーム長単位の各復号済みディジタル音声信号を用いて、各地点向けの混合音声信号を作成する。この例であれば、地点Ａ向けの混合音声信号は地点Ｂと地点Ｃの各復号済みディジタル音声信号が混合されたものであり、地点Ｂ向けの混合音声信号は地点Ａと地点Ｃの各復号済みディジタル音声信号が混合されたものであり、地点Ｃ向けの混合音声信号は地点Ａと地点Ｂの各復号済みディジタル音声信号が混合されたものである。混合信号に自地点信号を含めない理由は、既述のとおり、自分の声がエコーとして戻ってくることを防ぐためである。なお、各地点向けの混合音声信号の作成方法に限定はなく、例えばPCMミキシング部１１１が図２に示す構成を持つ場合、地点Ｂと地点Ｃの各復号済みディジタル音声信号が混合されたもの（地点Ａ向けの混合音声信号）は、地点Ａと地点Ｂと地点Ｃの各復号済みディジタル音声信号が混合されたものから地点Ａの復号済みディジタル音声信号を差し引いて作成される。同様に、地点Ａと地点Ｃの各復号済みディジタル音声信号が混合されたもの（地点Ｂ向けの混合音声信号）は、地点Ａと地点Ｂと地点Ｃの各復号済みディジタル音声信号が混合されたものから地点Ｂの復号済みディジタル音声信号を差し引いて作成され、地点Ａと地点Ｂの各復号済みディジタル音声信号が混合されたもの（地点Ｃ向けの混合音声信号）は、地点Ａと地点Ｂと地点Ｃの各復号済みディジタル音声信号が混合されたものから地点Ｃの復号済みディジタル音声信号を差し引いて作成される。また、このように全地点の各復号済みディジタル音声信号を混合した総合混合信号から目的地点の復号済みディジタル音声信号を差し引いて、当該目的地点向けの混合音声信号を作成する作成方法に限定されず、目的地点以外の各地点の各復号済みディジタル音声信号を混合して当該目的地点向けの混合音声信号を作成してもよい。 Then, the PCM mixing unit 111 generates a mixed audio signal for each point using each decoded digital audio signal in subframe length units. In this example, the mixed audio signal for point A is a mixture of the decoded digital audio signals at points B and C, and the mixed audio signal for point B is decoded at points A and C. The mixed audio signal for point C is a mixture of the decoded digital audio signals at points A and B. The reason why the local signal is not included in the mixed signal is to prevent the voice from returning as an echo as described above. Note that there is no limitation on the method of creating the mixed audio signal for each point. For example, when the PCM mixing unit 111 has the configuration shown in FIG. 2, the decoded digital audio signals at point B and point C are mixed ( The mixed audio signal for point A) is created by subtracting the decoded digital audio signal at point A from the mixture of the decoded digital audio signals at point A, point B, and point C. Similarly, the decoded digital audio signals at points A and C (mixed audio signals for point B) are mixed with the decoded digital audio signals at points A, B, and C. A signal obtained by subtracting the decoded digital audio signal of point B from the point and a mixture of the decoded digital audio signals of point A and point B (mixed audio signal for point C) It is created by subtracting the decoded digital audio signal at point C from the mixture of each decoded digital audio signal at point C. In addition, the present invention is not limited to the creation method for creating the mixed audio signal for the destination point by subtracting the decoded digital audio signal at the destination point from the total mixed signal obtained by mixing the decoded digital audio signals at all points. Alternatively, each decoded digital audio signal at each point other than the destination point may be mixed to create a mixed audio signal for the destination point.

次いでG.711エンコーダ１１３ａ，１１３ｂ，１１３ｃは、各地点向けのサブフレーム長単位の混合音声信号をそれぞれG.711エンコードして、混合音声符号を出力する。具体的には、G.711エンコーダ１１３ａは、サブフレーム長単位の地点Ａ向けの各混合音声信号をG.711エンコードしてG.711ビット（混合音声符号）を出力する。G.711エンコーダ１１３ｂは、サブフレーム長単位の地点Ｂ向けの各混合音声信号に対して、G.711エンコーダ１１３ｃは、サブフレーム長単位の地点Ｃ向けの各混合音声信号に対して、G.711エンコーダ１１３ａと同様の処理を行う。 Next, the G.711 encoders 113a, 113b, and 113c perform G.711 encoding of the mixed audio signal in units of subframe lengths for each point, and output a mixed audio code. Specifically, the G.711 encoder 113a performs G.711 encoding on each mixed speech signal for the point A in subframe length units and outputs G.711 bits (mixed speech code). The G.711 encoder 113b uses G.711 encoder 113c for each mixed audio signal for point B in subframe length units, and G.711 encoder 113c uses G.711 encoder 113c for each mixed audio signal for point C in subframe length units. The same processing as the 711 encoder 113a is performed.

各地点に対応する制御情報はそれぞれ、地点選択部１１６に入力される。地点選択部１１６は、各地点の制御情報を用いて、主たる発話地点と従たる発話地点を時々刻々決定し、拡張ビット切替部１１７を制御するための制御信号を出力する（ステップＳ６）。制御信号は、発話地点に対応して生成される信号である。拡張ビットは、既述のとおり、G.711.1音声符号化方式の場合であればG.711コアレイヤを除く低域拡張レイヤと高域拡張レイヤのための音声符号に相当する。また制御情報は、サブフレーム長単位の分割音声パケットにマルチプレクスされて独立して送られる情報のほか、G.711ビット、G.711デコーダの出力であるＰＣＭ音声信号、または拡張ビットを用いて、多地点ミキシング部内で作成される情報などでもよい。例えば、制御情報としては、音声信号（G.711デコーダの出力であるＰＣＭ音声を含む）のパワー、音声／非音声区間情報（ＶＡＤ）、有声音／無声音の識別情報などが用いられる。 Control information corresponding to each point is input to the point selection unit 116. The point selection unit 116 determines the main utterance point and the subordinate utterance point from time to time using the control information of each point, and outputs a control signal for controlling the extension bit switching unit 117 (step S6). The control signal is a signal generated corresponding to the utterance point. As described above, the extension bits correspond to speech codes for the low-frequency enhancement layer and the high-frequency enhancement layer excluding the G.711 core layer in the case of the G.711.1 speech coding scheme. In addition to the information that is multiplexed and transmitted independently in divided audio packets in subframe length units, the control information uses G.711 bits, PCM audio signals that are output from the G.711 decoder, or extension bits. Information created in the multipoint mixing unit may be used. For example, as control information, power of a voice signal (including PCM voice that is output from a G.711 decoder), voice / non-voice section information (VAD), voiced / unvoiced voice identification information, and the like are used.

地点選択部１１６からの制御信号に基づいて、拡張ビット切替部１１７は、拡張ビット切替部１１７に入力された各地点に対応する拡張ビットのうち出力地点に対応する拡張ビットを選択し、マルチプレクシング部（ＭＵＸ）１１５ａ，１１５ｂ，１１５ｃのうち出力地点に対応するマルチプレクシング部に対して選択された拡張ビットを出力する（ステップＳ７）。この処理は出力地点に対応して行われる。つまり、拡張ビット切替部１１７は、各地点のうち制御信号に応じて定まる地点向けとして、各地点に対応する拡張ビットのうち制御信号に応じて定まる拡張ビットを出力する。 Based on the control signal from the point selection unit 116, the extension bit switching unit 117 selects the extension bit corresponding to the output point from the extension bits corresponding to each point input to the extension bit switching unit 117, and multiplexing The extension bits selected for the multiplexing unit corresponding to the output point among the units (MUX) 115a, 115b, 115c are output (step S7). This process is performed corresponding to the output point. That is, the extension bit switching unit 117 outputs the extension bit determined according to the control signal among the extension bits corresponding to each point, for the point determined according to the control signal among the points.

例えば、出力地点が主たる発話地点である場合、拡張ビット切替部１１７は、従たる発話地点の拡張ビットを選択し、主たる発話地点に対応するマルチプレクシング部に選択された拡張ビットを出力する。出力地点が主たる発話地点以外の発話地点である場合、拡張ビット切替部１１７は、主たる発話地点の拡張ビットを選択し、当該出力地点に対応するマルチプレクシング部に選択された拡張ビットを出力する。
具体例として、ある時点における主たる発話地点をＡ、従たる発話地点をＢとすると、地点選択部１１６からの制御信号に基づいて、拡張ビット切替部１１７の第１切替制御部１１７ａは、出力地点Ａに対応する拡張ビットとして地点Ｂの拡張ビットを選択し、拡張ビット切替部１１７の第２切替制御部１１７ｂは、出力地点Ａに対応するマルチプレクシング部に選択された拡張ビットを出力する。また、地点選択部１１６からの制御信号に基づいて、拡張ビット切替部１１７の第１切替制御部１１７ａは、出力地点ＢとＣに対応する拡張ビットとして地点Ａの拡張ビットを選択し、拡張ビット切替部１１７の第２切替制御部１１７ｂは、出力地点ＢとＣに対応する各マルチプレクシング部に選択された拡張ビットを出力する。 For example, when the output point is the main utterance point, the extension bit switching unit 117 selects the extension bit of the subordinate utterance point and outputs the selected extension bit to the multiplexing unit corresponding to the main utterance point. When the output point is an utterance point other than the main utterance point, the extension bit switching unit 117 selects the extension bit of the main utterance point and outputs the selected extension bit to the multiplexing unit corresponding to the output point.
As a specific example, assuming that the main utterance point at a certain point is A and the subordinate utterance point is B, the first switching control unit 117a of the extension bit switching unit 117 is based on the control signal from the point selection unit 116. The extension bit at point B is selected as the extension bit corresponding to A, and the second switching control unit 117b of the extension bit switching unit 117 outputs the selected extension bit to the multiplexing unit corresponding to the output point A. Further, based on the control signal from the point selection unit 116, the first switching control unit 117a of the extension bit switching unit 117 selects the extension bit of the point A as the extension bit corresponding to the output points B and C, and the extension bit The second switching control unit 117b of the switching unit 117 outputs the selected extension bit to each multiplexing unit corresponding to the output points B and C.

そして、マルチプレクシング部（ＭＵＸ）１１５ａ，１１５ｂ，１１５ｃは、対応する地点向けのG.711ビット（混合音声符号）と拡張ビット（第２音声符号）をそれぞれ結合して、サブフレーム長単位の単位音声パケットを出力する（ステップＳ８）。つまり、マルチプレクシング部１１５ａは、G.711エンコーダ１１３ａが出力した地点Ａ向けのサブフレーム長単位のG.711ビットと拡張ビット切替部１１７が出力した拡張ビット（この例では従たる発話地点Ｂの拡張ビット）を結合して、サブフレーム長単位の単位音声パケットを出力する。同様に、マルチプレクシング部１１５ｂは、G.711エンコーダ１１３ｂが出力した地点Ｂ向けのG.711ビットと拡張ビット切替部１１７が出力した拡張ビット（この例では主たる発話地点Ａの拡張ビット）を結合して、サブフレーム長単位の単位音声パケットを出力し、マルチプレクシング部１１５ｃは、G.711エンコーダ１１３ｃが出力した地点Ｃ向けのG.711ビットと拡張ビット切替部１１７が出力した拡張ビット（この例では主たる発話地点Ａの拡張ビット）を結合して、サブフレーム長単位の単位音声パケットを出力する。なお、サブフレーム長単位の単位音声パケットは、音声情報に関連するデータのまとまり（ペイロード）であり、通信網を介した送受信のための狭義のパケット形式である必要はない。 The multiplexing units (MUX) 115a, 115b, and 115c combine the G.711 bit (mixed speech code) and the extension bit (second speech code) for the corresponding points, respectively, and unit the subframe length. A voice packet is output (step S8). That is, the multiplexing unit 115a outputs the G.711 bit in the subframe length unit for the point A output from the G.711 encoder 113a and the extension bit output from the extension bit switching unit 117 (in this example, the corresponding utterance point B). A unit voice packet in units of subframe length is output by combining the extension bits. Similarly, the multiplexing unit 115b combines the G.711 bit for the point B output from the G.711 encoder 113b and the extension bit output from the extension bit switching unit 117 (in this example, the extension bit of the main speech point A). Then, the unit voice packet in units of subframe length is output, and the multiplexing unit 115c outputs the G.711 bit for the point C output from the G.711 encoder 113c and the extension bit output from the extension bit switching unit 117 (this In the example, the extension bit of the main speech point A) is combined and a unit voice packet in units of subframe length is output. The unit voice packet in units of subframe length is a group (payload) of data related to voice information and does not need to be in a narrowly defined packet format for transmission / reception via a communication network.

次いで、時間方向結合部１２１ａ，１２１ｂ，１２１ｃは、マルチプレクシング部１１５ａ，１１５ｂ，１１５ｃが出力した、サブフレーム長単位の複数の単位音声パケットを時間方向で結合して、パケット化周期の時間単位を持つ送信用音声パケットを各出力地点に向けて出力する（ステップＳ９）。つまり、時間方向結合部１２１ａは、マルチプレクシング部１１５ａが出力した、サブフレーム長単位の複数の単位音声パケットを時間方向で結合して、パケット化周期の時間単位を持つ送信用音声パケットを出力地点Ａに向けて出力する。同様に、時間方向結合部１２１ｂは、マルチプレクシング部１１５ｂが出力した、サブフレーム長単位の複数の単位音声パケットを時間方向で結合して、パケット化周期の時間単位を持つ送信用音声パケットを出力地点Ｂに向けて出力し、時間方向結合部１２１ｃは、マルチプレクシング部１１５ｃが出力した、サブフレーム長単位の複数の単位音声パケットを時間方向で結合して、パケット化周期の時間単位を持つ送信用音声パケットを出力地点Ｃに向けて出力する。なお、時間方向で単位音声パケットを結合する処理において、単純に単位音声パケットを時間順に並べる結合方法に限定されず、単位音声パケット内あるいは単位音声パケット間で符号（レイヤ）を入れ替えてもよい。例えば、各単位音声パケットに含まれるG.711ビットを時間方向に連結したものと、各単位音声パケットに含まれる拡張ビットを時間方向に連結したものとを結合して、パケット化周期の時間単位を持つ送信用音声パケットを生成してもよい。 Next, the time direction combining units 121a, 121b, and 121c combine a plurality of unit voice packets in units of subframe lengths output from the multiplexing units 115a, 115b, and 115c in the time direction, and set the time unit of the packetization period. The transmitted voice packet is output toward each output point (step S9). That is, the time direction combining unit 121a combines a plurality of unit voice packets in subframe length units output from the multiplexing unit 115a in the time direction, and outputs a transmission voice packet having a time unit of a packetization period as an output point. Output to A. Similarly, the time direction combining unit 121b combines a plurality of unit voice packets in units of subframe length output from the multiplexing unit 115b in the time direction, and outputs a transmission voice packet having a time unit of a packetization period. The time direction combiner 121c outputs the signal toward the point B, combines the plurality of unit voice packets in units of subframe length output from the multiplexing unit 115c in the time direction, and transmits the unit having a packetization period time unit. The trusted voice packet is output toward the output point C. Note that the process of combining unit voice packets in the time direction is not limited to a combining method in which unit voice packets are simply arranged in time order, and codes (layers) may be exchanged within unit voice packets or between unit voice packets. For example, by combining the G.711 bits included in each unit voice packet in the time direction and the concatenation of extension bits included in each unit voice packet in the time direction, the time unit of the packetization period May be generated.

［第２実施形態］
図６に、本発明による第２実施形態の多地点接続装置３００の構成例を示す。図７に、第２実施形態の多地点接続装置３００の処理フローを示す。多地点接続装置３００は、図４に示す多地点接続装置２００と異なり、地点選択部１１６の替わりに制御情報計算・地点選択部１１６ｐを備えている。デマルチプレクシング部１１４ａ，１１４ｂ，１１４ｃと制御情報計算・地点選択部１１６ｐ以外の各機能の処理および手順は第１実施形態と同じであるから、重複説明に替えてこれを援用する。以下、制御情報計算・地点選択部１１６ｐの処理を主題として説明する。 [Second Embodiment]
In FIG. 6, the structural example of the multipoint connection apparatus 300 of 2nd Embodiment by this invention is shown. FIG. 7 shows a processing flow of the multipoint connection apparatus 300 of the second embodiment. Unlike the multipoint connection device 200 shown in FIG. 4, the multipoint connection device 300 includes a control information calculation / point selection unit 116 p instead of the point selection unit 116. Since the processing and procedure of each function other than the demultiplexing units 114a, 114b, and 114c and the control information calculation / point selection unit 116p are the same as those in the first embodiment, this is used instead of the redundant description. Hereinafter, the process of the control information calculation / point selection unit 116p will be described as a subject.

時間方向分割部１２０ａ，１２０ｂ，１２０ｃが、音声パケットをサブフレーム長単位に分割すると、分割音声パケットに含まれる地点選択部１１６のための制御情報がサブフレーム長単位の制御に対応していない場合がある。つまり、第１実施形態では、デマルチプレクシング部（ｄｅＭＵＸ）１１４ａ，１１４ｂ，１１４ｃが、各地点に対応するサブフレーム長単位の分割音声パケットをそれぞれ、G.711ビット、拡張ビット、制御情報に分離したが、サブフレーム長単位の制御情報を得られない場合がある（ステップＳ２ａ）。第２実施形態はこのような状況に対応する形態であり、制御情報計算・地点選択部１１６ｐがG.711ビットを利用して発話地点を決定して、制御信号を生成する（ステップＳ６ａ）。 When the time direction division units 120a, 120b, and 120c divide the voice packet into subframe length units, the control information for the point selection unit 116 included in the divided voice packet does not correspond to control in subframe length units. There is. That is, in the first embodiment, the demultiplexing units (deMUX) 114a, 114b, and 114c separate the divided voice packets in units of subframe length corresponding to each point into G.711 bits, extension bits, and control information, respectively. However, there are cases where control information in units of subframe length cannot be obtained (step S2a). The second embodiment is a form corresponding to such a situation, and the control information calculation / point selection unit 116p uses the G.711 bit to determine an utterance point and generates a control signal (step S6a).

制御情報計算・地点選択部１１６ｐの各機能構成例およびその処理フローを図８−図１５に示す。
＜構成例１＞
図８に、制御情報計算・地点選択部１１６ｐの構成例１を示す。図９に、制御情報計算・地点選択部１１６ｐの構成例１の処理フローを示す。パワー計算部１１６１ａ，１１６１ｂ，１１６１ｃは、サブフレーム長単位で入力された各地点のG.711ビットから、サブフレーム長単位のパワーを計算する（ステップＳ６ａ１１）。パワーは、G.711ビットをG.711デコーダでデコードしたＰＣＭ信号の二乗和によって求めることができる。または、パワーの代替値として、上記ＰＣＭ信号の絶対値の平均や、G.711ビットから正負符号を除いた１サンプル毎のコードの平均値を用いてもよい（以下、総称してパワーという。）。発話地点決定部１１６２は、各地点のパワーを比較して、パワーが最も大きい地点を主たる発話地点、二番目にパワーが大きい地点を従たる発話地点として決定し、これらを表す情報を出力する（ステップＳ６ａ１２）。制御信号出力部１１６３は、発話地点決定部１１６２から与えられた主たる発話地点と従たる発話地点の情報を用いて、拡張ビット切替部１１７を制御するための制御信号を出力する（ステップＳ６ａ１３）。 Each functional configuration example of the control information calculation / point selection unit 116p and its processing flow are shown in FIGS.
<Configuration example 1>
FIG. 8 shows a configuration example 1 of the control information calculation / point selection unit 116p. FIG. 9 shows a processing flow of Configuration Example 1 of the control information calculation / point selection unit 116p. The power calculators 1161a, 1161b, and 1161c calculate power in subframe length units from the G.711 bits of each point input in subframe length units (step S6a11). The power can be obtained by the sum of squares of the PCM signal obtained by decoding the G.711 bit by the G.711 decoder. Alternatively, as an alternative value of power, an average of the absolute values of the PCM signal or an average value of codes for each sample obtained by removing the positive / negative sign from G.711 bits may be used (hereinafter collectively referred to as power). ). The utterance point determination unit 1162 compares the power of each point, determines the point with the highest power as the main utterance point, and the point with the second highest power as the subordinate utterance point, and outputs information representing these ( Step S6a12). The control signal output unit 1163 outputs a control signal for controlling the extension bit switching unit 117 using the information of the main utterance point and the subordinate utterance point given from the utterance point determination unit 1162 (step S6a13).

＜構成例２＞
図１０は、制御情報計算・地点選択部１１６ｐの構成例２を示す。図１１に、制御情報計算・地点選択部１１６ｐの構成例２の処理フローを示す。
この構成例２は、図８に示す構成例１の機能構成に加えて、各地点に対応するメモリ１１６４ａ，１１６４ｂ，１１６４ｃを含む。発話地点決定部１１６２とメモリ１１６４ａ，１１６４ｂ，１１６４ｃ以外の各機能の処理は構成例１と同じであるから、重複説明に替えてこれを援用する。 <Configuration example 2>
FIG. 10 shows a configuration example 2 of the control information calculation / point selection unit 116p. FIG. 11 shows a processing flow of the configuration example 2 of the control information calculation / point selection unit 116p.
Configuration example 2 includes memories 1164a, 1164b, and 1164c corresponding to each point in addition to the functional configuration of configuration example 1 shown in FIG. Since the processing of each function other than the speech point determination unit 1162 and the memories 1164a, 1164b, and 1164c is the same as that in the configuration example 1, this is used instead of the duplicate description.

パワー計算部１１６１ａ，１１６１ｂ，１１６１ｃによってサブフレーム毎に計算された各地点のパワーは、メモリ１１６４ａ，１１６４ｂ，１１６４ｃに一時蓄積される（ステップＳ６ａ２）。発話地点決定部１１６２は、各地点に対応する現在のサブフレームのパワーのほか、メモリ１１６４ａ，１１６４ｂ，１１６４ｃに蓄積されている各地点に対応するサブフレームのパワーの時系列を用いて、主たる発話地点と従たる発話地点を決定し、これらを表す情報を出力する（ステップＳ６ａ２２）。例えば、各地点に対応する現在のサブフレームのパワー値について各地点間の差が閾値よりも小さいときには、この大小関係に、１サブフレーム前の各地点に対応するパワー値についての各地点間の大小関係を加味して、主たる発話地点と従たる発話地点を決定する。あるいは、１サブフレーム前の各地点に対応するパワー値についての各地点間の大小関係が、各地点に対応する現在のサブフレームのパワー値についての各地点間の大小関係よりも顕著である場合は、１サブフレーム前の各地点に対応するパワー値についての各地点間の大小関係に基づいて主たる発話地点と従たる発話地点を決定してもよい。あるいは、各地点に対応する現在のサブフレームのパワー値と１サブフレーム前の各地点に対応するパワー値の平均について各地点間で大小比較をして、主たる発話地点と従たる発話地点を決定してもよい。 The power at each point calculated for each subframe by the power calculators 1161a, 1161b, and 1161c is temporarily stored in the memories 1164a, 1164b, and 1164c (step S6a2). The utterance point determination unit 1162 uses the power of the current subframe corresponding to each point as well as the time series of the power of the subframes corresponding to each point stored in the memories 1164a, 1164b, and 1164c. The point and the subordinate utterance point are determined, and information representing them is output (step S6a22). For example, when the difference between the points for the power value of the current subframe corresponding to each point is smaller than the threshold value, the magnitude relationship between the points for the power value corresponding to each point one subframe before this Considering the size relationship, the main utterance point and the subordinate utterance point are determined. Or, when the magnitude relationship between the points for the power value corresponding to each point before one subframe is more significant than the magnitude relationship between the points for the power value of the current subframe corresponding to each point The main utterance point and the subordinate utterance point may be determined based on the magnitude relationship between the points for the power value corresponding to each point before one subframe. Alternatively, the main utterance point and the subordinate utterance point are determined by comparing the power value of the current subframe corresponding to each point and the average of the power value corresponding to each point before one subframe between the points. May be.

＜構成例３＞
図１２は、制御情報計算・地点選択部１１６ｐの構成例３を示す。図１３に、制御情報計算・地点選択部１１６ｐの構成例３の処理フローを示す。
この構成例３は、図８に示す構成例１の機能構成に加えて、主従を決定された発話地点に対応するメモリ１１６５ａ，１１６５ｂを含む。発話地点決定部１１６２とメモリ１１６５ａ，１１６５ｂ以外の各機能の処理は構成例１と同じであるから、重複説明に替えてこれを援用する。 <Configuration example 3>
FIG. 12 shows a configuration example 3 of the control information calculation / point selection unit 116p. FIG. 13 shows a processing flow of configuration example 3 of the control information calculation / point selection unit 116p.
This configuration example 3 includes memories 1165a and 1165b corresponding to the utterance points for which the master-slave is determined, in addition to the functional configuration of the configuration example 1 shown in FIG. Since the processing of each function other than the utterance point determination unit 1162 and the memories 1165a and 1165b is the same as that of the configuration example 1, this is used instead of the duplicate description.

構成例３では、発話地点決定部１１６２が決定した主従の各発話地点に関する情報がメモリ１１６５ａ，１１６５ｂに一時蓄積される（ステップＳ６ａ３）。発話地点決定部１１６２は、メモリ１１６５ａ，１１６５ｂから取得した１サブフレーム前の主従の各発話地点に関する情報に応じて、主従の各発話地点を決定するための決定基準を変更し、当該基準の下、各地点に対応する現在のサブフレームのパワーを用いて主たる発話地点と従たる発話地点を決定し、これらを表す情報を出力する（ステップＳ６ａ３２）。 In the configuration example 3, information regarding each master / slave utterance point determined by the utterance point determination unit 1162 is temporarily stored in the memories 1165a and 1165b (step S6a3). The utterance point determination unit 1162 changes the determination criterion for determining each utterance point of the master-slave according to the information about each utterance point of the master-slave one subframe before acquired from the memories 1165a, 1165b, and The main utterance point and the subordinate utterance point are determined using the power of the current subframe corresponding to each point, and information representing these is output (step S6a32).

主従の各発話地点を決定する処理では先に主たる発話地点を決定してから従たる発話の決定を行うところ、発話地点決定部１１６２は、まず、メモリ１１６５ａから取得した１サブフレーム前の主たる発話地点に関する情報に応じて、主たる発話地点を決定するための決定基準を変更し、当該基準の下、各地点に対応する現在のサブフレームのパワーを用いて主たる発話地点を決定し、これを表す情報を出力する。 In the process of determining the main and subordinate utterance points, the main utterance point is determined first and then the subordinate utterance is determined. First, the utterance point determination unit 1162 first acquires the main utterance one subframe before acquired from the memory 1165a. The decision criteria for determining the main utterance point are changed according to the information about the point, and the main utterance point is determined using the power of the current subframe corresponding to each point, and this is expressed. Output information.

例えば、１サブフレーム前の主たる発話地点が地点Ａで、現在のサブフレームについて各地点のパワーのうち地点Ａのパワーが最大であれば、引き続き主たる発話地点を地点Ａとする。１サブフレーム前の主たる発話地点が地点Ａで、現在のサブフレームについて各地点のパワーのうち地点Ｂのパワーが最大であれば、単純に主たる発話地点を地点Ｂとする図８に示す構成例１と異なり、例えば、現在のサブフレームについて地点Ｂのパワーが地点Ａのパワーよりもα倍以上大きい場合は主たる発話地点を地点Ｂに変更するが、地点Ｂのパワーのほうが大きくても地点Ａのパワーのα倍に満たない場合は、主たる発話地点を地点Ａのまま変更しないという処理を行う（ただし、αは正の実数であり通常１よりも大きい正数である。）。 For example, if the main utterance point one subframe before is the point A and the power of the point A is the maximum among the powers of the respective points in the current subframe, the main utterance point is continuously set as the point A. If the main utterance point one subframe before is point A, and the power at point B is the maximum among the powers at each point in the current subframe, the configuration example shown in FIG. 1, for example, when the power at the point B is greater than α times the power at the point A for the current subframe, the main utterance point is changed to the point B, but even if the power at the point B is larger, the point A If the power is less than α times the power, the processing is performed such that the main utterance point remains unchanged at point A (where α is a positive real number and is usually a positive number larger than 1).

この処理は、主たる発話地点の継続は容易に、主たる発話地点の変更には高いハードルを設けることを意味し、必要以上に頻繁に発話地点が切り替わることを防ぐ効果がある。また、或る一地点が主たる発話地点として判定されたサブフレームが長時間継続するほど、他の地点に主たる発話地点が切り替わるハードルをより高くするルールも採用可能である。例えば、連続する４サブフレームで同じ地点が主たる発話地点とされた場合は、発話地点の切り替わりに次のサブフレームでβ倍以上のパワー差を必要とするという制約を課すことが許される。このβはαと別個独立に設定される正の実数であり、通常はαよりも大きい値とされる。 This process means that continuation of the main utterance point is easy and a high hurdle is provided for changing the main utterance point, and there is an effect of preventing the utterance point from being switched more frequently than necessary. It is also possible to employ a rule that raises the hurdle for switching the main utterance point to another point as the subframe determined as a main utterance point continues for a longer time. For example, when the same point is set as a main utterance point in four consecutive subframes, it is allowed to impose a restriction that a power difference of β times or more is required in the next subframe for switching the utterance point. This β is a positive real number set independently of α, and is usually a value larger than α.

また、上述の主たる発話地点を決定する処理に続いて、発話地点決定部１１６２は、メモリ１１６５ｂから取得した１サブフレーム前の従たる発話地点に関する情報に応じて、従たる発話地点を決定するための決定基準を変更し、当該基準の下、各地点に対応する現在のサブフレームのパワーを用いて従たる発話地点を決定し、これを表す情報を出力する。 Further, following the process of determining the main utterance point described above, the utterance point determination unit 1162 determines the subordinate utterance point according to the information about the subordinate utterance point one subframe before acquired from the memory 1165b. In accordance with this criterion, the sub utterance point is determined using the power of the current subframe corresponding to each point, and information representing this is output.

例えば、≪１≫上述の主たる発話地点を決定する処理において、１サブフレーム前の従たる発話地点Ｂが主たる発話地点に昇格しなかった場合、次のような処理を行う。１サブフレーム前の従たる発話地点が地点Ｂで、現在のサブフレームについて各地点のパワーのうち地点Ｂのパワーが上述の処理で決定された主たる発話地点を除いて最大である場合、引き続き従たる発話地点を地点Ｂとする。１サブフレーム前の従たる発話地点が地点Ｂで、現在のサブフレームについて各地点のパワーのうち地点Ｃのパワーが上述の処理で決定された主たる発話地点を除いて最大である場合、単純に従たる発話地点を地点Ｃとする図８に示す構成例１と異なり、例えば、現在のサブフレームについて地点Ｃのパワーが地点Ｂのパワーよりもγ倍以上大きい場合は従たる発話地点を地点Ｃに変更するが、地点Ｃのパワーのほうが大きくても地点Ｂのパワーのγ倍に満たない場合は、従たる発話地点を地点Ｂのまま変更しないという処理を行う（ただし、γは正の実数であり通常１よりも大きい正数である。）。 For example, << 1 >> In the above-described process for determining the main utterance point, when the subordinate utterance point B one subframe before is not promoted to the main utterance point, the following process is performed. If the secondary utterance point one subframe before is the point B and the power of the point B among the power of each point for the current subframe is the highest except for the main utterance point determined by the above-described processing, it will continue to follow. Let the utterance point be point B. If the sub utterance point one subframe before is the point B and the power of the point C is the maximum except for the main utterance point determined in the above process among the powers of each point for the current sub frame, simply Unlike the configuration example 1 shown in FIG. 8 in which the subordinate utterance point is the point C, for example, when the power of the point C is larger than the power of the point B by γ times or more in the current subframe, the subordinate utterance point is set to the point C. However, if the power at point C is larger than γ times the power at point B, the subordinate utterance point is not changed as it is at point B (where γ is a positive real number). And is usually a positive number greater than 1.)

≪２≫上述の主たる発話地点を決定する処理において、１サブフレーム前の従たる発話地点Ｂが主たる発話地点に昇格した場合、次のような処理を行う。１サブフレーム前の主たる発話地点Ａを１サブフレーム前の従たる発話地点とみなして、前述の場合≪１≫と同様の処理を行う。すなわち、１サブフレーム前の従たる発話地点が地点Ａであり、現在のサブフレームについて各地点のパワーのうち地点Ａのパワーが上述の処理で決定された主たる発話地点を除いて最大である場合、従たる発話地点を地点Ａとする。１サブフレーム前の従たる発話地点が地点Ａであり、現在のサブフレームについて各地点のパワーのうち地点Ｃのパワーが上述の処理で決定された主たる発話地点を除いて最大である場合、単純に従たる発話地点を地点Ｃとする図８に示す構成例１と異なり、例えば、現在のサブフレームについて地点Ｃのパワーが地点Ａのパワーよりもγ倍以上大きい場合は従たる発話地点を地点Ｃに変更するが、地点Ｃのパワーのほうが大きくても地点Ａのパワーのγ倍に満たない場合は、従たる発話地点を地点Ａとするという処理を行う（ただし、γは正の実数であり通常１よりも大きい正数である。）。 << 2 >> In the above-described process of determining the main utterance point, when the subordinate utterance point B one subframe before is promoted to the main utterance point, the following process is performed. The main utterance point A one subframe before is regarded as a sub utterance point one subframe before, and the same processing as << 1 >> is performed in the above case. That is, when the subordinate utterance point one subframe before is the point A and the power of the point A among the powers of the respective points in the current subframe is the maximum except for the main utterance point determined by the above-described processing. The subordinate utterance point is designated as point A. If the sub utterance point one subframe before is the point A, and the power of the point C is the maximum except for the main utterance point determined in the above process among the powers of the respective points in the current subframe, the simple Unlike the configuration example 1 shown in FIG. 8 in which the utterance point conforming to is the point C, for example, if the power at the point C is more than γ times greater than the power at the point A for the current subframe, the utterance point according to Change to C, but if the power at point C is greater than γ times the power at point A, a process is performed in which the subordinate utterance point is point A (where γ is a positive real number) Yes, usually a positive number greater than 1.)

なお、場合≪２≫において、１サブフレーム前の主たる発話地点Ａを１サブフレーム前の従たる発話地点とみなして現在のサブフレームの従たる発話地点を決定する上述の処理に替えて、次のような処理を行ってもよい。すなわち、メモリ１１６５ｂの蓄積内容をクリアし、現在のサブフレームの従たる発話地点を、構成例１で説明した処理と同様に、現在のサブフレームのパワーを用いて決定する。 In the case of << 2 >>, the main utterance point A one subframe before is regarded as a sub utterance point one subframe before, and the following processing is performed to determine the sub utterance point subordinate to the current subframe. You may perform a process like this. That is, the content stored in the memory 1165b is cleared, and the utterance point followed by the current subframe is determined using the power of the current subframe in the same manner as the processing described in the configuration example 1.

このように、従たる発話地点の決定処理についても、主たる発話地点の決定処理と同様に、従たる発話地点の継続を容易にし、従たる発話地点の変更には高いハードルを設けることで、必要以上に頻繁に発話地点が切り替わることを防ぐようにしてもよい。また、或る一地点が従たる発話地点として判定されたサブフレームが長時間継続するほど、他の地点に従たる発話地点が切り替わるハードルをより高くするルールも採用可能である。例えば、連続する４サブフレームで同じ地点が従たる発話地点とされた場合は、発話地点の切り替わりに次のサブフレームでθ倍以上のパワー差を必要とするという制約を課すことが許される。このθはγと別個独立に設定される正の実数であり、通常はγよりも大きい値とされる。 In this way, the process for determining the subordinate utterance point is also necessary for the process of determining the subordinate utterance point by making it easy to continue the subordinate utterance point and providing a high hurdle for changing the subordinate utterance point. As described above, the utterance point may be prevented from being frequently switched. It is also possible to employ a rule that increases the hurdle for switching the utterance point according to another point as the subframe determined as the utterance point followed by one point continues for a longer time. For example, in the case where the same spot is followed by the same spot in four consecutive subframes, it is allowed to impose a restriction that a power difference of θ times or more is required in the next subframe for switching the talk spot. This θ is a positive real number that is set independently of γ, and is usually larger than γ.

なお、構成例３において、従たる発話地点を決定するための決定基準を変更することは必須ではなく、この場合、メモリ１１６５ｂは不要である（つまり、従たる発話地点は構成例１に準拠して決定される。）。 In the configuration example 3, it is not essential to change the determination criteria for determining the subordinate utterance point. In this case, the memory 1165b is unnecessary (that is, the subordinate utterance point conforms to the configuration example 1). Determined.)

＜構成例４＞
図１４は、制御情報計算・地点選択部１１６ｐの構成例４を示す。図１５に、制御情報計算・地点選択部１１６ｐの構成例４の処理フローを示す。
この構成例４は、図１０に示す構成例２の機能構成と図１２に示す構成例３の機能構成との複合形態である。各機能の処理は構成例２および構成例３の説明によって既に明らかであるから、重複説明に替えてこれを援用する。 <Configuration example 4>
FIG. 14 shows a configuration example 4 of the control information calculation / point selection unit 116p. FIG. 15 shows a process flow of the configuration example 4 of the control information calculation / point selection unit 116p.
Configuration example 4 is a composite form of the functional configuration of configuration example 2 shown in FIG. 10 and the functional configuration of configuration example 3 shown in FIG. Since the processing of each function has already been clarified by the description of the configuration example 2 and the configuration example 3, this is used instead of the duplicate description.

構成例４は、発話地点決定部１１６２が、各地点に対応する現在のサブフレームのパワーと、メモリ１１６４ａ，１１６４ｂ，１１６４ｃに蓄積されている各地点に対応する各サブフレームのパワーの時系列と、メモリ１１６５ａ，１１６５ｂに蓄積されている主従の各発話地点に関する情報とを用いて、現在のサブフレームにおける主従の各発話地点の決定を行い、これらを表す情報を出力する（ステップＳ６ａ４）。この決定処理として、まず構成例２による決定処理を行い、次いで構成例３による決定処理を行う二段階方式や、まず構成例３による決定処理を行い、次いで構成例２による決定処理を行う二段階方式や、構成例２による決定アルゴリズムと構成例３による決定アルゴリズムを融合させた決定処理などが実施できる。例えば、１サブフレーム前の主たる発話地点が地点Ａである場合、連続する直前の２サブフレームについて地点Ｂのパワーが最大となった場合に主たる発話地点を地点Ｂに切り替えるという決定処理が行われる。 In the configuration example 4, the utterance point determination unit 1162 includes a time series of the power of the current subframe corresponding to each point and the power of each subframe corresponding to each point stored in the memories 1164a, 1164b, and 1164c. The master / slave utterance points in the current subframe are determined using the information about the master / slave utterance points stored in the memories 1165a and 1165b, and information representing these is output (step S6a4). As the determination process, first, the determination process according to the configuration example 2 is performed, then the determination process according to the configuration example 3 is performed, or the determination process according to the configuration example 3 is performed first, and then the determination process according to the configuration example 2 is performed. It is possible to implement a determination process in which the determination algorithm according to the configuration example 2 and the determination algorithm according to the configuration example 3 are merged. For example, when the main utterance point one subframe before is the point A, a determination process is performed in which the main utterance point is switched to the point B when the power of the point B reaches the maximum for the two immediately preceding subframes. .

なお、構成例４において、従たる発話地点を決定するための決定基準を変更することは必須ではなく、この場合、メモリ１１６５ｂは不要である（つまり、従たる発話地点は構成例１または構成例２に準拠して決定される。）。 In the configuration example 4, it is not essential to change the determination criterion for determining the subordinate utterance point. In this case, the memory 1165b is unnecessary (that is, the subordinate utterance point is the configuration example 1 or the configuration example). 2).

上述の各実施形態では、音声符号化方式としてG.711.1を採用したがこれに限定されず、他の音声符号化方式であってもよい。また、第２実施形態では制御情報としてのパワーをG.711ビットを利用して計算し、このパワーを利用して発話地点を決定したが、このような処理に限定されない。制御情報として音声信号（G.711デコーダの出力であるＰＣＭ音声を含む）のパワー、音声／非音声区間情報（ＶＡＤ）、有声音／無声音の識別情報などの音響属性を、G.711デコーダの出力であるＰＣＭ音声信号や拡張ビットから求める処理としてもよい。 In each of the above-described embodiments, G.711.1 is adopted as the speech encoding method, but the present invention is not limited to this, and other speech encoding methods may be used. In the second embodiment, power as control information is calculated using G.711 bits, and an utterance point is determined using this power. However, the present invention is not limited to such processing. As the control information, acoustic attributes such as the power of voice signals (including PCM voice output from the G.711 decoder), voice / non-voice section information (VAD), voiced / unvoiced voice identification information, etc. It is good also as a process calculated | required from the output PCM audio | voice signal and an extension bit.

例えばコンピュータによって多地点接続装置を実現する場合であれば、実施形態に係わる多地点接続装置は、キーボードなどが接続可能な入力部、液晶ディスプレイなどが接続可能な出力部〔これらは、単純な中継基地局として多地点接続装置を実現する場合には必ずしも必要ではない。〕、多地点接続装置の外部に通信可能な通信装置（例えばモデム）が接続可能な通信部、ＤＳＰ〔ＣＰＵでも良い。またキャッシュメモリなどを備えていてもよい。〕、メモリであるＲＡＭ（Random Access Memory）、ＲＯＭ（Read Only Memory）やハードディスクである外部記憶装置並びにこれらの入力部、出力部、通信部、ＤＳＰ、ＲＡＭ、ＲＯＭ、外部記憶装置間のデータのやり取りが可能なように接続するバスなどを備えている。また必要に応じて、多地点接続装置に、ＣＤ−ＲＯＭなどの記憶媒体を読み書きできる装置（ドライブ）などを設けるとしてもよい。 For example, if the multipoint connection device is realized by a computer, the multipoint connection device according to the embodiment includes an input unit to which a keyboard or the like can be connected, an output unit to which a liquid crystal display or the like can be connected [these are simple relays. This is not always necessary when a multipoint connection apparatus is realized as a base station. ] A communication unit that can be connected to a communication device (for example, a modem) that can communicate with the outside of the multipoint connection device, DSP [CPU may be used. A cache memory or the like may be provided. ] RAM (Random Access Memory), ROM (Read Only Memory), and external storage devices such as hard disks, as well as the input unit, output unit, communication unit, DSP, RAM, ROM, data between the external storage devices It is equipped with a bus that is connected so that it can be exchanged. If necessary, the multipoint connection device may be provided with a device (drive) that can read and write a storage medium such as a CD-ROM.

多地点接続装置の外部記憶装置には、多地点接続のためのプログラムおよびこのプログラムの処理において必要となるデータなどが記憶されている〔例えばプログラムを読み出し専用記憶装置であるＲＯＭに記憶させておく形態も許される。〕。また、これらのプログラムの処理によって得られるデータなどは、ＲＡＭなどに適宜に記憶される。 The external storage device of the multipoint connection device stores a program for multipoint connection and data necessary for processing of the program [for example, the program is stored in a ROM which is a read-only storage device. Form is also allowed. ]. Further, data obtained by the processing of these programs is appropriately stored in a RAM or the like.

具体的には、外部記憶装置には、３地点以上の各地点から送られた音声パケットをそれぞれ、パケット化周期よりも短いサブフレーム長単位に分割して、分割音声パケットを出力するためのプログラム、各地点に対応する分割音声パケットそれぞれから、少なくとも第１音声符号と第２音声符号を取り出すためのプログラム、各地点に対応する第１音声符号をそれぞれ復号して第１音声信号を出力するためのプログラム、各地点に対応する第１音声信号をミキシングして各地点向けの混合音声信号を出力するためのプログラム、各地点に対応する混合音声信号をそれぞれ符号化して混合音声符号を出力するためのプログラム、各地点の中から発話地点を決定して、当該発話地点に対応する制御信号を出力するためのプログラム、各地点のうち制御信号に応じて定まる地点向けとして、各地点に対応する第２音声符号のうち制御信号に応じて定まる第２音声符号を出力するためのプログラム、各地点に対応する混合音声符号と、第２音声符号切替処理で出力された各地点に対応する第２音声符号とを結合して、サブフレーム長単位の単位音声パケットを出力するためのプログラム、各地点に対応する単位音声パケットを複数結合して、パケット化周期の時間単位を持つ送信用音声パケットを出力するためのプログラムが記憶されている。その他、これらのプログラムに基づく処理を制御するための制御プログラムも適宜に保存しておく。 Specifically, in the external storage device, a program for dividing a voice packet transmitted from each of three or more points into subframe length units shorter than the packetization period and outputting a divided voice packet A program for extracting at least the first voice code and the second voice code from each of the divided voice packets corresponding to each point, and for decoding the first voice code corresponding to each point and outputting the first voice signal A program for mixing a first audio signal corresponding to each point and outputting a mixed audio signal for each point, and for encoding a mixed audio signal corresponding to each point and outputting a mixed audio code Program, a program for determining the utterance point from each point, and outputting a control signal corresponding to the utterance point, out of each point A program for outputting a second speech code determined according to a control signal among second speech codes corresponding to each point, a mixed speech code corresponding to each point, and a second code for a point determined according to the control signal A program for combining the second speech code corresponding to each point output in the speech code switching process to output a unit speech packet in units of subframe length, and combining a plurality of unit speech packets corresponding to each point A program for outputting a voice packet for transmission having a time unit of a packetization period is stored. In addition, a control program for controlling processing based on these programs is also stored as appropriate.

実施形態に係る多地点接続装置では、外部記憶装置〔あるいはＲＯＭなど〕に記憶された各プログラムとこの各プログラムの処理に必要なデータが必要に応じてＲＡＭに読み込まれて、ＤＳＰで解釈実行・処理される。その結果、ＤＳＰが所定の機能（時間方向分割部、デマルチプレクシング部、デコーダ、ミキシング部、エンコーダ、地点選択部、第２音声符号切替部（拡張ビット切替部）、マルチプレクシング部、時間方向結合部）を実現することで、多地点接続が実現される。 In the multipoint connection device according to the embodiment, each program stored in an external storage device (or ROM, etc.) and data necessary for processing each program are read into the RAM as necessary, and interpreted and executed by the DSP. It is processed. As a result, the DSP has predetermined functions (time direction division unit, demultiplexing unit, decoder, mixing unit, encoder, point selection unit, second speech code switching unit (extension bit switching unit), multiplexing unit, time direction combination. Multi-point connection is realized.

このほか本発明である多地点接続装置・方法は上述の実施形態に限定されるものではなく、本発明の趣旨を逸脱しない範囲で適宜変更が可能である。また、上記多地点接続装置・方法において説明した処理は、記載の順に従って時系列に実行される趣旨ではなく、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されるとしてもよい。 In addition, the multipoint connection apparatus / method according to the present invention is not limited to the above-described embodiment, and can be appropriately changed without departing from the gist of the present invention. In addition, the processing described in the above multipoint connection apparatus / method is not performed in chronological order according to the description order, but is performed in parallel or individually as required by the processing capability of the apparatus that performs the processing or as necessary. It may be.

また、上記多地点接続装置における処理機能をコンピュータによって実現する場合、多地点接続装置が有すべき機能の処理内容はプログラムによって記述される。そして、このプログラムをコンピュータで実行することにより、上記多地点接続装置における処理機能がコンピュータ上で実現される。 When the processing functions in the multipoint connection apparatus are realized by a computer, the processing contents of the functions that the multipoint connection apparatus should have are described by a program. And the processing function in the said multipoint connection apparatus is implement | achieved on a computer by running this program with a computer.

この処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体としては、例えば、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等どのようなものでもよい。具体的には、例えば、磁気記録装置として、ハードディスク装置、フレキシブルディスク、磁気テープ等を、光ディスクとして、ＤＶＤ（Digital Versatile Disc）、ＤＶＤ−ＲＡＭ（Random Access Memory）、ＣＤ−ＲＯＭ（Compact Disc Read Only Memory）、ＣＤ−Ｒ（Recordable）／ＲＷ（ReWritable）等を、光磁気記録媒体として、ＭＯ（Magneto-Optical disc）等を、半導体メモリとしてＥＥＰ−ＲＯＭ（Electronically Erasable and Programmable-Read Only Memory）等を用いることができる。 The program describing the processing contents can be recorded on a computer-readable recording medium. As the computer-readable recording medium, for example, any recording medium such as a magnetic recording device, an optical disk, a magneto-optical recording medium, and a semiconductor memory may be used. Specifically, for example, as a magnetic recording device, a hard disk device, a flexible disk, a magnetic tape or the like, and as an optical disk, a DVD (Digital Versatile Disc), a DVD-RAM (Random Access Memory), a CD-ROM (Compact Disc Read Only). Memory), CD-R (Recordable) / RW (ReWritable), etc., magneto-optical recording medium, MO (Magneto-Optical disc), etc., semiconductor memory, EEP-ROM (Electronically Erasable and Programmable-Read Only Memory), etc. Can be used.

また、このプログラムの流通は、例えば、そのプログラムを記録したＤＶＤ、ＣＤ−ＲＯＭ等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記憶装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させる構成としてもよい。 The program is distributed by selling, transferring, or lending a portable recording medium such as a DVD or CD-ROM in which the program is recorded. Furthermore, the program may be distributed by storing the program in a storage device of the server computer and transferring the program from the server computer to another computer via a network.

このようなプログラムを実行するコンピュータは、例えば、まず、可搬型記録媒体に記録されたプログラムもしくはサーバコンピュータから転送されたプログラムを、一旦、自己の記憶装置に格納する。そして、処理の実行時、このコンピュータは、自己の記録媒体に格納されたプログラムを読み取り、読み取ったプログラムに従った処理を実行する。また、このプログラムの別の実行形態として、コンピュータが可搬型記録媒体から直接プログラムを読み取り、そのプログラムに従った処理を実行することとしてもよく、さらに、このコンピュータにサーバコンピュータからプログラムが転送されるたびに、逐次、受け取ったプログラムに従った処理を実行することとしてもよい。また、サーバコンピュータから、このコンピュータへのプログラムの転送は行わず、その実行指示と結果取得のみによって処理機能を実現する、いわゆるＡＳＰ（Application Service Provider）型のサービスによって、上述の処理を実行する構成としてもよい。なお、本形態におけるプログラムには、電子計算機による処理の用に供する情報であってプログラムに準ずるもの（コンピュータに対する直接の指令ではないがコンピュータの処理を規定する性質を有するデータ等）を含むものとする。 A computer that executes such a program first stores, for example, a program recorded on a portable recording medium or a program transferred from a server computer in its own storage device. When executing the process, the computer reads a program stored in its own recording medium and executes a process according to the read program. As another execution form of the program, the computer may directly read the program from a portable recording medium and execute processing according to the program, and the program is transferred from the server computer to the computer. Each time, the processing according to the received program may be executed sequentially. Also, the program is not transferred from the server computer to the computer, and the above-described processing is executed by a so-called ASP (Application Service Provider) type service that realizes the processing function only by the execution instruction and result acquisition. It is good. Note that the program in this embodiment includes information that is used for processing by an electronic computer and that conforms to the program (data that is not a direct command to the computer but has a property that defines the processing of the computer).

また、この形態では、コンピュータ上で所定のプログラムを実行させることにより、多地点接続装置を構成することとしたが、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。 In this embodiment, the multipoint connection apparatus is configured by executing a predetermined program on a computer. However, at least a part of these processing contents may be realized by hardware.

Claims

A time direction division unit that divides each voice packet transmitted from each of three or more points into time units shorter than the packetization period (hereinafter referred to as subframe length units) and outputs divided voice packets;
A demultiplexing unit that extracts at least a first speech code and a second speech code from each of the divided speech packets corresponding to each of the points;
A decoder for decoding the first audio code corresponding to each of the points and outputting a first audio signal;
A mixing unit that mixes the first audio signal corresponding to each of the points and outputs a mixed audio signal for each of the points;
An encoder that encodes the mixed speech signal corresponding to each of the points and outputs a mixed speech code;
A point selection unit that determines a speech point from each of the above points and outputs a control signal corresponding to the speech point;
A second voice code switching unit for outputting a second voice code determined according to the control signal among the second voice codes corresponding to the points, for a point determined according to the control signal among the points; ,
Multiplex that outputs a unit voice packet in units of subframe length by combining the mixed voice code corresponding to each point and the second voice code corresponding to each point output from the second voice code switching unit. Cushing part,
Combining a plurality of the unit voice packets corresponding to each of the points, and outputting a voice packet for transmission having a time unit of a packetization period,
The point selection part
Either the sum of squares of the first audio signal corresponding to each point, the average of the absolute values of the first audio signal, or the average value of the code for each sample excluding the positive / negative code from the first audio code (Hereinafter referred to as power) acoustic attribute determination unit,
Applying a decision criterion to the power corresponding to each of the above points, an utterance point determining unit that determines the current utterance point from among the above points,
A memory for storing information representing an utterance point;
A control signal output unit that outputs a control signal corresponding to the determined utterance point,
The above decision criteria are:
If the power is a maximum point which is a main utterance location identical to the location of the previous one subframe obtained from the memory, mainly a main utterance location same point and the previous upper Symbol 1 sub frame of the current sub-frame Rather than the decision criteria for the above power, which is determined to be the utterance point,
If the power is a maximum point is the point different from the main utterance location before the 1 sub-frame, a point different from the main utterance location before the upper Symbol 1 subframe to be the primary utterance location of the current subframe determined The decision criterion for the power is higher ,
The utterance point determination unit determines an utterance point under the determination criterion.

The multipoint connection device according to claim 1,
More certain one point to subframe lasts long it is determined as the main utterance location, to enhance the decision criteria when the power is the maximum point of a different point as the main utterance location of the previous subframe A multipoint connection device characterized by.

The multipoint connection device according to claim 1,
The above decision criteria are:
The maximum power among the power corresponding to each of the above points for the current subframe is not less than α times the power corresponding to the main utterance point one subframe before (where α is a positive number greater than 1). In some cases, the main utterance point is changed to a point corresponding to the maximum power, and if it is less than α times, the main utterance point is not changed as the main utterance point one subframe before. Multipoint connection device.

In the multipoint connection device according to claim 2,
The above decision criteria are:
The maximum power among the power corresponding to each of the above points for the current subframe is
More than α times the power corresponding to the main utterance point one subframe before (α is a positive number larger than 1) [However, the same utterance is the main point in a plurality of consecutive subframes up to the previous subframe. If it is a point, it is β times or more (β is a positive number larger than α). If it is, the main utterance point is changed to a point corresponding to the maximum power,
A multipoint connection device characterized by not changing the main utterance point as the main utterance point one subframe before when the number is less than α times.

A time direction division unit that divides each voice packet transmitted from each of three or more points into time units shorter than the packetization period (hereinafter referred to as subframe length units) and outputs divided voice packets;
A demultiplexing unit that extracts at least a first speech code and a second speech code from each of the divided speech packets corresponding to each of the points;
A decoder for decoding the first audio code corresponding to each of the points and outputting a first audio signal;
A mixing unit that mixes the first audio signal corresponding to each of the points and outputs a mixed audio signal for each of the points;
An encoder that encodes the mixed speech signal corresponding to each of the points and outputs a mixed speech code;
A point selection unit that determines a speech point from each of the above points and outputs a control signal corresponding to the speech point;
A second voice code switching unit for outputting a second voice code determined according to the control signal among the second voice codes corresponding to the points, for a point determined according to the control signal among the points; ,
Multiplex that outputs a unit voice packet in units of subframe length by combining the mixed voice code corresponding to each point and the second voice code corresponding to each point output from the second voice code switching unit. Cushing part,
Combining a plurality of the unit voice packets corresponding to each of the points, and outputting a voice packet for transmission having a time unit of a packetization period,
The point selection part
Either the sum of squares of the first audio signal corresponding to each point, the average of the absolute values of the first audio signal, or the average value of the code for each sample excluding the positive / negative code from the first audio code (Hereinafter referred to as power) acoustic attribute determination unit,
A memory for storing the power corresponding to each of the points;
An utterance point determination unit that determines a current utterance point from among the points based on the current power corresponding to the points and the past power corresponding to the points stored in the memory; ,
A control signal output unit that outputs a control signal corresponding to the determined utterance point,
The utterance point determination unit
When the difference between the points regarding the power value of the current subframe corresponding to each point is smaller than the threshold, the magnitude relationship between the points regarding the power value corresponding to each point before one subframe is added to the above determination. A multipoint connection device characterized by:

A time direction division unit that divides each voice packet transmitted from each of three or more points into time units shorter than the packetization period (hereinafter referred to as subframe length units) and outputs divided voice packets;
A demultiplexing unit that extracts at least a first speech code and a second speech code from each of the divided speech packets corresponding to each of the points;
A decoder for decoding the first audio code corresponding to each of the points and outputting a first audio signal;
A mixing unit that mixes the first audio signal corresponding to each of the points and outputs a mixed audio signal for each of the points;
An encoder that encodes the mixed speech signal corresponding to each of the points and outputs a mixed speech code;
A point selection unit that determines a speech point from each of the above points and outputs a control signal corresponding to the speech point;
A second voice code switching unit for outputting a second voice code determined according to the control signal among the second voice codes corresponding to the points, for a point determined according to the control signal among the points; ,
Multiplex that outputs a unit voice packet in units of subframe length by combining the mixed voice code corresponding to each point and the second voice code corresponding to each point output from the second voice code switching unit. Cushing part,
Combining a plurality of the unit voice packets corresponding to each of the points, and outputting a voice packet for transmission having a time unit of a packetization period,
The point selection part
Either the sum of squares of the first audio signal corresponding to each point, the average of the absolute values of the first audio signal, or the average value of the code for each sample excluding the positive / negative code from the first audio code (Hereinafter referred to as power) acoustic attribute determination unit,
A memory for storing the power corresponding to each of the points;
An utterance point determination unit that determines a current utterance point from among the points based on the current power corresponding to the points and the past power corresponding to the points stored in the memory; ,
A control signal output unit that outputs a control signal corresponding to the determined utterance point,
The utterance point determination unit
When the magnitude relationship between the points for the power value corresponding to each point before one subframe is more significant than the magnitude relationship between the points for the power value of the current subframe corresponding to each point, A multipoint connection apparatus, wherein a main utterance point is determined based on a magnitude relationship between points with respect to a power value corresponding to each point before one subframe.

A time direction division step of dividing a voice packet transmitted from each of three or more points into time units shorter than the packetization period (hereinafter referred to as subframe length units) and outputting divided voice packets;
A demultiplexing step of extracting at least a first speech code and a second speech code from each of the divided speech packets corresponding to each of the points;
A decoding step of decoding the first speech code corresponding to each of the points and outputting a first speech signal;
Mixing the first audio signal corresponding to each point to output a mixed audio signal for each point; and
An encoding step of encoding the mixed audio signal corresponding to each of the points and outputting a mixed audio code;
A point selection step of determining a speech point from each of the above points and outputting a control signal corresponding to the speech point;
A second voice code switching step for outputting a second voice code determined according to the control signal among the second voice codes corresponding to the points, for a point determined according to the control signal among the points; ,
The mixed speech code corresponding to each of the points and the second speech code corresponding to each of the points output in the second speech code switching step are combined to output a unit speech packet in subframe length units. A multiplexing step;
Combining a plurality of the unit voice packets corresponding to each of the points, and outputting a voice packet for transmission having a time unit of a packetization period, and a time direction combining step,
The point selection step is
Either the sum of squares of the first audio signal corresponding to each point, the average of the absolute values of the first audio signal, or the average value of the code for each sample excluding the positive / negative code from the first audio code (Hereinafter referred to as power) acoustic attribute determination step,
Applying a decision criterion to the power corresponding to each of the above points, and determining an utterance point determining step for determining a current utterance point from among the above points;
A storage step of storing information representing the utterance point in a memory;
A control signal output step for outputting a control signal corresponding to the determined utterance point,
The above decision criteria are:
If the power is a maximum point which is a main utterance location identical to the location of the previous one subframe obtained from the memory, mainly a main utterance location same point and the previous upper Symbol 1 sub frame of the current sub-frame Rather than the decision criteria for the above power, which is determined to be the utterance point,
If the power is a maximum point is the point different from the main utterance location before the 1 sub-frame, a point different from the main utterance location before the upper Symbol 1 subframe to be the primary utterance location of the current subframe determined The decision criterion for the power is higher ,
In the utterance point determination step, the utterance point is determined under the determination criterion.

The multipoint connection method according to claim 7,
More certain one point to subframe lasts long it is determined as the main utterance location, to enhance the decision criteria when the power is the maximum point of a different point as the main utterance location of the previous subframe A multipoint connection method characterized by

The multipoint connection method according to claim 7,
The above decision criteria are:
The maximum power among the power corresponding to each of the above points for the current subframe is not less than α times the power corresponding to the main utterance point one subframe before (where α is a positive number greater than 1). In some cases, the main utterance point is changed to a point corresponding to the maximum power, and if it is less than α times, the main utterance point is not changed as the main utterance point one subframe before. Multipoint connection method to do.

The multipoint connection method according to claim 8, wherein
The above decision criteria are:
The maximum power among the power corresponding to each of the above points for the current subframe is
More than α times the power corresponding to the main utterance point one subframe before (α is a positive number larger than 1) [However, the same utterance is the main point in a plurality of consecutive subframes up to the previous subframe. If it is a point, it is β times or more (β is a positive number larger than α). If it is, the main utterance point is changed to a point corresponding to the maximum power,
A multipoint connection method characterized by not changing the main utterance point as the main utterance point one subframe before if it is less than α times.

A time direction division step of dividing a voice packet transmitted from each of three or more points into time units shorter than the packetization period (hereinafter referred to as subframe length units) and outputting divided voice packets;
A demultiplexing step of extracting at least a first speech code and a second speech code from each of the divided speech packets corresponding to each of the points;
A decoding step of decoding the first speech code corresponding to each of the points and outputting a first speech signal;
Mixing the first audio signal corresponding to each point to output a mixed audio signal for each point; and
An encoding step of encoding the mixed audio signal corresponding to each of the points and outputting a mixed audio code;
A point selection step of determining a speech point from each of the above points and outputting a control signal corresponding to the speech point;
A second voice code switching step for outputting a second voice code determined according to the control signal among the second voice codes corresponding to the points, for a point determined according to the control signal among the points; ,
The mixed speech code corresponding to each of the points and the second speech code corresponding to each of the points output in the second speech code switching step are combined to output a unit speech packet in subframe length units. A multiplexing step;
Combining a plurality of the unit voice packets corresponding to each of the points, and outputting a voice packet for transmission having a time unit of a packetization period, and a time direction combining step,
The point selection step is
Either the sum of squares of the first audio signal corresponding to each point, the average of the absolute values of the first audio signal, or the average value of the code for each sample excluding the positive / negative code from the first audio code (Hereinafter referred to as power) acoustic attribute determination step,
A storage step of storing the power corresponding to each of the points in a memory;
An utterance point determination step for determining a current utterance point from among the points based on the current power corresponding to the points and the past power corresponding to the points stored in the memory; ,
A control signal output step for outputting a control signal corresponding to the determined utterance point,
In the above utterance point determination step,
When the difference between the points regarding the power value of the current subframe corresponding to each point is smaller than the threshold, the magnitude relationship between the points regarding the power value corresponding to each point before one subframe is added to the above determination. A multipoint connection method characterized by:

A time direction division step of dividing a voice packet transmitted from each of three or more points into time units shorter than the packetization period (hereinafter referred to as subframe length units) and outputting divided voice packets;
A demultiplexing step of extracting at least a first speech code and a second speech code from each of the divided speech packets corresponding to each of the points;
A decoding step of decoding the first speech code corresponding to each of the points and outputting a first speech signal;
Mixing the first audio signal corresponding to each point to output a mixed audio signal for each point; and
An encoding step of encoding the mixed audio signal corresponding to each of the points and outputting a mixed audio code;
A point selection step of determining a speech point from each of the above points and outputting a control signal corresponding to the speech point;
A second voice code switching step for outputting a second voice code determined according to the control signal among the second voice codes corresponding to the points, for a point determined according to the control signal among the points; ,
The mixed speech code corresponding to each of the points and the second speech code corresponding to each of the points output in the second speech code switching step are combined to output a unit speech packet in subframe length units. A multiplexing step;
Combining a plurality of the unit voice packets corresponding to each of the points, and outputting a voice packet for transmission having a time unit of a packetization period, and a time direction combining step,
The point selection step is
Either the sum of squares of the first audio signal corresponding to each point, the average of the absolute values of the first audio signal, or the average value of the code for each sample excluding the positive / negative code from the first audio code (Hereinafter referred to as power) acoustic attribute determination step,
A memory for storing the power corresponding to each of the points;
An utterance point determination step for determining a current utterance point from among the points based on the current power corresponding to the points and the past power corresponding to the points stored in the memory; ,
A control signal output step for outputting a control signal corresponding to the determined utterance point,
In the above utterance point determination step,
When the magnitude relationship between the points for the power value corresponding to each point before one subframe is more significant than the magnitude relationship between the points for the power value of the current subframe corresponding to each point, A multipoint connection method, wherein a main utterance point is determined based on a magnitude relationship between points with respect to a power value corresponding to each point before one subframe.