JP7571111B2

JP7571111B2 - COMMUNICATION TERMINAL, INFORMATION PROCESSING APPARATUS, COMMUNICATION METHOD, AND PROGRAM

Info

Publication number: JP7571111B2
Application number: JP2022209807A
Authority: JP
Inventors: 浩亮藁谷
Original assignee: KDDI Corp
Current assignee: KDDI Corp
Priority date: 2022-12-27
Filing date: 2022-12-27
Publication date: 2024-10-22
Anticipated expiration: 2042-12-27
Also published as: JP2024093431A

Description

本発明は、通信端末、情報処理装置、通信方法及びプログラムに関する。 The present invention relates to a communication terminal, an information processing device, a communication method, and a program.

従来、通話先端末との間で音声通信が可能な端末において、近傍の他の端末を使って音声通信をしている話者の音声が通話先端末において聞こえにくくする技術が知られている（例えば、特許文献１を参照）。 Conventionally, a technology is known that makes it difficult for a terminal capable of voice communication with a destination terminal to hear the voice of a speaker who is communicating using another nearby terminal (see, for example, Patent Document 1).

特許第７１３７０３３号公報Patent No. 7137033

特許文献１に記載された技術においては、自端末を使用している話者の音声のレベルと、自端末を使用していない話者の音声のレベルとの差に基づいて、自端末を使用している話者の音声を分離し、分離した音声を通話先端末に送信する。このような方法で通話先端末に送信する音声を選択する場合、自端末の近くにいる複数の話者の位置と自端末の位置との関係の変化により、複数の話者の音声が混同した音声に含まれる複数の話者の音量のバランスが変化すると、他の話者の音声を選択して通話先端末に送信してしまうことがあるという問題があった。 In the technology described in Patent Document 1, the voice of the speaker using the own terminal is separated based on the difference between the voice level of the speaker using the own terminal and the voice level of the speaker not using the own terminal, and the separated voice is transmitted to the destination terminal. When selecting the voice to transmit to the destination terminal in this manner, there is a problem that if the balance of the volumes of multiple speakers contained in the mixed voice of multiple speakers changes due to a change in the relationship between the positions of multiple speakers close to the own terminal and the position of the own terminal, the voice of another speaker may be selected and transmitted to the destination terminal.

そこで、本発明はこれらの点に鑑みてなされたものであり、自端末を使用している話者の音声を通話先の端末に送信できる確率を向上させることを目的とする。 The present invention was made in consideration of these points, and aims to improve the probability that the voice of the speaker using the own terminal can be transmitted to the destination terminal.

本発明の第１の態様の通信端末は、ユーザが使用する通信端末であって、前記ユーザの音声の特徴を含む特徴データを記憶する記憶部と、前記通信端末の周囲の音声を集音して音声データに変換する集音部と、前記音声データに含まれている、前記ユーザを含む複数の話者による複数の音声データを分離して複数の内部分離音声データを生成する音声分離部と、前記複数の内部分離音声データそれぞれと前記特徴データとの類似度である複数の内部類似度を特定し、当該複数の内部類似度のうち所定の条件を満たす前記内部類似度に対応する前記内部分離音声データを選択分離音声データとして選択する選択部と、前記ユーザの通話相手が使用する通話先端末に前記選択分離音声データを送信する通信部と、を有する。 The communication terminal of the first aspect of the present invention is a communication terminal used by a user, and includes a storage unit that stores feature data including features of the user's voice, a sound collection unit that collects sounds around the communication terminal and converts them into voice data, a voice separation unit that separates multiple voice data by multiple speakers including the user contained in the voice data to generate multiple internally separated voice data, a selection unit that identifies multiple internal similarities that are similarities between each of the multiple internally separated voice data and the feature data, and selects, from the multiple internal similarities, the internally separated voice data that corresponds to the internal similarity that satisfies a predetermined condition as selected separated voice data, and a communication unit that transmits the selected separated voice data to a call destination terminal used by the user's call partner.

前記選択部は、前記複数の内部類似度のうち最大の類似度であるという前記所定の条件を満たす前記内部類似度に対応する前記内部分離音声データを選択分離音声データとして選択してもよい。 The selection unit may select, as the selected separated audio data, the internal separated audio data corresponding to the internal similarity that satisfies the predetermined condition that the internal similarity is the maximum similarity among the multiple internal similarities.

前記通信端末は、前記集音部が集音した音声と同じ音声を集音する他の通信端末から、当該他の通信端末が生成した複数の外部分離音声データと、前記複数の外部分離音声データそれぞれと当該他の通信端末を使用する他ユーザの特徴データとの類似度である複数の外部類似度と、を関連付けて取得する情報取得部をさらに有し、前記選択部は、前記複数の内部分離音声データそれぞれについて、前記内部分離音声データと、当該内部分離音声データに対応する前記内部類似度と、前記情報取得部が取得した前記複数の外部分離音声データのうち当該内部分離音声データに最も類似する前記外部分離音声データと、当該外部分離音声データに対応する前記外部類似度と、を関連付け、関連付けた前記内部類似度と前記外部類似度とを比較した結果に基づいて、前記選択分離音声データを決定してもよい。 The communication terminal further has an information acquisition unit that acquires, from another communication terminal that collects the same sound as the sound collected by the sound collection unit, a plurality of external separated sound data generated by the other communication terminal in association with a plurality of external similarities that are similarities between each of the plurality of external separated sound data and feature data of another user using the other communication terminal, and the selection unit may, for each of the plurality of internal separated sound data, associate the internal separated sound data with the internal similarity corresponding to the internal separated sound data, and the external separated sound data that is most similar to the internal separated sound data among the plurality of external separated sound data acquired by the information acquisition unit and the external similarity corresponding to the external separated sound data, and determine the selected separated sound data based on a result of comparing the associated internal similarity with the external similarity.

前記選択部は、前記内部類似度が前記外部類似度よりも大きいという前記所定の条件を満たす前記内部分離音声データを前記選択分離音声データとして選択してもよい。 The selection unit may select the internally separated audio data that satisfies the predetermined condition that the internal similarity is greater than the external similarity as the selected separated audio data.

前記選択部は、前記内部類似度が前記外部類似度以下である場合、前記所定の条件を満たさない前記内部類似度に対応する前記内部分離音声データと異なる前記内部分離音声データを前記選択分離音声データとして選択してもよい。 When the internal similarity is equal to or less than the external similarity, the selection unit may select, as the selected separated audio data, the internal separated audio data that is different from the internal separated audio data that corresponds to the internal similarity that does not satisfy the predetermined condition.

前記ユーザの顔画像データを取得する画像データ取得部をさらに有し、前記選択部は、前記複数の内部分離音声データのうち、前記顔画像データが示す前記ユーザの顔の変化タイミングに同期している前記内部分離音声データを選択分離音声データとして選択してもよい。 The device may further include an image data acquisition unit that acquires facial image data of the user, and the selection unit may select, from among the plurality of internally separated audio data, the internally separated audio data that is synchronized with the timing of changes in the user's face indicated by the facial image data as the selected separated audio data.

本発明の第２の態様の情報処理装置は、通信端末を使用するユーザの音声の特徴を含む特徴データを記憶する記憶部と、前記通信端末により集音された前記通信端末の周囲の音声に基づく音声データに含まれている、前記ユーザを含む複数の話者による複数の音声データを分離して複数の分離音声データを生成する音声分離部と、前記複数の分離音声データそれぞれと前記特徴データとの類似度を特定し、当該複数の類似度のうち所定の条件を満たす前記類似度に対応する前記分離音声データを選択分離音声データとして選択する選択部と、前記ユーザの通話相手が使用する通話先端末に前記選択分離音声データを送信する通信部と、を有する。 The information processing device of the second aspect of the present invention has a storage unit that stores feature data including features of the voice of a user who uses a communication terminal, a voice separation unit that separates multiple voice data by multiple speakers including the user, which are included in voice data based on voices around the communication terminal collected by the communication terminal, to generate multiple separated voice data, a selection unit that identifies a similarity between each of the multiple separated voice data and the feature data, and selects, from the multiple similarities, the separated voice data corresponding to the similarity that satisfies a predetermined condition as selected separated voice data, and a communication unit that transmits the selected separated voice data to a call destination terminal used by the user's call partner.

前記記憶部は、複数の前記通信端末を使用する複数のユーザの音声の特徴を含む複数の特徴データを記憶し、前記選択部は、前記複数の通信端末それぞれに関連付けて前記選択分離音声データを決定し、前記通信部は、前記複数の通信端末に対応する複数の前記通話先端末に前記選択分離音声データを送信してもよい。 The storage unit may store a plurality of feature data including voice features of a plurality of users who use a plurality of the communication terminals, the selection unit may determine the selected separated voice data in association with each of the plurality of communication terminals, and the communication unit may transmit the selected separated voice data to a plurality of the call destination terminals corresponding to the plurality of communication terminals.

本発明の第３の態様の通信方法は、コンピュータが実行する、通信先端末を使用する通話相手と通話するユーザの周囲の音声に基づく音声データに含まれている、前記ユーザを含む複数の話者による複数の音声データを分離して複数の内部分離音声データを生成するステップと、前記複数の内部分離音声データそれぞれと記憶部に記憶された前記ユーザの音声の特徴を含む特徴データとの類似度である複数の内部類似度を生成するステップと、当該複数の内部類似度のうち所定の条件を満たす前記内部類似度に対応する前記内部分離音声データを選択分離音声データとして選択するステップと、前記ユーザの通話相手が使用する通話先端末に前記選択分離音声データを送信するステップと、を有する。 The communication method of the third aspect of the present invention includes the steps of: separating a plurality of voice data by a plurality of speakers including the user, which are included in voice data based on the voices around the user who is talking to a call partner using a destination terminal, to generate a plurality of internally separated voice data; generating a plurality of internal similarities, which are similarities between each of the plurality of internally separated voice data and feature data including the voice features of the user stored in a storage unit; selecting, from the plurality of internal similarities, the internally separated voice data corresponding to the internal similarity that satisfies a predetermined condition as selected separated voice data; and transmitting the selected separated voice data to the call partner terminal used by the call partner of the user.

本発明の第４の態様のプログラムは、コンピュータに、通信先端末を使用する通話相手と通話するユーザの周囲の音声に基づく音声データに含まれている、前記ユーザを含む複数の話者による複数の音声データを分離して複数の内部分離音声データを生成するステップと、前記複数の内部分離音声データそれぞれと記憶部に記憶された前記ユーザの音声の特徴を含む特徴データとの類似度である複数の内部類似度を生成するステップと、当該複数の内部類似度のうち所定の条件を満たす前記内部類似度に対応する前記内部分離音声データを選択分離音声データとして選択するステップと、前記ユーザの通話相手が使用する通話先端末に前記選択分離音声データを送信するステップと、を実行させるためのプログラムである。 The program of the fourth aspect of the present invention is a program for causing a computer to execute the steps of: separating a plurality of voice data by a plurality of speakers including the user, which are included in voice data based on the voices around a user who is talking to a call partner using a destination terminal, to generate a plurality of internal separated voice data; generating a plurality of internal similarities, which are similarities between each of the plurality of internal separated voice data and feature data including the voice features of the user stored in a storage unit; selecting, from the plurality of internal similarities, the internal separated voice data corresponding to the internal similarity that satisfies a predetermined condition as selected separated voice data; and transmitting the selected separated voice data to a destination terminal used by the call partner of the user.

本発明によれば、自端末を使用している話者の音声を通話先の端末に送信できる確率が向上するという効果を奏する。 The present invention has the effect of improving the probability that the voice of the speaker using the own terminal can be transmitted to the destination terminal.

通信システムＳの概要を説明するための図である。1 is a diagram for explaining an overview of a communication system S. 通信端末１の構成を示す図である。FIG. 2 is a diagram showing a configuration of a communication terminal 1. 選択部１６３が外部分離音声データを用いて選択分離音声データを決定する動作の詳細を説明するための図である。11 is a diagram for explaining details of an operation performed by the selection unit 163 to determine selected separated audio data using external separated audio data. FIG. それぞれの通信端末１が送信する複数の内部分離音声データと複数の内部類似度とを示す図である。1 is a diagram showing a plurality of internally separated voice data and a plurality of internal similarities transmitted by each communication terminal 1. FIG. 複数の通信端末１それぞれの選択部１６３が内部類似度情報及び外部類似度情報を用いて選択分離音声データを決定する方法を説明するための図である。11 is a diagram for explaining a method in which the selection unit 163 of each of the multiple communication terminals 1 determines selected and separated voice data using internal similarity information and external similarity information. FIG. 複数の通信端末１における処理の流れを示すシーケンス図である。1 is a sequence diagram showing a process flow in a plurality of communication terminals 1. FIG. 通信端末１における処理の流れを示すフローチャートである。4 is a flowchart showing a process flow in the communication terminal 1. 第１変形例に係る通信端末１Ａの構成を示す図である。FIG. 11 is a diagram showing a configuration of a communication terminal 1A according to a first modified example. 第２実施形態の概要を説明するための図である。FIG. 11 is a diagram for explaining an overview of a second embodiment. 情報処理装置３の構成を示す図である。FIG. 2 is a diagram showing a configuration of an information processing device 3.

＜第１実施形態＞
［通信システムＳの概要］
図１は、通信システムＳの概要を説明するための図である。通信システムＳは、ユーザが操作する通信端末１により、ネットワークＮを介して他のユーザとの間で音声通信するためのシステムである。ネットワークＮは、例えば、電話通信網、インターネット、ローカルエリアネットワーク等である。 First Embodiment
[Overview of communication system S]
1 is a diagram for explaining an overview of a communication system S. The communication system S is a system for enabling a user to perform voice communication with another user via a network N using a communication terminal 1 operated by the user. The network N is, for example, a telephone communication network, the Internet, a local area network, or the like.

図１は、ユーザＵ１が通信端末１を用いて、通話先端末２を使用する通話先のユーザＵ１０と通話している状態を示している。ユーザＵ１の近くには、ユーザＵ１０以外のユーザと通話しているユーザＵ２がいて、ユーザＵ１が発する音声とともに、ユーザＵ２が発する音声も通信端末１に入る。このように、ユーザＵ１の音声とユーザＵ２の音声が通信端末１に入ってしまうと、通話先端末２からユーザＵ１の音声とユーザＵ２の音声が出力されるので、ユーザＵ１０がユーザＵ１の音声を聞きづらいという問題が生じる。図１に示す例においては、通信端末１が「こんにちは」と話した際に、ユーザＵ２が「はじめまして」と話した場合に、通話先端末２には、「こんにちは」の音声に「はじめまして」という音声が重なって聞こえることになる。 Figure 1 shows a state in which user U1 is using communication terminal 1 to talk to user U10, who is the call destination using call destination terminal 2. Near user U1, user U2 is talking to a user other than user U10, and the voice of user U2 enters communication terminal 1 along with the voice of user U1. When the voices of user U1 and user U2 enter communication terminal 1 in this way, the voices of user U1 and user U2 are output from call destination terminal 2, causing a problem that user U10 has difficulty hearing the voice of user U1. In the example shown in Figure 1, if communication terminal 1 says "Hello" and user U2 says "Nice to meet you," the voice of "Hello" overlaps with the voice of "Nice to meet you" at call destination terminal 2.

この問題を解決するために、ユーザＵ１の音声とユーザＵ２の音声とを分離する技術が知られている。一例として、特許第７１３７０３３号公報に記載された技術においては、ユーザＵ１が使用する通信端末１とユーザＵ２が使用する通信端末１とが無線通信をすることにより、それぞれに入力された音声に基づく信号を他方に送信する。通信端末１においては、集音した混合音声に含まれるユーザＵ１の音声とユーザＵ２の音声との音量の比と、無線通信により受信した混合音声に含まれるユーザＵ１の音声とユーザＵ２の音声との音量の比とが異なることに基づいて、集音した混合音声と無線通信により受信した混合音声とを独立成分分析することにより、ユーザＵ１の音声（図１における分離音声Ｚ１）とユーザＵ２の音声（図１における分離音声Ｚ２）とを分離する。 To solve this problem, a technique for separating the voice of user U1 from the voice of user U2 is known. As an example, in the technique described in Japanese Patent No. 7137033, a communication terminal 1 used by user U1 and a communication terminal 1 used by user U2 communicate wirelessly to transmit a signal based on the voice input to each to the other. In the communication terminal 1, based on the difference between the volume ratio of the voice of user U1 and the voice of user U2 contained in the collected mixed voice and the volume ratio of the voice of user U1 and the voice of user U2 contained in the mixed voice received by wireless communication, the collected mixed voice and the mixed voice received by wireless communication are subjected to independent component analysis to separate the voice of user U1 (separated voice Z1 in FIG. 1) from the voice of user U2 (separated voice Z2 in FIG. 1).

音声の分離手法は任意であり、通信端末１が複数のマイクロホンを有しており、複数のマイクロホンが集音した音声と複数のマイクロホンの間の距離とに基づいて、複数の音声を分離してもよい。通信端末１は、分離音声Ｚ１及び分離音声Ｚ２のうち、ユーザＵ１の音声であると判定した音声を通話先端末２に送信する。これにより、通話先端末２を使用するユーザＵ１０にはユーザＵ１の音声が聞こえてユーザＵ２の音声が聞こえないという状態になる。 The method of separating the voices is arbitrary, and the communication terminal 1 may have multiple microphones and separate the multiple voices based on the voices picked up by the multiple microphones and the distance between the multiple microphones. The communication terminal 1 transmits the voice determined to be the voice of user U1 from the separated voice Z1 and separated voice Z2 to the call destination terminal 2. This results in a state where user U10 using the call destination terminal 2 can hear the voice of user U1 but cannot hear the voice of user U2.

このようにして複数の音声を分離した場合、周辺環境の状態によってユーザＵ２が発した音声の音量が変動したり、反射の影響で遅延したりすることにより、ユーザＵ１の通話相手が使用する通話先端末２にユーザＵ１の音声を送信できない場合が生じ得る。そこで、本実施形態に係る通信端末１は、事前に取得したユーザＵ１の音声の特徴を含む特徴データ、又はユーザＵ１が話している間の顔の特徴を含む特徴データを用いて、分離音声Ｚ１及び分離音声Ｚ２のうち、ユーザＵ１の音声を正しく選択できる確率を高めることを特徴としている。 When multiple voices are separated in this manner, there may be cases where the voice of user U1 cannot be transmitted to the destination terminal 2 used by the caller of user U1, due to fluctuations in the volume of the voice uttered by user U2 depending on the state of the surrounding environment, or delays due to reflections. Therefore, the communication terminal 1 according to this embodiment is characterized in that it uses feature data including the features of the voice of user U1 acquired in advance, or feature data including the features of the face of user U1 while he is speaking, to increase the probability of correctly selecting the voice of user U1 from the separated voices Z1 and Z2.

［通信端末１の構成］
以下、通信端末１の構成及び動作を詳細に説明する。本明細書においては、通信端末１が無線通信により他の通信端末１から受信した音声を用いて複数の音声を分離する場合を例にして説明する。以下の説明では、ユーザＵ１が通信端末１－１を使用し、ユーザＵ２が通信端末１－２を使用するものとする。 [Configuration of communication terminal 1]
The configuration and operation of the communication terminal 1 will be described in detail below. In this specification, an example will be described in which the communication terminal 1 separates multiple voices using voices received from other communication terminals 1 via wireless communication. In the following description, it is assumed that a user U1 uses the communication terminal 1-1 and a user U2 uses the communication terminal 1-2.

図２は、通信端末１の構成を示す図である。通信端末１は、集音部１１と、第１通信部１２と、第２通信部１３と、音声出力部１４と、記憶部１５と、制御部１６と、を有する。制御部１６は、音声分離部１６１と、情報取得部１６２と、選択部１６３とを有する。ここでは、通信端末１が、ユーザＵ１が使用する通信端末１－１である場合を例にして通信端末１の構成を説明する。 Figure 2 is a diagram showing the configuration of communication terminal 1. Communication terminal 1 has a sound collection unit 11, a first communication unit 12, a second communication unit 13, an audio output unit 14, a storage unit 15, and a control unit 16. The control unit 16 has an audio separation unit 161, an information acquisition unit 162, and a selection unit 163. Here, the configuration of communication terminal 1 will be explained using as an example a case where communication terminal 1 is communication terminal 1-1 used by user U1.

集音部１１は、通信端末１の周囲の音声を集音して音声データに変換する。集音部１１は、例えばマイクロホンを有する。集音部１１は、音声データを音声分離部１６１に入力する。 The sound collection unit 11 collects sounds around the communication terminal 1 and converts them into sound data. The sound collection unit 11 has, for example, a microphone. The sound collection unit 11 inputs the sound data to the sound separation unit 161.

第１通信部１２は、無線通信チャネル（例えばBluetooth（登録商標）又はWi-Fi（登録商標））を用いて、ユーザＵ２が使用する通信端末１－２と無線通信する。第１通信部１２は、通信端末１－２で集音された混合音声データ、及び通信端末１－２で集音された音声に基いて通信端末１－２が生成した複数の外部分離音声データを受信する。また、第１通信部１２は、複数の外部分離音声データに関連付けて、当該複数の外部分離音声データと、通信端末１－２を使用する他ユーザであるユーザＵ２の特徴データとの類似度である複数の外部類似度を受信する。第１通信部１２は、受信した混合音声データを音声分離部１６１に入力する。第１通信部１２は、受信した複数の外部分離音声データと、複数の外部類似度とを情報取得部１６２に入力する。 The first communication unit 12 wirelessly communicates with the communication terminal 1-2 used by the user U2 using a wireless communication channel (e.g., Bluetooth (registered trademark) or Wi-Fi (registered trademark)). The first communication unit 12 receives mixed voice data collected by the communication terminal 1-2 and a plurality of external separated voice data generated by the communication terminal 1-2 based on the voice collected by the communication terminal 1-2. The first communication unit 12 also receives a plurality of external similarities, which are similarities between the plurality of external separated voice data and feature data of the user U2, who is another user using the communication terminal 1-2, in association with the plurality of external separated voice data. The first communication unit 12 inputs the received mixed voice data to the voice separation unit 161. The first communication unit 12 inputs the received plurality of external separated voice data and the plurality of external similarities to the information acquisition unit 162.

複数の外部分離音声データは、通信端末１－２が集音した音声に含まれている複数の音声を通信端末１－２が分離することにより生成した音声データである。複数の外部類似度は、例えば複数の外部分離音声データをケプストラム分析することにより得られる情報と、ユーザＵ２の特徴データをケプストラム分析することにより得られる情報との間の相関値又はカルバック・ライブラ情報量により表される。 The multiple external separated voice data are voice data generated by the communication terminal 1-2 by separating multiple voices contained in the voice collected by the communication terminal 1-2. The multiple external similarities are represented, for example, by a correlation value or Kullback-Leibler information between information obtained by cepstral analysis of the multiple external separated voice data and information obtained by cepstral analysis of the feature data of the user U2.

第２通信部１３は、通話先のユーザＵ１０が使用する通話先端末２とネットワークＮを介して通信し、音声信号を送受信するための通信インターフェースを有する。第２通信部１３は、ユーザＵ１の通話相手であるユーザＵ１０が使用する通話先端末である通話先端末２に、選択部１６３から入力された選択分離音声データを送信する。また、第２通信部１３は、通話先端末２から受信した音声データを音声出力部１４に入力する。選択分離音声データの詳細については後述する。 The second communication unit 13 has a communication interface for communicating with the call destination terminal 2 used by the call destination user U10 via the network N and transmitting and receiving voice signals. The second communication unit 13 transmits the selected and separated voice data input from the selection unit 163 to the call destination terminal 2, which is the call destination terminal used by user U10, who is the call partner of user U1. In addition, the second communication unit 13 inputs the voice data received from the call destination terminal 2 to the voice output unit 14. Details of the selected and separated voice data will be described later.

音声出力部１４は、第２通信部１３を介して通話先端末２から受信した音声データを音声に変換して出力する。音声出力部１４は、例えば、スピーカ、イヤフォン、ヘッドフォン等である。 The audio output unit 14 converts the audio data received from the call destination terminal 2 via the second communication unit 13 into audio and outputs it. The audio output unit 14 is, for example, a speaker, earphones, headphones, etc.

記憶部１５は、ＲＯＭ（Read Only Memory）及びＲＡＭ（Random Access Memory）等の記憶媒体を有する。記憶部１５は、制御部１６が実行するプログラムを記憶する。また、記憶部１５は、複数の音声を分離するために使用する各種のデータを記憶する。記憶部１５は、例えば、予め登録されたユーザＵ１の音声の特徴を含む特徴データを記憶する。 The storage unit 15 has storage media such as a ROM (Read Only Memory) and a RAM (Random Access Memory). The storage unit 15 stores programs executed by the control unit 16. The storage unit 15 also stores various data used to separate multiple voices. The storage unit 15 stores, for example, feature data including voice features of a pre-registered user U1.

制御部１６は、例えばＣＰＵ（Central Processing Unit）を有する。制御部１６は、記憶部１５に記憶されたプログラムを実行することにより、音声分離部１６１、情報取得部１６２及び選択部１６３として機能する。 The control unit 16 has, for example, a CPU (Central Processing Unit). The control unit 16 executes the programs stored in the storage unit 15 to function as an audio separation unit 161, an information acquisition unit 162, and a selection unit 163.

音声分離部１６１は、集音部１１から入力された音声データに含まれている、ユーザＵ１を含む複数の話者による複数の音声データを分離して複数の内部分離音声データを生成する。上記のとおり、音声分離部１６１が集音部１１から入力された音声データに含まれている複数の音声データを分離する方法は任意である。音声分離部１６１は、生成した複数の内部分離音声データを選択部１６３に入力する。音声分離部１６１は、例えば、複数の内部分離音声データのそれぞれを識別するための情報に関連付けて、複数の内部分離音声データを選択部１６３に入力する。音声分離部１６１は、複数の内部分離音声データを、それぞれ異なるタイミングで選択部１６３に入力してもよい。 The audio separation unit 161 separates multiple pieces of audio data by multiple speakers including user U1, which are included in the audio data input from the sound collection unit 11, to generate multiple pieces of internally separated audio data. As described above, the audio separation unit 161 may separate the multiple pieces of audio data included in the audio data input from the sound collection unit 11 using any method. The audio separation unit 161 inputs the multiple pieces of internally separated audio data that it has generated to the selection unit 163. The audio separation unit 161 inputs the multiple pieces of internally separated audio data to the selection unit 163, for example, in association with information for identifying each of the multiple pieces of internally separated audio data. The audio separation unit 161 may input the multiple pieces of internally separated audio data to the selection unit 163 at different times.

情報取得部１６２は、第１通信部１２を介して、集音部１１が集音した音声と同じ音声を集音する他の通信端末１（例えば通信端末１－２）から、当該他の通信端末１が生成した複数の外部分離音声データと、複数の外部分離音声データそれぞれと当該他の通信端末１を使用する他ユーザＵ２の特徴データとの類似度である複数の外部類似度と、を関連付けて取得する。情報取得部１６２は、取得した複数の外部分離音声データと複数の外部類似度とを関連付けて選択部１６３に入力する。情報取得部１６２は、取得した複数の外部分離音声データと複数の外部類似度とを関連付けて記憶部１５に記憶させてもよい。 The information acquisition unit 162 acquires, via the first communication unit 12, from another communication terminal 1 (e.g., communication terminal 1-2) that collects the same sound as the sound collected by the sound collection unit 11, a plurality of external separated sound data generated by the other communication terminal 1, and a plurality of external similarities, which are the similarities between each of the plurality of external separated sound data and the feature data of another user U2 who uses the other communication terminal 1, in association with each other. The information acquisition unit 162 associates the acquired plurality of external separated sound data with the plurality of external similarities and inputs them to the selection unit 163. The information acquisition unit 162 may associate the acquired plurality of external separated sound data with the plurality of external similarities and store them in the storage unit 15.

選択部１６３は、音声分離部１６１から入力された複数の内部分離音声データそれぞれと、記憶部１５に記憶されたユーザＵ１の特徴データとの類似度である複数の内部類似度を特定する。複数の内部類似度は、上述した複数の外部類似度と同様に、例えば複数の内部分離音声データをケプストラム分析することにより得られる情報と、ユーザＵ１の特徴データをケプストラム分析することにより得られる情報との間の相関値又はカルバック・ライブラ情報量により表される。 The selection unit 163 identifies multiple internal similarities, which are the similarities between each of the multiple internally separated audio data input from the audio separation unit 161 and the feature data of the user U1 stored in the storage unit 15. Similar to the multiple external similarities described above, the multiple internal similarities are represented by, for example, a correlation value or Kullback-Leibler information between information obtained by cepstral analysis of the multiple internally separated audio data and information obtained by cepstral analysis of the feature data of the user U1.

選択部１６３は、特定した複数の内部類似度のうち所定の条件を満たす内部類似度に対応する内部分離音声データを選択分離音声データとして選択する。所定の条件は、例えば、複数の内部類似度のうち最も大きな値である。この場合、選択部１６３は、複数の内部類似度のうち最大の類似度であるという所定の条件を満たす内部類似度に対応する内部分離音声データを選択分離音声データとして選択する。所定の条件は、類似度が閾値以上であるという条件をさらに含んでもよい。 The selection unit 163 selects, as the selected separated audio data, the internal separated audio data corresponding to an internal similarity that satisfies a predetermined condition from among the multiple identified internal similarities. The predetermined condition is, for example, the largest value from among the multiple internal similarities. In this case, the selection unit 163 selects, as the selected separated audio data, the internal separated audio data corresponding to an internal similarity that satisfies the predetermined condition of being the largest similarity from among the multiple internal similarities. The predetermined condition may further include a condition that the similarity is equal to or greater than a threshold value.

選択部１６３は、選択分離音声データを第２通信部１３に入力することにより、ネットワークＮを介して選択分離音声データを通話先端末２に送信する。選択部１６３が、このようにユーザＵ１の特徴データに最も類似する内部分離音声データを通話先端末２に送信することで、通話先端末２を使用するユーザＵ１０にはユーザＵ１の音声以外の音声が聞こえないので、ユーザＵ１０がユーザＵ１と通話をしやすくなる。 The selection unit 163 inputs the selected separated voice data to the second communication unit 13, and transmits the selected separated voice data to the call destination terminal 2 via the network N. By transmitting the internal separated voice data that is most similar to the characteristic data of user U1 to the call destination terminal 2 in this manner, the user U10 using the call destination terminal 2 cannot hear any voice other than the voice of user U1, making it easier for user U10 to talk to user U1.

ところで、ユーザＵ１の音声の特徴とユーザＵ２の音声の特徴とが似ている場合、ユーザＵ１が使用する通信端末１－１の選択部１６３が、誤ってユーザＵ２の音声に基づく分離音声データを選択してしまうことが生じ得る。そこで、選択部１６３は、情報取得部１６２を介して取得した複数の外部分離音声データをさらに用いて選択分離音声データを決定してもよい。 However, if the voice characteristics of user U1 and user U2 are similar, the selection unit 163 of the communication terminal 1-1 used by user U1 may erroneously select separated voice data based on the voice of user U2. Therefore, the selection unit 163 may determine the selected separated voice data by further using multiple external separated voice data acquired via the information acquisition unit 162.

具体的には、選択部１６３は、複数の内部分離音声データそれぞれについて、内部分離音声データと、当該内部分離音声データに対応する内部類似度と、情報取得部１６２が取得した複数の外部分離音声データのうち当該内部分離音声データに最も類似する外部分離音声データと、当該外部分離音声データに対応する外部類似度と、を関連付ける。そして、選択部１６３は、関連付けた内部類似度と外部類似度とを比較した結果に基づいて、通話先端末２に送信する選択分離音声データを選択する。以下、選択部１６３が外部分離音声データを用いて選択分離音声データを決定する動作の詳細を説明する。 Specifically, for each of the multiple internal separated audio data, the selection unit 163 associates the internal separated audio data with the internal similarity corresponding to the internal separated audio data, and the external separated audio data that is most similar to the internal separated audio data among the multiple external separated audio data acquired by the information acquisition unit 162 with the external similarity corresponding to the external separated audio data. Then, the selection unit 163 selects the selected separated audio data to be transmitted to the call destination terminal 2 based on the result of comparing the associated internal similarity with the external similarity. Below, the operation of the selection unit 163 to determine the selected separated audio data using the external separated audio data will be described in detail.

図３から図５は、選択部１６３が外部分離音声データを用いて選択分離音声データを決定する動作の詳細を説明するための図である。ここでは、ユーザＵ１の近傍にユーザＵ２及びユーザＵ３がいる場合の選択部１６３の動作を説明する。通信端末１－１を使用するユーザＵ１は「こんにちは」という音声を発し、通信端末１－２を使用するユーザＵ２は、ユーザＵ１が音声を発したタイミングで「はじめまして」という音声を発し、通信端末１－３を使用するユーザＵ３も、ユーザＵ１が音声を発したタイミングで「お元気ですか」という音声を発している。 Figures 3 to 5 are diagrams for explaining the details of the operation of the selection unit 163 when determining selected separated voice data using external separated voice data. Here, the operation of the selection unit 163 is explained when users U2 and U3 are present in the vicinity of user U1. User U1 using communication terminal 1-1 utters the voice "Hello," user U2 using communication terminal 1-2 utters the voice "Nice to meet you" at the same time that user U1 utters the voice, and user U3 using communication terminal 1-3 also utters the voice "How are you?" at the same time that user U1 utters the voice.

通信端末１－１、通信端末１－２、通信端末１－３のそれぞれの選択部１６３は、無線通信により、音声分離部１６１から入力された複数の内部分離音声データに対応する複数の内部類似度を示す情報（以下、「内部類似度情報」という）を共有する。具体的には、例えば通信端末１－１は、無線通信が可能な通信端末１－２及び通信端末１－３に対して内部類似度情報を送信する。通信端末１－２及び通信端末１－３も同様に、他の通信端末１に対して内部類似度情報を送信する。それぞれの通信端末１は、他の通信端末１から受信した内部類似度情報を外部類似度情報として管理する。 The selection units 163 of each of communication terminals 1-1, 1-2, and 1-3 share, via wireless communication, information indicating multiple internal similarities corresponding to the multiple internally separated audio data input from the audio separation unit 161 (hereinafter referred to as "internal similarity information"). Specifically, for example, communication terminal 1-1 transmits internal similarity information to communication terminals 1-2 and 1-3 with which wireless communication is possible. Communication terminals 1-2 and 1-3 similarly transmit internal similarity information to the other communication terminals 1. Each communication terminal 1 manages the internal similarity information received from the other communication terminals 1 as external similarity information.

図４は、それぞれの通信端末１が送信する複数の内部分離音声データと複数の内部類似度とを示す図である。図４（ａ）は、通信端末１－１が送信する内部類似度情報であり、図４（ｂ）は、通信端末１－２が送信する内部類似度情報であり、図４（ｃ）は、通信端末１－３が送信する内部類似度情報である。内部類似度情報には、複数の内部分離音声データのうち選択された内部分離音声データを示す情報（図４においては〇）が含まれている。 Figure 4 is a diagram showing multiple internal separated audio data and multiple internal similarities transmitted by each communication terminal 1. Figure 4(a) shows internal similarity information transmitted by communication terminal 1-1, Figure 4(b) shows internal similarity information transmitted by communication terminal 1-2, and Figure 4(c) shows internal similarity information transmitted by communication terminal 1-3. The internal similarity information includes information (circles in Figure 4) indicating internal separated audio data selected from the multiple internal separated audio data.

選択部１６３は、このようにして他の通信端末１から受信した内部類似度情報である外部類似度情報を用いて選択分離音声データを決定する。具体的には、選択部１６３は、内部類似度が外部類似度よりも大きいという所定の条件を満たす内部分離音声データを選択分離音声データとして選択する。選択部１６３は、自端末の内部類似度が外部類似度以下である場合、上記の所定の条件を満たさない内部類似度に対応する内部分離音声データと異なる内部分離音声データを選択分離音声データとして選択する。 The selection unit 163 determines the selected separated audio data using the external similarity information, which is the internal similarity information received from the other communication terminal 1 in this manner. Specifically, the selection unit 163 selects, as the selected separated audio data, internal separated audio data that satisfies a predetermined condition that the internal similarity is greater than the external similarity. When the internal similarity of the own terminal is equal to or less than the external similarity, the selection unit 163 selects, as the selected separated audio data, internal separated audio data that is different from the internal separated audio data that corresponds to the internal similarity that does not satisfy the above-mentioned predetermined condition.

図５は、複数の通信端末１それぞれの選択部１６３が内部類似度情報及び外部類似度情報を用いて選択分離音声データを決定する方法を説明するための図である。図５（ａ）は、各通信端末１において算出された内部類似度と各通信端末１において仮に選択された内部分離音声データを示す情報（仮選択結果）が示されている。テーブルＡは通信端末１－１に対応し、テーブルＢは通信端末１－２に対応し、テーブルＣは通信端末１－３に対応している。 Figure 5 is a diagram for explaining a method in which the selection unit 163 of each of multiple communication terminals 1 determines selected separated audio data using internal similarity information and external similarity information. Figure 5(a) shows information (provisional selection result) indicating the internal similarity calculated in each communication terminal 1 and the internal separated audio data provisionally selected in each communication terminal 1. Table A corresponds to communication terminal 1-1, table B corresponds to communication terminal 1-2, and table C corresponds to communication terminal 1-3.

テーブルＡにおいては、ユーザＵ１の特徴データと「こんにちは」との類似度が最も大きく０．８であり、通信端末１－１が「こんにちは」を仮に選択していることが確認できる。テーブルＢにおいては、ユーザＵ２の特徴データと「はじめまして」との類似度が最も大きく０．６であり、通信端末１－２が「はじめまして」を仮に選択していることが確認できる。テーブルＣにおいては、ユーザＵ３の特徴データと「はじめまして」との類似度が最も大きく０．５であり、通信端末１－３も「はじめまして」を仮に選択していることが確認できる。 In table A, the similarity between the feature data of user U1 and "Hello" is the highest at 0.8, and it can be seen that communication terminal 1-1 has provisionally selected "Hello." In table B, the similarity between the feature data of user U2 and "Nice to meet you" is the highest at 0.6, and it can be seen that communication terminal 1-2 has provisionally selected "Nice to meet you." In table C, the similarity between the feature data of user U3 and "Nice to meet you" is the highest at 0.5, and it can be seen that communication terminal 1-3 has also provisionally selected "Nice to meet you."

このような場合、通信端末１－１の選択部１６３は、自身が仮に選択した内部分離音声データを他の通信端末１が選択していないので、仮に選択した内部分離音声データを選択分離音声データに決定する。一方、通信端末１－２及び通信端末１－３の選択部１６３は、自身が仮に選択した内部分離音声データを他の通信端末１が仮に選択しているため、選択した内部分離音声データに対応する内部類似度と外部類似度とを比較する。図５（ａ）に示す例においては、テーブルＢの類似度０．６とテーブルＣの類似度０．５とを比較する。その結果、テーブルＢの内部類似度の方がテーブルＣの外部類似度よりも大きいので、通信端末１－２の選択部１６３は、仮に選択した内部分離音声データを選択分離音声データに決定する。 In such a case, the selection unit 163 of communication terminal 1-1 determines the provisionally selected internal separated audio data as the selected separated audio data, since the other communication terminal 1 has not selected the internal separated audio data that it has provisionally selected. On the other hand, the selection units 163 of communication terminals 1-2 and 1-3 compare the internal similarity and external similarity corresponding to the selected internal separated audio data, since the internal separated audio data that they have provisionally selected has been provisionally selected by the other communication terminal 1. In the example shown in FIG. 5(a), the similarity of 0.6 in table B is compared with the similarity of 0.5 in table C. As a result, the internal similarity of table B is greater than the external similarity of table C, so the selection unit 163 of communication terminal 1-2 determines the provisionally selected internal separated audio data as the selected separated audio data.

通信端末１－３の選択部１６３は、内部類似度が他の通信端末１－２の外部類似度の方よりも小さいことから、仮に選択した内部分離音声データを選択分離音声データに決定せず、次に内部類似度が大きい内部分離音声データを選択分離音声データに決定する。図５（ａ）に示す例の場合、通信端末１－３の選択部１６３は、内部類似度が２番目に大きい「お元気ですか」を選択分離音声データに決定する。 The selection unit 163 of the communication terminal 1-3 does not determine the provisionally selected internal separated voice data as the selected separated voice data because the internal similarity is smaller than the external similarity of the other communication terminal 1-2, but determines the internal separated voice data with the next highest internal similarity as the selected separated voice data. In the example shown in FIG. 5(a), the selection unit 163 of the communication terminal 1-3 determines "How are you?", which has the second highest internal similarity, as the selected separated voice data.

図５（ｂ）は、このようにして各通信端末１の選択部１６３が選択分離音声データを決定した結果を示している。選択部１６３がこのように動作することで、通信端末１を使用するユーザＵの音声の特徴が他のユーザＵの音声の特徴と似ている場合であっても、正しいユーザＵの音声が選択される確率が高まる。 Figure 5(b) shows the result of the selection unit 163 of each communication terminal 1 determining the selected and separated voice data in this manner. By the selection unit 163 operating in this manner, the probability that the correct voice of user U will be selected is increased even if the voice characteristics of user U using communication terminal 1 are similar to the voice characteristics of another user U.

［複数の通信端末１における処理の流れ］
図６は、複数の通信端末１における処理の流れを示すシーケンス図である。図６に示すシーケンス図は、通信端末１－１、通信端末１－２及び通信端末１－３が集音を開始した時点から開始している。図６においては、通信端末１－１がネットワークＮを介して通話先端末２－１と通信し、通信端末１－２がネットワークＮを介して通話先端末２－２と通信している状態が想定されている。 [Processing flow in multiple communication terminals 1]
Fig. 6 is a sequence diagram showing the flow of processing in a plurality of communication terminals 1. The sequence diagram shown in Fig. 6 starts from the point when communication terminals 1-1, 1-2, and 1-3 start collecting sound. In Fig. 6, it is assumed that communication terminal 1-1 communicates with call destination terminal 2-1 via network N, and communication terminal 1-2 communicates with call destination terminal 2-2 via network N.

まず、通信端末１－１、通信端末１－２及び通信端末１－３は、ユーザＵ１、ユーザＵ２及びユーザＵ３の音声が含まれる混合音声を他の通信端末１から取得する（Ｓ１）。続いて、各通信端末１の音声分離部１６１が、取得した混合音声に基づいて音声分離処理を実行する（Ｓ２）。続いて、各通信端末１の選択部１６３は、記憶部１５に記憶された特徴データと複数の内部分離音声データとの類似度を算出し（Ｓ３）、１つの内部分離音声を仮選択する（Ｓ４）。 First, communication terminal 1-1, communication terminal 1-2, and communication terminal 1-3 acquire mixed audio including the voices of user U1, user U2, and user U3 from other communication terminals 1 (S1). Next, the audio separation unit 161 of each communication terminal 1 executes audio separation processing based on the acquired mixed audio (S2). Next, the selection unit 163 of each communication terminal 1 calculates the similarity between the feature data stored in the storage unit 15 and multiple internally separated audio data (S3), and provisionally selects one internally separated audio (S4).

続いて、通信端末１－１、通信端末１－２及び通信端末１－３は、複数の分離音声データと類似度とを示す類似度情報を共有する（Ｓ５）。すなわち、各通信端末１の情報取得部１６２は、他の通信端末１が送信し複数の内部分離音声データと内部類似度とを示す内部類似度情報を、複数の外部分離音声データと外部類似度を示す外部類似度情報として取得する。 Then, communication terminal 1-1, communication terminal 1-2, and communication terminal 1-3 share similarity information indicating the multiple separated audio data and the similarity (S5). That is, the information acquisition unit 162 of each communication terminal 1 acquires the internal similarity information transmitted by the other communication terminal 1 and indicating the multiple internal separated audio data and the internal similarity as external similarity information indicating the multiple external separated audio data and the external similarity.

各通信端末１の選択部１６３は、情報取得部１６２が他の通信端末１から取得した複数の外部分離音声データと、自端末の音声分離部１６１から入力された複数の内部分離音声データとの相関値を算出することにより、それぞれの内部分離音声データを最も相関値が高い外部分離音声データに紐づける（Ｓ６）。そして、選択部１６３は、それぞれの内部分離音声データに対応する外部類似度を特定し、図５に示したようなテーブルを作成する。 The selection unit 163 of each communication terminal 1 calculates a correlation value between the multiple externally separated audio data acquired by the information acquisition unit 162 from the other communication terminals 1 and the multiple internally separated audio data input from the audio separation unit 161 of the terminal itself, and links each internally separated audio data to the externally separated audio data with the highest correlation value (S6). The selection unit 163 then identifies the external similarity corresponding to each internally separated audio data, and creates a table as shown in FIG. 5.

各通信端末１の選択部１６３は、図５を参照しながら説明したように、自身が仮に選択した内部分離音声データに対応する内部類似度が、当該内部分離音声データに対応する外部類似度よりも大きいか否かを判定することにより、仮に選択した内部分離音声データの正誤を判定する（Ｓ７）。選択部１６３は、内部類似度が外部類似度よりも大きく、仮に選択した内部分離音声データが正しいと判定した場合、仮に選択した内部分離音声データを選択分離音声データとして選択する。選択部１６３は、誤っていると判定した場合、他の内部分離音声データを選択分離音声データ（すなわち送信する音声データ）として選択する（Ｓ８）。 As explained with reference to FIG. 5, the selection unit 163 of each communication terminal 1 determines whether the provisionally selected internal separated audio data is correct by determining whether the internal similarity corresponding to the internal separated audio data that it has provisionally selected is greater than the external similarity corresponding to that internal separated audio data (S7). If the selection unit 163 determines that the internal similarity is greater than the external similarity and that the provisionally selected internal separated audio data is correct, it selects the provisionally selected internal separated audio data as the selected separated audio data. If the selection unit 163 determines that the provisionally selected internal separated audio data is incorrect, it selects other internal separated audio data as the selected separated audio data (i.e., audio data to be transmitted) (S8).

通信端末１－１の選択部１６３は、選択分離音声データを通話先端末２－１に送信し（Ｓ９）、通信端末１－２の選択部１６３は、選択分離音声データを通話先端末２－２に送信する（Ｓ１０）。このような手順により、複数の通信端末１それぞれが、複数の通信端末１を使用するユーザＵの音声を通話中の相手が使用する通話先端末２に送信することができる。 The selection unit 163 of communication terminal 1-1 transmits the selected separated voice data to the call destination terminal 2-1 (S9), and the selection unit 163 of communication terminal 1-2 transmits the selected separated voice data to the call destination terminal 2-2 (S10). Through this procedure, each of the multiple communication terminals 1 can transmit the voice of a user U using the multiple communication terminals 1 to the call destination terminal 2 used by the other party in the call.

［通信端末１における処理の流れ］
図７は、通信端末１における処理の流れを示すフローチャートである。図７に示すフローチャートは、通信端末１を使用するユーザＵ１が、ユーザＵ１０との通話を開始する操作をした時点から開始している。 [Processing flow in communication terminal 1]
Fig. 7 is a flowchart showing the flow of processing in the communication terminal 1. The flowchart shown in Fig. 7 starts at the point when the user U1 who uses the communication terminal 1 performs an operation to start a call with the user U10.

音声分離部１６１は集音部１１を介して音声データを取得するとともに、第１通信部１２を介して他の通信端末１が集音することにより生成された混合音声データを取得する（Ｓ１１）。音声分離部１６１は、取得した音声データに基づいて、複数の内部分離音声データを生成する（Ｓ１２）。 The audio separation unit 161 acquires audio data via the sound collection unit 11, and also acquires mixed audio data generated by the other communication terminal 1 collecting audio via the first communication unit 12 (S11). The audio separation unit 161 generates multiple internally separated audio data based on the acquired audio data (S12).

続いて、選択部１６３は、複数の内部分離音声データを記憶部１５に記憶された特徴データと比較することにより、複数の内部分離音声データそれぞれと特徴データとの類似度を算出する（Ｓ１３）。選択部１６３は、類似度が最も大きい内部分離音声データを仮に選択する（Ｓ１４）。選択部１６３は、第１通信部１２を介して、内部分離音声データと内部類似度とを示す内部類似度情報を他の通信端末１に送信する（Ｓ１５）。 Then, the selection unit 163 calculates the similarity between each of the internal separated audio data and the feature data by comparing the multiple internal separated audio data with the feature data stored in the storage unit 15 (S13). The selection unit 163 provisionally selects the internal separated audio data with the highest similarity (S14). The selection unit 163 transmits internal similarity information indicating the internal separated audio data and the internal similarity to the other communication terminal 1 via the first communication unit 12 (S15).

また、選択部１６３は、第１通信部１２及び情報取得部１６２を介して、他の通信端末１から外部分離音声データと外部類似度とを示す外部類似度情報を取得する（Ｓ１６）。選択部１６３は、複数の内部分離音声データと複数の外部分離音声データとの組み合わせごとに相関値を算出することにより、内部分離音声データと外部分離音声データとを紐づけて、図５に示したようなテーブルを作成する（Ｓ１７）。 The selection unit 163 also acquires external similarity information indicating the external separated audio data and the external similarity from the other communication terminal 1 via the first communication unit 12 and the information acquisition unit 162 (S16). The selection unit 163 links the internal separated audio data with the external separated audio data by calculating a correlation value for each combination of multiple internal separated audio data and multiple external separated audio data, thereby creating a table such as that shown in FIG. 5 (S17).

選択部１６３は、Ｓ１３において算出した複数の類似度のうち最大の類似度（すなわち、仮に選択した内部分離音声データに対応する類似度）が、他の通信端末１から取得した、当該内部分離音声データに対応する外部分離音声データに関連付けられた外部類似度よりも大きいか否かを判定する（Ｓ１８）。選択部１６３は、Ｓ１８において、最大の内部類似度が外部類似度よりも大きいと判定した場合（Ｓ１８においてＹＥＳ）、最大の内部類似度に対応する内部分離音声データを選択分離音声データとして選択する（Ｓ１９）。 The selection unit 163 determines whether the maximum similarity among the multiple similarities calculated in S13 (i.e., the similarity corresponding to the provisionally selected internal separated audio data) is greater than the external similarity associated with the external separated audio data corresponding to the internal separated audio data obtained from another communication terminal 1 (S18). If the selection unit 163 determines in S18 that the maximum internal similarity is greater than the external similarity (YES in S18), it selects the internal separated audio data corresponding to the maximum internal similarity as the selected separated audio data (S19).

選択部１６３は、Ｓ１８において、最大の内部類似度が外部類似度よりも小さいと判定した場合（Ｓ１８においてＮＯ）、次に大きな内部類似度に対応する内部分離音声データを選択分離音声データとして選択する（Ｓ２０）。選択部１６３は、第２通信部１３を介して、選択分離音声データを通話先端末２に送信する（Ｓ２１）。通信端末１は、Ｓ１１からＳ２１までの処理を所定の時間間隔で繰り返す。所定の時間間隔は、例えば１秒である。 When the selection unit 163 determines in S18 that the maximum internal similarity is smaller than the external similarity (NO in S18), it selects the internal separated audio data corresponding to the next largest internal similarity as selected separated audio data (S20). The selection unit 163 transmits the selected separated audio data to the call destination terminal 2 via the second communication unit 13 (S21). The communication terminal 1 repeats the processes from S11 to S21 at a predetermined time interval. The predetermined time interval is, for example, one second.

なお、Ｓ１８において最大の内部類似度と外部類似度とが等しいと選択部１６３が判定した場合、選択部１６３は、例えば、最大の内部類似度と２番目の内部類似度との差が、他の通信端末１の最大の外部類似度と２番目の外部類似度との差よりも大きいことを条件として、Ｓ１９の処理を実行する。 If the selection unit 163 determines in S18 that the maximum internal similarity and the external similarity are equal, the selection unit 163 executes the process of S19 under the condition that, for example, the difference between the maximum internal similarity and the second internal similarity is greater than the difference between the maximum external similarity and the second external similarity of the other communication terminal 1.

Ｓ１８において最大の内部類似度と外部類似度とが等しいと選択部１６３が判定した場合、選択部１６３は、仮に選択した内部音声データを通話先端末２に送信してもよい。Ｓ１８において最大の内部類似度と外部類似度とが等しいと選択部１６３が判定した場合、選択部１６３は、選択分離音声データを決定せず、例えば処理の時間間隔である１秒間は音声データを通話先端末２に送信しないようにしてもよい。 If the selection unit 163 determines in S18 that the maximum internal similarity and the external similarity are equal, the selection unit 163 may transmit the provisionally selected internal voice data to the call destination terminal 2. If the selection unit 163 determines in S18 that the maximum internal similarity and the external similarity are equal, the selection unit 163 may not determine the selected separated voice data, and may not transmit the voice data to the call destination terminal 2 for, for example, one second, which is the processing time interval.

［変形例］
図８は、第１変形例に係る通信端末１Ａの構成を示す図である。図８に示す通信端末１Ａは、画像データ取得部１７をさらに有するという点で図２に示した通信端末１と異なり、他の点で同じである。画像データ取得部１７は、通信端末１を使用するユーザＵの顔画像データを取得する。画像データ取得部１７は、例えば撮像素子を有する。 [Modification]
Fig. 8 is a diagram showing the configuration of a communication terminal 1A according to a first modified example. The communication terminal 1A shown in Fig. 8 differs from the communication terminal 1 shown in Fig. 2 in that it further includes an image data acquisition unit 17, but is otherwise the same. The image data acquisition unit 17 acquires face image data of a user U who uses the communication terminal 1. The image data acquisition unit 17 includes, for example, an image sensor.

選択部１６３は、複数の内部分離音声データのうち、顔画像データが示すユーザＵの顔の変化タイミングに同期している内部分離音声データを選択分離音声データとして選択する。選択部１６３は、例えば、複数の内部分離音声データを周波数領域のデータに変換し、顔画像データが、ユーザＵの口が開閉する動き又は口の形の変化に同期して周波数が変化している内部分離音声データを選択する。選択部１６３は、顔画像データのフレーム間差分に基づいてユーザＵの口の動き（又は唇の動き）の特徴を含む特徴データを用いてもよい。選択部１６３がこのように動作することで、適切な内部分離音声データを選択できる確率がさらに高まる。 The selection unit 163 selects, from among the multiple internal separated audio data, internal separated audio data that is synchronized with the timing of changes in the face of user U indicated by the facial image data as the selected separated audio data. For example, the selection unit 163 converts the multiple internal separated audio data into frequency domain data and selects internal separated audio data in which the facial image data changes in frequency in synchronization with the opening and closing of user U's mouth or changes in the shape of the mouth. The selection unit 163 may use feature data including features of user U's mouth movement (or lip movement) based on the inter-frame difference of the facial image data. By the selection unit 163 operating in this manner, the probability of selecting appropriate internal separated audio data is further increased.

［通信端末１による効果］
以上説明したように、本実施形態に係る通信端末１は、集音した音声を話者ごとに分離して生成された複数の内部分離音声データそれぞれと、通信端末１のユーザＵの音声の特徴を含む特徴データとの類似度である複数の内部類似度を特定し、当該複数の内部類似度のうち所定の条件を満たす内部類似度に対応する内部分離音声データを通話先の通話先端末２に送信する。通信端末１がこのように構成されていることで、通信端末１のユーザＵの音声に対応する内部分離音声データが選択される確率が高くなり、自端末を使用している話者の音声を通話先の端末に送信できる確率が向上する。 [Effects of communication terminal 1]
As described above, the communication terminal 1 according to the present embodiment identifies a plurality of internal similarities, which are the similarities between each of a plurality of internally separated voice data generated by separating collected voice for each speaker and feature data including voice features of the user U of the communication terminal 1, and transmits the internally separated voice data corresponding to the internal similarity that satisfies a predetermined condition among the plurality of internal similarities to the call destination terminal 2. By configuring the communication terminal 1 in this way, the probability that the internally separated voice data corresponding to the voice of the user U of the communication terminal 1 will be selected increases, and the probability that the voice of the speaker using the communication terminal 1 can be transmitted to the call destination terminal increases.

＜第２実施形態＞
第１実施形態においては、通信端末１が、入力された混合音声に含まれる複数の音声を分離し、通信端末１を使用するユーザＵの音声を選択した。これに対して、第２実施形態においては、情報処理装置３が通信端末１から混合音声のデータを取得し、情報処理装置３が混合音声に含まれる複数の音声を分離して選択分離音声データを生成し、選択分離音声データを通話先端末２に送信するという点で第１実施形態と異なる。情報処理装置３は、例えばネットワークＮを介して通信端末１及び通話先端末２と通信可能なコンピュータである。 Second Embodiment
In the first embodiment, the communication terminal 1 separated multiple voices included in the input mixed voice and selected the voice of the user U who uses the communication terminal 1. In contrast, the second embodiment differs from the first embodiment in that the information processing device 3 acquires mixed voice data from the communication terminal 1, separates multiple voices included in the mixed voice to generate selected separated voice data, and transmits the selected separated voice data to the call destination terminal 2. The information processing device 3 is a computer capable of communicating with the communication terminal 1 and the call destination terminal 2 via, for example, a network N.

図９は、第２実施形態の概要を説明するための図である。通信端末１－１、通信端末１－２及び通信端末１－３は、それぞれが集音して生成した音声データを情報処理装置３に送信する。情報処理装置３は、複数の通信端末１それぞれのユーザＵの特徴データを記憶しており、複数の通信端末１それぞれから受信した音声データに基づく複数の内部分離音声データのうち、特徴データに最も類似する内部分離音声データを選択する。情報処理装置３は、選択した内部分離音声データを、複数の通信端末１それぞれと通信している通話先端末２に送信する。情報処理装置３は、例えば通信端末１－１と通信している通話先端末２－１に対して、ユーザＵ１が発した「こんにちは」に対応する内部分離音声データを送信する。 Figure 9 is a diagram for explaining an overview of the second embodiment. Communication terminal 1-1, communication terminal 1-2, and communication terminal 1-3 each transmit voice data generated by collecting sound to information processing device 3. Information processing device 3 stores feature data of user U of each of the multiple communication terminals 1, and selects the internal separated voice data that is most similar to the feature data from multiple internal separated voice data based on voice data received from each of the multiple communication terminals 1. Information processing device 3 transmits the selected internal separated voice data to call destination terminal 2 communicating with each of the multiple communication terminals 1. Information processing device 3 transmits internal separated voice data corresponding to "hello" uttered by user U1 to call destination terminal 2-1 communicating with communication terminal 1-1, for example.

図１０は、情報処理装置３の構成を示す図である。情報処理装置３は、通信部３１と、記憶部３２と、制御部３３と、を有する。制御部３３は、音声分離部３３１と選択部３３２とを有する。 Figure 10 is a diagram showing the configuration of the information processing device 3. The information processing device 3 has a communication unit 31, a storage unit 32, and a control unit 33. The control unit 33 has an audio separation unit 331 and a selection unit 332.

通信部３１は、ネットワークＮを介して通信端末１及び通話先端末２と通信するための通信インターフェースを有する。通信部３１は、通信端末１から送信された音声データを受信し、受信した音声データを音声分離部３３１に入力する。通信部３１は、選択部３３２が生成した選択分離音声データを通話先端末２に送信する。すなわち、通信部３１は、ユーザＵの通話相手が使用する通話先端末である通話先端末２に選択分離音声データを送信する。通信部３１は、複数の通信端末１に対応する複数の通話先端末２に選択分離音声データを送信してもよい。 The communication unit 31 has a communication interface for communicating with the communication terminal 1 and the call destination terminal 2 via the network N. The communication unit 31 receives voice data transmitted from the communication terminal 1 and inputs the received voice data to the voice separation unit 331. The communication unit 31 transmits the selected separated voice data generated by the selection unit 332 to the call destination terminal 2. In other words, the communication unit 31 transmits the selected separated voice data to the call destination terminal 2, which is the call destination terminal used by the call partner of the user U. The communication unit 31 may transmit the selected separated voice data to multiple call destination terminals 2 corresponding to multiple communication terminals 1.

記憶部３２は、ＲＯＭ、ＲＡＭ及びＳＳＤ（Solid State Drive）等の記憶媒体を有する。記憶部３２は、制御部３３が実行するプログラムを記憶する。また、記憶部３２は、複数の通信端末１を使用する複数のユーザの音声の特徴を含む複数の特徴データを記憶する。記憶部３２は、例えば、通信端末１又は通信端末１を使用するユーザＵを識別するための識別情報に関連付けて、ユーザＵの特徴データを記憶する。記憶部３２は、複数の通信端末１それぞれに関連付けて、通話先の通話先端末２の識別情報を記憶してもよい。 The storage unit 32 has storage media such as ROM, RAM, and SSD (Solid State Drive). The storage unit 32 stores programs executed by the control unit 33. The storage unit 32 also stores multiple feature data including voice features of multiple users who use multiple communication terminals 1. The storage unit 32 stores the feature data of user U, for example, in association with identification information for identifying the communication terminal 1 or the user U who uses the communication terminal 1. The storage unit 32 may also store identification information of the call destination terminal 2, in association with each of the multiple communication terminals 1.

制御部３３は、記憶部３２に記憶されたプログラムを実行することにより、音声分離部３３１及び選択部３３２として機能する。音声分離部３３１は、第１実施形態で説明した通信端末１における音声分離部１６１と同等の機能を有する。すなわち、音声分離部３３１は、通信端末１により集音された通信端末１の周囲の音声に基づく音声データに含まれている、ユーザＵを含む複数の話者による複数の音声データを分離して複数の分離音声データを生成する。ただし、音声分離部３３１は、複数の通信端末１から受信した複数の音声データそれぞれに対して、複数の分離音声データを生成する。音声分離部３３１は、通信端末１又はユーザＵの識別情報に関連付けて複数の分離音声データを選択部３３２に通知する。 The control unit 33 executes a program stored in the storage unit 32 to function as a voice separation unit 331 and a selection unit 332. The voice separation unit 331 has a function equivalent to that of the voice separation unit 161 in the communication terminal 1 described in the first embodiment. That is, the voice separation unit 331 separates multiple voice data by multiple speakers, including the user U, contained in voice data based on voices around the communication terminal 1 collected by the communication terminal 1, to generate multiple separated voice data. However, the voice separation unit 331 generates multiple separated voice data for each of the multiple voice data received from the multiple communication terminals 1. The voice separation unit 331 notifies the selection unit 332 of the multiple separated voice data in association with the identification information of the communication terminal 1 or the user U.

選択部３３２は、第１実施形態で説明した通信端末１における選択部１６３と同等の機能を有する。すなわち、選択部３３２は、複数の分離音声データそれぞれと特徴データとの類似度を特定し、当該複数の類似度のうち所定の条件を満たす類似度に対応する分離音声データを選択分離音声データとして選択する。選択部３３２は、例えば、音声分離部３３１から通知された複数の通信端末１に対応する複数の分離音声データに基づいて、それぞれの通信端末１に対応する複数の分離音声データのうち、通信端末１のユーザＵの特徴データとの類似度が最も大きい分離音声データを選択分離音声データとして選択する。選択部３３２は、複数の通信端末１それぞれに関連付けて選択分離音声データを決定する。 The selection unit 332 has a function equivalent to that of the selection unit 163 in the communication terminal 1 described in the first embodiment. That is, the selection unit 332 specifies the similarity between each of the multiple separated voice data and the feature data, and selects, from the multiple similarities, the separated voice data corresponding to the similarity that satisfies a predetermined condition as the selected separated voice data. For example, based on the multiple separated voice data corresponding to the multiple communication terminals 1 notified by the voice separation unit 331, the selection unit 332 selects, from the multiple separated voice data corresponding to each communication terminal 1, the separated voice data that has the greatest similarity to the feature data of the user U of the communication terminal 1 as the selected separated voice data. The selection unit 332 determines the selected separated voice data in association with each of the multiple communication terminals 1.

選択部３３２は、複数の通信端末１に対応する選択分離音声データが同一の分離音声データになった場合、図５を参照して説明した手順により、複数の通信端末１それぞれに対応する分離音声データが異なるように選択分離音声データを決定する。選択部３３２は、複数の通信端末１それぞれに対応する選択分離音声データを、複数の通信端末１それぞれのユーザＵが通話する他のユーザＵが使用する通話先端末２に対して送信する。 When the selected separated audio data corresponding to the multiple communication terminals 1 becomes the same separated audio data, the selection unit 332 determines the selected separated audio data so that the separated audio data corresponding to each of the multiple communication terminals 1 is different, according to the procedure described with reference to FIG. 5. The selection unit 332 transmits the selected separated audio data corresponding to each of the multiple communication terminals 1 to the call destination terminal 2 used by the other user U with whom the user U of each of the multiple communication terminals 1 calls.

このように情報処理装置３が複数の通信端末１から受信した音声データから、通信端末１を使用するユーザＵの特徴データに類似する分離音声データを生成することで、通信端末１の処理を軽くしつつ、通信端末１を使用している話者の音声を通話先の端末に送信できる確率を高めることが可能になる。 In this way, by generating separated voice data similar to the characteristic data of the user U using the communication terminal 1 from the voice data received by the information processing device 3 from multiple communication terminals 1, it is possible to reduce the processing load on the communication terminal 1 while increasing the probability that the voice of the speaker using the communication terminal 1 can be transmitted to the destination terminal.

なお、本発明により、国連が主導する持続可能な開発目標（SDGs）の目標９「産業と技術革新の基盤をつくろう」に貢献することが可能となる。 Furthermore, this invention will make it possible to contribute to Goal 9 of the United Nations' Sustainable Development Goals (SDGs), which is "Build resilient infrastructure, promote inclusive and sustainable industrialization, and promote innovation and infrastructure."

以上、本発明を実施の形態を用いて説明したが、本発明の技術的範囲は上記実施の形態に記載の範囲には限定されず、その要旨の範囲内で種々の変形及び変更が可能である。例えば、装置の全部又は一部は、任意の単位で機能的又は物理的に分散・統合して構成することができる。また、複数の実施の形態の任意の組み合わせによって生じる新たな実施の形態も、本発明の実施の形態に含まれる。組み合わせによって生じる新たな実施の形態の効果は、もとの実施の形態の効果を併せ持つ。 Although the present invention has been described above using embodiments, the technical scope of the present invention is not limited to the scope described in the above embodiments, and various modifications and changes are possible within the scope of the gist of the invention. For example, all or part of the device can be configured by distributing or integrating functionally or physically in any unit. In addition, new embodiments resulting from any combination of multiple embodiments are also included in the embodiments of the present invention. The effect of the new embodiment resulting from the combination also has the effect of the original embodiment.

１通信端末
２通話先端末
３情報処理装置
１１集音部
１２第１通信部
１３第２通信部
１４音声出力部
１５記憶部
１６制御部
１７画像データ取得部
３１通信部
３２記憶部
３３制御部
１５１記憶部
１６１音声分離部
１６２情報取得部
１６３選択部
３３１音声分離部
３３２選択部
REFERENCE SIGNS LIST 1 Communication terminal 2 Call destination terminal 3 Information processing device 11 Sound collection unit 12 First communication unit 13 Second communication unit 14 Audio output unit 15 Storage unit 16 Control unit 17 Image data acquisition unit 31 Communication unit 32 Storage unit 33 Control unit 151 Storage unit 161 Audio separation unit 162 Information acquisition unit 163 Selection unit 331 Audio separation unit 332 Selection unit

Claims

A communication terminal used by a user,
A storage unit that stores feature data including features of the user's voice;
a sound collection unit that collects sounds around the communication terminal and converts the sounds into sound data;
a voice separation unit that separates a plurality of voice data by a plurality of speakers including the user, which are included in the voice data, to generate a plurality of internally separated voice data;
a selection unit that identifies a plurality of internal similarities that are similarities between each of the plurality of internal separated voice data and the feature data, and selects, from the plurality of internal similarities, the internal separated voice data corresponding to the internal similarity that satisfies a predetermined condition as selected separated voice data;
a communication unit for transmitting the selected and separated voice data to a destination terminal used by a call partner of the user;
A communication terminal having the above configuration.

the selection unit selects, as selected separated audio data, the internal separated audio data corresponding to the internal similarity that satisfies the predetermined condition that the internal similarity is the maximum similarity among the plurality of internal similarities.
The communication terminal according to claim 1.

The present invention further includes an information acquisition unit that acquires, from another communication terminal that collects the same sound as the sound collected by the sound collection unit, a plurality of external separated sound data generated by the other communication terminal, and a plurality of external similarities, which are similarities between each of the plurality of external separated sound data and feature data of another user who uses the other communication terminal, in association with each other,
the selection unit associates, for each of the plurality of internal separated audio data, the internal similarity corresponding to the internal separated audio data, and the external separated audio data that is most similar to the internal separated audio data among the plurality of external separated audio data acquired by the information acquisition unit, and the external similarity corresponding to the external separated audio data, and determines the selected separated audio data based on a result of comparing the associated internal similarity with the external similarity;
The communication terminal according to claim 1.

The selection unit selects the internal separated speech data that satisfies the predetermined condition that the internal similarity is greater than the external similarity as the selected separated speech data.
The communication terminal according to claim 3.

When the internal similarity is equal to or less than the external similarity, the selection unit selects, as the selected separated audio data, the internal separated audio data that is different from the internal separated audio data that corresponds to the internal similarity that does not satisfy the predetermined condition.
The communication terminal according to claim 4.

An image data acquisition unit that acquires face image data of the user,
the selection unit selects, from the plurality of internal separated audio data, the internal separated audio data synchronized with a timing of a change in the face of the user indicated by the face image data as selected separated audio data;
The communication terminal according to claim 1.

A storage unit that stores feature data including features of a voice of a user who uses a communication terminal;
a voice separation unit that separates a plurality of voice data by a plurality of speakers including the user, which are included in voice data based on voices around the communication terminal collected by the communication terminal, to generate a plurality of separated voice data;
a selection unit that specifies a similarity between each of the plurality of separated speech data and the feature data, and selects, from the plurality of similarities, the separated speech data corresponding to the similarity that satisfies a predetermined condition as selected separated speech data;
a communication unit for transmitting the selected and separated voice data to a destination terminal used by a call partner of the user;
An information processing device having the above configuration.

the storage unit stores a plurality of feature data including voice features of a plurality of users who use a plurality of the communication terminals;
The selection unit determines the selected and separated audio data in association with each of the plurality of communication terminals;
the communication unit transmits the selected and separated voice data to a plurality of the call destination terminals corresponding to the plurality of communication terminals.
The information processing device according to claim 7.

The computer executes
A step of generating a plurality of internally separated voice data by separating a plurality of voice data by a plurality of speakers including the user, the plurality of voice data being included in voice data based on voices around the user who is communicating with the communication partner using the communication destination terminal;
generating a plurality of internal similarities which are similarities between each of the plurality of internal separated voice data and feature data including features of the user's voice stored in a storage unit;
selecting, as selected separated speech data, the internal separated speech data corresponding to the internal similarity that satisfies a predetermined condition from among the plurality of internal similarities;
transmitting the selected and separated voice data to a destination terminal used by a call partner of the user;
A communication method comprising:

On the computer,
A step of generating a plurality of internally separated voice data by separating a plurality of voice data by a plurality of speakers including the user, the plurality of voice data being included in voice data based on voices around the user who is communicating with the communication partner using the communication destination terminal;
generating a plurality of internal similarities which are similarities between each of the plurality of internal separated voice data and feature data including features of the user's voice stored in a storage unit;
selecting, as selected separated speech data, the internal separated speech data corresponding to the internal similarity that satisfies a predetermined condition from among the plurality of internal similarities;
transmitting the selected and separated voice data to a destination terminal used by a call partner of the user;
A program for executing the above.