JP7687433B2

JP7687433B2 - Utterance desire estimation device, utterance desire estimation method, and program

Info

Publication number: JP7687433B2
Application number: JP2023561954A
Authority: JP
Inventors: 俊一瀬古; 直紀萩山; 真奈笹川; 理香望月; 晴美齋藤; 隆二山本
Original assignee: Nippon Telegraph and Telephone Corp; NTT Inc USA
Current assignee: NTT Inc; NTT Inc USA
Priority date: 2021-11-16
Filing date: 2021-11-16
Publication date: 2025-06-03
Anticipated expiration: 2041-11-16
Also published as: US20250004699A1; JPWO2023089662A1; WO2023089662A1

Description

本発明は、リモート会議においてユーザの発話欲求を推定する技術に関する。 The present invention relates to a technology for estimating a user's desire to speak during a remote conference.

Ｗｅｂ会議などのリモート会議において、映像の不鮮明さやネットワーク遅延などの影響により、リアルでの対面コミュニケーションと比較して、発話したがっている人（発話欲求のある人）を把握することは困難である。 In remote meetings such as web conferences, it is difficult to identify people who want to speak (people who have a desire to speak) compared to real face-to-face communication due to factors such as blurred video images and network delays.

特許文献１は、カメラ及びマイクからユーザ（リモート会議の参加者）の振る舞いを取得し、ユーザの発話欲求の度合いを算出して表示する技術を開示している。当該技術によれば、各ユーザは誰が発話したがっているかを容易に把握することができる。 Patent Document 1 discloses a technology that obtains the behavior of users (participants in a remote conference) from a camera and microphone, and calculates and displays the degree of the user's desire to speak. This technology allows each user to easily grasp who wants to speak.

しかしながら、リモート会議ではカメラやマイクをオフにすることで回線圧迫や雑音などによるコミュニケーションの阻害を防ぐことがしばしば行われており、映像や音声を使用した発話欲求推定を実施し難いという問題がある。However, in remote meetings, cameras and microphones are often turned off to prevent communication from being hindered by line congestion or noise, making it difficult to estimate the desire to speak using video and audio.

日本国特開２０１３－１８３１８３号公報Japanese Patent Application Publication No. 2013-183183

本発明は、映像及び音声情報を利用せずにユーザの発話欲求を推定する技術を提供することを目的とする。 The present invention aims to provide a technology that estimates a user's desire to speak without using video and audio information.

本発明の一態様に係る発話欲求推定装置は、通信ネットワークを介したリモート会議に使用される複数の会議装置のうちの第１の会議装置に設けられ、前記リモート会議中にユーザが前記第１の会議装置に対して行った操作を示す操作情報を生成する操作情報生成部と、前記生成された操作情報に基づいて、前記ユーザが発話を欲求する度合いを示す発話欲求度合いを算出する発話欲求度合い算出部と、前記算出された発話欲求度合いに基づく情報を前記複数の会議装置のうちの第２の会議装置に送信する通信部と、を備える。A speech desire estimation device according to one embodiment of the present invention is provided in a first conference device among a plurality of conference devices used for a remote conference via a communication network, and comprises: an operation information generation unit that generates operation information indicating an operation performed by a user on the first conference device during the remote conference; a speech desire degree calculation unit that calculates a speech desire degree indicating the degree to which the user desires to speak based on the generated operation information; and a communication unit that transmits information based on the calculated speech desire degree to a second conference device among the plurality of conference devices.

本発明によれば、映像及び音声情報を利用せずにユーザの発話欲求を推定する技術が提供される。 The present invention provides a technology for estimating a user's desire to speak without using video and audio information.

図１は、実施形態に係る会議システムを示すブロック図である。FIG. 1 is a block diagram showing a conference system according to an embodiment. 図２は、実施形態に係る発話欲求推定装置を備えるクライアントを示す機能ブロック図である。FIG. 2 is a functional block diagram showing a client including an utterance desire estimation device according to an embodiment. 図３は、実施形態に係るリモート会議アプリケーションのユーザインタフェースを示す図である。FIG. 3 is a diagram illustrating a user interface of the remote conference application according to the embodiment. 図４は、図２に示した操作情報記憶部に記憶される操作情報を示す図である。FIG. 4 is a diagram showing operation information stored in the operation information storage unit shown in FIG. 図５は、実施形態に係る発話欲求推定装置を備えるクライアントのハードウェア構成を示すブロック図である。FIG. 5 is a block diagram showing a hardware configuration of a client including an utterance desire estimation device according to an embodiment. 図６は、実施形態に係る発話欲求推定方法を示すフローチャートである。FIG. 6 is a flowchart illustrating a method for estimating a desire to speak according to the embodiment. 図７は、実施形態に係る発話欲求推定装置を備えるクライアントを示す機能ブロック図である。FIG. 7 is a functional block diagram showing a client including an utterance desire estimation device according to an embodiment.

以下、図面を参照して本発明の実施形態を説明する。 Below, an embodiment of the present invention is described with reference to the drawings.

実施形態は、異なる場所に存在する複数のユーザが通信ネットワークに接続された複数の会議装置を用いてリモート会議を行う会議システムに関する。一実施形態では、各会議装置は、当該会議装置を使用するユーザの発話欲求を推定する発話欲求推定装置を備える。発話欲求推定装置は、リモート会議中にユーザが会議装置に対して行った操作に基づいてユーザの発話欲求度合いを算出し、算出した発話欲求度合いに基づく情報を他の会議装置に送信する。発話欲求度合いは、ユーザが発話することを欲求する度合い（程度）を示す。各会議装置は、他の会議装置から他のユーザの発話欲求度合いを示す情報を受信し、受信した情報をユーザに提示する。実施形態に係る会議システムによれば、映像及び音声情報を利用せずに各ユーザの発話欲求を推定することが可能となり、各ユーザは他のユーザが発話を望んでいるか否かを容易に判断することが可能となる。その結果、発話の衝突を回避することが可能となる。The embodiment relates to a conference system in which multiple users in different locations hold a remote conference using multiple conference devices connected to a communication network. In one embodiment, each conference device is equipped with an utterance desire estimation device that estimates the utterance desire of the user using the conference device. The utterance desire estimation device calculates the degree of the user's utterance desire based on the operation performed by the user on the conference device during the remote conference, and transmits information based on the calculated utterance desire degree to other conference devices. The utterance desire degree indicates the degree (degree) of the user's desire to speak. Each conference device receives information indicating the utterance desire degree of other users from other conference devices, and presents the received information to the user. According to the conference system of the embodiment, it is possible to estimate the utterance desire of each user without using video and audio information, and each user can easily determine whether or not other users want to speak. As a result, it is possible to avoid collisions in speech.

＜第１の実施形態＞
［構成］
図１は、第１の実施形態に係る会議システム１０を概略的に示している。図１に示すように、会議システム１０は、複数のユーザがそれぞれ使用する複数のクライアント１１と、通信ネットワーク１９を介してクライアント１１に接続されたサーバ１２と、を備える。通信ネットワーク１９は、インターネット、イントラネット、又はインターネットとイントラネットの組み合わせを含んでよい。サーバ１２はクライアント１１間でデータを中継する。例えば、サーバ１２は、通信ネットワーク１９を介してクライアント１１からデータを受け取り、受け取ったデータを通信ネットワーク１９を介して他のクライアント１１に転送する。 First Embodiment
[composition]
Fig. 1 is a schematic diagram of a conference system 10 according to a first embodiment. As shown in Fig. 1, the conference system 10 includes a plurality of clients 11 each used by a plurality of users, and a server 12 connected to the clients 11 via a communication network 19. The communication network 19 may include the Internet, an intranet, or a combination of the Internet and an intranet. The server 12 relays data between the clients 11. For example, the server 12 receives data from the clients 11 via the communication network 19, and transfers the received data to other clients 11 via the communication network 19.

各クライアント１１は、パーソナルコンピュータ（ＰＣ）などのコンピュータであり得る。クライアント１１は、通信ネットワーク１９を介したリモート会議に使用される会議装置に相当する。本実施形態では、クライアント１１は、リモート会議アプリケーションを実行することにより会議装置として機能する。他の実施形態では、クライアント１１は、ブラウザを使用してサーバ１２にアクセスすることにより会議装置として機能してよい。Each client 11 may be a computer such as a personal computer (PC). The client 11 corresponds to a conference device used for a remote conference via the communication network 19. In this embodiment, the client 11 functions as a conference device by executing a remote conference application. In other embodiments, the client 11 may function as a conference device by accessing the server 12 using a browser.

クライアント１１は互いに同じ又は同様の構成を有することができる。以下では、代表として１つのクライアント１１の構成について説明する。 The clients 11 may have the same or similar configurations. Below, the configuration of one client 11 will be described as a representative example.

図２は、本実施形態に係るクライアント１１の機能構成を概略的に示している。図２に示すように、クライアント１１は、制御部２１、入力部２２、出力部２３、通信部２４、操作情報生成部２５、発話欲求度合い算出部２６、及び記憶部２９を備える。記憶部２９は、操作情報記憶部２９１及びルール記憶部２９２を備える。制御部２１、操作情報生成部２５、及び発話欲求度合い算出部２６を処理部２７と総称する。制御部２１、通信部２４、操作情報生成部２５、発話欲求度合い算出部２６、操作情報記憶部２９１、及びルール記憶部２９２は、本実施形態に係る発話欲求推定装置に相当する。 Figure 2 shows a schematic functional configuration of the client 11 according to this embodiment. As shown in Figure 2, the client 11 includes a control unit 21, an input unit 22, an output unit 23, a communication unit 24, an operation information generation unit 25, a desire to speak degree calculation unit 26, and a memory unit 29. The memory unit 29 includes an operation information memory unit 291 and a rule memory unit 292. The control unit 21, the operation information generation unit 25, and the desire to speak degree calculation unit 26 are collectively referred to as a processing unit 27. The control unit 21, the communication unit 24, the operation information generation unit 25, the desire to speak degree calculation unit 26, the operation information memory unit 291, and the rule memory unit 292 correspond to the desire to speak estimation device according to this embodiment.

制御部２１は、クライアント１１の動作を制御する。具体的には、制御部２１は、入力部２２、出力部２３、通信部２４、操作情報生成部２５、発話欲求度合い算出部２６、及び記憶部２９を制御する。The control unit 21 controls the operation of the client 11. Specifically, the control unit 21 controls the input unit 22, the output unit 23, the communication unit 24, the operation information generation unit 25, the desire to speak degree calculation unit 26, and the memory unit 29.

入力部２２は、ユーザからの入力を受け取り、受け取った入力を制御部２１に送出する。図２に示す例では、入力部２２は、マウス２２１、カメラ２２２、及びマイク２２３を備える。マウス２２１は、ユーザがクライアント１１を操作することを可能にする。例えば、マウス２２１は、ユーザがリモート会議アプリケーションにより提供されるユーザインタフェースを操作することを可能にする。マウス２２１に代えて又は追加して、タッチパッド（トラックパッド）、タッチパネル、キーボードなどを使用してもよい。カメラ２２２は、ユーザを撮像してユーザの映像を示す映像データを生成する。カメラ２２２はカメラ２２２をオンとオフとの間で切り替える物理ボタンを備えていてもよい。マイク２２３は、ユーザが発した音声を集音してユーザの音声を示す音声データを生成する。マイク２２３はマイク２２３をオンとオフとの間で切り替える物理ボタンを備えていてもよい。制御部２１は、カメラ２２２から映像データ及びマイク２２３から音声データを受け取り、通信部２４を介して他のクライアント１１に映像データ及び音声データを送信する。The input unit 22 receives input from the user and sends the received input to the control unit 21. In the example shown in FIG. 2, the input unit 22 includes a mouse 221, a camera 222, and a microphone 223. The mouse 221 enables the user to operate the client 11. For example, the mouse 221 enables the user to operate a user interface provided by a remote conference application. A touchpad (trackpad), a touch panel, a keyboard, etc. may be used instead of or in addition to the mouse 221. The camera 222 captures an image of the user and generates video data showing an image of the user. The camera 222 may be provided with a physical button for switching the camera 222 between on and off. The microphone 223 collects a voice emitted by the user and generates audio data showing the user's voice. The microphone 223 may be provided with a physical button for switching the microphone 223 between on and off. The control unit 21 receives video data from the camera 222 and audio data from the microphone 223, and transmits the video data and audio data to another client 11 via the communication unit 24.

出力部２３は、制御部２１により生成された情報をユーザに対して出力する。図２に示す例では、出力部２３は、表示装置２３１及びスピーカ２３２を備える。表示装置２３１は、液晶表示装置などのディスプレイであり、制御部２１により生成された画像を表示する。例えば、制御部２１は、リモート会議アプリケーションにより提供されるユーザインタフェースを含む画像を生成し、表示装置２３１は、ユーザインタフェースを含む画像を表示する。ユーザインタフェースは、他のユーザの映像を表示する領域を含む。制御部２１は、通信部２４を介して他のクライアント１１から他のユーザの映像データを受信し、ユーザインタフェースに他のユーザの映像を表示するために、受信した映像データをユーザインタフェースに適用する。スピーカ２３２は、制御部２１により供給される音響データに応じた音を発する。例えば、制御部２１は、通信部２４を介して他のクライアント１１から他のユーザの音声データを受信し、スピーカ２３２が他のユーザの音声を出力するように、受信した音声データをスピーカ２３２に送出する。The output unit 23 outputs information generated by the control unit 21 to the user. In the example shown in FIG. 2, the output unit 23 includes a display device 231 and a speaker 232. The display device 231 is a display such as a liquid crystal display device, and displays an image generated by the control unit 21. For example, the control unit 21 generates an image including a user interface provided by a remote conference application, and the display device 231 displays the image including the user interface. The user interface includes an area for displaying images of other users. The control unit 21 receives image data of other users from other clients 11 via the communication unit 24, and applies the received image data to the user interface to display the images of the other users on the user interface. The speaker 232 emits sound according to the acoustic data supplied by the control unit 21. For example, the control unit 21 receives audio data of other users from other clients 11 via the communication unit 24, and sends the received audio data to the speaker 232 so that the speaker 232 outputs the audio of the other users.

図３は、リモート会議アプリケーションにより提供されるリモート会議に関するユーザインタフェース３０を概略的に示している。図３に示す例では、ユーザインタフェース３０は、映像領域３１及びコントロールバー３２を含む。映像領域３１は、他のユーザの映像を表示する領域である。コントロールバー３２は、ミュートボタン３２１、オーディオ設定ボタン３２２、映像ボタン３２３、及び映像設定ボタン３２４を含む。 Figure 3 shows a schematic diagram of a user interface 30 for a remote conference provided by a remote conference application. In the example shown in Figure 3, the user interface 30 includes a video area 31 and a control bar 32. The video area 31 is an area that displays the video of other users. The control bar 32 includes a mute button 321, an audio settings button 322, a video button 323, and a video settings button 324.

ミュートボタン３２１は、音声入力をオン（有効）とオフ（無効）との間で切り替えるためのボタンである。音声入力がオンである状態でミュートボタン３２１がクリックされると、音声入力がオフに切り替わり、音声入力がオフである状態でミュートボタン３２１がクリックされると、音声入力がオンに切り替わる。音声入力がオンである状態では、マイク２２３により得られた音声データが他のクライアント１１に送出され、音声入力がオフである状態では、マイク２２３により得られた音声データは他のクライアント１１に送出されない。 The mute button 321 is a button for switching the audio input between on (enabled) and off (disabled). When the mute button 321 is clicked while the audio input is on, the audio input is switched off, and when the mute button 321 is clicked while the audio input is off, the audio input is switched on. When the audio input is on, the audio data obtained by the microphone 223 is sent to other clients 11, and when the audio input is off, the audio data obtained by the microphone 223 is not sent to other clients 11.

オーディオ設定ボタン３２２は、オーディオ関連リストを表示するためのボタンである。オーディオ関連リストは、マイク設定及びスピーカ設定などの複数の項目を含む。マイク設定の項目が選択される（クリックされる）と、マイク２２３を設定するためのマイク設定画面が表示される。マイク設定画面では、マイク２２３の音量を調節することが可能である。 The audio settings button 322 is a button for displaying an audio-related list. The audio-related list includes multiple items such as microphone settings and speaker settings. When the microphone settings item is selected (clicked), a microphone settings screen for setting the microphone 223 is displayed. On the microphone settings screen, it is possible to adjust the volume of the microphone 223.

映像ボタン３２３は、映像入力をオンとオフとの間で切り替えるためのボタンである。映像入力がオンである状態で映像ボタン３２３がクリックされると、映像入力がオフに切り替わり、映像入力がオフである状態で映像ボタン３２３がクリックされると、映像入力がオンに切り替わる。映像入力がオンである状態では、カメラ２２２により得られた映像データが他のクライアント１１に送信され、映像入力がオフである状態では、カメラ２２２により得られた映像データは他のクライアント１１に送信されない。 Video button 323 is a button for switching video input between on and off. When video button 323 is clicked while video input is on, the video input is switched off, and when video button 323 is clicked while video input is off, the video input is switched on. When video input is on, video data obtained by camera 222 is transmitted to other clients 11, and when video input is off, video data obtained by camera 222 is not transmitted to other clients 11.

映像設定ボタン３２４は、映像関連リストを表示するためのボタンである。映像関連リストは、カメラ切り替え及びカメラ設定などの複数の項目を含む。カメラ設定の項目が選択されると、使用中のカメラ２２２を設定するためのカメラ設定画面が表示される。カメラ設定画面では、使用中のカメラ２２２により得られている映像が表示される。 The video settings button 324 is a button for displaying a video-related list. The video-related list includes multiple items such as camera switching and camera settings. When the camera settings item is selected, a camera settings screen for setting the camera 222 in use is displayed. On the camera settings screen, the video captured by the camera 222 in use is displayed.

図２を再び参照すると、通信部２４は、通信ネットワーク１９及びサーバ１２を介して他のクライアント１１と通信する。通信部２４は、制御部２１から受け取ったリモート会議に関連する情報を他のクライアント１１に送信する。例えば、通信部２４は、カメラ２２２により得られた映像データ及びマイク２２３により得られた音声データを他のクライアント１１に送信する。通信部２４は、他のクライアント１１からリモート会議に関連する情報を受信し、受信した情報を制御部２１に送出する。例えば、通信部２４は、他のクライアント１１から他のクライアント１１により得られた映像データ及び音声データを受信する。 Returning to FIG. 2, the communication unit 24 communicates with the other clients 11 via the communication network 19 and the server 12. The communication unit 24 transmits information related to the remote conference received from the control unit 21 to the other clients 11. For example, the communication unit 24 transmits video data obtained by the camera 222 and audio data obtained by the microphone 223 to the other clients 11. The communication unit 24 receives information related to the remote conference from the other clients 11 and sends the received information to the control unit 21. For example, the communication unit 24 receives from the other clients 11 the video data and audio data obtained by the other clients 11.

操作情報生成部２５は、リモート会議中にユーザにより行われたクライアント１１の操作を示す操作情報を生成し、生成した操作情報を操作情報記憶部２９１に記憶させる。操作情報は、ユーザがリモート会議中にクライアント１１に対して行った操作を示す情報、具体的には、ユーザがリモート会議中にリモート会議アプリケーションにより提供されるユーザインタフェースに対して行った操作を示す情報を含む。記録対象となる操作の例は、ミュートボタン３２１上へのカーソル配置、音声入力のオフからオンへの切り替え、マイク設定画面の表示、スピーカ設定画面の表示、カメラ設定画面の表示、リモート会議アプリケーションのフォアグラウンドへの移行、リモート会議アプリケーションのバックグラウンドへの移行、発話などを含む。リモート会議アプリケーションがフォアグラウンドで動作している状態は、ユーザがリモート会議アプリケーションを操作できるアクティブな状態を指す。リモート会議アプリケーションがバックグラウンドで動作している状態は、リモート会議アプリケーションは動作しているが、ユーザがリモート会議アプリケーションを操作できない状態を指す。操作情報生成部２５は、制御部２１から、ユーザにより行われたマウス２２１の操作を示すマウス操作情報及び表示装置２３１に表示する画像を示す画面情報を受け取る。操作情報生成部２５は、操作情報及び画面情報から、ユーザインタフェースに対する操作を検出することができる。例えば、操作情報生成部２５は、操作情報及び画面情報から、ユーザインタフェース上でのカーソルの位置を検出することができる。例えば、操作情報生成部２５は、カーソルがミュートボタン３２１上へ移動してミュートボタン３２１上に留まっていることを検出し、ミュートボタン３２１上へのカーソル配置という操作に関する操作情報を生成する。The operation information generating unit 25 generates operation information indicating the operation of the client 11 performed by the user during the remote conference, and stores the generated operation information in the operation information storage unit 291. The operation information includes information indicating the operation performed by the user on the client 11 during the remote conference, specifically, information indicating the operation performed by the user on the user interface provided by the remote conference application during the remote conference. Examples of operations to be recorded include placing the cursor on the mute button 321, switching the audio input from off to on, displaying the microphone setting screen, displaying the speaker setting screen, displaying the camera setting screen, moving the remote conference application to the foreground, moving the remote conference application to the background, speaking, etc. The state in which the remote conference application is operating in the foreground refers to an active state in which the user can operate the remote conference application. The state in which the remote conference application is operating in the background refers to a state in which the remote conference application is operating but the user cannot operate the remote conference application. The operation information generating unit 25 receives mouse operation information indicating the operation of the mouse 221 performed by the user and screen information indicating an image to be displayed on the display device 231 from the control unit 21. The operation information generating unit 25 can detect an operation on the user interface from the operation information and the screen information. For example, the operation information generating unit 25 can detect the position of the cursor on the user interface from the operation information and the screen information. For example, the operation information generating unit 25 detects that the cursor moves onto the mute button 321 and remains on the mute button 321, and generates operation information regarding the operation of placing the cursor on the mute button 321.

図４は、操作情報記憶部２９１に記憶される操作情報の例を概略的に示している。各操作は１つのレコード（エントリ）で管理される。図４に示す例では、各レコードは、データ項目として、識別子（Ｎｏ．）、操作種、開始タイム、終了タイム、継続時間、操作フラグを含む。識別子は操作を識別する情報を示す。例えば識別子は操作が行われた順番を表す。操作種は操作の種類を示す。開始タイムは操作が開始された時間を示す。終了タイムは操作が終了した時間を示す。継続時間は操作が行われた時間長を示す。操作フラグは操作が継続中であるか否かを示す。操作フラグ“－”は操作が終了していることを表し、操作フラグ“○”は操作が継続中であることを表す。 Figure 4 shows a schematic example of operation information stored in the operation information storage unit 291. Each operation is managed by one record (entry). In the example shown in Figure 4, each record includes the following data items: an identifier (No.), operation type, start time, end time, duration, and operation flag. The identifier indicates information that identifies an operation. For example, the identifier indicates the order in which the operation was performed. The operation type indicates the type of operation. The start time indicates the time when the operation started. The end time indicates the time when the operation ended. The duration indicates the length of time the operation was performed. The operation flag indicates whether the operation is ongoing or not. An operation flag of "-" indicates that the operation has ended, and an operation flag of "○" indicates that the operation is ongoing.

図２を再び参照すると、発話欲求度合い算出部２６は、操作情報記憶部２９１に記憶されている操作情報に基づいて、ユーザの発話欲求度合いを算出する。本実施形態では、０から１までの範囲の値を取り、ユーザが発話を欲求する度合いが高いほど値が大きくなるように、発話欲求度合いを定義する。2 again, the speech desire degree calculation unit 26 calculates the user's speech desire degree based on the operation information stored in the operation information storage unit 291. In this embodiment, the speech desire degree is defined as taking a value in the range from 0 to 1, with the value increasing as the user's degree of desire to speak increases.

本実施形態では、ルールベースで発話欲求度合いを算出する。ルール記憶部２９２は予め定められた発話欲求推定ルールを記憶する。発話欲求度合い算出部２６は、ユーザの発話欲求度合いを算出するために、ルール記憶部２９２に記憶されている発話欲求推定ルールを参照する。発話欲求推定ルールは、発話欲求と推定される操作の種類を指定する情報を含む。発話欲求と推定される操作の例は、ミュートボタン上へのカーソル配置、音声入力のオフからオンへの切り替え、マイク設定画面表示、カメラ設定画面表示、及びリモート会議アプリケーションのフォアグラウンドへの移行を含む。In this embodiment, the degree of desire to speak is calculated on a rule basis. The rule memory unit 292 stores predetermined desire to speak estimation rules. The desire to speak degree calculation unit 26 refers to the desire to speak estimation rules stored in the rule memory unit 292 to calculate the user's degree of desire to speak. The desire to speak estimation rules include information specifying the type of operation that is estimated to be a desire to speak. Examples of operations that are estimated to be a desire to speak include placing the cursor on the mute button, switching the audio input from off to on, displaying a microphone settings screen, displaying a camera settings screen, and transitioning to the foreground of a remote conference application.

一般に、ユーザがリモート会議で音声入力及び／又は映像入力がオフになっている状態から発話する場合には、以下のような行動を行うことが多い。
（１）ユーザは、現在の発話者の発話が終わった直後に音声入力をオフからオンに切り替えられるようにミュートボタンの上にカーソルを置き、現在の発話者の発話が終わるのを待つ。
（２）ユーザは、ミュートボタンをクリックして音声入力をオフからオンに切り替えたうえで、現在の発話者の発話が終わるのを待つ。
（３）ユーザは、マイク設定画面を表示させ、マイクの音量を確認する。
（４）ユーザは、カメラ設定画面を表示させ、カメラに映る映像を確認する。
（５）ユーザは、リモート会議アプリケーションをフォアグラウンドに復帰させる。 Generally, when a user speaks during a remote conference in a state where audio input and/or video input is turned off, the user often performs the following actions.
(1) The user places the cursor over the mute button so that voice input can be switched from off to on immediately after the current speaker finishes speaking, and waits for the current speaker to finish speaking.
(2) The user clicks the mute button to switch the voice input from off to on, and then waits for the current speaker to finish speaking.
(3) The user displays the microphone settings screen and checks the microphone volume.
(4) The user displays the camera setting screen and checks the image captured by the camera.
(5) The user returns the remote conferencing application to the foreground.

上記のような発話前によく行われる行動（発話の事前行動）が発話欲求と推定される操作として採用される。以下では、発話欲求と推定される操作を対象操作とも称する。ミュートボタン上へのカーソル配置、マイク設定画面表示、及びカメラ設定画面表示は、継続的な対象操作であり、音声入力のオフからオンへの切り替え、及びリモート会議アプリケーションのフォアグラウンドへの移行は、瞬間的な対象操作である。発話欲求度合い算出部２６は、対象操作に合致する操作がユーザの直前の発話（ユーザがまだ発話を行っていない場合は、リモート会議への参加時又はリモート会議の開始時）以降に発生した場合にユーザが発話欲求状態にあると推定する。The behaviors that are often performed before speaking as described above (pre-speech behaviors) are adopted as operations that are estimated to indicate a desire to speak. Hereinafter, operations that are estimated to indicate a desire to speak are also referred to as target operations. Placing the cursor on the mute button, displaying the microphone setting screen, and displaying the camera setting screen are continuous target operations, while switching the audio input from off to on and moving the remote conference application to the foreground are momentary target operations. The desire to speak degree calculation unit 26 estimates that the user is in a state of desire to speak when an operation that matches the target operation occurs after the user's most recent utterance (or, if the user has not yet spoken, at the time of joining or starting the remote conference).

発話欲求度合い算出部２６は、ユーザが直前の発話以降に行った操作のそれぞれについて、操作が発話の事前行動である可能性を示すスコアを算出し、算出したスコアに基づいて発話欲求度合いを算出する。発話欲求推定ルールは、継続的な対象操作のそれぞれについて設定される基準時間を含んでよい。各対象操作の基準時間は操作のスコアを算出するために使用される。一例として、ミュートボタン上へのカーソル配置に関する基準時間は５秒に設定され、マイク設定画面表示に関する基準時間は５秒に設定され、カメラ設定画面表示に関する基準時間は１０秒に設定される。The speech desire degree calculation unit 26 calculates a score indicating the possibility that each operation performed by the user since the last utterance is a pre-utterance behavior, and calculates the speech desire degree based on the calculated score. The speech desire estimation rule may include a reference time set for each continuous target operation. The reference time for each target operation is used to calculate the score of the operation. As an example, the reference time for placing the cursor on the mute button is set to 5 seconds, the reference time for displaying the microphone setting screen is set to 5 seconds, and the reference time for displaying the camera setting screen is set to 10 seconds.

操作がミュートボタン上へのカーソル配置などの継続的な対象操作である場合、発話欲求度合い算出部２６は、操作の継続時間と対象操作に関する基準時間とから操作のスコアを算出する。例えば、操作の継続時間が対象操作に関する基準時間以上である場合、発話欲求度合い算出部２６は操作のスコアを１と決定する。操作の継続時間が対象操作に関する基準時間を下回る場合、発話欲求度合い算出部２６は、操作の継続時間と対象操作に関する基準時間との差又は比に基づいて操作のスコアを算出する。操作の継続時間をＤ、当該操作に一致する対象操作に関する基準時間をＲ、操作のスコアをＳとすると、Ｓ＝Ｄ／Ｒである。この例において、継続時間Ｄが２秒であり、基準時間Ｒが５秒である場合、スコアＳは０．４である。なお、スコアは線形関数以外の関数で算出されてもよい。例えば、Ｓ＝（Ｄ／Ｒ）^２であってもよい。この例において、継続時間Ｄが２秒であり、基準時間Ｒが５秒である場合、スコアＳは０．１６である。 When the operation is a continuous target operation such as placing the cursor on the mute button, the speech desire degree calculation unit 26 calculates the operation score from the duration of the operation and the reference time for the target operation. For example, when the duration of the operation is equal to or longer than the reference time for the target operation, the speech desire degree calculation unit 26 determines the operation score as 1. When the duration of the operation is shorter than the reference time for the target operation, the speech desire degree calculation unit 26 calculates the operation score based on the difference or ratio between the operation duration and the reference time for the target operation. If the duration of the operation is D, the reference time for the target operation that matches the operation is R, and the score of the operation is S, then S=D/R. In this example, when the duration D is 2 seconds and the reference time R is 5 seconds, the score S is 0.4. The score may be calculated using a function other than a linear function. For example, S=(D/R) ² . In this example, when the duration D is 2 seconds and the reference time R is 5 seconds, the score S is 0.16.

例えばユーザが音声入力をオンにするためにカーソルをミュートボタン３２１に移動させた直後にミュートボタン３２１をクリックすることがある。ユーザが音声入力をオンにするためにミュートボタン３２１をクリックした場合には、発話欲求度合い算出部２６は、ミュートボタン上へのカーソル配置の継続時間に関わらず、ミュートボタン上へのカーソル配置という操作のスコアを１に決定してもよい。For example, the user may click the mute button 321 immediately after moving the cursor to the mute button 321 to turn on voice input. When the user clicks the mute button 321 to turn on voice input, the speech desire degree calculation unit 26 may determine the score of the operation of placing the cursor on the mute button to be 1, regardless of the duration of the cursor placement on the mute button.

操作がリモート会議アプリケーションのフォアグラウンドへの移行などの瞬間的な対象操作である場合、発話欲求度合い算出部２６は、操作のスコアを１に決定する。 If the operation is a momentary target operation, such as moving a remote conference application to the foreground, the desire to speak degree calculation unit 26 determines the score of the operation to be 1.

操作がいずれの対象操作でもない場合、発話欲求度合い算出部２６は、操作のスコアを０に決定する。 If the operation is not one of the target operations, the speech desire degree calculation unit 26 determines the score of the operation to be 0.

操作間に一定時間以上の空きがある場合には、発話欲求度合い算出部２６は、当該期間に１つの操作（操作種“無操作”）が発生したと見なし、その操作のスコアを０に決定してよい。発話欲求推定ルールは上記一定時間を示す情報を含んでよい。If there is a certain amount of time or more between operations, the speech desire degree calculation unit 26 may determine that one operation (operation type "no operation") has occurred during that period, and may determine the score of that operation to be 0. The speech desire estimation rule may include information indicating the certain amount of time.

発話欲求度合い算出部２６は、操作ごとに算出されたスコアの平均を発話欲求度合いとする。代替として、発話欲求度合い算出部２６は、操作ごとに算出されたスコアの荷重平均を発話欲求度合いとして得てもよい。一例として、現時刻の３０秒前から現時刻までに発生した操作に関する重みを１とし、現時刻の６０秒前から現時刻の３０秒前までに発生した操作に関する重みを０．９と、現時刻の９０秒前から現時刻の６０秒前までに発生した操作に関する重みを０．８などとする。他の例では、ユーザが現在行っている操作に関する重みを１とし、１つ前の操作に関する重みを０．９とし、２つ前の操作に関する重みを０．８などとする。The speech desire degree calculation unit 26 sets the average of the scores calculated for each operation as the speech desire degree. Alternatively, the speech desire degree calculation unit 26 may obtain the weighted average of the scores calculated for each operation as the speech desire degree. As an example, the weight for an operation that occurred from 30 seconds before the current time to the current time is set to 1, the weight for an operation that occurred from 60 seconds before the current time to 30 seconds before the current time is set to 0.9, the weight for an operation that occurred from 90 seconds before the current time to 60 seconds before the current time is set to 0.8, etc. In another example, the weight for an operation currently being performed by the user is set to 1, the weight for the previous operation is set to 0.9, and the weight for the operation two operations before is set to 0.8, etc.

制御部２１は、通信部２４を介して他のクライアント１１に、ユーザの発話欲求度合いに基づくユーザ情報を送信する。例えば、制御部２１は、ユーザ情報を他のクライアント１１に送信するために通信部２４を駆動する。ユーザ情報は、ユーザの発話欲求度合いそのものを含んでよい。代替として、ユーザ情報は、ユーザに発話欲求があることを通知する情報を含んでいてもよい。例えば、制御部２１は、発話欲求度合い算出部２６により算出された発話欲求度合いが所定の閾値を超えた場合に、他のクライアント１１に、ユーザに発話欲求があることを通知する。The control unit 21 transmits user information based on the user's degree of desire to speak to other clients 11 via the communication unit 24. For example, the control unit 21 drives the communication unit 24 to transmit the user information to other clients 11. The user information may include the user's degree of desire to speak itself. Alternatively, the user information may include information notifying the user that he or she has a desire to speak. For example, when the degree of desire to speak calculated by the desire to speak degree calculation unit 26 exceeds a predetermined threshold, the control unit 21 notifies the other clients 11 that the user has a desire to speak.

制御部２１は、通信部２４を介して他のクライアント１１から、他のユーザの発話欲求度合いに基づくユーザ情報を受信する。制御部２１は、受信したユーザ情報をユーザインタフェースに適用する。ユーザ情報が発話欲求度合いを含む例では、制御部２１は、各ユーザの映像に紐づけて各ユーザの発話欲求度合いを表示するようにしてよい。代替として、制御部２１は、発話欲求度合いが所定の閾値を超えたユーザの映像を強調するようにしてもよい。例えば、制御部２１は、発話欲求度合いが所定の閾値を超えたユーザの映像を赤枠で囲ったり、発話欲求度合いが所定の閾値を超えたユーザの映像に印を付与したりしてよい。The control unit 21 receives user information based on the degree of desire to speak of other users from other clients 11 via the communication unit 24. The control unit 21 applies the received user information to the user interface. In an example in which the user information includes the degree of desire to speak, the control unit 21 may display the degree of desire to speak of each user by linking it to the image of each user. Alternatively, the control unit 21 may highlight the image of a user whose degree of desire to speak exceeds a predetermined threshold. For example, the control unit 21 may surround the image of a user whose degree of desire to speak exceeds a predetermined threshold with a red frame, or may mark the image of a user whose degree of desire to speak exceeds a predetermined threshold.

図５は、クライアント１１のハードウェア構成例を概略的に示している。図５に示すように、クライアント１１は、ハードウェア構成要素として、図２に示したマウス２２１、カメラ２２２、マイク２２３、表示装置２３１、及びスピーカ２３２に加えて、コンピュータ５０を備える。 Figure 5 shows a schematic diagram of an example hardware configuration of client 11. As shown in Figure 5, client 11 includes, as hardware components, a computer 50 in addition to the mouse 221, camera 222, microphone 223, display device 231, and speaker 232 shown in Figure 2.

コンピュータ５０は、ＣＰＵ（Central Processing Unit）５１、ＲＡＭ（Random Access Memory）５２、プログラムメモリ５３、ストレージデバイス５４、入出力インタフェース５５、及び通信インタフェース５６を備える。ＣＰＵ５１は、ＲＡＭ５２、プログラムメモリ５３、ストレージデバイス５４、入出力インタフェース５５、及び通信インタフェース５６と通信可能に接続される。The computer 50 includes a CPU (Central Processing Unit) 51, a RAM (Random Access Memory) 52, a program memory 53, a storage device 54, an input/output interface 55, and a communication interface 56. The CPU 51 is communicatively connected to the RAM 52, the program memory 53, the storage device 54, the input/output interface 55, and the communication interface 56.

ＣＰＵ５１はプロセッサの一例である。プロセッサとして、他の汎用回路を使用してもよく、ＡＳＩＣ（Application Specific Integrated Circuit）やＦＰＧＡ（Field-Programmable Gate Array）などの専用回路を使用してもよい。The CPU 51 is an example of a processor. Other general-purpose circuits may be used as the processor, or dedicated circuits such as an ASIC (Application Specific Integrated Circuit) or an FPGA (Field-Programmable Gate Array) may be used.

ＲＡＭ５２はＳＤＲＡＭ（Synchronous Dynamic Random Access Memory）などの揮発性メモリを含む。ＲＡＭ５２はワーキングメモリとしてＣＰＵ５１により使用される。プログラムメモリ５３は、発話欲求推定プログラムを含むリモート会議アプリケーションなど、ＣＰＵ５１により実行されるプログラムを記憶する。プログラムはコンピュータ実行可能命令を含む。プログラムメモリ５３として例えばＲＯＭ（Read Only Memory）が使用される。ストレージデバイス５４の一部領域がプログラムメモリ５３として使用されてもよい。ＣＰＵ５１は、プログラムメモリ５３に記憶されたプログラムをＲＡＭ５２に展開し、プログラムを解釈及び実行する。リモート会議アプリケーションは、ＣＰＵ５１により実行されると、処理部２７に関して説明される一連の処理をＣＰＵ５１に行わせる。言い換えると、ＣＰＵ５１は、リモート会議アプリケーションに従って、制御部２１、操作情報生成部２５、及び発話欲求度合い算出部２６として機能する。なお、発話欲求推定プログラムはリモート会議アプリケーションとは別のプログラムとして設けられてもよい。発話欲求推定プログラムは、ＣＰＵ５１により実行されると、発話欲求推定に関連する一連の処理をＣＰＵ５１に行わせる。The RAM 52 includes a volatile memory such as an SDRAM (Synchronous Dynamic Random Access Memory). The RAM 52 is used by the CPU 51 as a working memory. The program memory 53 stores a program executed by the CPU 51, such as a remote conference application including a speech desire estimation program. The program includes computer executable instructions. For example, a ROM (Read Only Memory) is used as the program memory 53. A portion of the storage device 54 may be used as the program memory 53. The CPU 51 expands the program stored in the program memory 53 into the RAM 52, and interprets and executes the program. When the remote conference application is executed by the CPU 51, the CPU 51 performs a series of processes described with respect to the processing unit 27. In other words, the CPU 51 functions as the control unit 21, the operation information generation unit 25, and the speech desire degree calculation unit 26 according to the remote conference application. The speech desire estimation program may be provided as a program separate from the remote conference application. When executed by the CPU 51, the desire to speak estimation program causes the CPU 51 to perform a series of processes related to desire to speak estimation.

プログラムは、コンピュータで読み取り可能な記録媒体に記憶された状態でコンピュータ５０に提供されてよい。この場合、コンピュータ５０は、記録媒体からデータを読み出すドライブを備え、記録媒体からプログラムを取得する。記録媒体の例は、磁気ディスク、光ディスク（ＣＤ－ＲＯＭ、ＣＤ－Ｒ、ＤＶＤ－ＲＯＭ、ＤＶＤ－Ｒなど）、光磁気ディスク（ＭＯなど）、及び半導体メモリを含む。また、プログラムはネットワークを通じて配布するようにしてもよい。具体的には、プログラムをネットワーク上のサーバに格納し、コンピュータ５０がサーバからプログラムをダウンロードするようにしてもよい。The program may be provided to the computer 50 in a state where it is stored on a computer-readable recording medium. In this case, the computer 50 is equipped with a drive for reading data from the recording medium and acquires the program from the recording medium. Examples of recording media include magnetic disks, optical disks (CD-ROM, CD-R, DVD-ROM, DVD-R, etc.), magneto-optical disks (MO, etc.), and semiconductor memories. The program may also be distributed over a network. Specifically, the program may be stored on a server on the network, and the computer 50 may download the program from the server.

ストレージデバイス５４は、ＨＤＤ（Hard Disk Drive）又はＳＳＤ（Solid State Drive）などの不揮発性メモリを含む。ストレージデバイス５４はデータを記憶する。ストレージデバイス５４は、記憶部２９、具体的には、操作情報記憶部２９１及びルール記憶部２９２として機能する。The storage device 54 includes a non-volatile memory such as a hard disk drive (HDD) or a solid state drive (SSD). The storage device 54 stores data. The storage device 54 functions as the memory unit 29, specifically, as an operation information memory unit 291 and a rule memory unit 292.

入出力インタフェース５５は周辺機器と通信するためのインタフェースである。マウス２２１、カメラ２２２、マイク２２３、表示装置２３１、及びスピーカ２３２は入出力インタフェース５５によりコンピュータ５０に接続される。コンピュータ５０がノート型ＰＣである例では、カメラ２２２、マイク２２３、表示装置２３１、及びスピーカ２３２はコンピュータ５０に内蔵されたものであり得る。The input/output interface 55 is an interface for communicating with peripheral devices. The mouse 221, the camera 222, the microphone 223, the display device 231, and the speaker 232 are connected to the computer 50 by the input/output interface 55. In an example in which the computer 50 is a notebook PC, the camera 222, the microphone 223, the display device 231, and the speaker 232 may be built into the computer 50.

通信インタフェース５６は、通信ネットワーク１９に接続される外部装置（例えば図１に示すサーバ１２及び他のクライアント１１）と通信するためのインタフェースである。通信インタフェース５６は、有線モジュール及び／又は無線モジュールを備える。通信インタフェース５６は通信部２４として機能する。The communication interface 56 is an interface for communicating with external devices (e.g., the server 12 and other clients 11 shown in FIG. 1) connected to the communication network 19. The communication interface 56 includes a wired module and/or a wireless module. The communication interface 56 functions as the communication unit 24.

［動作］
図６は、図２に示したクライアント１１により実行される発話欲求推定方法を概略的に示している。ここでは、現時刻において他のユーザが発話しているものとする。 [Operation]
Fig. 6 shows an outline of a desire to speak estimation method executed by the client 11 shown in Fig. 2. Here, it is assumed that another user is currently speaking.

図６のステップＳ６１において、操作情報生成部２５は、リモート会議中にユーザがクライアント１１に対して行った操作を示す操作情報を生成し、生成した操作情報を操作情報記憶部２９１に記憶させる。具体的には、操作情報生成部２５は、会議アプリケーションにより提供されるユーザインタフェースに対するユーザの操作を示す操作情報を生成する。In step S61 of Fig. 6, the operation information generating unit 25 generates operation information indicating an operation performed by a user on the client 11 during the remote conference, and stores the generated operation information in the operation information storage unit 291. Specifically, the operation information generating unit 25 generates operation information indicating an operation performed by the user on a user interface provided by the conference application.

ステップＳ６２において、発話欲求度合い算出部２６は、操作情報に基づいてユーザの発話欲求度合いを算出する。例えば、発話欲求度合い算出部２６は、操作情報記憶部２９１に記憶されている操作情報から、リモート会議中におけるユーザによる１つ前の発話の後にユーザがクライアント１１に対して行った操作を特定し、操作ごとにスコアを算出し、算出されたスコアから発話欲求度合いを算出する。操作が対象操作のいずれかである場合、発話欲求度合い算出部２６は、操作の継続時間Ｄと対象操作に関する基準時間Ｒとに基づいて操作のスコアを算出する。発話欲求度合い算出部２６は、操作の継続時間Ｄが対象操作に関する基準時間Ｒ以上である場合、スコアを１に決定し、操作の継続時間Ｄが対象操作種に関する基準時間Ｒを下回る場合、操作の継続時間Ｄを対象操作種に関する基準時間Ｒで割った値を操作のスコアとして得る。操作がいずれの対象操作でもない場合、発話欲求度合い算出部２６は、操作のスコアをゼロに決定する。操作間に一定時間の空きがある場合、発話欲求度合い算出部２６は、対象操作に該当しない操作が行われたものとみなし、当該操作のスコアをゼロに決定する。続いて、発話欲求度合い算出部２６は、検出した操作ごとに算出されたスコアを平均することにより、ユーザの発話欲求度合いを求める。In step S62, the speech desire degree calculation unit 26 calculates the speech desire degree of the user based on the operation information. For example, the speech desire degree calculation unit 26 identifies the operation performed by the user on the client 11 after the previous speech by the user during the remote conference from the operation information stored in the operation information storage unit 291, calculates a score for each operation, and calculates the speech desire degree from the calculated score. If the operation is any of the target operations, the speech desire degree calculation unit 26 calculates the score of the operation based on the duration D of the operation and the reference time R for the target operation. If the duration D of the operation is equal to or longer than the reference time R for the target operation, the speech desire degree calculation unit 26 determines the score to be 1, and if the duration D of the operation is shorter than the reference time R for the target operation type, obtains the value obtained by dividing the duration D of the operation by the reference time R for the target operation type as the score of the operation. If the operation is not any of the target operations, the speech desire degree calculation unit 26 determines the score of the operation to be zero. If there is a certain amount of time between operations, the desire to speak degree calculation unit 26 assumes that an operation that does not correspond to the target operation has been performed, and determines the score of the operation to be 0. Next, the desire to speak degree calculation unit 26 calculates the degree of the user's desire to speak by averaging the scores calculated for each detected operation.

ステップＳ６３において、制御部２１は、通信部２４を介して他のクライアント１１に、ステップＳ６２で得られたユーザの発話欲求度合いを含むユーザ情報を送信する。In step S63, the control unit 21 transmits user information including the user's degree of desire to speak obtained in step S62 to other clients 11 via the communication unit 24.

ステップＳ６１に示す処理は、リモート会議中に、周期的に、例えば１秒間隔で、実行されてよい。ステップＳ６２、Ｓ６３に示す処理は、リモート会議中且つユーザが発話していない期間中に、周期的に、例えば１秒間隔で、実行されてよい。The process shown in step S61 may be executed periodically, for example at one-second intervals, during the remote conference. The processes shown in steps S62 and S63 may be executed periodically, for example at one-second intervals, during the remote conference and during a period when the user is not speaking.

図４に示す操作情報を参照して、発話欲求度合いの算出について説明する。ここでは、ミュートボタン上へのカーソル配置に関する基準時間は５秒に設定され、マイク設定画面表示に関する基準時間は５秒に設定され、カメラ設定画面表示に関する基準時間は１０秒に設定されるものとする。Calculation of the degree of desire to speak will be described with reference to the operation information shown in Figure 4. Here, the reference time for placing the cursor on the mute button is set to 5 seconds, the reference time for displaying the microphone setting screen is set to 5 seconds, and the reference time for displaying the camera setting screen is set to 10 seconds.

発話が終了した１４：２８：２２～１４：３０：２１では、ユーザは何の操作もしておらず、発話欲求度合いはゼロである。１４：２９：２２では、６０秒間何の操作も発生しなかったことから、発話欲求度合い算出部２６は、１つの操作が発生したと判断し、当該操作のスコアを０と決定する。発話欲求度合いはゼロのままである。 Between 14:28:22 and 14:30:21, when the speech ended, the user did not perform any operation, and the degree of desire to speak is zero. At 14:29:22, no operation occurred for 60 seconds, so the desire to speak degree calculation unit 26 determines that one operation occurred and determines the score of that operation as 0. The degree of desire to speak remains zero.

１４：３０：２１でユーザはマイク設定画面を開く。１４：３０：２２では、マイク設定画面表示のスコアが０．２（＝１／５）となり、発話欲求度合いは０．１（＝（０＋０．２）／２））となる。発話欲求度合いＳは、１４：３０：２３では０．２となり、１４：３０：２４では０．３となり、１４：３０：２５では０．４となり、１４：３０：２６～１４：３０：２７では０．５となる。At 14:30:21, the user opens the microphone settings screen. At 14:30:22, the score for the microphone settings screen display is 0.2 (= 1/5), and the degree of desire to speak is 0.1 (= (0 + 0.2)/2). The degree of desire to speak S is 0.2 at 14:30:23, 0.3 at 14:30:24, 0.4 at 14:30:25, and 0.5 from 14:30:26 to 14:30:27.

１４：３０：２７でユーザはマイク設定画面を閉じてカメラ設定画面を開く。１４：３０：２７では、カメラ設定画面表示のスコアが０．１（＝１／１０）となり、発話欲求度合いは０．３７（≒（０＋１＋０．１）／３）となる。発話欲求度合いは、１４：３０：２７では０．４となり、１４：３０：２８では０．４３となり、・・、１４：３０：３６では０．６３となり、１４：３０：３７～１４：３１：０５では０．６７となる。１４：３０：４２でユーザはカメラ設定画面を閉じ、１４：３０：４２～１４：３１：０５まで操作を行わない。At 14:30:27 the user closes the microphone settings screen and opens the camera settings screen. At 14:30:27, the score for the camera settings screen display is 0.1 (= 1/10), and the degree of desire to speak is 0.37 (≒ (0 + 1 + 0.1)/3). The degree of desire to speak is 0.4 at 14:30:27, 0.43 at 14:30:28, ... 0.63 at 14:30:36, and 0.67 from 14:30:37 to 14:31:05. At 14:30:42 the user closes the camera settings screen, and does not perform any operations from 14:30:42 to 14:31:05.

１４：３１：０５でユーザはマウス２２１を操作してカーソルをミュートボタン３２１に合わせる。１４：３１：０６では、ミュートボタン上へのカーソル配置のスコアが０．２（＝１／５）となり、発話欲求度合いは０．５５（≒（０＋１＋１＋０．２）／４）となる。発話欲求度合いは、１４：３１：０７では０．６となり、１４：３１：０８では０．６５となり、１４：３１：０９では０．７となり、１４：３１：１０～１４：３１：１３では０．７５となる。１４：３１：１３でユーザはミュートボタン３２１をクリックして発話を開始する。At 14:31:05, the user operates mouse 221 to move the cursor to mute button 321. At 14:31:06, the score for placing the cursor on the mute button is 0.2 (= 1/5), and the degree of desire to speak is 0.55 (≒ (0 + 1 + 1 + 0.2) / 4). The degree of desire to speak is 0.6 at 14:31:07, 0.65 at 14:31:08, 0.7 at 14:31:09, and 0.75 from 14:31:10 to 14:31:13. At 14:31:13, the user clicks mute button 321 to start speaking.

［効果］
本実施形態では、通信ネットワーク１９を介したリモート会議に使用されるクライアント１１の各々は、リモート会議中にユーザがクライアント１１に対して行った操作を示す操作情報を生成し、操作情報に基づいてユーザの発話欲求度合いを算出し、算出された発話欲求度合いを他のクライアント１１に送信する。発話欲求度合いの算出には、ユーザがクライアント１１に対して行った操作を示す操作情報が使用される。当該構成によれば、音声及び映像情報を利用せずにユーザの発話欲求を推定することが可能となる。さらに、算出された発話欲求度合いが他のクライアント１１に通知される。当該構成によれば、各クライアント１１において他のユーザの発話欲求度合いを表示することが可能となる。その結果、各クライアント１１のユーザは他のユーザが発話を望んでいるか否かを判断することができるようになり、発話の衝突を回避できるようになる。 [effect]
In this embodiment, each of the clients 11 used in a remote conference via the communication network 19 generates operation information indicating an operation performed by the user on the client 11 during the remote conference, calculates the degree of the user's desire to speak based on the operation information, and transmits the calculated degree of desire to speak to the other clients 11. The operation information indicating the operation performed by the user on the client 11 is used to calculate the degree of desire to speak. With this configuration, it is possible to estimate the user's desire to speak without using audio and video information. Furthermore, the calculated degree of desire to speak is notified to the other clients 11. With this configuration, it is possible to display the degree of desire to speak of the other users on each client 11. As a result, the user of each client 11 can determine whether the other users want to speak or not, and speech collisions can be avoided.

クライアント１１は、操作情報からリモート会議中におけるユーザによる１つ前の発話の後にユーザがクライアント１１に対して行った操作を特定し、特定された操作ごとに操作が発話の事前行動である可能性を示すスコアを算出し、算出されたスコアから発話欲求度合いを算出する。当該構成によれば、ユーザが発話の事前行動を行ったか否かを評価することが可能となり、ユーザの発話欲求を適切に推定することが可能となる。From the operation information, the client 11 identifies an operation performed by the user on the client 11 after the user's previous utterance during the remote conference, calculates a score indicating the possibility that the operation is a pre-utterance action for each identified operation, and calculates the degree of desire to utter from the calculated score. With this configuration, it is possible to evaluate whether the user has performed a pre-utterance action, and it is possible to appropriately estimate the user's desire to utter.

クライアント１１は、操作が継続的な対象操作である場合、操作の継続時間と対象操作に関する基準時間との比較に基づいて操作のスコアを算出してよい。当該構成によれば、操作が行われた時間長に応じてスコアを算出することが可能となる。When an operation is a continuous target operation, the client 11 may calculate a score for the operation based on a comparison between the duration of the operation and a reference time for the target operation. This configuration makes it possible to calculate a score according to the length of time the operation is performed.

継続的な対象操作は、音声入力をオンとオフとの間で切り替えるミュートボタンへのカーソル配置と、マイクを設定するためのマイク設定画面の表示と、カメラを設定するためのカメラ設定画面の表示と、の少なくとも１つを含んでよい。これらは発話の事前行動の典型例であり、よって、ユーザの発話欲求を適切に推定することが可能となる。The continuous target operation may include at least one of the following: moving the cursor to a mute button that switches the voice input between on and off; displaying a microphone setting screen for setting the microphone; and displaying a camera setting screen for setting the camera. These are typical examples of pre-speech behaviors, and thus it is possible to appropriately estimate the user's desire to speak.

＜第２の実施形態＞
上述した第１の実施形態では、ルールベースで発話欲求度合いを算出する。第２の実施形態では、機械学習により得られる発話欲求推定モデルを使用して発話欲求度合いを算出する。第２の実施形態では、第１の実施形態と同じ構成要素及び処理についての説明は適宜省略する。 Second Embodiment
In the first embodiment described above, the degree of desire to speak is calculated based on a rule. In the second embodiment, the degree of desire to speak is calculated using a desire to speak estimation model obtained by machine learning. In the second embodiment, the description of the same components and processes as in the first embodiment will be omitted as appropriate.

［構成］
図７は、第２の実施形態に係るクライアント７１を概略的に示している。第２の実施形態に係る会議システムは図１に示したものと同じであり、図７に示すクライアント７１は図１に示したクライアント１１の代替として使用される。図７において、図２に示したものと同様の構成要素に同様の符号を付して、それらについての説明を適宜省略する。 [composition]
Fig. 7 shows a schematic diagram of a client 71 according to the second embodiment. The conference system according to the second embodiment is the same as that shown in Fig. 1, and the client 71 shown in Fig. 7 is used as an alternative to the client 11 shown in Fig. 1. In Fig. 7, the same components as those shown in Fig. 2 are denoted by the same reference numerals, and the description thereof will be omitted as appropriate.

図７に示すように、クライアント７１は、制御部２１、入力部２２、出力部２３、通信部２４、操作情報生成部２５，発話欲求度合い算出部７６、学習部７８、及び記憶部７９を備える。記憶部７９は、操作情報記憶部２９１及びモデル記憶部７９２を備える。制御部２１、操作情報生成部２５、発話欲求度合い算出部７６、及び学習部７８を処理部７７と総称する。制御部２１、通信部２４、操作情報生成部２５、発話欲求度合い算出部７６、学習部７８、操作情報記憶部２９１、及びモデル記憶部７９２は、第２の実施形態に係る発話欲求推定装置に相当する。 As shown in Figure 7, the client 71 includes a control unit 21, an input unit 22, an output unit 23, a communication unit 24, an operation information generation unit 25, a desire to speak degree calculation unit 76, a learning unit 78, and a memory unit 79. The memory unit 79 includes an operation information memory unit 291 and a model memory unit 792. The control unit 21, the operation information generation unit 25, the desire to speak degree calculation unit 76, and the learning unit 78 are collectively referred to as a processing unit 77. The control unit 21, the communication unit 24, the operation information generation unit 25, the desire to speak degree calculation unit 76, the learning unit 78, the operation information memory unit 291, and the model memory unit 792 correspond to the desire to speak estimation device according to the second embodiment.

学習部７８は、機械学習により、クライアント７１に対する少なくとも１つの操作を示す操作情報を入力として受け取り、発話欲求度合いを表す数値を出力するように構成された発話欲求推定モデルを生成する。学習部７８は、操作情報記憶部２９１に記憶されている操作情報を学習データとして使用して発話欲求推定モデルを学習する。発話欲求推定モデルはニューラルネットワークであってよく、学習はニューラルネットワークを構成するパラメータ（重み及びバイアス）を決定する処理である。The learning unit 78 uses machine learning to generate an utterance desire estimation model configured to receive as input operation information indicating at least one operation on the client 71 and output a numerical value indicating the degree of utterance desire. The learning unit 78 learns the utterance desire estimation model using the operation information stored in the operation information storage unit 291 as learning data. The utterance desire estimation model may be a neural network, and learning is a process of determining parameters (weights and biases) that constitute the neural network.

学習部７８は、操作情報記憶部２９１に記憶されている操作情報から、発話につながる操作情報と発話につながらない操作情報とを生成する。例えば、学習部７８は、各発話の直前の所定期間（例えば６０秒間）における操作情報を発話につながる操作情報として得る。具体的には、学習部７８は、各発話の開始タイムより６０秒前の時刻から発話の開始タイムまでの操作情報を発話につながる操作情報として得る。学習部７８は、それより前の所定期間（例えば６０秒間）における操作情報を発話につながらない操作情報として得る。具体的には、学習部７８は、各発話の開始タイムより１２０秒前の時刻から発話の開始タイムより６０秒前の時刻までの操作情報や各発話の開始タイムより１８０秒前の時刻から発話の開始タイムより１２０秒前の時刻までの操作情報などを発話につながらない操作情報として得る。The learning unit 78 generates operation information that leads to speech and operation information that does not lead to speech from the operation information stored in the operation information storage unit 291. For example, the learning unit 78 obtains operation information for a predetermined period (e.g., 60 seconds) immediately before each speech as operation information that leads to speech. Specifically, the learning unit 78 obtains operation information from a time 60 seconds before the start time of each speech to the start time of the speech as operation information that leads to speech. The learning unit 78 obtains operation information for a predetermined period (e.g., 60 seconds) before that as operation information that does not lead to speech. Specifically, the learning unit 78 obtains operation information from a time 120 seconds before the start time of each speech to a time 60 seconds before the start time of the speech, operation information from a time 180 seconds before the start time of each speech to a time 120 seconds before the start time of the speech, and the like as operation information that does not lead to speech.

学習部７８は、発話につながる操作情報及び発話につながらない操作情報を発話欲求推定モデルへの入力として使用して発話欲求推定モデルの機械学習を行う。モデル記憶部７９２は、学習部７８により生成された発話欲求推定モデルを記憶する。The learning unit 78 performs machine learning of the desire to speak estimation model using operation information that leads to speech and operation information that does not lead to speech as input to the desire to speak estimation model. The model storage unit 792 stores the desire to speak estimation model generated by the learning unit 78.

発話欲求度合い算出部７６は、発話欲求推定モデルを使用して、操作情報記憶部２９１に記憶されている操作情報に基づいて、ユーザの発話欲求度合いを算出する。例えば、発話欲求度合い算出部７６は、操作情報記憶部２９１に記憶されている操作情報から、所定期間（例えば６０秒間）における操作情報を抽出する。具体的には、発話欲求度合い算出部７６は、操作情報記憶部２９１に記憶されている操作情報から、リモート会議中におけるユーザによる１つ前の発話の後であって現時刻より６０秒前の時刻から現時刻までにユーザがクライアント７１に対して行った操作を示す操作情報を抽出する。発話欲求度合い算出部７６は、抽出された操作情報を発話欲求推定モデルに入力し、発話欲求推定モデルから出力される数値を発話欲求度合いとして得る。The speech desire degree calculation unit 76 uses the speech desire estimation model to calculate the user's speech desire degree based on the operation information stored in the operation information storage unit 291. For example, the speech desire degree calculation unit 76 extracts operation information for a predetermined period (e.g., 60 seconds) from the operation information stored in the operation information storage unit 291. Specifically, the speech desire degree calculation unit 76 extracts operation information indicating operations performed by the user on the client 71 from the operation information stored in the operation information storage unit 291 after the previous utterance by the user during the remote conference from a time 60 seconds before the current time to the current time. The speech desire degree calculation unit 76 inputs the extracted operation information into the speech desire estimation model and obtains a numerical value output from the speech desire estimation model as the speech desire degree.

発話欲求推定モデルから出力される値の範囲が０から１までの範囲でない場合、発話欲求度合い算出部７６は、発話欲求推定モデルから出力される値が０から１までの範囲になるように正規化を行ってよい。If the range of values output from the desire to speak estimation model is not between 0 and 1, the desire to speak degree calculation unit 76 may perform normalization so that the values output from the desire to speak estimation model are in the range from 0 to 1.

なお、操作情報がある程度蓄積されるまでは、発話欲求推定モデルの学習を行うことができない。このため、操作情報がある程度蓄積されるまでは、発話欲求度合い算出部７６は予め用意された発話欲求推定モデル（リモート会議アプリケーションにプリセットされる発話欲求推定モデル）を使用してよい。代替として、発話欲求度合い算出部７６は、第１の実施形態で説明したものと同じ方法で発話欲求度合いを算出するようにしてもよい。Note that the speech desire estimation model cannot be trained until a certain amount of operation information has been accumulated. Therefore, until a certain amount of operation information has been accumulated, the speech desire degree calculation unit 76 may use a pre-prepared speech desire estimation model (a speech desire estimation model preset in the remote conference application). Alternatively, the speech desire degree calculation unit 76 may calculate the speech desire degree in the same manner as described in the first embodiment.

クライアント７１は図５に示したものと同様のハードウェア構成を有することができる。本実施形態に係る発話欲求推定プログラムを含むリモート会議アプリケーションは、ＣＰＵにより実行されると、処理部７７に関して説明される一連の処理をＣＰＵに行わせる。言い換えると、ＣＰＵは、リモート会議アプリケーションに従って、制御部２１、通信部２４、操作情報生成部２５、発話欲求度合い算出部７６、学習部７８として機能する。The client 71 may have a hardware configuration similar to that shown in Figure 5. When executed by a CPU, the remote conference application including the desire to speak estimation program according to this embodiment causes the CPU to perform a series of processes described with respect to the processing unit 77. In other words, the CPU functions as the control unit 21, the communication unit 24, the operation information generation unit 25, the desire to speak degree calculation unit 76, and the learning unit 78 in accordance with the remote conference application.

［動作］
クライアント７１により実行される学習方法を説明する。 [Operation]
The learning method executed by the client 71 will now be described.

操作情報生成部２５は、リモート会議中にユーザがクライアント７１に対して行った操作を示す操作情報を生成し、生成した操作情報を操作情報記憶部２９１に記憶させる。The operation information generation unit 25 generates operation information indicating operations performed by a user on the client 71 during a remote conference, and stores the generated operation information in the operation information storage unit 291.

学習部７８は、操作情報記憶部２９１に記憶されている操作情報から、発話につながる操作情報としての複数の第１サンプルと発話につながらない操作情報としての複数の第２サンプルとを含む複数のサンプルを生成する。各サンプルには正解データが付与される。例えば、発話欲求推定モデルの出力層が２つのノードを含む場合、各第１サンプルにはベクトル（１，０）が正解データとして付与され、各第２サンプルにはベクトル（０，１）が正解データとして付与されてよい。The learning unit 78 generates a plurality of samples including a plurality of first samples as operation information leading to speech and a plurality of second samples as operation information not leading to speech from the operation information stored in the operation information storage unit 291. Correct answer data is assigned to each sample. For example, if the output layer of the speech desire estimation model includes two nodes, the vector (1, 0) may be assigned to each first sample as correct answer data, and the vector (0, 1) may be assigned to each second sample as correct answer data.

学習部７８は、例えばランダムに、サンプルの中から少なくとも１つのサンプルを選択する。学習部７８は、各サンプルを発話欲求推定モデルに入力し、発話欲求推定モデルからの出力データを得る。学習部７８は、出力データが正解データに近づくように、発話欲求推定モデルのパラメータを更新する。例えば、目的関数として交差エントロピー誤差を使用し、最適化アルゴリズムとして勾配降下法を使用してよい。The learning unit 78 selects at least one sample from the samples, for example randomly. The learning unit 78 inputs each sample into the speech desire estimation model and obtains output data from the speech desire estimation model. The learning unit 78 updates parameters of the speech desire estimation model so that the output data approaches the correct answer data. For example, the cross-entropy error may be used as the objective function, and gradient descent may be used as the optimization algorithm.

学習部７８は、サンプル選択からパラメータ更新までの処理を繰り替えし実行する。その結果、クライアント７１を使用するユーザに適合する発話欲求推定モデルが生成される。The learning unit 78 repeatedly executes the processes from sample selection to parameter update. As a result, a speech desire estimation model that is suitable for the user who uses the client 71 is generated.

次に、クライアント７１により実行される発話欲求推定方法を説明する。ここでは、発話欲求推定モデルの学習が完了しているものとする。さらに、現時刻において他のユーザが発話しているものとする。Next, we will explain the method of estimating desire to speak executed by the client 71. Here, we assume that the learning of the desire to speak estimation model has been completed. Furthermore, we assume that another user is currently speaking.

発話欲求度合い算出部７６は、モデル記憶部７９２に記憶されている発話欲求推定モデルを使用して、操作情報記憶部２９１に記憶されている操作情報に基づいて、ユーザの発話欲求度合いを算出する。例えば、発話欲求度合い算出部２６は、操作情報記憶部２９１に記憶されている操作情報から、現時刻より６０秒前の時刻から現時刻までの操作情報を抽出し、抽出された操作情報を発話欲求推定モデルに入力し、発話欲求推定モデルから出力される値を発話欲求度合いとして得る。The speech desire degree calculation unit 76 uses the speech desire estimation model stored in the model storage unit 792 to calculate the user's speech desire degree based on the operation information stored in the operation information storage unit 291. For example, the speech desire degree calculation unit 26 extracts operation information from the time 60 seconds before the current time to the current time from the operation information stored in the operation information storage unit 291, inputs the extracted operation information to the speech desire estimation model, and obtains the value output from the speech desire estimation model as the speech desire degree.

制御部２１は、通信部２４を介して他のクライアント１１に、発話欲求度合い算出部２６により算出されたユーザの発話欲求度合いを含むユーザ情報を送信する。The control unit 21 transmits user information including the user's degree of desire to speak calculated by the desire to speak degree calculation unit 26 to other clients 11 via the communication unit 24.

［効果］
本実施形態は、第１の実施形態で説明したものと同様の効果を得ることができる。本実施形態では、機械学習により得られる発話欲求推定モデルを使用して発話欲求度合いが算出される。当該構成によれば、ユーザの発話欲求をより適切に推定できることが期待できる。 [effect]
This embodiment can obtain the same effect as that described in the first embodiment. In this embodiment, the degree of desire to speak is calculated using a desire to speak estimation model obtained by machine learning. With this configuration, it is expected that the desire to speak of the user can be more appropriately estimated.

クライアント７１は、操作情報記憶部２９１に記憶されている操作情報を学習データとして使用して発話欲求推定モデルを学習する。当該構成によれば、ユーザに適合した発話欲求推定モデルを得ることが可能となり、ユーザの発話欲求をさらに適切に推定することが可能となる。The client 71 learns the desire to speak estimation model using the operation information stored in the operation information storage unit 291 as learning data. With this configuration, it is possible to obtain an utterance desire estimation model suited to the user, and to more appropriately estimate the user's desire to speak.

＜変形例＞
上述した実施形態では、リモート会議はクライアントサーバモデルに基づいて実施される。他の実施形態では、会議システムがサーバを備えず、リモート会議はＰ２Ｐ（peer-to-peer）的にクライアント間で行われてもよい。 <Modification>
In the above-described embodiment, the remote conference is implemented based on a client-server model. In another embodiment, the conference system does not include a server, and the remote conference may be implemented between clients in a peer-to-peer (P2P) manner.

なお、本発明は、上記実施形態に限定されるものではなく、実施段階ではその要旨を逸脱しない範囲で種々に変形することが可能である。また、各実施形態は適宜組み合わせて実施してもよく、その場合組み合わせた効果が得られる。さらに、上記実施形態には種々の発明が含まれており、開示される複数の構成要素から選択された組み合わせにより種々の発明が抽出され得る。例えば、実施形態に示される全構成要素からいくつかの構成要素が削除されても、課題が解決でき、効果が得られる場合には、この構成要素が削除された構成が発明として抽出され得る。 Note that the present invention is not limited to the above-described embodiments, and can be modified in various ways in the implementation stage without departing from the gist of the invention. The embodiments may also be implemented in appropriate combination, in which case the combined effects can be obtained. Furthermore, the above-described embodiments include various inventions, and various inventions can be extracted by combinations selected from the multiple components disclosed. For example, if the problem can be solved and the effect can be obtained even if some components are deleted from all the components shown in the embodiments, the configuration from which these components are deleted can be extracted as an invention.

１０ …会議システム
１１ …クライアント
１２ …サーバ
１９ …通信ネットワーク
２１ …制御部
２２ …入力部
２２１…マウス
２２２…カメラ
２２３…マイク
２３ …出力部
２３１…表示装置
２３２…スピーカ
２４ …通信部
２５ …操作情報生成部
２６ …算出部
２７ …処理部
２９ …記憶部
２９１…操作情報記憶部
２９２…ルール記憶部
３０ …ユーザインタフェース
３１ …映像領域
３２ …コントロールバー
３２１…ミュートボタン
３２２…オーディオ設定ボタン
３２３…映像ボタン
３２４…映像設定ボタン
５０ …コンピュータ
５１ …ＣＰＵ
５２ …ＲＡＭ
５３ …プログラムメモリ
５４ …ストレージデバイス
５５ …入出力インタフェース
５６ …通信インタフェース
７１ …クライアント
７６ …算出部
７７ …処理部
７８ …学習部
７９ …記憶部
７９２…モデル記憶部
LIST OF SYMBOLS 10 Conference system 11 Client 12 Server 19 Communication network 21 Control unit 22 Input unit 221 Mouse 222 Camera 223 Microphone 23 Output unit 231 Display device 232 Speaker 24 Communication unit 25 Operation information generation unit 26 Calculation unit 27 Processing unit 29 Storage unit 291 Operation information storage unit 292 Rule storage unit 30 User interface 31 Video area 32 Control bar 321 Mute button 322 Audio setting button 323 Video button 324 Video setting button 50 Computer 51 CPU
52...RAM
53: program memory 54: storage device 55: input/output interface 56: communication interface 71: client 76: calculation unit 77: processing unit 78: learning unit 79: storage unit 792: model storage unit

Claims

A desire to speak estimation device provided in a first conference device among a plurality of conference devices used in a remote conference via a communication network,
an operation information generation unit that generates operation information indicating an operation performed by a user on the first conference device during the remote conference, the operation information including information indicating a type of the operation and a duration that is a time length during which the operation was performed ;
a speech desire degree calculation unit that calculates a speech desire degree indicating a degree to which the user desires to speak based on information indicating the type and duration included in the generated operation information;
a communication unit that transmits information based on the calculated desire to speak to a second conference device among the plurality of conference devices;
The speech desire estimation device includes:

the desire to speak degree calculation unit identifies an operation performed by the user on the first conference device after a previous utterance by the user during the remote conference from the generated operation information, calculates a score indicating a possibility that the operation is a pre-utterance behavior for each of the identified operations, and calculates the desire to speak degree from the calculated score.
The desire to speak estimation device according to claim 1 .

the speech desire degree calculation unit calculates the score of the specified operation based on a comparison between a duration of the specified operation and a reference time set for the specified operation when the specified operation matches a predetermined operation;
The desire to speak estimation device according to claim 2.

the predetermined operation includes at least one of: moving a cursor to a mute button for switching an audio input between on and off; displaying a microphone setting screen for setting a microphone; and displaying a camera setting screen for setting a camera.
The desire to speak estimation device according to claim 3.

The present invention further includes a speech desire estimation model configured to receive operation information indicating at least one operation as an input and output a numerical value indicating the speech desire degree,
the desire to speak degree calculation unit extracts, from the generated operation information, operation information indicating an operation performed by the user on the first conference device after a previous utterance by the user during the remote conference, inputs the extracted operation information into the desire to speak estimation model, and obtains a numerical value output from the desire to speak estimation model as the desire to speak degree;
The desire to speak estimation device according to any one of claims 1 to 4.

The speech desire estimation device of claim 5 further comprising a learning unit that uses the generated operation information to learn the speech desire estimation model.

A method for estimating a desire to speak, which is executed by a first conference device among a plurality of conference devices used in a remote conference via a communication network, comprising:
generating operation information indicating an operation performed by a user on the first conference device during the remote conference, the operation information including information indicating a type of the operation and a duration that is a time length during which the operation was performed ;
Calculating a degree of desire to speak, which indicates a degree to which the user desires to speak, based on information indicating the type and duration included in the generated operation information;
transmitting information based on the calculated desire to speak to a second conference device among the plurality of conference devices;
The method for estimating desire to speak comprises:

A program for causing a computer to function as the desire to speak estimation device according to any one of claims 1 to 6.