JP6550951B2

JP6550951B2 - Terminal, video conference system, and program

Info

Publication number: JP6550951B2
Application number: JP2015120357A
Authority: JP
Inventors: 悠斗後藤
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 2015-06-15
Filing date: 2015-06-15
Publication date: 2019-07-31
Anticipated expiration: 2035-06-15
Also published as: JP2017005616A

Description

本発明は、端末、ビデオ会議システム、及びプログラムに関する。 The present invention relates to a terminal, a video conference system, and a program.

従来から、２つのビデオ会議端末がインターネットなどのネットワークを介して接続されたビデオ会議システムが知られている。ビデオ会議システムによれば、映像と音声データをリアルタイムに双方向に送受信し、遠隔地の者同士でもリアルタイムコミュニケーションを実現することができる。また、ビデオ会議端末を複数台用いることによって、多拠点における複数の利用者が同時に同じ会議に参加することができることも既に知られている。 2. Description of the Related Art Conventionally, a video conference system in which two video conference terminals are connected via a network such as the Internet is known. According to the video conference system, video and audio data are bidirectionally transmitted and received in real time, and real-time communication can be realized even between remote persons. It has also been already known that a plurality of users at multiple locations can participate in the same conference at the same time by using a plurality of video conference terminals.

ところで、人同士が集う会議においては話者が会議出席者の誰に対して話しているかは意識することなく認識することができる。他方、ビデオ会議システムでは、会議出席者が常にカメラ等を注視し続けることはなく、また、どの話者が誰に向かって話しているかを特定することが困難である。 By the way, in a meeting where people gather, it is possible to recognize without knowing who the meeting attendee the speaker is talking to. On the other hand, in a video conference system, a meeting attendee does not always keep an eye on a camera or the like, and it is difficult to identify which speaker is speaking to whom.

そこで、例えば特許文献１に係るテレビ会議システムでは、発話者側端末２１の利用者が、表示手段により表示された映像のうち、聴衆者側端末２２の利用者の映像を注目している場合に、端末２２は、端末２１から、端末２２を注目していることを示す第１の注目情報を受信する。 Therefore, for example, in the video conference system according to Patent Document 1, the user of the speaker side terminal 21 focuses on the video of the user of the audience side terminal 22 among the images displayed by the display means. The terminal 22 receives, from the terminal 21, first attention information indicating that the terminal 22 is focused.

そして、第１の注目情報を受信すると、端末２２は、利用者が注目している映像に対応する端末と端末２１とが一致するか否かを判定する。端末２２は、一致すると判定した場合に、第２の注目情報を生成して送信し、端末２１が第２の注目情報を受信すると、表示手段により表示される映像のうち、端末２２に対応する映像を強調して表示する。 Then, when receiving the first attention information, the terminal 22 determines whether or not the terminal corresponding to the video that the user is paying attention matches with the terminal 21. The terminal 22 generates and transmits the second attention information when it is determined that the two match, and when the terminal 21 receives the second attention information, it corresponds to the terminal 22 among the images displayed by the display unit. Emphasize the image and display it.

例えば特許文献１によれば、自分が話者になった場合に、自分が注目した相手もまた自分を注目しているか否かがわかるようになる。しかし、発話者を判定する処理において、双方の端末利用者が双方の名前を呼ばねばならず煩わしい。 For example, according to Patent Document 1, when oneself becomes a speaker, it can be understood whether or not the other party to which he / she is focusing is also focusing on himself / herself. However, in the process of determining the speaker, both terminal users have to call both names, which is troublesome.

また、特許文献１では双方の利用者の映像を注目している場合にその注目情報が相互に送受信されて初めて自端末において相手映像を強調表示する処理が行われる。このため、例えば伝送遅延により注目情報の送受信に不具合が生じた場合等において上記処理が遅れると、ビデオ会議システムのリアルタイム性を損ねることとなり、円滑なコミュニケーションを阻害することとなってしまう。 Further, in Patent Document 1, when the images of both users are focused, the process of highlighting the other party's image on the own terminal is performed only after the attention information is mutually transmitted and received. For this reason, for example, when a problem occurs in transmission / reception of attention information due to a transmission delay, if the above processing is delayed, the real-time property of the video conference system is impaired, and smooth communication is hindered.

本発明は、このような実情に鑑みてなされたものであって、ビデオ会議システムにおける円滑なコミュニケーションを実現することを目的とする。 The present invention has been made in view of such circumstances, and its object is to realize smooth communication in a video conference system.

上述した目的を達成するため、本発明は、ネットワークを介して接続された他端末から、該他端末を使用する他ユーザを撮像した撮像情報と、該他端末の近傍で発せられた音声情報と、該他ユーザが自端末を注目していることを示す注目情報とを受信する受信手段と、音声情報の送信元の他端末と注目情報の送信元の他端末とが一致するか否かを判定する発話受動判定手段と、発話受動判定手段により他端末が一致すると判定されたとき、該他端末を使用する他ユーザから自端末への発話を通知する発話受動通知手段とを備えることを特徴とする。 In order to achieve the above-mentioned object, according to the present invention, imaging information obtained by imaging another user who uses the other terminal from the other terminal connected via the network and voice information emitted in the vicinity of the other terminal Whether the receiving means for receiving the attention information indicating that the other user is focusing on the own terminal and the other terminal of the transmission source of the voice information and the other terminal of the transmission source of the attention information coincide with each other The apparatus is characterized by comprising: an utterance passive judging unit to judge, and an utterance passive notifying unit to notify an utterance from another user who uses the other terminal to the own terminal when it is judged by the utterance passive judging unit that the other terminal matches. I assume.

本発明によれば、ビデオ会議システムにおける円滑なコミュニケーションを実現することが可能になる。 According to the present invention, it is possible to realize smooth communication in a video conferencing system.

本発明の実施形態におけるビデオ会議システムの概略図である。It is the schematic of the video conference system in embodiment of this invention. 本発明の実施形態における端末のハードウェア構成図である。It is a hardware block diagram of the terminal in embodiment of this invention. 本発明の実施形態における端末の機能ブロック図である。It is a functional block diagram of the terminal in the embodiment of the present invention. 本発明の実施形態における処理手順の概略を示すフローチャートである。It is a flowchart which shows the outline of the process sequence in embodiment of this invention. 本発明の実施形態における注視端末ＩＤ信号の生成手順を示すフローチャートである。It is a flowchart which shows the production | generation procedure of the gaze terminal ID signal in embodiment of this invention. 本発明の実施形態における注視判定処理の一例を示す模式図である。It is a schematic diagram which shows an example of the gaze determination process in the embodiment of the present invention. 本発明の実施形態における発話受動通知処理手順を示すフローチャートである。It is a flowchart which shows the utterance passive notification process procedure in embodiment of this invention. 本発明の実施形態における発話受動通知の流れを示すタイムチャートである。It is a time chart which shows the flow of the utterance passive notification in embodiment of this invention.

本発明の実施形態の端末及びビデオ会議システムに関し以下図面を用いて説明するが、本発明の趣旨を越えない限り、何ら本実施形態に限定されるものではない。なお、各図中、同一又は相当する部分には同一の符号を付しており、その重複説明は適宜に簡略化乃至省略する。また、以下に記載する実施形態は本発明の最良の形態であって、本発明に係る特許請求の範囲を限定するものではない。 The terminal and the video conference system according to the embodiment of the present invention will be described below with reference to the drawings, but the present invention is not limited to this embodiment as long as the purpose of the present invention is not exceeded. In the drawings, the same or corresponding parts are denoted by the same reference numerals, and the redundant description will be appropriately simplified or omitted. In addition, the embodiments described below are the best modes of the present invention, and do not limit the scope of the claims of the present invention.

まず初めに、本実施形態のビデオ会議システム１の概略について図１を参照して説明する。本図は、本システムを構成する４つの端末が４つの拠点ＡからＤにそれぞれ配置されており、各端末がネットワーク１４を介して接続されたシステムを示している。なお、ここでは４人のユーザがそれぞれ存在する４つの拠点で構成されたシステムを例示しているが、３つ以下の拠点で構成されたシステムであっても、５つ以上の拠点で構成されたシステムであってもよい。 First, an outline of the video conference system 1 according to the present embodiment will be described with reference to FIG. This figure shows a system in which four terminals constituting this system are respectively arranged at four bases A to D and each terminal is connected via a network 14. Here, although a system composed of four bases where four users exist respectively is illustrated, a system composed of three or less bases is composed of five or more bases. It may be a system.

各拠点に配置される端末とその周辺装置群の概略構成例について、拠点Ａに配置される構成を用いて説明する。拠点Ａには、端末としてのビデオ会議端末４と、ユーザＡの視線を追跡する例えば視線追跡装置等の撮像装置５と、ユーザＡを撮影するカメラ６と、ユーザＡの発する音声を取得するマイク等の音声入力装置７と、音声出力装置８と、ユーザＡの撮像、各拠点に配置された各端末から受信した各ユーザの撮像を表示する映像出力装置９が配置されている。拠点Ａ以外の他の拠点も同様の構成であるため説明を省略する。 A schematic configuration example of the terminal and peripheral device group arranged at each site will be described using the configuration arranged at the site A. At the site A, a video conference terminal 4 as a terminal, an imaging device 5 such as a gaze tracking device that tracks the line of sight of the user A, a camera 6 that captures the user A, and a microphone that acquires the sound emitted by the user A Etc., an audio output device 8, and an image output device 9 for displaying an image of the user A and an image of each user received from each terminal arranged at each site. Since the other bases other than the base A have the same configuration, the description thereof is omitted.

撮像装置５としての視線追跡装置は、ユーザＡの前に設置して、ユーザＡの眼球運動を計測することでユーザＡの視線データを取得する装置である。カメラ６は、ユーザＡを撮影して、撮影された撮像から所定の画像処理を行い視線の位置やその変化あるいは角度等を検出するために用いられる。また、視線追跡装置としては眼鏡型の装置を使用してもよい。なお、本発明においては、撮像装置５やカメラ６をまとめて撮像装置というものとする。 The line-of-sight tracking device as the imaging device 5 is a device that is installed in front of the user A and acquires line-of-sight data of the user A by measuring the eye movement of the user A. The camera 6 is used to photograph the user A, perform predetermined image processing from the photographed image, and detect the position of the line of sight, its change, the angle, and the like. In addition, as the gaze tracking device, a glasses-type device may be used. In the present invention, the imaging device 5 and the camera 6 are collectively referred to as an imaging device.

ビデオ会議端末４は、各拠点から各ユーザの撮像を受信して映像出力装置９に表示させたり、ユーザＡの撮像を解析したり、各拠点のビデオ会議端末に送信したり、後述する本実施形態に係る各種の情報処理を行う例えばパーソナルコンピュータ等の情報処理装置等である。ビデオ会議端末４の詳細な説明については後述する。なお、撮像等のデータの送受信には例えばサーバ等の仲介器を介してもよい。 The video conference terminal 4 receives the imaging of each user from each site and causes the video output device 9 to display it, analyzes the imaging of the user A, transmits it to the video conference terminals at each site, and this implementation to be described later. For example, an information processing apparatus such as a personal computer that performs various types of information processing according to the form. Detailed description of the video conference terminal 4 will be described later. Note that data transmission / reception such as imaging may be performed via a mediator such as a server.

拠点Ａにおいて、音声入力装置７はユーザＡの発する音声及びビデオ会議端末４近傍で発せられる音声の入力を受けるマイク等である。音声出力装置８は、各拠点から受信した各ユーザの音声等を出力するスピーカ等である。なお、これらは、ビデオ会議端末４に内蔵されていても、外部マイクや外部スピーカ等としてビデオ会議端末４に別途接続されるものであってもよい。 In the base A, the voice input device 7 is a microphone or the like which receives voices emitted by the user A and voices emitted in the vicinity of the video conference terminal 4. The voice output device 8 is a speaker or the like that outputs voice and the like of each user received from each site. Note that these may be incorporated in the video conference terminal 4 or may be separately connected to the video conference terminal 4 as an external microphone, an external speaker, or the like.

また、拠点Ａの映像出力装置９には、拠点ＢのユーザＢの撮像が画面左上に、拠点ＣのユーザＣの撮像が画面右上に、拠点ＤのユーザＤの撮像が画面左下に、拠点ＡのユーザＡの撮像が画面に表示されているが、この表示態様が一例であることは言うまでもない。 In addition, the video output device 9 at the site A includes the image of the user B at the site B at the upper left of the screen, the image of the user C at the site C at the upper right of the screen, and the image of the user D at the site D at the lower left of the screen. It is needless to say that this display mode is an example, although the imaging of the user A is displayed on the screen.

本実施形態では、ビデオ会議システムを利用する利用者の注視対象を検知して、注視しながら発話した際に、その注視対象に向けての発話であるとし、発話対象の利用者にこの発話は自分へ向けての発話であることを通知する。以下では、拠点ＡのユーザＡが、映像出力装置９に出力された拠点ＤのユーザＤの撮像を注視して発話している例を用いて説明する。 In the present embodiment, when the gaze target of the user using the video conference system is detected and uttered while gazing, it is assumed that the utterance is directed to the gaze target, and the user of the utterance target utters this utterance. Notify that you are speaking to yourself. In the following, description will be made using an example in which the user A at the base A gazes at the image of the user D at the base D output to the video output device 9 and speaks.

この場合、利用者Ａから利用者Ｄへ向けての発話であると拠点Ｄのビデオ会議端末が判定し、拠点Ｄにおいて、拠点ＡにおけるユーザＡの撮像を囲む赤枠を表示する。また、この場合、発話の開始タイミングや終了タイミングにおいて例えばビープ音等の音声を出力し、利用者Ｄに通知する。 In this case, the video conference terminal of the base D determines that the speech is directed from the user A to the user D, and the base D displays a red frame surrounding imaging of the user A at the base A. Further, in this case, a voice such as a beep is output at the start timing and the end timing of the speech, and the user D is notified of the sound.

なお、本実施形態では、各拠点の撮像を１画面を４分割して表示しているが、例えば１対１のシステムの場合、一画面に相手側の撮像として例えば会議室全体の様子がわかるような撮像を表示する態様であってもよい。また、１拠点に１ユーザでなく、複数のユーザが１拠点にいてもよいが、この場合、複数人の視線データを取得できることが好ましい。 In the present embodiment, the imaging of each site is displayed by dividing the screen into four, but in the case of, for example, a one-to-one system, one screen shows, for example, the appearance of the entire conference room as imaging of the other party. Such an image display may be possible. In addition, although one user may be at one site, a plurality of users may be at one site, but in this case, it is preferable to be able to acquire line-of-sight data of a plurality of people.

次に、本実施形態のビデオ会議システムにおける端末のハードウェア構成について図２を参照して説明する。拠点Ａの構成を例として説明するが、その他の拠点ＢからＤについても同様の構成であるため、説明を省略する。 Next, the hardware configuration of the terminal in the video conference system of this embodiment will be described with reference to FIG. Although the configuration of the base A will be described as an example, the other bases B to D have the same configuration, so the description will be omitted.

ビデオ会議端末４は、入力部２８と、メモリ２９と、ＣＰＵ３０と、ネットワークインタフェース３１（以下「ネットワークＩ／Ｆ」）を備えている。なお、その他のハードウェアとして、例えばＨＤＤや外付けあるいは内蔵された各種メディアドライブ等を備えていてもよい。 The video conference terminal 4 includes an input unit 28, a memory 29, a CPU 30, and a network interface 31 (hereinafter “network I / F”). As other hardware, for example, an HDD or various external or built-in media drives may be provided.

入力部２８は、電源のＯＮ／ＯＦＦや音量の変更など、各操作をするための操作ボタンである。メモリ２９は、本実施形態における各処理を実行するプログラムや、種々の制御プログラムや、入出力映像・音声データ、視線データ、後述の注視端末ＩＤ信号、話者ＩＤ信号、発話受動ＩＤ信号等を保存しておくＲＯＭやＲＡＭ等である。 The input unit 28 is an operation button for performing various operations such as power ON / OFF, volume change, and the like. The memory 29 is a program for executing each processing in the present embodiment, various control programs, input / output video / audio data, sight line data, a gaze terminal ID signal described later, a speaker ID signal, an utterance passive ID signal, etc. It is ROM, RAM, etc. which are saved.

ＣＰＵ３０は、ビデオ会議端末４の動作を制御し、映像データのエンコード及びデコード処理を行う。ネットワークＩ／Ｆ３１は、通信ネットワークを利用して各種データを転送する。 The CPU 30 controls the operation of the video conference terminal 4 and performs encoding and decoding processing of video data. The network I / F 31 transfers various data using a communication network.

次に、本実施形態におけるビデオ会議端末４の機能ブロックについて図３を参照して説明する。ここでも、図１の例に従い、４つの拠点ＡからＤがネットワーク１４によって接続されている構成例を用いて説明する。なお、拠点ＢからＤにおけるビデオ会議端末の機能ブロックもビデオ会議端末４と同様の構成であるため、本図では簡易図として示し、拠点Ａの構成との重複する説明は省略する。 Next, functional blocks of the video conference terminal 4 in the present embodiment will be described with reference to FIG. Here, in accordance with the example of FIG. 1, description will be made using a configuration example in which four bases A to D are connected by the network 14. The functional blocks of the video conference terminal at sites B to D have the same configuration as that of the video conference terminal 4 and thus are shown as a simplified diagram in this figure, and redundant description with the configuration of site A will be omitted.

なお、本図においては、撮像装置５、音声入力装置７、音声出力装置８、映像出力装置９をまとめて入出力装置群１０とする。また、拠点Ｂにはビデオ会議端末４’及び入出力装置群１０’が、拠点Ｃにはビデオ会議端末４”及び入出力装置群１０”が、拠点Ｄにはビデオ会議端末４”’及び入出力装置群１０”’が備えられている。 In the drawing, the imaging device 5, the audio input device 7, the audio output device 8, and the video output device 9 are collectively referred to as an input / output device group 10. In addition, the video conference terminal 4 'and the input / output device group 10' are in the site B, the video conference terminal 4 "and the input / output device group 10" are in the site C, and the video conference terminal 4 "'is in the site D. An output device group 10 "'is provided.

ビデオ会議端末４は、映像入力部１５と、データ送信部１６と、注目情報生成部１７と、音声取得部１８と、データ受信部２１と、話者判定部２２と、話者識別信号生成部２３と、発話受動判定部２４と、出力部２５を含み構成されている。出力部２５は発話受動通知部２５１を含む。 The video conference terminal 4 includes a video input unit 15, a data transmission unit 16, an attention information generation unit 17, a voice acquisition unit 18, a data reception unit 21, a speaker determination unit 22, and a speaker identification signal generation unit. The configuration includes an utterance passive determination unit 24 and an output unit 25. The output unit 25 includes an utterance passive notification unit 251.

映像入力部１５は、撮像装置５により撮像された映像を取得する撮像情報取得手段である。なお、取得した映像は別途、映像圧縮部により圧縮／符号化される。 The video input unit 15 is an imaging information acquisition unit that acquires a video captured by the imaging device 5. The acquired video is separately compressed / encoded by the video compression unit.

データ送信部１６は、映像入力部１５において取得した映像データをネットワーク１４を介して各拠点のビデオ会議端末に送信する送信手段である。また、データ送信部１６は、注目情報である後述の注視端末ＩＤ信号を各拠点のビデオ会議端末に送信する。 The data transmission unit 16 is a transmission unit that transmits the video data acquired by the video input unit 15 to the video conference terminal at each site via the network 14. Moreover, the data transmission part 16 transmits the below-mentioned gaze terminal ID signal which is attention information to the video conference terminal of each base.

注目情報生成部１７は、映像入力部１５により取得された他端末を使用する他ユーザの撮像情報を解析して、表示手段である出力部２５により映像出力装置９に表示された自端末を使用する自ユーザの撮像情報に他端末を使用する他ユーザが注目していることを示す注目情報を生成する注目情報生成手段である。なお、本実施形態における自端末及び他端末なる名称、自ユーザと他ユーザなる名称は便宜的なものであり、どちらかが主又は副であるかのような優劣を規定するものではない。 The attention information generation unit 17 analyzes the imaging information of the other user who uses the other terminal acquired by the video input unit 15, and uses the own terminal displayed on the video output device 9 by the output unit 25 which is a display unit. It is attention information generation means for generating attention information indicating that another user who is using another terminal is focusing on the imaging information of the own user. Note that the names of the own terminal and the other terminal and the names of the own user and the other user in the present embodiment are for convenience, and do not define superiority or inferiority as to which is the main or sub.

注目情報とは、例えば、図１を用いて説明した視線追跡装置により取得した視線データが特定の拠点のユーザ映像に集中している場合に生成される注視端末ＩＤ信号である。注視端末ＩＤ信号には、注視対象の拠点名と、注視元の拠点における端末の識別信号が含まれる。詳細については後述する。なお、注目情報は、視線データに限定されず、例えばユーザの撮像からユーザの顔の向き等を解析し、顔の向きが特定の対象に向けられているか否かを公知の画像解析技術により特定し、解析結果に基づいて生成されるものであってもよい。 The attention information is, for example, a gaze terminal ID signal generated when the line-of-sight data acquired by the line-of-sight tracking apparatus described with reference to FIG. 1 is concentrated on the user video at a specific base. The gaze terminal ID signal includes the name of the gaze target base and the terminal identification signal at the gaze base. Details will be described later. Note that attention information is not limited to line-of-sight data, for example, analyzes the user's face orientation etc. from imaging of the user, and specifies whether the face orientation is directed to a specific target by known image analysis technology However, it may be generated based on the analysis result.

音声取得部１８は、例えばマイク等の音声入力装置７に入力された音声情報を取得する音声情報取得手段である。なお、音声入力装置７は、拠点Ａのビデオ会議端末４においては、ビデオ会議端末４の近傍で発せられた音声情報の入力を受け付ける。つまり、この場合、音声入力装置７は、ユーザＡの発した音声に加え、周囲の音等の入力も受け付ける。なお、取得した音声データが圧縮・符号化されている場合、音声伸長部により復号される。 The voice acquisition unit 18 is voice information acquisition means for acquiring voice information input to the voice input device 7 such as a microphone, for example. The audio input device 7 receives an input of audio information emitted near the video conference terminal 4 at the video conference terminal 4 at the base A. That is, in this case, in addition to the voice uttered by the user A, the voice input device 7 also receives an input such as a surrounding sound. When the acquired voice data is compressed and encoded, it is decoded by the voice decompression unit.

データ受信部２１は、ネットワークを介して接続された他端末から該他端末を使用する他ユーザを撮像した撮像情報と、該他端末の近傍で発せられた音声情報と、該他ユーザが自端末を注目していることを示す注目情報を受信する受信手段である。なお、他ユーザの撮像情報は各拠点でのビデオ会議端末における映像圧縮部によって圧縮／符号化されているが、これを映像伸長部によって復号する。 The data receiving unit 21 includes imaging information obtained by imaging another user who uses the other terminal from another terminal connected via the network, voice information emitted in the vicinity of the other terminal, and the other user who is the own terminal. Receiving means for receiving attention information indicating that attention is being paid. Although imaging information of other users is compressed / encoded by the video compression unit in the video conference terminal at each site, this is decoded by the video decompression unit.

話者判定部２２は、データ受信部２１により受信した複数の他端末の近傍で発せられた音声情報を解析し、該複数の他端末を使用する複数の他ユーザのうち、どの他ユーザが発話しているか否かを判定する。音声情報の解析は、例えば取得した音声情報の入力ゲインで判定する等の手法をとればよいが、これに限定されず公知の手法で解析を行ってもよい。 The speaker determination unit 22 analyzes the voice information emitted in the vicinity of the plurality of other terminals received by the data reception unit 21, and among the plurality of other users who use the plurality of other terminals, any other user speaks It is determined whether the The analysis of the voice information may be performed by, for example, a method of determining based on the input gain of the acquired voice information, but the analysis is not limited to this and the analysis may be performed by a known method.

話者識別信号生成部２３は、話者判定部２２により発話していると判定された他ユーザが使用する他端末を識別する識別信号に基づいて自端末に対する発話者を識別する話者識別信号を生成する。話者識別信号には、話者判定部２２により発話していると判定された他ユーザの音声情報の送信元端末を識別する識別信号が含まれる。 The speaker identification signal generation unit 23 is a speaker identification signal for identifying a speaker for the own terminal based on an identification signal for identifying another terminal used by another user determined to be speaking by the speaker determination unit 22. Generate The speaker identification signal includes an identification signal for identifying the transmission source terminal of the voice information of another user who is determined to be speaking by the speaker determination unit 22.

発話受動判定部２４は、音声情報の送信元の他端末と注目情報の送信元の他端末とが一致するか否かを判定する。 The utterance passive determination unit 24 determines whether or not the other terminal transmitting the voice information matches the other terminal transmitting the attention information.

発話受動通知部２５１は、発話受動判定部２４により音声情報の送信元の他端末と注目情報の送信元の他端末とが一致すると判定されたとき、該他端末を使用する他ユーザから自端末への発話を通知する発話受動通知手段である。 When the utterance passive determination unit 24 determines that the other terminal from which the voice information is transmitted matches the other terminal from which the attention information is transmitted, the utterance passive notification unit 251 receives the own terminal from the other user who uses the other terminal. This is an utterance passive notification means for notifying the utterance to the.

また、発話受動通知部２５１は、発話受動判定部２４により一致すると判定された他端末を使用する他ユーザの出力部２５により映像出力装置９に表示される撮像情報を、該他ユーザからの発話受動を自端末のユーザが認識し得る表示態様に変更するよう、出力部２５による映像出力装置９への表示を制御する表示制御手段として機能する。表示態様の変更とは、具体的には、例えば図１を用いて説明したように、映像出力装置９に表示される撮像情報を囲むように赤枠を表示させたり、撮像情報を他の拠点の撮像情報より拡大させたりする等、他の拠点の撮像情報より目立つように表示態様を変更する処理をいう。 Further, the utterance passive notification unit 251 utters, from the other user, the imaging information displayed on the video output device 9 by the output unit 25 of the other user who uses the other terminal determined to match by the utterance passive determination unit 24. It functions as a display control unit that controls the display on the video output device 9 by the output unit 25 so as to change the passive to a display mode that can be recognized by the user of the own terminal. Specifically, with the change of the display mode, as described with reference to FIG. 1, for example, a red frame is displayed so as to surround the imaging information displayed on the video output device 9, or the imaging information is The display mode is changed so that it is more conspicuous than the imaging information of other bases, such as enlargement of the imaging information.

一方、発話受動判定部２４は、音声情報の送信元の他端末と注目情報の送信元の他端末とが一致すると判定したとき、該他端末を使用する他ユーザから自端末を使用する自ユーザに向けて発話されていることを識別する発話受動識別信号を生成してもよい。この場合、発話受動通知部２５１は、発話受動判定部２４により生成された発話受動識別信号に基づいて発話受動を通知することとしてもよい。詳細は後述するが、発話受動信号を生成することにより、端末は発話受動状態にあることを容易に認識することが可能となる。 On the other hand, when the utterance passive determination unit 24 determines that the other terminal of the transmission source of the voice information and the other terminal of the transmission source of the attention information match, the own user uses the own terminal from other users who use the other terminal. An utterance passive identification signal may be generated that identifies that the utterance is being directed to In this case, the utterance passive notification unit 251 may notify the utterance passive based on the utterance passive identification signal generated by the utterance passive determination unit 24. Although details will be described later, by generating the utterance passive signal, the terminal can easily recognize that it is in the utterance passive state.

また、発話受動通知部２５１は、発話受動判定部２４から発話受動識別信号を受信したとき、所定の音声を出力するよう出力部２５を制御することとしてもよい。所定の音声とは、例えば上述したビープ音等である。これにより、ユーザは自分に向けられた発話が開始されたことを認識することが可能となる。 In addition, when the utterance passive identification signal is received from the utterance passive determination unit 24, the utterance passive notification unit 251 may control the output unit 25 to output a predetermined voice. The predetermined voice is, for example, the above-described beep sound. This enables the user to recognize that the speech directed to him / her is started.

さらに、発話受動通知部２５１は、発話受動判定部２４からの発話受動識別信号の受信が途絶えたとき、所定の音声を出力する制御を行うこととしてもよい。これにより、ユーザは自分に向けられた発話が終了したことを認識することが可能となる。 Furthermore, the utterance passive notification unit 251 may perform control to output a predetermined voice when reception of the utterance passive identification signal from the utterance passive determination unit 24 is interrupted. This enables the user to recognize that the speech directed to him has ended.

さらに、発話受動通知部２５１は、発話受動判定部２４からの発話受動識別信号の受信が途絶えた以降の一定時間、発話受動通知を継続することとしてもよい。詳細は後述するが、これにより、例えば発話受動識別信号の受信が途絶えた場合にすぐにビープ音を鳴らしてユーザに受け答えを急がせるような事態を回避して、ユーザは受け答えをゆっくりと行うことができるため、円滑なコミュニケーションの実施に寄与することになる。 Furthermore, the utterance passive notification unit 251 may continue the utterance passive notification for a certain period of time after the reception of the utterance passive identification signal from the utterance passive determination unit 24 is discontinued. Although details will be described later, for example, when reception of an utterance passive identification signal is interrupted, a situation in which a user beeps immediately and a response is urgently avoided is avoided, and the user performs a response slowly. Can contribute to smooth communication.

なお、発話受動通知部２５１は、発話受動識別信号を受信したとき、予め音声出力していた所定の音声の音量を上げ、発話受動識別信号の受信が途絶えたとき、上げていた音量を下げる制御を行うこととしてもよい。これにより、ユーザは発話受動開始及び終了のタイミングを音量の変化で認識することができる。 The utterance passive notification unit 251 increases the volume of a predetermined voice output in advance when the utterance passive identification signal is received, and decreases the volume increased when reception of the utterance passive identification signal is interrupted. It is good to do. Thereby, the user can recognize the timing of the speech passive start and end by the change of the sound volume.

さらに、発話受動通知部２５１は、発話受動識別信号を受信したとき、ビデオ会議端末を振動させ、発話受動識別信号の受信が途絶えたとき、再度ビデオ会議端末を振動させる制御を行うこととしてもよい。これにより、ユーザは発話受動開始及び終了の各タイミングを振動により認識することが可能となる。 Furthermore, the utterance passive notification unit 251 may perform control to vibrate the video conference terminal when the utterance passive identification signal is received, and to vibrate the video conference terminal again when the reception of the utterance passive identification signal is stopped. . Thereby, the user can recognize each timing of the speech passive start and end by vibration.

また、発話受動判定部２４は、データ受信部２１により注目情報の受信が途絶えた場合であっても、データ受信部２１により音声情報の受信が途絶えるまでは発話受動信号の生成を継続することとしてもよい。詳細は後述するが、これにより、例えば他ユーザが表示された映像に注視しなくなる等によって注目情報である注視端末ＩＤ信号を受信しなくなったとしても、他ユーザによる発話が継続されている間は、発話受動状態にあるとみなすため、円滑なコミュニケーションを阻害することがない。 Further, even if the data reception unit 21 stops receiving the attention information, the utterance passive determination unit 24 continues generating the utterance passive signal until the data reception unit 21 stops receiving the audio information. It is also good. Although the details will be described later, even if the gaze terminal ID signal which is attention information is not received because of this, for example, the other user ceases to gaze at the displayed image, while the utterance by the other user is continued Because it is considered as being in the speech passive state, smooth communication is not impeded.

ここで、例えば拠点ＤにおけるユーザＤが映像出力装置であるディスプレイに表示されたユーザＡを注視しながら発話しているとする。ビデオ会議端末４の発話受動判定部２４は、各拠点より受信した注目情報である注視端末ＩＤ信号と話者ＩＤ信号を比較する。そして、発話受動判定部２４は、比較の結果、それぞれが示す拠点名が一致したとき、利用者Ａが発話受動状態にあると判定し、発話受動ＩＤ信号を生成する。この例の場合、注視端末ＩＤ信号と話者ＩＤ信号がともにＤ拠点を示すので、利用者Ａは利用者Ｄによって発話受動状態にある、ということになる。 Here, for example, it is assumed that the user D at the site D speaks while gazing at the user A displayed on the display which is the video output device. The utterance passive determination unit 24 of the video conference terminal 4 compares a gaze terminal ID signal, which is attention information received from each site, with a speaker ID signal. Then, the utterance passive determination unit 24 determines that the user A is in the utterance passive state, and generates the utterance passive ID signal, when the base names indicated by the two correspond as a result of the comparison. In this example, since the gaze terminal ID signal and the speaker ID signal both indicate the D base, the user A is in an utterance passive state by the user D.

次に、本実施形態における処理の概略手順について図３及び図４を参照して説明する。ここでは図１を用いた説明と同様に、拠点ＤのユーザＤが拠点ＡのユーザＡの撮像を注視しているものとして説明する。 Next, a schematic procedure of processing in the present embodiment will be described with reference to FIGS. Here, as in the description using FIG. 1, it is assumed that the user D at the site D is watching the image of the user A at the site A.

まず、拠点Ｄにおけるビデオ会議端末４”’は撮像装置を介してユーザＤの撮像を取得し、該撮像からユーザＤの視線データを取得する（ステップＳ１）。 First, the video conference terminal 4 '' at the site D acquires an image of the user D via the imaging device, and acquires line-of-sight data of the user D from the image (step S1).

次に、ビデオ会議端末４”’の注目情報生成部１７は注目情報である注視端末ＩＤを生成する（ステップＳ２）。注目情報生成部１７による注視端末ＩＤ生成処理の詳細については後述する。 Next, the attention information generation unit 17 of the video conference terminal 4 ′ ′ generates a gaze terminal ID that is attention information (step S2). Details of gaze terminal ID generation processing by the attention information generation unit 17 will be described later.

拠点Ｄにおけるビデオ会議端末４”’は、ユーザＤの撮像や注視端末ＩＤを拠点Ａのビデオ会議端末４に送信する（ステップＳ３）。 The video conference terminal 4 ″ ′ at the site D transmits the imaging and gaze terminal ID of the user D to the video conference terminal 4 at the site A (step S3).

拠点Ａの話者判定部２２により話者の判定がされ、話者識別信号生成部２３により話者識別信号である話者ＩＤ信号が生成される（ステップＳ４）。話者ＩＤ信号生成の詳細については後述する。 The speaker determination unit 22 of the base A determines the speaker, and the speaker identification signal generation unit 23 generates a speaker ID signal which is a speaker identification signal (step S4). Details of the speaker ID signal generation will be described later.

そして、発話受動判定部２４により発話受動判定処理が実行される（ステップＳ５）。この処理の詳細については後述する。 Then, the utterance passive determination processing is executed by the utterance passive determination unit 24 (step S5). Details of this process will be described later.

発話受動通知部２５１は、自分に注目しているユーザの撮像について表示制御処理を行う（ステップＳ６）。 The utterance passive notification unit 251 performs display control processing for imaging of the user who is focusing on himself (step S6).

図４に示した注視端末ＩＤ生成処理の詳細について図５を参照して説明する。ここでは、拠点ＡのユーザＡが拠点ＤのユーザＤに向けて発言していることを想定した、拠点Ａでの注視判定処理手順を例として説明する。 Details of the gaze terminal ID generation process shown in FIG. 4 will be described with reference to FIG. Here, the gaze determination processing procedure at the site A will be described as an example, assuming that the user A at the site A speaks to the user D at the site D.

まず、ビデオ会議端末４は、撮像装置５からユーザＡの視線データを取得する（ステップＳ１１）。例えば、視線データは（ｘ、ｙ）で表した座標データとして取得するものとする。 First, the video conference terminal 4 acquires line-of-sight data of the user A from the imaging device 5 (step S11). For example, the line-of-sight data is acquired as coordinate data represented by (x, y).

視線データは時間の経過と共に変化するのが通常であるため、ビデオ会議端末４は所定の更新頻度で視線データを更新する（ステップＳ１２）。この更新頻度は、例えば映像データのフレームレートが３０ｆｐｓである場合、３３ｍｓｅｃ間隔とすることが好ましい。このとき、撮像装置５は眼球を撮像した際の変化量を角度として取得するものとする。 Since the line-of-sight data normally changes with time, the video conference terminal 4 updates the line-of-sight data at a predetermined update frequency (step S12). For example, when the frame rate of the video data is 30 fps, the update frequency is preferably set at 33 msec intervals. At this time, the imaging device 5 acquires the amount of change when the eyeball is imaged as an angle.

ビデオ会議端末４は、メモリ２９に取得した視線データを順次記憶し、例えば過去１０データを参照する（ステップＳ１３）。そして、ビデオ会議端末４は、過去１０データ分の視線データに基づいて停留判定を行う（ステップＳ１４）。この場合の停留判定は、例えば人間の眼球運動の特徴を踏まえ、例えば非特許文献１等に開示されているような判定条件を用いることが好ましい。 The video conference terminal 4 sequentially stores the line-of-sight data acquired in the memory 29 and refers to, for example, the past 10 data (step S13). Then, the video conference terminal 4 performs the stop determination based on the sight data for the past 10 data (step S14). In this case, it is preferable to use the determination condition as disclosed in, for example, Non-Patent Document 1 or the like on the basis of the feature of human eye movement, for example.

すなわち、本実施形態では、停留判定条件として、
１：「前後の視線データが視野角にして２．１度以上離れないこと」
２：「過去１０データのうち、２つの視線データの最大距離が視野角にして２．１度以上離れないこと」
とするが一例であって、これに限らずその他公知の判定条件を採用してもよい。 That is, in this embodiment, as the stop determination condition,
1: "The line-of-sight data before and after does not deviate by more than 2.1 degrees in the viewing angle"
2: “The maximum distance between two line-of-sight data in the past 10 data is not more than 2.1 degrees away from the viewing angle”
However, this is an example, and not limited to this, other known judgment conditions may be adopted.

なお、人間の瞬きは、一般的に１００ｍｓｅｃ〜１５０ｍｓｅｃといわれており、これは視線データにして３〜５データ分である。瞬きされたときはユーザの眼球に基づいた視線データを取得できないためデータ欠損となってしまう。そこで、本実施形態では過去１０データのうち連続５回までのデータ欠損を停留判定から除外し、残りのデータで判定するものとする。 Note that human blinking is generally said to be 100 msec to 150 msec, which corresponds to 3-5 data as line-of-sight data. When blinking, data cannot be obtained because line-of-sight data based on the user's eyeball cannot be acquired. Therefore, in the present embodiment, data loss up to 5 consecutive times out of the past 10 data is excluded from the stop determination, and determination is made with the remaining data.

ステップＳ１４で視線データが停留していると判定した場合（ステップＳ１４、ＹＥＳ）、ビデオ会議端末４は、画面上のどの拠点における映像上で停留したのかを確認する（ステップＳ１５）。この際、ビデオ会議端末４は各拠点における映像が画面上のどの領域に表示されているのかの情報を得ているものとする。そして、ビデオ会議端末４は、停留していると判定した１０データ分の視線データの座票が、各拠点映像が表示される矩形領域内に収まっているか否かによりどの拠点映像に停留しているかを判定する。 When it is determined in step S14 that the line-of-sight data is stopped (YES in step S14), the video conference terminal 4 confirms at which base on the screen the image is stopped (step S15). At this time, it is assumed that the video conference terminal 4 obtains information on which region on the screen the video at each site is displayed. Then, the video conference terminal 4 is stopped at which base video depending on whether or not the seat data of the line-of-sight data for 10 data determined to be parked falls within the rectangular area where each base video is displayed. Determine if it exists.

なお、拠点映像間の境目に視線データが存在する場合は、過去１０データの内、視線データが収まっている数の多い映像に係る拠点を停留拠点とする。視線データが同数の場合、どちらも停留拠点とせず、それまでの停留判定処理をリセットする。 In addition, when line-of-sight data exists at the boundary between the base images, the base relating to the large number of images in which the line-of-sight data are contained among the past 10 data is taken as the fixed base. If the line-of-sight data has the same number, neither of them is regarded as the station base, and the station determination processing up to that point is reset.

他方、ステップＳ１４で視線データが停留していないと判定した場合（ステップＳ１４、ＮＯ）、停留判定処理をリセットする（ステップＳ１１３）。 On the other hand, when it is determined in step S14 that the line-of-sight data is not stopped (step S14, NO), the stop determination process is reset (step S113).

次に、ビデオ会議端末４は、ステップＳ１５において確認された停留拠点が前回と同じ停留拠点か否かを判定する（ステップＳ１６）。前回と異なる停留拠点の場合（ステップＳ１６、ＮＯ）、ビデオ会議端末４は停留判定をリセットする（ステップＳ１１３）。 Next, the video conference terminal 4 determines whether or not the stop base confirmed in step S15 is the same stop base as the previous time (step S16). In the case of the stopping base different from the previous time (step S16, NO), the video conference terminal 4 resets the stopping determination (step S113).

他方、前回つまり１回前の注視判定時と同じ停留拠点の場合（ステップＳ１６、ＹＥＳ）、ビデオ会議端末４は注視端末ＩＤ信号が送信中であるか否かを判定する（ステップＳ１７）。 On the other hand, in the case of the same stop base as the last time, that is, the previous gaze determination (step S16, YES), the video conference terminal 4 determines whether or not the gaze terminal ID signal is being transmitted (step S17).

注視端末ＩＤ信号が送信中でない場合（ステップＳ１７、ＮＯ）、ビデオ会議端末４は停留カウントを「＋１」とする（ステップＳ１８）。他方、注視端末ＩＤ信号が送信中である場合は（ステップＳ１７、ＹＥＳ）、ビデオ会議端末４は注視判定を行わず、そのまま対象の拠点のビデオ会議端末に注視端末ＩＤ信号を送信し続けるものとする（ステップＳ１１２）。 When the gaze terminal ID signal is not being transmitted (step S17, NO), the video conference terminal 4 sets the stop count to “+1” (step S18). On the other hand, when the gaze terminal ID signal is being transmitted (step S17, YES), the video conference terminal 4 does not perform the gaze determination and continues to transmit the gaze terminal ID signal to the video conference terminal at the target base as it is. (Step S112).

ステップＳ１８以後、ビデオ会議端末４は停留が３回連続したかどうかの判定をする（ステップＳ１９）。本実施形態では、停留が３回連続した場合に「注視」を判定する。３回未満と判定した場合はステップＳ１２に戻る。この判定により、一定位置に視線データが３０回留まっていなくても、ゆっくりと滑らかに視線の対象を追うことも注視であると判定することが可能になる。 After step S18, the video conference terminal 4 determines whether or not the stop has been continued three times (step S19). In the present embodiment, “gazing” is determined when the stop continues three times. If it is determined that the number of times is less than three, the process returns to step S12. By this determination, it is possible to determine that it is also a gaze to follow the target of the line of sight slowly and smoothly even if the line-of-sight data does not stay at a fixed position 30 times.

ビデオ会議端末４は、ステップＳ１８の処理を繰り返し、停留が３回連続で続いたと判定した場合（ステップＳ１９、ＹＥＳ）、「注視」と判定し、その停留拠点を注視拠点に変更する（ステップＳ１１０）。 The video conference terminal 4 repeats the process of step S18, and when it is determined that the stop has continued for three consecutive times (step S19, YES), determines that it is "gaze" and changes the stay base to the gaze base (step S110). ).

ビデオ会議端末４は、自分の拠点ＩＤと注視拠点のビデオ会議端末ＩＤを含む注視端末ＩＤ信号を生成する（ステップＳ１１１）。その後、ビデオ会議端末４は、注視対象のビデオ会議端末に注視端末ＩＤ信号を送信する（ステップＳ１１２）。 The video conference terminal 4 generates a gaze terminal ID signal including its own base ID and the video conference terminal ID of the gaze base (step S111). Thereafter, the video conference terminal 4 transmits a gaze terminal ID signal to the gaze target video conference terminal (step S112).

なお、停留や注視が起こっている状態で、ステップＳ１４の停留判定において、大きく視線が変化し、停留や注視が終了したと判定された場合（ステップＳ１４、ＮＯ）、それらの状態はリセットされる（ステップＳ１１３）。 When it is determined that the line of sight changes greatly and the stop and gaze are finished in the stay determination in step S14 while the stay and gaze are occurring (NO in step S14), those states are reset. (Step S113).

また、ビデオ会議端末４は、ステップＳ１１３の後、現在注視端末ＩＤ信号を送信している状態であるかどうかを判定し（ステップＳ１１４）、注視端末ＩＤ信号を送信していると判定したとき（ステップＳ１１４、ＹＥＳ）、その注視端末ＩＤ信号の送信を停止する（ステップＳ１１５）。なお、その間は常に、ステップＳ１１２での注視端末ＩＤ信号は送信され続けているものとする。 Further, after step S113, the video conference terminal 4 determines whether or not it is currently in the state of transmitting the gaze terminal ID signal (step S114), and determines that it is transmitting the gaze terminal ID signal (step S114). In step S114, YES), transmission of the gaze terminal ID signal is stopped (step S115). Note that the gaze terminal ID signal in step S112 is always transmitted during this period.

次に、本実施形態における注視判定処理の一例について図６を参照して説明する。本図は、拠点ＡにおけるユーザＡが画面表示された拠点ＤにおけるユーザＤの撮像に注目している場合の視線データの滞留の状況を示したものである。本図において、「Ｅ１」等で示された、数字が付された小円は、映像出力装置９における画面上の視線データが示す位置を表し、小円に付された数字はデータの取得順としている。 Next, an example of gaze determination processing in the present embodiment will be described with reference to FIG. This figure shows the staying state of the line-of-sight data when the user A at the site A is paying attention to the imaging of the user D at the site D displayed on the screen. In the figure, a small circle with a numeral, such as “E1”, represents the position indicated by the line-of-sight data on the screen in the video output device 9, and the numeral attached to the small circle represents the data acquisition order. And

Ｅ１からＥ２の変化は停留判定の条件１において、停留ではないと判定される。同様に、Ｅ３、Ｅ４、Ｅ５、Ｅ６と停留ではないと判定され、Ｅ７とＥ８の変化は上述した停留判定の条件１に該当する。同様にＥ９、Ｅ１０〜Ｅ１６までの変化は停留判定の条件に１該当し、この１０データが上述の停留判定の条件２に該当したとすると、視線データＥ７〜Ｅ１６が停留しているといえる。また、このときの１０データは映像１２で示す拠点Ｄの拠点映像領域内にあるため停留拠点はＤとなる。 It is determined that the change from E1 to E2 is not a stop under condition 1 of the stop determination. Similarly, it is determined that E3, E4, E5, and E6 are not stationary, and changes of E7 and E8 correspond to Condition 1 of the above-described stationary determination. Similarly, changes from E9 to E10 to E16 correspond to the condition for the stop determination, and assuming that the ten data correspond to the condition 2 for the stop determination described above, it can be said that the line-of-sight data E7 to E16 is stopped. Further, since the 10 data at this time is in the base image area of the base D indicated by the video 12, the stop base is D.

次に視線データＥ１７〜Ｅ２６、Ｅ２７〜Ｅ３６も同様に停留拠点はＤとなる。ここで、３回連続で停留が起きたので、Ｅ３７〜Ｅ４４まで注視しているとみなし、この場合停留判定は行わないものとする。このとき、拠点Ａのビデオ会議端末４は、拠点Ｄのビデオ会議端末に拠点ＤのＩＤと自拠点のＩＤを付した、注視端末ＩＤ信号を送信する。 Next, in the line-of-sight data E17 to E26 and E27 to E36, the stop base is D as well. Here, since the stop has occurred three times in succession, it is considered that E37 to E44 are being watched, and in this case, the stop determination is not performed. At this time, the video conference terminal 4 at the site A transmits a gaze terminal ID signal in which the ID of the site D and the ID of its own site are added to the video conference terminal at the site D.

Ｅ４４からＥ４５の視線データは停留場件の閾値以上の変化をしたので、ここで注視は終了したとする。その際、ビデオ会議端末４は、注視端末ＩＤ信号の送信を停止することで、拠点Ｄのビデオ会議端末に、注視しているタイミングとその長さをリアルタイムに知らせることが可能である。 The gaze data at E44 to E45 has changed by more than the threshold value of the parking lot condition, and it is assumed that the gaze has ended here. At that time, the video conference terminal 4 can notify the video conference terminal of the base D in real time of the watching timing and the length thereof by stopping the transmission of the gaze terminal ID signal.

次に、本実施形態における発話受動通知処理手順について図７を参照して説明する。ここでは、拠点ＡのユーザＡが拠点ＤのユーザＤに向けて発話しているものとし、拠点Ｄにおける発話受動通知手順を例として説明する。 Next, an utterance passive notification processing procedure according to the present embodiment will be described with reference to FIG. Here, it is assumed that the user A at the site A is speaking toward the user D at the site D, and the passive speech notification procedure at the site D will be described as an example.

まず、拠点Ｄのビデオ会議端末は、拠点Ａのビデオ会議端末４から、映像情報、音声情報の他、注視端末ＩＤ信号等を受信する（ステップＳ２１）。 First, the video conference terminal at the site D receives the gaze terminal ID signal and the like from the video conference terminal 4 at the site A in addition to the video information and the audio information (step S21).

次に、拠点Ｄのビデオ会議端末は、受信した音声データについて、受信データに付された送信元端末ＩＤより、どの拠点の音声なのか解析する（ステップＳ２２）。 Next, the video conference terminal of the base D analyzes the base of the received voice data from the transmission source terminal ID attached to the received data (step S22).

そして、拠点Ｄのビデオ会議端末は、話者がいるかどうかの話者判定を行う（ステップＳ２３）。拠点Ｄのビデオ会議端末は、話者がいると判定し（ステップＳ２３、ＹＥＳ）、その話者が発話中であると判定すると、その話者がいる拠点のＩＤを含む話者ＩＤ信号を生成する（ステップＳ２４）。 Then, the video conference terminal of the base D performs speaker determination as to whether or not there is a speaker (step S23). The video conference terminal at the base D determines that there is a speaker (step S23, YES), and when it is determined that the speaker is speaking, generates a speaker ID signal including the ID of the base where the speaker is present (Step S24).

拠点Ｄのビデオ会議端末は、注視端末ＩＤ信号を受信している状態であるとき（ステップＳ２５、ＹＥＳ）、その送信元の拠点と話者ＩＤ信号に付されたＩＤの拠点が一致するかどうかを判定する（ステップＳ２６）。両者が一致すると判定した場合（ステップＳ２６、ＹＥＳ）、拠点Ｄのビデオ会議端末は、発話受動ＩＤ信号を生成する（ステップＳ２７）。 When the video conference terminal of the base D receives the gaze terminal ID signal (YES in step S25), whether the base of the transmission source matches the base of the ID attached to the speaker ID signal Is determined (step S26). When it is determined that the two match (step S26, YES), the video conference terminal at the site D generates an utterance passive ID signal (step S27).

そして、拠点Ｄのビデオ会議端末は、発話受動通知を開始する（ステップＳ２８）。ここでは、拠点Ｄのビデオ会議端末は、ビープ音を出力し、ビデオ会議端末の画面内における、発話受動の対象である拠点Ａが表示されている拠点映像の枠を赤く表示させることとする（ステップＳ２９）。ステップＳ２９は、発話受動通知が継続されていることを示している。この場合、拠点Ａの拠点映像の枠を赤く表示し続けることになる。 Then, the video conference terminal at site D starts utterance passive notification (step S28). Here, the video conference terminal at the site D outputs a beep sound and displays the frame of the site video in which the site A that is the subject of speech passive is displayed in red on the screen of the video conference terminal ( Step S29). Step S29 indicates that the utterance passive notification is continued. In this case, the base image frame of the base A is continuously displayed in red.

次に、拠点Ｄのビデオ会議端末は、ステップＳ２１に戻り、次の話者判定を行う。まだ話者が発話中である場合（ステップＳ２２⇒ステップＳ２３、ＹＥＳ）、話者ＩＤ信号は更新され（ステップＳ２４）、連続で生成され続けることになる。 Next, the video conference terminal at the base D returns to step S21 to perform the next speaker determination. If the speaker is still speaking (step S22⇒step S23, YES), the speaker ID signal is updated (step S24) and continues to be generated continuously.

また、拠点Ｄのビデオ会議端末は、拠点Ａの利用者の視線が大きく変動し、その結果、拠点Ａからの注視拠点ＩＤ信号を受信しなくなっていた場合（ステップＳ２５、ＮＯ）においても、その発話中の発話は拠点Ｄの利用者に向けたものであるとみなす。すなわち、拠点Ｄのビデオ会議端末は、ステップＳ２１０において、まだ発話受動ＩＤ信号を生成し続けているため（ステップＳ２１０、ＹＥＳ）、発話受動通知を継続する（ステップＳ２９）。ここで、発話受動ＩＤ信号が生成されていな場合（ステップＳ２１０、ＮＯ）、拠点Ｄのビデオ会議端末は再びステップＳ２１に戻り、次のデータを受信する（ステップＳ２１）。 Further, the video conference terminal at the site D has a large change in the line of sight of the user at the site A, and as a result, the video conference terminal does not receive the gaze site ID signal from the site A (NO in step S25). The utterance being uttered is considered to be directed to the user at the site D. That is, since the video conference terminal of the base D still generates the utterance passive ID signal in step S210 (step S210, YES), the utterance passive notification is continued (step S29). Here, when the utterance passive ID signal is not generated (step S210, NO), the video conference terminal of the base D returns to step S21 again, and receives the next data (step S21).

拠点Ｄのビデオ会議端末は、再度各データを受信し、次の判定を行う（ステップＳ２１）。拠点Ｄのビデオ会議端末は、音声データ解析で一定時間無音状態が続く場合（ステップＳ２２）、話者判定で話者がいないと判定する（ステップＳ２３、ＮＯ）。 The video conference terminal at site D receives each data again and makes the next determination (step S21). The video conference terminal at site D determines that there is no speaker in the speaker determination (step S23, NO) when the silent state continues for a certain period of time in the audio data analysis (step S22).

この場合、拠点Ｄのビデオ会議端末は、発話受動ＩＤ信号を停止させ（ステップＳ２１１）、発話受動通知を終了する（ステップＳ２１２）。また、拠点Ｄのビデオ会議端末は、この際に画面表示状態を元に戻し、ビープ音を出力することで、拠点ＤのユーザＤに、拠点Ａのユーザからの発話に対する答えを、自然に促すことができる。 In this case, the video conference terminal at the site D stops the utterance passive ID signal (step S211), and ends the utterance passive notification (step S212). Also, at this time, the video conference terminal at the site D restores the screen display state to the original state, and naturally prompts the user D at the site D to answer the speech from the user at the site A by outputting a beep sound. be able to.

また、拠点Ｄのビデオ会議端末は、ステップＳ２６において、話者ＩＤと注視拠点ＩＤが一致しなかった場合は（ステップＳ２６、ＮＯ）、自分に向けての発話ではないと判定し、なにもせずにステップＳ２１に戻る。 Further, if the speaker ID and the gaze base ID do not match in step S26 (step S26, NO), the video conference terminal of the base D determines that it is not an utterance directed to itself, Without returning to step S21.

次に、本実施形態における発話受動通知の流れについて図８を参照して説明する。ここでは、拠点ＡのユーザＡが拠点ＤのユーザＤに向けて発言していることを想定し、各拠点のビデオ会議端末で行われる各判定と、ＩＤ信号の処理について時系列に説明する。ここで、各機能の処理や伝送遅延等によるレイテンシは起こり得るが、送受信される映像音声データと視線データとの同期は保証されているものとする。 Next, the flow of passive utterance notification in this embodiment will be described with reference to FIG. Here, assuming that the user A at the site A speaks to the user D at the site D, each determination performed at the video conference terminal at each site and the process of the ID signal will be described in time series. Here, latency due to processing of each function, transmission delay, and the like can occur, but it is assumed that synchronization between video / audio data to be transmitted and received and line-of-sight data is guaranteed.

まず、拠点Ａのビデオ会議端末４において、ユーザＡの視線データが拠点Ｄ映像に停留している（ステップＳ３１、これを「停留１」とする。）と、拠点Ｄのビデオ会議端末において判定される（ステップＳ４１）。ステップＳ３２、ステップＳ３３においても同様に判定されていく。 First, in the video conference terminal 4 of the base A, it is determined in the video conference terminal of the base D that the line-of-sight data of the user A is parked in the base D video (step S31, this is referred to as “station 1”). (Step S41). The same determination is made in step S32 and step S33.

同時に拠点Ａのユーザが発話すると、その音声データを受信した拠点Ｄのビデオ会議端末において、どの拠点の利用者が発話しているのかの話者判定が行われる（ステップＳ４２）。拠点Ｄのビデオ会議端末において、話者ＩＤ信号が生成される（ステップＳ３５）。 At the same time, when a user at the site A speaks, a speaker determination as to which user at the site is speaking is performed at the video conference terminal at the site D that has received the voice data (step S42). A speaker ID signal is generated at the video conference terminal at site D (step S35).

拠点Ａにおいて停留が３回連続した際、拠点Ｄの拠点映像を注視していると判定され（ステップＳ４３）、拠点Ａのビデオ会議端末４は、同時に注視拠点ＩＤ信号を拠点Ｄに送信する（ステップ３７）。拠点Ｄのビデオ会議端末４は、注視拠点ＩＤ信号を受信する（ステップ３８）。このとき、既に話者ＩＤ信号が生成されているので、拠点Ｄのビデオ会議端末は発話受動判定を行う（ステップＳ４４）。 When the stop at the base A continues three times, it is determined that the base video of the base D is being watched (step S43), and the video conference terminal 4 at the base A transmits a watch base ID signal to the base D at the same time ( Step 37). The video conference terminal 4 at the site D receives the gaze site ID signal (step 38). At this time, since the speaker ID signal has already been generated, the video conference terminal at site D performs passive speech determination (step S44).

拠点Ｄのビデオ会議端末は、注視拠点ＩＤ信号と話者ＩＤ信号が示す拠点が一致した際、発話受動ＩＤ信号を生成する（ステップＳ３９）。また、拠点Ｄのビデオ会議端末は、発話受動が開始されたことを、ビープ音を出力することでユーザＤに通知する（ステップＳ３１０）。 The video conference terminal at the site D generates a passive speech ID signal when the gaze site ID signal matches the site indicated by the speaker ID signal (step S39). In addition, the video conference terminal at the site D notifies the user D that the utterance passive has been started by outputting a beep sound (step S310).

その後、拠点Ｄのビデオ会議端末は、拠点Ｄにおける画面上に表示されている拠点Ａの拠点映像の囲む赤枠を表示させ、発話受動が継続しているということを利用者Ｄに通知する（ステップＳ３１１）。 Thereafter, the video conference terminal of the base D displays the red frame surrounding the base video of the base A displayed on the screen at the base D, and notifies the user D that the speech passive is continuing ( Step S311).

ここで、拠点ＡのユーザＡの視線データが大きく変動し、他の拠点や画面外を見たとき（ステップＳ３６）、拠点Ａのビデオ会議端末４は注視拠点ＩＤ信号の送信を終了する（ステップＳ３７）。これに伴い、拠点Ｄにおける注視拠点ＩＤ信号の受信も終了する（ステップＳ３８）。一方、拠点Ｄのビデオ会議端末は、話者ＩＤ信号を生成し続けているため（ステップＳ３９）、発話受動通知を終了しない（ステップＳ３１１）。 Here, when the line-of-sight data of the user A at the base A fluctuates greatly and looks at another base or out of the screen (step S36), the video conference terminal 4 at the base A ends the transmission of the gaze base ID signal (step S37). Accordingly, the reception of the gaze base ID signal at the base D is also ended (step S38). On the other hand, since the video conference terminal at site D continues to generate the speaker ID signal (step S39), the passive speech notification is not terminated (step S311).

拠点Ａにおいて、利用者Ａの発話が終了し（ステップＳ３４）、無音状態が一定時間続いたと、拠点Ｄの話者判定部で判定されると、拠点Ｄのビデオ会議端末は話者ＩＤ信号生成を停止し（ステップＳ３５）、発話受動ＩＤ信号の生成も停止する（ステップＳ３９）。 When the speech of the user A ends at the base A (step S34), and it is determined by the speaker determination unit at the base D that the silent state continues for a predetermined time, the video conference terminal at the base D generates the speaker ID signal (Step S35), and the generation of the utterance passive ID signal is also stopped (step S39).

また、拠点Ｄのビデオ会議端末は、設定しておいた一定時間（以下「設定時間長」とする。）、拠点Ａの拠点映像を囲む赤枠表示による発話受動通知（ステップＳ３１１）の停止を保留する（ステップＳ３１３）。そして、拠点Ｄのビデオ会議端末は、設定時間長の時間が経過したら、発話受動通知（ステップＳ３１１）を停止し、ビープ音を出力することでユーザＡの発話が終了したということをユーザＤに通知する（ステップＳ３１２）。 In addition, the video conference terminal of the base D stops the utterance passive notification (step S311) by the red frame display surrounding the base video of the base A for a set predetermined time (hereinafter referred to as "set time length"). Hold (step S313). When the set time length has elapsed, the video conference terminal at the site D stops the passive utterance notification (step S311) and outputs a beep sound to the user D that the utterance of the user A has ended. Notification is made (step S312).

これにより、拠点Ｄのユーザは、他拠点のユーザのうち、自分に向けての発話が、誰からされているのかを視覚的に理解することができ、また、その終了タイミングも知ることができるので、自然なタイミングで相手ユーザの発話への返答をすることが可能になる。 As a result, the user at the site D can visually understand who is speaking to him among the users at other sites, and can also know the end timing. Therefore, it becomes possible to reply to the other user's utterance at a natural timing.

以上、本実施形態のビデオ会議システムによれば、ユーザの視線情報を常に追跡し、ユーザが画面内に表示されている、他拠点のユーザの映像を注視しながら発話をしていると判定されたときに、その判定結果に応じて、受信端末で発話者が表示されている領域のみ異なる表示をさせ、発話開始のタイミングと終了のタイミングで通知音を出力させる。これにより、受信端末側のユーザに、自分が誰にいつ発話されているのかを知らせることが可能となる。 As described above, according to the video conference system of the present embodiment, it is determined that the user's line-of-sight information is always tracked and the user is speaking while gazing at the video of the user at the other site displayed on the screen. Depending on the determination result, only the area where the speaker is displayed is displayed differently on the receiving terminal, and the notification sound is output at the start timing and the end timing of the speech. This makes it possible to notify the user on the receiving terminal side who and when he / she is speaking.

また、本実施形態では、拠点双方におけるユーザの視線データを計測せず、発話者側の視線データのみを解析するだけでよく、円滑なコミュニケーションに必要なリアルタイム性に優れている。 Further, in the present embodiment, it is sufficient to analyze only the line-of-sight data on the utterer side without measuring the line-of-sight data of the user at both bases, and it is excellent in real time required for smooth communication.

また、発話長と注視長の差を考慮するため、発話中に発話対象のユーザから目を逸らしても、その発話が終わるまでは、注視対象に向けての発話であると判定され、発話を受けるユーザが、自分に向けての発話であると通知され続けることが可能である。 Also, in order to take account of the difference between the utterance length and the gaze length, even if the user of the utterance target is turned away during the utterance, it is determined that the utterance is directed to the gaze object until the utterance is finished. The receiving user can continue to be notified that the utterance is directed toward him / herself.

なお、上述する各実施の形態は、本発明の好適な実施の形態であり、本発明の要旨を逸脱しない範囲内において種々変更実施が可能である。例えば、上述した本実施形態の情報処理装置及びビデオ会議システムにおける各処理を、ハードウェア、又は、ソフトウェア、あるいは、両者の複合構成を用いて実行することも可能である。 Each of the above-described embodiments is a preferred embodiment of the present invention, and various modifications can be made without departing from the scope of the present invention. For example, each process in the above-described information processing apparatus and video conference system of the present embodiment can be executed using hardware, software, or a combined configuration of both.

なお、ソフトウェアを用いて処理を実行する場合には、処理シーケンスを記録したプログラムを、専用のハードウェアに組み込まれているコンピュータ内のメモリにインストールして実行させることが可能である。あるいは、各種処理が実行可能な汎用コンピュータにプログラムをインストールして実行させることが可能である。 In the case of executing processing using software, it is possible to install and execute a program in which a processing sequence is recorded in a memory in a computer incorporated in dedicated hardware. Alternatively, the program can be installed and executed on a general-purpose computer capable of executing various processes.

１ビデオ会議システム
４ビデオ会議端末
５撮像装置
６カメラ
７音声入力装置
８音声出力装置
９映像出力装置
１０入出力装置群
１４ネットワーク
１５映像入力部
１６データ送信部
１７注目情報生成部
１８音声取得部
２１データ受信部
２２話者判定部
２３話者識別信号生成部
２４発話受動判定部
２５出力部
２５１発話受動通知部 DESCRIPTION OF SYMBOLS 1 Video conference system 4 Video conference terminal 5 Imaging device 6 Camera 7 Audio | voice input device 8 Audio | voice output device 9 Video | video output device 10 Input / output device group 14 Network 15 Image | video input part 16 Data transmission part 17 Attention information generation part 18 Voice acquisition part 21 Data reception unit 22 Speaker determination unit 23 Speaker identification signal generation unit 24 Utterance passive determination unit 25 Output unit 251 Utterance passive notification unit

特開２０１０−２００１５０号公報JP 2010-200150 A

脇山孝貴、外２名、「注目の検出に基づいた興味モデルの作成と絵画推薦」、情報処理学会論文誌、平成19年5月、Vol.48 No.3、ｐ．1048−1057Takaki Wakiyama, 2 others, “Creation of interest model based on attention detection and picture recommendation”, Journal of Information Processing Society of Japan, May 2007, Vol.48 No.3, p. 1048-1057

Claims

Imaging information obtained by imaging another user using the other terminal from another terminal connected via a network, voice information emitted in the vicinity of the other terminal, and the other user focusing on the own terminal Receiving means for receiving attention information indicating that
Utterance passive determination means for determining whether or not the other terminal of the transmission source of the voice information matches the other terminal of the transmission source of the attention information;
A terminal comprising: utterance passive notification means for notifying an utterance from the other user who uses the other terminal to the own terminal when it is determined by the utterance passive determination means that the other terminal matches.

A display unit configured to display imaging information of the other user received by the reception unit;
The said utterance passive notification means is the user of the terminal of the utterance passive from the other user, the imaging information displayed by the display means of the other user who uses the other terminal determined to match by the utterance passive determination means. The terminal according to claim 1, characterized in that the display means is controlled to change to a display mode that can be recognized.

The utterance passive determination means generates an utterance passive identification signal for identifying that the other user using the other terminal is speaking when it is determined that the other terminals match. 2. The terminal according to 2.

4. The terminal according to claim 3, wherein the utterance passive notification means notifies the utterance passive based on the utterance passive identification signal generated by the utterance passive determination means.

5. The terminal according to claim 3, wherein said utterance passive notification means performs control to output a predetermined voice when said utterance passive identification signal is received from said utterance passive determination means.

The said speech passive notification means performs control which outputs a predetermined | prescribed audio | voice, when reception of the said speech passive identification signal from the said speech passive determination means is stopped. The terminal described in.

7. The terminal according to claim 6, wherein the utterance passive notification means continues the utterance passive notification for a certain period of time after the reception of the utterance passive identification signal from the utterance passive determination means stops.

The utterance passive determination means is characterized in that generation of the utterance passive identification signal is continued until reception of voice information is interrupted by the reception means even when reception of the attention information is interrupted by the reception means. The terminal according to any one of claims 3 to 7.

A video conference system in which an own terminal and two or more other terminals are connected via a network,
The other terminal is
Imaging information acquisition means for acquiring imaging information of another user who uses the other terminal imaged by the imaging device;
Voice information acquisition means for acquiring voice information emitted near the other terminal;
Displayed on the same screen together with imaging information obtained by imaging at least the other user other than the other terminal received from the other terminal other than the other terminal. Display means to
The imaging information of the other user using the other terminal acquired by the imaging information acquiring means is analyzed, and the other terminal is used for the imaging information of the own user using the own terminal displayed by the display means Attention information generation means for generating attention information indicating that another user is looking at;
Imaging information of another user using the other terminal acquired by the imaging information acquiring means, audio information emitted in the vicinity of the other terminal acquired by the audio information acquiring means, generated by the attention information generating means And sending means for sending attention information to the terminal.
The terminal itself
Receiving means for receiving, from the other terminal, imaging information obtained by imaging the other user who uses the other terminal, voice information emitted in the vicinity of the other terminal, and the attention information;
Utterance passive determination means for determining whether or not the other terminal of the transmission source of the voice information matches the other terminal of the transmission source of the attention information;
A video conference system comprising: utterance passive notification means for notifying an utterance from another user who uses the other terminal to the own terminal when it is determined by the utterance passive determination means that the other terminal matches.

The attention information generation means is the number of staying lines of the line of sight of the other user analyzed from the imaging information of the other user using the other terminal with respect to the imaging information of the own user using the own terminal displayed by the display means The video conference system according to claim 9, wherein attention information is generated when is equal to or greater than a predetermined threshold.

From other terminals connected via the network, imaging information obtained by photographing other users who use the other terminals, audio information emitted in the vicinity of the other terminals, and the other users paying attention to the own terminal A process of receiving attention information indicating that information and storing it in the storage unit;
A process of determining whether the other terminal as the transmission source of the voice information and the other terminal as the transmission source of the attention information match;
A program for causing a computer to execute a process of notifying an utterance from another user using the other terminal to the own terminal when it is determined that the other terminals match.

A computer readable program to be executed by a video conference system in which an own terminal and two or more other terminals are connected via a network,
The other terminal is
A process of acquiring imaging information of another user who uses the other terminal imaged by the imaging device and storing the acquired imaging information in the storage unit of the other terminal;
A process of acquiring voice information emitted in the vicinity of the other terminal and storing it in a storage unit of the other terminal;
The imaging information obtained by imaging at least the own user using the own terminal received from the own terminal is arranged on the same screen along with the imaging information obtained by imaging the other terminal other than the other terminal received from the other terminal other than the other terminal Processing to be displayed on the display unit of the other terminal;
The imaging information of the other user using the other terminal stored in the storage unit of the other terminal is analyzed, and the other terminal is used for the imaging information of the own user using the own terminal displayed on the display unit A process of generating attention information indicating that another user is paying attention and storing the attention information in the storage unit of the other terminal;
It includes processing of transmitting imaging information of another user who uses the other terminal stored in the storage unit of the other terminal, voice information emitted in the vicinity of the other terminal, and the attention information to the own terminal ,
The own terminal is
A process of receiving imaging information obtained by imaging another user using the other terminal from the other terminal, voice information emitted in the vicinity of the other terminal, and the attention information and storing the information in a storage unit of the own terminal ,
A process of determining whether the other terminal as the transmission source of the voice information and the other terminal as the transmission source of the attention information match;
A program including: a process of notifying an utterance from another user who uses the other terminal to the own terminal when it is determined that the other terminal matches.