JP7659322B2

JP7659322B2 - Information Extraction Device

Info

Publication number: JP7659322B2
Application number: JP2022129730A
Authority: JP
Inventors: 渉三神谷
Original assignee: Imbesideyou Inc
Current assignee: Imbesideyou Inc
Priority date: 2020-10-27
Filing date: 2022-08-16
Publication date: 2025-04-09
Anticipated expiration: 2040-10-27
Also published as: JPWO2022091230A1; JP2022166195A; WO2022091230A1

Description

本開示は、情報抽出装置に関する。 This disclosure relates to an information extraction device.

近年、各種コンテンツの配信を受ける配信サービスが普及しつつある。 In recent years, distribution services that deliver various types of content have become increasingly popular.

特許文献１には、ユーザが配信を希望する音楽コンテンツの曲名等がわからないときで
も、捜索対象である楽曲の鼻歌を入力することで、所望の音楽コンテンツを検出する処理
を可能にした技術が開示されている。 Patent Document 1 discloses a technology that enables a process to detect desired music content by inputting a humming of the song to be searched for, even when the user does not know the title of the music content that he or she wishes to have delivered.

特開２００２－５５９９４号公報JP 2002-55994 A

ところで、特許文献１に記載の技術は、配信されるコンテンツが音楽コンテンツに限ら
れるため、それ以外のあらゆる動画コンテンツに対して捜索対象を検出する処理を行うに
は、コンピュータによる膨大な演算処理が必要となる。 However, the technology described in Patent Document 1 limits the content to be distributed to music content, and therefore requires a huge amount of computational processing by a computer to perform processing to detect search targets in all other video content.

そこで、本開示は、このような状況に鑑みてなされたものであり、演算処理に伴う負荷
を軽減し得る情報抽出装置を提供することを一つの目的とする。 Therefore, the present disclosure has been made in consideration of such circumstances, and has an object to provide an information extraction device that can reduce the load associated with calculation processing.

上記課題を解決するための本発明の主たる発明は、複数のフレームから構成される動画
像から、外部から指示される所定の特定条件に従って特定のフレーム群を抽出する抽出部
を備えることを特徴とする。 The main invention of the present invention for solving the above problem is characterized by including an extraction unit that extracts a specific group of frames from a moving image made up of a plurality of frames in accordance with predetermined specific conditions instructed from outside.

本開示によれば、演算処理に伴う負荷を軽減し得る。 This disclosure can reduce the load associated with computational processing.

本開示の第１の実施形態に係る在宅個別指導システム１の構成例を示す概念図である。1 is a conceptual diagram illustrating a configuration example of an at-home individualized instruction system 1 according to a first embodiment of the present disclosure. 本開示の第１の実施形態に係る教室映像配信装置１０を実現するコンピュータのハードウェア構成例を示す図である。1 is a diagram illustrating an example of the hardware configuration of a computer that realizes a classroom video distribution device 10 according to a first embodiment of the present disclosure. 本開示の第１の実施形態に係る受講生端末２０を実現するコンピュータのハードウェア構成例を示す図である。2 is a diagram illustrating an example of the hardware configuration of a computer that realizes a student terminal 20 according to the first embodiment of the present disclosure. FIG. 本開示の第１の実施形態に係る教室映像配信装置１０のソフトウェア構成例を示す図である。2 is a diagram illustrating an example of the software configuration of the classroom video distribution device 10 according to the first embodiment of the present disclosure. FIG. 本開示の第１の実施の形態による教室映像配信方法の処理の流れを説明するフローチャートである。1 is a flowchart illustrating a processing flow of a classroom video distribution method according to a first embodiment of the present disclosure. 本開示の第２の実施形態に係る教室映像配信装置１０のソフトウェア構成例を示す図である。A diagram showing an example of the software configuration of a classroom video distribution device 10 according to a second embodiment of the present disclosure. 本開示の第２の実施の形態による教室映像配信方法の処理の流れを説明するフローチャートである。13 is a flowchart illustrating the process flow of a classroom video distribution method according to a second embodiment of the present disclosure. 本開示の第３の実施形態に係る教室映像配信装置１０のソフトウェア構成例を示す図である。A diagram showing an example of the software configuration of a classroom video distribution device 10 according to a third embodiment of the present disclosure. 本開示の第３の実施の形態による教室映像配信方法の処理の流れを説明するフローチャートである。13 is a flowchart illustrating the process flow of a classroom video distribution method according to a third embodiment of the present disclosure. 本開示の第４の実施形態に係る教室映像配信装置１０のソフトウェア構成例を示す図である。A diagram showing an example of the software configuration of a classroom video distribution device 10 according to a fourth embodiment of the present disclosure. 本開示の第４の実施の形態による教室映像配信方法の処理の流れを説明するフローチャートである。13 is a flowchart illustrating the process flow of a classroom video distribution method according to a fourth embodiment of the present disclosure.

本開示の実施形態の内容を列記して説明する。本開示は、以下のような構成を備える。
［項目１］
複数のフレームから構成される動画像を取得する取得部と、
当該動画像内に含まれる所定のデータを特定するための特定条件を記憶する記憶部と、
当該特定条件に従って、前記動画像から特定のフレーム群を複数抽出する抽出部と、
抽出された前記特定のフレーム群同士を連結する連結部と、
連結された複数のフレーム群を含むダイジェスト情報を出力する出力部と、を備える、
情報抽出装置。
［項目２］
項目１に記載の情報抽出装置であって、
所定の波形データを予め登録する波形登録部、を更に備え、
前記特定条件は、前記動画像内に含まれる音の波形データと前記登録されている波形デ
ータとが一致するか否かであって、
前記抽出部は、両波形データが一致した場合に、当該一致した波形に対応するフレーム
群を前記特定のフレーム群として前記動画像から抽出する、
情報抽出装置。
［項目３］
項目２に記載の情報抽出装置であって、
動画内に含まれる前記音を音声認識によりテキスト情報に変換する変換部を更に備え、
前記変換部は、前記特定のフレーム群とその前後所定フレーム数とを含む補助フレーム
群に対応する前記音を変換する、
情報抽出装置。
［項目４］
項目２又は項目３に記載の情報抽出装置であって、
前記被写体を含む周囲の音が示す情報には、会話情報と非会話情報とが混在する、
情報抽出装置。
［項目５］
項目４に記載の情報抽出装置であって、
前記会話情報には、ポジティブな感情を示すワードと、ネガティブな感情を示すワード
の少なくとも何れかが含まれる、
情報抽出装置。
［項目６］
項目４又は項目５に記載の情報抽出装置であって、
前記非会話情報には、舌打ち、溜め息、相槌の少なくとも何れかを示す情報が含まれる
、
情報抽出装置。
［項目７］
項目１に記載の情報抽出装置であって、
顔の表情に関する所定の顔評価値を予め登録する顔情報登録部、を更に備え、
前記特定条件は、前記動画像内に含まれる顔の表情から算出される顔評価値と前記登録
されている顔評価値とが一致するか否かであって、
前記抽出部は、両顔評価値が一致した場合に、当該一致した顔評価値に対応するフレー
ム群を前記特定のフレーム群として前記動画像から抽出する、
情報抽出装置。
［項目８］
項目７に記載の情報抽出装置であって、
前記顔評価値には、前記人物の幸福感、退屈感又は緊張感の度合いを評価した評価値が
含まれる、
情報抽出装置。
［項目９］
項目７又は項目８に記載の情報抽出装置であって、
前記顔評価値には、前記人物の表情、前記人物の視線の向き、前記人物の顔の向きを評
価した評価値が含まれる、
情報抽出装置
［項目１０］
項目１に記載の情報抽出装置であって、
人物の動作に関する所定の動作評価値を予め登録する動作情報登録部、を更に備え、
前記特定条件は、前記動画像内に含まれる人物から算出される動作評価値と前記登録さ
れている動作評価値とが一致するか否かであって、
前記抽出部は、両動作評価値が一致した場合に、当該一致した動作評価値に対応するフ
レーム群を前記特定のフレーム群として前記動画像から抽出する、
情報抽出装置。
［項目１１］
項目１０に記載の情報抽出装置であって、
前記動作評価値には、前記人物の身振り、手振り、ジェスチャ、ボディランゲージの少
なくとも何れかの動作を評価した評価値が含まれる、
情報抽出装置。

［項目１２］
項目１に記載の情報抽出装置であって、
所定の生体情報に関する生体評価値を予め登録する生体情報登録部と、を備え、
前記特定条件は、前記動画像内に含まれる人物から算出可能な生体評価値と、前記登録
されている生体評価値とが一致するか否かであって、
前記抽出部は、両生体評価値が一致した場合に、当該一致した生体評価値に対応するフ
レーム群を前記特定のフレーム群として前記動画像から抽出する、
情報抽出装置。
［項目１３］
項目１２に記載の情報抽出装置であって、
前記生体評価値には、前記人物の血圧、脈拍、脈圧の少なくとも何れかが含まれる、
情報抽出装置。
［項目１４］
項目１乃至項目１３の何れか一項に記載の情報抽出装置であって、
前記特定のフレーム群に対して、当該特定のフレーム群と時系列的に前後に連続する追
加フレームを追加するフレーム追加部を備えている、
情報抽出装置。
［項目１５］
項目１乃至項目１４の何れかに記載の情報抽出装置によって抽出されたダイジェスト情報
に含まれる少なくとも顔画像又は音声を所定のフレーム単位ごとに識別する識別手段と、
識別した前記顔画像に関する評価値を算出する評価手段とを更に備える、
ビデオミーティング評価端末。
［項目１６］
項目１５に記載のビデオミーティング評価端末であって、
ビデオミーティング評価端末は、前記評価値の時系列によるグラフ情報を提供する、
ビデオミーティング評価端末。
［項目１７］
項目１５又は項目１６に記載のビデオミーティング評価端末であって、
前記ビデオミーティング評価端末は、前記顔画像を複数の異なる観点によって評価した
複数の評価値を算出する、
ビデオミーティング評価端末。
［項目１８］
項目１５乃至項目１７のいずれかに記載のビデオミーティング評価端末であって、
前記ビデオミーティング評価端末は、前記動画像に含まれる音声と共に前記評価値を算
出する、
ビデオミーティング評価端末。
［項目１９］
項目１５乃至項目１８のいずれかに記載のビデオミーティング評価端末であって、
前記ビデオミーティング評価端末は、前記動画像内に含まれる前記顔画像以外の対象物
と共に前記評価値を算出する、
ビデオミーティング評価端末。 The contents of the embodiments of the present disclosure will be described below. The present disclosure has the following configuration.
[Item 1]
An acquisition unit that acquires a moving image composed of a plurality of frames;
A storage unit that stores a specific condition for identifying predetermined data included in the video;
an extracting unit that extracts a plurality of specific frame groups from the video in accordance with the specific condition;
a connecting portion that connects the extracted specific frame groups together;
and an output unit that outputs digest information including a plurality of linked frame groups.
Information extraction device.
[Item 2]
Item 1 is an information extraction device according to the present invention,
A waveform registration unit that registers predetermined waveform data in advance,
The specific condition is whether or not waveform data of a sound included in the moving image matches the registered waveform data,
When both waveform data match, the extraction unit extracts a frame group corresponding to the matching waveform from the video as the specific frame group.
Information extraction device.
[Item 3]
Item 2. An information extraction device according to the present invention,
A conversion unit that converts the sounds included in the video into text information by speech recognition,
the conversion unit converts the sound corresponding to an auxiliary frame group including the specific frame group and a predetermined number of frames before and after the specific frame group.
Information extraction device.
[Item 4]
The information extraction device according to item 2 or 3,
The information indicated by the surrounding sounds including the subject includes a mixture of conversation information and non-conversation information.
Information extraction device.
[Item 5]
Item 4. An information extraction device according to item 4,
The conversation information includes at least one of words indicating positive emotions and words indicating negative emotions.
Information extraction device.
[Item 6]
Item 4 or Item 5. An information extraction device according to item 4 or 5,
The non-conversational information includes information indicating at least one of clicking of the tongue, sighing, and interjections.
Information extraction device.
[Item 7]
Item 1 is an information extraction device according to the present invention,
A face information registration unit that registers a predetermined face evaluation value related to a facial expression in advance,
the specific condition is whether or not a face evaluation value calculated from a facial expression included in the moving image matches the registered face evaluation value,
when the two face evaluation values match, the extraction unit extracts a frame group corresponding to the matching face evaluation values from the video as the specific frame group.
Information extraction device.
[Item 8]
Item 7: An information extraction device according to item 7,
The face evaluation value includes an evaluation value that evaluates the degree of happiness, boredom, or tension of the person.
Information extraction device.
[Item 9]
The information extraction device according to item 7 or 8,
The face evaluation value includes evaluation values of the facial expression of the person, the direction of the gaze of the person, and the direction of the face of the person.
Information extraction device [Item 10]
Item 1 is an information extraction device according to the present invention,
a motion information registration unit that registers in advance a predetermined motion evaluation value related to a motion of a person,
the specific condition is whether or not a motion evaluation value calculated from a person included in the video matches the registered motion evaluation value;
when the two action evaluation values match, the extraction unit extracts a frame group corresponding to the matching action evaluation values from the video as the specific frame group.
Information extraction device.
[Item 11]
Item 11. An information extraction device according to item 10,
The motion evaluation value includes an evaluation value that evaluates at least one of the person's gestures, hand movements, gestures, and body language.
Information extraction device.

[Item 12]
Item 1 is an information extraction device according to the present invention,
a biometric information registration unit for registering in advance a biometric evaluation value relating to predetermined biometric information,
the specific condition is whether or not a biometric evaluation value that can be calculated from a person included in the moving image matches the registered biometric evaluation value,
when the two biometric evaluation values match, the extraction unit extracts a frame group corresponding to the matching biometric evaluation values from the video as the specific frame group.
Information extraction device.
[Item 13]
Item 13. An information extraction device according to item 12,
The biometric evaluation value includes at least one of the person's blood pressure, pulse rate, and pulse pressure.
Information extraction device.
[Item 14]
An information extraction device according to any one of items 1 to 13,
a frame adding unit that adds additional frames that are consecutive in time series to the specific frame group to the specific frame group,
Information extraction device.
[Item 15]
An identification unit that identifies at least a face image or a voice included in the digest information extracted by the information extraction device according to any one of items 1 to 14 for each predetermined frame;
and an evaluation means for calculating an evaluation value for the classified face image.
Video meeting evaluation terminal.
[Item 16]
Item 16. A video meeting evaluation terminal according to item 15,
the video meeting evaluation terminal provides graph information of the evaluation value in time series;
Video meeting evaluation terminal.
[Item 17]
17. The video meeting evaluation terminal according to claim 15,
the video meeting evaluation terminal calculates a plurality of evaluation values by evaluating the face image from a plurality of different viewpoints;
Video meeting evaluation terminal.
[Item 18]
18. A video meeting evaluation terminal according to any one of items 15 to 17,
the video meeting evaluation terminal calculates the evaluation value together with the audio included in the video;
Video meeting evaluation terminal.
[Item 19]
19. A video meeting evaluation terminal according to any one of items 15 to 18,
the video meeting evaluation terminal calculates the evaluation value together with an object other than the face image included in the moving image;
Video meeting evaluation terminal.

以下に添付図面を参照しながら、本開示の好適な実施の形態について詳細に説明する。な
お、本明細書及び図面において、実質的に同一の機能構成を有する構成要素については、
同一の符号を付することにより重複説明を省略する。 Hereinafter, preferred embodiments of the present disclosure will be described in detail with reference to the accompanying drawings. In this specification and the drawings, components having substantially the same functional configuration are referred to as follows:
The same reference numerals are used to avoid redundant explanation.

本開示では、デジタル通信回線を介して学習塾・教育機関と生徒・受講生宅を結び、各
生徒・受講生は自宅に居ながら学習塾・教育機関で講義されている講義映像を視聴して、
学習塾・教育機関の授業を受けられる在宅個別指導システムに情報抽出装置を適用する例
を説明する。 In this disclosure, a cram school or educational institution is connected to the student's home via a digital communication line, and each student can watch the lecture video given at the cram school or educational institution from the comfort of their own home.
An example of application of the information extraction device to a home tutoring system where students can take lessons at a cram school or educational institution will be described.

＜第１の実施形態＞
図１は、本開示の第１の実施形態に係る在宅個別指導システム１の構成例を示す概念図で
ある。図示するように、この在宅個別指導システム１では、遠隔授業を行う講師Ｔの教室
側に設けられた教室映像配信装置１０と、それぞれの在宅で指導を受ける受講生群（受講
生Ａ、Ｂ、Ｃ）に夫々関連する受講生端末２０Ａ、２０Ｂ、２０Ｃと、がネットワークＮ
Ｗを介して通信可能に接続されている。なお以下では、受講生端末２０Ａ、２０Ｂ、２０
Ｃを特に区別して説明する必要がない場合には、単に受講生端末２０と略記する。同様に
、受講生Ａ、Ｂ、Ｃを特に区別して説明する必要がない場合には、単に受講生と略記する
。 First Embodiment
1 is a conceptual diagram showing a configuration example of a home-based individual instruction system 1 according to a first embodiment of the present disclosure. As shown in the figure, in this home-based individual instruction system 1, a classroom video distribution device 10 provided in a classroom of an instructor T who conducts remote lessons, and student terminals 20A, 20B, and 20C respectively associated with groups of students (students A, B, and C) who receive instruction at home are connected to a network N.
In the following description, the student terminals 20A, 20B, and 20
When there is no need to particularly distinguish between students A, B, and C, they will be simply referred to as student terminal 20. Similarly, when there is no need to particularly distinguish between students A, B, and C, they will be simply referred to as students.

教室映像配信装置１０は、請求の範囲に記載された情報抽出装置の一例となる。なお、本
構成は一例であり、ある構成が他の構成を兼ね備えていたり、他の構成が含まれていたり
してもよい。なお、ここでは受講生Ａ、Ｂ、Ｃの３名の場合を示しているが、講師が同時
に指導できる人数又はネットワークＮＷの接続回線数等に応じて、さらに多人数としても
よい。 The classroom video distribution device 10 is an example of an information extraction device described in the claims. Note that this configuration is just one example, and one configuration may combine with other configurations or include other configurations. Note that although the case of three students A, B, and C is shown here, the number of students may be greater depending on the number of students that the instructor can teach at the same time or the number of connection lines of the network NW, etc.

本実施形態において、「講師」とは、教授、教諭、教師を含む概念である。「教室」とは
、学習塾、カルチャーセンター、教育機関（例えば、初等・中等・高等教育機関、高等学
校、高等専門学校、専門学校、短期大学、四年制大学、大学院など、文部科学省に登録さ
れている学校）を含む概念である。「受講生」とは、生徒、学生、聴講生を含む概念であ
る。 In this embodiment, the term "lecturer" refers to a concept that includes professors, teachers, and instructors. The term "classroom" refers to a concept that includes cram schools, cultural centers, and educational institutions (for example, primary, secondary, and higher education institutions, high schools, technical colleges, vocational schools, junior colleges, four-year universities, graduate schools, and other schools registered with the Ministry of Education, Culture, Sports, Science and Technology). The term "student" refers to a concept that includes pupils, students, and auditors.

本実施形態においてネットワークＮＷはインターネットを想定している。ネットワークＮ
Ｗは、例えば、公衆電話回線網、携帯電話回線網、無線通信網、イーサネット（登録商標
）などにより構築される。 In this embodiment, the network NW is assumed to be the Internet.
W is constructed, for example, by a public telephone line network, a mobile phone line network, a wireless communication network, Ethernet (registered trademark), and the like.

＜ハードウェア構成＞
図２は、本実施形態に係る教室映像配信装置１０を実現するコンピュータのハードウェア
構成例を示す図である。コンピュータは、少なくとも、通信部１１と、撮像部１２と、収
音部１３と、モニタ１４と、メモリ１５と、ストレージ１６と、入出力部１７と、制御部
１８等を備える。これらはバス１９を通じて相互に電気的に接続される。 <Hardware Configuration>
2 is a diagram showing an example of the hardware configuration of a computer that realizes the classroom video distribution device 10 according to this embodiment. The computer includes at least a communication unit 11, an imaging unit 12, a sound collection unit 13, a monitor 14, a memory 15, a storage 16, an input/output unit 17, and a control unit 18. These are electrically connected to each other via a bus 19.

通信部１１は、教室映像配信装置１０をネットワークＮＷに接続する。通信部１１は、例
えば、有線ＬＡＮ（Local Area Network）、無線ＬＡＮ、Ｗｉ－Ｆｉ（Wireless Fide
lity、登録商標）、赤外線通信、Ｂｌｕｅｔｏｏｔｈ（登録商標）、近距離または非接触
通信等の方式で、外部機器と直接またはネットワークアクセスポイントを介して通信する
。 The communication unit 11 connects the classroom video distribution device 10 to a network NW. The communication unit 11 is, for example, a wired LAN (Local Area Network), a wireless LAN, or a Wi-Fi (Wireless Network).
The device communicates with external devices directly or via a network access point using a method such as Bluetooth®, infrared communication, Bluetooth®, short-range or contactless communication.

撮像部１２は、ＣＭＯＳ又はＣＣＤなどの撮像素子を用いて電子撮影する機能を有する。
撮像部１２は、受講生に対する講義を行う講師Ｔを被写体として撮像して、講師映像を取
得する。撮像部１２は、講師Ｔが講義を進行する際に使用する黒板又はホワイトボードに
記載した画像も撮像できる構成とするとよいが、黒板又はホワイトボードの為に独立した
カメラを設けてもよい。 The imaging unit 12 has a function of electronically capturing images using an imaging element such as a CMOS or a CCD.
The imaging unit 12 captures an image of the lecturer T who is giving a lecture to the students as a subject, and obtains the lecturer video. The imaging unit 12 is preferably configured to capture images of writings on a blackboard or whiteboard used by the lecturer T when proceeding with the lecture, but a separate camera may be provided for the blackboard or whiteboard.

収音部１３は、講師Ｔを含む周囲の音を収音する。収音部１３は、講師Ｔの音声を含む周
囲の音を取得するためのマイクロフォン等を備える。さらに、収音部１３は、取得した音
を電気信号に変換する等の適宜処理を行い得る。 The sound collection unit 13 collects surrounding sounds including the instructor T. The sound collection unit 13 includes a microphone and the like for acquiring surrounding sounds including the voice of the instructor T. Furthermore, the sound collection unit 13 can perform appropriate processing such as converting the acquired sounds into electrical signals.

モニタ１４は、受講生端末２０から送信される受講生映像と、撮像部１２で取得される講
師映像とを一覧可能な状態で表示し得る。もちろん、モニタ１４は、受講生映像のみを単
独で表示してもよく、講師映像のみを単独で表示してもよい。 The monitor 14 can display, in a viewable manner, the student video transmitted from the student terminal 20 and the instructor video acquired by the imaging unit 12. Of course, the monitor 14 may display only the student video alone, or only the instructor video alone.

メモリ１５は、ＤＲＡＭ（ＤｙｎａｍｉｃＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ
）等の揮発性記憶装置で構成される主記憶と、フラッシュメモリ又はＨＤＤ（Ｈａｒｄ
ＤｉｓｃＤｒｉｖｅ）等の不揮発性記憶装置で構成される補助記憶と、を含む。メモリ
１５は、制御部１８のワークエリア等として使用され、また、教室映像配信装置１０の起
動時に実行されるＢＩＯＳ（ＢａｓｉｃＩｎｐｕｔ／ＯｕｔｐｕｔＳｙｓｔｅｍ）、
及び各種設定情報等を格納する。 The memory 15 is a dynamic random access memory (DRAM).
The main memory is made up of a volatile storage device such as a flash memory or a hard disk drive (HDD).
The memory 15 is used as a work area for the control unit 18 and stores a BIOS (Basic Input/Output System) that is executed when the classroom video distribution device 10 is started up.
It also stores various setting information, etc.

ストレージ１６は、アプリケーション・プログラム等の各種プログラムを格納する。各処
理に用いられるデータを格納したデータベースがストレージ１６に構築されていてもよい
。 The storage 16 stores various programs such as application programs, etc. A database that stores data used in each process may be constructed in the storage 16.

入出力部１７は、例えば、キーボード、マウス、タッチパネル等の情報入力機器である。 The input/output unit 17 is, for example, an information input device such as a keyboard, a mouse, or a touch panel.

制御部１８は、教室映像配信装置１０全体の動作を制御し、各要素間におけるデータの送
受信の制御、及びアプリケーションの実行及び認証処理に必要な情報処理等を行う演算装
置である。例えば制御部１８は、ＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎ
ｉｔ）等のプロセッサであり、ストレージ１６に格納されメモリ１５に展開されたプログ
ラム等を実行して各情報処理を実施する。 The control unit 18 is a computing device that controls the overall operation of the classroom video distribution device 10, controls data transmission and reception between each element, and performs information processing necessary for application execution and authentication processing. For example, the control unit 18 is a central processing unit (CPU).
The processor 14 is a processor such as a processor for information processing (IT) that executes programs stored in the storage 16 and deployed in the memory 15 to carry out various information processing operations.

バス１９は、上記各要素に共通に接続され、例えば、アドレス信号、データ信号及び各種
制御信号を伝達する。 A bus 19 is commonly connected to each of the above elements, and transmits, for example, address signals, data signals, and various control signals.

図３は、本実施形態に係る受講生端末２０を実現するコンピュータのハードウェア構成例
を示す図である。コンピュータは、少なくとも、通信部２１と、撮像部２２と、収音部２
３と、モニタ２４と、メモリ２５と、ストレージ２６と、入出力部２７と、制御部２８等
を備える。これらはバス２９を通じて相互に電気的に接続される。本実施形態に係る受講
生端末２０を実現するコンピュータ等のハードウェア構成は、図２に示す教室映像配信装
置１０のハードウェア構成例と同様であるため、相違点のみ説明する。 3 is a diagram showing an example of the hardware configuration of a computer that realizes the student terminal 20 according to the present embodiment. The computer includes at least a communication unit 21, an imaging unit 22, and a sound collection unit 23.
The student terminal 20 according to this embodiment includes a computer and other hardware components that realize the student terminal 20, such as a display 23, a display 24, a memory 25, a storage 26, an input/output unit 27, and a control unit 28. These components are electrically connected to each other via a bus 29. The hardware configuration of a computer or the like that realizes the student terminal 20 according to this embodiment is similar to the hardware configuration example of the classroom video distribution device 10 shown in Figure 2, so only the differences will be described.

通信部２１は、受講生端末２０をネットワークＮＷに接続する。 The communication unit 21 connects the student terminal 20 to the network NW.

撮像部２２は、講義を受講する受講生を被写体として撮像して、受講生映像を取得する。 The imaging unit 22 captures images of students attending a lecture as subjects and obtains student video.

収音部２３は、受講生を含む周囲の音を収音する。音声データを送受するために、受講生
端末２０においては、マイク付きヘッドフォンを設けてもよいが、当該端末に内蔵された
マイク並びにスピーカを用いてもよい。 The sound pickup unit 23 picks up surrounding sounds including those of the students. To transmit and receive voice data, the student terminal 20 may be provided with a headphone with a microphone, or a microphone and a speaker built into the terminal may be used.

モニタ２４は、教室映像配信装置１０から送信される講師映像と、撮像部２２で取得され
る受講生映像とを一覧可能な状態で表示し得る。もちろん、モニタ２４は、講師映像のみ
を単独で表示してもよく、受講生映像のみを単独で表示してもよい。 The monitor 24 can display, in a viewable manner, the lecturer video transmitted from the classroom video distribution device 10 and the student video captured by the imaging unit 22. Of course, the monitor 24 may display only the lecturer video alone, or only the student video alone.

制御部２８は、受講生端末２０全体の動作を制御し、各要素間におけるデータの送受信
の制御、及びアプリケーションの実行及び認証処理に必要な情報処理等を行う演算装置で
ある。 The control unit 28 is a calculation device that controls the overall operation of the student terminal 20, controls the transmission and reception of data between each element, and performs information processing required for application execution and authentication processing.

＜ソフトウェア構成＞
図４は、本実施形態に係る教室映像配信装置１０のソフトウェア構成例を示す図である。
教室映像配信装置１０は、抽出部１０１と、波形登録部１０２と、変換部１０３と、表示
部１０４と、フレーム切り出し部１０５と、生成部１０６と、を備える。 <Software configuration>
FIG. 4 is a diagram showing an example of the software configuration of the classroom video distribution device 10 according to this embodiment.
The classroom video distribution device 10 comprises an extraction unit 101, a waveform registration unit 102, a conversion unit 103, a display unit 104, a frame extraction unit 105, and a generation unit 106.

抽出部１０１と、波形登録部１０２と、変換部１０３と、表示部１０４と、フレーム切り
出し部１０５と、生成部１０６とは、制御部１８がストレージ１６に記憶されているプロ
グラムをメモリ１５に読み出して実行することにより実現され得る。 The extraction unit 101, the waveform registration unit 102, the conversion unit 103, the display unit 104, the frame cut-out unit 105, and the generation unit 106 can be realized by the control unit 18 reading out a program stored in the storage 16 into the memory 15 and executing it.

抽出部１０１は、撮像部１２で取得される講師映像と、撮像部２２で取得される受講生
映像とを適宜取捨選択して合成すると共に、収音部１３又は収音部２３で収音された音を
組み合わせて教室映像を生成する。ここでの講師映像又は受講生映像は、請求の範囲に記
載された複数のフレームの一例となる。また、教室映像は、請求の範囲に記載された動画
像の一例となる。教室映像は、テキストデータ、数値データ、図形データ、画像データ、
動画データ、音声データ等、又はこれらの組み合わせであり、記憶、編集及び検索等の対
象となり、システム又は利用者間で個別の単位として交換できるものをいい、これらに類
似するものを含む。 The extraction unit 101 appropriately selects and synthesizes the lecturer video captured by the imaging unit 12 and the student video captured by the imaging unit 22, and generates a classroom video by combining sounds collected by the sound collection unit 13 or the sound collection unit 23. The lecturer video or student video here is an example of a plurality of frames as recited in the claims. The classroom video is an example of a moving image as recited in the claims. The classroom video may include text data, numerical data, graphic data, image data,
This refers to video data, audio data, etc., or a combination of these, which can be stored, edited, searched, etc., and can be exchanged as an individual unit between systems or users, and includes similar data.

抽出部１０１は、かかる教室映像から、外部から指示される所定の特定条件に従って特定
のフレーム群を抽出する機能を有する。例えば、外部からの指示は、合成する教室映像の
項目、画像の配置・画像の占有面積等を指示するものであり得る。外部からの指示は、例
えば、講師Ｔ自身が講義の途中で映像構成を編集可能な簡便な操作であることが好ましい
。 The extraction unit 101 has a function of extracting a specific frame group from the classroom video in accordance with a specific condition instructed from outside. For example, the external instruction may instruct the items of the classroom video to be synthesized, the image arrangement, the area occupied by the image, etc. It is preferable that the external instruction is, for example, a simple operation that allows the lecturer T himself to edit the video composition in the middle of a lecture.

波形登録部１０２は、所定の波形データを予め登録する機能を有する。所定の特定条件
とは、例えば、収音部１３又は収音部２３により収音された音の波形データと、波形登録
部１０２に登録されている波形データとが一致するか否かであってよく、もちろん、他の
条件であってもよい。本実施形態において波形登録部１０２に登録されている波形データ
は、講義におけるその場全体の雰囲気を評価するために用いられ得る。 The waveform registration unit 102 has a function of registering predetermined waveform data in advance. The predetermined specific condition may be, for example, whether or not the waveform data of the sound collected by the sound collection unit 13 or the sound collection unit 23 matches the waveform data registered in the waveform registration unit 102, and may of course be other conditions. In this embodiment, the waveform data registered in the waveform registration unit 102 can be used to evaluate the overall atmosphere of the lecture.

抽出部１０１は、所定の特定条件が満たされた場合、例えば、両波形データが一致した
場合に、当該一致した波形に対応するフレーム群を特定のフレーム群として前記教室映像
から抽出する。 When a predetermined specific condition is met, for example when both waveform data match, the extraction unit 101 extracts a group of frames corresponding to the matching waveforms as a specific group of frames from the classroom video.

変換部１０３は、収音部１３が収音した音を音声認識によりテキストに変換する機能を有
する。このテキストとは、任意の文字列のことである。変換部１０３は、音声認識に成功
した場合は、生成したテキストを含む音声認識結果を出力する。音声認識結果に、音声認
識が成功したことを示す成功情報を含めてもよい。 The conversion unit 103 has a function of converting the sound collected by the sound collection unit 13 into text by speech recognition. This text is any character string. If the speech recognition is successful, the conversion unit 103 outputs a speech recognition result including the generated text. The speech recognition result may include success information indicating that the speech recognition was successful.

表示部１０４は、変換部１０３により変換されたテキストをモニタ１４又はモニタ２４
に表示する機能を有する。講師Ｔ又は受講生Ａ、Ｂ、Ｃを含む周囲の音が示す情報には、
会話情報と非会話情報とが混在する。会話情報には、例えば、ポジティブな感情を示すワ
ードと、ネガティブな感情を示すワードの少なくとも何れかが含まれる。 The display unit 104 displays the text converted by the conversion unit 103 on the monitor 14 or the monitor 24.
The information displayed by the surrounding sounds including the instructor T or the students A, B, and C includes the following:
Conversational information and non-conversational information are mixed in. Conversational information includes, for example, at least one of words indicating positive emotions and words indicating negative emotions.

ポジティブな感情を示すワードの一例としては、講師が受講生を褒めたり、応援したり、
励ましたりする内容として、「よく頑張ったね」「努力したね」「すごいね」「素晴らし
いね」「立派だね」「偉いね」等を挙げることができる。 Examples of words that show positive emotions include when a teacher praises or encourages a student.
Examples of encouraging words include "You did a great job,""You put in effort,""That'samazing,""That'swonderful,""You'reimpressive," and "You're great."

一方、ネガティブな感情を示すワードの一例としては、講師が受講生を貶したり、非難
したり、誹謗したりする内容として、「そんなんじゃダメだ」「お前はダメだ」「なにや
ってんのよ」「落ちるぞ」「バカ」等を挙げることができる。 On the other hand, examples of words that indicate negative emotions include instructors using words to belittle, criticize, or slander students, such as "That's no good,""You're no good,""What are you doing,""You'llfail," and "You're stupid."

非会話情報は、会話情報以外のテキスト情報である。非会話情報には、舌打ち、溜め息
、相槌の少なくとも何れかを示す情報が含まれる。これらの非会話情報は、講義を受講す
る受講生の感情を判断するための判断基準となり得る。受講生の感情は、例えば、「幸福
感」、「退屈感」、「緊張感」の３つに分類され得る。 Non-conversational information is text information other than conversational information. Non-conversational information includes information indicating at least one of clicking the tongue, sighing, and interjections. Such non-conversational information can be a criterion for judging the emotions of students attending a lecture. The emotions of students can be classified into three categories, for example, "happiness,""boredom," and "tension."

フレーム切り出し部１０５は、抽出部１０１により抽出された特定のフレーム群に対し
て、少なくとも時系列的に前後に連続するフレーム群を切り出す機能を有する。フレーム
切り出し部１０５は、例えば、各ワードがどのような文脈で使用されたかを示す文脈情報
を取得するために適用可能な任意のフレームレートを用いて、前後に連続するフレーム群
を切り出すことができる。ここでの文脈情報とは、例えば、単語前後の任意範囲の文字列
、単語間の共起関係等を示す情報である。 The frame extraction unit 105 has a function of extracting at least a group of frames that are consecutive before and after a specific group of frames extracted by the extraction unit 101 in chronological order. The frame extraction unit 105 can extract a group of frames that are consecutive before and after a specific group of frames using any frame rate that is applicable to obtain, for example, context information indicating in what context each word is used. The context information here is, for example, information indicating a character string in an arbitrary range before and after a word, a co-occurrence relationship between words, etc.

生成部１０６は、抽出部１０１により抽出された特定のフレーム群に対して、その前後に
連続する先行フレーム群と後続フレームを連結して、ダイジェスト動画を生成する機能を
有する。 The generating unit 106 has a function of generating a digest movie by linking a group of preceding and succeeding frames that are consecutive before and after a specific group of frames extracted by the extracting unit 101 .

次に、このように構成された在宅個別指導システム１の動作について説明する。図５は、
本開示の第１の実施の形態による教室映像配信方法の処理の流れを説明するフローチャー
トである。 Next, the operation of the home individual instruction system 1 configured as above will be described.
1 is a flowchart illustrating a processing flow of a classroom video distribution method according to a first embodiment of the present disclosure.

ここでは、教室を運営する運営者等が、講師が受講生に対して不適切な発言・問題発言を
していないかどうかをチェックする場面を例に挙げて説明する。具体的に、予め登録され
た講師の声でネガティブな感情を示すワードとして「バカ」を表す波形データを用いて、
ダイジェスト動画を生成する場面を例に説明する。 Here, we will explain an example of a classroom administrator checking whether a lecturer is making inappropriate or problematic remarks to students. Specifically, using waveform data representing the word "baka" (idiot) in a pre-registered lecturer's voice,
An example of a scene where a digest movie is generated will be described.

まず、講義が開始される時刻になると、各受講生Ａ、Ｂ、Ｃは受講生端末２０Ａ、２０Ｂ
、２０ＣをネットワークＮＷ経由で在宅個別指導システム１に接続して、講師Ｔの講義開
始を待つ。講師Ｔは、教室映像配信装置１０が備えるモニタ１４を見て各受講生Ａ、Ｂ、
Ｃが受講態勢にあるか否かを判断し、受講態勢が整っていれば、講義を開始する。すなわ
ち、撮像部１２並びに撮像部２２による撮像動作が開始されるとともに、収音部１３並び
に収音部２３による収音動作が開始される（ステップＳ１００）。 First, when the time for the lecture to start arrives, each of the students A, B, and C accesses the student terminals 20A, 20B.
, 20C are connected to the home tutoring system 1 via the network NW and wait for the lecture of the instructor T to begin. The instructor T watches the monitor 14 of the classroom video distribution device 10 and
It is determined whether C is ready to attend the lecture, and if C is ready to attend, the lecture is started. That is, the imaging unit 12 and the imaging unit 22 start imaging operations, and the sound collection unit 13 and the sound collection unit 23 start collecting sounds (step S100).

そして、抽出部１０１は、撮像部１２で取得される講師映像と、撮像部２２で取得される
受講生映像とを適宜取捨選択して合成すると共に、収音部１３又は収音部２３で収音され
た音を組み合わせて教室映像を生成すると共に、生成した教室映像をネットワークＮＷ経
由で受講生端末２０Ａ、２０Ｂ、２０Ｃに配信する（ステップＳ１０２）。 Then, the extraction unit 101 appropriately selects and synthesizes the instructor video captured by the imaging unit 12 and the student video captured by the imaging unit 22, combines the sounds collected by the sound collection unit 13 or the sound collection unit 23 to generate a classroom video, and distributes the generated classroom video to the student terminals 20A, 20B, and 20C via the network NW (step S102).

次に、講師が「バカ」と発声すると、その音声は収音部１３で収音されて、その音声デ
ータを含む教室映像が特定のフレーム群として抽出部１０１によって抽出される（ステッ
プＳ１０４）。 Next, when the teacher utters the word "idiot", the voice is picked up by the sound pickup unit 13, and the classroom video including the voice data is extracted as a specific frame group by the extraction unit 101 (step S104).

フレーム切り出し部１０５は、その教室映像の前後において例えば１０秒～２０秒程度
の時間間隔で連なるフレーム群を切り出す（ステップＳ１０６）。 The frame extracting unit 105 extracts a series of frames at time intervals of, for example, about 10 to 20 seconds before and after the classroom video (step S106).

そして、生成部１０６は、これらのフレーム群を連結したダイジェスト動画を生成すると
共に、当該ダイジェスト動画に基づいて、講師の音声データに対して所定の音響分析を
施す（ステップＳ１０８）。 The generating unit 106 then generates a digest movie by connecting these frames together, and performs a predetermined acoustic analysis on the instructor's voice data based on the digest movie (step S108).

この音響分析によれば、「バカ」というワードがどういう文脈で使われたかを把握するこ
とができる。 This acoustic analysis allows us to understand the context in which the word "baka" was used.

例えば、（受講生以外の第三者に対して）「こういうバカなことを言ってる人はダメだよ
ね」という文脈で講師において「バカ」というワードが使われた場合には、講師は受講生
のことをバカと言ったわけではないことが把握できる。 For example, if an instructor uses the word "idiot" in the context of saying (to a third party other than the students) "People who say stupid things like this are no good," it is clear that the instructor did not mean to call the students stupid.

また例えば、「俺はバカだから」と文脈で講師において「バカ」というワードが使われた
場合には、講師は自身のことをバカと言っていることが把握できる。 For example, if a teacher uses the word "idiot" in the context of "I'm an idiot," it can be understood that the teacher is calling himself an idiot.

かくして、教室を運営する運営者等は、「バカ」というワードとその前後の文脈をテキス
ト文章として例えばレポート形式で取得し得るので、講師が受講生に対して不適切な発言
・問題発言をしていないかどうかを容易にチェックできる。 Thus, a classroom administrator or the like can obtain the word "baka" and the context before and after it as a text sentence, for example in the form of a report, and can easily check whether the instructor is making inappropriate or problematic remarks to students.

すなわち、予め登録済みの波形データを用いた照合により、演算処理に伴う負荷を軽減
しながらも、教室映像内において講師が不適切な発言・問題発言を引き起こす可能性の高
い状況を含む特徴的なシーンをピンポイントで引き出すことが可能となる。例えば、講義
時間（例えば９０分）に対して、講師によるネガティブな感情を示すワードの発声回数が
比較的大きい所定の回数（例えば１０回）に至るような場合には、講師の人間性を判断す
ることも可能になる。もちろん、講師によるポジティブな感情を示すワードの発声回数も
講師の人間性を判断する材料になり得る。さらには、講師の側に限らず、受講生の側にお
いても、講師からの発話に対する舌打ちの回数、溜め息の回数、相槌の回数は、受講生が
どのような感情を抱いているかを判断する材料となり得る。 That is, by using the preregistered waveform data for comparison, it is possible to pinpoint characteristic scenes in the classroom video, including situations where the lecturer is likely to make inappropriate or problematic remarks, while reducing the load of computational processing. For example, if the lecturer utters a relatively large number of words indicating negative emotions (e.g., 10 times) during a lecture time (e.g., 90 minutes), it is possible to judge the lecturer's personality. Of course, the number of times the lecturer utters words indicating positive emotions can also be used to judge the lecturer's personality. Furthermore, not only on the lecturer's side, but also on the students' side, the number of times they click their tongues, sigh, and respond to the lecturer's remarks can be used to judge the students' emotions.

＜第２の実施形態＞
以下、図６及び図７に基づいて、第２の実施形態に係る在宅個別指導システムについて説
明する。この実施形態では、上述した第１実施形態で説明した要素と同一の要素について
同じ符号を付し、詳細な説明を省略する。 Second Embodiment
Hereinafter, an at-home tutoring system according to a second embodiment will be described with reference to Fig. 6 and Fig. 7. In this embodiment, the same elements as those described in the first embodiment are denoted by the same reference numerals, and detailed description thereof will be omitted.

上記の第１実施形態に係る在宅個別指導システムでは、既に述べたように、予め登録済み
の波形データを用いた照合により、教室映像の中から特定のフレーム群を抽出していたが
、第２の実施形態に係る在宅個別指導システムでは、人物の顔の表情に関する評価値に基
づいて、教室映像の中から特定のフレーム群を抽出する処理が行われる。 As already mentioned, in the above-mentioned first embodiment of the tutoring system, specific frames are extracted from classroom footage by matching with pre-registered waveform data, whereas in the second embodiment of the tutoring system, a process is performed to extract specific frames from classroom footage based on an evaluation value related to a person's facial expression.

＜ソフトウェア構成＞
図６は、本実施形態に係る教室映像配信装置１０のソフトウェア構成例を示す図である。
教室映像配信装置１０は、抽出部１０１と、フレーム切り出し部１０５と、生成部１０６
と、顔検出部１０７と、算出部１０８と、顔情報登録部１０９と、を備える。 <Software configuration>
FIG. 6 is a diagram showing an example of the software configuration of the classroom video distribution device 10 according to this embodiment.
The classroom video distribution device 10 includes an extraction unit 101, a frame extraction unit 105, and a generation unit 106.
The image processing unit 102 includes a face detection unit 107, a calculation unit 108, and a face information registration unit 109.

顔検出部１０７は、例えば、メモリ１５に格納されている教室映像を構成する複数のフレ
ームの夫々について、人物の顔検出を行う機能を有する。例えば、顔検出部１０７は、教
室映像の中から特徴点を抽出して、講師Ｔ又は各受講生Ａ、Ｂ、Ｃの顔領域、顔領域の大
きさ・顔面積等を検出する。 The face detection unit 107 has a function of detecting human faces in each of a plurality of frames constituting the classroom video stored in the memory 15. For example, the face detection unit 107 extracts feature points from the classroom video and detects the face region, the size of the face region, the face area, etc. of the instructor T or each of the students A, B, and C.

特徴点としては、例えば、眉、目、鼻、唇の各端点、顔の輪郭点、頭頂点、顎の下端点等
が挙げられる。そして、顔検出部１０７は、顔領域の位置情報を特定する。例えば、顔検
出部１０７は、画像の横方向をＸ軸とし、縦方向をＹ軸として、顔領域に含まれる画素の
Ｘ座標及びＹ座標を算出する。さらに、顔検出部１０７は、上述した特徴点を用いた演算
処理によって、検出した講師又は各受講生の表情・年齢などを判別し得る。 Examples of the feature points include the end points of the eyebrows, eyes, nose, and lips, the contour points of the face, the vertex of the head, and the bottom end point of the chin. Then, the face detection unit 107 identifies the position information of the face area. For example, the face detection unit 107 calculates the X-coordinate and the Y-coordinate of the pixels included in the face area by taking the horizontal direction of the image as the X-axis and the vertical direction as the Y-axis. Furthermore, the face detection unit 107 can determine the facial expression, age, and the like of the detected instructor or each student by performing a calculation process using the above-mentioned feature points.

算出部１０８は、教室映像を構成する複数のフレームの夫々について、講師Ｔ又は各受講
生Ａ、Ｂ、Ｃの顔に関する評価値を算出する機能を有する。算出部１０８において算出さ
れる各評価値は、以下に示す所定の評価値（１）～（６）が含まれる。これらの評価値（
１）～（６）は、顔情報登録部１０９に予め登録される。 The calculation unit 108 has a function of calculating an evaluation value related to the face of the instructor T or each of the students A, B, and C for each of the multiple frames that make up the classroom video. The evaluation values calculated by the calculation unit 108 include the following predetermined evaluation values (1) to (6). These evaluation values (
1) to (6) are registered in advance in the face information registration unit 109.

（１）笑顔の度合い
算出部１０８は、顔検出部１０７が検出した顔のそれぞれについて、例えば、パターンマ
ッチングなどの公知技術を用いて、笑顔の度合いを評価値として算出する。本実施形態で
は、度合いの一例として、「０：笑顔なし」、「１：微笑」、「２：普通笑い」、「３：
大笑い」までの４段階で笑顔の度合いを示す。 (1) The smile degree calculation unit 108 calculates the smile degree as an evaluation value for each face detected by the face detection unit 107, using a known technique such as pattern matching. In this embodiment, examples of the degrees include "0: no smile", "1: smile", "2: normal smile", "3:
The level of smile is indicated on a four-point scale, ranging from "laughing loud."

（２）視線の向き
算出部１０８は、顔検出部１０７が検出した顔のそれぞれについて、公知の技術を用いて
、視線の向きを評価値として算出する。本実施形態では、一例として、「０：視線正面」
、「１：視線左右方向」、「２：視線右方向」、「３：視線検出不可」の４種類で視線の
向きを示す。 (2) The gaze direction calculation unit 108 calculates the gaze direction as an evaluation value for each face detected by the face detection unit 107 using a known technique. In the present embodiment, as an example,
The direction of gaze is indicated by four types: "1: gaze left/right", "2: gaze right", and "3: gaze detection not possible".

（３）顔の向き
算出部１０８は、顔検出部１０７が検出した顔のそれぞれについて、公知の技術を用いて
、顔の向きを評価値として算出する。本実施形態では、一例として、「０：顔向き正面」
、「１：顔向き左方向」、「２：顔向き右方向」、「３：検出不可」の４種類で顔の向き
を示す。 (3) The face direction calculation unit 108 calculates the face direction as an evaluation value for each face detected by the face detection unit 107 using a known technique. In the present embodiment, as an example,
The face direction is indicated in four types: "1: face direction to the left", "2: face direction to the right", and "3: not detectable".

（４）顔面積
算出部１０８は、顔検出部１０７が検出した顔のそれぞれについて、顔部分の面積を評価
値として算出する。 (4) The face area calculation unit 108 calculates the area of the face portion of each face detected by the face detection unit 107 as an evaluation value.

（５）年齢
算出部１０８は、顔検出部１０７が検出した顔のそれぞれについて、公知の技術を用いて
、その人物の年齢を評価値として算出する。 (5) The age calculation unit 108 calculates the age of each person detected by the face detection unit 107 as an evaluation value using a known technique.

（６）目つぶり度合い
算出部１０８は、顔検出部１０７が検出した顔のそれぞれについて、公知の技術を用いて
、目つぶり度合いを評価値として算出する。本実施形態では、一例として、「０：目つぶ
りなし」、「１：一部目つぶりあり」、「２：両目目つぶり」、「３：目つぶり検出不可
」の４種類で目つぶり度合いを示す。 (6) The eye-blinking degree calculation unit 108 uses known technology to calculate the degree of eye-blinking as an evaluation value for each face detected by the face detection unit 107. In the present embodiment, as an example, the eye-blinking degree is indicated by four types: "0: no eye-blinking", "1: some eye-blinking", "2: both eyes closed", and "3: eye-blinking not detectable".

これらの評価値（１）～（６）は、講義を受講する受講生の感情を判断するための判断基
準となり得る。受講生の感情は、例えば、「幸福感」、「退屈感」、「緊張感」の３つに
分類され得る。 These evaluation values (1) to (6) can be used as criteria for judging the emotions of students attending a lecture. The emotions of students can be classified into three categories, for example, "happiness,""boredom," and "tension."

抽出部１０１は、所定の特定条件が満たされた場合、例えば、算出部１０８により算出さ
れた評価値と、顔情報登録部１０９に予め登録されている評価値とが一致した場合に、当
該一致した評価値に対応するフレーム群を特定のフレーム群として前記教室映像から抽出
する。 When a predetermined specific condition is satisfied, for example, when the evaluation value calculated by the calculation unit 108 matches an evaluation value pre-registered in the face information registration unit 109, the extraction unit 101 extracts a group of frames corresponding to the matching evaluation value from the classroom video as a specific group of frames.

次に、このように構成された在宅個別指導システムの動作について説明する。図７は、本
開示の第２の実施の形態による教室映像配信方法の処理の流れを説明するフローチャート
である。 Next, the operation of the at-home individual instruction system configured as above will be described. Fig. 7 is a flowchart illustrating the process flow of the classroom video distribution method according to the second embodiment of the present disclosure.

ここでは、教室を運営する運営者等が、受講生Ａである子供を塾などに預ける保護者等か
らの要望であって、受講生Ａの学習態度・学習状況を把握したいという要望に応える場面
を例に挙げて説明する。 Here, we will explain an example of a situation in which a classroom operator responds to a request from a parent or guardian who has their child (student A) attend a cram school or other such institution, who wants to understand student A's learning attitude and learning situation.

まず、講義が開始される時刻になると、各受講生Ａ、Ｂ、Ｃは受講生端末２０Ａ、２０Ｂ
、２０ＣをネットワークＮＷ経由で在宅個別指導システムに接続して、講師Ｔの講義開始
を待つ。講師Ｔは、教室映像配信装置１０が備えるモニタ１４を見て各受講生Ａ、Ｂ、Ｃ
が受講態勢にあるか否かを判断し、受講態勢が整っていれば、講義を開始する。すなわち
、撮像部１２並びに撮像部２２による撮像動作が開始されるとともに、収音部１３並びに
収音部２３による収音動作が開始される（ステップＳ２００）。 First, when the time for the lecture to start arrives, each of the students A, B, and C accesses the student terminals 20A, 20B.
, 20C are connected to the home tutoring system via the network NW and wait for the lecture of the instructor T to start. The instructor T watches the monitor 14 of the classroom video distribution device 10 and the students A, B, C
If the students are ready to attend the lecture, the lecture is started. That is, the imaging units 12 and 22 start imaging operations, and the sound collection units 13 and 23 start collecting sounds (step S200).

そして、抽出部１０１は、撮像部１２で取得される講師映像と、撮像部２２で取得される
受講生映像とを適宜取捨選択して合成すると共に、収音部１３又は収音部２３で収音され
た音を組み合わせて教室映像を生成すると共に、生成した教室映像をネットワークＮＷ経
由で受講生端末２０Ａ、２０Ｂ、２０Ｃに配信する（ステップＳ２０２）。 Then, the extraction unit 101 appropriately selects and synthesizes the instructor video captured by the imaging unit 12 and the student video captured by the imaging unit 22, combines the sounds collected by the sound collection unit 13 or the sound collection unit 23 to generate a classroom video, and distributes the generated classroom video to the student terminals 20A, 20B, and 20C via the network NW (step S202).

次に、顔検出部１０７は、教室映像について、受講生Ａの顔検出を行う（ステップＳ２０
４）。顔検出の具体的な手法については、公知技術と同様であるため説明を省略する。 Next, the face detection unit 107 detects the face of student A in the classroom video (step S20
4) The specific method of face detection is similar to known techniques, and therefore a description thereof will be omitted.

そして、算出部１０８は、顔検出部１０７が検出した受講生Ａの顔について、公知の技術
を用いて、顔の向きを評価値として算出する（ステップＳ２０６）。続いて、抽出部１０
１は、算出部１０８により算出された評価値が、顔情報登録部１０９に予め登録されてい
る評価値「１：顔向き左方向」、「２：顔向き右方向」の何れかと一致した場合に、当該
一致した評価値に対応するフレーム群を特定のフレーム群として教室映像から抽出する（
ステップＳ２０８）。 Then, the calculation unit 108 calculates the orientation of the face of the student A detected by the face detection unit 107 as an evaluation value using a known technique (step S206).
In the case where the evaluation value calculated by the calculation unit 108 matches any one of the evaluation values "1: face direction to the left" and "2: face direction to the right" registered in advance in the face information registration unit 109, the frame group corresponding to the matching evaluation value is extracted as a specific frame group from the classroom video (
Step S208).

すなわち、受講生Ａの顔の向きが正面ではなく左方向又は右方向を向いているような場合
には、受講生Ａの講義に対する集中度が低下していることが推認され得る。 That is, if Student A's face is not facing forward but is facing to the left or right, it can be inferred that Student A's concentration on the lecture is declining.

フレーム切り出し部１０５は、受講生Ａの顔が正面を向いていない教室映像の前後にお
いて例えば１０秒～２０秒程度の時間間隔で連なるフレーム群を切り出す（ステップＳ２
１０）。 The frame extraction unit 105 extracts a series of frames, for example, at a time interval of about 10 to 20 seconds, before and after the classroom video in which the face of student A is not facing forward (step S2
10).

そして、生成部１０６は、これらのフレーム群を連結したダイジェスト動画を生成すると
共に、当該ダイジェスト動画に基づいて、講師の音声データに対して所定の音響分析を施
す（ステップＳ２１２）。かかる音響分析によれば、受講生Ａの顔が正面を向いていない
ときに、講師Ｔが発話しているワードが何であるかのみをフラグで管理できる。これによ
り、受講生Ａの講義に対する集中度が低下した要因となり得るワードを、教室を運営する
運営者等が収集できるとともに、受講生Ａの学習態度・学習状況を把握したいという保護
者等の要望に対しても効率的に応えることができる。 The generator 106 then generates a digest video by connecting these frames, and performs a predetermined acoustic analysis on the instructor's voice data based on the digest video (step S212). This acoustic analysis allows a flag to be used to manage only the words that instructor T utters when student A is not facing forward. This allows the administrator of the classroom to collect words that may be the cause of student A's reduced concentration on the lecture, and also efficiently responds to the requests of parents and others who want to understand student A's learning attitude and learning situation.

もちろん、受講生Ａの講義に対する集中度の判定は、受講生Ａの顔の向きに限らず、受講
生Ａの視線の向きによっても行い得る。すなわち、受講生Ａの視線の向きが正面ではなく
左方向又は右方向を向いているような場合には、受講生Ａの講義に対する集中度が低下し
ていることが推認され得る。 Of course, the degree of concentration of Student A on the lecture can be determined not only from the direction of Student A's face but also from the direction of Student A's gaze. In other words, if Student A's gaze direction is not forward but toward the left or right, it can be inferred that Student A's degree of concentration on the lecture is decreasing.

本実施形態に係る在宅個別指導システムには、さらに以下のような使用例が考えられる。 Further possible use cases for the at-home individual tutoring system according to this embodiment include the following:

具体的に、受講生Ａが満足感・幸福感・充実感といったポジティブな感情を抱いたシーン
のみを集めて編集した動画を受講生Ａの保護者等に向けたダイジェスト動画として生成し
得る。 Specifically, a video may be created by editing only scenes in which student A felt positive emotions such as satisfaction, happiness, and fulfillment, and the edited video may be generated as a digest video for student A's parents, etc.

かかるダイジェスト動画を生成し得る具体的な処理の一例としては、まず、算出部１０８
は、顔検出部１０７が検出した受講生Ａの顔について、公知の技術を用いて、笑顔の度合
いを評価値として算出することができる。続いて、抽出部１０１は、算出部１０８により
算出された度合いが、顔情報登録部１０９に予め登録されている度合い「２：普通笑い」
、「３：大笑い」の何れかと一致した場合に、当該一致した評価値に対応するフレーム群
を特定のフレーム群として教室映像から抽出することができる。 As an example of a specific process for generating such a digest movie, the calculation unit 108
The face detection unit 107 can use a known technique to calculate the degree of smiling as an evaluation value for the face of the student A detected by the face detection unit 107. Next, the extraction unit 101 calculates the degree calculated by the calculation unit 108 and the degree of smiling "2: normal smiling" registered in advance in the face information registration unit 109.
, "3: laughing loudly," then the frame group corresponding to the matching evaluation value can be extracted as a specific frame group from the classroom video.

さらに、フレーム切り出し部１０５は、受講生Ａの笑顔の度合いが「２：普通笑い」、
「３：大笑い」の何れかであるときの教室映像の前後において例えば１０秒～２０秒程度
の時間間隔で連なるフレーム群を切り出すことができる。そして、最後に、生成部１０６
は、これらのフレーム群を連結したダイジェスト動画を、受講生Ａの保護者等に向けたダ
イジェスト動画として生成する。 Furthermore, the frame extraction unit 105 determines that the degree of smile of the student A is “2: normal smile”;
A group of frames that are continuous at a time interval of, for example, about 10 to 20 seconds before and after the classroom video when the subject is in any of the categories of “3: laughing loudly” can be extracted.
The video processing unit 100 generates a digest video by connecting these frames together, as a digest video intended for Student A's parents and guardians.

＜第３の実施形態＞
以下、図８及び図９に基づいて、第３の実施形態に係る在宅個別指導システムについて説
明する。この実施形態では、上述した第１実施形態で説明した要素と同一の要素について
同じ符号を付し、詳細な説明を省略する。 Third Embodiment
Hereinafter, a home tutoring system according to a third embodiment will be described with reference to Fig. 8 and Fig. 9. In this embodiment, the same elements as those described in the first embodiment are denoted by the same reference numerals, and detailed description thereof will be omitted.

上記の第１実施形態に係る在宅個別指導システムでは、既に述べたように、予め登録済み
の波形データを用いた照合により、教室映像の中から特定のフレーム群を抽出していたが
、第３の実施形態に係る在宅個別指導システムでは、人物の動作に関する動作情報に基づ
いて、教室映像の中から特定のフレーム群を抽出する処理が行われる。 As already mentioned above, in the at-home tutoring system of the first embodiment, specific frames were extracted from classroom footage by matching with pre-registered waveform data, whereas in the at-home tutoring system of the third embodiment, a process is performed to extract specific frames from classroom footage based on movement information relating to a person's movements.

＜ソフトウェア構成＞
図８は、本実施形態に係る教室映像配信装置１０のソフトウェア構成例を示す図である
。教室映像配信装置１０は、抽出部１０１と、フレーム切り出し部１０５と、生成部１０
６と、特定部１１０と、動作情報登録部１１１と、を備える。 <Software configuration>
8 is a diagram showing an example of the software configuration of the classroom video distribution device 10 according to this embodiment. The classroom video distribution device 10 includes an extraction unit 101, a frame extraction unit 105, and a generation unit 10
6, a specification unit 110, and a motion information registration unit 111.

特定部１１０は、例えば、メモリ１５に格納されている教室映像を構成する複数のフレー
ムの夫々について、人物の動作に関する動作情報を特定する機能を有する。この動作情報
は、例えば、人物の動作を複数の姿勢の連続として捉えた情報であって、様々な姿勢に対
応する人体の骨格を形成する各関節の情報を含み得る。動作情報には、例えば、人物の身
振り、手振り、ジェスチャ、ボディランゲージの少なくとも何れかが含まれる。 The identification unit 110 has a function of identifying motion information related to a person's motion for each of a plurality of frames constituting the classroom video stored in the memory 15. This motion information is, for example, information that captures a person's motion as a sequence of a plurality of postures, and may include information on each joint that forms the skeleton of the human body corresponding to the various postures. The motion information includes, for example, at least one of a person's gestures, hand movements, gestures, and body language.

特定部１１０は、例えば、人体パターンを用いたパターンマッチングにより、教室映像を
構成する複数のフレームから、人体の骨格を形成する各関節の座標を得る。座標取得の具
体的な手法については、公知技術と同様であるため説明を省略する。そして、この座標系
で表される各関節の座標が、例えば、１フレーム分の骨格情報となり得る。さらに、複数
フレーム分の骨格情報が所定の動作情報となり得る。 The identification unit 110 obtains the coordinates of each joint forming the skeleton of the human body from the multiple frames constituting the classroom video, for example, by pattern matching using a human body pattern. The specific method of obtaining the coordinates is similar to known techniques, and therefore will not be described here. The coordinates of each joint expressed in this coordinate system can be, for example, skeleton information for one frame. Furthermore, skeleton information for multiple frames can be predetermined motion information.

かかる所定の動作情報は、動作情報登録部１１１に予め登録されている。すなわち、動作
情報登録部１１１は、様々な姿勢に対応する動作情報を、例えば、公知の人工知能技術を
用いた機械学習により予め記憶している。例えば、本実施形態において、受講生が手を振
る動きに対応するジェスチャは、講義の内容に納得ができなかったり、引っかかるところ
があったりする受講生が講師に対して補充説明を求めるジェスチャパターンとして機械学
習済みであるとする。 Such predetermined motion information is pre-registered in the motion information registration unit 111. That is, the motion information registration unit 111 pre-stores motion information corresponding to various postures, for example, by machine learning using a known artificial intelligence technology. For example, in this embodiment, a gesture corresponding to a student's hand waving movement has been pre-programmed by machine learning as a gesture pattern for a student who is not convinced or has difficulty with the content of a lecture to ask the lecturer for additional explanation.

抽出部１０１は、所定の特定条件が満たされた場合、例えば、特定部１１０により特定さ
れた動作情報と、動作情報登録部１１１に予め登録されている動作情報とが一致した場合
に、当該一致した動作情報に対応するフレーム群を特定のフレーム群として教室映像から
抽出する。 When a predetermined specific condition is satisfied, for example, when the motion information identified by the identification unit 110 matches the motion information pre-registered in the motion information registration unit 111, the extraction unit 101 extracts a group of frames corresponding to the matching motion information from the classroom video as a specific group of frames.

次に、このように構成された在宅個別指導システムの動作について説明する。図９は、本
開示の第３の実施の形態による教室映像配信方法の処理の流れを説明するフローチャート
である。 Next, the operation of the at-home individual instruction system thus configured will be described. Fig. 9 is a flowchart illustrating the process flow of the classroom video distribution method according to the third embodiment of the present disclosure.

ここでは、講義の内容に納得ができない受講生Ａが講師に対して補充説明を求める状況を
含むシーンを教室映像からピックアップする場面を例に挙げて説明する。 Here, an example will be described in which a scene is picked up from a classroom video including a situation in which a student A, who is not convinced by the content of the lecture, asks the lecturer for additional explanation.

まず、講義が開始される時刻になると、各受講生Ａ、Ｂ、Ｃは受講生端末２０Ａ、２０Ｂ
、２０ＣをネットワークＮＷ経由で在宅個別指導システムに接続して、講師Ｔの講義開始
を待つ。講師Ｔは、教室映像配信装置１０が備えるモニタ１４を見て各受講生Ａ、Ｂ、Ｃ
が受講態勢にあるか否かを判断し、受講態勢が整っていれば、講義を開始する。すなわち
、撮像部１２並びに撮像部２２による撮像動作が開始されるとともに、収音部１３並びに
収音部２３による収音動作が開始される（ステップＳ３００）。 First, when the time for the lecture to start arrives, each of the students A, B, and C accesses the student terminals 20A, 20B.
, 20C are connected to the home tutoring system via the network NW and wait for the lecture of the instructor T to start. The instructor T watches the monitor 14 of the classroom video distribution device 10 and the students A, B, C
If the students are ready to attend the lecture, the lecture is started. That is, the imaging units 12 and 22 start imaging operations, and the sound collection units 13 and 23 start collecting sounds (step S300).

そして、抽出部１０１は、撮像部１２で取得される講師映像と、撮像部２２で取得される
受講生映像とを適宜取捨選択して合成すると共に、収音部１３又は収音部２３で収音され
た音を組み合わせて教室映像を生成すると共に、生成した教室映像をネットワークＮＷ経
由で受講生端末２０Ａ、２０Ｂ、２０Ｃに配信する（ステップＳ３０２）。 Then, the extraction unit 101 appropriately selects and synthesizes the instructor video captured by the imaging unit 12 and the student video captured by the imaging unit 22, combines the sounds collected by the sound collection unit 13 or the sound collection unit 23 to generate a classroom video, and distributes the generated classroom video to the student terminals 20A, 20B, and 20C via the network NW (step S302).

次に、特定部１１０は、人体パターンを用いたパターンマッチングにより、教室映像を構
成する複数のフレームから、受講生Ａの骨格を形成する各関節の座標を得る。さらに、特
定部１１０は、各関節の座標に基づいて、複数フレーム分の骨格情報を受講生Ａのジェス
チャとして特定する（ステップＳ３０４）。 Next, the identification unit 110 obtains the coordinates of each joint forming the skeleton of student A from multiple frames constituting the classroom video by pattern matching using the human body pattern. Furthermore, the identification unit 110 identifies the skeleton information for multiple frames as gestures of student A based on the coordinates of each joint (step S304).

続いて、抽出部１０１は、特定部１１０により特定されたジェスチャが、動作情報登録部
１１１において機械学習済みのジェスチャパターン（受講生Ａが講師に対して補充説明を
求めるジェスチャパターン）と一致した場合に、当該一致したジェスチャに対応するフレ
ーム群を特定のフレーム群として教室映像から抽出する（ステップＳ３０６）。 Next, if the gesture identified by the identification unit 110 matches a gesture pattern that has been machine-learned in the action information registration unit 111 (a gesture pattern in which student A requests additional explanation from the instructor), the extraction unit 101 extracts a group of frames corresponding to the matching gesture as a specific group of frames from the classroom video (step S306).

フレーム切り出し部１０５は、受講生Ａ講師に対して補充説明を求める教室映像の前後
において例えば１０秒～２０秒程度の時間間隔で連なるフレーム群を切り出す（ステップ
Ｓ３０８）。 The frame extracting unit 105 extracts a series of frames at time intervals of, for example, about 10 to 20 seconds before and after the classroom video in which Student A requests additional explanation from the instructor (step S308).

そして、生成部１０６は、これらのフレーム群を連結したダイジェスト動画を生成する（
ステップＳ３１０）。 The generator 106 then generates a digest movie by connecting these frames (
Step S310).

かくして、受講生Ａが講師に対して補充説明を求めるジェスチャパターンが既に登録済み
の状態であるので、今後上記フローと同様の状況があれば、動作情報登録部１１１に保持
されているジェスチャパターンに従った照合により、教室映像の中から特定のフレーム群
をピックアップしてくれば、演算処理に伴う負荷を増やさなくとも同様のダイジェスト動
画を生成することが可能となる。 Thus, since the gesture pattern by which student A requests further explanation from the instructor has already been registered, if a situation similar to the above flow occurs in the future, a specific group of frames can be picked up from the classroom video by matching against the gesture pattern stored in the motion information registration unit 111, and a similar digest video can be generated without increasing the load associated with computational processing.

＜第４の実施形態＞
以下、図１０及び図１１に基づいて、第４の実施形態に係る在宅個別指導システムについ
て説明する。この実施形態では、上述した第１実施形態で説明した要素と同一の要素につ
いて同じ符号を付し、詳細な説明を省略する。 Fourth Embodiment
Hereinafter, a home tutoring system according to a fourth embodiment will be described with reference to Fig. 10 and Fig. 11. In this embodiment, the same elements as those described in the first embodiment are denoted by the same reference numerals, and detailed description thereof will be omitted.

上記の第１実施形態に係る在宅個別指導システムでは、既に述べたように、予め登録済み
の波形データを用いた照合により、教室映像の中から特定のフレーム群を抽出していたが
、第４の実施形態に係る在宅個別指導システムでは、人物の生体情報に基づいて、教室映
像の中から特定のフレーム群を抽出する処理が行われる。 As already mentioned above, in the at-home tutoring system of the first embodiment, specific frames were extracted from classroom video footage by matching with pre-registered waveform data, whereas in the at-home tutoring system of the fourth embodiment, a process is performed to extract specific frames from classroom video footage based on a person's biometric information.

＜ソフトウェア構成＞
図１０は、本実施形態に係る教室映像配信装置１０のソフトウェア構成例を示す図であ
る。教室映像配信装置１０は、抽出部１０１と、フレーム切り出し部１０５と、生成部１
０６と、生体情報検出部１１２と、生体情報登録部１１３と、を備える。 <Software configuration>
10 is a diagram showing an example of the software configuration of the classroom video distribution device 10 according to this embodiment. The classroom video distribution device 10 includes an extraction unit 101, a frame extraction unit 105, and a generation unit 106.
06, a biometric information detection unit 112, and a biometric information registration unit 113.

生体情報検出部１１２は、例えば、メモリ１５に格納されている教室映像を構成する複数
のフレームの夫々について、人物の生体情報を検出する機能を有する。人物の生体情報に
は、人物の血圧、脈拍、脈圧の少なくとも何れかが含まれる。これら所定の生体情報は、
各フレームに映り込んだ講師又は受講生の顔領域を一般的な顔検知技術等によって抽出し
たのちに、血流方向に沿って複数の領域に分割し、各領域における血流を示す色画像の時
系列変化に基づいて取得することができる。 The biometric information detection unit 112 has a function of detecting biometric information of a person for each of a plurality of frames constituting the classroom video stored in the memory 15. The biometric information of a person includes at least one of the person's blood pressure, pulse rate, and pulse pressure. These predetermined biometric information are:
The facial area of the instructor or student reflected in each frame is extracted using general facial detection technology, and then divided into multiple areas along the direction of blood flow, and the facial area can be obtained based on the time-series changes in color images indicating the blood flow in each area.

かかる所定の生体情報は、生体情報登録部１１３に予め登録されている。すなわち、生体
情報登録部１１３は、例えば緊張の有無等の精神状態、体調の良否等の身体状態の検知に
用いる生体情報を、例えば、公知の人工知能技術を用いた機械学習により予め記憶してい
る。例えば、本実施形態において、受講生Ａにおいてミリ秒単位での表情の変化、瞳孔の
開き、脈拍の速さ（脈拍数）、顔面の紅潮、発汗具合等、受講生Ａが無意識に支配されて
いる情動を読み取り得る生体情報が学習済みであるとする。 Such predetermined biometric information is registered in advance in the biometric information registration unit 113. That is, the biometric information registration unit 113 stores in advance, for example, biometric information used to detect mental states such as the presence or absence of tension, and physical states such as physical condition, for example, by machine learning using known artificial intelligence technology. For example, in this embodiment, it is assumed that biometric information capable of reading emotions unconsciously controlled by student A, such as changes in facial expression in milliseconds, pupil dilation, pulse rate (pulse rate), facial flushing, degree of sweating, etc., has been learned.

抽出部１０１は、所定の特定条件が満たされた場合、例えば、生体情報検出部１１２によ
り検出された生体情報と、生体情報登録部１１３に予め登録されている生体情報とが一致
した場合に、当該一致した生体情報に対応するフレーム群を特定のフレーム群として教室
映像から抽出する。 When a predetermined specific condition is met, for example, when the biometric information detected by the biometric information detection unit 112 matches the biometric information pre-registered in the biometric information registration unit 113, the extraction unit 101 extracts a group of frames corresponding to the matching biometric information from the classroom video as a specific group of frames.

次に、このように構成された在宅個別指導システムの動作について説明する。図１１は、
本開示の第４の実施の形態による教室映像配信方法の処理の流れを説明するフローチャー
トである。 Next, the operation of the home individual instruction system thus configured will be described.
13 is a flowchart illustrating the process flow of a classroom video distribution method according to a fourth embodiment of the present disclosure.

ここでは、人間には肉体的安全を保つために遺伝的に備わっているバイアスがあり、見慣
れないもの、理解しにくいものに対しては瞬間的に異常を感じるという知見のもとで、受
講生Ａが緊張状態に陥ったシーンを教室映像からピックアップする場面を例に挙げて説明
する。 Here, based on the knowledge that humans have a genetic bias to maintain physical safety and that they instantly sense something abnormal when they encounter something unfamiliar or difficult to understand, we will explain by using as an example a scene from a classroom video in which Student A falls into a state of tension.

まず、講義が開始される時刻になると、各受講生Ａ、Ｂ、Ｃは受講生端末２０Ａ、２０Ｂ
、２０ＣをネットワークＮＷ経由で在宅個別指導システムに接続して、講師Ｔの講義開始
を待つ。講師Ｔは、教室映像配信装置１０が備えるモニタ１４を見て各受講生Ａ、Ｂ、Ｃ
が受講態勢にあるか否かを判断し、受講態勢が整っていれば、講義を開始する。すなわち
、撮像部１２並びに撮像部２２による撮像動作が開始されるとともに、収音部１３並びに
収音部２３による収音動作が開始される（ステップＳ４００）。 First, when the time for the lecture to start arrives, each of the students A, B, and C accesses the student terminals 20A, 20B.
, 20C are connected to the home tutoring system via the network NW and wait for the lecture of the instructor T to start. The instructor T watches the monitor 14 of the classroom video distribution device 10 and the students A, B, C
If the students are ready to attend the lecture, the lecture is started. That is, the imaging units 12 and 22 start imaging operations, and the sound collection units 13 and 23 start collecting sounds (step S400).

そして、抽出部１０１は、撮像部１２で取得される講師映像と、撮像部２２で取得される
受講生映像とを適宜取捨選択して合成すると共に、収音部１３又は収音部２３で収音され
た音を組み合わせて教室映像を生成すると共に、生成した教室映像をネットワークＮＷ経
由で受講生端末２０Ａ、２０Ｂ、２０Ｃに配信する（ステップＳ４０２）。 Then, the extraction unit 101 appropriately selects and synthesizes the instructor video captured by the imaging unit 12 and the student video captured by the imaging unit 22, combines the sounds collected by the sound collection unit 13 or the sound collection unit 23 to generate a classroom video, and distributes the generated classroom video to the student terminals 20A, 20B, and 20C via the network NW (step S402).

次に、生体情報検出部１１２は、公知の技術を用いて、教室映像を構成する複数のフレー
ムの夫々について、受講生Ａの脈拍数を検出する（ステップＳ４０４）。 Next, the biological information detection unit 112 detects the pulse rate of student A for each of the frames constituting the classroom video using a known technique (step S404).

続いて、抽出部１０１は、生体情報検出部１１２により検出された脈拍数が、生体情報登
録部１１３において機械学習済みの脈拍数（受講生Ａが緊張状態にある脈拍数）と一致し
た場合に、当該一致した脈拍数に対応するフレーム群を特定のフレーム群として教室映像
から抽出する（ステップＳ４０６）。 Next, when the pulse rate detected by the biometric information detection unit 112 matches the pulse rate that has been machine-learned in the biometric information registration unit 113 (the pulse rate when student A is in a state of tension), the extraction unit 101 extracts a group of frames corresponding to the matching pulse rate from the classroom video as a specific group of frames (step S406).

フレーム切り出し部１０５は、受講生Ａが緊張状態にある教室映像の前後において例え
ば１０秒～２０秒程度の時間間隔で連なるフレーム群を切り出す（ステップＳ４０８）。 The frame extracting unit 105 extracts a series of frames, for example, at time intervals of about 10 to 20 seconds, before and after the classroom video in which Student A is in a tense state (step S408).

そして、生成部１０６は、これらのフレーム群を連結したダイジェスト動画を生成する（
ステップＳ４１０）。 The generator 106 then generates a digest movie by connecting these frames (
Step S410).

かくして、受講生Ａが緊張状態にある脈拍数が既に登録済みの状態であるので、今後上記
フローと同様の状況があれば、生体情報登録部１１３に保持されている脈拍数に従った照
合により、教室映像の中から特定のフレーム群をピックアップしてくれば、演算処理に伴
う負荷を増やさなくとも同様のダイジェスト動画を生成することが可能となる。 Thus, since the pulse rate of student A when he is in a state of tension has already been registered, if a situation similar to that described above occurs in the future, a specific group of frames can be picked out from the classroom video by matching it with the pulse rate stored in the biometric information registration unit 113, and a similar digest video can be generated without increasing the load associated with calculation processing.

以上、添付図面を参照しながら本開示の好適な実施形態について詳細に説明したが、本
開示の技術的範囲はかかる例に限定されない。本開示の技術分野における通常の知識を有
する者であれば、請求の範囲に記載された技術的思想の範疇内において、各種の変更例ま
たは修正例に想到し得ることは明らかであり、これらについても、当然に本開示の技術的
範囲に属するものと了解される。 Although the preferred embodiment of the present disclosure has been described in detail above with reference to the attached drawings, the technical scope of the present disclosure is not limited to such examples. It is clear that a person having ordinary knowledge in the technical field of the present disclosure can conceive of various modified or amended examples within the scope of the technical ideas described in the claims, and it is understood that these also naturally belong to the technical scope of the present disclosure.

上述した各実施形態では、遠隔授業を支援する在宅個別指導システムに情報抽出装置を
適用する例について述べた。しかし、これに限らない。例えば、会議、講演会等のように
、開始時刻および終了時刻が事前に決められており、主として教室、会議室などの特定の
空間で行われる各種イベントを支援するシステムに情報抽出装置を適用してもよい。 In the above-described embodiments, the information extraction device is applied to a home tutoring system that supports remote classes. However, the present invention is not limited to this. For example, the information extraction device may be applied to a system that supports various events, such as meetings and lectures, whose start and end times are determined in advance and which are mainly held in a specific space such as a classroom or a conference room.

本明細書において説明した装置は、単独の装置として実現されてもよく、一部または全
部がネットワークで接続された複数の装置（例えばクラウドサーバ）等により実現されて
もよい。例えば、教室映像配信装置１０のストレージ１６又は制御部１８は、互いにネッ
トワークで接続された異なるサーバにより実現されてもよい。 The devices described in this specification may be realized as a single device, or may be realized by multiple devices (e.g., cloud servers) partially or entirely connected via a network. For example, the storage 16 or the control unit 18 of the classroom video distribution device 10 may be realized by different servers connected to each other via a network.

本明細書において説明した装置による一連の処理は、ソフトウェア、ハードウェア、及
びソフトウェアとハードウェアとの組合せのいずれを用いて実現されてもよい。本実施形
態に係る教室映像配信装置１０の各機能を実現するためのコンピュータプログラムを作製
し、ＰＣ等に実装することが可能である。また、このようなコンピュータプログラムが格
納された、コンピュータで読み取り可能な記録媒体も提供することができる。記録媒体は
、例えば、磁気ディスク、光ディスク、光磁気ディスク、フラッシュメモリ等である。ま
た、上記のコンピュータプログラムは、記録媒体を用いずに、例えばネットワークを介し
て配信されてもよい。 The series of processes performed by the device described in this specification may be realized using software, hardware, or a combination of software and hardware. A computer program for realizing each function of the classroom video distribution device 10 according to this embodiment can be created and installed in a PC or the like. A computer-readable recording medium on which such a computer program is stored can also be provided. Examples of the recording medium include a magnetic disk, an optical disk, a magneto-optical disk, and a flash memory. The above computer program may also be distributed, for example, via a network, without using a recording medium.

また、本明細書においてフローチャート図を用いて説明した処理は、必ずしも図示され
た順序で実行されなくてもよい。いくつかの処理ステップは、並列的に実行されてもよい
。また、追加的な処理ステップが採用されてもよく、一部の処理ステップが省略されても
よい。 In addition, the processes described herein using flowchart diagrams do not necessarily have to be performed in the order shown. Some process steps may be performed in parallel. Additional process steps may be employed, and some process steps may be omitted.

また、本明細書に記載された効果は、あくまで説明的または例示的なものであって限定
的ではない。つまり、本開示に係る技術は、上記の効果とともに、または上記の効果に代
えて、本明細書の記載から当業者には明らかな他の効果を奏しうる。 In addition, the effects described in this specification are merely descriptive or exemplary and are not limiting. In other words, the technology according to the present disclosure may achieve other effects that are apparent to a person skilled in the art from the description of this specification, in addition to or in place of the above effects.

１在宅個別指導システム
１０教室映像配信装置（情報抽出装置）
１０１抽出部
１０２波形登録部
１０３変換部
１０４表示部
１０５フレーム切り出し部
１０６生成部
１０７顔検出部
１０８算出部
１０９顔情報登録部
１１０特定部
１１１動作情報登録部
１１２生体情報検出部
１１３生体情報登録部
ＮＷネットワーク
1 Home-based individual instruction system 10 Classroom video distribution device (information extraction device)
REFERENCE SIGNS LIST 101 Extraction unit 102 Waveform registration unit 103 Conversion unit 104 Display unit 105 Frame extraction unit 106 Generation unit 107 Face detection unit 108 Calculation unit 109 Face information registration unit 110 Identification unit 111 Motion information registration unit 112 Biometric information detection unit 113 Biometric information registration unit NW Network

Claims

An acquisition unit that acquires a moving image composed of a plurality of frames;
A storage unit that stores a specific condition for identifying predetermined data included in the video;
an extracting unit that extracts a plurality of specific frame groups from the video in accordance with the specific condition;
a connecting portion that connects the extracted specific frame groups together;
an output unit that outputs digest information including a group of multiple linked frames;
an identification means for identifying at least a face image or a voice included in the digest information for each predetermined frame;
and an evaluation means for calculating an evaluation value for the classified face image.
Device.

2. The apparatus of claim 1,
A waveform registration unit that registers predetermined waveform data in advance,
The specific condition is whether or not waveform data of a sound included in the moving image matches the registered waveform data,
When both waveform data match, the extraction unit extracts a frame group corresponding to the matching waveform from the video as the specific frame group.
Device.

3. The apparatus of claim 2,
A conversion unit that converts the sounds included in the video into text information by speech recognition,
the conversion unit converts the sound corresponding to an auxiliary frame group including the specific frame group and a predetermined number of frames before and after the specific frame group.
Device.

4. An apparatus according to claim 2 or claim 3, comprising:
The information provided by the surrounding sounds, including the subject , is a mixture of conversational and non-conversational information.
Device.

5. The apparatus of claim 4,
The conversation information includes at least one of words indicating positive emotions and words indicating negative emotions.
Device.

6. An apparatus according to claim 4 or claim 5,
The non-conversational information includes information indicating at least one of clicking of the tongue, sighing, and interjections.
Device.

2. The apparatus of claim 1,
A face information registration unit that registers a predetermined face evaluation value related to a facial expression in advance,
the specific condition is whether or not a face evaluation value calculated from a facial expression included in the moving image matches the registered face evaluation value,
when the two face evaluation values match, the extraction unit extracts a frame group corresponding to the matching face evaluation values from the video as the specific frame group.
Device.

8. The apparatus of claim 7,
The face evaluation value includes an evaluation value that evaluates the degree of happiness, boredom, or tension of the person .
Device.

9. An apparatus according to claim 7 or claim 8, comprising:
The face evaluation value includes evaluation values of a person 's facial expression, a direction of the person's gaze, and a direction of the person's face.
Device.

2. The apparatus of claim 1,
a motion information registration unit that registers in advance a predetermined motion evaluation value related to a motion of a person,
the specific condition is whether or not a motion evaluation value calculated from a person included in the video matches the registered motion evaluation value;
when the two action evaluation values match, the extraction unit extracts a frame group corresponding to the matching action evaluation values from the video as the specific frame group.
Device.

11. The apparatus of claim 10,
The motion evaluation value includes an evaluation value that evaluates at least one of the person's gestures, hand movements, gestures, and body language.
Device.

2. The apparatus of claim 1,
a biometric information registration unit for registering in advance a biometric evaluation value relating to predetermined biometric information,
the specific condition is whether or not a biometric evaluation value that can be calculated from a person included in the moving image matches the registered biometric evaluation value,
when the two biometric evaluation values match, the extraction unit extracts a frame group corresponding to the matching biometric evaluation values from the video as the specific frame group.
Device.

13. The apparatus of claim 12,
The biometric evaluation value includes at least one of the person's blood pressure, pulse rate, and pulse pressure.
Device.

14. An apparatus according to any one of claims 1 to 13, comprising:
a frame adding unit that adds additional frames that are consecutive to the specific frame group in time series to the specific frame group,
Device.

2. The apparatus of claim 1,
the video meeting evaluation terminal provides graph information of the evaluation value in time series;
Device.

16. The apparatus of claim 15 ,
the video meeting evaluation terminal calculates a plurality of evaluation values by evaluating the face image from a plurality of different viewpoints;
Device.

17. Apparatus according to any one of claims 1 to 16, comprising:
calculating the evaluation value together with the audio included in the moving image;
Device.

18. Apparatus according to any one of claims 1 to 17, comprising:
calculating the evaluation value together with an object other than the face image included in the moving image;
Device.