JP7728580B2

JP7728580B2 - Video conference analysis system and video conference analysis program

Info

Publication number: JP7728580B2
Application number: JP2021214096A
Authority: JP
Inventors: 和幸五島
Original assignee: QUALITIA CO., LTD.
Current assignee: QUALITIA CO., LTD.
Priority date: 2021-12-28
Filing date: 2021-12-28
Publication date: 2025-08-25
Anticipated expiration: 2041-12-28
Also published as: JP2023097789A

Description

この発明は、ネットワークを介して行われるビデオ会議の参加者の発言を分析するビデオ会議分析システムおよびビデオ会議分析プログラムに関するものである。 This invention relates to a video conference analysis system and a video conference analysis program that analyze the comments made by participants in video conferences held over a network.

高速大容量の通信環境の普及に伴い、離れた場所にいる参加者同士をネットワークで接続してビデオ会議を行うことが一般的になっている。参加者は一堂に会する必要がなく、移動時間や移動に伴う体力の消耗などから解放され、無駄なく効率的に会議に参加することができる。また、ビデオ会議自体を効率化する提案も行われており、例えば、特許文献１に記載されている会議支援システムがある。この会議支援システムは、出席者を顔認証により特定し、会議中の出席者の発言をリアルタイムにテキストとしてディスプレイに表示するとともに記録することができる。また、要約の形で議事録を生成し、対象の会議を識別する情報と関連付けて記録することもできる。このように、ビデオ会議は、効率化との相性がよく様々な提案がなされている。 With the spread of high-speed, high-capacity communication environments, it has become common to hold video conferences by connecting participants in remote locations over a network. Participants do not need to gather in one place, which frees them from travel time and the associated physical exhaustion, allowing them to participate in conferences efficiently and without waste. Proposals have also been made to improve the efficiency of video conferences themselves, such as the conference support system described in Patent Document 1. This conference support system identifies attendees through facial recognition and can display and record attendees' comments during the conference as text on a display in real time. It can also generate minutes in the form of summaries and record them in association with information identifying the conference in question. As such, video conferences are well suited to improving efficiency, and various proposals have been made.

特開２０１９－６１５９４号公報JP 2019-61594 A

しかしながら、ビデオ会議は対面でなく画面を介して行われるため、感情の抑制が働きにくくなることが危惧されている。例えば、参加者が感情的になり他の参加者に不快感を与える言動をしやすくなるのではないかというおそれなどがある。 However, because video conferencing is conducted via a screen rather than face-to-face, there are concerns that it may be harder to control emotions. For example, there is a risk that participants may become emotional and be more likely to say or do things that offend other participants.

このような危惧から、ビデオ会議中の参加者の感情の状態や参加者の発言に含まれる嫌がらせ度合であるハラスメントの状態を分析できるシステムやプログラムの実現が求められていた。 Due to these concerns, there was a demand for a system or program that could analyze the emotional state of participants during a video conference and the level of harassment contained in their comments.

本発明はこのような事情に鑑みてなされたものであり、この発明の課題は、ビデオ会議の参加者の発言を分析して参加者の感情の状態や参加者の発言に含まれるハラスメントの状態を把握できるビデオ会議分析システムおよびビデオ会議分析プログラムを提供することである。 The present invention was made in light of these circumstances, and its objective is to provide a video conference analysis system and a video conference analysis program that can analyze the comments made by video conference participants to understand the emotional state of the participants and the level of harassment contained in their comments.

かかる課題を解決するために、請求項１に記載の発明は、ネットワークを介して行われるビデオ会議の参加者の発言を分析するビデオ会議分析システムであって、前記参加者の発言の音声データを前記ネットワークを介して受け付ける画像音声受付部と、該画像音声受付部で受け付けた前記音声データを文字に変換する音声文字変換部と、該音声文字変換部で前記音声データから変換された文字により構成される前記参加者の発言の内容を示す文字文章を分析して発言分析結果を求める文字文章分析部と、前記文字文章および前記文字文章分析部で求められた前記発言分析結果を表示する表示画面を構成する画面構成部と、該画面構成部で構成された前記表示画面を前記ビデオ会議に参加している前記参加者に向けて前記ネットワークを介して送信する画面提供部と、前記画像音声受付部で受け付けた前記音声データを前記音声文字変換部で文字に変換させて前記文字文章を取得し、前記文字文章を前記文字文章分析部で分析させて前記発言分析結果を取得し、前記画面構成部で前記表示画面を構成させ、前記表示画面を前記画面提供部から送信させる動作の制御を行う制御部とを備え、前記文字文章分析部は、前記参加者の発言の内容を示す前記文字文章に基づいて分析を行い、感情の状態を示す感情レベルおよび嫌がらせの度合を示すハラスメントレベルを推定して前記発言分析結果を求めることを特徴とする。 In order to solve this problem, the invention described in claim 1 is a video conference analysis system for analyzing speech made by participants in a video conference held over a network, comprising: an image/audio receiving unit that receives audio data of the speech made by the participants over the network; a voice/speech conversion unit that converts the audio data received by the image/audio receiving unit into text; a text/sentence analysis unit that analyzes text indicating the content of the speech made by the participants, which is composed of text converted from the audio data by the voice/speech conversion unit, to obtain a speech analysis result; a screen configuration unit that configures a display screen that displays the text and the speech analysis result obtained by the text/sentence analysis unit; and a display screen configured by the screen configuration unit. The system comprises a screen providing unit that transmits a screen to the participants taking part in the video conference via the network, and a control unit that converts the audio data received by the image and audio receiving unit into text using the speech conversion unit to obtain the text, analyzes the text using the text analysis unit to obtain the speech analysis results, configures the display screen using the screen composition unit, and controls the transmission of the display screen from the screen providing unit, and is characterized in that the text analysis unit performs an analysis based on the text indicating the content of the participants' speech, and estimates an emotional level indicating an emotional state and a harassment level indicating the degree of harassment to obtain the speech analysis results.

請求項２に係る発明は、請求項１に記載の構成に加えて、前記音声文字変換部は、学習用音声データと該学習用音声データに対応する正解文字データとの組合せを学習データとして、機械学習により生成させた学習済み音声文字変換モデルに、前記画像音声受付部で受け付けた前記音声データを入力して演算することにより文字に変換して前記文字文章を取得し、前記文字文章分析部は、学習用文章と該学習用文章に対応する正解感情データとの組合せ、および、前記学習用文章と該学習用文章に対応する正解ハラスメントデータとの組合せを学習データとして、機械学習により生成させた学習済み文字文章分析モデルに、取得した前記文字文章を入力して演算することにより前記感情レベルおよび前記ハラスメントレベルを推定して前記発言分析結果を求めることを特徴とする。 The invention of claim 2, in addition to the configuration of claim 1, is characterized in that the speech-to-text conversion unit inputs the speech data received by the image and audio receiving unit into a trained speech-to-text conversion model generated by machine learning using a combination of training speech data and correct text data corresponding to the training speech data, and performs calculations to convert the speech data into text and obtain the text; and the text analysis unit inputs the acquired text into a trained text analysis model generated by machine learning using a combination of training text and correct emotion data corresponding to the training text, and a combination of the training text and correct harassment data corresponding to the training text, and performs calculations to estimate the emotion level and the harassment level and obtain the utterance analysis result.

請求項３に係る発明は、請求項１又は２に記載の構成に加えて、原文言語で構成される原文文章を所定の翻訳言語に翻訳する翻訳部を備え、前記制御部は、前記参加者の発言の内容を示す前記文字文章を前記翻訳部で翻訳させて翻訳文を取得する動作の制御を行い、前記画面構成部は、前記翻訳部で翻訳された前記文字文章の翻訳文を表示する翻訳文表示画面を構成し、前記画面提供部は、前記画面構成部で構成された前記翻訳文表示画面を前記ビデオ会議に参加している前記参加者に向けて前記ネットワークを介して送信することを特徴とする。 The invention of claim 3, in addition to the configuration of claim 1 or 2, further comprises a translation unit that translates an original text written in a source language into a predetermined translation language, the control unit controls the operation of translating the text indicating the content of the participant's remarks using the translation unit to obtain a translation, the screen composition unit composes a translation display screen that displays the translation of the text translated by the translation unit, and the screen providing unit transmits the translation display screen composed by the screen composition unit to the participants participating in the video conference via the network.

請求項４に係る発明は、請求項３に記載の構成に加えて、前記翻訳部は、学習用原文文章と該学習用原文文章の翻訳文として正解となる正解翻訳文との組合せを学習データとして、機械学習により生成させた学習済み翻訳モデルに、前記原文文章を入力して演算することにより該学習用原文文章の翻訳文を取得することを特徴とする。 The invention of claim 4 is characterized in that, in addition to the configuration of claim 3, the translation unit obtains a translation of a training original sentence by inputting the original sentence into a trained translation model generated by machine learning, using a combination of a training original sentence and a correct translation that is the correct translation of the training original sentence as training data, and performing calculations on the trained translation model.

請求項５に係る発明は、請求項１乃至４の何れか一項に記載の構成に加えて、前記参加者を撮影した画像データに基づいて分析を行い、前記感情レベルを推定して画像分析結果を求める画像分析部を備え、前記画像音声受付部は、前記参加者を撮影した前記画像データを前記ネットワークを介して受け付けて、前記制御部は、前記画像音声受付部で受け付けた前記画像データを前記画像分析部で分析させる動作の制御を行い、前記画像分析部は、学習用顔画像と該学習用顔画像に対応する正解となる感情の種類との組合せを学習データとして、機械学習により生成させた学習済み画像分析モデルに、前記画像データを入力し演算することにより、前記画像分析結果を求め、前記画面構成部は、前記画像分析部で求められた前記画像分析結果を表示する画像分析表示画面を構成し、前記画面提供部は、前記画面構成部で構成された前記画像分析表示画面を前記ビデオ会議に参加している前記参加者に向けて前記ネットワークを介して送信することを特徴とする。 The invention of claim 5, in addition to the configuration of any one of claims 1 to 4, further comprises an image analysis unit that performs analysis based on image data of the participants and estimates the emotional level to obtain image analysis results; the image and audio receiving unit receives the image data of the participants via the network; the control unit controls the operation of the image analysis unit to analyze the image data received by the image and audio receiving unit; the image analysis unit inputs the image data into a trained image analysis model generated by machine learning using combinations of training face images and correct emotion types corresponding to the training face images as training data, and performs calculations to obtain the image analysis results; the screen composition unit composes an image analysis display screen that displays the image analysis results obtained by the image analysis unit; and the screen providing unit transmits the image analysis display screen composed by the screen composition unit to the participants participating in the video conference via the network.

請求項６に係る発明は、請求項１乃至４の何れか一項に記載の構成に加えて、前記参加者を撮影した画像データに基づいて分析を行い、前記感情レベルを推定して画像分析結果を求める画像分析部を備え、前記画像音声受付部は、前記参加者を撮影した前記画像データを前記ネットワークを介して受け付けて、前記制御部は、前記画像音声受付部で受け付けた前記画像データを前記画像分析部で分析させる動作の制御を行い、前記画像分析部は、前記画像データから顔画像を抽出して、目の形状、該目の形状の変化、眉の形状、該眉の形状の変化、唇の両脇の部分である口角の形状、該口角の形状の変化、頬の形状、該頬の形状の変化、歯の出現頻度、該歯の出現頻度の変化のうち、少なくともいずれか一つを用いて分析を行い、前記画像分析結果を求め、前記画面構成部は、前記画像分析部で求められた前記画像分析結果を表示する画像分析表示画面を構成し、前記画面提供部は、前記画面構成部で構成された前記画像分析表示画面を前記ビデオ会議に参加している前記参加者に向けて前記ネットワークを介して送信することを特徴とする。 The invention of claim 6, in addition to the configuration of any one of claims 1 to 4, further comprises an image analysis unit that performs analysis based on image data of the participants, estimates the emotional level, and obtains image analysis results; the image and audio receiving unit receives the image data of the participants via the network; the control unit controls the image analysis unit to analyze the image data received by the image and audio receiving unit; the image analysis unit extracts a facial image from the image data and performs analysis using at least one of eye shape, changes in eye shape, eyebrow shape, changes in eyebrow shape, shape of the corners of the mouth (the sides of the lips), changes in the shape of the corners of the mouth, cheek shape, changes in the shape of the cheeks, frequency of teeth, and changes in the frequency of teeth to obtain the image analysis results; the screen composition unit creates an image analysis display screen that displays the image analysis results obtained by the image analysis unit; and the screen providing unit transmits the image analysis display screen created by the screen composition unit to the participants participating in the video conference via the network.

請求項７に係る発明は、請求項１乃至４の何れか一項に記載の構成に加えて、前記画像音声受付部で受け付けた前記音声データに基づいて分析を行い、前記感情レベルおよび前記ハラスメントレベルを推定して音声分析結果を求める音声分析部を備え、前記制御部は、前記音声データを前記音声分析部で分析させる動作の制御を行い、前記音声分析部は、学習用音声データと該学習用音声データに対応する正解感情データとの組合せ、および、前記学習用音声データと該学習用音声データに対応する正解ハラスメントデータとの組合せを学習データとして、機械学習により生成させた学習済み音声分析モデルに、受け付けた前記音声データを入力して演算することにより前記感情レベルおよび前記ハラスメントレベルを推定して前記音声分析結果を求め、前記画面構成部は、前記音声分析部で求められた前記音声分析結果を表示する音声分析表示画面を構成し、前記画面提供部は、前記画面構成部で構成された前記音声分析表示画面を前記ビデオ会議に参加している前記参加者に向けて前記ネットワークを介して送信することを特徴とする。 The invention of claim 7, in addition to the configuration of any one of claims 1 to 4, further comprises a voice analysis unit that performs analysis based on the voice data received by the image and voice receiving unit, estimates the emotional level and the harassment level, and obtains a voice analysis result; the control unit controls the operation of the voice analysis unit to analyze the voice data; the voice analysis unit inputs the received voice data into a trained voice analysis model generated by machine learning using combinations of training voice data and correct emotion data corresponding to the training voice data, and combinations of the training voice data and correct harassment data corresponding to the training voice data, as training data, and performs calculations to estimate the emotional level and the harassment level and obtain the voice analysis result; the screen composition unit composes a voice analysis display screen that displays the voice analysis result obtained by the voice analysis unit; and the screen provision unit transmits the voice analysis display screen composed by the screen composition unit to the participants participating in the video conference via the network.

請求項８に係る発明は、請求項１乃至４の何れか一項に記載の構成に加えて、前記画像音声受付部で受け付けた前記音声データに基づいて分析を行い、前記感情レベルおよび前記ハラスメントレベルを推定して音声分析結果を求める音声分析部を備え、前記制御部は、前記音声データを前記音声分析部で分析させる動作の制御を行い、前記音声分析部は、声の大きさ、該声の大きさの変化、声の高さ、該声の高さの変化、話す速さ、該話す速さの変化、他の前記参加者の言葉に被せて発言する頻度、前記他の前記参加者の言葉に被せて発言する頻度の変化のうち、少なくともいずれか一つを用いて分析を行い、前記音声分析結果を求め、前記画面構成部は、前記音声分析部で求められた前記音声分析結果を表示する音声分析表示画面を構成し、前記画面提供部は、前記画面構成部で構成された前記音声分析表示画面を前記ビデオ会議に参加している前記参加者に向けて前記ネットワークを介して送信することを特徴とする。 The invention of claim 8, in addition to the configuration of any one of claims 1 to 4, further comprises a voice analysis unit that performs analysis based on the voice data received by the image and voice receiving unit, estimates the emotion level and the harassment level, and obtains a voice analysis result; the control unit controls the operation of the voice analysis unit to analyze the voice data; the voice analysis unit performs analysis using at least one of voice volume, changes in voice volume, voice pitch, changes in voice pitch, speaking rate, changes in speaking rate, frequency of speaking over the words of other participants, and changes in frequency of speaking over the words of the other participants, and obtains the voice analysis result; the screen composition unit composes a voice analysis display screen that displays the voice analysis result obtained by the voice analysis unit; and the screen provision unit transmits the voice analysis display screen composed by the screen composition unit to the participants participating in the video conference via the network.

請求項９に係る発明は、請求項１に記載の構成に加えて、前記参加者を撮影した画像データに基づいて分析を行い、前記感情レベルを推定して画像分析結果を求める画像分析部と、前記画像音声受付部で受け付けた前記音声データに基づいて分析を行い、前記感情レベルおよび前記ハラスメントレベルを推定して音声分析結果を求める音声分析部と、前記文字文章分析部で分析された前記発言分析結果、前記画像分析部で求められた前記画像分析結果および前記音声分析部で求められた前記音声分析結果を総合評価して、総合感情レベルおよび総合ハラスメントレベルのうち、少なくともいずれかを推定して総合判定結果を求める判定部とを備え、前記画像音声受付部は、前記参加者を撮影した前記画像データを前記ネットワークを介して受け付けて、前記制御部は、前記画像データを前記画像分析部で分析させ前記音声データを前記音声分析部で分析させて、前記文字文章分析部で分析された前記発言分析結果、前記画像分析部で求められた前記画像分析結果および前記音声分析部で求められた前記音声分析結果を前記判定部で総合評価させて前記総合判定結果を求める動作の制御を行い、前記画面構成部は、前記判定部で求められた前記総合判定結果を表示する総合判定表示画面を構成し、前記画面提供部は、前記画面構成部で構成された前記総合判定表示画面を前記ビデオ会議に参加している前記参加者に向けて前記ネットワークを介して送信することを特徴とする。 The invention according to claim 9, in addition to the configuration of claim 1, further comprises an image analysis unit that performs analysis based on image data of the participants, estimates the emotional level, and obtains an image analysis result; a voice analysis unit that performs analysis based on the voice data received by the image and voice receiving unit, estimates the emotional level and the harassment level, and obtains a voice analysis result; and a judgment unit that comprehensively evaluates the utterance analysis result analyzed by the character and sentence analysis unit, the image analysis result obtained by the image analysis unit, and the voice analysis result obtained by the voice analysis unit, and estimates at least one of the overall emotional level and the overall harassment level, and obtains a comprehensive judgment result; and the image and voice receiving unit performs analysis based on image data of the participants, estimates the emotional level, and obtains an overall judgment result. The control unit receives the projected image data via the network, and controls the image analysis unit to analyze the image data and the audio analysis unit to analyze the audio data, and the judgment unit to comprehensively evaluate the utterance analysis results analyzed by the text analysis unit, the image analysis results obtained by the image analysis unit, and the audio analysis results obtained by the audio analysis unit to determine the comprehensive judgment result. The screen composition unit creates a comprehensive judgment display screen that displays the comprehensive judgment result obtained by the judgment unit, and the screen provision unit transmits the comprehensive judgment display screen created by the screen composition unit to the participants participating in the video conference via the network.

請求項１０に係る発明は、請求項１乃至９の何れか一項に記載の構成に加えて、あらかじめ登録されている不適切な語句が前記文字文章に含まれているかを検出する不適切語句検出部を備え、該不適切語句検出部で前記不適切な語句が検出された場合、前記制御部は、前記ビデオ会議に参加している前記参加者に向けて前記ネットワークを介して送信する前記不適切な語句に対応する部分の前記音声データの送信停止、前記文字文章に含まれている前記不適切な語句の削除、前記文字文章に含まれている前記不適切な語句を該不適切な語句に対応する適切な語句に置換のうち、少なくともいずれか一つを含む不適切語句遮断措置を行うことを特徴とする。 The invention of claim 10, in addition to the configuration of any one of claims 1 to 9, further comprises an inappropriate phrase detection unit that detects whether the text contains pre-registered inappropriate phrases, and when the inappropriate phrase detection unit detects the inappropriate phrase, the control unit takes inappropriate phrase blocking measures that include at least one of stopping the transmission of the portion of the audio data corresponding to the inappropriate phrase to be transmitted via the network to the participants in the video conference, deleting the inappropriate phrase contained in the text, and replacing the inappropriate phrase contained in the text with an appropriate phrase corresponding to the inappropriate phrase.

請求項１１に係る発明は、請求項１０に記載の構成に加えて、前記不適切語句検出部で検出される前記不適切な語句の検出頻度が、所定の閾値を超えた場合、前記制御部は、前記画面構成部が警告表示画面を構成し、前記画面提供部が前記画面構成部で構成された前記警告表示画面を前記ビデオ会議に参加している前記参加者に向けて前記ネットワークを介して送信する動作の制御を行うことを特徴とする。 The invention of claim 11 is characterized in that, in addition to the configuration of claim 10, when the detection frequency of the inappropriate phrases detected by the inappropriate phrase detection unit exceeds a predetermined threshold, the control unit controls the screen composition unit to compose a warning display screen and the screen provision unit to transmit the warning display screen composed by the screen composition unit to the participants taking part in the video conference via the network.

請求項１２に係る発明は、ネットワークを介して行われるビデオ会議の参加者の発言を分析するビデオ会議分析プログラムであって、前記参加者の発言の音声データを前記ネットワークを介して受け付ける画像音声受付処理と、該画像音声受付処理で受け付けた前記音声データを文字に変換する音声文字変換処理と、該音声文字変換処理で前記音声データから変換された文字により構成される前記参加者の発言の内容を示す文字文章を分析して発言分析結果を求める文字文章分析処理とを有し、前記文字文章分析処理は、前記参加者の発言の内容を示す前記文字文章に基づいて分析を行い、感情の状態を示す感情レベルおよび嫌がらせの度合を示すハラスメントレベルを推定して前記発言分析結果を求めることを特徴とする。 The invention of claim 12 is a video conference analysis program that analyzes speech made by participants in a video conference held over a network, comprising: an image and audio reception process that receives audio data of the participants' speech over the network; a voice-to-text conversion process that converts the audio data received in the image and audio reception process into text; and a text analysis process that analyzes text indicating the content of the participants' speech, which is composed of text converted from the audio data in the voice-to-text conversion process, to obtain a speech analysis result. The text analysis process performs an analysis based on the text indicating the content of the participants' speech, and estimates an emotional level that indicates their emotional state and a harassment level that indicates the degree of harassment to obtain the speech analysis result.

請求項１３に係る発明は、請求項１２に記載の構成に加えて、原文言語で構成される原文文章を所定の翻訳言語に翻訳する翻訳処理を有し、前記翻訳処理は、前記参加者の発言の内容を示す前記文字文章を翻訳して翻訳文を生成することを特徴とする。 The invention of claim 13, in addition to the configuration of claim 12, further comprises a translation process for translating an original text written in a source language into a predetermined translation language, and the translation process is characterized in that it translates the text indicating the content of the participant's remarks to generate a translated text.

請求項１４に係る発明は、請求項１２に記載の構成に加えて、前記参加者を撮影した画像データに基づいて分析を行い、前記感情レベルを推定して画像分析結果を求める画像分析処理を有し、前記画像音声受付処理は、前記参加者を撮影した前記画像データを前記ネットワークを介して受け付けて、前記画像分析処理は、前記画像データから顔画像を抽出して、目の形状、該目の形状の変化、眉の形状、該眉の形状の変化、唇の両脇の部分である口角の形状、該口角の形状の変化、頬の形状、該頬の形状の変化、歯の出現頻度、該歯の出現頻度の変化のうち、少なくともいずれか一つを用いて分析を行い、前記画像分析結果を求めることを特徴とする。 The invention of claim 14, in addition to the configuration of claim 12, further comprises an image analysis process that performs analysis based on image data of the participant, estimates the emotional level, and obtains an image analysis result; the image and audio reception process receives the image data of the participant via the network; and the image analysis process extracts a facial image from the image data and performs analysis using at least one of the following: eye shape, changes in eye shape, eyebrow shape, changes in eyebrow shape, shape of the corners of the mouth (the parts on both sides of the lips), changes in the shape of the corners of the mouth, cheek shape, changes in cheek shape, frequency of tooth appearance, and changes in the frequency of tooth appearance, thereby obtaining the image analysis result.

請求項１５に係る発明は、請求項１２に記載の構成に加えて、前記画像音声受付処理で受け付けた前記音声データに基づいて分析を行い、前記感情レベルおよび前記ハラスメントレベルを推定して音声分析結果を求める音声分析処理を有し、前記音声分析処理は、声の大きさ、該声の大きさの変化、声の高さ、該声の高さの変化、話す速さ、該話す速さの変化、他の前記参加者の言葉に被せて発言する頻度、前記他の前記参加者の言葉に被せて発言する頻度の変化のうち、少なくともいずれか一つを用いて分析を行い、前記音声分析結果を求めることを特徴とする。 The invention of claim 15, in addition to the configuration of claim 12, includes a voice analysis process that performs analysis based on the voice data received in the image and voice reception process, estimates the emotional level and the harassment level, and obtains a voice analysis result, and is characterized in that the voice analysis process performs analysis using at least one of the following: voice volume, changes in voice volume, voice pitch, changes in voice pitch, speaking rate, changes in speaking rate, frequency of speaking over the words of other participants, and changes in frequency of speaking over the words of the other participants, to obtain the voice analysis result.

請求項１６に係る発明は、請求項１２に記載の構成に加えて、前記参加者を撮影した画像データに基づいて分析を行い、前記感情レベルを推定して画像分析結果を求める画像分析処理と、前記画像音声受付処理で受け付けた前記音声データに基づいて分析を行い、前記感情レベルおよび前記ハラスメントレベルを推定して音声分析結果を求める音声分析処理と、前記文字文章分析処理で分析された前記発言分析結果、前記画像分析処理で求められた前記画像分析結果および前記音声分析処理で求められた前記音声分析結果を総合評価して、総合感情レベルおよび総合ハラスメントレベルのうち、少なくともいずれかを推定して総合判定結果を求める判定処理とを有し、前記画像音声受付処理は、前記参加者を撮影した前記画像データを前記ネットワークを介して受け付けて、前記画像分析処理は、前記画像データを分析して前記画像分析結果を求め、前記音声分析処理は、前記音声データを分析して前記音声分析結果を求め、前記判定処理は、前記文字文章分析処理で分析された前記発言分析結果、前記画像分析処理で求められた前記画像分析結果および前記音声分析処理で求められた前記音声分析結果を総合評価して前記総合判定結果を求めることを特徴とする。 The invention of claim 16, in addition to the configuration of claim 12, comprises an image analysis process that performs an analysis based on image data of the participants and estimates the emotional level to obtain an image analysis result; a voice analysis process that performs an analysis based on the voice data received in the image and voice reception process and estimates the emotional level and the harassment level to obtain a voice analysis result; and a judgment process that comprehensively evaluates the utterance analysis result analyzed in the character and sentence analysis process, the image analysis result obtained in the image analysis process, and the voice analysis result obtained in the voice analysis process to estimate at least one of an overall emotional level and an overall harassment level to obtain an overall judgment result, wherein the image and voice reception process receives the image data of the participants via the network, the image analysis process analyzes the image data to obtain the image analysis result, the voice analysis process analyzes the voice data to obtain the voice analysis result, and the judgment process comprehensively evaluates the utterance analysis result analyzed in the character and sentence analysis process, the image analysis result obtained in the image analysis process, and the voice analysis result obtained in the voice analysis process to obtain the overall judgment result.

請求項１の発明によれば、ビデオ会議参加者の発言の音声データが文字に変換され、その発言の内容を示す文字文章が構成される。また、構成されたこの文字文章に基づいて分析が行われて、感情の状態を示す感情レベルおよび嫌がらせの度合を示すハラスメントレベルが推定され発言分析結果が求められる。そして、得られた参加者の発言の内容を示す文字文章と発言分析結果を表示する表示画面が構成されて、この表示画面がビデオ会議の参加者に送信される。 According to the invention of claim 1, audio data of speeches made by video conference participants is converted into text, and a text sentence indicating the content of the speech is constructed. Analysis is then performed based on the constructed text sentence to estimate an emotional level indicating the emotional state and a harassment level indicating the degree of harassment, and a speech analysis result is obtained. A display screen is then constructed that displays the obtained text sentence indicating the content of the participants' speeches and the speech analysis result, and this display screen is transmitted to the video conference participants.

このように、ビデオ会議参加者の発言に基づいて感情レベルおよびハラスメントレベルを分析することができ、その発言分析結果をビデオ会議の参加者の間で共有することができる。このため、参加者が感情的になった場合など、自らその状態を把握でき自制できるとともに、他の参加者もその状態を把握でき鎮静化を促すことができる。 In this way, the emotional and harassment levels of video conference participants can be analyzed based on their comments, and the results of this comment analysis can be shared among the video conference participants. This means that if a participant becomes emotional, they can understand their own state and control themselves, and other participants can also understand their state and encourage them to calm down.

また、参加者の発言の内容を示す文字文章がビデオ会議の参加者に送信されて、画面で確認できるため、回線の状態が悪く音声を聞きにくい場合にも、会議を中断することなく続けることができる。また、聴覚に障害を有する参加者も会議に参加することができる。 In addition, text indicating the content of each participant's remarks is sent to video conference participants and can be viewed on the screen, so even if the connection is poor and it is difficult to hear the audio, the conference can continue without interruption. Participants with hearing impairments can also participate in the conference.

また、請求項２の発明によれば、音声文字変換部が、人工知能（ＡＩ：ＡｒｔｉｆｉｃｉａｌＩｎｔｅｌｌｉｇｅｎｃｅ）である機械学習により生成させた学習済み音声文字変換モデルを用いて、音声データから文字に変換している。このため、高精度に安定して、音声データを文字に変換できる。また、文字文章分析部が、人工知能である機械学習により生成させた学習済み文字文章分析モデルを用いて感情レベルおよびハラスメントレベルを求める。このため、精度よく安定して感情レベルとハラスメントレベルを推定できる。 Furthermore, according to the invention of claim 2, the speech-to-text conversion unit converts speech data into text using a trained speech-to-text conversion model generated by machine learning, which is artificial intelligence (AI). This allows for highly accurate and stable conversion of speech data into text. Furthermore, the text analysis unit determines emotion levels and harassment levels using a trained text analysis model generated by machine learning, which is artificial intelligence. This allows for accurate and stable estimation of emotion levels and harassment levels.

また、請求項３の発明によれば、ビデオ会議参加者の発言の内容を示す文字文章が、所定の翻訳言語に翻訳されるため、使用する言語の異なる参加者同士でも翻訳文を参照することにより円滑な意思の疎通を図ることができる。 Furthermore, according to the invention of claim 3, text indicating the content of statements made by video conference participants is translated into a specified translation language, allowing participants who speak different languages to refer to the translation text and achieve smooth communication.

また、請求項４の発明によれば、翻訳部が、機械学習により生成させた学習済み翻訳モデルを用いて、原文文章を翻訳する。このため、高い精度で確実に翻訳できる。 Furthermore, according to the invention of claim 4, the translation unit translates the source text using a trained translation model generated by machine learning. This allows for highly accurate and reliable translation.

また、請求項５の発明によれば、画像分析部が、機械学習により生成させた学習済み画像分析モデルを用いて、画像データから感情レベルを推定する。このため、高精度に安定して画像データから感情レベルを求められる。 Furthermore, according to the invention of claim 5, the image analysis unit estimates the emotional level from the image data using a trained image analysis model generated by machine learning. This makes it possible to stably determine the emotional level from the image data with high accuracy.

また、請求項６の発明によれば、ビデオ会議参加者を撮影した画像データに基づいて感情レベルが推定され画像分析結果が求められる。そして、この画像分析結果を表示する画像分析表示画面が構成されて、ビデオ会議の参加者に送信される。このように、参加者の画像データに基づいて感情レベルが推定され、会議の参加者にその画像分析結果が共有される。参加者の画像データから画像分析結果が求められるため、参加者の発言の内容を示す文字文章と異なるデータを用いて感情レベルを分析することができ、多面的に分析結果を得ることができる。 Furthermore, according to the invention of claim 6, emotional levels are estimated based on image data captured of video conference participants, and image analysis results are obtained. An image analysis display screen displaying these image analysis results is then constructed and transmitted to the video conference participants. In this way, emotional levels are estimated based on the image data of the participants, and the image analysis results are shared with the conference participants. Because the image analysis results are obtained from the image data of the participants, emotional levels can be analyzed using data other than text that indicates the content of the participants' comments, and multifaceted analysis results can be obtained.

また、請求項７の発明によれば、音声分析部が、機械学習により生成させた学習済み音声分析モデルを用いて、音声データから感情レベルおよびハラスメントレベルを推定する。このため、精度よく確実に音声データから感情レベルやハラスメントレベルを検出できる。 Furthermore, according to the invention of claim 7, the voice analysis unit estimates the emotion level and harassment level from the voice data using a trained voice analysis model generated by machine learning. This makes it possible to accurately and reliably detect the emotion level and harassment level from the voice data.

また、請求項８の発明によれば、ビデオ会議参加者の発言の音声データに基づいて感情レベルおよびハラスメントレベルが推定され音声分析結果が求められる。そして、この音声分析結果を表示する音声分析表示画面が構成されて、ビデオ会議の参加者に送信される。このように、参加者の発言の音声データに基づいて感情レベルおよびハラスメントレベルが推定され、会議の参加者にその音声分析結果が共有される。参加者の音声データから音声分析結果が求められるため、参加者の発言の内容を示す文字文章と異なり、参加者の音声データそのものを用いて感情レベルおよびハラスメントレベルを分析することができ、多面的に分析結果を得ることができる。 Furthermore, according to the invention of claim 8, emotional and harassment levels are estimated based on the audio data of the speech of the video conference participants, and audio analysis results are obtained. A audio analysis display screen displaying these audio analysis results is then constructed and transmitted to the video conference participants. In this way, emotional and harassment levels are estimated based on the audio data of the participants' speech, and the audio analysis results are shared with the conference participants. Because the audio analysis results are obtained from the participants' audio data, the emotional and harassment levels can be analyzed using the participants' audio data itself, rather than text that indicates the content of the participants' speech, and analysis results can be obtained from multiple perspectives.

また、請求項９の発明によれば、参加者の発言の内容を示す文字文章に基づいて求められた発言分析結果、参加者の画像データに基づいて求められた画像分析結果および参加者の発言の音声データに基づいて求められた音声分析結果が総合評価されて総合判定結果が求められる。そして、この総合判定結果を表示する総合判定表示画面が構成されて、ビデオ会議の参加者に送信される。このように、発言分析結果、画像分析結果および音声分析結果が総合されるため、より多面的な分析結果を得ることができる。 Furthermore, according to the invention of claim 9, a comprehensive assessment result is obtained by comprehensively evaluating the utterance analysis result obtained based on the text indicating the content of the participants' utterances, the image analysis result obtained based on the image data of the participants, and the audio analysis result obtained based on the audio data of the participants' utterances. Then, a comprehensive assessment display screen is constructed to display this comprehensive assessment result, and is transmitted to the participants in the video conference. In this way, the utterance analysis result, image analysis result, and audio analysis result are combined, thereby obtaining a more multifaceted analysis result.

また、請求項１０の発明によれば、参加者の発言の内容を示す文字文章に不適切な語句が含まれているか検出され、不適切な語句が検出された場合には、不適切語句遮断措置が行われる。この不適切語句遮断措置により、他人に不快感を与えるような不適切な語句が会議参加者に伝達されなくなるため、参加者は安心して会議に参加することができる。 Furthermore, according to the invention of claim 10, the text indicating the content of participants' remarks is detected to determine whether it contains inappropriate words, and if inappropriate words are detected, an inappropriate word blocking measure is taken. This inappropriate word blocking measure prevents inappropriate words that may offend others from being communicated to conference participants, allowing participants to participate in the conference with peace of mind.

また、請求項１１の発明によれば、不適切な語句の検出頻度が所定の閾値を超えた場合、警告表示画面が参加者に向けて送信される。この警告表示画面により、会議参加者は、不適切な発言が多くなっていることを客観的に認識することができ、休憩するなどの対策を講じることができる。 Furthermore, according to the invention of claim 11, if the detection frequency of inappropriate words and phrases exceeds a predetermined threshold, a warning display screen is sent to the participants. This warning display screen allows meeting participants to objectively recognize that inappropriate remarks are increasing, and they can take measures such as taking a break.

請求項１２の発明によれば、ビデオ会議参加者の発言の音声データが文字に変換され、その発言の内容を示す文字文章が構成される。また、構成されたこの文字文章に基づいて分析が行われて、感情の状態を示す感情レベルおよび嫌がらせの度合を示すハラスメントレベルが推定され発言分析結果が求められる。このように、ビデオ会議参加者の発言に基づいて感情レベルおよびハラスメントレベルを分析することができる。 According to the invention of claim 12, audio data of speeches made by video conference participants is converted into text, and a text sentence indicating the content of the speech is constructed. Furthermore, an analysis is performed based on the constructed text sentence, and an emotional level indicating the emotional state and a harassment level indicating the degree of harassment are estimated, and a speech analysis result is obtained. In this way, emotional levels and harassment levels can be analyzed based on the speeches made by video conference participants.

ビデオ会議分析プログラムをＡＰＩ（ＡｐｐｌｉｃａｔｉｏｎＰｒｏｇｒａｍｍｉｎｇＩｎｔｅｒｆａｃｅ）として提供することができるため、様々なビデオ会議システムでこのＡＰＩを利用することができ、汎用性を持たせることができる。 The video conferencing analysis program can be provided as an API (Application Programming Interface), which means that the API can be used in a variety of video conferencing systems, making it highly versatile.

また、請求項１３の発明によれば、ビデオ会議参加者の発言の内容を示す文字文章が、所定の翻訳言語に翻訳されるため、使用する言語の異なる参加者同士でも翻訳文を参照することにより円滑な意思の疎通を図ることができる。 Furthermore, according to the invention of claim 13, text indicating the content of statements made by video conference participants is translated into a specified translation language, allowing participants who speak different languages to refer to the translation text and achieve smooth communication.

また、請求項１４の発明によれば、ビデオ会議参加者を撮影した画像データに基づいて感情レベルが推定され画像分析結果が求められる。このため、参加者の発言の内容を示す文字文章と異なるデータを用いて感情レベルを分析することができ、多面的に分析結果を得ることができる。 Furthermore, according to the invention of claim 14, emotional levels are estimated based on image data captured of video conference participants, and image analysis results are obtained. This makes it possible to analyze emotional levels using data other than text that indicates the content of participants' comments, thereby obtaining multifaceted analysis results.

また、請求項１５の発明によれば、ビデオ会議参加者の発言の音声データに基づいて感情レベルおよびハラスメントレベルが推定され音声分析結果が求められる。このため、参加者の発言の内容を示す文字文章と異なり、参加者の音声データそのものを用いて感情レベルおよびハラスメントレベルを分析することができ、多面的に分析結果を得ることができる。 Furthermore, according to the invention of claim 15, the emotional and harassment levels of video conference participants are estimated based on the audio data of their speech, and audio analysis results are obtained. Therefore, unlike text that indicates the content of participants' speech, the emotional and harassment levels can be analyzed using the participants' audio data itself, and multifaceted analysis results can be obtained.

また、請求項１６の発明によれば、参加者の発言の内容を示す文字文章に基づいて求められた発言分析結果、参加者の画像データに基づいて求められた画像分析結果および参加者の発言の音声データに基づいて求められた音声分析結果が総合評価されて総合判定結果が求められる。このように、発言分析結果、画像分析結果および音声分析結果が総合されるため、より多面的な分析結果を得ることができる。 Furthermore, according to the invention of claim 16, a comprehensive assessment is made by comprehensively evaluating the utterance analysis results obtained based on text indicating the content of the participants' utterances, the image analysis results obtained based on the participants' image data, and the audio analysis results obtained based on the audio data of the participants' utterances. In this way, the utterance analysis results, image analysis results, and audio analysis results are combined, making it possible to obtain more multifaceted analysis results.

この発明の実施の形態１に係るビデオ会議分析システムを含むビデオ会議システムを概略的に示す構成ブロック図である。1 is a configuration block diagram showing in outline a video conference system including a video conference analysis system according to a first embodiment of the present invention; 同実施の形態１に係るＷＥＢ／ＡＰＰサーバの構成を示す概略ブロック図である。2 is a schematic block diagram showing the configuration of a WEB/APP server according to the first embodiment. FIG. 同実施の形態１に係る音声文字変換サーバの構成を示す概略ブロック図である。2 is a schematic block diagram showing the configuration of a voice-to-text conversion server according to the first embodiment. FIG. 同実施の形態１に係る文字文章分析サーバの構成を示す概略ブロック図である。2 is a schematic block diagram showing the configuration of a character and sentence analysis server according to the first embodiment. FIG. 同実施の形態１に係る翻訳サーバの構成を示す概略ブロック図である。FIG. 2 is a schematic block diagram showing the configuration of a translation server according to the first embodiment. 同実施の形態１に係る画像音声分析サーバの構成を示す概略ブロック図である。2 is a schematic block diagram showing the configuration of an image and audio analysis server according to the first embodiment. FIG. 同実施の形態１に係るビデオ会議分析システムを含むビデオ会議システムのクライアント端末の構成を示す概略ブロック図である。2 is a schematic block diagram showing the configuration of a client terminal of a videoconference system including the videoconference analysis system according to the first embodiment. FIG. 同実施の形態１に係るクライアント端末に表示されるビデオ会議画面の一例を示す図である。10 is a diagram showing an example of a video conference screen displayed on a client terminal according to the first embodiment; FIG. 同実施の形態１に係るクライアント端末に表示されるビデオ会議画面の別の例を示す図である。FIG. 10 is a diagram showing another example of the video conference screen displayed on the client terminal according to the first embodiment. （ａ）同実施の形態１に係る感情レベルを表示する絵文字の例を示す図であり、（ｂ）ハラスメントレベルを表示する絵文字の例を示す図である。1A is a diagram showing an example of a pictogram for displaying an emotion level according to the first embodiment, and FIG. 1B is a diagram showing an example of a pictogram for displaying a harassment level. 同実施の形態１に係る感情レベルを表示する画面の一例を示す図である。FIG. 10 is a diagram showing an example of a screen displaying an emotion level according to the first embodiment. 同実施の形態１に係るビデオ会議分析システムにおいてビデオ会議開始時の概略フローチャートを示す図である。10 is a diagram showing a schematic flowchart at the start of a video conference in the video conference analysis system according to the first embodiment. FIG. 同実施の形態１に係るビデオ会議分析システムにおいてビデオ会議参加者が発言したときの概略フローチャートを示す図である。10 is a diagram showing a schematic flowchart when a video conference participant makes a statement in the video conference analysis system according to the first embodiment. FIG. この発明の実施の形態２に係るビデオ会議分析システムを含むビデオ会議システムを概略的に示す構成ブロック図である。FIG. 10 is a block diagram illustrating a configuration of a video conference system including a video conference analysis system according to a second embodiment of the present invention. 同実施の形態２に係るＷＥＢ／ＡＰＰサーバの構成を示す概略ブロック図である。FIG. 10 is a schematic block diagram showing the configuration of a WEB/APP server according to the second embodiment. 同実施の形態２に係るビデオ会議分析ＡＰＩの仕様の一例を説明する図である。FIG. 10 is a diagram illustrating an example of the specifications of a video conference analysis API according to the second embodiment. 同実施の形態２に係るビデオ会議分析システムを含むビデオ会議システムのビデオ会議サーバの構成を示す概略ブロック図である。FIG. 10 is a schematic block diagram showing the configuration of a videoconference server of a videoconference system including a videoconference analysis system according to the second embodiment.

［発明の実施の形態１］
この発明の実施の形態１について、図１～図１３を用いて説明する。 [First embodiment of the invention]
A first embodiment of the present invention will be described with reference to FIGS. 1 to 13. FIG.

図１は、本実施の形態１に係るビデオ会議分析システム１を含むビデオ会議システム１００を概略的に示す構成ブロック図である。このビデオ会議分析システム１は、ＷＥＢ／ＡＰＰ（ＷＥＢ／Ａｐｐｌｉｃａｔｉｏｎ）サーバ２、音声文字変換サーバ３、文字文章分析サーバ４、翻訳サーバ５、画像音声分析サーバ６を含む構成になっている。この分析システム１を構成する各サーバは、それぞれ「ネットワーク」としてのインターネット８に接続されている。また、ビデオ会議参加者の操作するクライアント端末７_１，７_２，７_３，・・・，７_ｎ（以下、「クライアント端末７」という）がインターネット８に接続されており、参加者はインターネット８を介してビデオ会議に参加できるようになっている。このように、このビデオ会議システム１００は、ビデオ会議分析システム１やクライアント端末７を含むように構成されている。 FIG. 1 is a block diagram illustrating a configuration of a videoconference system 100 including a videoconference analysis system 1 according to a first embodiment of the present invention. This videoconference analysis system 1 includes a WEB/APP (Web/Application) server 2, a speech-to-text conversion server 3, a text analysis server 4, a translation server 5, and an image/speech analysis server 6. Each server included in this analysis system 1 is connected to the Internet 8, which serves as a "network." Client terminals 7 ₁ , 7 ₂ , 7 ₃ , . . . , 7 _n (hereinafter referred to as "client terminals 7") operated by videoconference participants are also connected to the Internet 8, allowing participants to participate in the videoconference via the Internet 8. Thus, this videoconference system 100 is configured to include the videoconference analysis system 1 and the client terminals 7.

このビデオ会議システム１００では、距離の離れた参加者同士がビデオ会議を行えるだけでなく、ビデオ会議分析システム１により参加者の発言が分析され、感情の状態や発言に含まれる嫌がらせの度合などが推定されて、それぞれのクライアント端末７に表示されるようになっている。 This videoconferencing system 100 not only allows participants who are far apart to hold videoconferences, but also analyzes the participants' comments using the videoconferencing analysis system 1, estimating their emotional state and the degree of harassment contained in the comments, and displaying this information on each client terminal 7.

以下に、このビデオ会議分析システム１を構成する各サーバ、クライアント端末７について説明する。 The following describes each server and client terminal 7 that make up this videoconference analysis system 1.

＜ＷＥＢ／ＡＰＰサーバ＞
ＷＥＢ／ＡＰＰサーバ２は、会議参加者の各クライアント端末７から送信されてくる参加者の画像や音声のデータを受け付けて、この画像や音声のデータに基づいてビデオ会議画面や会議の音声を構成して、各クライアント端末７に向けて送信する。このような動作を行うことにより、会議参加者は、各クライアント端末７を介して会議の画面や音声を視聴することができ、ビデオ会議を進行させることができる。このように、ＷＥＢ／ＡＰＰサーバ２は、ビデオ会議を実現する機能を有している。これに加えて、このＷＥＢ／ＡＰＰサーバ２は、各クライアント端末７から送信されてくる参加者の画像や音声のデータを後述する音声文字変換サーバ３、文字文章分析サーバ４、翻訳サーバ５、画像音声分析サーバ６で加工や分析を行わせる。ＷＥＢ／ＡＰＰサーバ２では、このようにして得られる分析結果などに基づいてクライアント端末７に表示させる表示画面を構成し、その表示画面を各クライアント端末７に向けて送信する。参加者は、クライアント端末７に表示される画面を通して自らに対する分析結果や他の参加者の分析結果を確認することができる。 <WEB/APP server>
The WEB/APP server 2 accepts image and audio data of the participants transmitted from each client terminal 7 of the conference participants, composes a video conference screen and conference audio based on the image and audio data, and transmits the video conference screen and audio to each client terminal 7. By performing these operations, the conference participants can view the conference screen and audio via each client terminal 7, allowing the video conference to proceed. In this way, the WEB/APP server 2 has the function of realizing a video conference. In addition, the WEB/APP server 2 processes and analyzes the image and audio data of the participants transmitted from each client terminal 7 using the speech-to-text conversion server 3, the text analysis server 4, the translation server 5, and the image and audio analysis server 6 (described below). The WEB/APP server 2 composes a display screen to be displayed on the client terminal 7 based on the analysis results obtained in this manner, and transmits the display screen to each client terminal 7. Participants can check their own analysis results and the analysis results of other participants through the screen displayed on the client terminal 7.

図２に示す概略ブロック図のようにＷＥＢ／ＡＰＰサーバ２は、ＷＥＢ／ＡＰＰサーバ制御部２０、画面構成部２１、会議設定部２２、議事記録部２３、判定部２４、画像音声受付部２５、画面提供部２６、通信部２８、記憶部２９を含むように構成されている。 As shown in the schematic block diagram of Figure 2, the WEB/APP server 2 is configured to include a WEB/APP server control unit 20, a screen configuration unit 21, a conference setting unit 22, a minutes recording unit 23, a determination unit 24, an image and audio receiving unit 25, a screen providing unit 26, a communication unit 28, and a memory unit 29.

「制御部」としてのＷＥＢ／ＡＰＰサーバ制御部２０は、プログラムの実行、演算処理、ＷＥＢ／ＡＰＰサーバ２を構成する各要素の制御などを行うＣＰＵ（図示せず）を含むように構成されている。ＷＥＢ／ＡＰＰサーバ制御部２０によって、記憶部２９を構成する不揮発性記憶装置である補助記憶装置（図示せず）に記憶されているプログラムなどが実行され、ＷＥＢ／ＡＰＰサーバ２を構成する各要素が動作する。補助記憶装置としては、例えば、ＨＤＤ（ＨａｒｄＤｉｓｋＤｒｉｖｅ）やＳＳＤ（ＳｏｌｉｄＳｔａｔｅＤｒｉｖｅ）を用いることができる。プログラムの実行の際、記憶部２９を構成する揮発性メモリのＲＡＭ（図示せず）が、ＣＰＵによるプログラムの実行や演算処理のワークエリアとして使用される。 The WEB/APP server control unit 20, which serves as a "control unit," is configured to include a CPU (not shown) that executes programs, performs arithmetic processing, and controls the various elements that make up the WEB/APP server 2. The WEB/APP server control unit 20 executes programs stored in an auxiliary storage device (not shown), a non-volatile storage device that makes up the storage unit 29, and operates the various elements that make up the WEB/APP server 2. The auxiliary storage device may be, for example, an HDD (hard disk drive) or an SSD (solid state drive). When a program is executed, the RAM (not shown), a volatile memory that makes up the storage unit 29, is used as a work area for program execution and arithmetic processing by the CPU.

具体的には、ビデオ会議を動作させるプログラムや、各クライアント端末７から送信されてくる参加者の画像や音声のデータを各サーバに分析させて、その分析結果を表示させる画面を構成し各クライアント端末７に向けて送信するプログラムなどが記憶部２９の補助記憶装置に記憶されており、これらのプログラムをＷＥＢ／ＡＰＰサーバ制御部２０のＣＰＵが記憶部２９のＲＡＭを用いて実行するようになっている。 Specifically, the auxiliary storage device of the storage unit 29 stores programs such as a program that runs the video conference and a program that causes each server to analyze the image and audio data of participants sent from each client terminal 7, creates a screen displaying the analysis results, and sends these to each client terminal 7. These programs are executed by the CPU of the WEB/APP server control unit 20 using the RAM of the storage unit 29.

通信部２８は、インターネット８に接続されて、このビデオ会議分析システム１を構成する各サーバや、各クライアント端末７との間でデータの送受信を行う。 The communication unit 28 is connected to the Internet 8 and transmits and receives data between each server and each client terminal 7 that make up this videoconference analysis system 1.

画像音声受付部２５は、ＷＥＢ／ＡＰＰサーバ制御部２０の制御に基づいて、会議参加者の各クライアント端末７からインターネット８を介して送信されてくる参加者を撮影した画像や参加者の発言の音声データを受け付ける（画像音声受付処理）。この画像音声受付部２５は、通信部２８によって実現するようにしてもよい。また、画像音声受付部２５または通信部２８では、参加者の各クライアント端末７のキーボードなどの文字入力部７４から入力された文章を構成する文字データや、クライアント端末７から送信される図表データなどを受け付けるようにしてもよい。 Under the control of the WEB/APP server control unit 20, the image and audio receiving unit 25 receives captured images of participants and audio data of participants' speech transmitted from each client terminal 7 of the conference participants via the Internet 8 (image and audio reception process). This image and audio receiving unit 25 may be implemented by the communication unit 28. The image and audio receiving unit 25 or the communication unit 28 may also receive character data constituting sentences input from a character input unit 74 such as a keyboard on each client terminal 7 of the participants, as well as chart data transmitted from the client terminal 7.

ＷＥＢ／ＡＰＰサーバ制御部２０は、画像音声受付部２５で受け付けた参加者の発言の音声データを通信部２８から音声文字変換サーバ３に向けて送信して、その音声データを文字に変換させ、発言の内容を示す文字文章を取得する。次に、ＷＥＢ／ＡＰＰサーバ制御部２０は、取得した文字文章の文字データを通信部２８から文字文章分析サーバ４に向けて送信し、文字文章を分析させて発言分析結果を取得する。また、ＷＥＢ／ＡＰＰサーバ制御部２０は、画像音声受付部２５で受け付けた参加者の画像や音声データを通信部２８から画像音声分析サーバ６に向けて送信し、その画像データを分析させて画像分析結果を取得し、音声データを分析させて音声分析結果を取得する。 The WEB/APP server control unit 20 transmits the audio data of the participants' comments received by the image and audio receiving unit 25 from the communication unit 28 to the speech-to-text conversion server 3, converts the audio data into text, and obtains text indicating the content of the comments. Next, the WEB/APP server control unit 20 transmits the text data of the obtained text from the communication unit 28 to the text analysis server 4, analyzes the text, and obtains the results of the comment analysis. The WEB/APP server control unit 20 also transmits the image and audio data of the participants received by the image and audio receiving unit 25 from the communication unit 28 to the image and audio analysis server 6, analyzes the image data, and obtains the results of the image analysis, and analyzes the audio data, and obtains the results of the audio analysis.

画面構成部２１は、ＷＥＢ／ＡＰＰサーバ制御部２０の制御に基づいて、各クライアント端末７に表示される後述するようなビデオ会議画面２００，２１０などを構成する。表示画面には、参加者の画像、参加者の発言の音声データを文字に変換して求めたその発言の内容を示す文字文章、文字文章に基づいて分析を行った発言分析結果、文字文章を所定の翻訳言語に翻訳した翻訳文（翻訳文表示画面）、参加者を撮影した画像データに基づいて分析を行った画像分析結果（画像分析表示画面）、参加者の音声データに基づいて分析を行った音声分析結果（音声分析表示画面）、発言分析結果や画像分析結果や音声分析結果を総合評価して求めた総合判定結果（総合判定表示画面）、参加者のクライアント端末７から入力された文章や図表データなどが表示される。 The screen configuration unit 21 configures video conference screens 200, 210 (described below) and the like, which are displayed on each client terminal 7 under the control of the WEB/APP server control unit 20. The display screen displays images of participants, text indicating the content of their speech obtained by converting the audio data of their speech into text, speech analysis results obtained by analyzing the text, translations of the text into a specified translation language (translation display screen), image analysis results obtained by analyzing image data of participants (image analysis display screen), audio analysis results obtained by analyzing the audio data of participants (audio analysis display screen), overall judgment results obtained by comprehensively evaluating the speech analysis results, image analysis results, and audio analysis results (overall judgment display screen), and text and chart data entered from the participants' client terminals 7.

画面提供部２６は、ＷＥＢ／ＡＰＰサーバ制御部２０の制御に基づいて、画面構成部２１で構成された画面を参加者の各クライアント端末７に向けて送信して提供する。この画面提供部２６は、通信部２８によって実現するようにしてもよい。 The screen providing unit 26 transmits and provides the screen created by the screen composition unit 21 to each of the participants' client terminals 7 under the control of the WEB/APP server control unit 20. This screen providing unit 26 may be realized by the communication unit 28.

会議設定部２２には、会議に先立って、会議で使用される言語の設定、翻訳の有無の設定、会議の分析内容の設定などのビデオ会議の設定、会議参加者の設定、その参加者の使用する言語の設定などを行う。翻訳を行う設定の場合、発言をする会議参加者の言語からその発言をクライアント端末７で受信する他の参加者の使用する言語に翻訳するようになっている。ビデオ会議は、この会議設定部２２の設定に基づいて進行するようになっている。 Prior to the conference, the conference setting unit 22 sets up the video conference, such as setting the language to be used in the conference, whether or not translation will be used, and the content of the conference analysis, as well as setting the conference participants and the languages used by those participants. When translation is set, the language of the conference participant making a statement is translated into the language used by the other participants whose comments are received at the client terminal 7. The video conference proceeds based on the settings of this conference setting unit 22.

議事記録部２３は、画像音声受付部２５で受け付けた参加者の発言の音声データを記録したり、参加者の音声データを文字に変換した文字データを記録したりする。さらに、参加者の画像や音声データに基づいて分析される分析結果を記録するようにしてもよいし、各クライアント端末７のキーボードなどから入力された文章を記録するようにしてもよい。また、各クライアント端末７から送信されてくる図表データや画像データなどを記録するようにしてもよい。 The minutes recording unit 23 records the audio data of participants' remarks received by the image and audio receiving unit 25, and records text data converted from participants' audio data. Furthermore, it may record the results of analysis based on the participants' images and audio data, or may record text entered from the keyboard of each client terminal 7. It may also record chart data, image data, etc. sent from each client terminal 7.

判定部２４は、ＷＥＢ／ＡＰＰサーバ制御部２０の制御に基づいて、文字文章分析サーバ４で分析された発言分析結果、画像音声分析サーバ６で求められた画像分析結果および音声分析結果を総合評価して、総合感情レベルおよび総合ハラスメントレベルを推定して総合判定結果を求める（判定処理）。発言分析結果と画像分析結果と音声分析結果とを総合評価する方法としては、例えば、それぞれの結果に対して重み付けなどの処理を行い、総合的に判定するようにしてもよい。 Based on the control of the WEB/APP server control unit 20, the judgment unit 24 comprehensively evaluates the utterance analysis results analyzed by the character and sentence analysis server 4 and the image and audio analysis results obtained by the image and audio analysis server 6, estimates the overall emotional level and overall harassment level, and determines an overall judgment result (judgment process). As a method for comprehensively evaluating the utterance analysis results, image analysis results, and audio analysis results, for example, weighting or other processing may be performed on each result to make a comprehensive judgment.

また、ＷＥＢ／ＡＰＰサーバ制御部２０は、画像音声受付部２５で受け付けた参加者の発言の音声データを音声文字変換サーバ３で文字に変換させた文字文章の中に不適切な語句が検出された場合、参加者の各クライアント端末７に送信する音声データから不適切な語句に対応する部分の音声データを送信停止にしたり、参加者の各クライアント端末７に送信する文字文章の表示画面から不適切な語句を削除したりする不適切語句遮断措置を行うようにしてもよい。また、不適切語句遮断措置として、不適切な語句に対応する適切な語句をＷＥＢ／ＡＰＰサーバ２や音声文字変換サーバ３等にあらかじめ記憶させておき、文字文章に含まれている不適切な語句をその不適切な語句に対応する適切な語句に置換するようにしてもよい。 Furthermore, if inappropriate words are detected in a text sentence obtained by converting the audio data of a participant's speech received by the image and audio receiving unit 25 into text by the speech-to-text server 3, the WEB/APP server control unit 20 may take inappropriate word blocking measures, such as stopping the transmission of the portion of the audio data corresponding to the inappropriate words from the audio data to be sent to each participant's client terminal 7, or deleting the inappropriate words from the display screen of the text sentence to be sent to each participant's client terminal 7. Furthermore, as an inappropriate word blocking measure, appropriate words corresponding to inappropriate words may be stored in advance in the WEB/APP server 2, speech-to-text server 3, etc., and inappropriate words contained in the text sentence may be replaced with appropriate words corresponding to the inappropriate words.

また、ＷＥＢ／ＡＰＰサーバ制御部２０は、文字文章の中に不適切な語句が検出される頻度が、あらかじめ会議設定部２２に設定されている回数を超えた場合、不適切な語句の検出頻度が所定の閾値を超えたことを示す警告表示画面を画面構成部２１で構成させて、構成された警告表示画面を画面提供部２６から参加者の各クライアント端末７に向けて送信するようにしてもよい。警告表示画面は、例えば、不適切な発言が多くなっていることの警告や、休憩を促す表示とすることができる。また、ある特定の参加者が不適切な語句を含む発言を繰り返す場合には、その参加者のビデオ会議からの退出を勧告する表示とするようにしてもよい。 Furthermore, if the frequency at which inappropriate words are detected in text exceeds a number preset in the conference setting unit 22, the WEB/APP server control unit 20 may cause the screen configuration unit 21 to configure a warning display screen indicating that the frequency at which inappropriate words are detected has exceeded a predetermined threshold, and may transmit the configured warning display screen from the screen providing unit 26 to each of the participants' client terminals 7. The warning display screen may, for example, be a warning that inappropriate remarks are becoming more frequent or a display encouraging participants to take a break. Furthermore, if a particular participant repeatedly makes remarks containing inappropriate words, a display may be displayed advising that the participant should leave the video conference.

＜音声文字変換サーバ＞
音声文字変換サーバ３は、会議参加者の発言の音声データを受け付けて、この音声データを文字に変換して発言の内容を示す文字文章を構成する。また、この音声文字変換サーバ３には、不適切な語句があらかじめ登録されており、音声データから求めた文字文章の中に不適切な語句が含まれているか否かを検出するようになっている。 <Speech-to-text server>
The speech-to-text server 3 receives speech data of conference participants and converts the speech data into text to create text that shows the content of the speech. In addition, the speech-to-text server 3 stores inappropriate words and phrases in advance, and detects whether or not the text obtained from the speech data contains inappropriate words and phrases.

図３に示す概略ブロック図のように音声文字変換サーバ３は、音声文字変換サーバ制御部３０、音声文字変換部３１、音声文字変換辞書データベース３２、学習済み音声文字変換モデル保持部３３、不適切語句記憶部３４、不適切語句検出部３５、通信部３８、記憶部３９を含むように構成されている。 As shown in the schematic block diagram of Figure 3, the speech-to-text server 3 is configured to include a speech-to-text server control unit 30, a speech-to-text conversion unit 31, a speech-to-text dictionary database 32, a trained speech-to-text model storage unit 33, an inappropriate phrase storage unit 34, an inappropriate phrase detection unit 35, a communication unit 38, and a memory unit 39.

音声文字変換サーバ制御部３０は、プログラムの実行、演算処理、音声文字変換サーバ３を構成する各要素の制御などを行うＣＰＵ（図示せず）を含むように構成されている。音声文字変換サーバ制御部３０によって、記憶部３９を構成する補助記憶装置（図示せず）に記憶されているプログラムなどが実行され、音声文字変換サーバ３を構成する各要素が動作する。プログラムの実行の際、記憶部３９を構成する揮発性メモリのＲＡＭ（図示せず）が、ＣＰＵによるプログラムの実行や演算処理のワークエリアとして使用される。 The speech-to-text server control unit 30 is configured to include a CPU (not shown) that executes programs, performs calculations, and controls the various elements that make up the speech-to-text server 3. The speech-to-text server control unit 30 executes programs stored in an auxiliary storage device (not shown) that makes up the storage unit 39, causing the various elements that make up the speech-to-text server 3 to operate. When a program is executed, the RAM (not shown), a volatile memory that makes up the storage unit 39, is used as a work area for the CPU to execute the program and perform calculations.

通信部３８は、インターネット８に接続されて、ＷＥＢ／ＡＰＰサーバ２などとの間でデータの送受信を行う。この通信部３８では、音声文字変換サーバ制御部３０の制御に基づいて、ＷＥＢ／ＡＰＰサーバ２から送信されてくる参加者の発言の音声データやその音声データの言語を受け付ける。 The communication unit 38 is connected to the Internet 8 and transmits and receives data to and from the WEB/APP server 2 and the like. Under the control of the voice-to-text conversion server control unit 30, this communication unit 38 accepts the voice data of participants' speech and the language of that voice data transmitted from the WEB/APP server 2.

音声文字変換辞書データベース３２には、音声データに対応する文字がデータベースとして記憶されており、このデータベース３２は音声データから文字を抽出する際の辞書として用いられる。また、音声文字変換辞書データベース３２は、多言語の辞書データベースを有しており、言語を指定して音声データから文字へ変換する辞書として使用される。 The speech-to-text dictionary database 32 stores characters corresponding to voice data as a database, and this database 32 is used as a dictionary when extracting characters from voice data. The speech-to-text dictionary database 32 also has a multilingual dictionary database, and is used as a dictionary for converting voice data into text by specifying a language.

学習済み音声文字変換モデル保持部３３は、学習用音声データとそれに対応する正解文字データとの組合せからなる多数の組合せを学習データとして用い、機械学習により生成させた学習済み音声文字変換モデルが保持されている。多言語に対応できるように、言語ごとの生成モデルが保持されており、言語を指定して、この学習済み音声文字変換モデルに音声データを入力して演算する。この演算によってその音声データに対応すると推定される指定された言語の文字が出力される。人工知能である機械学習を用いるため、高精度に安定して音声データが文字に変換される。 The trained speech-to-text model storage unit 33 stores trained speech-to-text models generated through machine learning using a large number of combinations of training speech data and corresponding ground truth text data as training data. To support multiple languages, a generation model for each language is stored. A language is specified, and speech data is input into this trained speech-to-text model for calculation. This calculation outputs characters in the specified language that are estimated to correspond to the speech data. Because machine learning, an artificial intelligence, is used, speech data is converted into text with high accuracy and stability.

音声文字変換部３１は、音声文字変換サーバ制御部３０の制御に基づいて、通信部３８で受け付けた音声データとその言語について、音声データからその言語の文字に変換する（音声文字変換処理）。具体的な処理として、音声文字変換部３１が、音声文字変換辞書データベース３２を用いその音声データを文字に変換する。また、音声文字変換部３１は、音声データとその言語について、学習済み音声文字変換モデル保持部３３に保持されている学習済み音声文字変換モデルを用いその音声データを文字に変換する。その後、音声文字変換辞書データベース３２を用いて求めた文字と、学習済み音声文字変換モデルを用いて文字とを比較して調整を行い、最適な文字になるように修正する。なお、音声文字変換部３１では、音声文字変換辞書データベース３２と学習済み音声文字変換モデルの両方を利用する必要はなく、どちらか一方を利用して、音声データから文字に変換するようにしてもよい。 Based on the control of the speech-to-text server control unit 30, the speech-to-text conversion unit 31 converts the speech data received by the communication unit 38 into characters in that language (speech-to-text conversion process). Specifically, the speech-to-text conversion unit 31 converts the speech data into characters using the speech-to-text dictionary database 32. The speech-to-text conversion unit 31 also converts the speech data into characters using a trained speech-to-text model stored in the trained speech-to-text model storage unit 33 for the speech data and its language. The speech-to-text conversion unit 31 then compares and adjusts the characters obtained using the speech-to-text dictionary database 32 with the characters obtained using the trained speech-to-text model, correcting them to obtain the optimal characters. Note that the speech-to-text conversion unit 31 does not need to use both the speech-to-text dictionary database 32 and the trained speech-to-text model; it may use either one to convert speech data into characters.

音声文字変換サーバ制御部３０は、音声データから文字に変換されて求められた文字文章の文字データを通信部３８からＷＥＢ／ＡＰＰサーバ２に向けて送信する。 The voice-to-text conversion server control unit 30 transmits the text data of the text sentence obtained by converting the voice data into text from the communication unit 38 to the WEB/APP server 2.

不適切語句記憶部３４には、他人に不快感やハラスメントを与えるような不適切な語句があらかじめ登録されて記憶されている。 The inappropriate phrase storage unit 34 pre-registers and stores inappropriate phrases that may cause discomfort or harassment to others.

不適切語句検出部３５は、音声文字変換サーバ制御部３０の制御に基づいて、音声文字変換部３１で求められた文字文章の中に、不適切語句記憶部３４に記憶されている不適切な語句が含まれているか否かを検出する。 The inappropriate phrase detection unit 35, under the control of the speech-to-text server control unit 30, detects whether the text obtained by the speech-to-text conversion unit 31 contains any inappropriate phrases stored in the inappropriate phrase storage unit 34.

文字文章中に不適切な語句が検出された場合、音声文字変換サーバ制御部３０は、受け付けた音声データから構成された文字文章に不適切な語句が含まれていることを示す通知と、検出されたその不適切な語句とを通信部３８からＷＥＢ／ＡＰＰサーバ２に送信する。また、この音声文字変換サーバ３に、不適切な語句に対応する適切な語句を記憶させておき、検出された不適切な語句と、その不適切な語句に対応する適切な語句とをＷＥＢ／ＡＰＰサーバ２に送信するようにしてもよい。こうすることで、ＷＥＢ／ＡＰＰサーバ２では、不適切な語句を適切な語句に置き換えることができる。 If an inappropriate word or phrase is detected in the text, the voice-to-text server control unit 30 sends a notification indicating that the text composed of the received voice data contains an inappropriate word or phrase, along with the detected inappropriate word or phrase, to the WEB/APP server 2 via the communication unit 38. The voice-to-text server 3 may also store appropriate words or phrases corresponding to inappropriate words, and transmit the detected inappropriate word or phrase and the appropriate words or phrases corresponding to the inappropriate word or phrase to the WEB/APP server 2. This allows the WEB/APP server 2 to replace the inappropriate word or phrase with an appropriate word or phrase.

不適切な語句が含まれていることを示す通知やその不適切な語句などを受信するＷＥＢ／ＡＰＰサーバ２では、上述のような不適切語句遮断措置を施すことできる。 The WEB/APP server 2, which receives a notification indicating that inappropriate words are included and the inappropriate words, can take measures to block inappropriate words as described above.

＜文字文章分析サーバ＞
文字文章分析サーバ４は、会議参加者の発言の音声データを文字に変換して構成された文字文章を分析し、感情の状態を示す感情レベルおよび嫌がらせの度合を示すハラスメントレベルを推定して発言分析結果を求める（文字文章分析処理）。 <Text analysis server>
The character sentence analysis server 4 analyzes the character sentences that are constructed by converting the audio data of the speeches of the conference participants into text, and estimates the emotional level that indicates the emotional state and the harassment level that indicates the degree of harassment to obtain the speech analysis results (character sentence analysis processing).

感情レベルとして、「喜び」、「悲しみ」、「怒り」、「嫌悪」、「恐怖」、「驚き」などの感情の種類と、その感情の強さのパーセント表示との組み合わせとしてもよい。また、「喜び」や「悲しみ」等の感情の推定確率のパーセント表示としてもよい。また、良い感情から悪い感情までの間の感情の状態のパーセント表示としてもよい。 Emotion levels may be a combination of emotion types such as "joy," "sadness," "anger," "disgust," "fear," and "surprise," along with a percentage display of the intensity of the emotion. Alternatively, the estimated probability of emotions such as "joy" and "sadness" may be displayed as a percentage. Alternatively, the emotional state may be displayed as a percentage ranging from good to bad.

ハラスメントレベルとして、「パワーハラスメント」や「セクシャルハラスメント」などのハラスメントの種類と、そのハラスメントの強さのパーセント表示との組み合わせとしてもよい。また、「パワーハラスメント」や「セクシャルハラスメント」等のハラスメントの推定確率のパーセント表示としてもよい。また、ハラスメント有りから無しの間の状態のパーセント表示としてもよい。 Harassment levels may be a combination of the type of harassment, such as "power harassment" or "sexual harassment," and a percentage display of the intensity of that harassment. Alternatively, the estimated probability of harassment, such as "power harassment" or "sexual harassment," may be displayed as a percentage. Alternatively, the percentage display may be a range between the presence and absence of harassment.

図４に示す概略ブロック図のように文字文章分析サーバ４は、文字文章分析サーバ制御部４０、文字文章分析部４１、文字文章分析辞書データベース４２、学習済み文字文章分析モデル保持部４３、通信部４８、記憶部４９を含むように構成されている。 As shown in the schematic block diagram of Figure 4, the text analysis server 4 is configured to include a text analysis server control unit 40, a text analysis unit 41, a text analysis dictionary database 42, a trained text analysis model storage unit 43, a communication unit 48, and a memory unit 49.

文字文章分析サーバ制御部４０は、文字文章分析サーバ４を構成する各要素の制御などを行うＣＰＵ（図示せず）を含むように構成されている。記憶部４９は、補助記憶装置（図示せず）やＲＡＭ（図示せず）により構成されている。 The character and text analysis server control unit 40 is configured to include a CPU (not shown) that controls the various elements that make up the character and text analysis server 4. The memory unit 49 is configured from an auxiliary storage device (not shown) and RAM (not shown).

通信部４８は、インターネット８に接続されて、ＷＥＢ／ＡＰＰサーバ２などとの間でデータの送受信を行う。この通信部４８では、文字文章分析サーバ制御部４０の制御に基づいて、ＷＥＢ／ＡＰＰサーバ２から送信されてくる参加者の音声データを文字に変換した文字文章の文字データやその言語を受け付ける。 The communication unit 48 is connected to the Internet 8 and transmits and receives data to and from the WEB/APP server 2 and the like. Under the control of the character and sentence analysis server control unit 40, this communication unit 48 accepts character data and the language of text sentences converted from participants' voice data transmitted from the WEB/APP server 2.

文字文章分析辞書データベース４２には、文章を構成する語句に対応する感情レベルやハラスメントレベルがデータベースとして記憶されており、このデータベース４２は文字文章から感情レベルやハラスメントレベルを推定する際の辞書として用いられる。また、文字文章分析辞書データベース４２は、多言語の辞書データベースを有しており、言語を指定して文字文章から感情レベルなどを推定する辞書として使用される。 The character and sentence analysis dictionary database 42 stores emotional and harassment levels corresponding to the words and phrases that make up sentences, and this database 42 is used as a dictionary for estimating emotional and harassment levels from text. The character and sentence analysis dictionary database 42 also has a multilingual dictionary database, and is used as a dictionary for estimating emotional levels, etc. from text by specifying a language.

学習済み文字文章分析モデル保持部４３は、学習用文章とそれに対応する正解感情データとの組合せ、学習用文章とそれに対応する正解ハラスメントデータとの組合せからなる多数の組合せを学習データとして用い、機械学習により生成させた学習済み文字文章分析モデルが保持されている。この学習済み文字文章分析モデルは、学習済み文字文章分析感情生成モデルと学習済み文字文章分析ハラスメント生成モデルから構成される。学習済み文字文章分析感情生成モデルは、学習用文章とそれに対応する、例えば、「喜び」、「悲しみ」、「怒り」、「嫌悪」、「恐怖」、「驚き」などの正解となる感情の種類との組合せを学習の教師データとして用いて生成され、文字文章を入力して演算することにより感情レベルが求められる。感情の種類としては、「喜び」、「悲しみ」、「怒り」、「嫌悪」、「恐怖」、「驚き」などの少なくともいずれか一つの感情を用いればよい。一方、学習済み文字文章分析ハラスメント生成モデルは、学習用文章とそれに対応する、例えば、「パワーハラスメント」や「セクシャルハラスメント」、そして、「ハラスメント無し」などの正解となるハラスメントの種類との組合せを教師データとして用いて生成され、文字文章を入力して演算することによりハラスメントレベルが求められる。ハラスメントの種類としては、「パワーハラスメント」、「セクシャルハラスメント」、「ハラスメント無し」などの少なくともいずれか一つを用いればよい。多言語に対応できるように、言語ごとの生成モデルが保持されており、言語を指定して、この学習済み文字文章分析モデルに文字文章を入力して演算させることによってその文字文章に対応すると推定される感情レベルやハラスメントレベルが出力されるようになっている。 The learned character sentence analysis model storage unit 43 stores a learned character sentence analysis model generated through machine learning using numerous combinations of training sentences and their corresponding correct emotion data, and training sentences and their corresponding correct harassment data, as training data. This learned character sentence analysis model is composed of a learned character sentence analysis emotion generation model and a learned character sentence analysis harassment generation model. The learned character sentence analysis emotion generation model is generated using combinations of training sentences and their corresponding correct emotion types, such as "joy," "sadness," "anger," "disgust," "fear," and "surprise," as training data. The emotion level is determined by inputting the character sentence and performing a calculation. At least one of the emotions, such as "joy," "sadness," "anger," "disgust," "fear," and "surprise," can be used as the emotion type. Meanwhile, the trained text analysis harassment generation model is generated using training data consisting of combinations of training text and corresponding correct answers for types of harassment, such as "power harassment," "sexual harassment," and "no harassment." The harassment level is determined by inputting the text and performing a calculation. At least one of "power harassment," "sexual harassment," and "no harassment" can be used as the type of harassment. To support multiple languages, a generation model for each language is maintained. By specifying a language and inputting text into this trained text analysis model and performing a calculation, the estimated emotion level and harassment level corresponding to the text are output.

文字文章分析部４１は、文字文章分析サーバ制御部４０の制御に基づいて、通信部４８で受け付けた文字データから構成される文字文章とその言語について、文字文章分析辞書データベース４２を用いその文字文章から感情レベルやハラスメントレベルを推定する。文字文章分析辞書データベース４２を用いると、文字文章を構成する語句ごとの感情レベルやハラスメントレベルが求められる。そして、それら語句ごとの感情レベルやハラスメントレベルに重み付けなどの処理を行い、文字文章全体としての感情レベルやハラスメントレベルが推定される。また、文字文章分析部４１は、文字文章とその言語について、学習済み文字文章分析モデル保持部４３に保持されている学習済み文字文章分析モデルを用いその文字文章から感情レベルやハラスメントレベルを推定する。その後、文字文章分析辞書データベース４２を用いて推定された感情レベルやハラスメントレベルと、学習済み文字文章分析モデルを用いて推定された感情レベルやハラスメントレベルとを比較して調整を行い、最適な感情レベルやハラスメントレベルになるように修正する。 Under the control of the character/text analysis server control unit 40, the character/text analysis unit 41 uses the character/text analysis dictionary database 42 to estimate the emotional level and harassment level from a character/text composed of character data received by the communication unit 48 and its language. Using the character/text analysis dictionary database 42, the emotional level and harassment level for each word/phrase that makes up the character/text is determined. Then, the emotional level and harassment level for each word/phrase are weighted and other processing is performed to estimate the emotional level and harassment level for the character/text as a whole. Furthermore, the character/text analysis unit 41 uses a trained character/text analysis model stored in the trained character/text analysis model storage unit 43 to estimate the emotional level and harassment level from the character/text for the character/text and its language. Then, the emotional level and harassment level estimated using the character/text analysis dictionary database 42 are compared with the emotional level and harassment level estimated using the trained character/text analysis model, and adjustments are made to correct them to the optimal emotional level and harassment level.

なお、文字文章分析部４１では、文字文章分析辞書データベース４２と学習済み文字文章分析モデルの両方を利用する必要はなく、どちらか一方を利用して、文字文章から感情レベルやハラスメントレベルを推定するようにしてもよい。また、文字文章分析部４１では、感情レベルまたはハラスメントレベルの何れか一方を推定するようにしてもよい。 The character and sentence analysis unit 41 does not need to use both the character and sentence analysis dictionary database 42 and the trained character and sentence analysis model, and may use either one to estimate the emotion level or harassment level from the text. The character and sentence analysis unit 41 may also estimate either the emotion level or the harassment level.

このようにして参加者の発言の内容を示す文字文章を分析して推定された感情レベルやハラスメントレベルが発言分析結果となる。文字文章分析サーバ制御部４０は、この発言分析結果を通信部４８からＷＥＢ／ＡＰＰサーバ２に向けて送信する。 In this way, the emotional level and harassment level estimated by analyzing the text that indicates the content of the participants' comments become the comment analysis results. The text analysis server control unit 40 transmits this comment analysis result from the communication unit 48 to the WEB/APP server 2.

なお、文字文章分析部４１では、クライアント端末７のキーボードなどから入力された文章に対して分析を行い、感情レベルやハラスメントレベルを推定するようにしてもよい。 The character and sentence analysis unit 41 may also analyze sentences entered via the keyboard of the client terminal 7, etc., to estimate the emotional level and harassment level.

＜翻訳サーバ＞
翻訳サーバ５は、会議参加者の発言の音声データを文字に変換して構成された文字文章を翻訳言語に翻訳し、原文言語で構成される原文文章である文字文章の翻訳文を生成する（翻訳処理）。 <Translation server>
The translation server 5 converts the speech data of the conference participants into text, and translates the resulting text into a target language to generate a translation of the text, which is the original text written in the source language (translation process).

図５に示す概略ブロック図のように翻訳サーバ５は、翻訳サーバ制御部５０、翻訳部５１、翻訳辞書データベース５２、学習済み翻訳モデル保持部５３、通信部５８、記憶部５９を含むように構成されている。 As shown in the schematic block diagram of Figure 5, the translation server 5 is configured to include a translation server control unit 50, a translation unit 51, a translation dictionary database 52, a trained translation model storage unit 53, a communication unit 58, and a memory unit 59.

翻訳サーバ制御部５０は、翻訳サーバ５を構成する各要素の制御などを行うＣＰＵ（図示せず）を含むように構成されている。記憶部５９は、補助記憶装置（図示せず）やＲＡＭ（図示せず）により構成されている。 The translation server control unit 50 is configured to include a CPU (not shown) that controls the various elements that make up the translation server 5. The memory unit 59 is configured from an auxiliary storage device (not shown) and RAM (not shown).

通信部５８は、インターネット８に接続されて、ＷＥＢ／ＡＰＰサーバ２などとの間でデータの送受信を行う。この通信部５８では、翻訳サーバ制御部５０の制御に基づいて、ＷＥＢ／ＡＰＰサーバ２から送信されてくる参加者の音声データを文字に変換した文字文章の文字データ、文字文章の言語である原文言語、翻訳する言語である翻訳言語を受け付ける。 The communication unit 58 is connected to the Internet 8 and transmits and receives data to and from the WEB/APP server 2 and the like. Under the control of the translation server control unit 50, this communication unit 58 receives text data of text sentences converted from participant voice data transmitted from the WEB/APP server 2, the source language in which the text sentences are written, and the translation language into which the text is to be translated.

翻訳辞書データベース５２には、原文言語の語句に対応する翻訳語がデータベースとして記憶されており、このデータベース５２は文字文章の翻訳文を生成する際の辞書として用いられる。また、翻訳辞書データベース５２は、多言語の辞書データベースを有しており、原文言語と翻訳言語を指定して原文言語の文字文章から翻訳言語の翻訳文を生成する辞書として使用される。 Translation dictionary database 52 stores translation words corresponding to words in the source language as a database, and is used as a dictionary when generating translations of text. Translation dictionary database 52 also has a multilingual dictionary database, and is used as a dictionary when generating translations of text in the source language into a target language by specifying the source language and target language.

学習済み翻訳モデル保持部５３は、学習用原文文章とその翻訳文として正解となる正解翻訳文との組合せからなる多数の組合せを学習データとして用い、機械学習により生成させた学習済み翻訳モデルが保持されている。多言語に対応できるように、言語ごとの生成モデルが保持されており、原文言語と翻訳言語を指定して、この学習済み翻訳モデルに原文言語の文字文章を入力して演算させることによってその文字文章の翻訳文が出力されるようになっている。 The trained translation model storage unit 53 stores trained translation models generated by machine learning using as training data a large number of combinations of training source sentences and their correct translations. To support multiple languages, a generation model for each language is stored, and by specifying the source language and translation language, a text sentence in the source language is input into this trained translation model and calculations are performed to output a translation of the text sentence.

翻訳部５１は、翻訳サーバ制御部５０の制御に基づいて、通信部５８で受け付けた文字データから構成される文字文章、その文字文章の原文言語、翻訳言語について、翻訳辞書データベース５２を用いその文字文章から翻訳文を生成する。また、翻訳部５１は、文字文章、その言語、翻訳言語について、学習済み翻訳モデル保持部５３に保持されている学習済み翻訳モデルを用いその文字文章の翻訳文を生成する。その後、翻訳辞書データベース５２を用いて生成された翻訳文と、学習済み翻訳モデルを用いて生成された翻訳文との調整を行い、最適な翻訳文になるように修正する。なお、翻訳部５１では、翻訳辞書データベース５２と学習済み翻訳モデルの両方を利用する必要はなく、どちらか一方を利用して、文字文章の翻訳文を生成するようにしてもよい。 Under the control of the translation server control unit 50, the translation unit 51 uses the translation dictionary database 52 to generate a translation from a text sentence composed of character data received by the communication unit 58, and the source language and translation language of the text sentence. The translation unit 51 also generates a translation of the text sentence using a trained translation model stored in the trained translation model storage unit 53 for the text sentence, the language, and the translation language. The translation generated using the translation dictionary database 52 is then adjusted with the translation generated using the trained translation model to obtain an optimal translation. Note that the translation unit 51 does not need to use both the translation dictionary database 52 and the trained translation model; it may use either one to generate a translation of the text sentence.

翻訳サーバ制御部５０は、このように生成された文字文章の翻訳文の文字データを通信部５８からＷＥＢ／ＡＰＰサーバ２に向けて送信する。 The translation server control unit 50 transmits the text data of the translation of the text sentence generated in this way from the communication unit 58 to the WEB/APP server 2.

なお、翻訳部５１では、クライアント端末７のキーボードなどから入力された文章を翻訳して翻訳文を生成するようにしてもよい。 The translation unit 51 may also translate sentences entered via a keyboard or other device on the client terminal 7 to generate translated text.

＜画像音声分析サーバ＞
画像音声分析サーバ６は、会議参加者を撮影した画像データを分析して感情レベルを推定して画像分析結果を求める（画像分析処理）とともに、会議参加者の発言の音声データを分析して感情レベルおよびハラスメントレベルを推定して音声分析結果を求める（音声分析処理）。 <Image and audio analysis server>
The image and audio analysis server 6 analyzes image data of the conference participants to estimate their emotional levels and obtain image analysis results (image analysis processing), and also analyzes audio data of the conference participants' speech to estimate their emotional levels and harassment levels and obtain audio analysis results (audio analysis processing).

図６に示す概略ブロック図のように画像音声分析サーバ６は、画像音声分析サーバ制御部６０、画像分析部６１、画像分析辞書データベース６２、学習済み画像分析モデル保持部６３、音声分析部６４、音声分析辞書データベース６５、学習済み音声分析モデル保持部６６、通信部６８、記憶部６９を含むように構成されている。 As shown in the schematic block diagram of Figure 6, the image and audio analysis server 6 is configured to include an image and audio analysis server control unit 60, an image analysis unit 61, an image analysis dictionary database 62, a trained image analysis model holding unit 63, a sound analysis unit 64, a sound analysis dictionary database 65, a trained sound analysis model holding unit 66, a communication unit 68, and a memory unit 69.

画像音声分析サーバ制御部６０は、画像音声分析サーバ６を構成する各要素の制御などを行うＣＰＵ（図示せず）を含むように構成されている。記憶部６９は、補助記憶装置（図示せず）やＲＡＭ（図示せず）により構成されている。 The image and audio analysis server control unit 60 is configured to include a CPU (not shown) that controls each element that makes up the image and audio analysis server 6. The memory unit 69 is configured from an auxiliary storage device (not shown) and RAM (not shown).

通信部６８は、インターネット８に接続されて、ＷＥＢ／ＡＰＰサーバ２などとの間でデータの送受信を行う。この通信部６８では、画像音声分析サーバ制御部６０の制御に基づいて、ＷＥＢ／ＡＰＰサーバ２から送信されてくる参加者を撮影した画像データや参加者の発言の音声データを受け付ける。 The communication unit 68 is connected to the Internet 8 and transmits and receives data to and from the WEB/APP server 2 and the like. Based on the control of the image and audio analysis server control unit 60, this communication unit 68 receives image data of participants and audio data of participants' speech sent from the WEB/APP server 2.

画像分析辞書データベース６２には、人の目、眉、唇の両脇の部分である口角、頬、口元などの顔画像の部位の形状やその形状の変化に対応する感情レベルがデータベースとして記憶されており、このデータベース６２は参加者の画像データから抽出される顔画像に基づいて感情レベルを推定する際の辞書として用いられる。 The image analysis dictionary database 62 stores a database of emotional levels corresponding to the shapes of facial image parts such as the eyes, eyebrows, corners of the mouth (the parts on either side of the lips), cheeks, and mouth, as well as changes in these shapes. This database 62 is used as a dictionary when estimating emotional levels based on facial images extracted from the participant's image data.

学習済み画像分析モデル保持部６３は、学習用顔画像とそれに対応する、例えば、「喜び」、「悲しみ」、「怒り」、「嫌悪」、「恐怖」、「驚き」などの正解となる感情の種類との組合せからなる多数の組合せを学習データとして用い、機械学習により生成させた学習済み画像分析モデルが保持されている。感情の種類としては、「喜び」、「悲しみ」、「怒り」、「嫌悪」、「恐怖」、「驚き」などの少なくともいずれか一つの感情を用いればよい。この学習済み画像分析モデルに参加者の画像データから抽出される顔画像を入力して演算させることによってその顔画像に対応すると推定される感情レベルが出力されるようになっている。例えば、入力された顔画像についての「喜び」や「悲しみ」や「怒り」などの各感情の推定確率が出力される。 The trained image analysis model storage unit 63 stores a trained image analysis model generated by machine learning using as training data a large number of combinations of training face images and corresponding correct emotion types, such as "joy," "sadness," "anger," "disgust," "fear," and "surprise." At least one of the emotions, such as "joy," "sadness," "anger," "disgust," "fear," and "surprise," can be used as the emotion type. By inputting a face image extracted from the participant's image data into this trained image analysis model and performing a calculation, an estimated emotion level corresponding to the face image is output. For example, an estimated probability of each emotion, such as "joy," "sadness," or "anger," for the input face image is output.

画像分析部６１は、通信部６８で受け付けた参加者の画像データから顔画像を抽出し、その顔画像の目の形状、ビデオ会議中における目の形状の変化、眉の形状、会議中における眉の形状の変化、口角の形状、口角の形状の変化、頬の形状、頬の形状の変化、歯の出現頻度、歯の出現頻度の変化などを用いて分析を行い、感情レベルを推定して画像分析結果を求める。 The image analysis unit 61 extracts facial images from the image data of participants received by the communication unit 68, and analyzes the facial images using the eye shape, changes in eye shape during the video conference, eyebrow shape, changes in eyebrow shape during the conference, shape of the corners of the mouth, changes in shape of the corners of the mouth, shape of the cheeks, changes in cheek shape, frequency of teeth, changes in frequency of teeth, etc., to estimate the emotional level and obtain the image analysis results.

すなわち、画像分析部６１は、画像音声分析サーバ制御部６０の制御に基づいて、画像データから抽出された顔画像について画像分析辞書データベース６２を用いその顔画像から感情レベルを推定する。画像分析辞書データベース６２を用いると、顔画像を構成する目、眉、口角、頬、口元などの各部位の形状やその形状の変化に対応する感情レベルが求められ、それら部位ごとの感情レベルに重み付けなどの処理を行い、顔画像全体としての感情レベルが推定される。また、画像分析部６１は、抽出された顔画像について、学習済み画像分析モデル保持部６３に保持されている学習済み画像分析モデルを用いその顔画像から感情レベルを推定する。その後、画像分析辞書データベース６２を用いて推定された感情レベルと、学習済み画像分析モデルを用いて推定された感情レベルとの調整を行い、最適な感情レベルになるように修正する。なお、画像分析部６１では、画像分析辞書データベース６２と学習済み画像分析モデルの両方を利用する必要はなく、どちらか一方を利用して、顔画像から感情レベルを推定するようにしてもよい。 That is, under the control of the image and audio analysis server control unit 60, the image analysis unit 61 uses the image analysis dictionary database 62 to estimate an emotional level from a facial image extracted from image data. Using the image analysis dictionary database 62, the emotional level corresponding to the shape of each part of the facial image, such as the eyes, eyebrows, corners of the mouth, cheeks, and mouth, and changes in that shape, is determined, and the emotional level of each part is weighted and other processing is performed to estimate the emotional level of the entire facial image. Furthermore, the image analysis unit 61 estimates an emotional level from the extracted facial image using a trained image analysis model stored in the trained image analysis model storage unit 63. The emotional level estimated using the image analysis dictionary database 62 is then adjusted to the emotional level estimated using the trained image analysis model, thereby correcting it to an optimal emotional level. It is not necessary for the image analysis unit 61 to use both the image analysis dictionary database 62 and the trained image analysis model; it may use either one to estimate an emotional level from a facial image.

このようにして画像データを分析して推定された感情レベルが画像分析結果となる。画像音声分析サーバ制御部６０は、この画像分析結果を通信部６８からＷＥＢ／ＡＰＰサーバ２に向けて送信する。 The emotion level estimated by analyzing the image data in this way becomes the image analysis result. The image and audio analysis server control unit 60 transmits this image analysis result from the communication unit 68 to the WEB/APP server 2.

また、画像分析部６１には、参加者の画像データから抽出された顔画像に基づいて、性別や年齢を推定する構成（図示せず）を備えるようにしてもよい。 The image analysis unit 61 may also be equipped with a configuration (not shown) that estimates the gender and age of participants based on facial images extracted from their image data.

音声分析辞書データベース６５には、声の大きさ、声の高さ、話す速さ、他の発言者の言葉に被せて発言する頻度や、それら声の大きさ、高さ、話す速さなどの変化に対応する感情レベルやハラスメントレベルがデータベースとして記憶されており、このデータベース６５は参加者の音声データから感情レベルやハラスメントレベルを推定する際の辞書として用いられる。 The voice analysis dictionary database 65 stores a database of information on the volume of a voice, the pitch of a voice, speaking speed, the frequency of speaking over other speakers, and the emotional and harassment levels corresponding to changes in the volume, pitch, speaking speed, etc. This database 65 is used as a dictionary when estimating the emotional and harassment levels from the participants' voice data.

学習済み音声分析モデル保持部６６は、学習用音声データとそれに対応する正解感情データとの組合せ、学習用音声データとそれに対応する正解ハラスメントデータとの組合せからなる多数の組合せを学習データとして用い、機械学習により生成させた学習済み音声分析モデルが保持されている。この学習済み音声分析モデルは、学習済み音声分析感情生成モデルと学習済み音声分析ハラスメント生成モデルから構成される。学習済み音声分析感情生成モデルは、学習用音声データとそれに対応する、例えば、「喜び」、「悲しみ」、「怒り」、「嫌悪」、「恐怖」、「驚き」などの正解となる感情の種類との組合せを学習の教師データとして用いて生成され、音声データを入力して演算することにより感情レベルが求められる。例えば、入力された音声データについての「喜び」や「悲しみ」などの各感情の推定確率が出力される。一方、学習済み音声分析ハラスメント生成モデルは、学習用音声データとそれに対応する、例えば、「パワーハラスメント」、「セクシャルハラスメント」、「ハラスメント無し」などの正解となるハラスメントの種類との組合せを教師データとして用いて生成され、音声データを入力して演算することによりハラスメントレベルが求められる。例えば、入力された音声データについての「パワーハラスメント」や「セクシャルハラスメント」などの各ハラスメントの推定確率が出力される。この学習済み音声分析モデルに参加者の音声データを入力して演算させることによってその音声データに対応すると推定される感情レベルやハラスメントレベルが出力されるようになっている。 The trained voice analysis model storage unit 66 stores trained voice analysis models generated through machine learning using numerous combinations of training voice data and corresponding correct emotion data, and training voice data and corresponding correct harassment data, as training data. These trained voice analysis models are comprised of a trained voice analysis emotion generation model and a trained voice analysis harassment generation model. The trained voice analysis emotion generation model is generated using training voice data and corresponding correct emotion types, such as "joy," "sadness," "anger," "disgust," "fear," and "surprise," as training data. The trained voice analysis emotion generation model is input and calculated to determine emotion levels. For example, the estimated probability of each emotion, such as "joy" or "sadness," for the input voice data is output. Meanwhile, the trained voice analysis harassment generation model is generated using training voice data and corresponding correct harassment types, such as "power harassment," "sexual harassment," and "no harassment," as training data. The trained voice analysis harassment generation model is input and calculated to determine harassment levels. For example, the estimated probability of each type of harassment, such as "power harassment" or "sexual harassment," for the input voice data is output. By inputting the participant's voice data into this trained voice analysis model and performing calculations, the estimated emotional level and harassment level corresponding to the voice data is output.

音声分析部６４は、通信部６８で受け付けた参加者の音声データから、声の大きさ、ビデオ会議中における声の大きさの変化、声の高さ、会議中における声の高さの変化、話す速さ、話す速さの変化、他の参加者の言葉に被せて発言する頻度、他の参加者の言葉に被せて発言する頻度の変化などを用いて分析を行い、感情レベルやハラスメントレベルを推定して音声分析結果を求める。 The voice analysis unit 64 analyzes the participant's voice data received by the communication unit 68 using factors such as voice volume, changes in voice volume during the video conference, voice pitch, changes in voice pitch during the conference, speaking speed, changes in speaking speed, frequency of speaking over other participants' words, and changes in frequency of speaking over other participants' words, and estimates emotional level and harassment level to obtain voice analysis results.

すなわち、音声分析部６４は、画像音声分析サーバ制御部６０の制御に基づいて、音声データについて音声分析辞書データベース６５を用い感情レベルやハラスメントレベルを推定する。音声分析辞書データベース６５を用いると、声の大きさ、その高さ、話す速さなどやそれらの変化に対応する感情レベルやハラスメントレベルが求められ、それら判定要素ごとの感情レベルやハラスメントレベルに重み付けなどの処理を行い、音声データ全体としての感情レベルやハラスメントレベルが推定される。また、音声分析部６４は、音声データについて、学習済み音声分析モデル保持部６６に保持されている学習済み音声分析モデルを用い感情レベルやハラスメントレベルを推定する。その後、音声分析辞書データベース６５を用いて推定された感情レベルやハラスメントレベルと、学習済み音声分析モデルを用いて推定された感情レベルやハラスメントレベルとの調整を行い、最適な感情レベルやハラスメントレベルになるように修正する。 That is, under the control of the image and audio analysis server control unit 60, the audio analysis unit 64 estimates the emotional level and harassment level of the audio data using the audio analysis dictionary database 65. Using the audio analysis dictionary database 65, the emotional level and harassment level corresponding to the volume of the voice, pitch, speaking rate, etc. and changes therein are determined, and the emotional level and harassment level for each of these judgment elements are weighted and other processing is performed to estimate the emotional level and harassment level for the audio data as a whole. The audio analysis unit 64 also estimates the emotional level and harassment level of the audio data using a trained audio analysis model stored in the trained audio analysis model storage unit 66. The emotional level and harassment level estimated using the audio analysis dictionary database 65 are then adjusted with the emotional level and harassment level estimated using the trained audio analysis model to correct them to the optimal emotional level and harassment level.

なお、音声分析部６４では、音声分析辞書データベース６５と学習済み音声分析モデルの両方を利用する必要はなく、どちらか一方を利用して、参加者の音声データから感情レベルなどを推定するようにしてもよい。また、音声分析部６４では、感情レベルまたはハラスメントレベルの何れか一方を推定するようにしてもよい。 The voice analysis unit 64 does not need to use both the voice analysis dictionary database 65 and the trained voice analysis model; it may use either one to estimate the emotional level, etc. from the participant's voice data. The voice analysis unit 64 may also estimate either the emotional level or the harassment level.

このようにして音声データを分析して推定された感情レベルやハラスメントレベルが音声分析結果となる。画像音声分析サーバ制御部６０は、この音声分析結果を通信部６８からＷＥＢ／ＡＰＰサーバ２に向けて送信する。 In this way, the emotion level and harassment level estimated by analyzing the voice data become the voice analysis results. The image and voice analysis server control unit 60 transmits this voice analysis result from the communication unit 68 to the WEB/APP server 2.

＜クライアント端末＞
クライアント端末７は、ブラウザと呼ばれるソフトウェアによってインターネット８につながり、ＷＥＢ／ＡＰＰサーバ２などのビデオ会議を動作させるサーバに接続される。クライアント端末７からビデオ会議参加者の画像や発言の音声などが送信され、他の参加者の画像や音声などを受信して端末７で視聴することにより、ビデオ会議が行われる。クライアント端末７としては、パーソナルコンピュータやスマートフォンなどの情報端末が用いられる。 <Client terminal>
The client terminal 7 is connected to the Internet 8 by software called a browser, and is connected to a server that operates the video conference, such as the WEB/APP server 2. The video conference is carried out by transmitting images and audio of the video conference participants from the client terminal 7, and receiving images and audio of other participants and viewing them on the terminal 7. The client terminal 7 may be an information terminal such as a personal computer or a smartphone.

図７に示す概略ブロック図のようにクライアント端末７は、クライアント端末制御部７０、表示部７１、カメラ７２、マイクロホン７３、文字入力部７４、スピーカ７５、通信部７８、記憶部７９を含むように構成されている。 As shown in the schematic block diagram of Figure 7, the client terminal 7 is configured to include a client terminal control unit 70, a display unit 71, a camera 72, a microphone 73, a character input unit 74, a speaker 75, a communication unit 78, and a memory unit 79.

クライアント端末制御部７０は、クライアント端末７を構成する各要素の制御などを行うＣＰＵ（図示せず）を含むように構成されている。記憶部７９は、補助記憶装置（図示せず）やＲＡＭ（図示せず）により構成されている。 The client terminal control unit 70 is configured to include a CPU (not shown) that controls each element that makes up the client terminal 7. The memory unit 79 is configured from an auxiliary storage device (not shown) and RAM (not shown).

通信部７８は、インターネット８に接続されて、ＷＥＢ／ＡＰＰサーバ２などのビデオ会議を動作させるサーバ等との間でデータの送受信を行う。この通信部７８で送受信されるデータは、画像データや音声データなどである。 The communication unit 78 is connected to the Internet 8 and transmits and receives data to and from servers that operate video conferences, such as the WEB/APP server 2. The data transmitted and received by this communication unit 78 includes image data, audio data, and the like.

表示部７１は、液晶ディスプレイなどの表示装置により構成され、ＷＥＢ／ＡＰＰサーバ２の画面構成部２１で構成されたビデオ会議画面２００，２１０などを表示する。 The display unit 71 is composed of a display device such as an LCD display, and displays the video conference screens 200, 210, etc., created by the screen composition unit 21 of the WEB/APP server 2.

カメラ７２は、ＣＣＤイメージセンサやＣＭＯＳイメージセンサ等の固体撮像素子などにより構成され、会議参加者などを撮影する。マイクロホン７３は、参加者の発言などの音声を電気信号に変換して音声データを取得する。スピーカ７５は、他の参加者の発言の音声データを音声として発生させる。 The camera 72 is composed of a solid-state imaging device such as a CCD image sensor or CMOS image sensor, and captures images of conference participants. The microphone 73 converts the speech of participants into electrical signals to obtain audio data. The speaker 75 reproduces the audio data of speech by other participants as sound.

文字入力部７４は、キーボードなどの入力装置で構成され、参加者の文字の入力に用いられる。 The character input unit 74 consists of an input device such as a keyboard, and is used by participants to input characters.

カメラ７２で撮影された参加者の画像データ、マイクロホン７３で取得された参加者の発言の音声データ、文字入力部７４から参加者によって入力された文章を構成する文字データなどは、通信部７８からビデオ会議を動作させるサーバに向けて送信される。また、ビデオ会議の表示画面データや会議の音声データなどが通信部７８で受信されて、表示部７１に表示され、スピーカ７５からその音声が出力される。 Image data of participants captured by camera 72, audio data of participants' speech captured by microphone 73, and text data constituting sentences entered by participants via text input unit 74 are transmitted from communication unit 78 to the server running the video conference. Video conference display screen data and conference audio data are also received by communication unit 78 and displayed on display unit 71, with the audio output from speaker 75.

＜ビデオ会議画面＞
図８と図９は、クライアント端末７の表示部７１に表示されるビデオ会議画面の例である。ＷＥＢ／ＡＰＰサーバ２の画面構成部２１で構成されたビデオ会議の表示画面が画面提供部２６からクライアント端末７に向けて送信され、この表示画面を受信したクライアント端末７がその表示部７１に表示したものである。 <Video conference screen>
8 and 9 show examples of video conference screens displayed on the display unit 71 of the client terminal 7. The video conference display screen configured by the screen configuration unit 21 of the WEB/APP server 2 is transmitted from the screen providing unit 26 to the client terminal 7, and the client terminal 7 receives this display screen and displays it on its display unit 71.

図８に示すビデオ会議画面２００には、会議に参加している７人が表示されている。各参加者を表示する領域には、その参加者を撮影した画像２０１、その領域の左上部に感情レベル表示２０２、左中央部に性別表示２０３、左下部に推定年齢表示２０４が表示されている。 The video conference screen 200 shown in Figure 8 displays seven people participating in the conference. The area displaying each participant displays a photograph of that participant 201, an emotion level indicator 202 in the upper left corner of the area, a gender indicator 203 in the center left corner, and an estimated age indicator 204 in the lower left corner.

感情レベル表示２０２は、「喜び」、「悲しみ」、「怒り」、「嫌悪」、「恐怖」、「驚き」などの感情の種類を表現する絵文字と、その感情の強さのパーセント表示により構成されている。また、この他に、「喜び」や「悲しみ」等の感情の推定確率をパーセント表示してもよいし、感情の最も悪い状態を０％、最も良い状態を１００％としたときの感情の状態をパーセント表示してもよい。図１０（ａ）は、感情の最も悪い状態を０％、最も良い状態を１００％として感情の状態をパーセント表示によって表示する場合の感情レベルを表示する絵文字の例である。絵文字は、感情レベルの２０％刻みに対応するように５種類用意されている。 The emotion level display 202 is composed of emojis that express types of emotions such as "joy," "sadness," "anger," "disgust," "fear," and "surprise," and a percentage display of the strength of those emotions. Alternatively, the estimated probability of emotions such as "joy" or "sadness" may be displayed as a percentage, or the emotional state may be displayed as a percentage, with the worst emotional state being 0% and the best emotional state being 100%. Figure 10(a) shows an example of an emoji that displays the emotional level when the emotional state is displayed as a percentage, with the worst emotional state being 0% and the best emotional state being 100%. Five types of emojis are provided, corresponding to 20% increments of the emotion level.

また、ハラスメントレベル表示は、「パワーハラスメント」や「セクシャルハラスメント」などのハラスメントの種類を表現する絵文字と、そのハラスメントの強さのパーセント表示を表示するようにしてもよい。また、この他に「パワーハラスメント」や「セクシャルハラスメント」等のハラスメントの推定確率をパーセント表示してもよい。また、ハラスメントが無く適切な状態を０％、ハラスメントが有り不適切な状態を１００％としたときのパーセント表示を表示してもよい。図１０（ｂ）は、ハラスメント有りの状態を１００％、ハラスメント無しの状態を０％としてハラスメントの状態をパーセント表示によって表示する場合のハラスメントレベルを表示する絵文字の例である。絵文字は、ハラスメントレベルの２０％刻みに対応するように５種類用意されている。 The harassment level display may also display emojis that represent the type of harassment, such as "power harassment" or "sexual harassment," along with a percentage indication of the intensity of that harassment. Additionally, the estimated probability of harassment, such as "power harassment" or "sexual harassment," may also be displayed as a percentage. A percentage indication may also be displayed, where 0% represents an appropriate state with no harassment and 100% represents an inappropriate state with harassment. Figure 10(b) shows an example of emojis that display the harassment level as a percentage, with 100% representing a state with harassment and 0% representing a state without harassment. Five types of emojis are available, corresponding to 20% increments of the harassment level.

図９に示すビデオ会議画面２１０のように、発言中の参加者をクライアント端末７の表示部７１に拡大して表示するようにしてもよい。このビデオ会議画面２１０では、参加者の画像２０１が表示され、その下方に文字文章表示２０５が配置されている。この文字文章表示２０５には、その参加者の発言の内容を示す文字文章が文字で表示される。翻訳を行う設定になっている場合には、この領域に文字文章の翻訳文を表示させるようにしてもよい。 As shown in the video conference screen 210 in Figure 9, a participant who is speaking may be displayed enlarged on the display unit 71 of the client terminal 7. On this video conference screen 210, an image 201 of the participant is displayed, with a text display 205 located below it. This text display 205 displays a text message indicating the content of the participant's remarks. If translation is enabled, a translation of the text message may be displayed in this area.

また、文字文章表示２０５の下方左部には、感情レベル表示２０６が配置され、その表示２０６の右側には、ハラスメントレベル表示２０７が配置されている。 Furthermore, an emotion level display 206 is located below and to the left of the text display 205, and a harassment level display 207 is located to the right of that display 206.

感情レベル表示２０６やハラスメントレベル表示２０７は、絵文字とパーセント表示で構成するだけでなく、例えば、図１１に示すように、詳細な分析結果を表示するようにしてもよい。この図１１は、感情レベル表示２０８の例であるが、感情を構成する「喜び」、「悲しみ」、「怒り」、「嫌悪」、「恐怖」、「驚き」などの項目の推定確率を表示するようになっている。 The emotion level display 206 and harassment level display 207 may not only be composed of emoticons and percentages, but may also display detailed analysis results, as shown in Figure 11. Figure 11 is an example of the emotion level display 208, which displays the estimated probability of items that make up emotions, such as "joy," "sadness," "anger," "disgust," "fear," and "surprise."

なお、感情レベル表示２０２，２０６，２０８に表示する感情レベルや、ハラスメントレベル表示２０７に表示するハラスメントレベルは、文字文章分析サーバ４の文字文章分析部４１において参加者の発言の内容を示す文字文章を分析して求められた発言分析結果の感情レベルやハラスメントレベルを採用してもよい。また、画像音声分析サーバ６の画像分析部６１において参加者の画像データを分析して求められた画像分析結果の感情レベルを採用してもよい。また、音声分析部６４において参加者の音声データを分析して求められた音声分析結果の感情レベルやハラスメントレベルを採用してもよい。また、ＷＥＢ／ＡＰＰサーバ２の判定部２４において発言分析結果、画像分析結果および音声分析結果を総合評価して求めた総合判定結果の総合感情レベルや総合ハラスメントレベルを採用してもよい。 The emotion levels displayed in emotion level displays 202, 206, and 208 and the harassment level displayed in harassment level display 207 may be emotion levels or harassment levels obtained as a result of utterance analysis by analyzing text indicating the content of participants' utterances in the text analysis unit 41 of the text analysis server 4. Alternatively, the emotion levels may be obtained as an image analysis result by analyzing image data of participants in the image analysis unit 61 of the image and audio analysis server 6. Alternatively, the emotion levels or harassment levels may be obtained as an audio analysis result by analyzing audio data of participants in the audio analysis unit 64. Alternatively, the overall emotion levels or overall harassment levels may be obtained as a comprehensive judgment result by comprehensively evaluating the utterance analysis results, image analysis results, and audio analysis results in the judgment unit 24 of the WEB/APP server 2.

＜ビデオ会議分析システムの動作＞
次に、本実施の形態１に係るビデオ会議分析システム１を含むビデオ会議システム１００の動作を説明する。以下に、ビデオ会議開始時の動作と、ビデオ会議参加者が発言したときの動作について説明する。 <Operation of the video conference analysis system>
Next, a description will be given of the operation of the video conference system 100 including the video conference analysis system 1 according to the present embodiment 1. The following describes the operation at the start of a video conference and the operation when a video conference participant speaks.

＜ビデオ会議開始時の動作＞
図１２には、ビデオ会議分析システム１においてビデオ会議開始時の概略フローチャートが示されている。 <Operations when starting a video conference>
FIG. 12 shows a schematic flowchart of the video conference analysis system 1 at the start of a video conference.

まず、ビデオ会議の開始に先立って、会議参加者のうちの一人が、その参加者のクライアント端末７を用いて会議参加者、参加者の言語、参加者の発言の音声を文字に変換して画面に表示するか否か、翻訳の有無などの会議設定を行う（Ｓ１００ステップ）。会議設定には、参加者の発言の内容を示す文字文章を分析して感情レベルやハラスメントレベルの推定を行うか否か、参加者の画像を分析して感情レベルの推定を行うか否か、参加者の音声データを分析して感情レベルやハラスメントレベルの推定を行うか否か、総合感情レベルや総合ハラスメントレベルを求めるか否かなどのビデオ会議の分析内容も設定される。この会議設定は、クライアント端末７からＷＥＢ／ＡＰＰサーバ２に向けて送信され、そのサーバ２に受信され会議設定部２２に記憶される（Ｓ１１０ステップ）。 First, prior to the start of a video conference, one of the conference participants uses his or her client terminal 7 to set up the conference, including the conference participants, the participant's language, whether to convert the participants' speech into text and display it on the screen, and whether to use translation (step S100). The conference settings also include settings for the analysis details of the video conference, such as whether to analyze text indicating the content of participants' speech to estimate emotional levels and harassment levels, whether to analyze images of participants to estimate emotional levels, whether to analyze participants' speech data to estimate emotional levels and harassment levels, and whether to calculate overall emotional levels and overall harassment levels. This conference setting is sent from the client terminal 7 to the WEB/APP server 2, received by the server 2, and stored in the conference setting unit 22 (step S110).

会議設定が終了すると、参加者はクライアント端末７に会議開始を入力する（Ｓ１０１ステップ）。会議開始の指示は、クライアント端末７からＷＥＢ／ＡＰＰサーバ２に送信され、そのサーバ２に受け付けられる（Ｓ１１１ステップ）。 Once the conference setup is complete, the participants input a conference start command into the client terminal 7 (step S101). The conference start command is sent from the client terminal 7 to the web/app server 2 and is accepted by the server 2 (step S111).

クライアント端末７からは、参加者を撮影した画像データがＷＥＢ／ＡＰＰサーバ２に送信される。ＷＥＢ／ＡＰＰサーバ２では、画像音声受付部２５で参加者の画像データを受け付ける（Ｓ１１２ステップ）。取得した参加者の画像を分析するため、ＷＥＢ／ＡＰＰサーバ２のＷＥＢ／ＡＰＰサーバ制御部２０は、参加者画像データを画像音声分析サーバ６に向けて送信する（Ｓ１１３ステップ）。 Image data of the participants is sent from the client terminal 7 to the WEB/APP server 2. In the WEB/APP server 2, the image and audio receiving unit 25 receives the image data of the participants (step S112). To analyze the acquired images of the participants, the WEB/APP server control unit 20 of the WEB/APP server 2 sends the participant image data to the image and audio analysis server 6 (step S113).

画像音声分析サーバ６の画像分析部６１では、受け付けた参加者画像データを分析して感情レベルを推定して画像分析結果を求める（Ｓ１３０ステップ）。また、画像分析部６１では、参加者画像データに基づいて性別や年齢を推定する。 The image analysis unit 61 of the image and audio analysis server 6 analyzes the received participant image data, estimates the emotional level, and obtains the image analysis results (step S130). The image analysis unit 61 also estimates the gender and age based on the participant image data.

画像音声分析サーバ６における参加者画像データの分析結果は、ＷＥＢ／ＡＰＰサーバ２に送信され、そのサーバ２で取得される（Ｓ１１４ステップ）。 The analysis results of the participant image data by the image and audio analysis server 6 are sent to the WEB/APP server 2 and acquired by that server 2 (step S114).

他の会議参加者のクライアント端末７からも、参加者の画像データがＷＥＢ／ＡＰＰサーバ２に送信され、ＷＥＢ／ＡＰＰサーバ２に受け付けられる（Ｓ１１５ステップ）。ＷＥＢ／ＡＰＰサーバ２は、同様に、参加者画像データを画像音声分析サーバ６に送信（Ｓ１１６ステップ）して分析させ（Ｓ１３１ステップ）、分析結果を取得する（Ｓ１１７ステップ）。 Image data of other conference participants is also sent from the client terminals 7 to the WEB/APP server 2, which accepts the data (step S115). The WEB/APP server 2 similarly sends the participant image data to the image and audio analysis server 6 (step S116), which analyzes the data (step S131), and obtains the analysis results (step S117).

すべての会議参加者のクライアント端末７から参加者画像データを受け付けて分析を行わせると、ＷＥＢ／ＡＰＰサーバ２のＷＥＢ／ＡＰＰサーバ制御部２０は、画面構成部２１に例えば、図８に示すようなビデオ会議画面２００を構成させる。ＷＥＢ／ＡＰＰサーバ制御部２０は、構成されたビデオ会議画面２００を画面提供部２６から、会議に参加しているすべてのクライアント端末７に向けて送信する（Ｓ１１８ステップ）。 After receiving and analyzing participant image data from the client terminals 7 of all conference participants, the WEB/APP server control unit 20 of the WEB/APP server 2 causes the screen composition unit 21 to compose a video conference screen 200, for example, as shown in FIG. 8. The WEB/APP server control unit 20 then transmits the composed video conference screen 200 from the screen providing unit 26 to all client terminals 7 participating in the conference (step S118).

こうすることにより、クライアント端末７の表示部７１には、ビデオ会議画面２００のような画面が表示され（Ｓ１０２ステップ、Ｓ１４０ステップ）、ビデオ会議が開始される。 By doing this, a screen like the video conference screen 200 is displayed on the display unit 71 of the client terminal 7 (steps S102 and S140), and the video conference begins.

＜ビデオ会議参加者が発言したときの動作＞
図１３には、ビデオ会議分析システム１においてビデオ会議参加者が発言したときの概略フローチャートが示されている。 <What happens when a video conference participant speaks>
FIG. 13 shows a schematic flowchart of the video conference analysis system 1 when a video conference participant makes a statement.

ビデオ会議中に、参加者が発言すると（Ｓ２００ステップ）、その参加者のクライアント端末７からＷＥＢ／ＡＰＰサーバ２に向けて発言の音声データが送信される。また、参加者の画像データは、常時、クライアント端末７からＷＥＢ／ＡＰＰサーバ２に向けて送信されるようになっている。 When a participant speaks during a video conference (step S200), the participant's voice data is sent from the participant's client terminal 7 to the WEB/APP server 2. In addition, the participant's image data is constantly sent from the client terminal 7 to the WEB/APP server 2.

ＷＥＢ／ＡＰＰサーバ２が、参加者の音声データと画像データを受け付けると（Ｓ２１０ステップ）、このサーバ２のＷＥＢ／ＡＰＰサーバ制御部２０は、音声データを文字に変換させるため、受け付けた音声データを音声文字変換サーバ３に送信する。 When the WEB/APP server 2 receives the participants' voice data and image data (step S210), the WEB/APP server control unit 20 of this server 2 transmits the received voice data to the voice-to-text conversion server 3 to convert the voice data into text.

音声文字変換サーバ３の音声文字変換部３１では、受け付けた音声データを文字に変換する（Ｓ２２０ステップ）。発言の音声データから変換された文字によって、発言の内容を示す文字文章が構成される。このように音声データを変換して得られた文字データは、ＷＥＢ／ＡＰＰサーバ２に向けて送信される。 The speech-to-text conversion unit 31 of the speech-to-text conversion server 3 converts the received voice data into text (step S220). The text converted from the voice data of the utterance forms a text sentence indicating the content of the utterance. The text data obtained by converting the voice data in this way is transmitted to the WEB/APP server 2.

ＷＥＢ／ＡＰＰサーバ２では、文字文章を構成する文字データを音声文字変換サーバ３から受け付けると（Ｓ２１１ステップ）、受け付けた文字文章を分析させるため、ＷＥＢ／ＡＰＰサーバ制御部２０が、受け付けた文字データを文字文章分析サーバ４に送信する。 When the WEB/APP server 2 receives character data constituting a text sentence from the speech-to-text conversion server 3 (step S211), the WEB/APP server control unit 20 transmits the received character data to the text analysis server 4 to analyze the received text sentence.

文字文章分析サーバ４の文字文章分析部４１では、受け付けた文字データで構成されている文字文章を分析して、感情レベルやハラスメントレベルを推定する（Ｓ２３０ステップ）。このように参加者の発言の内容を示す文字文章を分析した結果が、発言分析結果となる。文字文章の分析結果は、ＷＥＢ／ＡＰＰサーバ２に送信され、そのサーバ２に受け付けられる（Ｓ２１２ステップ）。 The text analysis unit 41 of the text analysis server 4 analyzes the text composed of the received text data to estimate the emotional level and harassment level (step S230). The result of this analysis of the text indicating the content of the participants' comments becomes the comment analysis result. The text analysis result is sent to the WEB/APP server 2 and accepted by that server 2 (step S212).

発言の内容を示す文字文章を翻訳する設定になっている場合、翻訳させるため、ＷＥＢ／ＡＰＰサーバ制御部２０は、文字文章を構成する文字データ、その文字文章の言語、翻訳言語などを翻訳サーバ５に送信する。 If the setting is to translate the text indicating the content of the statement, the WEB/APP server control unit 20 transmits the character data constituting the text, the language of the text, the translation language, etc. to the translation server 5 in order to perform the translation.

翻訳サーバ５の翻訳部５１では、受け付けた文字文章を翻訳言語に翻訳する（Ｓ２４０ステップ）。生成された文字文章の翻訳文は、ＷＥＢ／ＡＰＰサーバ２に送信され、そのサーバ２に受け付けられる（Ｓ２１３ステップ）。 The translation unit 51 of the translation server 5 translates the received text into the translation language (step S240). The generated translation of the text is sent to the WEB/APP server 2 and accepted by that server 2 (step S213).

次に、参加者の画像データや音声データを分析させるため、ＷＥＢ／ＡＰＰサーバ制御部２０は、クライアント端末７から受け付けた参加者画像データと音声データを画像音声分析サーバ６に向けて送信する。 Next, in order to analyze the participant's image data and audio data, the WEB/APP server control unit 20 transmits the participant image data and audio data received from the client terminal 7 to the image and audio analysis server 6.

画像音声分析サーバ６の画像分析部６１では、受け付けた参加者画像データを分析して感情レベルを推定する（Ｓ２５０ステップ）。このように参加者画像データを分析した結果が、画像分析結果となる。また、音声分析部６４では、受け付けた参加者の音声データを分析して感情レベルやハラスメントレベルを推定する（Ｓ２５０ステップ）。このように参加者音声データを分析した結果が、音声分析結果となる。 The image analysis unit 61 of the image and audio analysis server 6 analyzes the received participant image data to estimate the emotional level (step S250). The result of analyzing the participant image data in this way becomes the image analysis result. The audio analysis unit 64 analyzes the received participant audio data to estimate the emotional level and harassment level (step S250). The result of analyzing the participant audio data in this way becomes the audio analysis result.

画像音声分析サーバ６で求められた画像分析結果や音声分析結果は、ＷＥＢ／ＡＰＰサーバ２に送信され、そのサーバ２に受け付けられる（Ｓ２１４ステップ）。 The image analysis results and audio analysis results obtained by the image and audio analysis server 6 are sent to the web/app server 2 and accepted by that server 2 (step S214).

ＷＥＢ／ＡＰＰサーバ制御部２０は、受け付けた文字文章、文字文章の分析結果、文字文章の翻訳文、画像分析結果、音声分析結果に基づいて、画面構成部２１に例えば、図９に示すようなビデオ会議画面２１０を構成させる。そして、ＷＥＢ／ＡＰＰサーバ制御部２０は、構成されたビデオ会議画面２１０を画面提供部２６から、会議に参加しているすべてのクライアント端末７に向けて送信する（Ｓ２１５ステップ）。 The WEB/APP server control unit 20 causes the screen composition unit 21 to compose a video conference screen 210, for example, as shown in FIG. 9, based on the received text, the analysis results of the text, the translation of the text, the image analysis results, and the audio analysis results. The WEB/APP server control unit 20 then transmits the composed video conference screen 210 from the screen providing unit 26 to all client terminals 7 participating in the conference (step S215).

こうすることにより、クライアント端末７の表示部７１には、ビデオ会議画面２１０のような画面が表示され（Ｓ２０１ステップ、Ｓ２６０ステップ）、ビデオ会議が進行する。 By doing this, a screen like the video conference screen 210 is displayed on the display unit 71 of the client terminal 7 (steps S201 and S260), and the video conference proceeds.

＜本実施の形態１の効果＞
本実施の形態１によれば、ビデオ会議参加者の発言の音声データが文字に変換され、その発言の内容を示す文字文章が構成される。また、構成されたこの文字文章に基づいて分析が行われて、感情レベルおよびハラスメントレベルが推定され発言分析結果が求められる。そして、得られた参加者の発言の内容を示す文字文章と発言分析結果を表示する表示画面が構成されて、この表示画面がビデオ会議の参加者に送信される。このように、参加者の発言に基づいて感情レベルおよびハラスメントレベルを分析することができ、その発言分析結果をビデオ会議の参加者の間で共有することができる。このため、参加者が感情的になった場合など、自らその状態を把握でき自制できるとともに、他の参加者もその状態を把握でき鎮静化を促すことができる。 <Effects of First Embodiment>
According to the first embodiment, audio data of speeches made by video conference participants is converted into text, and a text sentence indicating the content of the speech is constructed. Furthermore, an analysis is performed based on the constructed text sentence, and the emotional level and harassment level are estimated to obtain a speech analysis result. A display screen is then constructed that displays the obtained text sentence indicating the content of the participants' speeches and the speech analysis result, and this display screen is transmitted to the video conference participants. In this way, the emotional level and harassment level can be analyzed based on the participants' speeches, and the speech analysis result can be shared among the video conference participants. Therefore, if a participant becomes emotional, for example, the participant can recognize the state of mind and control themselves, and other participants can also recognize the state of mind and encourage them to calm down.

また、本実施の形態１によれば、音声文字変換サーバ３の音声文字変換部３１が、人工知能である機械学習により生成させた学習済み音声文字変換モデルを用いて、音声データから文字に変換している。このため、高精度に安定して、音声データを文字に変換できる。また、文字文章分析部４１が、人工知能である機械学習により生成させた学習済み文字文章分析モデルを用いて感情レベルおよびハラスメントレベルを求める。このため、精度よく安定して感情レベルとハラスメントレベルを推定できる。 Furthermore, according to this embodiment 1, the speech-to-text conversion unit 31 of the speech-to-text conversion server 3 converts speech data into text using a trained speech-to-text model generated by machine learning, which is artificial intelligence. This allows for highly accurate and stable conversion of speech data into text. Furthermore, the text analysis unit 41 determines the emotion level and harassment level using a trained text analysis model generated by machine learning, which is artificial intelligence. This allows for accurate and stable estimation of the emotion level and harassment level.

また、本実施の形態１によれば、会議参加者の発言の内容を示す文字文章が、所定の翻訳言語に翻訳されるため、使用する言語の異なる参加者同士でも翻訳文を参照することにより円滑な意思の疎通を図ることができる。 Furthermore, according to this first embodiment, text indicating the content of statements made by conference participants is translated into a specified translation language, allowing participants who speak different languages to refer to the translation to facilitate smooth communication.

また、本実施の形態１によれば、翻訳サーバ５の翻訳部５１が、機械学習により生成させた学習済み翻訳モデルを用いて、原文文章を翻訳する。このため、高い精度で確実に翻訳できる。 Furthermore, according to this embodiment 1, the translation unit 51 of the translation server 5 translates the source text using a trained translation model generated by machine learning. This allows for highly accurate and reliable translation.

また、本実施の形態１によれば、画像音声分析サーバ６の画像分析部６１が、機械学習により生成させた学習済み画像分析モデルを用いて、画像データから感情レベルを推定する。このため、高精度に安定して画像データから感情レベルを求められる。 Furthermore, according to this first embodiment, the image analysis unit 61 of the image and audio analysis server 6 estimates the emotion level from the image data using a trained image analysis model generated by machine learning. This makes it possible to stably determine the emotion level from the image data with high accuracy.

また、本実施の形態１によれば、会議参加者を撮影した画像データに基づいて感情レベルが推定され画像分析結果が求められる。そして、この画像分析結果を表示する画像分析表示画面が構成されて、ビデオ会議の参加者に送信される。このように、参加者の画像データに基づいて感情レベルが推定され、会議の参加者にその画像分析結果が共有される。参加者の画像データから画像分析結果が求められるため、参加者の発言の内容を示す文字文章と異なるデータを用いて感情レベルを分析することができ、多面的に分析結果を得ることができる。 Furthermore, according to this first embodiment, emotional levels are estimated based on image data of the conference participants, and image analysis results are obtained. An image analysis display screen that displays these image analysis results is then constructed and transmitted to the video conference participants. In this way, emotional levels are estimated based on the image data of the participants, and the image analysis results are shared with the conference participants. Because the image analysis results are obtained from the image data of the participants, emotional levels can be analyzed using data other than text that indicates the content of the participants' remarks, and multifaceted analysis results can be obtained.

また、本実施の形態１によれば、画像音声分析サーバ６の音声分析部６４が、機械学習により生成させた学習済み音声分析モデルを用いて、音声データから感情レベルおよびハラスメントレベルを推定する。このため、精度よく確実に音声データから感情レベルやハラスメントレベルを検出できる。 Furthermore, according to this embodiment 1, the audio analysis unit 64 of the image and audio analysis server 6 estimates the emotion level and harassment level from the audio data using a trained audio analysis model generated by machine learning. This makes it possible to accurately and reliably detect the emotion level and harassment level from the audio data.

また、本実施の形態１によれば、会議参加者の発言の音声データに基づいて感情レベルおよびハラスメントレベルが推定され音声分析結果が求められる。そして、この音声分析結果を表示する音声分析表示画面が構成されて、ビデオ会議の参加者に送信される。このように、参加者の発言の音声データに基づいて感情レベルおよびハラスメントレベルが推定され、会議の参加者にその音声分析結果が共有される。参加者の音声データから音声分析結果が求められるため、参加者の発言の内容を示す文字文章と異なり、参加者の音声データそのものを用いて感情レベルおよびハラスメントレベルを分析することができ、多面的に分析結果を得ることができる。 Furthermore, according to this first embodiment, emotional and harassment levels are estimated based on the audio data of the speech of the conference participants, and audio analysis results are obtained. A audio analysis display screen displaying these audio analysis results is then constructed and transmitted to the video conference participants. In this way, emotional and harassment levels are estimated based on the audio data of the participants' speech, and the audio analysis results are shared with the conference participants. Because the audio analysis results are obtained from the participants' audio data, the emotional and harassment levels can be analyzed using the participants' audio data itself, as opposed to text that indicates the content of the participants' speech, and analysis results can be obtained from multiple perspectives.

また、本実施の形態１によれば、参加者の発言の内容を示す文字文章に基づいて求められた発言分析結果、参加者の画像データに基づいて求められた画像分析結果および参加者の発言の音声データに基づいて求められた音声分析結果が総合評価されて総合判定結果が求められる。そして、この総合判定結果を表示する総合判定表示画面が構成されて、ビデオ会議の参加者に送信される。このように、発言分析結果、画像分析結果および音声分析結果が総合されるため、より多面的な分析結果を得ることができる。 Furthermore, according to this first embodiment, a comprehensive assessment result is obtained by comprehensively evaluating the utterance analysis result obtained based on the text indicating the content of the participants' utterances, the image analysis result obtained based on the image data of the participants, and the audio analysis result obtained based on the audio data of the participants' utterances. Then, a comprehensive assessment display screen is constructed to display this comprehensive assessment result, and is transmitted to the participants in the video conference. In this way, the utterance analysis result, image analysis result, and audio analysis result are combined, thereby obtaining a more multifaceted analysis result.

また、本実施の形態１によれば、参加者の発言の内容を示す文字文章に不適切な語句が含まれているか検出され、不適切な語句が検出された場合には、不適切語句遮断措置が行われる。この不適切語句遮断措置により、他人に不快感を与えるような不適切な語句が会議参加者に伝達されなくなるため、参加者は安心して会議に参加することができる。 Furthermore, according to this first embodiment, the text indicating the content of participants' remarks is checked for the inclusion of inappropriate words, and if inappropriate words are detected, an inappropriate word blocking measure is taken. This inappropriate word blocking measure prevents inappropriate words that may offend others from being communicated to conference participants, allowing participants to participate in the conference with peace of mind.

また、本実施の形態１によれば、不適切な語句の検出頻度が所定の閾値を超えた場合、警告表示画面が参加者に向けて送信される。この警告表示画面により、会議参加者は、不適切な発言が多くなっていることを客観的に認識することができ、休憩するなどの対策を講じることができる。 Furthermore, according to this first embodiment, if the detection frequency of inappropriate words and phrases exceeds a predetermined threshold, a warning display screen is sent to the participants. This warning display screen allows meeting participants to objectively recognize that inappropriate remarks are increasing, and they can take measures such as taking a break.

［発明の実施の形態２］
次に、この発明の実施の形態２について、図１４～図１７を用いて説明する。ただし、上述の実施の形態１と同一または対応する要素には、同一の符号を付し、重複する説明は省略する。 Second Embodiment
Next, a second embodiment of the present invention will be described with reference to Figures 14 to 17. However, elements that are the same as or correspond to those in the first embodiment described above are given the same reference numerals, and duplicated descriptions will be omitted.

図１４は、本実施の形態２に係るビデオ会議分析システム１Ａを含むビデオ会議システム１００Ａを概略的に示す構成ブロック図である。このビデオ会議システム１００Ａは、上述の実施形態１に係るビデオ会議システム１００にビデオ会議を動作させる機能を有するビデオ会議サーバ９が追加された構成になっている。また、このビデオ会議分析システム１Ａに含まれるＷＥＢ／ＡＰＰサーバ２Ａが、実施形態１のＷＥＢ／ＡＰＰサーバ２のようにビデオ会議を動作させるのではなく、ビデオ会議を分析するプログラムをＡＰＩとして提供するようになっている。それ以外は、実施形態１に係るビデオ会議システム１００とほぼ同様の構成となっている。 Figure 14 is a block diagram showing a schematic configuration of a videoconference system 100A including a videoconference analysis system 1A according to the second embodiment. This videoconference system 100A is configured by adding a videoconference server 9 having the function of operating videoconferences to the videoconference system 100 according to the first embodiment described above. Furthermore, the WEB/APP server 2A included in this videoconference analysis system 1A does not operate videoconferences like the WEB/APP server 2 in the first embodiment, but instead provides a program for analyzing videoconferences as an API. Otherwise, the configuration is substantially the same as that of the videoconference system 100 according to the first embodiment.

このビデオ会議分析システム１Ａでは、ビデオ会議サーバ９がインターネット８を介して会議参加者のクライアント端末７に接続され、ビデオ会議が動作するようになっている。会議参加者の発言や画像や音声を分析して感情レベルやハラスメントレベルを推定するときに、ビデオ会議サーバ９が、ＷＥＢ／ＡＰＰサーバ２Ａの提供するビデオ会議分析ＡＰＩを利用して、分析結果を取得するようになっている。 In this videoconference analysis system 1A, a videoconference server 9 is connected to the client terminals 7 of the conference participants via the Internet 8, enabling the videoconference to operate. When analyzing the remarks, images, and audio of the conference participants to estimate their emotional and harassment levels, the videoconference server 9 obtains the analysis results using a videoconference analysis API provided by the web/app server 2A.

＜ＷＥＢ／ＡＰＰサーバ＞
ＷＥＢ／ＡＰＰサーバ２Ａは、参加者の画像や音声を分析して感情レベルやハラスメントレベルを推定するビデオ会議分析プログラムをビデオ会議分析ＡＰＩの形式で提供する。 <WEB/APP server>
The WEB/APP server 2A provides a video conference analysis program in the form of a video conference analysis API that analyzes images and sounds of participants to estimate their emotional levels and harassment levels.

図１５に示す概略ブロック図のようにＷＥＢ／ＡＰＰサーバ２Ａは、ビデオ会議分析ＡＰＩ提供部２７を有しており、それ以外は、実施形態１に係るＷＥＢ／ＡＰＰサーバ２とほぼ同様の構成である（図２参照）。 As shown in the schematic block diagram of Figure 15, the WEB/APP server 2A has a videoconference analysis API provider 27, but otherwise has a configuration similar to that of the WEB/APP server 2 in embodiment 1 (see Figure 2).

図１６には、ビデオ会議分析ＡＰＩ提供部２７から提供されるビデオ会議分析ＡＰＩの仕様の例が示されている。このビデオ会議分析ＡＰＩへの入力である引数は、分析内容の設定、参加者の音声データ、参加者の画像データ、参加者がキーボード等から入力した文字データなどである。また、このビデオ会議分析ＡＰＩからの出力である戻り値は、音声変換後の文字、翻訳文、感情レベル、感情分析の詳細、ハラスメントレベル、ハラスメント分析の詳細などである。ここで、感情分析の詳細とは、図１１に示すような感情を構成する「喜び」、「悲しみ」などの各項目の割合のことである。ハラスメント分析の詳細も同様に、ハラスメントを構成する各項目の割合のことである。 Figure 16 shows an example of the specifications of the video conference analysis API provided by the video conference analysis API provider 27. The arguments input to this video conference analysis API include analysis content settings, participant voice data, participant image data, and text data entered by participants via a keyboard or the like. The return values output from this video conference analysis API include the text after speech conversion, translated text, emotion level, emotion analysis details, harassment level, and harassment analysis details. Here, emotion analysis details refer to the proportions of each item such as "happiness" and "sadness" that make up the emotions shown in Figure 11. Similarly, harassment analysis details refer to the proportions of each item that makes up harassment.

図１６に示すビデオ会議分析ＡＰＩの引数である分析内容の設定によって、１２パターンの分析内容を選択できるようになっている。 By setting the analysis content, which is an argument to the video conference analysis API shown in Figure 16, it is possible to select from 12 patterns of analysis content.

設定（１）では、音声データが文字に変換される（Ｐ１処理）。この設定では、ビデオ会議分析ＡＰＩの引数として入力される音声データが、ＷＥＢ／ＡＰＰサーバ２Ａの画像音声受付部２５に受け付けられて（画像音声受付処理）、このサーバ２ＡのＷＥＢ／ＡＰＰサーバ制御部２０が、受け付けた音声データを音声文字変換サーバ３に送信する。音声文字変換サーバ３の音声文字変換部３１では、受け付けた音声データを文字に変換する処理（音声文字変換処理）が行われる。このように音声データから変換された文字データは、音声文字変換サーバ３からＷＥＢ／ＡＰＰサーバ２Ａに送信され、そのサーバ２Ａに受け付けられる。 In setting (1), audio data is converted into text (P1 processing). In this setting, audio data input as an argument to the videoconference analysis API is accepted by the image and audio accepting unit 25 of the WEB/APP server 2A (image and audio accepting processing), and the WEB/APP server control unit 20 of this server 2A sends the accepted audio data to the speech-to-text conversion server 3. The speech-to-text conversion unit 31 of the speech-to-text conversion server 3 performs processing to convert the accepted audio data into text (speech-to-text conversion processing). The text data converted from the audio data in this way is sent from the speech-to-text conversion server 3 to the WEB/APP server 2A, and is accepted by that server 2A.

設定（２）では、文字文章が分析され感情レベルやハラスメントレベルが推定される（Ｐ２処理）。この設定では、ビデオ会議分析ＡＰＩの引数として入力される文字データが、ＷＥＢ／ＡＰＰサーバ２Ａの通信部２８等に受け付けられ、ＷＥＢ／ＡＰＰサーバ制御部２０が、受け付けた文字データを文字文章分析サーバ４に送信する。文字文章分析サーバ４の文字文章分析部４１では、受け付けた文字データで構成される文字文章を分析し感情レベルやハラスメントレベルを推定して文字文章の分析結果を求める（文字文章分析処理）。このようにして求められた文字文章の分析結果は、音声文字変換サーバ３からＷＥＢ／ＡＰＰサーバ２Ａに送信され、そのサーバ２Ａに受け付けられる。 In setting (2), text is analyzed and emotion and harassment levels are estimated (P2 processing). In this setting, text data input as arguments to the videoconference analysis API is received by the communication unit 28 of the WEB/APP server 2A, and the WEB/APP server control unit 20 transmits the received text data to the text analysis server 4. The text analysis unit 41 of the text analysis server 4 analyzes the text composed of the received text data, estimates the emotion and harassment levels, and obtains the text analysis results (text analysis processing). The text analysis results obtained in this way are transmitted from the speech-to-text conversion server 3 to the WEB/APP server 2A, where they are received.

設定（３）では、原文が翻訳され翻訳文が求められる（Ｐ３処理）。この設定では、ビデオ会議分析ＡＰＩの引数として入力される文字データが、ＷＥＢ／ＡＰＰサーバ２Ａの通信部２８等に受け付けられ、ＷＥＢ／ＡＰＰサーバ制御部２０が、受け付けた文字データを翻訳サーバ５に送信する。翻訳サーバ５の翻訳部５１では、受け付けた文字データで構成される原文が翻訳され翻訳文が生成される（翻訳処理）。このようにして求められた翻訳文は、翻訳サーバ５からＷＥＢ／ＡＰＰサーバ２Ａに送信され、そのサーバ２Ａに受け付けられる。 In setting (3), the original text is translated and a translation is obtained (P3 processing). In this setting, character data input as an argument to the videoconference analysis API is accepted by the communication unit 28 of the WEB/APP server 2A, and the WEB/APP server control unit 20 sends the accepted character data to the translation server 5. The translation unit 51 of the translation server 5 translates the original text composed of the accepted character data and generates a translation (translation processing). The translation obtained in this way is sent from the translation server 5 to the WEB/APP server 2A, and is accepted by that server 2A.

設定（４）では、画像や音声が分析され感情レベルやハラスメントレベルが推定される（Ｐ４処理）。この設定では、ビデオ会議分析ＡＰＩの引数として入力される画像データや音声データが、ＷＥＢ／ＡＰＰサーバ２Ａの画像音声受付部２５に受け付けられ（画像音声受付処理）、ＷＥＢ／ＡＰＰサーバ制御部２０が、受け付けた画像データや音声データを画像音声分析サーバ６に送信する。画像音声分析サーバ６の画像分析部６１では、受け付けた画像データを分析し感情レベルを推定して画像分析結果を求める（画像分析処理）。また、音声分析部６４では、受け付けた音声データを分析し感情レベルやハラスメントレベルを推定して音声分析結果を求める（音声分析処理）。このようにして求められた画像分析結果や音声分析結果は、画像音声分析サーバ６からＷＥＢ／ＡＰＰサーバ２Ａに送信され、そのサーバ２Ａに受け付けられる。 In setting (4), images and audio are analyzed to estimate emotional and harassment levels (P4 processing). In this setting, image data and audio data input as arguments to the videoconference analysis API are accepted by the image and audio acceptance unit 25 of the WEB/APP server 2A (image and audio acceptance processing), and the WEB/APP server control unit 20 transmits the accepted image data and audio data to the image and audio analysis server 6. The image analysis unit 61 of the image and audio analysis server 6 analyzes the accepted image data, estimates the emotional level, and obtains image analysis results (image analysis processing). The audio analysis unit 64 analyzes the accepted audio data, estimates the emotional and harassment levels, and obtains audio analysis results (audio analysis processing). The image and audio analysis results obtained in this way are transmitted from the image and audio analysis server 6 to the WEB/APP server 2A, where they are accepted.

設定（５）では、このビデオ会議分析ＡＰＩの引数として入力される音声データが文字に変換され（音声文字変換処理）、その変換された文字で構成される文字文章が分析され感情レベルなどの分析結果が求められ（文字文章分析処理）、さらに、この文字文章が翻訳され翻訳文が求められる（翻訳処理）（Ｐ５処理）。 In setting (5), the audio data input as an argument to this videoconference analysis API is converted to text (speech-to-text conversion process), the text sentences made up of these converted characters are analyzed to obtain analysis results such as emotional levels (text sentence analysis process), and then this text sentences are translated to obtain translations (translation process) (P5 process).

設定（６）では、このＡＰＩの引数として入力される音声データが文字に変換され（音声文字変換処理）、その変換された文字で構成される文字文章が分析され感情レベルなどの分析結果が求められる（文字文章分析処理）（Ｐ６処理）。 In setting (6), the voice data input as an argument to this API is converted into text (voice-to-text conversion process), and the text composed of the converted characters is analyzed to obtain analysis results such as emotional level (text analysis process) (P6 process).

設定（７）では、このＡＰＩの引数として入力される音声データが文字に変換され（音声文字変換処理）、その変換された文字で構成される文字文章が翻訳され翻訳文が求められる（翻訳処理）（Ｐ７処理）。 In setting (7), the voice data input as an argument to this API is converted into text (voice-to-text conversion process), and the text sentence composed of the converted characters is translated to obtain a translated sentence (translation process) (P7 process).

設定（８）では、このビデオ会議分析ＡＰＩの引数として入力される文字データから構成される文字文章が分析され感情レベルなどの分析結果が求められ（文字文章分析処理）、この文字文章が翻訳され翻訳文が求められる（翻訳処理）（Ｐ８処理）。 In setting (8), the text sentences composed of text data input as arguments to this video conference analysis API are analyzed to obtain analysis results such as emotional levels (text sentence analysis process), and these text sentences are translated to obtain translations (translation process) (P8 process).

設定（９）は、設定（５）と設定（４）の組合せである。このビデオ会議分析ＡＰＩの引数として入力される音声データが文字に変換され（音声文字変換処理）、その変換された文字で構成される文字文章が分析され感情レベルなどの分析結果が求められる（文字文章分析処理）。さらに、この文字文章が翻訳され翻訳文が求められる（翻訳処理）。そして、さらに、このＡＰＩの引数として入力される画像データを分析し感情レベルを推定して画像分析結果を求め（画像分析処理）、このＡＰＩに入力される音声データを分析し感情レベルやハラスメントレベルを推定して音声分析結果を求める（音声分析処理）。 Setting (9) is a combination of settings (5) and (4). Audio data input as an argument to this video conference analysis API is converted to text (speech-to-text conversion processing), and the text composed of the converted characters is analyzed to obtain analysis results such as emotional level (text analysis processing). This text is then translated to obtain a translation (translation processing). Image data input as an argument to this API is then analyzed to estimate emotional level and obtain image analysis results (image analysis processing), and audio data input to this API is analyzed to estimate emotional level and harassment level and obtain audio analysis results (audio analysis processing).

設定（１０）は設定（６）と設定（４）の組合せであり、設定（１１）は設定（７）と設定（４）の組合せであり、設定（１２）は設定（８）と設定（４）の組合せである。 Setting (10) is a combination of settings (6) and (4), setting (11) is a combination of settings (7) and (4), and setting (12) is a combination of settings (8) and (4).

＜ビデオ会議サーバ＞
ビデオ会議サーバ９は、会議参加者の各クライアント端末７から送信されてくる参加者の画像や音声のデータを受け付けて、この画像や音声のデータに基づいてビデオ会議画面や会議の音声を構成して、各クライアント端末７に向けて送信する。このような動作を行うことにより、会議参加者は、各クライアント端末７を介して会議の画面や音声を視聴することができ、ビデオ会議を進行させることができる。このように、このビデオ会議サーバ９は、ビデオ会議を実現する機能を有している。 <Video conference server>
The video conference server 9 receives image and audio data of the conference participants transmitted from each client terminal 7, composes a video conference screen and audio of the conference based on this image and audio data, and transmits it to each client terminal 7. By performing this operation, the conference participants can view the conference screen and audio via each client terminal 7, and the video conference can proceed. In this way, the video conference server 9 has the function of realizing a video conference.

図１７に示す概略ブロック図のようにビデオ会議サーバ９は、ビデオ会議サーバ制御部９０、画面構成部９１、会議設定部９２、議事記録部９３、画像音声受付部９５、画面提供部９６、ビデオ会議分析ＡＰＩ呼出部９７、通信部９８、記憶部９９を含むように構成されている。 As shown in the schematic block diagram of Figure 17, the video conference server 9 is configured to include a video conference server control unit 90, a screen configuration unit 91, a conference setting unit 92, a minutes recording unit 93, an image and audio receiving unit 95, a screen providing unit 96, a video conference analysis API calling unit 97, a communication unit 98, and a memory unit 99.

ビデオ会議サーバ制御部９０は、ビデオ会議サーバ９を構成する各要素の制御などを行うＣＰＵ（図示せず）を含むように構成されている。記憶部９９は、補助記憶装置（図示せず）やＲＡＭ（図示せず）により構成されている。 The videoconference server control unit 90 is configured to include a CPU (not shown) that controls each element that makes up the videoconference server 9. The memory unit 99 is configured from an auxiliary storage device (not shown) and RAM (not shown).

通信部９８は、インターネット８に接続されて、このビデオ会議分析システム１Ａを構成する各サーバや、各クライアント端末７との間でデータの送受信を行う。 The communication unit 98 is connected to the Internet 8 and transmits and receives data between each server and each client terminal 7 that make up this videoconference analysis system 1A.

画像音声受付部９５は、実施形態１に係るＷＥＢ／ＡＰＰサーバ２の画像音声受付部２５とほぼ同様の機能を行う。この画像音声受付部９５は、ビデオ会議サーバ制御部９０の制御に基づいて、会議参加者の各クライアント端末７からインターネット８を介して送信されてくる参加者を撮影した画像や参加者の発言の音声データを受け付ける。 The image and audio receiving unit 95 performs substantially the same functions as the image and audio receiving unit 25 of the WEB/APP server 2 in embodiment 1. Based on the control of the video conference server control unit 90, this image and audio receiving unit 95 receives captured images of participants and audio data of participants' speech transmitted via the Internet 8 from each client terminal 7 of the conference participants.

画面構成部９１は、実施形態１に係るＷＥＢ／ＡＰＰサーバ２の画面構成部２１とほぼ同様の機能を行う。この画面構成部９１は、ビデオ会議サーバ制御部９０の制御に基づいて、各クライアント端末７に表示されるビデオ会議画面などを構成する。
画面提供部９６は、実施形態１に係るＷＥＢ／ＡＰＰサーバ２の画面提供部２６とほぼ同様の機能を行う。この画面提供部９６は、ビデオ会議サーバ制御部９０の制御に基づいて、画面構成部９１で構成された画面を参加者の各クライアント端末７に向けて送信して提供する。 The screen composition unit 91 performs substantially the same functions as the screen composition unit 21 of the WEB/APP server 2 according to the first embodiment. The screen composition unit 91 composes the video conference screen to be displayed on each client terminal 7 under the control of the video conference server control unit 90.
The screen providing unit 96 performs substantially the same function as the screen providing unit 26 of the WEB/APP server 2 according to the first embodiment. Based on the control of the video conference server control unit 90, the screen providing unit 96 transmits and provides the screen created by the screen creating unit 91 to each client terminal 7 of the participants.

会議設定部９２は、実施形態１に係るＷＥＢ／ＡＰＰサーバ２の会議設定部２２とほぼ同様の機能を行う。この会議設定部９２には、会議に先立って、会議参加者や会議の分析内容などの設定を行う。 The conference setting unit 92 performs substantially the same functions as the conference setting unit 22 of the WEB/APP server 2 in embodiment 1. This conference setting unit 92 is used to set conference participants, conference analysis content, and other settings prior to the conference.

議事記録部９３は、実施形態１に係るＷＥＢ／ＡＰＰサーバ２の議事記録部２３とほぼ同様の機能を行う。この議事記録部９３は、画像音声受付部９５で受け付けた参加者の発言の音声データを記録したり、参加者の音声データを文字に変換した文字データを記録したりする。さらに、参加者の画像や音声データに基づいて分析される分析結果を記録するようにしてもよい。 The minutes recording unit 93 performs substantially the same functions as the minutes recording unit 23 of the WEB/APP server 2 in embodiment 1. This minutes recording unit 93 records the audio data of participants' remarks received by the image and audio receiving unit 95, and records text data obtained by converting the participants' audio data into text. Furthermore, it may also be configured to record the results of analysis based on the participants' images and audio data.

ビデオ会議分析ＡＰＩ呼出部９７は、参加者の画像データや音声データを分析する際、ＷＥＢ／ＡＰＰサーバ２Ａから提供されているビデオ会議分析ＡＰＩを呼び出す。このビデオ会議分析ＡＰＩ呼出部９７は、ビデオ会議分析ＡＰＩに、分析内容の設定、参加者の音声データ、参加者の画像データなどの入力を行い、ビデオ会議分析ＡＰＩから分析結果などを取得する。 When analyzing participant image data and voice data, the video conference analysis API call unit 97 calls the video conference analysis API provided by the WEB/APP server 2A. This video conference analysis API call unit 97 inputs analysis content settings, participant voice data, participant image data, etc. into the video conference analysis API, and obtains analysis results, etc. from the video conference analysis API.

＜本実施の形態２の効果＞
本実施の形態２によれば、本実施の形態１とほぼ同様の効果が得られる。 <Effects of the Second Embodiment>
According to the second embodiment, substantially the same effects as those of the first embodiment can be obtained.

本実施の形態２によれば、ビデオ会議参加者の発言の音声データが文字に変換され、その発言の内容を示す文字文章が構成される。また、構成されたこの文字文章に基づいて分析が行われて、感情レベルおよびハラスメントレベルが推定され発言分析結果が求められる。このように、ビデオ会議参加者の発言に基づいて感情レベルおよびハラスメントレベルを分析することができる。 According to the second embodiment, the audio data of speeches made by video conference participants is converted into text, and a text sentence indicating the content of the speech is constructed. Analysis is then performed based on the constructed text sentence, and the emotional level and harassment level are estimated to obtain a speech analysis result. In this way, the emotional level and harassment level can be analyzed based on the speeches made by video conference participants.

また、ビデオ会議分析プログラムをＡＰＩとして提供することができるため、様々なビデオ会議システムでこのＡＰＩを利用することができ、汎用性を持たせることができる。 In addition, the video conferencing analysis program can be provided as an API, which means that the API can be used in a variety of video conferencing systems, making it versatile.

また、本実施の形態２によれば、会議参加者の発言の内容を示す文字文章が、翻訳されるため、使用する言語の異なる参加者同士でも翻訳文を参照することにより円滑な意思の疎通を図ることができる。 Furthermore, according to this second embodiment, the text indicating the content of the statements made by the conference participants is translated, so participants who speak different languages can refer to the translations to facilitate smooth communication.

また、本実施の形態２によれば、会議参加者を撮影した画像データに基づいて感情レベルが推定され画像分析結果が求められる。このため、参加者の発言の内容を示す文字文章と異なるデータを用いて感情レベルを分析することができ、多面的に分析結果を得ることができる。 Furthermore, according to this second embodiment, emotional levels are estimated based on image data of conference participants, and image analysis results are obtained. This makes it possible to analyze emotional levels using data other than text that indicates the content of participants' remarks, and obtain multifaceted analysis results.

また、本実施の形態２によれば、会議参加者の発言の音声データに基づいて感情レベルおよびハラスメントレベルが推定され音声分析結果が求められる。このため、参加者の発言の内容を示す文字文章と異なり、参加者の音声データそのものを用いて感情レベルおよびハラスメントレベルを分析することができ、多面的に分析結果を得ることができる。 Furthermore, according to this second embodiment, emotional and harassment levels are estimated based on the audio data of the speech of the conference participants, and audio analysis results are obtained. Therefore, unlike text that indicates the content of the participants' speech, the emotional and harassment levels can be analyzed using the participants' audio data itself, and multifaceted analysis results can be obtained.

［発明のその他の実施の形態］
なお、実施形態１や２に係るビデオ会議分析システム１，１Ａは、機能別に複数のサーバを設置した構成になっているが、全ての機能を統合して１台のサーバで実現してもよい。 [Other embodiments of the invention]
Although the videoconference analysis systems 1 and 1A according to the first and second embodiments are configured with multiple servers installed for different functions, all functions may be integrated into a single server.

また、「ネットワーク」は、インターネット８に限定されるものでなく、ローカルエリアネットワーク（ＬｏｃａｌＡｒｅａＮｅｔｗｏｒｋ）やワイドエリアネットワーク（ＷＡＮ:ＷｉｄｅＡｒｅａＮｅｔｗｏｒｋ）などその他のネットワークで構成するようにしてもよい。 Furthermore, the "network" is not limited to the Internet 8, but may be composed of other networks such as a local area network or a wide area network (WAN).

１００，１００Ａ…ビデオ会議システム、１…ビデオ会議分析システム、２，２Ａ…ＷＥＢ／ＡＰＰサーバ、３…音声文字変換サーバ、４…文字文章分析サーバ、５…翻訳サーバ、６…画像音声分析サーバ、７_１，７_２，７_３，・・・，７_ｎ，７…クライアント端末、８…インターネット（ネットワーク）、９…ビデオ会議サーバ、２０…ＷＥＢ／ＡＰＰサーバ制御部（制御部）、２１…画面構成部、２２…会議設定部、２３…議事記録部、２４…判定部、２５…画像音声受付部、２６…画面提供部、２７…ビデオ会議分析ＡＰＩ提供部、３１…音声文字変換部、３４…不適切語句記憶部、３５…不適切語句検出部、４１…文字文章分析部、５１…翻訳部、６１…画像分析部、６４…音声分析部、７１…表示部、７２…カメラ、７３…マイクロホン、７４…文字入力部、７５…スピーカ、９７…ビデオ会議分析ＡＰＩ呼出部、２００，２１０…ビデオ会議画面、２０１…参加者画像、２０２，２０６，２０８…感情レベル表示、２０５…文字文章表示、２０７…ハラスメントレベル表示
100, 100A...Video conference system, 1...Video conference analysis system, 2, 2A...WEB/APP server, 3...Speech-to-text conversion server, 4...Text analysis server, 5...Translation server, 6...Image and speech analysis server, 7 ₁ , 7 ₂ , 7 ₃ , ..., 7 _n , 7...Client terminal, 8...Internet (network), 9...Video conference server, 20...WEB/APP server control unit (control unit), 21...Screen composition unit, 22...Conference setting unit, 23...Proceedings recording unit, 24...Determination unit, 25...Image and audio receiving unit, 26...Screen providing unit, 27...Video conference analysis API providing unit, 31...Speech to text conversion unit, 34...Inappropriate phrase storage unit, 35...Inappropriate phrase detection unit, 41...Character and sentence analysis unit, 51...Translation unit, 61...Image analysis unit, 64...Speech analysis unit, 71...Display unit, 72...Camera, 73...Microphone, 74...Character input unit, 75...Speaker, 97...Video conference analysis API calling unit, 200, 210...Video conference screen, 201...Participant image, 202, 206, 208...Emotion level display, 205...Character and sentence display, 207...Harassment level display

Claims

A video conference analysis system for analyzing statements made by participants in a video conference held over a network, comprising:
an image and audio receiving unit that receives audio data of the participants' speeches via the network;
a voice-to-text converter for converting the voice data received by the voice image receiver into text;
a character sentence analysis unit that analyzes character sentences that indicate the contents of the participants' statements, which are composed of characters converted from the voice data by the voice-to-text conversion unit, and obtains a statement analysis result;
a screen configuration unit that configures a display screen that displays the text and the utterance analysis result obtained by the text analysis unit;
a screen providing unit that transmits the display screen configured by the screen configuration unit to the participants participating in the video conference via the network;
a control unit that converts the voice data received by the image and voice receiving unit into text using the voice-to-text conversion unit to obtain the text sentence, analyzes the text sentence using the text sentence analysis unit to obtain the utterance analysis result, configures the display screen using the screen configuration unit, and controls operations to transmit the display screen from the screen providing unit,
The character and sentence analysis unit
A video conference analysis system characterized by performing an analysis based on the text indicating the content of the participants' remarks, estimating an emotional level indicating their emotional state and a harassment level indicating the degree of harassment, and obtaining the remark analysis results.

The speech-to-text conversion unit
a trained speech-to-text conversion model is generated by machine learning using a combination of training speech data and correct character data corresponding to the training speech data as training data, and the speech data received by the image/speech receiving unit is input and converted into characters by calculation to obtain the character sentence;
The character and sentence analysis unit
The videoconference analysis system of claim 1, characterized in that the acquired text is input into a trained text analysis model generated by machine learning using combinations of training texts and correct emotion data corresponding to the training texts, and combinations of the training texts and correct harassment data corresponding to the training texts, and the acquired text is calculated to estimate the emotion level and the harassment level and obtain the utterance analysis results.

a translation unit that translates an original text written in a source language into a predetermined translation language;
The control unit
Controlling an operation of translating the text indicating the content of the speech of the participant by the translation unit to obtain a translated text;
The screen configuration unit
a translation display screen that displays the translation of the character sentence translated by the translation unit;
The screen providing unit
3. The video conference analysis system according to claim 1, wherein the translation display screen constructed by the screen construction unit is transmitted to the participants participating in the video conference via the network.

The translation unit
The videoconferencing analysis system according to claim 3, characterized in that a combination of a training original sentence and a correct translation sentence that is the correct translation of the training original sentence is used as training data, and the original sentence is input into a trained translation model generated by machine learning, and a translation of the training original sentence is obtained by performing calculations on the trained translation model.

an image analysis unit that performs analysis based on image data of the participants and estimates the emotion level to obtain an image analysis result;
The image and sound receiving unit
receiving the image data of the participants via the network;
The control unit
Controlling an operation of causing the image analysis unit to analyze the image data received by the image and audio receiving unit;
The image analysis unit
obtaining the image analysis results by inputting the image data into a trained image analysis model generated by machine learning using a combination of a training face image and a correct emotion type corresponding to the training face image as training data, and performing a calculation;
The screen configuration unit
configuring an image analysis display screen that displays the image analysis results obtained by the image analysis unit;
The screen providing unit
The video conference analysis system according to any one of claims 1 to 4, characterized in that the image analysis display screen constructed by the screen construction unit is transmitted to the participants participating in the video conference via the network.

an image analysis unit that performs analysis based on image data of the participants and estimates the emotion level to obtain an image analysis result;
The image and sound receiving unit
receiving the image data of the participants via the network;
The control unit
Controlling an operation of causing the image analysis unit to analyze the image data received by the image and audio receiving unit;
The image analysis unit
extracting a face image from the image data, and performing analysis using at least one of the following: eye shape, changes in the eye shape, eyebrow shape, changes in the eyebrow shape, shapes of the corners of the mouth which are the sides of the lips, changes in the shape of the corners of the mouth, cheek shape, changes in the cheek shape, frequency of appearance of teeth, and changes in the frequency of appearance of teeth; and obtaining the image analysis results;
The screen configuration unit
configuring an image analysis display screen that displays the image analysis results obtained by the image analysis unit;
The screen providing unit
The video conference analysis system according to any one of claims 1 to 4, characterized in that the image analysis display screen constructed by the screen construction unit is transmitted to the participants participating in the video conference via the network.

a voice analysis unit that performs analysis based on the voice data received by the image and voice receiving unit, estimates the emotion level and the harassment level, and obtains a voice analysis result;
The control unit
Controlling an operation to cause the voice analysis unit to analyze the voice data;
The voice analysis unit
a trained voice analysis model is generated by machine learning using a combination of training voice data and correct emotion data corresponding to the training voice data, and a combination of the training voice data and correct harassment data corresponding to the training voice data, and the received voice data is inputted into the trained voice analysis model, and calculations are performed to estimate the emotion level and the harassment level, thereby obtaining the voice analysis result;
The screen configuration unit
configuring a voice analysis display screen that displays the voice analysis result obtained by the voice analysis unit;
The screen providing unit
The video conference analysis system according to any one of claims 1 to 4, characterized in that the voice analysis display screen constructed by the screen construction unit is transmitted to the participants participating in the video conference via the network.

a voice analysis unit that performs analysis based on the voice data received by the image and voice receiving unit, estimates the emotion level and the harassment level, and obtains a voice analysis result;
The control unit
Controlling an operation to cause the voice analysis unit to analyze the voice data;
The voice analysis unit
performing an analysis using at least one of the following: voice volume, changes in voice volume, voice pitch, changes in voice pitch, speaking rate, changes in speaking rate, frequency of speaking over the words of other participants, and changes in the frequency of speaking over the words of the other participants; and obtaining the voice analysis result;
The screen configuration unit
configuring a voice analysis display screen that displays the voice analysis result obtained by the voice analysis unit;
The screen providing unit
The video conference analysis system according to any one of claims 1 to 4, characterized in that the voice analysis display screen constructed by the screen construction unit is transmitted to the participants participating in the video conference via the network.

an image analysis unit that performs analysis based on image data of the participants and estimates the emotion level to obtain an image analysis result;
a voice analysis unit that performs analysis based on the voice data received by the image and voice receiving unit, estimates the emotion level and the harassment level, and obtains a voice analysis result;
a judgment unit that comprehensively evaluates the statement analysis result analyzed by the character and sentence analysis unit, the image analysis result obtained by the image analysis unit, and the voice analysis result obtained by the voice analysis unit, and estimates at least one of a comprehensive emotion level and a comprehensive harassment level to obtain a comprehensive judgment result,
The image and sound receiving unit
receiving the image data of the participants via the network;
The control unit
The image data is analyzed by the image analysis unit, the voice data is analyzed by the voice analysis unit, and the judgment unit comprehensively evaluates the utterance analysis result analyzed by the character and sentence analysis unit, the image analysis result obtained by the image analysis unit, and the voice analysis result obtained by the voice analysis unit, thereby controlling the operation of obtaining the comprehensive judgment result;
The screen configuration unit
configuring a comprehensive judgment display screen that displays the comprehensive judgment result obtained by the judgment unit;
The screen providing unit
2. The video conference analysis system according to claim 1, wherein the comprehensive judgment display screen constructed by the screen construction unit is transmitted to the participants participating in the video conference via the network.

an inappropriate phrase detection unit that detects whether the text contains inappropriate phrases that have been registered in advance;
When the inappropriate phrase is detected by the inappropriate phrase detection unit,
The control unit
The video conference analysis system of any one of claims 1 to 9, characterized in that it takes measures to block inappropriate phrases, including at least one of stopping transmission of the audio data corresponding to the inappropriate phrases sent via the network to the participants participating in the video conference, deleting the inappropriate phrases included in the text, and replacing the inappropriate phrases included in the text with appropriate phrases corresponding to the inappropriate phrases.

When the detection frequency of the inappropriate phrase detected by the inappropriate phrase detection unit exceeds a predetermined threshold,
The control unit
11. The video conference analysis system of claim 10, wherein the screen composition unit composes a warning display screen, and the screen providing unit controls an operation of transmitting the warning display screen composed by the screen composition unit to the participants participating in the video conference via the network.

A video conference analysis program for analyzing statements made by participants in a video conference conducted via a network,
an image and voice receiving process for receiving voice data of the participants via the network;
a voice-to-text conversion process for converting the voice data received in the voice and image receiving process into text;
a text analysis process for analyzing text indicating the content of the speech of the participant, which is composed of characters converted from the voice data in the voice-to-text conversion process, to obtain a speech analysis result;
The character sentence analysis process includes:
A video conference analysis program characterized by performing an analysis based on the text indicating the content of the participants' remarks, estimating an emotional level indicating their emotional state and a harassment level indicating the degree of harassment, and obtaining the remark analysis results.

A translation process is provided for translating an original text written in a source language into a predetermined translation language,
The translation process includes:
13. The video conference analysis program according to claim 12, wherein the translated sentence is generated by translating the text indicating the content of the speech of the participant.

an image analysis process for performing an analysis based on image data of the participants, estimating the emotion level, and obtaining an image analysis result;
The image and sound reception process includes:
receiving the image data of the participants via the network;
The image analysis process includes:
13. The video conference analysis program according to claim 12, wherein a facial image is extracted from the image data, and an analysis is performed using at least one of the following: eye shape, changes in the eye shape, eyebrow shape, changes in the eyebrow shape, shapes of the corners of the mouth (the parts on both sides of the lips), changes in the shape of the corners of the mouth, cheek shape, changes in the cheek shape, frequency of appearance of teeth, and changes in the frequency of appearance of teeth, to obtain the image analysis result.

a voice analysis process for performing an analysis based on the voice data received in the image and voice receiving process, and estimating the emotion level and the harassment level to obtain a voice analysis result;
The voice analysis process includes:
13. The video conference analysis program according to claim 12, wherein the voice analysis result is obtained by performing analysis using at least one of the following: voice volume, changes in voice volume, voice pitch, changes in voice pitch, speaking rate, changes in speaking rate, frequency of speaking over the speech of other participants, and changes in frequency of speaking over the speech of the other participants.

an image analysis process for performing an analysis based on image data of the participants, estimating the emotion level, and obtaining an image analysis result;
a voice analysis process for performing an analysis based on the voice data received in the image and voice receiving process, and estimating the emotion level and the harassment level to obtain a voice analysis result;
a judgment process for comprehensively evaluating the statement analysis result obtained by the character and sentence analysis process, the image analysis result obtained by the image analysis process, and the voice analysis result obtained by the voice analysis process, and estimating at least one of a comprehensive emotion level and a comprehensive harassment level to obtain a comprehensive judgment result;
The image and sound reception process includes:
receiving the image data of the participants via the network;
the image analysis process analyzes the image data to obtain the image analysis result;
the voice analysis process analyzes the voice data to obtain the voice analysis result;
13. The video conference analysis program according to claim 12, wherein the judgment process determines the overall judgment result by comprehensively evaluating the utterance analysis result obtained by the character sentence analysis process, the image analysis result obtained by the image analysis process, and the audio analysis result obtained by the audio analysis process.