JP7553408B2

JP7553408B2 - Drawing device and program

Info

Publication number: JP7553408B2
Application number: JP2021118976A
Authority: JP
Inventors: 晴久加藤
Original assignee: KDDI Corp
Current assignee: KDDI Corp
Priority date: 2021-07-19
Filing date: 2021-07-19
Publication date: 2024-09-18
Anticipated expiration: 2041-07-19
Also published as: JP2023014813A

Description

本発明は、遠隔コミュニケーション等に利用可能なアバタを描画する描画装置及びプログラムに関する。 The present invention relates to a drawing device and a program for drawing avatars that can be used for remote communication, etc.

ネットワークを介して遠隔コミュニケーション等を実現するために、利用者の動きを推定しアバタに適用するシステムにおいて、伝送遅延を抑制することができれば、円滑なコミュニケーションをとることができる。これに対処することに関連した従来技術の例として、特許文献１及び２に開示のものがあり、ここでは以下のような手法が公開されている。 In a system that estimates a user's movements and applies them to an avatar to achieve remote communication over a network, if transmission delays could be reduced, smoother communication would be possible. Examples of related art that address this issue are disclosed in Patent Documents 1 and 2, which disclose the following techniques.

特許文献１では、ゲーム用途を想定したVR（仮想現実）において二人の利用者の未来の動きを推定し映像を早送り・遅延を制御することで互いに相手を見つめたときにアバタの視線が合うタイミングを一致させる手法を開示している。特許文献２では、配信動画にユーザがアバタで参加する用途において、データ量が大きな姿勢特徴量ではなく、小さなデータ量で視聴ユーザの姿勢に関する情報を特定することができる基準姿勢識別データを送信することで伝送遅延を抑制する手法を開示している。 Patent Document 1 discloses a method for estimating the future movements of two users in virtual reality (VR) intended for gaming applications, and controlling the fast-forwarding and delay of the video to synchronize the timing at which the avatars' gazes meet when they look at each other. Patent Document 2 discloses a method for suppressing transmission delays in applications in which users participate in streaming videos as avatars, by transmitting reference posture identification data that can identify information about the posture of the viewing user with a small amount of data, rather than posture feature data that requires a large amount of data.

特開２０２０－００９２９５号公報JP 2020-009295 A 特許６７３１５３２号Patent No. 6731532

しかしながら、以上のような従来技術では、伝送等に起因する遅延が存在することを前提に、遠隔コミュニケーション等を円滑に実現することに課題があった。 However, with the conventional technologies described above, there are issues with smoothly realizing remote communication, etc., given the existence of delays due to transmission, etc.

特許文献１では、アバタの視線は一致するが、映像を早送りさせられた利用者は未だ相手のアバタを見ておらず、逆に映像を遅延させられた利用者は動作の反映を待たされるため、円滑なコミュニケーションをとれないという問題がある。また、送受信の遅延量はネットワークの状況により揺らぐため一致させるタイミングの予測が困難という問題がある。 In Patent Document 1, the gazes of the avatars meet, but the user whose video is fast-forwarded does not yet see the other user's avatar, and conversely, the user whose video is delayed must wait for their movements to be reflected, which results in the problem of being unable to communicate smoothly. In addition, the amount of delay in transmission and reception fluctuates depending on the network conditions, making it difficult to predict the timing for the gazes to meet.

特許文献２では、伝送量を少なくすることで伝送遅延を抑制しようとしているが、姿勢の認識にかかる時間や伝送にかかる時間は不可避的に遅延し、遅延の影響それ自体を回避ないし抑制できるわけではないという問題がある。 Patent document 2 attempts to suppress transmission delays by reducing the amount of transmission, but there is an unavoidable delay in the time it takes to recognize posture and the time it takes to transmit, and there is a problem in that it is not possible to avoid or suppress the effects of the delays themselves.

上記従来技術の課題に鑑み、本発明は、伝送等に起因する遅延の影響を効果的に抑制してアバタ描画を行うことのできる描画装置及びプログラムを提供することを目的とする。 In view of the problems with the conventional technology described above, the present invention aims to provide a drawing device and program that can draw avatars while effectively suppressing the effects of delays caused by transmission, etc.

上記目的を達成するため、本発明は描画装置であって、アバタを描画される対象となるユーザより、アバタ描画のための制御パラメータを当該ユーザの様子を認識した認識情報として時系列で受信し、前記受信済みとなっている認識情報のうち直近過去の認識情報の時刻と現在時刻との遅延量を推定する推定部と、前記受信済みとなっている認識情報の履歴より、前記直近過去よりも前記遅延量の分だけ未来側に位置することで、現在時刻に該当する認識情報を予測する予測部と、前記予測された認識情報を用いて現在時刻におけるものとして前記ユーザのアバタを描画する描画部と、を備えることを第１の特徴とする。 To achieve the above object, the present invention is a drawing device having a first feature including: an estimation unit that receives, in chronological order, control parameters for avatar drawing from a user for whom an avatar is to be drawn as recognition information that recognizes the state of the user, and estimates the amount of delay between the time of the most recent recognition information among the received recognition information and the current time; a prediction unit that predicts recognition information corresponding to the current time by locating the most recent recognition information in the history of the received recognition information by the amount of delay on the future side; and a drawing unit that uses the predicted recognition information to draw the user's avatar as if it were at the current time.

また、前記予測部では、予め用意されている複数種類の予測モデルの中から選択された予測モデルを用いて、前記現在時刻に該当する認識情報を予測し、且つ、前記受信済みとなっている認識情報の履歴に対して前記複数種類の予測モデルをそれぞれ適用して予測モデルごとの予測精度を評価し、当該予測精度が最良となる予測モデルを選択することを第２の特徴とする。また、コンピュータを前記描画装置として機能させるプログラムであることを特徴とする。 The second feature of the present invention is that the prediction unit predicts the recognition information corresponding to the current time using a prediction model selected from a plurality of types of prediction models prepared in advance, and applies each of the plurality of types of prediction models to the history of the received recognition information to evaluate the prediction accuracy of each prediction model, and selects the prediction model with the best prediction accuracy. The present invention is also characterized by being a program that causes a computer to function as the drawing device.

前記第１の特徴によれば、遅延量を推定して、受信済みの認識情報の履歴からこの遅延量の幅だけ未来側に位置するものとして現在時刻の認識情報を予測し、この予測した認識情報によってアバタを描画するので、予測により遅延の影響を効果的に低減してアバタ描画を行うことができる。前記第２の特徴によれば、複数種類の予測モデルに対して予測精度を評価して予測精度が最良となるものを利用することで、高精度に予測を行うことが可能となる。 According to the first feature, the amount of delay is estimated, and the recognition information at the current time is predicted from the history of received recognition information as being located in the future by the width of this amount of delay, and an avatar is drawn based on this predicted recognition information, so that the effect of delay can be effectively reduced by the prediction when drawing an avatar. According to the second feature, prediction accuracy is evaluated for multiple types of prediction models, and the one with the best prediction accuracy is used, making it possible to make highly accurate predictions.

一実施形態に係る通信システムの構成図である。FIG. 1 is a configuration diagram of a communication system according to an embodiment. 一実施形態に係る通信システムの動作のフローチャートである。1 is a flowchart of an operation of a communication system according to an embodiment. 一実施形態に係る通信システムにおける認識装置及び描画装置の機能ブロック図である。2 is a functional block diagram of a recognition device and a drawing device in a communication system according to an embodiment. 一実施形態に係る認識装置の動作のフローチャートである。10 is a flowchart of an operation of a recognition device according to an embodiment. 一実施形態に係る描画装置の動作のフローチャートである。10 is a flowchart of an operation of a drawing device according to an embodiment. 遅延量を推定する模式例を示す図である。FIG. 13 is a diagram illustrating a schematic example of estimating a delay amount. 予測部において、予測する遅延の値と予測に用いるモデル種別とによりあらかじめ複数の予測モデルを用意しておく模式例を示す図である。11 is a diagram showing a schematic example in which a plurality of prediction models are prepared in advance in a prediction unit according to a delay value to be predicted and a model type to be used for prediction. FIG. 予測部において、複数種類の予測モデルの中から最適なものを選択するために、各予測モデルで実際に予測適用を行って予測精度を評価することの模式例を示す図である。FIG. 13 is a diagram showing a schematic example of evaluating prediction accuracy by actually applying prediction to each prediction model in order to select an optimal one from among a plurality of types of prediction models in a prediction unit. 予測部において、複数の遅延値の候補を少ない個数で効果的に用意するために、遅延値の分布から現実的に発生しうる値を設定する例として、遅延値の分布が正規分布となる場合の模式例を示す図である。FIG. 13 is a schematic diagram showing a case where the distribution of delay values is a normal distribution, as an example of setting values that can actually occur from the distribution of delay values in order to effectively prepare a small number of multiple delay value candidates in a prediction unit. 図７の変形例を示す図である。FIG. 8 is a diagram showing a modification of FIG. 7 . 遅延の３つの候補値について、実績値時系列と、遅延候補値ごとに予測適用した予測時系列とのグラフを同時プロットした例である。This is an example in which a graph is simultaneously plotted for three candidate delay values, a time series of actual values, and a forecast time series obtained by applying forecasting to each candidate delay value. 一般的なコンピュータにおけるハードウェア構成を示す図である。FIG. 1 is a diagram illustrating a hardware configuration of a typical computer.

図１は、一実施形態に係る通信システム100の構成図である。通信システム100は、第１ユーザU1が利用する第１端末としての認識装置10と、第２ユーザU2が利用する第２端末としての描画装置20と、を備える。認識装置10及び描画装置20はインターネット及び／又はLAN（ローカルエリアネットワーク）等として構成されるネットワークNWを介して相互に通信可能とされる。認識装置10及び描画装置20はいずれも、スマートフォン端末、タブレット端末、ラップトップコンピュータ端末等のモバイル端末やその他の任意構成のコンピュータ装置として構成することができる。 FIG. 1 is a configuration diagram of a communication system 100 according to one embodiment. The communication system 100 includes a recognition device 10 as a first terminal used by a first user U1, and a drawing device 20 as a second terminal used by a second user U2. The recognition device 10 and the drawing device 20 can communicate with each other via a network NW configured as the Internet and/or a LAN (local area network), etc. Both the recognition device 10 and the drawing device 20 can be configured as mobile terminals such as smartphone terminals, tablet terminals, and laptop computer terminals, or as computer devices of any other configuration.

図２は一実施形態に係る通信システム100の動作のフローチャートである。図１内にもイラストにより模式的に示されるように、通信システム100の全体的な動作としての図２の各ステップは次の通りである。 Figure 2 is a flowchart of the operation of the communication system 100 according to one embodiment. As illustrated in Figure 1, the steps in Figure 2 as the overall operation of the communication system 100 are as follows:

認識装置10は撮像機能を提供するカメラとしての撮像部11を備え、ステップS1において第１ユーザU1を撮影してその表情等を認識し、ネットワークNWを介してリアルタイムでこの認識情報を描画措置20へと送信する。描画装置20は少なくとも表示機能を提供するディスプレイとしての提示部24を備え、ステップS2において認識装置10側から送信された認識情報を受信し、リアルタイムの現時刻までの認識情報の履歴（認識装置10から送信され受信済みとなっている履歴）と、この履歴で評価される予測適用の成績とを考慮して第１ユーザU1のアバタA1を現在時刻について予測して描画し、このアバタA1を提示部24に表示することで、第２ユーザU2に対してアバタ表示を提供する。 The recognition device 10 has an imaging unit 11 as a camera that provides an imaging function, and in step S1, it takes an image of the first user U1, recognizes the facial expression, etc., and transmits this recognition information to the drawing device 20 in real time via the network NW. The drawing device 20 has a presentation unit 24 as a display that provides at least a display function, and in step S2, it receives the recognition information transmitted from the recognition device 10, predicts and draws the avatar A1 of the first user U1 for the current time based on the history of the recognition information up to the current time in real time (the history that has been transmitted from the recognition device 10 and has been received) and the performance of the prediction application evaluated from this history, and displays this avatar A1 on the presentation unit 24, thereby providing an avatar display for the second user U2.

ステップS3では現時刻iを次の時刻i+1に更新してステップS1へと戻り、同様のステップをリアルタイムの各時刻で繰り返す。すなわち、図２のステップによりリアルタイムの各時刻i=1,2,…について、ステップS1で第１ユーザU1の認識情報p(i)を取得し、ステップS2で第１ユーザのアバタA1(i)を描画して表示することを繰り返すことができる。こうして、第２ユーザU2はリアルタイムに描画されることでリアルタイムに動くアバタA1(i)を見ることにより、遠隔に存在する第１ユーザU1の様子（全身のポーズ及び／又は顔の表情等）をリアルタイムで把握して、アバタA1(i)を介して第１ユーザU1とリアルタイムに遠隔コミュニケーションを行うことが可能となる。 In step S3, the current time i is updated to the next time i+1, and the process returns to step S1, where similar steps are repeated for each real-time instant. That is, for each real-time instant i=1, 2, ..., the steps in FIG. 2 are repeated to obtain the recognition information p(i) of the first user U1 in step S1, and to draw and display the avatar A1(i) of the first user in step S2. In this way, the second user U2 can grasp in real time the appearance (whole body pose and/or facial expression, etc.) of the remotely located first user U1 by looking at the avatar A1(i) that moves in real time as it is drawn in real time, and can thus remotely communicate with the first user U1 in real time via the avatar A1(i).

ここで、ネットワークNWを介して認識装置10から送信される認識情報p(i)等は、通信による遅延を伴って描画装置20において受信される。（また、認識装置10及び描画装置20の各処理についても、その処理完了には短時間であったとしても一定時間を要し、処理遅延が発生する。）図２のフローは共通の時刻i=1,2,…で認識装置10及び描画装置20において処理タイミングを同期させて実行することができるが、現時刻iのステップS2でアバタA1(i)を描画する際に、通信等による遅延の存在によって描画装置20においては当該現時刻iの認識情報p(i)は受信できておらず、K個（K>0）前までの過去時刻i=1,2,…,i-K-1,i-Kでの認識情報p(1),p(2),…,p(i-K-1),p(i-K)しか受信を完了していない状態にある。（ここで時刻i=1は、図２の動作を通信システム100において開始した初期時刻とする。）従って、描画装置20においてはこのK個前の過去時刻までの認識情報の受信履歴から現時刻iの認識情報p(i)（現時刻iの第１ユーザU1の表情等の認識情報p(i)）を予測し、この予測された認識情報p(i)に従ってアバタA1(i)を描画することにより、遅延の存在する前提のもとでアバタ描画を利用したリアルタイムのスムーズなコミュニケーションを実現することができる。この詳細に関しては後述する。（なお、描画装置20において予測された認識情報p(i)は、当該現在時刻iよりも後の時刻において描画装置20が受信する、認識装置10が実際に認識した実測値の認識情報p(i)とは通常、別のものとなる。p(i)等が実測値のものか予測値のものかは説明の文脈上明らかであるため、記号表記の簡素化から同じ「p(i)」等として表記するものとする。また、区別が必要な場合はp(i)_[実測]及びp(i)_[予測]等の表記で区別する。） Here, the recognition information p(i) and the like transmitted from the recognition device 10 via the network NW are received by the drawing device 20 with a delay due to communication. (Furthermore, each process of the recognition device 10 and the drawing device 20 requires a certain time to complete even if it is a short time, and a processing delay occurs.) The flow of Fig. 2 can be executed by synchronizing the processing timing in the recognition device 10 and the drawing device 20 at a common time i = 1, 2, ..., but when drawing the avatar A1(i) in step S2 at the current time i, the drawing device 20 is not able to receive the recognition information p(i) at the current time i due to the presence of a delay due to communication, etc., and is in a state where it has only completed reception of the recognition information p(1), p(2), ..., p(iK-1), p(iK) at K (K>0) past times i = 1, 2, ..., iK-1, iK. (Here, time i=1 is the initial time when the operation in FIG. 2 is started in communication system 100.) Therefore, in drawing device 20, by predicting recognition information p(i) at current time i (recognition information p(i) such as facial expression of first user U1 at current time i) from the reception history of recognition information up to K past times, and drawing avatar A1(i) according to this predicted recognition information p(i), it is possible to realize smooth real-time communication using avatar drawing under the assumption that there is a delay. Details of this will be described later. (Note that the recognition information p(i) predicted by the drawing device 20 is usually different from the recognition information p(i) of the actual measurement value actually recognized by the recognition device 10, which is received by the drawing device 20 at a time after the current time i. Since it is clear from the context of the explanation whether p(i) etc. are actual measurements or predicted values, they will be expressed as the same "p(i)" etc. for the sake of simplicity of symbolic notation. Also, when a distinction is necessary, they will be distinguished by notations such as p(i) _{[actual measurement]} and p(i) _[prediction] .)

なお、図２ではフローチャートとしての記載の便宜上、ステップS1の次にステップS2が配置されているが、ステップS1の完了を待ってからステップS2が実行されなくともよい。すなわち、リアルタイムの各時刻i=1,2,…において、認識装置10がステップS1を実行し、これと並行する形で描画装置20がステップS20を実行することができる。すなわち、ステップS3でタイミングが管理されるリアルタイムの各時刻i=1,2,…においてステップS1,S2は並列して実行されるものであってよい。 In FIG. 2, step S2 follows step S1 for the sake of convenience in describing the flowchart, but step S2 does not have to be executed until step S1 is completed. That is, at each real-time time i=1, 2, ..., the recognition device 10 executes step S1, and in parallel with this, the rendering device 20 executes step S20. That is, steps S1 and S2 may be executed in parallel at each real-time time i=1, 2, ..., whose timing is managed by step S3.

また、図１及び図２の構成では、第１ユーザU1がそのアバタA1を描画される側であり、第２ユーザU2はこの描画されたアバタA1を視聴する側となっているが、役割を入れ替えた構成を追加することにより、第１ユーザU1及び第2ユーザU2の間において双方のアバタを介した双方向の遠隔コミュニケーションを行うことが可能である。（すなわち、第１ユーザU1の利用する第１端末としての認識装置10が第２ユーザU2のアバタA2（図１に不図示）を描画するための描画装置20に相当する機能を追加で備え、第２ユーザU2の利用する第２端末としての描画装置20が第２ユーザU2の表情等の様子を認識するための認識装置10に相当する機能を追加で備えることにより、双方向のコミュニケーションが可能となる。）また同様に、遠隔コミュニケーションの参加ユーザ数についても、第１ユーザU1及び第２ユーザU2の２人のみでなく、任意人数のユーザのアバタを同時に描画して遠隔コミュニケーションを行うことが可能である。 In the configurations of Figs. 1 and 2, the first user U1 is the one whose avatar A1 is drawn, and the second user U2 is the one who views the drawn avatar A1. However, by adding a configuration in which the roles are swapped, it is possible to perform two-way remote communication between the first user U1 and the second user U2 via their avatars. (That is, the recognition device 10 as the first terminal used by the first user U1 is additionally provided with a function equivalent to the drawing device 20 for drawing the avatar A2 (not shown in Fig. 1) of the second user U2, and the drawing device 20 as the second terminal used by the second user U2 is additionally provided with a function equivalent to the recognition device 10 for recognizing the facial expressions and other aspects of the second user U2, thereby enabling two-way communication.) Similarly, the number of users participating in remote communication is not limited to the first user U1 and the second user U2, and it is possible to perform remote communication by simultaneously drawing the avatars of any number of users.

ただし、以下では説明の明確化及び簡潔化のために、アバタを用いた遠隔コミュニケーションを行う際の人数及び役割が、図１及び図２に示されるもの（２人及び単方向）となる場合に限って説明するが、上記の通り同様にして任意人数で双方向にアバタを用いた遠隔コミュニケーションを行うことが可能である。 However, for the sake of clarity and simplicity, the following explanation will be limited to the case where the number of people and roles involved in remote communication using avatars are as shown in Figures 1 and 2 (two people and one-way), but it is possible to carry out remote communication using avatars in both directions with any number of people in the same way as described above.

図３は、一実施形態に係る通信システム100における認識装置10及び描画装置20の機能ブロック図である。認識装置10は、撮像部11及び認識部12と、オプション構成としての録音部13とを備える。描画装置20は、推定部21、予測部22、描画部23、提示部24及び記憶部25を備える。 Figure 3 is a functional block diagram of the recognition device 10 and drawing device 20 in the communication system 100 according to one embodiment. The recognition device 10 includes an imaging unit 11, a recognition unit 12, and an optional recording unit 13. The drawing device 20 includes an estimation unit 21, a prediction unit 22, a drawing unit 23, a presentation unit 24, and a memory unit 25.

図４は、一実施形態に係る認識装置10の動作のフローチャートであり、図５は、一実施形態に係る描画装置20のフローチャートである。図４のステップS11～S13は、図２のステップS1に該当し、図５のステップS21～S24は、図２のステップS2に該当する。同様に、図４及び図５のステップS14及びS25は、図２のステップS3に該当し、共通のリアルタイムの各時刻i=1,2,…のタイミング管理をそれぞれ認識装置10及び描画装置20において行うものである。 Figure 4 is a flowchart of the operation of the recognition device 10 according to one embodiment, and Figure 5 is a flowchart of the drawing device 20 according to one embodiment. Steps S11 to S13 in Figure 4 correspond to step S1 in Figure 2, and steps S21 to S24 in Figure 5 correspond to step S2 in Figure 2. Similarly, steps S14 and S25 in Figures 4 and 5 correspond to step S3 in Figure 2, and timing management of each common real-time time i=1, 2, ... is performed in the recognition device 10 and the drawing device 20, respectively.

以下、図４及び図５の各ステップを説明しながら、図３の認識装置10及び描画装置20の各機能部の詳細に関して説明する。なお、認識装置10が図４のフローを開始し、且つ、描画装置20が図５のフローを開始するための前処理として、認識装置10及び描画装置20では既存手法であるNTP（ネットワークタイムプロトコル）等を用いて時刻同期を行っておく。認識装置10及び描画装置20では当該同期された時計（コンピュータのオペレーティングシステム等によって提供される計時機能）を用いてリアルタイムの各時刻における処理を行う。図２の説明と同様に図４及び図５の説明においても、この共通のリアルタイムの各時刻をi=1,2,…とし、フローが開始された最初の時点の時刻（初期時刻）をi=1とする。なお、実時間上では例えば30fps（フレーム毎秒）などの離散的な処理タイミングとして、例えば1/30秒間隔でこの整数指定される各時刻i=1,2,…が並んでいる。 Below, the details of each functional unit of the recognition device 10 and the drawing device 20 in FIG. 3 will be described while explaining each step in FIG. 4 and FIG. 5. Note that, as a preprocessing before the recognition device 10 starts the flow in FIG. 4 and the drawing device 20 starts the flow in FIG. 5, the recognition device 10 and the drawing device 20 perform time synchronization using an existing method such as NTP (Network Time Protocol). The recognition device 10 and the drawing device 20 perform processing at each real-time point using the synchronized clock (a timekeeping function provided by a computer operating system, etc.). As in the explanation of FIG. 2, in the explanation of FIG. 4 and FIG. 5, each common real-time point is i=1, 2, ..., and the time at the beginning of the flow (initial time) is i=1. Note that in real time, each time i=1, 2, ... specified by an integer is arranged at intervals of, for example, 1/30 seconds as a discrete processing timing such as 30 fps (frames per second).

＜ステップS11…撮像部11（及び録音部13）＞
図４のフローが開始（初期時刻i=1）されると、あるいは、その後のリアルタイムの各時刻i(i≧2)において、ステップS11では、撮像部11が第１ユーザU1を撮像することで当該現在時刻iが紐づけられたフレーム画像F(i)を得て、この画像F(i)を認識部12へと出力してからステップS12へと進む。撮像部11はハードウェアとしてはカメラで構成され、第１ユーザU1を撮像することができる。 <Step S11...imaging unit 11 (and audio recording unit 13)>
4 is started (initial time i=1), or at each real-time time i thereafter (i≧2), in step S11, the imaging unit 11 captures an image of the first user U1 to obtain a frame image F(i) linked to the current time i, and outputs this image F(i) to the recognition unit 12 before proceeding to step S12. The imaging unit 11 is configured as a camera in terms of hardware, and is capable of capturing an image of the first user U1.

追加的な実施形態として、ステップS1ではさらに、録音部13が第１ユーザU1の発言等による発声を録音し、当該時刻iにおける音声情報v(i)（離散的な処理タイミングである時刻iとこの後の時刻i+1（あるいは前の時刻i-1）との間の実時間期間における音声情報v(i)）を取得してもよい。録音部13はハードウェアとしてはマイクで構成され、第１ユーザU1の存在する環境において録音を行うことで第１ユーザU1の発言等が含まれる音声情報を得ることができる。音声情報v(i)を得る際に、録音データから第１ユーザU1の発声以外の環境音等を雑音として除去する既存手法を適用してもよい。 As an additional embodiment, in step S1, the recording unit 13 may further record the speech of the first user U1, and obtain the speech information v(i) at the time i (speech information v(i) in the real time period between the time i, which is a discrete processing timing, and the subsequent time i+1 (or the previous time i-1)). The recording unit 13 is configured as a microphone in terms of hardware, and by recording in the environment where the first user U1 is present, speech information including the speech of the first user U1 can be obtained. When obtaining the speech information v(i), an existing method may be applied for removing environmental sounds other than the speech of the first user U1 as noise from the recorded data.

＜ステップS12…認識部12＞
ステップS12では、ステップS11で得た画像F(i)に撮像されている第１ユーザU1を認識部12が認識することにより、この第１ユーザU1をアバタA1として描画するために必要なアバタ制御パラメータとして認識情報p(i)を計算してからステップS13へと進む。当該変数としての表記より明らかなように、この認識情報p(i)は処理タイミングである現時刻i（画像F(i)の撮像時刻に相当）が紐づいたものとして取得される。 <Step S12...Recognition Unit 12>
In step S12, the recognition unit 12 recognizes the first user U1 captured in the image F(i) obtained in step S11, and calculates recognition information p(i) as an avatar control parameter required to render the first user U1 as an avatar A1, and then the process proceeds to step S13. As is clear from the notation as a variable, this recognition information p(i) is acquired as being linked to the current time i (corresponding to the capture time of the image F(i)), which is the processing timing.

認識部12で計算する認識情報p(i)は、描画装置20で描画するアバタで用いる所定モデルに応じた任意種類のものを用いることができるが、一般には複数のN個のパラメータの現時刻iにおける値を列挙したN次元ベクトルの形で以下のように計算することができる。p_k(i)はk番目(1≦k≦N)のパラメータである。
p(i)=(p₁(i),p₂(i),…,p_N(i)) The recognition information p(i) calculated by the recognition unit 12 can be of any type according to a predetermined model used for the avatar drawn by the drawing device 20, but can generally be calculated as follows in the form of an N-dimensional vector listing the values of multiple N parameters at current time i: _{p k} (i) is the k-th (1≦k≦N) parameter.
p(i)=(p ₁ (i),p ₂ (i),…,p _N (i))

例えば、アバタに関して顔の表情を描画する場合、既存手法であるブレンドシェイプのパラメータとして認識情報p(i)を計算してよい。ブレンドシェイプではアバタ描画対象となるユーザの表情を複数のメッシュモデル（各メッシュモデルが何らかの表情状態を表す）の重み付けで表現するので、ベクトル要素p_k(i)としてk番目(1≦k≦N)のメッシュモデルの重みを算出すればよい。また例えば、アバタに関して体のポーズを描画する場合、任意の所定の人体骨格関節モデルを用いて、ユーザの各関節の座標を表現するパラメータとして認識情報p(i)を計算してよい。この場合、ベクトル要素p_k(i)としてk番目(1≦k≦N)の関節の座標（２次元画像座標(u,v)又は３次元座標(x,y,z)）を算出すればよい。アバタとして顔の表情と全身のポーズとの両方を描画する場合は、これらブレンドシェイプの重みと関節座標との組み合わせを認識情報p(i)としてもよい。 For example, when drawing a facial expression of an avatar, the recognition information p(i) may be calculated as a parameter of blendshape, which is an existing method. In blendshape, the facial expression of a user to be drawn as an avatar is expressed by weighting a plurality of mesh models (each mesh model expresses some facial expression state), so the weight of the kth (1≦k≦N) mesh model may be calculated as the vector element p _k (i). For example, when drawing a body pose of an avatar, the recognition information p(i) may be calculated as a parameter expressing the coordinates of each joint of the user using any predetermined human body skeleton joint model. In this case, the coordinates (two-dimensional image coordinates (u,v) or three-dimensional coordinates (x,y,z)) of the kth (1≦k≦N) joint may be calculated as the vector element p _k (i). When drawing both a facial expression and a whole-body pose of an avatar, a combination of the weights of the blendshapes and the joint coordinates may be used as the recognition information p(i).

認識部12において画像F(i)からアバタ制御パラメータとして認識情報p(i)を具体的に算出する手法に関しても、アバタ制御パラメータの種類に応じた既存手法を用いればよい。例えば、事前学習された深層学習ネットワーク等を用いて画像F(i)から顔の特徴点を抽出し、この特徴点の分布をブレンドシェイプの重みに変換してもよいし、深層学習ネットワーク等により画像F(i)から（特徴点抽出を経ることなく）ただちにブレンドシェイプの重みが出力されるような手法を用いてもよい。関節座標についても事前学習された深層学習ネットワーク等を用いて画像F(i)から出力することができる。 Regarding the method for specifically calculating the recognition information p(i) as the avatar control parameter from the image F(i) in the recognition unit 12, an existing method according to the type of avatar control parameter may be used. For example, facial feature points may be extracted from the image F(i) using a pre-trained deep learning network or the like, and the distribution of these feature points may be converted into blend shape weights, or a method may be used in which the blend shape weights are output immediately from the image F(i) (without going through feature point extraction) using a deep learning network or the like. The joint coordinates may also be output from the image F(i) using a pre-trained deep learning network or the like.

＜ステップS13…送信処理＞
ステップS13では、ステップS12で得た認識情報p(i)を認識装置10から描画装置20へと送信してから、ステップS14へと進む。ステップS11で録音部13による録音を行った場合は、ステップS13ではこの音声情報v(i)も認識装置10から描画装置20へと送信する。送信された認識情報p(i)は描画装置20の記憶部25において記憶され、参照して利用される。送信された音声情報v(i)は、描画装置20においてリアルタイムにアバタ描画する際にアバタ動作と同期して再生される発声データとして利用される。 <Step S13...Transmission Processing>
In step S13, the recognition information p(i) obtained in step S12 is transmitted from the recognition device 10 to the drawing device 20, and the process proceeds to step S14. If recording is performed by the recording unit 13 in step S11, this voice information v(i) is also transmitted from the recognition device 10 to the drawing device 20 in step S13. The transmitted recognition information p(i) is stored in the storage unit 25 of the drawing device 20 and is referenced and used. The transmitted voice information v(i) is used as voice data that is played back in synchronization with the avatar movement when the avatar is drawn in real time by the drawing device 20.

なお、認識情報p(i)と音声情報v(i)との両方を送信する場合、認識装置10において取得され次第、ただちに通信インタフェースを介して送信するようにすればよい。例えば現時刻iに対応する実時間の期間（離散的な処理タイミングである現時刻iと次の時刻i+1との間の期間）内において音声情報v(i)が先に取得され認識情報p(i)が後に取得された場合、音声情報v(i)が取得され次第、ただちに音声パケット化して送信した後、後で得られた認識情報p(i)についても所定様式のデータパケット化して送信すればよい。認識情報p(i)と音声情報v(i)については、処理タイミングである時刻iとは別途に、通信インタフェースによりパケット化して送信される際のタイムスタンプt（時刻i=1,2,…よりも細かい間隔で刻まれるタイムスタンプt（例えば100倍の細分間隔としてt=0.01,0.02,0.03,…等））も付与して、認識装置10から描画装置20へと送信するようにしてもよい。 When transmitting both the recognition information p(i) and the voice information v(i), they may be transmitted via the communication interface as soon as they are acquired by the recognition device 10. For example, if the voice information v(i) is acquired first and the recognition information p(i) is acquired later within a real-time period corresponding to the current time i (the period between the current time i, which is a discrete processing timing, and the next time i+1), the voice information v(i) may be immediately packetized as a voice packet as soon as it is acquired and transmitted, and the recognition information p(i) obtained later may also be packetized in a predetermined format and transmitted. The recognition information p(i) and the voice information v(i) may also be transmitted from the recognition device 10 to the drawing device 20 after being packetized via the communication interface with a timestamp t (a timestamp t that is recorded at intervals finer than time i=1, 2, ... (for example, t=0.01, 0.02, 0.03, ... as a 100-fold subdivision interval)) in addition to the time i, which is the processing timing.

＜ステップS14…時刻更新＞
ステップS14では現時刻iを次の時刻i+1へと更新してからステップS11へと戻ることにより、以上のステップが次のリアルタイムの時刻i+1について同様に繰り返される。 <Step S14...Time update>
In step S14, the current time i is updated to the next time i+1, and the process returns to step S11, whereby the above steps are similarly repeated for the next real-time time i+1.

以上、一実施形態に係る認識装置10の処理内容として図４の各ステップを説明した。次いで、一実施形態に係る描画装置20の処理内容として図５の各ステップを説明する。 Above, the steps in FIG. 4 have been described as the processing contents of the recognition device 10 according to one embodiment. Next, the steps in FIG. 5 will be described as the processing contents of the drawing device 20 according to one embodiment.

＜ステップS21…受信処理＞
図５のフローが開始（初期時刻i=1）されると、あるいは、その後のリアルタイムの各時刻i(i≧2)において、ステップS21では、図４のステップS13で認識装置10から送信された認識情報p(j)（及び音声情報v(j)）を描画装置20において受信してから、ステップS22へと進む。なお、図２を参照して通信システム100での全体動作に関して説明した通り、描画装置20において現時刻iで受信できるこの最新の認識情報p(j)等は、遅延等の存在によって現時刻iから見てK個分の過去の認識情報p(i-K)（K>0）となる。なお、遅延等の発生状況によっては、現時刻iにおいて１つ以上の時刻jの認識情報p(j)を受信する場合や、全く受信しない場合もありうる。 <Step S21...Reception processing>
When the flow of Fig. 5 is started (initial time i = 1), or at each real-time time i (i ≧ 2) thereafter, in step S21, the drawing device 20 receives the recognition information p(j) (and the voice information v(j)) transmitted from the recognition device 10 in step S13 of Fig. 4, and then proceeds to step S22. As described with reference to Fig. 2 regarding the overall operation of the communication system 100, the latest recognition information p(j) that can be received at the drawing device 20 at the current time i is K pieces of past recognition information p(iK) (K > 0) as viewed from the current time i due to the presence of delays, etc. Depending on the occurrence of delays, etc., one or more pieces of recognition information p(j) at time j may be received at the current time i, or none may be received at all.

なお、認識装置10における送信時と同様に、描画装置20における受信時においても、通信インタフェースにおいて受信した認識情報p(j)（及び音声情報v(j)）に、当該パケットの受信時刻であるタイムスタンプt（離散的な処理タイミングi,jよりも一般に細かい）を紐づけるようにしてもよい。 As in the case of transmission by the recognition device 10, when reception is performed by the drawing device 20, the recognition information p(j) (and the voice information v(j)) received at the communication interface may be linked to a timestamp t (which is generally more precise than the discrete processing timings i, j) that is the reception time of the packet.

受信した認識情報p(j)は、ハードウェアとしては記憶媒体で構成される記憶部25において蓄積して記憶される。すなわち、記憶部25は現時刻iまでに継続して受信されている認識情報p(1),p(2),…,p(j)を履歴として記憶しており、この情報を後述する推定部21及び予測部22の処理のための参照に供する。なお、認識情報p(1),p(2),…,p(j)に関しては、伝送時のパケットロス等により、部分的に消失しているもの（あるいはその他のパケットよりも大きく遅延しており未受信のため現時刻iで存在しないもの）があってもよい。 The received recognition information p(j) is accumulated and stored in the storage unit 25, which is constituted by a storage medium as hardware. That is, the storage unit 25 stores the recognition information p(1), p(2), ..., p(j) that has been continuously received up to the current time i as history, and provides this information as a reference for processing by the estimation unit 21 and the prediction unit 22, which will be described later. Note that some of the recognition information p(1), p(2), ..., p(j) may be partially lost due to packet loss during transmission, etc. (or may be delayed more than other packets and not be received at the current time i, and therefore may not exist).

＜ステップS22…推定部21＞
ステップS22では、推定部21が記憶部25を参照して、遅延量を推定してこの遅延量を予測部22へと出力してからステップS23へと進む。 <Step S22...Estimation Unit 21>
In step S22, the estimation unit 21 refers to the storage unit 25, estimates the amount of delay, and outputs this amount of delay to the prediction unit 22, and then the process proceeds to step S23.

この遅延量は、後述する描画部23及び提示部24において現時刻iのアバタA1(i)を描画して表示するために考慮する、現時刻iにおいて実際に得られている認識情報等の遅延であり、各実施形態で推定することができる。第１実施形態では、記憶部25に記憶されている最新の認識情報（直近過去の認識情報）が認識情報p(i-K)であれば、遅延量をK（時刻i=1,2,…として指定される離散的な処理タイミングのなす間隔のK回分）として推定してよい。 This delay amount is a delay in the recognition information, etc., actually obtained at the current time i, which is taken into consideration in order to draw and display avatar A1(i) at the current time i in the drawing unit 23 and the presentation unit 24 described later, and can be estimated in each embodiment. In the first embodiment, if the latest recognition information (recognition information in the most recent past) stored in the storage unit 25 is recognition information p(i-K), the delay amount may be estimated as K (K times the interval between discrete processing timings specified as time i=1, 2, ...).

第２実施形態では、送受信時のパケットのタイムスタンプを用いて、受信済みの最新の認識情報p(i-K)における送信時タイムスタンプt_[送信]と受信時タイムスタンプt_[受信]との差（認識情報のデータパケットの伝送時間に相当するもの）を遅延量としてもよい。 In the second embodiment, the timestamps of the packets sent and received may be used to determine the amount of delay as the difference between the sending timestamp t _[sent] and the receiving timestamp t _[received] in the most recently received recognition information p(iK) (corresponding to the transmission time of the data packet of the recognition information).

第３実施形態では、図６に模式的に示されるように、伝送時間t2に加え認識部12の認識時間t1、予測部22の予測時間t3、描画部23及び提示部24の描画提示時間t4も推定し伝送時間t2と合わせて実時間（離散的な処理タイミングとは区別される実時間）での遅延量（=t1+t2+t3+t4）としてもよい。伝送時間t2については第１実施形態または第２実施形態の値を利用し、各機能部の処理時間（処理完了に要する時間）t1,t3,t4については、事前実験で予め求めておく固定値を用いてもよいし、直近の所定の過去期間内での実績値（平均値など）を記録しておいて利用してもよい。第１実施形態または第２実施形態による伝送時間t2についても、最新の１個のみの値を用いてもよいし、直近の所定の過去期間内での実績値（平均値など）を記録しておいて利用してもよい。 In the third embodiment, as shown in FIG. 6, in addition to the transmission time t2, the recognition time t1 of the recognition unit 12, the prediction time t3 of the prediction unit 22, and the drawing presentation time t4 of the drawing unit 23 and the presentation unit 24 may also be estimated and combined with the transmission time t2 to obtain the delay amount (=t1+t2+t3+t4) in real time (real time distinct from discrete processing timing). For the transmission time t2, the value of the first or second embodiment may be used, and for the processing times (time required to complete processing) t1, t3, and t4 of each functional unit, fixed values obtained in advance through a preliminary experiment may be used, or actual values (average values, etc.) within a specified recent period may be recorded and used. For the transmission time t2 according to the first or second embodiment, only the most recent value may be used, or actual values (average values, etc.) within a specified recent period may be recorded and used.

第３実施形態の変形例としての第４実施形態では、認識装置10において録音部13による録音を行う前提で、認識情報p(j)に加えて音声情報v(j)も送受信し音声を同期させたアバタ描画を行う場合に、認識情報p(j)を取得するための認識処理等に係る時間だけ音声情報v(j)が早く受信できるとすると、認識情報の遅延量から音声情報の遅延量を差し引いたものを改めて認識情報の遅延量とすることができる。認識情報と音声情報との同期がとれるとともに遅延量が小さくなることから予測部22の精度が向上する効果が得られる。 In the fourth embodiment, which is a modification of the third embodiment, on the premise that recording is performed by the recording unit 13 in the recognition device 10, in addition to the recognition information p(j), audio information v(j) is also transmitted and received to draw an avatar with synchronized audio. If the audio information v(j) can be received earlier by the time required for the recognition process to obtain the recognition information p(j), the amount of delay of the recognition information can be calculated by subtracting the amount of delay of the audio information from the amount of delay of the recognition information. Since the recognition information and the audio information are synchronized and the amount of delay is reduced, the effect of improving the accuracy of the prediction unit 22 is obtained.

図６の例では括弧を付した下段側に第４実施形態の例が示され、第３実施形態の遅延量が「t1+t2+t3+t4」であるのに対し、第４実施形態では音声情報の遅延t5を減算した「t1+t2+t3+t4-t5」を遅延量として推定することができる。音声情報の遅延t5は、第２実施形態で認識情報の送信時タイムスタンプt_[送信]と受信時タイムスタンプt_[受信]との差を用いたのと全く同様に、音声情報v(j)のパケットにおける当該送受信時のタイムスタンプの時刻差t5として求めればよい。当該時刻差t5は、最新の１個のみの値を用いてもよいし、直近の所定の過去期間内での実績値（平均値など）を記録しておいて利用してもよい。 In the example of FIG. 6, an example of the fourth embodiment is shown in the lower part enclosed with parentheses, and while the delay amount in the third embodiment is "t1+t2+t3+t4", in the fourth embodiment, the delay amount can be estimated as "t1+t2+t3+t4-t5" obtained by subtracting the delay t5 of the voice information. The delay t5 of the voice information can be calculated as the time difference t5 between the time stamps of the transmission and reception in the packet of the voice information v(j), just like the difference between the transmission time stamp t _{[transmission]} and the reception time stamp t[ _reception] of the recognition information in the second embodiment. The time difference t5 may be the latest single value, or the actual value (average value, etc.) within a specified past period may be recorded and used.

第４実施形態では描画装置20でアバタ描画する際のリアルタイムの定義を、NTP等により管理される正確な実時間から、音声情報の伝送時間t5の分だけ過去側に寄せる（すなわち、音声情報が到達した時刻をアバタ描画用途のうえでのリアルタイムとみなす）ことにより予測時間幅をt5だけ減らすことで、アバタを用いたコミュニケーションの快適性を確保しつつ、遅延量を減らすことができる。（後述する予測部22での予測は、一般的に遅延（=予測を適用する時間幅）が少ないほど予測精度が高くなることが想定されるので、第４実施形態はこの効果を得ることができる。）同様に、音声情報の伝送時間t5に加え、この音声情報を実際に提示部24において再生するまでの所定処理（圧縮等されている音声情報を復号等して再生可能なパルス符号変調等による音声波形とする処理等）の所要時間t6（実績値あるいは事前調査値）も考慮して、「「t1+t2+t3+t4-t5-t6」を遅延量として推定してもよい。 In the fourth embodiment, the definition of real time when the drawing device 20 draws an avatar is shifted backward from the accurate real time managed by NTP or the like by the transmission time t5 of the voice information (i.e., the time when the voice information arrives is considered to be real time for the purpose of drawing the avatar), and the predicted time width is reduced by t5, thereby making it possible to reduce the amount of delay while ensuring the comfort of communication using the avatar. (The prediction by the prediction unit 22 described later is generally expected to have higher prediction accuracy as the delay (= the time width to which the prediction is applied) is smaller, so the fourth embodiment can achieve this effect.) Similarly, in addition to the transmission time t5 of the voice information, the time t6 (actual value or pre-investigation value) required for a predetermined process (processing to decode compressed voice information into a reproducible voice waveform by pulse code modulation or the like) until the voice information is actually played back on the presentation unit 24 may be taken into consideration, and the delay may be estimated as "t1+t2+t3+t4-t5-t6".

＜ステップS23…予測部22＞
ステップS23では、予測部22は過去の認識情報の推移（p(1),p(2),…,p(i-K)）から現在の認識情報p(i)を予測し、当該予測された認識情報p(i)を描画部23へと出力してから、ステップS24へと進む。 <Step S23...Prediction Unit 22>
In step S23, the prediction unit 22 predicts the current recognition information p(i) from the transition of past recognition information (p(1), p(2), ..., p(iK)), outputs the predicted recognition information p(i) to the drawing unit 23, and then proceeds to step S24.

ここで、予測部22の処理の説明のために、用語を次のように定義する。現時刻iにおいて認識情報の推移が得られている直近の（最新の）過去時刻i-Kを、予測の基準となる「現在時刻t」とし、予測の対象となる現在時刻t+n（≒i）を、「現在時刻t」を基準として「未来時刻t+n」とする。 Here, to explain the processing of the prediction unit 22, the terms are defined as follows. The most recent (latest) past time i-K at which a change in the recognition information is obtained at the current time i is defined as the "current time t" that serves as the basis for prediction, and the current time t+n (≒i) that is the subject of prediction is defined as the "future time t+n" based on the "current time t".

ここで、現在時刻tから未来時刻t+nを定めるためのn(n>0)の値として、予測部22では推定部21が出力した遅延量を利用する。遅延量の第１実施形態では、「現在時刻t」は「直近の認識情報p(i-K)が存在する過去時刻i-K」に一致し、従って、n=Kとなる。遅延量の第２～第４実施形態においては、nの値は、Kの値（離散的な処理タイミングi=1,2,…の間隔のK回分の時間幅）に近い値となることが想定されるが、必ずしもKには一致しない値となる。 Here, the prediction unit 22 uses the delay amount output by the estimation unit 21 as the value of n (n>0) for determining future time t+n from the current time t. In the first embodiment of the delay amount, the "current time t" matches the "past time i-K at which the most recent recognition information p(i-K) exists," and therefore n=K. In the second to fourth embodiments of the delay amount, the value of n is expected to be close to the value of K (the time width of K intervals between discrete processing timings i=1, 2, ...), but does not necessarily match K.

予測部22では図７に表形式で模式的に示されるように、遅延の値nの候補a（ここでは実際の遅延値ではなく遅延値のIDとしてa=1,2,…とする）と、予測に用いるモデル種別b（モデルのIDとしてb=1,2,…とする）との区別によって指定される複数の予測モデルM_ab（a=1,2,…, b=1,2,…）を予め構築しておき、この複数の予測モデルM_abの中から最適なモデルを選択して、選択された最適なモデルM_abを用いることにより、現在時刻tから見て未来時刻t+nの予測情報を予測する。（なお、１つの場合として、予測に用いるモデル種別bが２つ以上の複数ではなく１つとなる実施形態も可能である。） As shown in a table format in Fig. 7, the prediction unit 22 constructs in advance a plurality of prediction models M ab (a=1,2,..., b=1,2,...) designated by distinguishing between candidates a for delay value n (here, a=1,2,... as delay value IDs, not actual delay values) and model types b used for prediction (b=1,2,... as model IDs), selects an optimal model from among the plurality of prediction _{models M ab} _, and predicts prediction information for future time t+n from the current time t by using the selected optimal model M _ab . (Note that, as a single case, an embodiment in which the model type b used for prediction is one, rather than two or more, is also possible.)

図７の例は、遅延の値の候補a（遅延の値T_aを定める候補a）として遅延の値の範囲によりa=1（遅延の値T₁=0.5秒）、a=2（遅延の値T₂=1秒）、a=3（遅延の値T₃=2秒）の3通りを予め設定しておき、予測に用いるモデル種別bとしてb=1（線形補間）、b=2（CNN（畳み込みニューラルネットワーク））、b=3（RNN（再帰型ニューラルネットワーク））の3通りを予め設定しておくことで、合計3×3=9通りの予測モデルM_abを予め用意しておく例となっている。なお、CNNやRNNを用いる場合は、予め学習データを用いてネットワークのパラメータを学習しておく。 In the example of Fig. 7, three delay value candidates a (candidate a for determining delay value T _a ) are preset according to the range of delay values: a=1 (delay value T ₁ =0.5 seconds), a=2 (delay value T ₂ =1 second), and a=3 (delay value T ₃ =2 seconds), and three model types b used for prediction are preset: b=1 (linear interpolation), b=2 (CNN (convolutional neural network)), and b=3 (RNN (recurrent neural network)), resulting in a total of 3 x 3 = 9 prediction models M _ab being prepared in advance. When using CNN or RNN, the network parameters are learned in advance using training data.

予測部22では、複数の予測モデルM_abのうちいずれを用いるかに関して、遅延の値の候補aについては、推定部21から得た遅延量nが該当するもの（一致するものがない場合は最も近いもの）として決定すればよい。例えば図７の予測モデルM_abの場合で遅延量nがn=1.2秒であった場合、a=2（遅延が1秒）として決定すればよい。 In the prediction unit 22, in regard to which of the multiple prediction models M _ab to use, the candidate a for the delay value may be determined as the one that corresponds to the delay amount n obtained from the estimation unit 21 (if there is no match, the closest one). For example, in the case of the prediction model M _ab in Fig. 7, if the delay amount n is n = 1.2 seconds, a = 2 (delay is 1 second) may be determined.

予測部22ではまた、複数の予測モデルM_abのうちいずれを用いるかに関して、モデル種別bについては、モデル種別bごとにL+1回分の一定期間T_a~L+T_aの過去時刻t-L-T_a,t-L+1-T_a,t-L+2-T_a,…,t-T_aにおいてT_aだけ先の未来の認識情報を予測したそれぞれの値p(t-L),p(t-L+1),…,p(t-1),p(t)を実績値と比較した成績のスコアscore(b)を評価し、最もスコアscore(b)がよいものを採用すればよい。例えば、予測モデルM_abにおける予測遅延幅をT_a（図７の例であればa=1,2,3でT₁=0.5秒、T₂=1秒、T₃=2秒）とすると、遅延幅でのモデルbの予測誤差error(b)を例えばユークリッド距離の２乗和により以下のように計算できるので、所定の減少関数fを用いてscore(b)=f(error(b))としてこの予測誤差が小さいほどスコアが高くなるようにスコアscore(b)を計算することができる。 In addition, in the prediction unit 22, in regard to which of the multiple prediction models _Mab to use, for model type b, the performance scores score(b ₎ are evaluated by comparing the respective values _p (tL ₎ , p(t-L+1 ₎ , ..., p(t-1 ₎ , p(t ₎ predicted for the future recognition information T _a ahead at past times tLT a, t-L+1-T a, t-L+2-T a, ..., tT a for L+1 fixed periods T a to L+T a for each model type b, with the actual values, and the one with the best score score(b) is adopted. For example, if the predicted delay width in prediction model M _ab is T _a (in the example of Figure 7, a = 1, 2, 3, T ₁ = 0.5 seconds, T ₂ = 1 second, T ₃ = 2 seconds), the prediction error error(b) of model b at the delay width can be calculated as follows, for example, by the sum of squares of the Euclidean distance, and then the score(b) can be calculated using a predetermined reduction function f as score(b) = f(error(b)) so that the smaller the prediction error, the higher the score.

なお、上記の和を計算する範囲k=T_a,…,L+T_aは、現在時刻に最も近い範囲で且つ遅延幅T_aの分の実測値（のうち予測の成績評価に利用できる実績値）が実際に受信できている範囲（前述のとおり、p(t)が受信できている実測値のうち最も現在時刻に近いため、この実測値により過去時刻t-T_aにおいて未来時刻t-T_a+T_aについて予測した値が評価できる）であり、当該範囲に該当して和を取るkの個数が規格化項としてのL+1である。なお、上記の和では遅延幅T_aが、整数で与えられる図５等の離散的な処理タイミングi=1,2,…と同様の整数となるものとしている。各モデルM_abの遅延幅T_aはこの制約を満たすものとして定義しておいてもよいし、この制約を満たさない場合には、実時間としての遅延幅T_aが離散的な処理タイミングi=1,2,…の何回分に該当するかの値を、床関数floor(T_a)で整数以下を切り捨てた値を用いるようにしてもよい。 The range k= _Ta ,...,L+ _Ta for calculating the above sum is the range closest to the current time and the range in which the actual measurement value (among which the actual value that can be used for the evaluation of the prediction performance) for the delay width _Ta can actually be received (as described above, since p(t) is the closest to the current time among the actual measurement values that can be received, the value predicted for the future time _tTa + _Ta at the past time _tTa can be evaluated using this actual measurement value), and the number of k that falls within the range and takes the sum is L+1 as a normalization term. Note that in the above sum, the delay width _Ta is an integer similar to the discrete processing timing i=1,2,... given by an integer in FIG. 5 and the like. The delay width _Ta of each model M _ab may be defined as satisfying this constraint, or if this constraint is not satisfied, the value of how many discrete processing timings i=1,2,... the delay width _Ta in real time corresponds to may be a value obtained by rounding down the integer using the floor function floor( _Ta ).

また、上記の和を計算する際の各予測値「p(t-k+T_a)_{[モデルbによる予測]}」は、予測に利用可能な最新データが得られた時点としての時刻t-k（時刻t-kまでの認識情報が受信済みの際）においてT_aだけ先の未来の値を予測した値を用いればよい。 In addition, when calculating the above sum, each predicted value “p(t-k+T _a ) _{[prediction by model b]} ” can be calculated by using the value predicted for the future, T _a ahead, at time tk (when recognition information up to time tk has already been received), which is the time when the latest data available for prediction is obtained.

図８では予測精度を評価する模式例として、ある１つのモデルbで認識情報（多次元ベクトルのうちある１つの要素のみを例として縦軸方向に示し、横軸方向を時刻とする）を予測する際の、実績値のグラフ（現時刻t以前の部分）が実線で、予測値のグラフ（現時刻tよりも未来の部分）が破線で示されている。図８では遅延幅T_aの３つの例としてn1,n2,n3が示されている。 In Fig. 8, as a schematic example for evaluating prediction accuracy, when predicting recognition information (only one element of a multidimensional vector is shown on the vertical axis, and time is shown on the horizontal axis) using a certain model b, a graph of actual values (part before the current time t) is shown with a solid line, and a graph of predicted values (part in the future than the current time t) is shown with a dashed line. In Fig. 8, n1, n2, and n3 are shown as three examples of delay width T _a .

ここで、モデルM_abを定める遅延幅T_aについては、細かい間隔で多数に渡って網羅的に設けることが予測精度の観点からは望ましいが、一方で、候補aが多いと成績スコアscore(b)の評価等も増え、予測部22での計算量が増えてしまうことが想定される。 Here, it is desirable from the viewpoint of prediction accuracy to comprehensively set the delay width T _a that determines the model M _ab at small intervals over a large number of candidates. On the other hand, however, if there are many candidates a, the number of evaluations of the performance score score(b) will also increase, and it is expected that the amount of calculation in the prediction unit 22 will increase.

そこで遅延の値の候補aとしては、図９に正規分布の場合を模式的に示すように、予め実験等により遅延の値の分布を求めておき、当該分布内から現実的に発生しうる遅延の値（又は範囲）を離散的に候補aとして設定してよい。図９の例では、遅延の値の候補aとして、正規分布の平均値n2と、この平均値n2から一定幅にあるn1及びn3と、の３つ（0<n1<n2＜n3）を設定する例となっている。 As shown in the schematic diagram of the normal distribution in Figure 9, the delay value candidate a can be determined in advance by experiments, etc., and the delay value (or range) that can actually occur within that distribution can be set discretely as candidate a. In the example of Figure 9, three delay value candidates a are set: the average value n2 of the normal distribution, and n1 and n3 that are within a certain range from this average value n2 (0<n1<n2<n3).

以上の実施形態では、ある予測モデルbにおいて、予測精度を確保する観点から、遅延の値の候補aごとに個別に学習等を行うことにより、遅延の値の候補aごとに異なるモデルM_abを用意するものとした。別の実施形態として、予測モデルbが例えばRNN（再帰型ニューラルネットワーク）である場合に、複数の遅延の値の候補aにおける認識情報を全てまとめて予測する方法と個別に予測する方法を適応的に選択するようにしてもよい。選択は過去の予測誤差を利用することができる。例えば、過去の予測結果の誤差が小さいと判定された場合、認識情報はまとめて予測することで処理負荷を軽減し、過去の予測結果の誤差が大きいと判定された場合、認識情報は個別に予測することで予測精度を向上させる効果が得られる。 In the above embodiment, in a certain prediction model b, from the viewpoint of ensuring prediction accuracy, learning or the like is performed individually for each candidate a of the delay value, and a different model M _ab is prepared for each candidate a of the delay value. In another embodiment, when the prediction model b is, for example, an RNN (recurrent neural network), a method of predicting all recognition information for a plurality of candidates a of the delay value together and a method of predicting each individually may be adaptively selected. The selection can be made by utilizing past prediction errors. For example, when it is determined that the error of the past prediction results is small, the recognition information is predicted together to reduce the processing load, and when it is determined that the error of the past prediction results is large, the recognition information is predicted individually to improve the prediction accuracy.

図１０は、この別の実施形態の例として、図７の例の変形例を示す図である。予測モデルb=3のRNNについて、a=1,2,3の個別の遅延幅における認識情報を個別に予測するモデルM₁₃,M₂₃,M₃₃に加えて、これら3つの遅延幅における認識情報を全て包括して予測するモデルM₀₃を予め用意しておく。（すなわち、このモデルM₀₃は、a=1,2,3の３つの遅延幅だけ先の未来における3つの認識情報を一括で予測して出力するモデルである。）図１０のような予測モデルの中からいずれを用いるかの決定は、評価した予測誤差に応じた所定規則で決定すればよい。例えば、ある遅延の値の候補a（遅延幅T_a）の場合に、線形補間モデルM_a1と、CNNモデルM_a2と、包括的なRNNモデルM₀₃と、の予測誤差を評価し、包括的なRNNモデルM₀₃が最も予測精度が高いものとして評価された場合にさらに、個別のRNNモデルM_a3の予測誤差も評価して、最良の精度のモデルを用いるものとして決定してもよいし、最初から４つのモデルM_a1,M_a2,M_a3,M₀₃の誤差を全て評価して最良のモデルに決定してもよい。包括的なRNNモデルM₀₃の評価については、図５のステップS25で示される離散的な処理タイミングi=1,2,…の毎回について評価するのではなく、a=1,2,3の３時刻を一括で評価するものであることから、これら３時刻がなす間隔ごとに断続的に評価するようにしてもよい。図１０の例ではRNNについて複数時刻を一括予測するモデルを追加しているが、CNN等のその他の学習モデルについても同様に一括予測モデルを追加してもよい。 Fig. 10 is a diagram showing a modified example of the example of Fig. 7 as an example of this other embodiment. For the RNN of prediction model b=3, in addition to models _M13 , _M23 , and _M33 that individually predict the recognition information at each delay width of a=1, 2, and 3, a model _M03 that comprehensively predicts the recognition information at these three delay widths is prepared in advance. (That is, this model _M03 is a model that collectively predicts and outputs three pieces of recognition information in the future by three delay widths of a=1, 2, and 3.) The decision of which prediction model to use from among the prediction models shown in Fig. 10 may be made according to a predetermined rule according to the evaluated prediction error. For example, for a candidate delay value a (delay width T _a ), the prediction errors of the linear interpolation model M _a1 , the CNN model M _a2 , and the comprehensive RNN model M ₀₃ may be evaluated, and if the comprehensive RNN model M ₀₃ is evaluated as having the highest prediction accuracy, the prediction error of the individual RNN model M _a3 may be evaluated and the model with the best accuracy may be determined to be used, or the errors of the first four models M _a1 , M _a2 , M _a3 , and M ₀₃ may all be evaluated and the best model may be determined. The evaluation of the comprehensive RNN model M ₀₃ is not performed for each discrete processing timing i=1, 2, ... shown in step S25 of FIG. 5, but is performed for three times a=1, 2, 3 at once, so that it may be intermittently evaluated at intervals between these three times. In the example of FIG. 10, a model that predicts multiple times at once is added for the RNN, but a batch prediction model may be added for other learning models such as CNN in the same manner.

＜ステップS24…描画部23及び提示部24＞
ステップS24では、上記の予測部22において現時刻iに関して予測した認識情報p(t+n)を用いて第１ユーザU1のアバタA1を描画部23が描画し、当該描画されたアバタA1を提示部24において表示してからステップS25へと進む。録音部13から音声情報v(i-K)が得られている場合には、提示部24ではアバタA1の表示と同期させてこの音声情報v(i-K)も再生する。 <Step S24...Rendering Unit 23 and Presentation Unit 24>
In step S24, the drawing unit 23 draws the avatar A1 of the first user U1 using the recognition information p(t+n) predicted for the current time i by the prediction unit 22, and the drawn avatar A1 is displayed on the presentation unit 24, and then the process proceeds to step S25. If audio information v(iK) is obtained from the recording unit 13, the presentation unit 24 also plays back the audio information v(iK) in synchronization with the display of the avatar A1.

描画部23における描画は、第１ユーザU1が予め設定しているアバタのモデルに認識情報p(t+n)の表情やポーズ等を反映させて描画すればよい。なお、アバタモデルは、内容として、第１ユーザU1自身を実写モデル化したものであってもよいし、その他の任意のキャラクタであってもよいし、画像平面上への描画態様として、２次元描画でも３次元描画（３次元モデルを２次元画像平面上に投影する描画）でもよい。３次元仮想空間内にアバタを描画する場合は、描画装置20を利用する第２ユーザが指定する仮想視点によってこの３次元仮想空間が定まるものとして画像平面上へと描画してもよい。当該描画には拡張現実や3次元コンピュータグラフィックスにおける既存手法を用いることができ、描画装置20の側で別途に撮影された背景画像等にアバタを重畳させる形で提示部24における表示等を行ってもよい。提示部24を構成するディスプレイが光学シースルー型ヘッドマウントディスプレイ等である場合は、現実世界の背景に対してアバタのみを描画して重畳表示するようにしてもよい。 The drawing in the drawing unit 23 may be performed by reflecting the facial expression, pose, etc. of the recognition information p(t+n) on the model of the avatar previously set by the first user U1. The avatar model may be a live-action model of the first user U1 himself or any other character, and the drawing manner on the image plane may be two-dimensional drawing or three-dimensional drawing (drawing in which a three-dimensional model is projected onto a two-dimensional image plane). When drawing an avatar in a three-dimensional virtual space, the three-dimensional virtual space may be determined by a virtual viewpoint specified by the second user using the drawing device 20, and the avatar may be drawn on the image plane. For the drawing, an existing method in augmented reality or three-dimensional computer graphics may be used, and the avatar may be displayed on the presentation unit 24 in a form in which the avatar is superimposed on a background image, etc., separately captured by the drawing device 20. If the display constituting the presentation unit 24 is an optical see-through type head-mounted display, etc., only the avatar may be drawn and superimposed on the background of the real world.

＜ステップS25…時刻更新＞
ステップS25では現時刻iを次の時刻i+1へと更新してからステップS21へと戻ることにより、以上のステップが次のリアルタイムの時刻i+1について同様に繰り返される。 <Step S25...Time update>
In step S25, the current time i is updated to the next time i+1, and the process returns to step S21, whereby the above steps are similarly repeated for the next real-time time i+1.

以上、本発明の実施形態によれば、認識装置10の側から送信される認識情報から遅延に応じて、未受信の現時刻での認識情報を予測し、現時刻での認識装置10の側の第１ユーザU1の実際の認識情報と近似させることで伝送遅延や処理遅延を効果的に隠蔽することに繋がり、円滑なコミュニケーションを実現できる。また、アバタの提示において、複数の予測手法の適応的な選択に基づいて相手の動きを予測し受信時刻のアバタを提示することで伝送遅延を抑制できる。 As described above, according to the embodiment of the present invention, unreceived recognition information at the current time is predicted based on the delay in the recognition information transmitted from the recognition device 10, and the unreceived recognition information is approximated to the actual recognition information of the first user U1 at the recognition device 10 at the current time, which leads to effectively concealing transmission delays and processing delays, thereby realizing smooth communication. In addition, in presenting an avatar, the movement of the other party is predicted based on an adaptive selection of multiple prediction methods, and an avatar at the time of reception is presented, thereby suppressing transmission delays.

図１１は、遅延（すなわち予測時間）の３つの候補値n1=30ミリ秒、n2=100ミリ秒、n3=250ミリ秒に対して、認識情報のある１つのパラメータのサンプル時系列（実績値時系列）と、この実績値時系列にRNNによる予測を適用した予測値時系列とを同時プロットしたグラフの例EX1,EX2,EX3をそれぞれ示す図である。時間軸を共通として、実績値時系列は太線、予測値時系列は細線で示されており、これら両者はよく一致している様子を見て取ることができる。このように事前学習されたRNN等を用いて高精度な認識情報の予測が可能であることから、本発明の実施形態では高精度な予測により遅延を吸収して円滑なコミュニケーションを実現することができる。 Figure 11 shows example graphs EX1, EX2, and EX3 in which a sample time series (actual value time series) of one parameter with recognition information is plotted simultaneously with a predicted value time series obtained by applying prediction by RNN to this actual value time series, for three candidate values of delay (i.e., predicted time) n1 = 30 ms, n2 = 100 ms, and n3 = 250 ms. The time axis is shared, and the actual value time series is shown with a thick line and the predicted value time series with a thin line, and it can be seen that the two match well. As such, since highly accurate prediction of recognition information is possible using a pre-trained RNN or the like, in an embodiment of the present invention, delays can be absorbed by highly accurate predictions, enabling smooth communication.

以下、種々の補足例、追加例、代替例等を説明する。 Various supplementary, additional, and alternative examples are explained below.

（１）本発明の実施形態によれば、臨場感ある遠隔コミュニケーションを実現可能である。これにより、遠隔地への実際の移動を必ずしも必須とせずに遠隔会議等を行うことが可能となり、ユーザ移動に必要となるエネルギー資源を節約することで二酸化炭素排出量を抑制できることから、国連が主導する持続可能な開発目標（ＳＤＧｓ）の目標１３「気候変動とその影響に立ち向かうため、緊急対策を取る」に貢献することが可能となる。 (1) According to an embodiment of the present invention, it is possible to realize realistic remote communication. This makes it possible to hold remote conferences and the like without necessarily requiring actual travel to remote locations, and by saving the energy resources required for user travel, it is possible to reduce carbon dioxide emissions, which makes it possible to contribute to Goal 13 of the United Nations-led Sustainable Development Goals (SDGs) "Take urgent action to combat climate change and its impacts."

（２）予測部22において予め用意しておく複数の予測モデルM_abの遅延の値Kの候補aとしては、図７で例示したように離散的な時刻（例えば1秒後、2秒後、3秒後…など）における１つの予測値のみを得ることができるものであった。この離散的な遅延幅の予測値に関して、指定される遅延時間に最も近い離散的な時刻を１つのみ選ぶようにしてもよいし、前後の近い値で内挿補間するようにしてもよい。例えば指定される遅延時間が1.2秒の場合であれば、最も近い1秒後の離散的な予測値を用いるようにしてもよいし、最も近い２つの予測値として、1秒後の離散的な予測値と2秒後の離散的な予測値とを比率0.8：0.2で重み付け和した値を予測部22の出力として用いるようにしてもよい。 (2) As the candidate a for the delay value K of the multiple prediction models M _ab prepared in advance in the prediction unit 22, only one predicted value at a discrete time (for example, 1 second later, 2 seconds later, 3 seconds later, etc.) can be obtained as illustrated in FIG. 7. Regarding the predicted value of this discrete delay width, only one discrete time closest to the specified delay time may be selected, or a value close to the specified delay time may be interpolated. For example, if the specified delay time is 1.2 seconds, the discrete predicted value at the closest time 1 second later may be used, or the weighted sum of the discrete predicted value at 1 second later and the discrete predicted value at 2 seconds later in a ratio of 0.8:0.2 may be used as the output of the prediction unit 22 as the two closest predicted values.

（３）モデルM_abの予測方式bが切り替わる際にもアバタが滑らかに自然に動作するような追加処理として、切り替わる前後での予測値に、切り替わる前後の予測モデルの予測値をクロスフェードさせた値を用いるようにしてもよい。例えば、時刻t=TA-1まではb=1の予測モデルの予測値が採用され、時刻t=TA以降ではこれが切り替わってb=2の予測モデルが最良と決定された場合に、以降のN+1回分の一定期間の各時刻t=TA,TA+1,…,TA+Nについて、切り替え前のb=1の予測モデルの値をp1(t)とし、切り替え後のb=2の予測モデルの値をp2(t)とすると、以下のクロスフェードさせた重み付け和の値を利用することで、b=1の予測モデルの値から新たなb=2の予測モデルの値へと徐々に切り替わるようにしてもよい。
{(t-TA+1)*p2(t)+(TA+N-t)*p1(t)}／(N+1) (3) As an additional process for making the avatar move smoothly and naturally even when the prediction method b of model M _ab is switched, the predicted value before and after the switch may be a value obtained by cross-fading the predicted values of the prediction models before and after the switch. For example, if the predicted value of the b=1 prediction model is used until time t=TA-1, and this is switched after time t=TA and the b=2 prediction model is determined to be the best, then for each time t=TA, TA+1, ..., TA+N during the N+1 subsequent fixed period, the value of the b=1 prediction model before the switch is p1(t) and the value of the b=2 prediction model after the switch is p2(t), the following cross-faded weighted sum value may be used to gradually switch from the value of the b=1 prediction model to the value of the new b=2 prediction model.
{(t-TA+1)*p2(t)+(TA+Nt)*p1(t)}／(N+1)

（４）図１２は、一般的なコンピュータ装置70におけるハードウェア構成の例を示す図である。通信システム100を構成する認識装置10及び描画装置20の各々は、このような構成を有する１台以上のコンピュータ装置70として実現可能である。なお、２台以上のコンピュータ装置70で認識装置10及び描画装置20の各々を実現する場合、ネットワーク経由で処理に必要な情報の送受を行うようにしてよい。コンピュータ装置70は、所定命令を実行するCPU（中央演算装置）71、CPU71の実行命令の一部又は全部をCPU71に代わって又はCPU71と連携して実行する専用プロセッサとしてのGPU（グラフィックス演算装置）72、CPU71（及びGPU72）にワークエリアを提供する主記憶装置としてのRAM73、補助記憶装置としてのROM74、通信インタフェース75、ディスプレイ76、マウス、キーボード、タッチパネル等によりユーザ入力を受け付ける入力インタフェース77、カメラ78、マイク79及びスピーカ80と、これらの間でデータを授受するためのバスBSと、を備える。 (4) FIG. 12 is a diagram showing an example of the hardware configuration of a general computer device 70. Each of the recognition device 10 and the drawing device 20 constituting the communication system 100 can be realized as one or more computer devices 70 having such a configuration. When each of the recognition device 10 and the drawing device 20 is realized by two or more computer devices 70, information required for processing may be transmitted and received via a network. The computer device 70 includes a CPU (Central Processing Unit) 71 that executes predetermined instructions, a GPU (Graphics Processing Unit) 72 as a dedicated processor that executes some or all of the execution instructions of the CPU 71 in place of the CPU 71 or in cooperation with the CPU 71, a RAM 73 as a main storage device that provides a work area for the CPU 71 (and the GPU 72), a ROM 74 as an auxiliary storage device, a communication interface 75, a display 76, an input interface 77 that accepts user input via a mouse, keyboard, touch panel, etc., a camera 78, a microphone 79, and a speaker 80, and a bus BS for transmitting and receiving data between them.

通信システム100の各機能部は、各部の機能に対応する所定のプログラムをROM74から読み込んで実行するCPU71及び／又はGPU72によって実現することができる。なお、CPU71及びGPU72は共に、演算装置（プロセッサ）の一種である。ここで、表示関連の処理が行われる場合にはさらに、ディスプレイ76が連動して動作し、データ送受信に関する通信関連の処理が行われる場合にはさらに通信インタフェース75が連動して動作する。提示部24におけるアバタ表示機能はディスプレイ76において実現し、音声再生機能はスピーカ80において実現すればよい。録音部13における録音機能はマイク79において実現すればよい。撮像部11はカメラとして実現してよい。 Each functional unit of the communication system 100 can be realized by a CPU 71 and/or a GPU 72 that reads from a ROM 74 a predetermined program corresponding to the function of each unit and executes it. Both the CPU 71 and the GPU 72 are a type of computing device (processor). Here, when display-related processing is performed, the display 76 also operates in conjunction with the GPU 72, and when communication-related processing related to data transmission and reception is performed, the communication interface 75 also operates in conjunction with the GPU 72. The avatar display function of the presentation unit 24 may be realized by the display 76, and the audio playback function may be realized by the speaker 80. The recording function of the recording unit 13 may be realized by the microphone 79. The imaging unit 11 may be realized as a camera.

100…通信システム、10…認識装置、20…描画装置
11…撮像部、12…認識部、13…録音部、21…推定部、22…予測部、23…描画部、24…提示部、25…記憶部 100... communication system, 10... recognition device, 20... drawing device
11: imaging unit, 12: recognition unit, 13: recording unit, 21: estimation unit, 22: prediction unit, 23: drawing unit, 24: presentation unit, 25: storage unit

Claims

receiving, in chronological order, control parameters for drawing the avatar from a user for whom an avatar is to be drawn, as recognition information that recognizes the state of the user;
an estimation unit that estimates a delay amount between a time of the most recent past recognition information among the already received recognition information and a current time;
a prediction unit for predicting recognition information corresponding to the current time based on a history of the received recognition information;
a drawing unit that draws an avatar of the user at the current time using the predicted recognition information ;
The prediction unit predicts recognition information corresponding to the current time by using a prediction model selected from a plurality of types of prediction models prepared in advance;
The drawing device, wherein the prediction unit selects a prediction model for a candidate delay amount corresponding to the delay amount from among a plurality of prediction models for each candidate delay amount prepared in advance .

receiving, in chronological order, control parameters for drawing the avatar from a user for whom an avatar is to be drawn, as recognition information that recognizes the state of the user;
an estimation unit that estimates a delay amount between a time of the most recent past recognition information among the already received recognition information and a current time;
a prediction unit for predicting recognition information corresponding to the current time based on a history of the received recognition information;
a drawing unit that draws an avatar of the user at the current time using the predicted recognition information ;
The prediction unit predicts recognition information corresponding to the current time by using a prediction model selected from a plurality of types of prediction models prepared in advance;
A drawing device characterized in that the prediction unit selects, for a predetermined type of prediction model, from among a model that predicts each of a plurality of candidate delay amounts individually and a model that predicts a plurality of candidate delay amounts collectively .

3. The drawing device according to claim 2 , wherein the predetermined type of prediction model is a recurrent neural network model.

The drawing device according to any one of claims 1 to 3, characterized in that the prediction unit applies each of the multiple types of prediction models to the history of the recognition information that has already been received, evaluates the prediction accuracy of each prediction model, and selects the prediction model with the best prediction accuracy.

receiving, in chronological order, control parameters for drawing the avatar from a user for whom an avatar is to be drawn, as recognition information that recognizes the state of the user;
an estimation unit that estimates a delay amount between a time of the most recent past recognition information among the already received recognition information and a current time;
a prediction unit for predicting recognition information corresponding to the current time based on a history of the received recognition information;
a drawing unit that draws an avatar of the user at the current time using the predicted recognition information ;
On the side of the user for whom the avatar is to be drawn, frame images in a time series obtained by imaging the user are analyzed to acquire the recognition information, and timestamps of the frame images are associated with the corresponding recognition information and received;
The estimation unit estimates the delay amount as a difference between the timestamp and a current time,
At the user for whom the avatar is to be rendered, recordings of the user's utterances are acquired in chronological order, with the timestamps being linked to the recordings;
Receive the recording;
The drawing device, wherein the estimation unit estimates the amount of delay by further subtracting a delay including at least a transmission delay until the recording is received from the difference.

6. The drawing device according to claim 1 , wherein the recognition information is obtained as representing a facial expression and/or a body pose of the user.

The prediction unit predicts recognition information corresponding to the current time by using a prediction model selected from a plurality of types of prediction models prepared in advance;
7. The drawing device according to claim 1, wherein, when the type of prediction model selected is switched at a certain time, during a certain period of time after the switch, recognition information corresponding to the current time is predicted as a weighted sum of recognition information corresponding to the current time using a first type prediction model before the switch and recognition information corresponding to the current time using a second type prediction model after the switch .

8. A program for causing a computer to function as the drawing device according to claim 1.