JP6059614B2

JP6059614B2 - Gesture generation device, gesture generation system, gesture generation method, and computer program

Info

Publication number: JP6059614B2
Application number: JP2013159655A
Authority: JP
Inventors: 建鋒徐; 茂之酒澤
Original assignee: KDDI Corp
Current assignee: KDDI Corp
Priority date: 2013-07-31
Filing date: 2013-07-31
Publication date: 2017-01-11
Anticipated expiration: 2033-07-31
Also published as: JP2015032032A

Description

本発明は、仕草生成装置、仕草生成システム、仕草生成方法およびコンピュータプログラムに関する。 The present invention relates to a gesture generation device, a gesture generation system, a gesture generation method, and a computer program.

近年、携帯端末上で動作する音声対話型インタフェースが実現されているが、さらに、擬人化されたエージェントとしてのキャラクタを携帯端末の表示画面上に表示し、このキャラクタに情報に合わせて話したり動いたりさせることが検討されている。例えば、非特許文献１に記載の従来技術では、ＡＰＥ（Automatic Production Engine）を用いたＴＶＭＬ（TV program Making language）による映像コンテンツの自動生成技術として、ニュースや天気予報などの特定のシナリオに使用されるテンプレートを用意することにより、コンピュータ・グラフィックス（ＣＧ）アニメーションを自動生成している。また、電子メールやブログのテキストデータからＣＧアニメーションを自動生成したり（例えば、特許文献１、２参照）、ユーザから入力されたテキストデータに対応する手話をＣＧアニメーションで生成したり（例えば、特許文献３参照）する技術が知られている。 In recent years, a voice interactive interface that operates on a mobile terminal has been realized. Furthermore, an anthropomorphic agent character is displayed on the display screen of the mobile terminal, and this character can speak and move according to information. It is being considered to let For example, the conventional technology described in Non-Patent Document 1 is used for a specific scenario such as news or weather forecast as a technology for automatically generating video content by TVML (TV program Making language) using APE (Automatic Production Engine). A computer graphics (CG) animation is automatically generated by preparing a template. In addition, a CG animation is automatically generated from text data of an e-mail or a blog (for example, see Patent Documents 1 and 2), or a sign language corresponding to text data input by a user is generated by a CG animation (for example, a patent). The technique of (refer literature 3) is known.

特開２００４−８８３３５号公報JP 2004-88335 A 特開２００８−１０７９０４号公報JP 2008-107904 A 特開２０１１−１７５５９８号公報JP 2011-175598 A

Hayashi, M.; Douke, M.; Hamaguchi, N., ‘Automatic TV program production with APEs,' Creating, Connecting and Collaborating through Computing, 2004. Proceedings. Second International Conference on , vol., no., pp.20,25, 29-30 Jan. 2004Hayashi, M .; Douke, M .; Hamaguchi, N., 'Automatic TV program production with APEs,' Creating, Connecting and Collaborating through Computing, 2004. Proceedings.Second International Conference on, vol., No., Pp.20 , 25, 29-30 Jan. 2004 McNeill, D. 1992 Hand and Mind. Chicago: University of Chicago PressMcNeill, D. 1992 Hand and Mind.Chicago: University of Chicago Press Hitoshi Isahara, Francis Bond, Kiyotaka Uchimoto, Masao Utiyama and Kyoko Kanzaki, Development of Japanese WordNet. In LREC-2008, Marrakech.Hitoshi Isahara, Francis Bond, Kiyotaka Uchimoto, Masao Utiyama and Kyoko Kanzaki, Development of Japanese WordNet. In LREC-2008, Marrakech. Cormen，Thomas H.; Leiserson，Charles E.，Rivest，Ronald L. (1990). Introduction to Algorithms (2st ed.). MIT Press and McGraw-Hill. ISBN 0-262-03141-8. pp. 323-69Cormen, Thomas H .; Leiserson, Charles E., Rivest, Ronald L. (1990). Introduction to Algorithms (2st ed.). MIT Press and McGraw-Hill. ISBN 0-262-03141-8. Pp. 323- 69

しかし、上述した従来技術では、セリフのテキストデータと該セリフの音声データとが与えられた場合に、セリフの音声に合わせた自然なＣＧアニメーションを自動生成することが困難である。特許文献１〜３の従来技術では、テキストデータに合わせたＣＧアニメーションを生成することはできるが、該テキストデータの音声のタイミングに合ったＣＧアニメーションを生成することはできない。また、非特許文献１の従来技術では、同期技術を導入していないため、与えられた音声にＣＧアニメーションのタイミングを合わせることができない。 However, in the above-described conventional technology, it is difficult to automatically generate a natural CG animation in accordance with the speech of the speech when the speech text data and speech speech data are given. In the prior arts disclosed in Patent Documents 1 to 3, a CG animation that matches text data can be generated, but a CG animation that matches the voice timing of the text data cannot be generated. Further, since the conventional technique of Non-Patent Document 1 does not introduce a synchronization technique, the timing of the CG animation cannot be synchronized with the given voice.

本発明は、このような事情を考慮してなされたもので、セリフの音声に合わせた自然な仕草の動画像を生成することができる仕草生成装置、仕草生成システム、仕草生成方法およびコンピュータプログラムを提供することを課題とする。 The present invention has been made in consideration of such circumstances, and includes a gesture generation device, a gesture generation system, a gesture generation method, and a computer program capable of generating a moving image of a natural gesture that matches a speech of a speech. The issue is to provide.

（１）本発明に係る仕草生成装置は、セリフのテキストデータであるセリフデータ、前記セリフの音声データ、および複数の仕草データの連結性に基づいて前記複数の仕草データが連結されたモーショングラフを入力し、前記セリフの音声に合わせた仕草データを生成する仕草生成装置であり、前記セリフの継続時間に基づいて前記モーショングラフ上の最小コストのパスを選択し、該選択されたパスの仕草データに対して前記セリフの音声に合わせる調整を行う仕草データ生成部を備えたことを特徴とする。 (1) The gesture generating apparatus according to the present invention is configured to generate a motion graph in which the plurality of gesture data is connected based on the serif data that is the text data of the serif, the speech data of the serif, and the connectivity of the plurality of gesture data. A gesture generation device that inputs and generates gesture data according to the speech of the line, selects a path with the lowest cost on the motion graph based on the duration of the line, and indicates the gesture data of the selected path Is provided with a gesture data generation unit that performs adjustment to match the speech of the speech.

（２）本発明に係る仕草生成装置においては、上記（１）の仕草生成装置において、前記仕草データ生成部は、前記選択されたパスの仕草データのストロークに対応する前記セリフ中のキーワードの音声データに対して、前記ストロークの開始タイミングと終了タイミングを合わせる調整を行うことを特徴とする。 (2) In the gesture generation device according to the present invention, in the gesture generation device according to (1), the gesture data generation unit is configured to generate a voice of the keyword in the speech corresponding to the stroke of the selected path of the gesture data. The data is adjusted to match the start timing and end timing of the stroke.

（３）本発明に係る仕草生成装置においては、上記（２）の仕草生成装置において、前記仕草データ生成部は、前記セリフの継続時間に前記選択されたパスの仕草データの長さを合わせる調整を行うことを特徴とする。 (3) In the gesture generation device according to the present invention, in the gesture generation device according to (2), the gesture data generation unit adjusts the length of the selected path's gesture data to the duration of the line. It is characterized by performing.

（４）本発明に係る仕草生成装置においては、上記（３）の仕草生成装置において、前記仕草データ生成部は、前記選択されたパスの仕草データの準備期の終了タイミングを前記ストロークの開始タイミングに合わせるように該準備期の継続時間を伸縮させることを特徴とする。 (4) In the gesture generation device according to the present invention, in the gesture generation device according to (3), the gesture data generation unit sets an end timing of a preparation period of the selected path of the gesture data as a start timing of the stroke. The duration of the preparation period is expanded or contracted so as to match the above.

（５）本発明に係る仕草生成装置においては、上記（３）または（４）の仕草生成装置において、前記仕草データ生成部は、前記選択されたパスの仕草データの終了期について、開始タイミングを前記ストロークの終了タイミングに合わせるように、且つ、終了タイミングを前記セリフの音声データ終了タイミングに合わせるように、該終了期の継続時間を伸縮させることを特徴とする。 (5) In the gesture generation device according to the present invention, in the gesture generation device according to (3) or (4), the gesture data generation unit sets a start timing for an end period of the gesture data of the selected path. The duration of the end period is expanded or contracted so as to match the end timing of the stroke and to match the end timing with the end timing of the speech data of the speech.

（６）本発明に係る仕草生成装置においては、上記（１）から（５）のいずれかの仕草生成装置において、前記仕草データ生成部は、前記モーショングラフ内に含まれる仕草データの最初のノードのうち、仕草データの最後のポーズと最も連結性の良いノードを始点ノードにすることを特徴とする。 (6) In the gesture generation device according to the present invention, in the gesture generation device according to any one of (1) to (5), the gesture data generation unit is a first node of the gesture data included in the motion graph. Among them, the node having the best connectivity with the last pose of the gesture data is set as the start point node.

（７）本発明に係る仕草生成装置においては、上記（１）から（６）のいずれかの仕草生成装置において、前記モーショングラフはカテゴリ別に複数あり、前記仕草データ生成部は、前記セリフ中のキーワードのカテゴリと同じ前記モーショングラフを使用して前記セリフの音声に合わせた仕草データを生成することを特徴とする。 (7) In the gesture generation device according to the present invention, in the gesture generation device according to any one of (1) to (6), the motion graph includes a plurality of categories, and the gesture data generation unit includes Using the same motion graph as that of a keyword category, gesture data matching the speech of the speech is generated.

（８）本発明に係る仕草生成装置においては、上記（１）から（７）のいずれかの仕草生成装置において、前記仕草データ生成部は、前記選択されたパスの仕草データの準備期または終了期のフレームに対して、所定の仕草データの中から似ているフレームで入れ替えることを特徴とする。 (8) In the gesture generation device according to the present invention, in any of the gesture generation devices according to (1) to (7), the gesture data generation unit is configured to prepare or end the gesture data of the selected path. The frame of the period is replaced with a similar frame from predetermined gesture data.

（９）本発明に係る仕草生成装置においては、上記（１）から（８）のいずれかの仕草生成装置において、前記仕草データ生成部は、ストロークしかない仕草データに対して、所定の定常ポーズを用いて、ストロークの前と後に一定時間の準備期と終了期を追加することを特徴とする。 (9) In the gesture generating apparatus according to the present invention, in any of the gesture generating apparatuses according to (1) to (8), the gesture data generating unit is configured to perform a predetermined steady pose on the gesture data having only a stroke. Is used to add a preparation period and an end period of a certain time before and after the stroke.

（１０）本発明に係る仕草生成装置においては、上記（１）から（８）のいずれかの仕草生成装置において、前記仕草データ生成部は、終了期がない仕草データに対して、準備期のポーズを用いて、ストロークの後に一定時間の終了期を追加することを特徴とする。 (10) In the gesture generating apparatus according to the present invention, in the gesture generating apparatus according to any one of the above (1) to (8), the gesture data generating unit is configured to prepare for the gesture data having no end period. Using a pause, an end period of a certain time is added after the stroke.

（１１）本発明に係る仕草生成装置においては、上記（１）から（８）のいずれかの仕草生成装置において、前記仕草データ生成部は、準備期がない仕草データに対して、終了期のポーズを用いて、ストロークの前に一定時間の準備期を追加することを特徴とする。 (11) In the gesture generating device according to the present invention, in the gesture generating device according to any one of the above (1) to (8), the gesture data generating unit is configured to perform an end period for gesture data having no preparation period. It is characterized in that a fixed period of preparation is added before the stroke using a pose.

（１２）本発明に係る仕草生成装置においては、上記（１）から（１１）のいずれかの仕草生成装置において、前記仕草データ生成部は、前記モーショングラフのストロークの長さがセリフの継続時間よりも所定倍以上である場合には、所定の定常モーショングラフに切り替える、または、前記モーショングラフのストロークに対応する音声データのセリフの直後に一定時間の無音区間を挿入する、ことを特徴とする。 (12) In the gesture generation device according to the present invention, in the gesture generation device according to any one of (1) to (11), the gesture data generation unit is configured such that the stroke length of the motion graph is a duration of the dialogue. If it is a predetermined multiple or more, it is switched to a predetermined steady motion graph, or a silent section of a certain time is inserted immediately after the speech data line corresponding to the stroke of the motion graph. .

（１３）本発明に係る仕草生成システムは、上記（１）から（１２）のいずれかの仕草生成装置と、入力仕草データの準備期、ストローク、終了期の各フェーズ境界をノードに設定し、且つ、前記ノード間の連結性に基づいてエッジを設けたモーショングラフを生成するモーショングラフ生成部と、前記モーショングラフを記憶するモーショングラフデータベースと、を備えたことを特徴とする。 (13) The gesture generation system according to the present invention sets the phase boundary of the preparation period, the stroke, and the end period of the input gesture data in any one of the above (1) to (12) as nodes, In addition, a motion graph generation unit that generates a motion graph provided with edges based on connectivity between the nodes, and a motion graph database that stores the motion graph are provided.

（１４）本発明に係る仕草生成方法は、セリフのテキストデータであるセリフデータ、前記セリフの音声データ、および複数の仕草データの連結性に基づいて前記複数の仕草データが連結されたモーショングラフを入力し、前記セリフの音声に合わせた仕草データを生成する仕草生成装置の仕草生成方法であり、前記仕草生成装置が、前記セリフの継続時間に基づいて前記モーショングラフ上の最小コストのパスを選択し、該選択されたパスの仕草データに対して前記セリフの音声に合わせる調整を行うことを特徴とする。 (14) In the gesture generation method according to the present invention, a motion graph in which the plurality of gesture data is connected based on the serif data that is the text data of the serif, the speech data of the serif, and the connectivity of the plurality of gesture data. A gesture generation method of a gesture generation device that inputs and generates gesture data according to the speech of the speech, wherein the gesture generation device selects a path with the lowest cost on the motion graph based on the duration of the speech Then, adjustment is performed so that the speech data of the selected path is adjusted to the speech of the speech.

（１５）本発明に係るコンピュータプログラムは、セリフのテキストデータであるセリフデータ、前記セリフの音声データ、および複数の仕草データの連結性に基づいて前記複数の仕草データが連結されたモーショングラフを入力し、前記セリフの音声に合わせた仕草データを生成する仕草生成装置のコンピュータに、前記セリフの継続時間に基づいて前記モーショングラフ上の最小コストのパスを選択し、該選択されたパスの仕草データに対して前記セリフの音声に合わせる調整を行うステップを実行させるためのコンピュータプログラムであることを特徴とする。 (15) A computer program according to the present invention inputs speech data that is text data of speech, speech data of the speech, and a motion graph in which the plurality of gesture data are connected based on connectivity of the plurality of gesture data. And selecting a path with the lowest cost on the motion graph based on the duration of the lines to a computer of a gesture generation apparatus that generates gesture data according to the speech of the lines, and the gesture data of the selected path Is a computer program for executing the step of adjusting the voice to the speech of the line.

本発明によれば、セリフの音声に合わせた自然な仕草の動画像を生成することができるという効果が得られる。 According to the present invention, an effect that a moving image of a natural gesture matched to the speech of a speech can be generated.

本発明の一実施形態に係る仕草生成システム１の構成を示すブロック図である。It is a block diagram which shows the structure of the gesture generation system 1 which concerns on one Embodiment of this invention. 本発明の一実施形態に係る仕草データの定義例の概略図である。It is the schematic of the example of a definition of the gesture data which concerns on one Embodiment of this invention. 本発明の一実施形態に係るモーショングラフ生成方法の流れを示す概念図である。It is a conceptual diagram which shows the flow of the motion graph production | generation method which concerns on one Embodiment of this invention. 本発明の一実施形態に係るブレンディング処理の概念図である。It is a conceptual diagram of the blending process which concerns on one Embodiment of this invention. 本発明の一実施形態に係るブレンディング処理を説明するための概念図である。It is a conceptual diagram for demonstrating the blending process which concerns on one Embodiment of this invention. 本発明の一実施形態に係る仕草データ生成方法の流れを示すフローチャートである。It is a flowchart which shows the flow of the gesture data generation method which concerns on one Embodiment of this invention. 本発明の一実施形態に係る仕草データ調整方法の説明図である。It is explanatory drawing of the gesture data adjustment method which concerns on one Embodiment of this invention.

以下、図面を参照し、本発明の実施形態について説明する。
図１は、本発明の一実施形態に係る仕草生成システム１の構成を示すブロック図である。図１において、仕草生成システム１は、仕草生成装置１０とモーショングラフ生成部２０とモーショングラフデータベース３０を有する。仕草生成装置１０は、モーショングラフデータベース３０を使用して、入力データ（セリフデータ、音声データ）のセリフの音声に合わせた仕草データを生成し、生成した仕草データを出力する。モーショングラフ生成部２０は、入力仕草データを使用して、モーショングラフを生成する。モーショングラフデータベース３０は、モーショングラフ生成部２０により生成されたモーショングラフを記憶する。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.
FIG. 1 is a block diagram showing a configuration of a gesture generation system 1 according to an embodiment of the present invention. In FIG. 1, the gesture generation system 1 includes a gesture generation device 10, a motion graph generation unit 20, and a motion graph database 30. The gesture generation device 10 uses the motion graph database 30 to generate gesture data that matches the speech of the input data (line data, voice data), and outputs the generated gesture data. The motion graph generation unit 20 generates a motion graph using the input gesture data. The motion graph database 30 stores the motion graph generated by the motion graph generation unit 20.

ここで、本実施形態に係る仕草データを説明する。仕草データは、人や動物などの動きを表す動きデータである。特には、仕草データは、仕草と呼ばれる動きを表す動きデータである。一般的に、仕草は、準備期（preparation）に始まって実行期（ストローク（stroke））を経て終了期（retraction）で終わる一連の３つのフェーズの動きから構成される（例えば、非特許文献２参照）。準備期は、仕草の最初のポーズ（以下、定常ポーズと称する）からストロークが始まるまでのフェーズである。例えば、準備期として、人の手が置かれていた位置（定常ポーズ）からストロークが始まるまでの動きが挙げられる。ストロークは、仕草の主なフェーズである。例えば、ストロークとして、人の手の強い振りの動きが挙げられる。終了期は、ストロークの後に仕草の最後のポーズになるフェーズである。例えば、終了期として、ストロークの後に定常ポーズに戻ることが挙げられる。なお、準備期と終了期は、必須ではなく、なくてもよい。 Here, the gesture data according to the present embodiment will be described. The gesture data is movement data representing movement of a person or an animal. In particular, the gesture data is motion data representing a motion called a gesture. In general, a gesture is composed of a series of three-phase movements that begin in a preparation period, end in a run period (stroke), and end in a retraction period (for example, Non-Patent Document 2). reference). The preparation period is a phase from the first pose of the gesture (hereinafter referred to as a steady pose) to the start of the stroke. For example, the preparation period includes a movement from a position where a human hand is placed (steady pose) until a stroke starts. Stroke is the main phase of gestures. For example, as a stroke, a strong swing motion of a human hand can be cited. The end period is a phase that becomes the last pose of the gesture after the stroke. For example, as the end period, returning to a steady pose after a stroke may be mentioned. The preparation period and the end period are not essential and may be omitted.

図２は、本実施形態に係る仕草データの定義例の概略図である。図２の例では、仕草データとして、人体のスケルトン型動きデータを使用している。人体のスケルトン型動きデータは、人の骨格を基に、骨及び骨の連結点（ジョイント）を用い、一ジョイントを根（ルート）とし、ルートからジョイント経由で順次連結される骨の構造を木（ツリー）構造として定義される。図２には、スケルトン型動きデータの定義の一部分のみを示している。図２において、ジョイント１００は腰の部分であり、ルートとして定義される。ジョイント１０１は左腕の肘の部分、ジョイント１０２は左腕の手首の部分、ジョイント１０３は右腕の肘の部分、ジョイント１０４は右腕の手首の部分、ジョイント１０５は左足の膝の部分、ジョイント１０６は左足の足首の部分、ジョイント１０７は右足の膝の部分、ジョイント１０８は右足の足首の部分、ジョイント１０９は鎖骨の部分、ジョイント１１０、１１１は肩の部分、ジョイント１１２は頭の部分、ジョイント１１３、１１４は股関節の部分、である。 FIG. 2 is a schematic diagram of a definition example of gesture data according to the present embodiment. In the example of FIG. 2, skeleton type motion data of a human body is used as gesture data. Skeleton motion data of the human body is based on the human skeleton, using bone and bone connection points (joints), with one joint as the root (root), and the bone structure that is sequentially connected from the root via the joint. Defined as a (tree) structure. FIG. 2 shows only a part of the definition of the skeleton type motion data. In FIG. 2, a joint 100 is a waist part and is defined as a root. Joint 101 is the elbow portion of the left arm, Joint 102 is the wrist portion of the left arm, Joint 103 is the elbow portion of the right arm, Joint 104 is the wrist portion of the right arm, Joint 105 is the knee portion of the left foot, and Joint 106 is the left foot portion. Ankle part, joint 107 is right leg knee part, joint 108 is right leg ankle part, joint 109 is clavicle part, joints 110 and 111 are shoulder parts, joint 112 is head part, joints 113 and 114 are The hip joint part.

スケルトン型動きデータは、スケルトン型対象物の各ジョイントの動きを記録したデータであり、スケルトン型対象物としては人体や動物などが適用可能である。スケルトン型動きデータは、例えばモーションキャプチャデータに基づいて生成される。 The skeleton type motion data is data that records the movement of each joint of the skeleton type object, and a human body or an animal can be applied as the skeleton type object. The skeleton type motion data is generated based on, for example, motion capture data.

本実施形態では、仕草データとして図２に例示される人体のスケルトン型動きデータを使用する。仕草データは、人の一連の動きを複数の姿勢（ポーズ）の連続により表すものである。一つのポーズは、一つのフレームに対応し、全ての関節（ジョイント）の位置情報を記録する。一つのフレームｘ（ｔ）は、式（１）で表される。 In the present embodiment, skeleton type motion data of the human body exemplified in FIG. 2 is used as gesture data. The gesture data represents a series of movements of a person by a series of a plurality of postures (poses). One pose corresponds to one frame and records position information of all joints. One frame x (t) is expressed by Expression (1).

但し、p^ｋ（ｔ）は、時刻ｔにおけるｋ番目のジョイントの位置であり、３次元座標で表される。時刻ｔはフレームの時刻である。Ｋはジョイントの数である。したがって、ｘ（ｔ）は３Ｋ次元のベクトルである。 Here, p ^k (t) is the position of the k-th joint at time t and is represented by three-dimensional coordinates. Time t is the time of the frame. K is the number of joints. Therefore, x (t) is a 3K-dimensional vector.

Ｔ個のフレームからなる仕草データＸは、式（２）で表される。 The gesture data X composed of T frames is expressed by Expression (2).

Ｘは３Ｋ×Ｔの行列である。本実施形態では、時刻ｔを単に「フレームインデックス」として扱う。これにより、時刻ｔは、「０，１，２，・・・，Ｔ−１」の値をとる。Ｔは、仕草データに含まれるフレームの個数である。 X is a 3K × T matrix. In the present embodiment, time t is simply handled as a “frame index”. Thereby, the time t takes a value of “0, 1, 2,..., T−1”. T is the number of frames included in the gesture data.

また、他のフレームの定義例として、基本ポーズからの移動量をジョイント毎に表すことも可能である。一フレームは、基本ポーズに対して各ジョイントの移動量が加味された一ポーズを特定する。これにより、各フレームによって特定される各ポーズの連続により、人の一連の動きが特定される。この場合、移動量として角度情報を利用する。そして、角度情報データ内の基本ポーズデータとフレームデータを用いて、ジョイント位置を算出する。基本ポーズデータは、基本ポーズのときのルートの位置及び各ジョイントの位置、並びに各骨の長さなど、基本ポーズを特定する情報を有する。フレームデータは、ジョイント毎に、基本ポーズからの移動量を表す角度情報を有する。時刻ｔにおけるｋ番目のジョイントの位置p^ｋ（ｔ）は、式（３）および式（４）により算出される。 Further, as another frame definition example, the amount of movement from the basic pose can be expressed for each joint. One frame specifies one pose in which the movement amount of each joint is added to the basic pose. Thereby, a series of movements of a person is specified by the continuation of each pose specified by each frame. In this case, angle information is used as the movement amount. Then, the joint position is calculated using the basic pose data and the frame data in the angle information data. The basic pose data includes information for specifying the basic pose, such as the position of the root and the position of each joint in the basic pose, and the length of each bone. The frame data has angle information representing the amount of movement from the basic pose for each joint. The position p ^k (t) of the k-th joint at time t is calculated by Expression (3) and Expression (4).

但し、０番目（ｉ＝０）のジョイントはルートである。Ｒ_ａｘｉｓ ^{ｉ−１，ｉ}（ｔ）は、ｉ番目のジョイントとその親ジョイント（「ｉ−１」番目のジョイント）間の座標回転マトリックスであり、基本ポーズデータに含まれる。各ジョイントにはローカル座標系が定義されており、座標回転マトリックスは親子関係にあるジョイント間のローカル座標系の対応関係を表す。Ｒ^ｉ（ｔ）は、ｉ番目のジョイントのローカル座標系におけるｉ番目のジョイントの回転マトリックスであり、フレームデータに含まれる角度情報である。Ｔ^ｉ（ｔ）は、ｉ番目のジョイントとその親ジョイント間の遷移マトリックスであり、基本ポーズデータに含まれる。遷移マトリックスは、ｉ番目のジョイントとその親ジョイント間の骨の長さを表す。 However, the 0th (i = 0) joint is the root. R _axis ^{i-1, i} (t) is a coordinate rotation matrix between the i-th joint and its parent joint ("i-1" -th joint), and is included in the basic pose data. A local coordinate system is defined for each joint, and the coordinate rotation matrix represents the correspondence of the local coordinate system between joints in a parent-child relationship. R ⁱ (t) is a rotation matrix of the i-th joint in the local coordinate system of the i-th joint, and is angle information included in the frame data. T ⁱ (t) is a transition matrix between the i-th joint and its parent joint, and is included in the basic pose data. The transition matrix represents the bone length between the i-th joint and its parent joint.

以上が仕草データの説明である。説明を図１に戻す。 This is the description of the gesture data. Returning to FIG.

［モーショングラフ生成部］
モーショングラフ生成部２０には、入力仕草データが入力される。入力仕草データは、仕草データとメタデータを有する。該メタデータは、仕草データのカテゴリを示すカテゴリ識別子（カテゴリＩＤ）と、仕草データを仕草の３つのフェーズに区分するフェーズ識別子（フェーズＩＤ）とを有する。フェーズＩＤは、準備期を示す「Ｐ」、ストロークを示す「Ｓ」、終了期を示す「Ｒ」である。フェーズＩＤによって、仕草データの準備期、ストローク、終了期が特定される。 [Motion graph generator]
The input gesture data is input to the motion graph generation unit 20. The input gesture data includes gesture data and metadata. The metadata includes a category identifier (category ID) indicating a category of gesture data, and a phase identifier (phase ID) for dividing the gesture data into three phases of the gesture. The phase ID is “P” indicating a preparation period, “S” indicating a stroke, and “R” indicating an end period. The preparation period, stroke, and end period of the gesture data are specified by the phase ID.

仕草データをカテゴリに分類する方法の例を以下に説明する。利用可能なセリフ集合に対して、セリフのテキストを形態素解析してキーワードを抽出する。そして、各キーワードに対して、概念辞書（意味辞書）を用いてカテゴリのラベルを付ける。概念辞書として、例えば非特許文献３に記載される「WordNet」を利用可能である。例えば、「おはよう」、「おはようございます」、「こんにちは」、「こんばんは」といったキーワードに対して、「挨拶」というカテゴリのカテゴリＩＤを付ける。これにより、セリフ集合に対してカテゴリ集合を作成する。次いで、利用可能な仕草データ集合に含まれる各仕草データに対して、カテゴリ集合に含まれるカテゴリのカテゴリＩＤを付ける。この仕草データに対するカテゴリＩＤの付与は、人手により行われる。例えば、「お辞儀」の仕草データに対して「挨拶」のカテゴリＩＤを付ける。 An example of a method for classifying gesture data into categories will be described below. For a set of available words, keywords are extracted by morphological analysis of the text of the words. Each keyword is labeled with a category using a concept dictionary (semantic dictionary). For example, “WordNet” described in Non-Patent Document 3 can be used as the concept dictionary. For example, "Good morning", "good morning", "Hello", for a keyword such as "Good evening", put a category ID of the category of "greeting". This creates a category set for the serif set. Next, the category ID of the category included in the category set is attached to each gesture data included in the available gesture data set. The category ID is assigned to the gesture data manually. For example, a category ID of “greeting” is attached to the gesture data of “bow”.

仕草データに対して仕草の３つのフェーズ（準備期、ストローク、終了期）に区分することは人手により行われる。この区分に従って、仕草データに対して、フェーズＩＤ「Ｐ（準備期）」、「Ｓ（ストローク）」、「Ｒ（終了期）」が付与される。但し、仕草データによっては、準備期または終了期がない場合がある。 The gesture data is manually divided into three phases of the gesture (preparation period, stroke, and end period). According to this classification, phase IDs “P (preparation period)”, “S (stroke)”, and “R (end period)” are assigned to the gesture data. However, depending on the gesture data, there may be no preparation period or end period.

モーショングラフ生成部２０は、入力仕草データを使用してモーショングラフを生成する。モーショングラフは、カテゴリ別に生成される。したがって、ある一つのカテゴリのモーショングラフの生成には、当該カテゴリのカテゴリＩＤが付された入力仕草データのみが使用される。 The motion graph generation unit 20 generates a motion graph using the input gesture data. The motion graph is generated for each category. Therefore, only the input gesture data with the category ID of the category is used to generate the motion graph of a certain category.

図３は、本実施形態に係るモーショングラフ生成方法の流れを示す概念図である。以下、図３を参照して、モーショングラフ生成部２０がモーショングラフを生成する動作を説明する。 FIG. 3 is a conceptual diagram showing a flow of a motion graph generation method according to the present embodiment. Hereinafter, an operation in which the motion graph generation unit 20 generates a motion graph will be described with reference to FIG.

［フレーム抽出ステップ］
まず、フレーム抽出ステップにおいて、モーショングラフ生成対象カテゴリの全ての入力仕草データから、仕草データのフェーズ境界に該当するフレームを全て抽出する。この抽出されたフェーズ境界のフレームの集合をＦ^ｉＡＬＬ _Ｂと表す。 [Frame extraction step]
First, in the frame extraction step, all the frames corresponding to the phase boundary of the gesture data are extracted from all the input gesture data of the motion graph generation target category. The set of frames at the extracted phase boundary is represented as ^{Fi ALL} _B.

［連結性算出ステップ］
次いで、連結性算出ステップにおいて、集合Ｆ^ｉＡＬＬ _Ｂに含まれる全フレームをそれぞれ、モーショングラフのノードに設定する。従って、モーショングラフのノード数の初期値は、集合Ｆ^ｉＡＬＬ _Ｂに含まれるフレームの個数に一致する。次いで、全ノードを対象とした全てのペアについて、式（５）又は式（６）により距離を算出する。あるノードＦ^ｉ _ＢとあるノードＦ^ｊ _Ｂとの距離をｄ（Ｆ^ｉ _Ｂ，Ｆ^ｊ _Ｂ）と表す。 [Connectivity calculation step]
Next, in the connectivity calculation step, all the frames included in the set F ^iALL _B are set as the nodes of the motion graph. Therefore, the initial value of the number of nodes in the motion graph matches the number of frames included in the set F ^iALL _B. Next, distances are calculated for all pairs targeting all nodes by using Equation (5) or Equation (6). The distance between a certain node F ⁱ _{B and a} certain node F ^j _B is represented as d (F ⁱ _B , F ^j _B ).

但し、ｑ_ｉ，ｋはノードＦ^ｉ _Ｂのｋ番目のジョイントの四元数（quaternion）である。ｗ_ｋはｋ番目のジョイントに係る重みである。重みｗ_ｋは予め設定される。 Where q _{i, k} is the quaternion of the k-th joint of the node F ⁱ _B. w _k is a weight related to the k-th joint. The weight w _k is preset.

但し、ｐ_ｉ，ｋはノードＦ^ｉ _Ｂのｋ番目のジョイントのルートに対する相対位置のベクトルである。つまり、ｐ_ｉ，ｋは、ルートの位置と方向は考えずに算出したノードＦ^ｉ _Ｂのｋ番目のジョイントの位置のベクトルである。 Here, p _{i, k} is a vector of relative positions with respect to the root of the k-th joint of the node F ⁱ _B. That is, p _{i, k} is a vector of the position of the k-th joint of the node F ⁱ _B calculated without considering the position and direction of the route.

なお、ノード間の距離は、対象ノードにおけるポーズを構成する各ジョイントの位置、速度、加速度、角度、角速度、角加速度などの物理量の差分の重み付き平均として算出することができる。 Note that the distance between nodes can be calculated as a weighted average of differences in physical quantities such as the position, velocity, acceleration, angle, angular velocity, and angular acceleration of each joint that constitutes a pose at the target node.

次いで、式（７）式により、連結性を算出する。あるノードＦ^ｉ _ＢとあるノードＦ^ｊ _Ｂとの連結性をｃ（Ｆ^ｉ _Ｂ，Ｆ^ｊ _Ｂ）と表す。 Next, the connectivity is calculated by the equation (7). The connectivity between a certain node F ⁱ _{B and a} certain node F ^j _B is represented as c (F ⁱ _B , F ^j _B ).

但し、ｄ（Ｆ^ｉ _Ｂ）はノードＦ^ｉ _Ｂの前フレームと後フレームの間の距離である（式（５）又は式（６）と同様の計算式で算出する）。ＴＨは予め設定される閾値である。 However, d (F ⁱ _B ) is a distance between the previous frame and the rear frame of the node F ⁱ _B (calculated by a calculation formula similar to the formula (5) or the formula (6)). TH is a preset threshold value.

連結性ｃ（Ｆ^ｉ _Ｂ，Ｆ^ｊ _Ｂ）が１である場合、ノードＦ^ｉ _ＢのポーズとノードＦ^ｊ _Ｂのポーズは似ていると判断できる。連結性ｃ（Ｆ^ｉ _Ｂ，Ｆ^ｊ _Ｂ）が０である場合、ノードＦ^ｉ _ＢのポーズとノードＦ^ｊ _Ｂのポーズは似ているとは判断できない。 When the connectivity c (F ⁱ _B , F ^j _B ) is 1, it can be determined that the pose of the node F ⁱ _{B and} the pose of the node F ^j _B are similar. When the connectivity c (F ⁱ _B , F ^j _B ) is 0, it cannot be determined that the pose of the node F ⁱ _{B and} the pose of the node F ^j _B are similar.

［モーショングラフ構築ステップ］
次いで、モーショングラフ構築ステップにおいて、連結性ｃ（Ｆ^ｉ _Ｂ，Ｆ^ｊ _Ｂ）が１である場合、ノードＦ^ｉ _ＢとノードＦ^ｊ _Ｂの間に双方向のエッジを設ける。連結性ｃ（Ｆ^ｉ _Ｂ，Ｆ^ｊ _Ｂ）が０である場合には、ノードＦ^ｉ _ＢとノードＦ^ｊ _Ｂの間に双方向のエッジを設けない。 [Motion graph construction step]
Next, in the motion graph construction step, when the connectivity c (F ⁱ _B , F ^j _B ) is 1, a bidirectional edge is provided between the node F ⁱ _B and the node F ^j _B. When the connectivity c (F ⁱ _B , F ^j _B ) is 0, no bidirectional edge is provided between the node F ⁱ _B and the node F ^j _B.

次いで、同じ仕草データの中で隣接するノード間には、単方向のエッジを設ける。単方向のエッジは、時間的に前のノードから後のノードへ向かう。 Next, a unidirectional edge is provided between adjacent nodes in the same gesture data. Unidirectional edges travel from the previous node to the next node in time.

次いで、双方向エッジの両端のノードに係る仕草データに対して、ブレンディング（blending）処理を行う。ブレンディング処理は、双方向エッジの方向ごとに、それぞれ行う。従って、一つの双方向エッジに対して、図４（１），（２）に示されるように、２つのブレンディング処理を行うことになる。図４は、ノードｉとノードｊの間の双方向エッジに係るブレンディング処理の概念図である。図４（１）はノードｉからノードｊへ向かう方向に係るブレンディング処理を表し、図４（２）はノードｊからノードｉへ向かう方向に係るブレンディング処理を表す。 Next, blending processing is performed on the gesture data relating to the nodes at both ends of the bidirectional edge. The blending process is performed for each bidirectional edge direction. Therefore, two blending processes are performed on one bidirectional edge as shown in FIGS. 4 (1) and (2). FIG. 4 is a conceptual diagram of blending processing related to a bidirectional edge between the node i and the node j. FIG. 4 (1) represents the blending process in the direction from the node i to the node j, and FIG. 4 (2) represents the blending process in the direction from the node j to the node i.

図５は、ブレンディング処理を説明するための概念図であり、図４（１）に対応している。ここでは、図５を参照し、図４（１）に示されるノードｉからノードｊへ向かう方向に係るブレンディング処理を例に挙げて説明する。 FIG. 5 is a conceptual diagram for explaining the blending process and corresponds to FIG. Here, with reference to FIG. 5, the blending process in the direction from node i to node j shown in FIG. 4A will be described as an example.

ブレンディング処理では、ノードｉを有する仕草データ１とノードｊを有する仕草データ２に対して、動きのつながりが不自然にならないように、両者の仕草データの接続部分を混合した補間データ（ブレンディングデータ）１＿２を生成する。本実施形態では、一定時間分のフレームを使用しクォータニオンによる球面線形補間を利用して連結部分を補間する。具体的には、仕草データ１と仕草データ２を接続する接続区間（区間長ｍ、但し、ｍは所定値）のブレンディングデータ１＿２を、仕草データ１のノードｉを中心に周りの区間長ｍのデータ１＿ｍと仕草データ２のノードｊを中心に区間長ｍのデータ２＿ｍを用いて生成する。 In the blending process, interpolated data (blending data) in which the connection parts of the gesture data are mixed with respect to the gesture data 1 having the node i and the gesture data 2 having the node j so as not to unnaturally connect the movements. 1_2 is generated. In the present embodiment, the connected portions are interpolated using spherical linear interpolation by quaternions using frames for a fixed time. Specifically, blending data 1_2 of a connection section (section length m, where m is a predetermined value) connecting the gesture data 1 and the gesture data 2 is set to a section length m around the node i of the gesture data 1. The data 1_m and the gesture data 2 are generated using the data 2_m having the section length m around the node j.

このとき、接続区間の区間長ｍに対する接続区間の先頭からの距離ｕの比（ｕ／ｍ）に応じて、データ１＿ｍのうち距離ｕに対応するフレームｉとデータ２＿ｍのうち距離ｕに対応するフレームｊを混合する。具体的には、式（８）および式（９）により、ブレンディングデータ１＿２を構成する各フレームを生成する。なお、式（８）は、ある一つの骨についての式となっている。 At this time, according to the ratio (u / m) of the distance u from the head of the connection section to the section length m of the connection section, the frame i corresponding to the distance u in the data 1_m corresponds to the distance u in the data 2_m. Mix frame j. Specifically, each frame constituting the blending data 1_2 is generated by Expression (8) and Expression (9). Equation (8) is an equation for a certain bone.

但し、ｍはブレンディング動きデータ１＿２を構成するフレーム（ブレンディングフレーム）の総数（所定値）、ｕはブレンディングフレームの先頭からの順番（１≦ｕ≦ｍ）、ｑ（ｋ，ｕ）はｕ番目のブレンディングフレームにおける第ｋ骨の四元数、ｑ（ｋ，ｉ）はフレームｉにおける第ｋ骨の四元数、ｑ（ｊ）はフレームｊにおける第k骨の四元数、である。但し、ルートにはブレンディングを行わない。なお、式（９）はslerp（spherical linear interpolation）の算出式である。 Where m is the total number (predetermined value) of the frames (blending frames) constituting the blending motion data 1_2, u is the order from the top of the blending frame (1 ≦ u ≦ m), and q (k, u) is the uth The quaternion of the kth bone in the blending frame, q (k, i) is the quaternion of the kth bone in frame i, and q (j) is the quaternion of the kth bone in frame j. However, blending is not performed on the route. Equation (9) is a calculation formula of slerp (spherical linear interpolation).

ブレンディングデータ１＿２は、仕草データ１と仕草データ２の接続部分のデータとする。 The blending data 1_2 is data of a connection portion between the gesture data 1 and the gesture data 2.

次いで、モーショングラフからデッドエンド（Dead end）を除去する。デッドエンドとは次数が１であるノードのことである。なお、モーショングラフにおいて、ノードに接続するエッジの数のことを次数という。また、ノードに入ってくるエッジの数のことを入次数、ノードから出て行くエッジの数のことを出次数という。モーショングラフからデッドエンドを除去すると、新たなデッドエンドが発生する可能性があるが、デッドエンドがなくなるまでデッドエンド除去を繰り返す。 Next, the dead end is removed from the motion graph. A dead end is a node whose degree is 1. In the motion graph, the number of edges connected to a node is called an order. The number of edges entering the node is referred to as the input order, and the number of edges exiting from the node is referred to as the output order. If the dead end is removed from the motion graph, a new dead end may occur. However, the dead end elimination is repeated until the dead end disappears.

次いで、モーショングラフの各エッジにメタデータを付ける。双方向のエッジに対して、当該エッジに係るノードＦ^ｉ _ＢとノードＦ^ｊ _Ｂとの距離ｄ（Ｆ^ｉ _Ｂ，Ｆ^ｊ _Ｂ）を重みとして付与する。また、単方向のエッジに対して、フェーズＩＤと、当該エッジに係る継続時間をラベルとして付与する。 Next, metadata is attached to each edge of the motion graph. A distance d (F ⁱ _B , F ^j _B ) between the node F ⁱ _B and the node F ^j _B related to the edge is given as a weight to the bidirectional edge. In addition, a phase ID and a duration related to the edge are given as labels to a unidirectional edge.

以上がモーショングラフ生成処理の説明である。これにより、カテゴリ別にモーショングラフが生成される。なお、モーショングラフ生成部２０は、特別なモーショングラフとして、定常モーショングラフを生成する。定常モーショングラフは、特定のカテゴリに属さないモーショングラフである。定常モーショングラフは、特定のカテゴリに限定せず、定常用の仕草データを使用して、上述のモーショングラフ生成処理により同様に生成される。 The above is the description of the motion graph generation process. Thereby, a motion graph is generated for each category. The motion graph generation unit 20 generates a steady motion graph as a special motion graph. A steady motion graph is a motion graph that does not belong to a specific category. The steady motion graph is not limited to a specific category, but is generated in the same manner by the above-described motion graph generation processing using the gesture data for steady use.

［モーショングラフデータベース］
モーショングラフデータベース３０は、モーショングラフ生成部２０により生成されたモーショングラフを記憶する。モーショングラフデータベース３０には、カテゴリ別のモーショングラフと、定常モーショングラフとが格納される。 [Motion Graph Database]
The motion graph database 30 stores the motion graph generated by the motion graph generation unit 20. The motion graph database 30 stores a category-specific motion graph and a steady motion graph.

次に、図１に示される仕草生成装置１０について説明する。図１において、仕草生成装置１０は、入力処理部１１とメタデータ生成部１２と仕草データ生成部１３を備える。 Next, the gesture generating apparatus 10 shown in FIG. 1 will be described. In FIG. 1, the gesture generation device 10 includes an input processing unit 11, a metadata generation unit 12, and a gesture data generation unit 13.

仕草生成装置１０には、入力データとして、セリフデータと音声データの組が入力される。セリフデータは、セリフのテキストデータである。音声データは、同じ組のセリフデータのセリフの音声データである。 A set of speech data and voice data is input to the gesture generation apparatus 10 as input data. The serif data is text data of the serif. The voice data is the voice data of the same set of serif data.

［入力処理部］
入力処理部１１は、入力データのセリフデータに対して形態素解析を行い、この形態素解析の結果としてキーワード列を出力する。例えば、セリフデータ「じゃあ今日はウォーキングしなきゃね」の形態素解析の結果として、キーワード列「じゃあ｜今日｜は｜ウォーキング｜し｜なきゃ｜ね」を出力する。次いで、入力処理部１１は、音声データとキーワード列の時間上の対応関係を設定する。セリフデータと音声データとの時間上の対応関係は、予め、設定しておく。音声データが合成音声である場合、音声合成時に音声とセリフの対応関係が得られるので、該対応関係を設定する。合成音声以外の音声データ（録音音声）である場合には、人手によって音声とセリフの対応関係を設定する。次いで、入力処理部１１は、セリフの継続時間を記録する。 [Input processing section]
The input processing unit 11 performs morpheme analysis on the input data, and outputs a keyword string as a result of the morpheme analysis. For example, as a result of the morphological analysis of the speech data “Now, I must walk today”, the keyword string “Ja | Today | is | Walking | Shi | Nakya | Ne” is output. Next, the input processing unit 11 sets the temporal correspondence between the voice data and the keyword string. The correspondence in time between the speech data and the voice data is set in advance. If the speech data is synthesized speech, the correspondence between speech and speech can be obtained during speech synthesis, so the correspondence is set. In the case of voice data other than synthesized voice (recorded voice), the correspondence between voice and speech is set manually. Next, the input processing unit 11 records the duration of the line.

［メタデータ生成部］
メタデータ生成部１２は、入力処理部１１により出力されたキーワード列の各キーワードに対して、概念辞書を用いてカテゴリのラベルを付ける。概念辞書として、例えば非特許文献３に記載される「WordNet」を利用可能である。次いで、メタデータ生成部１２は、キーワード毎に、モーショングラフデータベース３０から、同じカテゴリのモーショングラフを選択する。この結果、複数のモーショングラフが選択された場合には、いずれか一つのモーショングラフを選択する。例えば、無作為に一つのモーショングラフを選択する。一方、モーショングラフが一つも選択されなかった場合には、定常モーショングラフを選択する。 [Metadata generator]
The metadata generation unit 12 attaches a category label to each keyword in the keyword string output by the input processing unit 11 using a concept dictionary. For example, “WordNet” described in Non-Patent Document 3 can be used as the concept dictionary. Next, the metadata generation unit 12 selects a motion graph of the same category from the motion graph database 30 for each keyword. As a result, when a plurality of motion graphs are selected, one of the motion graphs is selected. For example, select one motion graph at random. On the other hand, when no motion graph is selected, a steady motion graph is selected.

次いで、メタデータ生成部１２は、仕草のストロークのタイミングを決定する。具体的には、メタデータ生成部１２は、選択したモーショングラフに対して同じカテゴリのキーワードの開始タイミングと終了タイミングをストロークのタイミングに設定する。但し、定常モーショングラフが選択された場合には、定常モーショングラフに対してストロークのタイミングを無限大にする（特に定めない）。 Next, the metadata generation unit 12 determines the timing of the gesture stroke. Specifically, the metadata generation unit 12 sets the start timing and end timing of keywords in the same category for the selected motion graph as stroke timings. However, when the steady motion graph is selected, the stroke timing is set to infinite with respect to the steady motion graph (not specifically defined).

メタデータ生成部１２は、音声データとキーワード列の時間上の対応関係の情報と、セリフの継続時間の情報と、モーショングラフの情報と、該モーショングラフに対するストロークのタイミングの情報と、をメタデータとする。 The metadata generation unit 12 includes metadata of audio data and keyword sequence information on time, serif duration information, motion graph information, and stroke timing information for the motion graph. And

なお、ユーザが、オーサリングツールを用いて、手作業により、入力データ（セリフデータ、音声データ）に対して、該メタデータを生成するようにしてもよい。この場合、ユーザが、モーショングラフの選択、該モーショングラフのストロークに対応させるキーワードの選択（ストロークのタイミングの設定）、音声データとキーワード列の時間上の対応関係の設定などを任意に行う。 Note that the user may generate the metadata for the input data (serif data, voice data) manually using an authoring tool. In this case, the user arbitrarily performs selection of a motion graph, selection of a keyword corresponding to a stroke of the motion graph (setting of the timing of the stroke), setting of the temporal correspondence between the voice data and the keyword string, and the like.

［仕草データ生成部］
仕草データ生成部１３は、メタデータ生成部１２により生成されたメタデータを用いて、セリフの音声に合わせた仕草データを生成する。図６は、本実施形態に係る仕草データ生成方法の流れを示すフローチャートである。以下、図６を参照して、仕草データ生成部１３が仕草データを生成する動作を説明する。 [Draft data generator]
The gesture data generation unit 13 uses the metadata generated by the metadata generation unit 12 to generate gesture data that matches the speech of the speech. FIG. 6 is a flowchart showing the flow of the gesture data generation method according to the present embodiment. Hereinafter, an operation in which the gesture data generation unit 13 generates the gesture data will be described with reference to FIG.

（ステップＳ１１）仕草データ生成部１３は、モーショングラフから仕草データの始点となるノードを選択する。例えば、モーショングラフ内のノードであって、仕草データの最初のノードのうち、仕草データの最後のポーズと最も距離が小さい（連結性の良い）ノードを始点ノードにする。 (Step S11) The gesture data generation unit 13 selects a node that is the starting point of the gesture data from the motion graph. For example, among the first nodes in the motion data, the node having the shortest distance (highest connectivity) from the last pose of the gesture data is set as the start node.

（ステップＳ１２）仕草データ生成部１３は、モーショングラフ上の始点ノードからの最適パスを探索し、最小コストのパスを選択する。このパス探索方法には、非特許文献４に記載されるパス探索技術を用いる。非特許文献４に記載されるパス探索技術は、始点からダイナミックプログラミングで最適なパスを探索するものである。以下、最適パス探索ステップの詳細を説明する。 (Step S12) The gesture data generation unit 13 searches for the optimum path from the start node on the motion graph and selects the path with the lowest cost. For this path search method, a path search technique described in Non-Patent Document 4 is used. The path search technique described in Non-Patent Document 4 searches for an optimal path from the start point by dynamic programming. Details of the optimum path search step will be described below.

まず、始点ノードｕからモーショングラフ上の全てのノードｉまでの各パスのコストを式（１０）により算出する。始点ノードｕに係る最初の最短パス算出操作は第１回の操作である。 First, the cost of each path from the start point node u to all the nodes i on the motion graph is calculated by Expression (10). The first shortest path calculation operation related to the start point node u is the first operation.

但し、ｓｈｏｒｔｅｓｔＰａｔｈ（ｉ，１）は、第１回の最短パス算出操作による、始点ノードｕからノードｉまでのパスのコストである。ｅｄｇｅＣｏｓｔ（ｕ，ｉ）はノードｕからノードｉまでのエッジコストである。エッジコストは毎回計算される。エッジコストの計算式は式（１１）である。 However, shortestPath (i, 1) is the cost of the path from the start node u to the node i by the first shortest path calculation operation. edgeCost (u, i) is the edge cost from node u to node i. The edge cost is calculated every time. The formula for calculating the edge cost is Equation (11).

なお、定常モーショングラフのエッジコストの計算式は式（１２）である。 Note that the formula for calculating the edge cost of the steady motion graph is Expression (12).

次いで、第２回目以降の第ｋ回の最短パス算出操作では、式（１３）により、始点ノードｕからモーショングラフ上の全てのノードｖまでの最適パスのコストを算出する。 Next, in the k-th shortest path calculation operation after the second time, the cost of the optimal path from the start point node u to all the nodes v on the motion graph is calculated by Expression (13).

但し、Ｖはモーショングラフ上のノードの集合である。ｓｈｏｒｔｅｓｔＰａｔｈ（ｖ，ｋ）は、第ｋ回の最短パス算出操作による、始点ノードｕからノードｖまでの最適パスのコストである。ｅｄｇｅＣｏｓｔ（ｉ，ｖ）はノードｉからノードｖまでのエッジコストである。 V is a set of nodes on the motion graph. shorttestPath (v, k) is the cost of the optimum path from the start node u to the node v by the k-th shortest path calculation operation. edgeCost (i, v) is the edge cost from node i to node v.

この式（１３）を用いた第２回目以降の最短パス算出操作は、最適パス探索の終了条件を満たすまで行う。 The second and subsequent shortest path calculation operations using this equation (13) are performed until the optimal path search end condition is satisfied.

（ステップＳ１３）仕草データ生成部１３は、最適パス探索の終了条件の判定を行う。最適パス探索の終了条件（ａ）〜（ｄ）を以下に示す。
（ａ）モーショングラフ内の最後のノード以外でパス長が所定フレーム数Ｎ（セルフの継続時間に対応）を超えた場合には「超過」として当該探索結果のパスを破棄する。
（ｂ）モーショングラフ内の最後のノードに到達した場合、当該探索結果のパスを最適パス候補として保存する。
（ｃ）最適パス候補の中から、パス長と所定フレーム数Ｎの差が所定範囲内である最適パス候補を抽出する。この抽出された最適パス候補として、パス長が所定フレーム数Ｎ未満であるものと、パス長が所定フレーム数Ｎ超過であるものとがある場合には、パス長が所定フレーム数Ｎ未満である最適パス候補を選択する。
（ｄ）上記（ｃ）で選択された最適パス候補が複数ある場合には、コストが最小である最適パス候補を最適パスとする。 (Step S13) The gesture data generation unit 13 determines the optimal path search end condition. The end conditions (a) to (d) for the optimum path search are shown below.
(A) When the path length exceeds a predetermined frame number N (corresponding to the self duration) except for the last node in the motion graph, the path of the search result is discarded as “excess”.
(B) When the last node in the motion graph is reached, the search result path is stored as an optimal path candidate.
(C) From the optimal path candidates, extract the optimal path candidates whose difference between the path length and the predetermined number of frames N is within a predetermined range. As the extracted optimum path candidates, when there are a path length less than the predetermined number of frames N and a path length exceeding the predetermined number of frames N, the path length is less than the predetermined number of frames N. Select the best path candidate.
(D) If there are a plurality of optimal path candidates selected in (c) above, the optimal path candidate with the lowest cost is determined as the optimal path.

（ステップＳ１４）上記最適パス探索の終了条件（ａ）〜（ｄ）を満たした場合にはステップＳ１５に進む。一方、上記最適パス探索の終了条件（ａ）〜（ｄ）を満たさない場合にはステップＳ１２に戻る。 (Step S14) If the optimal path search end conditions (a) to (d) are satisfied, the process proceeds to Step S15. On the other hand, if the optimal path search termination conditions (a) to (d) are not satisfied, the process returns to step S12.

（ステップＳ１５）仕草データ生成部１３は、選択した最適パスに基づいて、モーショングラフから最適パスに対応する仕草データを特定する。次いで、仕草データ生成部１３は、該最適パスに対応する仕草データに対して、セリフの音声に合わせる調整を行う。図７は、本実施形態に係る仕草データ調整方法の説明図である。図７を参照して本実施形態に係る仕草データ調整方法を以下に説明する。 (Step S15) The gesture data generation unit 13 specifies the gesture data corresponding to the optimum path from the motion graph based on the selected optimum path. Next, the gesture data generation unit 13 adjusts the gesture data corresponding to the optimum path according to the speech of the speech. FIG. 7 is an explanatory diagram of the gesture data adjustment method according to the present embodiment. The gesture data adjustment method according to the present embodiment will be described below with reference to FIG.

図７に示されるように、調整前の仕草データにおいて、ストロークのタイミングは対応するキーワード「ウォーキング」の音声データのタイミングと合っていない。このため、まず、ストロークの開始タイミングを対応キーワード「ウォーキング」の音声データ開始タイミングに合わせるように移動させる。次いで、ストロークの継続時間を、対応キーワード「ウォーキング」のの音声データ終了タイミングに合わせるように伸縮させる。この伸縮率の範囲は、不自然にならないように、予め設定しておく。 As shown in FIG. 7, in the gesture data before adjustment, the timing of the stroke does not match the timing of the voice data of the corresponding keyword “walking”. For this reason, first, the stroke start timing is moved to match the voice data start timing of the corresponding keyword “walking”. Next, the stroke duration is expanded or contracted to match the voice data end timing of the corresponding keyword “walking”. The range of the expansion / contraction rate is set in advance so as not to be unnatural.

次いで、準備期を調整する。準備期の開始タイミングはセリフの音声データ開始タイミングと一致しているので、準備期の終了タイミングをストロークの開始タイミングに合わせるように準備期の継続時間を伸縮させる。この伸縮率の範囲は、不自然にならないように、予め設定しておく Next, the preparation period is adjusted. Since the start timing of the preparation period coincides with the speech data start timing of the speech, the continuation time of the preparation period is expanded and contracted so that the end timing of the preparation period matches the start timing of the stroke. This stretch rate range is set in advance so as not to be unnatural.

次いで、終了期を調整する。終了期について、開始タイミングをストロークの終了タイミングに合わせるように、且つ、終了タイミングをセリフの音声データ終了タイミングに合わせるように、終了期の継続時間を伸縮させる。この伸縮率の範囲は、不自然にならないように、予め設定しておく。 Next, the end period is adjusted. Regarding the end period, the duration of the end period is expanded and contracted so that the start timing is matched with the end timing of the stroke, and the end timing is matched with the speech data end timing of the speech. The range of the expansion / contraction rate is set in advance so as not to be unnatural.

（ステップＳ１６）仕草データ生成部１３は、調整後の仕草データに対して、ランダム性の導入処理を行う。このランダム性の導入処理では、事前に短い仕草データ（ランダム仕草データと称する）を複数用意し、準備期または終了期の中にランダム仕草データと似ているフレームがあれば、当該フレームを該似ているランダム仕草データと入れ替える。具体的には、式（６）により各ランダム仕草データの第一フレームと準備期または終了期の各フレームとの距離を算出する。そして、距離算出対象フレームと距離算出対象ランダム仕草データに関して算出された距離が閾値以下である場合に、準備期または終了期の該距離算出対象フレームから該距離算出対象ランダム仕草データの継続時間分のフレームまでを該距離算出対象ランダム仕草データで入れ替える。図７の例では、準備期および終了期においてランダム仕草データ１０００との入れ替えが行われている。 (Step S16) The gesture data generation unit 13 performs randomness introduction processing on the adjusted gesture data. In this randomness introduction process, a plurality of short gesture data (referred to as random gesture data) is prepared in advance, and if there is a frame similar to the random gesture data in the preparation period or the end period, the frame is referred to. Replace with random gesture data. Specifically, the distance between the first frame of each random gesture data and each frame in the preparation period or the end period is calculated by Expression (6). When the distance calculated for the distance calculation target frame and the distance calculation target random gesture data is equal to or less than the threshold value, the distance calculation target random gesture data for the duration of the distance calculation target random gesture data from the distance calculation target frame in the preparation period or the end period The frame is replaced with the distance calculation target random gesture data. In the example of FIG. 7, the random gesture data 1000 is replaced in the preparation period and the end period.

ランダム仕草データとして、例えば、首をかしげる仕草、体をゆする仕草、舌を出す仕草などが挙げられる。このランダム仕草データで準備期または終了期のフレームを入れ替えることにより、準備期または終了期にアクセントを加えることができる。準備期や終了期はセリフを待っている無音区間である場合があるが、ランダム性の導入処理によって、ユーザに対して該無音区間にアクセントを与え、ユーザにあきさせない効果を得ることができる。 Random gesture data includes, for example, a gesture that raises the neck, a gesture that shakes the body, and a gesture that sticks out the tongue. By replacing the frame of the preparation period or the end period with this random gesture data, an accent can be added to the preparation period or the end period. Although the preparation period and the end period may be silent periods waiting for a line, the randomness introducing process can give the user an accent on the silent period and can prevent the user from being exposed.

仕草データ生成部１３は、生成した仕草データを入力データ（セリフデータ、音声データ）と共に出力する。この出力された仕草データによって、入力データのセリフの音声に合わせた自然な仕草の動画像を再生することができる。 The gesture data generation unit 13 outputs the generated gesture data together with input data (serif data, voice data). With this output gesture data, it is possible to reproduce a natural gesture moving image that matches the speech of the input data.

以上、本発明の実施形態について図面を参照して詳述してきたが、具体的な構成はこの実施形態に限られるものではなく、本発明の要旨を逸脱しない範囲の設計変更等も含まれる。 As mentioned above, although embodiment of this invention was explained in full detail with reference to drawings, the specific structure is not restricted to this embodiment, The design change etc. of the range which does not deviate from the summary of this invention are included.

例えば、仕草データ生成部１３は、ストロークしかない仕草データに対して、所定の定常ポーズを用いて、ストロークの前と後に一定時間の準備期と終了期を追加するようにしてもよい。また、仕草データ生成部１３は、終了期がない仕草データに対して、準備期のポーズを用いて、ストロークの後に一定時間の終了期を追加するようにしてもよい。また、仕草データ生成部１３は、準備期がない仕草データに対して、終了期のポーズを用いて、ストロークの前に一定時間の準備期を追加するようにしてもよい。 For example, the gesture data generation unit 13 may add a preparation period and an end period of a certain time before and after the stroke by using a predetermined steady pose for the gesture data having only a stroke. Further, the gesture data generation unit 13 may add an end period of a certain time after the stroke by using a pose of the preparation period for the gesture data having no end period. Further, the gesture data generation unit 13 may add a preparation period of a certain time before the stroke by using an end period pose for the gesture data having no preparation period.

また、モーショングラフのストロークの長さがセリフの継続時間よりも所定倍以上である場合には、定常モーショングラフに切り替えたり、または、ストロークに対応する音声データのセリフの直後に一定時間の無音区間を挿入したりするようにしてもよい。 Also, if the stroke length of the motion graph is more than a predetermined time than the duration of the speech, switch to the steady motion graph, or a silent interval of a certain time immediately after the speech data speech corresponding to the stroke Or may be inserted.

また、上述した仕草生成システム１を実現するためのコンピュータプログラムをコンピュータ読み取り可能な記録媒体に記録して、この記録媒体に記録されたプログラムをコンピュータシステムに読み込ませ、実行するようにしてもよい。なお、ここでいう「コンピュータシステム」とは、ＯＳや周辺機器等のハードウェアを含むものであってもよい。 Further, a computer program for realizing the above-described gesture generation system 1 may be recorded on a computer-readable recording medium, and the program recorded on the recording medium may be read into the computer system and executed. Here, the “computer system” may include an OS and hardware such as peripheral devices.

また、「コンピュータ読み取り可能な記録媒体」とは、フレキシブルディスク、光磁気ディスク、ＲＯＭ、フラッシュメモリ等の書き込み可能な不揮発性メモリ、ＤＶＤ（Digital Versatile Disk）等の可搬媒体、コンピュータシステムに内蔵されるハードディスク等の記憶装置のことをいう。 “Computer-readable recording medium” refers to a flexible disk, a magneto-optical disk, a ROM, a writable nonvolatile memory such as a flash memory, a portable medium such as a DVD (Digital Versatile Disk), and a built-in computer system. A storage device such as a hard disk.

さらに「コンピュータ読み取り可能な記録媒体」とは、インターネット等のネットワークや電話回線等の通信回線を介してプログラムが送信された場合のサーバやクライアントとなるコンピュータシステム内部の揮発性メモリ（例えばＤＲＡＭ（Dynamic Random Access Memory））のように、一定時間プログラムを保持しているものも含むものとする。
また、上記プログラムは、このプログラムを記憶装置等に格納したコンピュータシステムから、伝送媒体を介して、あるいは、伝送媒体中の伝送波により他のコンピュータシステムに伝送されてもよい。ここで、プログラムを伝送する「伝送媒体」は、インターネット等のネットワーク（通信網）や電話回線等の通信回線（通信線）のように情報を伝送する機能を有する媒体のことをいう。
また、上記プログラムは、前述した機能の一部を実現するためのものであっても良い。さらに、前述した機能をコンピュータシステムにすでに記録されているプログラムとの組み合わせで実現できるもの、いわゆる差分ファイル（差分プログラム）であっても良い。 Further, the “computer-readable recording medium” means a volatile memory (for example, DRAM (Dynamic DRAM) in a computer system that becomes a server or a client when a program is transmitted through a network such as the Internet or a communication line such as a telephone line. Random Access Memory)), etc., which hold programs for a certain period of time.
The program may be transmitted from a computer system storing the program in a storage device or the like to another computer system via a transmission medium or by a transmission wave in the transmission medium. Here, the “transmission medium” for transmitting the program refers to a medium having a function of transmitting information, such as a network (communication network) such as the Internet or a communication line (communication line) such as a telephone line.
The program may be for realizing a part of the functions described above. Furthermore, what can implement | achieve the function mentioned above in combination with the program already recorded on the computer system, and what is called a difference file (difference program) may be sufficient.

１…仕草生成システム、１０…仕草生成装置、１１…入力処理部、１２…メタデータ生成部、１３…仕草データ生成部、２０…モーショングラフ生成部、３０…モーショングラフデータベース DESCRIPTION OF SYMBOLS 1 ... gesture generation system, 10 ... gesture generation apparatus, 11 ... input processing part, 12 ... metadata generation part, 13 ... gesture data generation part, 20 ... motion graph generation part, 30 ... motion graph database

Claims

Inputs speech data that is text data of speech, speech data of the speech, and a motion graph in which the plurality of gesture data are connected based on connectivity of the plurality of gesture data, and gesture data that matches the speech of the speech A gesture generating device for generating
A gesture data generation unit is provided that selects a path with the lowest cost on the motion graph based on the duration of the lines and adjusts the gesture data of the selected paths to match the speech of the lines. Characteristic gesture generation device.

The gesture data generation unit adjusts the start timing and the end timing of the stroke with respect to the speech data of the keyword in the speech corresponding to the stroke of the gesture data of the selected path. The gesture generating apparatus according to claim 1.

The gesture generation apparatus according to claim 2, wherein the gesture data generation unit adjusts the length of the selected path's gesture data to the duration of the line.

The said gesture data production | generation part expands / contracts the continuation time of this preparation period so that the completion timing of the preparation period of the selected path | pass gesture data may match the start timing of the said stroke. Gesture generation device.

The gesture data generation unit adjusts the start timing to the end timing of the stroke and the end timing to the speech data end timing of the lines for the end period of the selected path gesture data. The gesture generating device according to claim 3, wherein the duration of the end period is expanded or contracted.

The said gesture data generation part makes the start node the node having the best connectivity with the last pose of the gesture data among the first nodes of the gesture data included in the motion graph. The gesture generating apparatus according to any one of 5.

There are multiple motion graphs by category,
The said gesture data production | generation part produces | generates the gesture data matched with the audio | voice of the said speech using the said motion graph same as the category of the keyword in the said speech, The any one of Claim 1 to 6 characterized by the above-mentioned. The gesture generating device described in 1.

The said gesture data production | generation part replaces with the frame which is similar from predetermined | prescribed gesture data with respect to the frame of the preparation period or completion | finish period of the selected gesture data of the said path | pass. The gesture generation device according to any one of the above.

The said gesture data production | generation part adds the preparation period and completion | finish period of a fixed time before and after a stroke using the predetermined steady pose with respect to the gesture data which has only a stroke, The 1st to 8 characterized by the above-mentioned. The gesture generation device according to any one of the above.

The said gesture data production | generation part adds the end period of a fixed time after a stroke using the pause of a preparation period with respect to the gesture data which does not have an end period, The any one of Claim 1 to 8 characterized by the above-mentioned. The gesture generating apparatus as described in the item.

The said gesture data production | generation part adds the preparation period of a fixed time before a stroke using the pause of an end period with respect to the gesture data without a preparation period, The any one of Claim 1 to 8 characterized by the above-mentioned. The gesture generating apparatus according to item 1.

The gesture data generation unit switches to a predetermined steady motion graph when the stroke length of the motion graph is equal to or longer than a line duration, or a voice corresponding to the stroke of the motion graph. The gesture generation device according to any one of claims 1 to 11, wherein a silent section of a fixed time is inserted immediately after a data line.

The gesture generating device according to any one of claims 1 to 12,
A motion graph generation unit that sets each phase boundary of the preparation period, stroke, and end period of input gesture data to a node, and generates a motion graph with an edge based on connectivity between the nodes;
A motion graph database for storing the motion graph;
A gesture generation system characterized by comprising:

Inputs speech data that is text data of speech, speech data of the speech, and a motion graph in which the plurality of gesture data are connected based on connectivity of the plurality of gesture data, and gesture data that matches the speech of the speech Is a gesture generation method of the gesture generation device for generating
The gesture generation device selects a path with the lowest cost on the motion graph based on the duration of the lines, and adjusts the gesture data of the selected paths according to the speech of the lines. A gesture generation method.

Inputs speech data that is text data of speech, speech data of the speech, and a motion graph in which the plurality of gesture data are connected based on connectivity of the plurality of gesture data, and gesture data that matches the speech of the speech To the computer of the gesture generation device that generates
A computer program for executing a step of selecting a minimum cost path on the motion graph based on the duration of the serif and performing adjustment to match the voice of the serif with respect to the gesture data of the selected path.