JP7817014B2

JP7817014B2 - Information processing device and program

Info

Publication number: JP7817014B2
Application number: JP2022029644A
Authority: JP
Inventors: 雄一郎山本; 清貴田中
Original assignee: Tribawl
Current assignee: Tribawl
Priority date: 2022-02-28
Filing date: 2022-02-28
Publication date: 2026-02-18
Anticipated expiration: 2042-02-28
Also published as: JP2023125517A

Description

本発明は、情報処理装置、及びプログラムに関する。 The present invention relates to an information processing device and a program.

カメラにより撮像された画像に含まれる人物の骨格情報を推定する技術が知られている（例えば、特許文献１参照）。また、人物の動きに基づいて、ロボットを遠隔で制御し、特定の操作を行わせる技術も知られている（例えば、特許文献２参照）。 Technology is known that estimates skeletal information of a person contained in an image captured by a camera (see, for example, Patent Document 1). Also known is technology that remotely controls a robot to perform specific operations based on the movement of a person (see, for example, Patent Document 2).

特開２０２１－１８９９４６号公報Japanese Patent Application Laid-Open No. 2021-189946 特開平７－１６０３１０号公報Japanese Unexamined Patent Publication No. 7-160310

上記のようなロボットは、人の身体的な動き、或いは操作等に沿って動作するように制御される。このような動作制御では、ロボットの動作は、操作を含む身体的な動きをする人が予め想定したものとなるだけでなく、その人にのみに依存することになる。ロボットの用途によっては、人にとっての利便性、或いは人に与える心証等を考慮し、ロボットの動作の多様性を向上させることが望ましいと考えられる。これは、アバター等の画像を人の動作により動作させる場合も同様である。 Robots such as those described above are controlled to move in accordance with a person's physical movements or operations. With this type of movement control, the robot's movements are not only expected in advance by the person making the physical movements, including the operations, but also depend solely on that person. Depending on the robot's intended use, it may be desirable to increase the diversity of the robot's movements, taking into consideration convenience for people and the impression it leaves on people. This also applies when moving an image such as an avatar in response to a person's movements.

本発明は、人の動作によるアバター、及びロボットの少なくとも一方の動作の多様性を向上させることが可能な情報処理装置、及びプログラムを提供することを目的とする。 The present invention aims to provide an information processing device and program that can improve the diversity of movements of at least one of an avatar and a robot based on human movements.

本開示の一態様の情報処理装置は、検知対象を撮像して得られる画像データを取得する画像データ取得手段と、前記画像データが表す検知対象画像に基づいて、前記検知対象画像における関節の位置情報と、関節間の関係性を示す関係性情報とを含む骨格情報を生成する骨格情報生成手段と、前記骨格情報に基づいて、前記検知対象画像における動作内容を解析する解析手段と、前記動作内容の解析結果に基づいて、テキスト情報を生成するテキスト情報生成手段と、前記テキスト情報に基づいて、表示させる所定のアバター、及び物理機械であるロボットのうちの少なくとも一方を動作対象として動作させる動作制御手段と、を有する。 An information processing device according to one aspect of the present disclosure includes image data acquisition means for acquiring image data obtained by capturing an image of a detection target; skeletal information generation means for generating skeletal information based on the detection target image represented by the image data, the skeletal information including positional information of joints in the detection target image and relationship information indicating relationships between joints; analysis means for analyzing the motion details in the detection target image based on the skeletal information; text information generation means for generating text information based on the analysis results of the motion details; and motion control means for operating at least one of a specified avatar to be displayed and a robot, which is a physical machine, as the motion target based on the text information.

本開示の一態様のプログラムは、情報処理装置に、検知対象を想定した撮像により得られる画像データが表す検知対象画像に基づいて、前記検知対象画像における関節の位置情報と、関節間の関係性を示す関係性情報とを含む骨格情報を生成し、前記骨格情報に基づいて、前記検知対象画像における動作内容を解析し、前記動作内容の解析結果に基づいて、テキスト情報を生成し、前記テキスト情報に基づいて、表示させる所定のアバター、及び物理機械であるロボットのうちの少なくとも一方を動作対象として動作させる処理を実行させる。 A program according to one aspect of the present disclosure causes an information processing device to generate skeletal information including positional information of joints in the detection target image and relationship information indicating the relationships between joints, based on a detection target image represented by image data obtained by capturing an image of an assumed detection target; analyze the motion details in the detection target image based on the skeletal information; generate text information based on the analysis results of the motion details; and, based on the text information, execute a process of operating at least one of a specified avatar to be displayed and a robot, which is a physical machine, as the motion target.

本発明では、人の動作によるアバター、及びロボットの少なくとも一方の動作の多様性を向上させることができる。 This invention can improve the diversity of movements of at least one of an avatar and a robot based on human movements.

本発明の適用により構築されたサービス提供システム、及びそのシステムで提供されるサービスの内容の一例を説明する図である。1 is a diagram illustrating an example of a service providing system constructed by applying the present invention and the contents of services provided by the system. 動作対象に別の動作を行わせるための姿勢を含む動作の一例を説明する図。10A and 10B are diagrams illustrating an example of a motion including a posture for making a motion target perform another motion. 本発明の情報処理装置の一実施形態に係るＡＰサーバが接続されたネットワーク環境の一例を説明する図である。1 is a diagram illustrating an example of a network environment to which an AP server according to an embodiment of the information processing apparatus of the present invention is connected; ゲーム等のための場の例を説明する図である。FIG. 1 is a diagram illustrating an example of a place for games and the like. 本発明の情報処理装置の一実施形態に係るＡＰサーバのハードウェア構成の一例を示すブロック図である。FIG. 2 is a block diagram showing an example of a hardware configuration of an AP server according to an embodiment of the information processing apparatus of the present invention. 本発明の情報処理装置の一実施形態に係るＡＰサーバ上に実現される機能的構成の一例を示す機能ブロック図である。FIG. 2 is a functional block diagram showing an example of a functional configuration realized on an AP server according to an embodiment of the information processing device of the present invention. 本実施形態に係る情報処理装置であるＡＰサーバに搭載のＣＰＵによって実行される動画表示処理の一例を示すフローチャートである。10 is a flowchart illustrating an example of a video display process executed by a CPU installed in an AP server that is an information processing apparatus according to the present embodiment.

以下、本発明を実施するための形態について、図を参照しながら説明する。なお、説明する実施形態は、あくまでも一例であって、本発明の技術的範囲はこれに限られるものではない。本発明の技術的範囲には、様々な変形例も含まれる。 The following describes embodiments of the present invention with reference to the drawings. Please note that the described embodiments are merely examples, and the technical scope of the present invention is not limited to these. The technical scope of the present invention also includes various modifications.

図１は、本発明の適用により構築されたサービス提供システム、及びそのシステムで提供されるサービスの内容の一例を説明する図である。ここでは、このサービス提供システムＳＹにより提供するサービスを「本サービス」と表記し説明する。 Figure 1 is a diagram illustrating an example of a service provision system constructed by applying the present invention, and the content of the services provided by that system. Here, the services provided by this service provision system SY will be referred to as "this service."

図１に示す例は、カメラＣによる撮像（動画撮影）を想定しているのは動物であり、動物は主に人である。この例では、人を、骨格検知の対象である検知対象ＫＴとし、検知対象ＫＴの動作内容に応じて、伝達すべき情報を音声出力により、タイムリに検知対象ＫＴに伝達することが想定されている。 In the example shown in Figure 1, it is assumed that the camera C captures images (video recording) of animals, primarily humans. In this example, a human is the detection target KT that is the subject of skeletal detection, and it is assumed that information to be communicated is transmitted to the detection target KT in a timely manner by audio output depending on the movement of the detection target KT.

検知対象ＫＴの動作内容の確認のために、検知対象ＫＴの骨格情報の生成が行われる。この骨格情報は、人の関節の位置情報と、関節間の関係性を示す関係性情報とを含む情報である。関係性情報は、例えば骨格上、隣り合う形となっている関節間の距離、及び方向を表す情報である。このような骨格情報の生成は、カメラＣの撮像により得られた対象映像ＳＴ１を用いた解析により行われる。それにより、検知対象ＫＴの骨格情報は、対象映像ＳＴ１から生成される。図１に示すスケルトン画像ＳＧは、間接の位置、及び関係性情報が生成される関節間の一例を表している。骨格情報の生成自体は、周知の技術により行われる。 Skeletal information of the detection target KT is generated to confirm the movement of the detection target KT. This skeletal information includes position information of the person's joints and relationship information that indicates the relationships between the joints. The relationship information is, for example, information that indicates the distance and direction between adjacent joints on the skeleton. Such skeletal information is generated by analysis using the target image ST1 captured by camera C. In this way, skeletal information of the detection target KT is generated from the target image ST1. The skeleton image SG shown in Figure 1 represents an example of the positions of the joints and the joints for which relationship information is generated. The generation of skeletal information itself is performed using well-known technology.

なお、撮像に用いるカメラＣは複数台であっても良い。複数台のカメラＣをそれぞれ異なる位置に設置させた場合、各カメラＣから得られる対象映像ＳＴ１により、１台以上のカメラＣでは死角となる検知対象ＫＴの部分も確認できるようになる。そのため、骨格情報の生成はより高精度に行えることとなる。例えば、これは、検知対象ＫＴが二人以上である場合に特に有用である。ここでは、説明上、便宜的に、カメラＣは１台の想定で説明することとする。 It should be noted that multiple cameras C may be used for imaging. When multiple cameras C are installed in different locations, the target image ST1 obtained from each camera C makes it possible to check parts of the detection target KT that are blind spots for one or more cameras C. This allows for more accurate generation of skeletal information. For example, this is particularly useful when there are two or more detection targets KT. For the sake of convenience, the following explanation will be given assuming that only one camera C is used.

サービス提供システムＳＹには、図１に示すように、骨格情報生成システムＨ１、姿勢評価システムＨ２、及び姿勢管理システムＨ３が含まれる。
図１では、各システムＨ１～Ｈ３を別々に表しているが、これらのシステム（における機能）は同じ情報処理装置上に実現されていても良い。そのうちの２つを同じ情報処理装置上に実現させても良い。ここでは、図１に示す通り、各システムＨ１～Ｈ３はそれぞれ異なる情報処理装置上に実現されているものと想定、つまりサービス提供システムＳＹには３台以上の情報処理装置を用いて構築されているものと想定し説明する。 As shown in FIG. 1, the service providing system SY includes a skeleton information generating system H1, a posture evaluation system H2, and a posture management system H3.
1, each of systems H1 to H3 is shown separately, but these systems (functions thereof) may be realized on the same information processing device. Two of them may be realized on the same information processing device. Here, as shown in FIG. 1, it is assumed that each of systems H1 to H3 is realized on a different information processing device, that is, the service providing system SY is constructed using three or more information processing devices.

骨格情報生成システムＨ１は、ＯＳ（Operating System）Ｈ１１が搭載された情報処理装置上に実現されている。そのＯＳＨ１１には、カメラ機器映像取得・管理スイッチ（ＳＷ）Ｈ１１１が搭載されている。このカメラ機器映像取得・管理スイッチＨ１１１は、骨格情報生成システムＨ１に受信された対象映像ＳＴ１を含む各種情報を渡す先を切り換える機能である。それにより、骨格情報生成システムＨ１に受信された対象映像ＳＴ１は、カメラ機器映像取得・管理ＳＷＨ１１１により、対象映像ＳＴ２として処理エンジンＨ１１２に渡される。この処理エンジンＨ１１２、及び骨格検知部Ｈ１１３は、例えば何れもＯＳＨ１１上で動作するアプリケーション・プログラム（以降「アプリケーション」と略記）により実現される機能である。 The skeletal information generation system H1 is implemented on an information processing device equipped with an OS (Operating System) H11. The OSH11 is equipped with a camera device image acquisition and management switch (SW) H111. This camera device image acquisition and management switch H111 has the function of switching the destination of various information, including the target image ST1 received by the skeletal information generation system H1. As a result, the target image ST1 received by the skeletal information generation system H1 is passed to the processing engine H112 as the target image ST2 by the camera device image acquisition and management SW111. The processing engine H112 and the skeletal detection unit H113 are functions implemented, for example, by an application program (hereinafter abbreviated as "application") running on the OSH11.

処理エンジンＨ１１２は、本サービスの提供のための全体的な制御を行う機能である。処理エンジンＨ１１２は、骨格検知処理呼出ＳＴ３により、骨格検知部Ｈ１１３を骨格検知処理に実行させ、対象映像ＳＴ２を用いた骨格情報の生成を行わせる。そのために、処理エンジンＨ１１２は、骨格検知処理呼出ＳＴ３により、例えば対象映像ＳＴ２を骨格検知部Ｈ１１３に渡す。処理エンジンＨ１１２から骨格検知部Ｈ１１３に渡すのは、対象映像ＳＴ２ではなく、検知対象画像を含む部分のみの映像情報であっても良い。ここでは、対象映像ＳＴ２が処理エンジンＨ１１２から骨格検知部Ｈ１１３に渡されるものと想定する。 The processing engine H112 is a function that performs overall control for providing this service. The processing engine H112 causes the skeleton detection unit H113 to execute the skeleton detection process by calling the skeleton detection process ST3, and generates skeleton information using the target video ST2. To achieve this, the processing engine H112 passes, for example, the target video ST2 to the skeleton detection unit H113 by calling the skeleton detection process ST3. What the processing engine H112 passes to the skeleton detection unit H113 may not be the target video ST2, but rather video information containing only the portion of the image to be detected. Here, it is assumed that the target video ST2 is passed from the processing engine H112 to the skeleton detection unit H113.

骨格検知部Ｈ１１３は、骨格情報の生成のための骨格検知処理を実行し、その骨格検知処理の処理結果ＳＴ４を処理エンジンＨ１１２に返す。この処理結果ＳＴ４が、骨格表現のデータ列の形で生成される骨格情報である。 The skeleton detection unit H113 executes a skeleton detection process to generate skeleton information, and returns the processing result ST4 of the skeleton detection process to the processing engine H112. This processing result ST4 is the skeleton information generated in the form of a data string of skeleton representation.

処理エンジンＨ１１２は、骨格検知部Ｈ１１３から返された処理結果ＳＴ４である骨格情報を骨格表現のデータ列ＳＴ５として、姿勢評価システムＨ２、及び姿勢管理システムＨ３にそれぞれ出力することができる。姿勢評価システムＨ２には、検知対象の画像ＳＴ６として、検知対象ＫＴの画像である検知対象画像ＫＴＧも出力される。 The processing engine H112 can output the skeletal information, which is the processing result ST4 returned from the skeleton detection unit H113, as a skeletal representation data string ST5 to the posture evaluation system H2 and the posture management system H3. A detection target image KTG, which is an image of the detection target KT, is also output to the posture evaluation system H2 as an image ST6 of the detection target.

姿勢評価システムＨ２は、骨格情報生成システムＨ１から入力した検知対象の画像ＳＴ６か、或いは骨格表現のデータ列ＳＴ５を用いて生成したキャラクタの画像を、表示させるべき画像ＳＴ８の一部として画像表示システムＧＨに出力することができる。画像ＳＴ８は、例えば１画面分の表示のためのものであり、キャラクタの画像は、その画面内に配置されて表示される。キャラクタの画像は、検知対象ＫＴの分身と想定されたものである。 The posture evaluation system H2 can output the image ST6 of the detection target input from the skeletal information generation system H1, or the image of a character generated using the skeletal representation data string ST5, to the image display system GH as part of the image ST8 to be displayed. The image ST8 is intended to display, for example, one screen's worth of data, and the character image is positioned and displayed within that screen. The character image is assumed to be an alter ego of the detection target KT.

他のキャラクタ画像を表示させる場合であっても、キャラクタには別の検知対象ＫＴの分身か、或いは仮想的な人格の分身のものが含まれることがある。仮想的な人格は、具体的には本サービスの提供用に想定されたものである。以降、キャラクタのうち、そのような分身と位置付けられるキャラクタを「アバター」と表記する。特に断らない限り、キャラクタのうち、アバター以外のものを指す意味で「キャラクタ」を用いる。 Even when other character images are displayed, the characters may include alter egos of other detection targets KT, or alter egos of virtual personalities. Virtual personalities are specifically envisioned for the provision of this service. Hereinafter, characters that can be considered as such alter egos will be referred to as "avatars." Unless otherwise specified, the term "characters" will be used to refer to characters other than avatars.

画像表示システムＧＨは、姿勢評価システムＨ２から入力した画像ＳＴ８により、ディスプレイＤ上に画面を表示させる。そのため、姿勢評価システムＨ２は、画像表示システムＧＨに出力する画像ＳＴ８を通して、ディスプレイＤに表示させる画面の内容を任意に変更することができる。図１では、画像ＳＴ８が表す検知対象画像ＫＴＧがディスプレイＤ上に表示されている状態を示している。検知対象画像ＫＴＧの代わりに、骨格情報から生成したアバター画像を表示させるようにしても良い。 The image display system GH displays a screen on the display D using the image ST8 input from the posture evaluation system H2. Therefore, the posture evaluation system H2 can arbitrarily change the content of the screen displayed on the display D via the image ST8 output to the image display system GH. Figure 1 shows a state in which the detection target image KTG represented by the image ST8 is displayed on the display D. Instead of the detection target image KTG, an avatar image generated from skeletal information may be displayed.

ディスプレイＤとは、例えば大型の表示装置か、或いはスクリーン等である。ディスプレイＤがスクリーンであった場合、画像表示システムＧＨには、スクリーンへの投影が可能なプロジェクターも含まれる。
その一方、姿勢評価システムＨ２は、骨格表現のデータ列ＳＴ５を用いた分析により、音声出力の対象となるテキスト情報の生成を行う。そのために姿勢評価システムＨ２は、お手本との差分ＡＩ（Artificial Intelligence）分析ＳＨ２を行う。
このテキスト情報は、詳細は後述するように、アバター画像、或いはロボットである動作対象の各種動作を制御する制御情報としても機能する。各種動作には、動作対象を視覚的に変化させる動作だけでなく、視覚的な変化を伴わない表示、或いは放音による発言動作等も含まれる。 The display D is, for example, a large display device or a screen. When the display D is a screen, the image display system GH also includes a projector capable of projecting onto the screen.
On the other hand, the posture evaluation system H2 generates text information to be output as voice by analysis using the data string ST5 of the skeletal representation. To this end, the posture evaluation system H2 performs an AI (Artificial Intelligence) analysis SH2 of the difference from the model.
As will be described in detail later, this text information also functions as control information for controlling various actions of the subject, which may be an avatar image or a robot. The various actions include not only actions that visually change the subject, but also displays that do not cause visual changes, or speech actions that are made by emitting sound.

お手本との差分ＡＩ分析ＳＨ２は、何れかのタイミングで検知対象ＫＴが取るべき姿勢（ポーズ）が予め判明していることを前提として行われる分析である。その分析のために、手本となる姿勢を表すお手本データＴＤが用意されている。取るべき姿勢が予め判明している動きとしては、ダンス、お手本の動きをまねた動作が求められるか、或いは定めた一つ以上の姿勢を取る動作が求められるようなゲーム、及び理想的な動きが考えられる運動（例えばスポーツ）等を挙げることができる。以降、このような要求、或いは運動等による一連の動作を「姿勢要求動作」と総称する。 The AI analysis SH2 of the difference from the model is an analysis performed on the premise that the posture (pose) that the detection target KT should assume at a certain timing is known in advance. For this analysis, model data TD representing the model posture is prepared. Examples of movements where the posture to be assumed is known in advance include dance, games that require movements that imitate the movements of a model or movements that require movements in one or more predetermined postures, and exercises (e.g., sports) in which ideal movements are conceivable. Hereinafter, such requests, or a series of movements resulting from exercise, etc., will be collectively referred to as "posture-requiring movements."

このお手本データＴＤには、手本となる姿勢時の画像であるお手本画像ＯＧの他に、骨格情報が含まれる。それにより、お手本との差分ＡＩ分析ＳＨ２では、骨格表現のデータ列ＳＴ５とお手本データＴＤに含まれる骨格情報との間の差分の算出を含む、その差分を用いたＡＩによる分析が行われる。その分析により、検知対象画像ＫＴＧが表す姿勢が評価され、その評価結果に応じたテキスト情報が生成される。図１に示す例では、姿勢評価システムＨ２から音声出力システムＴＯに出力される差分補正指示ＳＴ９は、テキスト情報の音声としての放音を指示するためのものである。この差分補正指示ＳＴ９は、具体的には、放音のために生成された音声信号か、或いはその音声信号の生成に必要な情報を含むコマンド等である。 This model data TD includes skeletal information in addition to the model image OG, which is an image of the model posture. As a result, the AI difference analysis SH2 from the model performs an AI analysis using the difference, including calculating the difference between the skeletal representation data string ST5 and the skeletal information included in the model data TD. This analysis evaluates the posture represented by the detection target image KTG, and generates text information according to the evaluation results. In the example shown in FIG. 1, the difference correction instruction ST9 output from the posture evaluation system H2 to the audio output system TO instructs the output of the text information as audio. Specifically, this difference correction instruction ST9 is either an audio signal generated for audio output, or a command containing information necessary for generating the audio signal.

差分を用いた分析のために、ＡＩでは、差分と、生成すべきテキスト情報との関係を表す学習データを用いた深層学習が行われる。この深層学習により、お手本との差分ＡＩ分析ＳＨ２では、差分の生成（算出）により、適切なテキスト情報を生成することができる。検知対象画像ＫＴＧのサイズとお手本データＴＤが表すお手本画像ＯＧのサイズとの間の比が必ずしも適切とする範囲内であるとは限らないことから、差分は、例えば２つの骨格情報のうちの一方に対して拡大、或いは縮小の操作を行った後に生成される。 To perform analysis using differences, the AI performs deep learning using training data that represents the relationship between the differences and the text information to be generated. Through this deep learning, the AI analysis of differences from the model SH2 can generate appropriate text information by generating (calculating) the differences. Because the ratio between the size of the detection target image KTG and the size of the model image OG represented by the model data TD is not necessarily within an appropriate range, the differences are generated, for example, after performing an enlargement or reduction operation on one of the two pieces of skeletal information.

図１では、ディスプレイＤに、検知対象画像ＫＴＧに加え、お手本画像ＯＧが表示されていることを表している。
検知対象画像ＫＴＧがお手本画像ＯＧと特に大きく異なるのは、右手である。左手も明確に異なっているが、右手と比較して、異なる程度は小さい。このことから、この場合、「右手を少し上へ」等の文字列を表すテキスト情報がお手本との差分ＡＩ分析ＳＨ２により生成される。この結果、この文字列が音声として音声出力システムＴＯから出力、つまり放音される。この音声出力は、検知対象画像ＫＴＧ、或いはお手本画像ＯＧをアバター画像と想定して行われる。 FIG. 1 shows that the display D displays a model image OG in addition to the detection target image KTG.
The right hand is the most significantly different part of the detection target image KTG from the model image OG. The left hand is also clearly different, but the difference is smaller than that of the right hand. For this reason, in this case, text information representing a string such as "Move your right hand a little higher" is generated by the AI analysis SH2 of the difference from the model. As a result, this string is output as sound from the audio output system TO, that is, emitted as sound. This sound output is performed by assuming the detection target image KTG or the model image OG as an avatar image.

音声出力システムＴＯによる音声出力は、検知対象ＫＴを想定して行われる。そのため、検知対象ＫＴにとっては、ディスプレイＤにアバター画像として表示される検知対象画像ＫＴＧ、或いはお手本画像ＯＧから、自身が取るべき適切な姿勢を取るうえでの有用な情報が音声出力によりタイムリに得られる形となる。それにより、表示される検知対象画像ＫＴＧ、或いはお手本画像ＯＧを動作対象とし、検知対象画像ＫＴＧからのテキスト情報の生成により、その動作対象に発言させる発言動作が仮想的に行われる形となる。 The audio output by the audio output system TO is performed with the detection target KT in mind. Therefore, the detection target KT can obtain useful information in a timely manner from the detection target image KTG or model image OG displayed as an avatar image on the display D to help it adopt the appropriate posture it should take. As a result, the displayed detection target image KTG or model image OG is treated as the action target, and by generating text information from the detection target image KTG, a speech action is virtually performed in which the action target speaks.

なお、動作対象の動作は、発言動作、つまり音声出力でなくとも良い。例えば、検知対象ＫＴがお手本画像ＯＧの姿勢に近づけるために必要な動きを動作対象に行わせるようにしても良い。また、文字列をアバター画像の発言内容として表示させるようにしても良い。音声出力（メッセージ等の出力）は、検知対象画像ＫＴＧの姿勢が不適切と評価した場合にのみ行い、その姿勢が適切と評価した場合には、その旨を検知対象ＫＴが認識できるように、効果音、或いは演出音を放音させるようにしても良い。このことから、テキスト情報は、効果音、或いは演出音の放音を表すものであっても良い。発言動作は、放音、及び表示の何れで表しても良い。 The action of the action target does not have to be a speaking action, that is, an audio output. For example, the action target may be made to perform the movement necessary to bring the detection target KT closer to the posture of the model image OG. Alternatively, a string of characters may be displayed as the speech content of the avatar image. Audio output (output of a message, etc.) is performed only when the posture of the detection target image KTG is evaluated as inappropriate, and when the posture is evaluated as appropriate, a sound effect or a dramatic sound may be emitted so that the detection target KT can recognize this. For this reason, the text information may represent the emission of a sound effect or a dramatic sound. The speaking action may be represented by either sound emission or display.

姿勢評価システムＨ２は、上記のように、骨格情報生成システムＨ１から入力した骨格表現のデータ列ＳＴ５を検知対象ＫＴの姿勢評価に用い、テキスト情報の生成、及びそのテキスト情報に応じたアバター画像の動作を実現させる。そのようにして、検知対象ＫＴの動作にはない別の動作を動作対象に行わせ、別の動作を通す形で、検知対象ＫＴにとって有用な情報を提供する。このため、動作対象における動作の多様性が向上するだけでなく、検知対象ＫＴにとっての利便性も向上することになる。検知対象ＫＴは、より適切な姿勢要求動作をより容易に行えるようになる。 As described above, the posture evaluation system H2 uses the skeletal representation data sequence ST5 input from the skeletal information generation system H1 to evaluate the posture of the detection target KT, generating text information and realizing the movement of the avatar image according to that text information. In this way, the movement target performs a different movement than the movement of the detection target KT, and useful information is provided to the detection target KT through this different movement. This not only improves the diversity of the movement of the movement target, but also improves convenience for the detection target KT. The detection target KT can more easily perform movements requiring a more appropriate posture.

検知対象ＫＴにとっては、自身の身体を適切に動かすために、音声出力の認識が求められることになる。そのため、より高い集中力が必要となる。そのような状態で動作を行わなければならないことから、検知対象ＫＴは、より高い没入感がより容易に得られるようにもなる。この結果、検知対象ＫＴに対し、より良い心証、或いはより高い満足感を与えられるようにもなる。 The detection target KT is required to recognize the audio output in order to move their body appropriately. This requires a higher level of concentration. Because they must move in this state, the detection target KT can more easily achieve a greater sense of immersion. As a result, the detection target KT can be given a better impression or a greater sense of satisfaction.

一方、姿勢管理システムＨ３は、骨格情報生成システムＨ１から入力した骨格表現のデータ列ＳＴ５を、物理機械であるロボット、及び画面に表示させるアバター画像のうちの少なくとも一方を動作対象にした動作制御に用いている。それにより、動作対象は、検知対象画像ＫＴＧの動きに沿って動作する。このため、検知対象ＫＴは、自身の動きを通して、動作対象の動きを制御することができる。従って、検知対象ＫＴには、動作対象を動かすことを娯楽として楽しめる、或いは自身の姿勢、若しくは動きを確認できる環境が提供される。ここでの物理機械とは、一つ以上のモータ等の動力源を備えたものである。物理機械であるロボットは、動力源から伝達される動力により、全体的な姿形を変化させることが可能なものである。 On the other hand, the posture management system H3 uses the skeletal representation data string ST5 input from the skeletal information generation system H1 to control the movement of at least one of the physical machine, a robot, and an avatar image displayed on the screen. As a result, the moving object moves in accordance with the movement of the detection object image KTG. Therefore, the detection object KT can control the movement of the moving object through its own movement. This provides the detection object KT with an environment in which it can enjoy moving the moving object as entertainment, or check its own posture or movement. A physical machine here is one equipped with one or more power sources such as motors. A robot, which is a physical machine, is capable of changing its overall appearance using the power transmitted from the power source.

ロボットの動作制御は、例えば次に取るべき姿勢を取らせるための制御情報を動作・表示指示ＳＴ７として姿勢管理システムＨ３から送信させることで行われる。この制御情報は、主に動力源を動作させるための情報である。アバター画像の動作制御は、例えばアバター画像を含む画面を動画として表示させることが可能な画像音声生成システムに、表示させるべき画面を表す画像情報を動作・表示指示ＳＴ７として姿勢管理システムＨ３から送信させることで行われる。 The robot's movement is controlled, for example, by having the posture management system H3 send control information for the next posture to be taken as a movement/display instruction ST7. This control information is mainly information for operating the power source. The movement of the avatar image is controlled, for example, by having the posture management system H3 send image information representing the screen to be displayed as a movement/display instruction ST7 to an image/audio generation system capable of displaying a screen including the avatar image as a moving image.

動作対象とするロボット、及びアバター画像の何れも、通常、各部の全体における比率は、検知対象画像ＫＴＧにおけるその比率とは異なる。例えば検知対象画像ＫＴＧが８頭身であれば、動作対象は８頭身未満であることが多く、それらの比率は異なるのが普通である。このことから、姿勢管理システムＨ３では、入力した骨格情報を動作対象における比率に合わせるための縮尺変換処理ＳＨ３を行い、その骨格情報を操作する。その操作は、比率の違いに応じて用意されるパラメータＰを参照して行われる。そのような操作を行った後の骨格情報を用いて動作・表示指示ＳＴ７を生成し送信することにより、動作対象は、検知対象画像ＫＴＧによって表される動作に沿って、不自然な印象を与えないように動作することになる。 The overall proportions of each part of both the robot and the avatar image that are the target of movement typically differ from those in the detection target image KTG. For example, if the detection target image KTG has eight heads, the target of movement is likely to have less than eight heads, and the proportions will usually be different. For this reason, the posture management system H3 performs a scale conversion process SH3 to match the input skeletal information to the proportions of the target of movement, and manipulates the skeletal information. This manipulation is performed by referencing parameters P, which are prepared depending on the difference in proportions. By generating and transmitting movement/display instructions ST7 using the skeletal information after such manipulation, the target of movement will move in accordance with the movement represented by the detection target image KTG without giving an unnatural impression.

なお、姿勢管理システムＨ３でも、お手本データＴＤを用意し、テキスト情報の生成、及びその出力を行うようにしても良い。それにより、例えば要所要所の姿勢の善し悪し、或いは次に取るべき姿勢についての情報等をタイムリに検知対象ＫＴに伝達するようにしても良い。次に取るべき姿勢についての情報等の伝達を行うようにする場合、お手本データＴＤは、検知対象画像ＫＴＧがテキスト情報を生成すべき姿勢となったか否かの判定用にしても良い。
上述した動作対象における別の動作は、一つのお手本データＴＤが表す一つの姿勢との対比により行わせるものである。しかし、別の動作は、姿勢を含む動作との対比により行わせるようにしても良い。 The posture management system H3 may also prepare model data TD, generate text information, and output the text information. This may allow, for example, information on the quality of a key posture or the next posture to be taken to be transmitted to the detection target KT in a timely manner. When transmitting information on the next posture to be taken, the model data TD may be used to determine whether the detection target image KTG has reached a posture for which text information should be generated.
The above-described different motion of the motion subject is performed by comparing with one posture represented by one example data TD. However, the different motion may be performed by comparing with a motion including a posture.

図２は、動作対象に別の動作を行わせるための姿勢を含む動作の一例を説明する図である。図２中に描く各丸は、それぞれ異なる関節を表している。
図２に示す例は、忍者が手裏剣を投げる場合を想定したものである。手裏剣を投げる動作は、左手で手裏剣を投げる場合のものである。図２の左側の図は、手裏剣を投げる際に取る姿勢の例を表している。その姿勢は、右腕の肘を曲げ、手裏剣を持つ右手を腰の高さでその腰の近くに位置させた状態で、左手で手裏剣をつかんだ姿勢である。以降、この姿勢は、便宜的に「初期想定姿勢」と表記する。 2 is a diagram illustrating an example of a motion including a posture for causing a motion target to perform another motion. Each circle in FIG. 2 represents a different joint.
The example shown in Figure 2 is an example of a ninja throwing a shuriken. The action of throwing the shuriken is when it is thrown with the left hand. The diagram on the left side of Figure 2 shows an example of the posture taken when throwing a shuriken. In this posture, the right elbow is bent, the right hand holding the shuriken is positioned near the waist at waist height, and the left hand grasps the shuriken. Hereafter, for convenience, this posture will be referred to as the "initial assumed posture."

一方、図２の右側の図は、手裏剣を投げ終えた姿勢の例を表している。この姿勢は、図２の左側の図で表す初期想定姿勢から、左手を向かって右方向に動かし手裏剣を投げた後の姿勢の例である。手裏剣を投げるために、頭上から見て、身体は反時計回りに回転された状態となっている。 On the other hand, the diagram on the right side of Figure 2 shows an example of the posture after throwing the shuriken. This posture is an example of the posture after throwing the shuriken by moving the left hand to the right from the initial assumed posture shown in the diagram on the left side of Figure 2. In order to throw the shuriken, the body is rotated counterclockwise when viewed from above.

左手に掴んだ手裏剣を投げるためには、図２に示すように、その左手を比較的に大きく動かす必要がある。このため、初期想定姿勢を起点とし、手裏剣を持つと想定する手の起点からの動きを確認することにより、検知対象ＫＴが手裏剣を投げる動作を行ったか否かを判定することができる。そのような判定が可能となることから、手裏剣を投げる動作が行われたと判定した場合には、テキスト情報の生成により、例えば投げられた手裏剣の風切り音、或いは演出音等を放音させるようなことが可能となる。このような音の放音は、特に、身体を動かして行うゲーム等において、より高いリアル感が得られる環境の提供を可能にする。このような音の放音自体は、検知対象画像ＫＴＧの一部であった手裏剣を投げる動作とは別の動作と位置付けることができる。なお、上述の手裏剣を投げる動作を行ったか否かの判定のみではなく、手裏剣を離した位置および投げた方向を更に判定することもできる。 To throw a shuriken held in the left hand, as shown in Figure 2, the left hand must be moved relatively widely. Therefore, by starting from the initial assumed posture and checking the movement of the hand from the starting point of the hand assumed to be holding the shuriken, it is possible to determine whether the detection target KT has performed the action of throwing the shuriken. Because such a determination is possible, if it is determined that the action of throwing the shuriken has been performed, it is possible to generate text information, such as the sound of the wind cutting through the thrown shuriken, or a special sound. The emission of such sounds makes it possible to provide an environment with a higher sense of realism, particularly in games that require physical movement. The emission of such sounds can be considered a separate action from the action of throwing the shuriken, which was part of the detection target image KTG. In addition to determining whether the above-mentioned action of throwing the shuriken has been performed, it is also possible to further determine the position at which the shuriken was released and the direction in which it was thrown.

図２に例を示すような動きを検知可能にする場合、お手本データＴＤには、例えば次のようなデータを含めても良い。つまり、初期想定姿勢と見なすべき第１条件、その初期想定姿勢から定められた動作が行われたと見なす第２条件、及び第１条件が満たされなくなってから第２の条件が満たされるまでの時間範囲、等をデータとして含むお手本データＴＤを用意しても良い。 When it is possible to detect movements such as those shown in the example of Figure 2, the model data TD may include, for example, the following data. In other words, model data TD may be prepared that includes data such as a first condition under which an initial assumed posture should be considered, a second condition under which a specified movement is considered to have been performed from that initial assumed posture, and the time range from when the first condition is no longer satisfied until the second condition is satisfied.

ゲーム等において、このような手裏剣を投げる動作を検知可能にする場合、動作対象は、対戦相手等とするアバター画像とすることが考えられる。そのような動作対象では、手裏剣を投げる動作の確認により、手裏剣が投げられたことに対する発言を行わせるとともに、手裏剣に対応するための動作を行わせるようにしても良い。つまり、一つのテキスト情報の生成により、更に一つ以上のテキスト情報を生成するようにしても良い。 When making it possible to detect the action of throwing a shuriken in a game or the like, the subject of the action could be an avatar image of an opponent or the like. When the action of throwing a shuriken is confirmed, such a subject of the action could be made to make a statement in response to the shuriken being thrown and perform an action in response to the shuriken. In other words, the generation of one piece of text information could be made to generate one or more pieces of text information.

このようなこともあり、テキスト情報を生成する検知対象画像ＫＴＧの動き、その動きにより生成するテキスト情報の内容、及び生成したテキスト情報を出力させた後の動作制御等は、様々なものが考えられる。例えば手裏剣以外の武器の使用を想定した動作に着目し、仕様を考えても良い。しかし、どのような仕様であっても、動作対象の動作による表現の幅がより広がるだけでなく、検知対象ＫＴに提供される情報量もより大きくなる。その結果、検知対象ＫＴの満足度もより高くさせることができるようになる。 For this reason, various possibilities are conceivable for the movement of the detection target image KTG that generates text information, the content of the text information generated by that movement, and the movement control after the generated text information is output. For example, specifications could be considered that focus on movements that anticipate the use of weapons other than shuriken. However, regardless of the specifications, not only will the range of expression through the movements of the movement target be wider, but the amount of information provided to the detection target KT will also be greater. As a result, the satisfaction of the detection target KT can be increased.

以降は、図３～図７を参照しつつ、図１に例示する本サービスを提供する具体的な実現方法について詳細に説明する。
図３は、本発明の情報処理装置の一実施形態に係るＡＰ（APplication）サーバが接続されたネットワーク環境の一例を説明する図である。 Hereinafter, a specific method for providing the present service illustrated in FIG. 1 will be described in detail with reference to FIGS.
FIG. 3 is a diagram illustrating an example of a network environment to which an AP (Application) server is connected according to an embodiment of the information processing apparatus of the present invention.

ＡＰサーバ１は、本サービスを提供するサービス提供会社ＳＫが設置の情報処理装置である。サービス提供会社ＳＫは、例えばゲーム等の娯楽を提供する企業等の組織と契約し、契約した組織への本サービスの提供を通して、その組織を利用する訪問者の満足度がより高くなるように支援する。
図３では、ＡＰサーバ１は、サービス提供会社ＳＫ内に設置されたものとして表しているが、別の場所に設置されていても良い。例えばクラウドサービスにより提供されるものであっても良い。 The AP server 1 is an information processing device installed by a service provider SK that provides this service. The service provider SK enters into a contract with an organization, such as a company that provides entertainment such as games, and by providing this service to the contracted organization, helps to increase the satisfaction of visitors who use the organization.
3, the AP server 1 is shown as being installed within the service provider company SK, but it may be installed elsewhere, for example, it may be provided by a cloud service.

ＡＰサーバ１は、実際には、例えばプロキシ等の他の情報処理装置を介してネットワークＮと接続されている。しかし、ここでは、説明上、便宜的に、ＡＰサーバ１とネットワークＮとの間に介在する情報処理装置は無視することとする。つまり、ＡＰサーバ１は直接的にネットワークＮと接続されているものとする。ネットワークＮは、例えばインターネットを含むものである。 AP server 1 is actually connected to network N via another information processing device, such as a proxy. However, for the sake of convenience, information processing devices intervening between AP server 1 and network N will be ignored here. In other words, it is assumed that AP server 1 is directly connected to network N. Network N includes, for example, the Internet.

顧客企業ＫＫは、サービス提供会社ＳＫと契約した組織である。この顧客企業ＫＫは、利用者が身体を使ったゲーム等を行える場を提供する。本サービスは、その場を利用する利用者に対し、より満足できる環境の提供を可能にする。以降、顧客企業ＫＫは、契約した組織の総称として用いる。顧客企業ＫＫは、複数、存在する。 Customer company KK is an organization that has entered into a contract with service provider SK. This customer company KK provides a space where users can play physical games and other activities. This service makes it possible to provide a more satisfying environment for users who use the space. Hereinafter, customer company KK will be used as a general term for the contracting organization. There will be multiple customer companies KK.

顧客企業ＫＫには、上記カメラＣの他に、サーバ２、プロジェクター３、サウンドシステム４、及びマイクロホン（図３では「マイク」と略記。以降、この略記を用いる）５が設置されている。サーバ２を除く全ては、ゲーム等が行える一つの場に設置されている。そのため、サーバ２を除く全ては、複数、存在するのが普通である。 In addition to the camera C, client company KK is also equipped with a server 2, a projector 3, a sound system 4, and a microphone (abbreviated as "mic" in Figure 3; this abbreviation will be used hereafter) 5. All except server 2 are installed in a single location where games and other such activities can be played. For this reason, it is common for there to be multiple copies of all components except server 2.

図４は、ゲーム等のための場の例を説明する図である。
ゲーム等のための場（以降「ゲーム場」と表記）は、図４に示すように、スクリーンＳＣが設置されているか、或いは設置可能な空間である。その空間には、スクリーンＳＣへの投影が可能なプロジェクター３、各種音の放音が可能なサウンドシステム４、カメラＣ、及びマイク５が設置されている。カメラＣは、スクリーンＳＣ側から利用者を撮像可能なように設置されている。 FIG. 4 is a diagram illustrating an example of a place for games and the like.
A place for games and the like (hereinafter referred to as a "game center") is a space where a screen SC is installed or can be installed, as shown in Fig. 4. In the space, a projector 3 capable of projecting onto the screen SC, a sound system 4 capable of emitting various sounds, a camera C, and a microphone 5 are installed. The camera C is installed so that it can capture images of users from the screen SC side.

サーバ２は、カメラＣ、及びマイク５と接続されている。プロジェクター３、及びサウンドシステム４は、ネットワークＮを介してＡＰサーバ１と接続されている。
サウンドシステム４は、放音装置であるスピーカを含むシステムであり、ネットワークＮを介した通信が可能な端末として機能する情報処理装置も含まれる。それにより、サウンドシステム４は、ＡＰサーバ１から送信される音声信号により、スピーカから音を放音させる。 The server 2 is connected to a camera C and a microphone 5. The projector 3 and sound system 4 are connected to the AP server 1 via a network N.
The sound system 4 is a system including a speaker, which is a sound emitting device, and also includes an information processing device that functions as a terminal capable of communication via the network N. As a result, the sound system 4 emits sound from the speaker in response to an audio signal transmitted from the AP server 1.

プロジェクター３は、ネットワークＮを介した通信機能を備えた情報処理装置である、ＡＰサーバ１から送信される映像信号により、投影すべき画面を生成し、生成した画面をスクリーンＳＣ上に投影させることができる。
カメラＣは、ゲーム場の利用者の動画撮影に用いられる。この撮影結果は、画像データ（動画情報）として、サーバ２からＡＰサーバ１に送信される。この利用者の全て、或いは少なくとも一人は、検知対象ＫＴである。カメラＣがスクリーンＳＣ側に設置されているのは、検知対象ＫＴである利用者はスクリーンＳＣのほうを向いて、身体を動かすゲームを行うものと想定しているからである。 The projector 3 is an information processing device equipped with a communication function via the network N, and can generate a screen to be projected using a video signal transmitted from the AP server 1, and project the generated screen onto the screen SC.
Camera C is used to capture video of users at the game center. The captured video is sent as image data (video information) from server 2 to AP server 1. All or at least one of these users is a detection target KT. Camera C is installed on the screen SC side because it is assumed that users who are detection targets KT will face the screen SC and move their bodies while playing the game.

マイク５は、ゲーム場の利用者が発する音声を拾うために設置されている。マイク５から出力された音声信号は、サーバ２により音声情報に変換され、ＡＰサーバ１に送信される。 Microphone 5 is installed to pick up sounds made by game center users. The audio signal output from microphone 5 is converted into audio information by server 2 and transmitted to AP server 1.

図５は、本発明の情報処理装置の一実施形態に係るＡＰサーバ１のハードウェア構成の一例を示すブロック図である。次に図５を参照し、ＡＰサーバ１のハードウェア構成例について具体的に説明する。なお、この構成例は一例であり、ＡＰサーバ１のハードウェア構成はこれに限定されない。 Figure 5 is a block diagram showing an example of the hardware configuration of the AP server 1 according to an embodiment of the information processing device of the present invention. Next, with reference to Figure 5, an example of the hardware configuration of the AP server 1 will be described in detail. Note that this example configuration is merely an example, and the hardware configuration of the AP server 1 is not limited to this.

ＡＰサーバ１は、図５に示すように、ＣＰＵ（Central Processing Unit）１１と、ＲＯＭ（Read Only Memory）１２と、ＲＡＭ（Random Access Memory）１３と、バス１４と、入出力インターフェース１５と、出力部１６、入力部１７と、記憶部１８と、通信部１９と、及びドライブ２０と、を備えている。 As shown in FIG. 5, the AP server 1 includes a CPU (Central Processing Unit) 11, a ROM (Read Only Memory) 12, a RAM (Random Access Memory) 13, a bus 14, an input/output interface 15, an output unit 16, an input unit 17, a memory unit 18, a communication unit 19, and a drive 20.

ＣＰＵ１１は、ＲＯＭ１２に記録されているプログラム、及び記憶部１８からＲＡＭ１３にロードされたプログラムに従って各種の処理を実行する。記憶部１８からＲＡＭ１３にロードされるプログラムには、例えばＯＳ、及びそのＯＳ上で動作する各種アプリケーション・プログラムが含まれる。各種アプリケーション・プログラムには、本サービスの提供用に開発されたものが１つ以上、含まれる。 The CPU 11 executes various processes in accordance with programs stored in the ROM 12 and programs loaded from the storage unit 18 to the RAM 13. The programs loaded from the storage unit 18 to the RAM 13 include, for example, an OS and various application programs that run on the OS. The various application programs include one or more programs developed for the purpose of providing this service.

ＲＡＭ１３には、ＣＰＵ１１が各種の処理を実行する上において必要なデータ等も適宜記憶される。そのデータには、ＣＰＵ１１が実行する各種プログラムも含まれる。
ＣＰＵ１１、ＲＯＭ１２及びＲＡＭ１３は、バス１４を介して相互に接続されている。このバス１４にはまた、入出力インターフェース１５も接続されている。入出力インターフェース１５には、出力部１６、入力部１７、記憶部１８、通信部１９、及びドライブ２０が接続されている。 The RAM 13 also stores data and the like necessary for the CPU 11 to execute various processes, including various programs executed by the CPU 11.
The CPU 11, ROM 12, and RAM 13 are interconnected via a bus 14. An input/output interface 15 is also connected to the bus 14. An output unit 16, an input unit 17, a storage unit 18, a communication unit 19, and a drive 20 are connected to the input/output interface 15.

出力部１６は、例えば液晶等のディスプレイを含む構成である。出力部１６は、ＣＰＵ１１の制御により、各種画像、或いは各種画面を表示する。出力部１６は、ＡＰサーバ１に搭載されたものであっても良いが、必要に応じて接続されるものであっても良い。つまり、出力部１６は、必須の構成要素ではない。 The output unit 16 includes a display such as an LCD. The output unit 16 displays various images or screens under the control of the CPU 11. The output unit 16 may be built into the AP server 1, or may be connected as needed. In other words, the output unit 16 is not a required component.

入力部１７は、例えばキーボード等の各種ハードウェア釦等を含む構成のものである。その構成には、マウス等のポインティングデバイスが１つ以上、含まれていても良い。操作者は、入力部１７を介して各種情報を入力することができる。この入力部１７も、ＡＰサーバ１に搭載されたものであっても良いが、必要に応じて接続されるものであっても良い。つまり、入力部１７も、必須の構成要素ではない。 The input unit 17 is configured to include various hardware buttons, such as a keyboard. It may also include one or more pointing devices, such as a mouse. The operator can input various information via the input unit 17. This input unit 17 may also be built into the AP server 1, but may also be connected as needed. In other words, the input unit 17 is not an essential component.

記憶部１８は、例えばハードディスク装置、或いはＳＳＤ（Solid State Drive）等の補助記憶装置である。データ量の大きいデータは、この記憶部１８に記憶される。
通信部１９は、ネットワークＮを介した他の情報処理装置との間の通信を可能にする。図３に示すサーバ２、プロジェクター３、及びサウンドシステムを構成する情報処理装置は全て、他の情報処理装置に相当する。 The storage unit 18 is, for example, a hard disk drive or an auxiliary storage device such as an SSD (Solid State Drive). Large amounts of data are stored in the storage unit 18.
The communication unit 19 enables communication with other information processing devices via the network N. The server 2, the projector 3, and the information processing devices that make up the sound system shown in FIG. 3 all correspond to other information processing devices.

ドライブ２０は、磁気ディスク、光ディスク、光磁気ディスク、或いは半導体メモリカード等のリムーバブルメディア２５が着脱可能な装置である。ドライブ２０は、例えば装着されたリムーバブルメディア２５からの情報の読み取り、及びリムーバブルメディア２５への情報の書き込みが可能である。それにより、リムーバブルメディア２５に記録されたプログラムは、ドライブ２０を介して、記憶部１８に記憶させることができる。また、ドライブ２０に装着されたリムーバブルメディア２５は、記憶部１８に記憶されている各種データのコピー先、或いは移動先として用いることができる。 The drive 20 is a device into which removable media 25, such as a magnetic disk, optical disk, magneto-optical disk, or semiconductor memory card, can be attached or detached. The drive 20 can, for example, read information from and write information to the attached removable media 25. As a result, programs recorded on the removable media 25 can be stored in the memory unit 18 via the drive 20. In addition, the removable media 25 attached to the drive 20 can be used as a copy or transfer destination for various data stored in the memory unit 18.

本サービス用に開発されたアプリケーション・プログラムは、リムーバブルメディア２５に記録させて配布しても良い。ネットワークＮ等を介して配布可能にしても良い。このことから、アプリケーション・プログラムを記録した記録媒体としては、ネットワークＮに直接的、或いは間接的に接続された情報処理装置に搭載、若しくは装着されたもの、或いは外部のアクセス可能な装置に搭載、若しくは装着されたものであっても良い。 Application programs developed for this service may be recorded on removable media 25 and distributed. They may also be available for distribution via a network N, etc. For this reason, the recording medium on which the application program is recorded may be one that is installed or attached to an information processing device that is directly or indirectly connected to the network N, or one that is installed or attached to an externally accessible device.

ＡＰサーバ１が備えるハードウェア資源は、アプリケーション・プログラムを含む各種プログラムによって制御される。その結果、ＡＰサーバ１は、顧客企業ＫＫに対し、本サービスを提供することができる。 The hardware resources of AP Server 1 are controlled by various programs, including application programs. As a result, AP Server 1 is able to provide this service to client company KK.

図６は、本発明の情報処理装置の一実施形態に係るＡＰサーバ１上に実現される機能的構成の一例を示す機能ブロック図である。次に図６を参照しつつ、ＡＰサーバ１上に実現される機能的構成の例について詳細に説明する。
ここでは、混乱を避けるために、ゲーム場は一つのみと想定する。また、ゲーム場で行われるのはゲームであると想定する。動作対象であるアバター画像、及びロボットの何れも、検知対象の画像（以降、検知対象画像ＫＴＧとも表記する。）から生成される骨格情報を用いて動作させるものと想定する。 6 is a functional block diagram showing an example of a functional configuration realized on the AP server 1 according to an embodiment of the information processing device of the present invention. Next, with reference to FIG. 6, an example of the functional configuration realized on the AP server 1 will be described in detail.
To avoid confusion, we will assume that there is only one game arena. We will also assume that the game is played in the game arena. We will assume that both the avatar image and the robot, which are the objects of movement, are controlled using skeletal information generated from an image of the detection object (hereinafter also referred to as the detection object image KTG).

ＡＰサーバ１のＣＰＵ１１上には、機能的構成として、図６に示すように、設定部１１１、画像抽出部１１２、骨格検知部１１３、姿勢評価部１１４、テキスト変換部１１５、動画生成部１１６、姿勢制御部１１７、音声認識部１１８、会話処理部１１９、画像認識部１２０、及び顔認識部１２１が実現される。 As shown in FIG. 6, the CPU 11 of the AP server 1 has a functional configuration including a setting unit 111, an image extraction unit 112, a skeleton detection unit 113, a posture evaluation unit 114, a text conversion unit 115, a video generation unit 116, a posture control unit 117, a voice recognition unit 118, a conversation processing unit 119, an image recognition unit 120, and a face recognition unit 121.

これらは、本サービスの提供用に開発されたアプリケーション・プログラムを含む各種プログラムをＣＰＵ１１が実行することにより実現される。その結果として、或いは本サービスの提供のために、記憶部１８には、基準画像格納部１８１、キャラクタ画像格納部１８２、背景画像格納部１８３、辞書格納部１８４、及び制御情報格納部１８５が確保される。 These functions are realized by the CPU 11 executing various programs, including application programs developed for providing this service. As a result, or in order to provide this service, the memory unit 18 is allocated with a reference image storage unit 181, a character image storage unit 182, a background image storage unit 183, a dictionary storage unit 184, and a control information storage unit 185.

設定部１１１は、ゲーム場で行われるゲームの設定、設定されたゲームの開始、或いは終了、設定されたゲームを行ううえでの各種設定、等のための機能である。これら設定のための各種要求等は、例えばサーバ２から送信される。 The setting unit 111 has functions for setting up games played at the game center, starting or ending the set game, and various settings required for playing the set game. These various requests for setting up the game are transmitted, for example, from the server 2.

図３、及び図４では、このような設定を可能にする装置は省略している。この装置は、サーバ２と接続されており、サーバ２は、この装置からの要求を処理して、ＡＰサーバ１に送信すべき要求を生成し送信する。
サーバ２は、例えばゲームが開始される前からそのゲームが終了するまでの間、カメラＣによる動画撮影を行わせ、その動画撮影により得られる画像データ（動画情報）をＡＰサーバ１に送信し続ける。その間、マイク５からの音声信号を変換して得られる音声情報もサーバ２からＡＰサーバ１に送信される。 3 and 4 omit the device that enables such settings. This device is connected to the server 2, which processes requests from this device and generates and transmits requests to be sent to the AP server 1.
For example, from before the game starts until the game ends, the server 2 causes the camera C to take moving images and continues to transmit image data (moving image information) obtained by taking moving images to the AP server 1. During this time, the server 2 also transmits audio information obtained by converting an audio signal from the microphone 5 to the AP server 1.

画像抽出部１１２は、サーバ２から受信した画像データが表す画像中の検知対象画像ＫＴＧの範囲を特定し、特定した範囲を抽出する。この範囲の抽出は、例えば画像データから、定めた時間間隔毎に静止画像データを生成することにより、生成した静止画像データを対象に行われる。このようにして抽出された画像データは、以降「検知対象画像データ」と表記する。 The image extraction unit 112 identifies the range of the detection target image KTG in the image represented by the image data received from the server 2, and extracts the identified range. This range is extracted, for example, by generating still image data from the image data at predetermined time intervals, and targeting the generated still image data. Image data extracted in this way will hereinafter be referred to as "detection target image data."

骨格検知部１１３は、例えば検知対象画像データ毎に、それが表す検知対象画像ＫＴＧから骨格情報を生成する（例えば、特許文献１参照）。
姿勢評価部１１４は、骨格検知部１１３が生成する骨格情報を用いて、検知対象画像ＫＴＧの姿勢、或いは姿勢を含む動きを評価する。この評価は、例えば図１におけるお手本との差分ＡＩ分析ＳＨ２によるものと基本的に同じものか、或いは図２に示すような初期想定姿勢、及びその後の動きによるものである。この評価を行うために、お手本データＴＤに相当する基準画像情報が基準画像格納部１８１に格納されている。それにより、評価には、対応する基準画像情報も用いられる。 The skeleton detection unit 113 generates skeleton information from the detection target image KTG represented by each detection target image data, for example (see, for example, Patent Document 1).
The posture evaluation unit 114 evaluates the posture of the detection target image KTG or the movement including the posture using the skeleton information generated by the skeleton detection unit 113. This evaluation is basically the same as that based on the difference AI analysis SH2 from the model in FIG. 1, for example, or is based on the initial assumed posture and subsequent movement as shown in FIG. 2. To perform this evaluation, reference image information corresponding to the model data TD is stored in the reference image storage unit 181. As a result, the corresponding reference image information is also used for the evaluation.

テキスト変換部１１５は、姿勢評価部１１４による評価結果を変換する形でテキスト情報を生成する。生成されるテキスト情報は、図１、及び図２を参照しての説明の通り、評価に実際に用いられた基準画像情報、及びその基準画像情報が表す動作内容と検知対象画像ＫＴＧが表す動作内容との間の相違に依存する。辞書格納部１８４には、想定するゲームに合わせ、テキスト情報等の生成のための辞書が格納されている。それにより、例えば辞書に登録された文章、或いは文節等の文字列か、或いは２つ以上の文字列の組み合わせが、テキスト情報として生成される。 The text conversion unit 115 generates text information by converting the evaluation results from the posture evaluation unit 114. As explained with reference to Figures 1 and 2, the generated text information depends on the reference image information actually used for the evaluation and the difference between the action content represented by the reference image information and the action content represented by the detection target image KTG. The dictionary storage unit 184 stores a dictionary for generating text information, etc., tailored to the intended game. As a result, for example, a sentence or phrase or other character string registered in the dictionary, or a combination of two or more character strings, is generated as text information.

動画生成部１１６は、プロジェクター３に投影させる画像（画面）を生成し、生成した画像を画像データとして、通信部１９に送信させる。
動画生成部１１６は、検知対象ＫＴに聞かせることを想定する音楽、効果音、演出音、及び音声の放音のための情報も必要に応じて生成し、生成した情報を通信部１９に送信させる。それにより、各種音が放音されるなかで、検知対象ＫＴがゲーム等を行える環境を提供する。以降、各種音を放音させる情報は「音声情報」と総称する。 The video generation unit 116 generates an image (screen) to be projected by the projector 3, and causes the communication unit 19 to transmit the generated image as image data.
The video generation unit 116 also generates, as necessary, information for emitting music, sound effects, dramatic sounds, and voices that are intended to be heard by the detection target KT, and transmits the generated information to the communication unit 19. This provides an environment in which the detection target KT can play games, etc., while various sounds are being emitted. Hereinafter, information for emitting various sounds will be collectively referred to as "audio information."

キャラクタ画像格納部１８２には、アバター画像を含む各種キャラクタ画像の生成のための各種画像情報が格納されている。その画像情報は、例えばアバター画像を構成するパーツ毎の３次元表現の画像情報である。背景画像格納部１８３には、各種キャラクタ画像以外の画像の生成のための画像情報が背景画像情報として格納されている。それにより、動画生成部１１６は、背景画像格納部１８３に格納されている各種画像情報から背景画像を生成するとともに、骨格情報、及びキャラクタ画像格納部１８２に格納されている各種画像情報を用いて各アバター画像を生成する。動画生成部１１６は、これらを生成した後、背景画像上に各アバター画像を配置する形で、１画面分の画像を生成する。 The character image storage unit 182 stores various image information for generating various character images, including avatar images. This image information is, for example, image information representing three-dimensional representations of each part that makes up an avatar image. The background image storage unit 183 stores image information for generating images other than the various character images as background image information. As a result, the video generation unit 116 generates a background image from the various image information stored in the background image storage unit 183, and generates each avatar image using skeletal information and the various image information stored in the character image storage unit 182. After generating these, the video generation unit 116 generates one screen's worth of images by placing each avatar image on the background image.

姿勢制御部１１７は、骨格情報を用いてロボットを動作させるための姿勢制御情報を生成する。この姿勢制御部１１７により、アバター画像の代わりに、或いはアバター画像とともに、ロボットを検知対象画像ＫＴＧの動きに合わせて動作させることが可能である。姿勢制御情報の生成は、制御情報格納部１８５に格納された別の制御情報を参照して行われる。この制御情報は、姿勢制御情報の生成のためのものであり、図１に示すパラメータＰは制御情報の一つとして制御情報格納部１８５に格納されている。 The posture control unit 117 generates posture control information for operating the robot using the skeleton information. This posture control unit 117 can operate the robot in accordance with the movement of the detection target image KTG, instead of or in addition to the avatar image. The posture control information is generated by referencing other control information stored in the control information storage unit 185. This control information is used to generate the posture control information, and the parameter P shown in Figure 1 is stored in the control information storage unit 185 as one piece of control information.

音声認識部１１８は、例えばサーバ２から受信した音声情報を用いた自然言語処理により音声認識を行う。この音声認識により、音声認識部１１８は、検知対象ＫＴの発言内容を表すテキスト情報である文字列を生成する。文字列の生成は、パターンマッチングにより、音声情報が表す各文字を特定することで行われる。この文字列が本実施形態における第１の文字列に相当する。
会話処理部１１９は、生成された文字列を検知対象ＫＴの発言内容と見なし、その発言内容に対して返すべき文字列を応答文として生成する。そのために、会話処理部１１９は、生成された文字列に対し、形態素解析、構文解析、意味解析、文脈解析、及び照応解析を順次、行い、その文字列が表す内容を特定する。会話処理部１１９は、この特定結果を用いて、例えばその特定結果に対応付けられた文字列を応答文として生成する。応答文として生成される文字列は本実施形態における第２の文字列に相当する。 The speech recognition unit 118 performs speech recognition by natural language processing using speech information received from the server 2, for example. Through this speech recognition, the speech recognition unit 118 generates a character string, which is text information representing the content of the speech of the detection target KT. The character string is generated by identifying each character represented by the speech information through pattern matching. This character string corresponds to a first character string in this embodiment.
The conversation processing unit 119 regards the generated character string as the content of a statement made by the detection target KT, and generates a character string to be returned in response to the content of the statement as a response sentence. To this end, the conversation processing unit 119 sequentially performs morphological analysis, syntactic analysis, semantic analysis, contextual analysis, and anaphora analysis on the generated character string to identify the content represented by the character string. Using the identification result, the conversation processing unit 119 generates, for example, a character string associated with the identification result as a response sentence. The character string generated as a response sentence corresponds to the second character string in this embodiment.

会話処理部１１９によって生成される応答文、及びテキスト変換部１１５によって生成されるテキスト情報を放音させるための音声情報は、例えば、動画生成部１１６により生成され、通信部１９から送信される。それにより、検知対象ＫＴは、プロジェクター３によりスクリーンＳＣに投影された画像（動画）を見るだけでなく、サウンドシステム４により放音される会話等のための音声を聞くことができる。
放音される音には、音声の他に、音楽、効果音、及び演出音等も含まれる。ここでは、特に断らない限り、音声情報は、それらの音、及び音声を含む各種音を放音させる情報の総称として用いる。
動作対象がロボットであった場合、音声情報の生成は、姿勢制御部１１７によって行われる。それにより、ロボットを動作させる場合であっても、検知対象ＫＴは、会話等の音声を聞くことができる。 The response sentences generated by the conversation processing unit 119 and the audio information for emitting the text information generated by the text conversion unit 115 are generated, for example, by the video generation unit 116 and transmitted from the communication unit 19. As a result, the detection target KT can not only see the image (video) projected onto the screen SC by the projector 3, but also hear the audio for conversation, etc., emitting from the sound system 4.
The sounds that are emitted include not only voice but also music, sound effects, sound effects, etc. Here, unless otherwise specified, audio information is used as a general term for information that causes these sounds and various sounds including voice to be emitted.
If the operation target is a robot, the generation of audio information is performed by the posture control unit 117. As a result, even when the robot is operated, the detection target KT can hear audio such as conversation.

音声認識部１１８、及び会話処理部１１９は、検知対象ＫＴと、スクリーンＳＣ上に画像が投影されたキャラクタとの間の会話を可能とさせる。それにより、検知対象ＫＴは、キャラクタとの会話をしながら、ゲームを進めることができる。その会話により、検知対象ＫＴは、例えば考えていなかった動作、或いはゲームの進め方等の選択も可能になる。このことから、検知対象ＫＴにとっては、ゲームへの没入感をより高くさせられるようにするだけでなく、楽しみ方の幅も広がって、より高い満足度も得られるようになる。検知対象ＫＴが考えていなかった動作とは、自身の発言による他のアバター画像、或いはキャラクタ画像の出現（及び出現後の動作）、自身のアバター画像の発言内容に応じた動作、或いは変化、背景画像の変化、等を挙げることができる。
なお、音声認識部１１８、及び会話処理部１１９にはともに、周知の技術が採用されている。 The voice recognition unit 118 and the conversation processing unit 119 enable conversation between the detection target KT and a character whose image is projected on the screen SC. This allows the detection target KT to progress through the game while conversing with the character. Through this conversation, the detection target KT can, for example, make actions or choose a way to progress through the game that they had not considered. This not only allows the detection target KT to feel more immersed in the game, but also broadens the range of ways to enjoy it and achieve greater satisfaction. Examples of actions that the detection target KT had not considered include the appearance (and actions after appearance) of another avatar image or character image due to the detection target KT's own utterances, actions or changes in the detection target KT's own avatar image in response to the content of the utterances, changes in the background image, etc.
It should be noted that well-known technologies are used for both the voice recognition unit 118 and the conversation processing unit 119 .

画像認識部１２０は、検知対象画像ＫＴＧ内に含まれる、或いは近傍に画像として存在する物体を認識する。この物体は、検知対象ＫＴとは別と見なすべきものである。その物体は、主に、検知対象ＫＴが身につけている物か、或いは手にしている物である。この物体画像を認識した後、物体画像の位置、姿勢、或いは大きさ等の変化に着目することにより、画像認識部１２０は、物体画像の動作内容を特定する。動作内容の特定は、周知の技術を画像認識部１２０に採用しても行うことができる。
その動作内容の特定により、例えば検知対象ＫＴの物体の扱いに応じて、物体画像を変化させることができる。具体的には、物体が剣と見なす剣状物体であった場合、例えば検知対象ＫＴとの位置関係に応じて、物体画像の表現を変更する、剣状物体を振る早さに応じて、物体画像に演出を加える、等のことを行っても良い。 The image recognition unit 120 recognizes an object that is included in the detection target image KTG or exists as an image nearby. This object should be considered separate from the detection target KT. The object is mainly an object worn by the detection target KT or held in the hand. After recognizing this object image, the image recognition unit 120 identifies the action content of the object image by focusing on changes in the position, posture, size, etc. of the object image. Identification of the action content can also be performed by employing well-known technology in the image recognition unit 120.
By specifying the action content, it is possible to change the object image depending on how the object of the detection target KT is handled, for example. Specifically, if the object is a sword-like object that is regarded as a sword, it is possible to change the representation of the object image depending on the positional relationship with the detection target KT, or to add effects to the object image depending on the speed at which the sword-like object is swung.

顔認識部１２１は、検知対象画像ＫＴＧから顔の部分を抽出し、検知対象ＫＴの性別、及び年齢等の認識を行う。この顔認識部１２１にも周知の技術を採用することができる。
なお、物体としては、加速度センサ、及び通信装置等を搭載させた専用のものであっても良い。そのようなセンサの検知結果を画像認識部１２０による画像認識に用いる場合、物体画像の動作制御をより高精度に行えるようになる。通信装置から物体の種別等を表す情報を例えばサーバ２を介して送信させることにより、ＡＰサーバ１側は検知対象ＫＴが持っている、或いは身につけている物体の種類を認識することができる。 The face recognition unit 121 extracts a face portion from the detection target image KTG and recognizes the gender, age, etc. of the detection target KT. Well-known technology can also be used for this face recognition unit 121.
The object may be a dedicated object equipped with an acceleration sensor, a communication device, etc. If the detection results of such a sensor are used for image recognition by the image recognition unit 120, the movement control of the object image can be performed with higher precision. By transmitting information indicating the type of object from the communication device via, for example, the server 2, the AP server 1 can recognize the type of object that the detection target KT is holding or wearing.

画像認識部１２０による物体の認識結果は、物体を持つアバター画像の生成に用いることができる。それにより、例えば検知対象ＫＴが刀形状の物体を手に持っていた場合、刀を持ったアバター画像を表示させるようにしても良い。或いは検知対象ＫＴが投げた物体があった場合には、投げられた物体が動く様子を動画で表現するようにしても良い。物体を表す画像は、例えばキャラクタ画像格納部１８２に格納し、物体の認識結果、或いは骨格情報により特定する検知対象画像ＫＴＧの動作内容等から選択させるようにすれば良い。この選択は、例えば画像認識部１２０に行わせることが考えられる。検知対象ＫＴが物体を持っているか否かの判定は、例えば検知対象画像ＫＴＧと物体画像とが重なっていると見なす箇所が存在し、且つそれらの大きさが変化する度合いに設定以上の変化が生じているか否かにより行うことができる。
例えば図２に示すように手裏剣を投げる場合、左手を動かす方向から、手裏剣を投げる方向、更には的の位置等も推定することができる。このような推定結果を手裏剣の動きに反映させても良い。 The object recognition results by the image recognition unit 120 can be used to generate an avatar image holding the object. For example, if the detection target KT is holding a sword-shaped object, an avatar image holding the sword may be displayed. Alternatively, if the detection target KT has thrown an object, the movement of the thrown object may be depicted in a video. The image representing the object may be stored, for example, in the character image storage unit 182, and selected from the object recognition results or the movement content of the detection target image KTG identified by skeletal information. This selection may be performed, for example, by the image recognition unit 120. Whether the detection target KT is holding an object can be determined, for example, by whether there is a location where the detection target image KTG and the object image are considered to overlap, and whether the degree of change in their sizes exceeds a preset value.
For example, when throwing a shuriken as shown in Figure 2, the direction in which the left hand is moved can be used to estimate the throwing direction of the shuriken and even the position of the target. Such estimation results can be reflected in the movement of the shuriken.

このような画面制御は、会話処理部１１９が生成した応答文に応じた処理を動画生成部１１６に行わせることで実現できる。ロボットの姿勢制御は、その応答文に応じた処理を姿勢制御部１１７に行わせることで実現できる。応答文に応じて動作させるものは、基本的に、応答文を発言させる想定のアバター画像、或いはロボットとなる。 This type of screen control can be achieved by having the video generation unit 116 perform processing in accordance with the response sentence generated by the conversation processing unit 119. The robot's posture control can be achieved by having the posture control unit 117 perform processing in accordance with the response sentence. What moves in accordance with the response sentence is basically an avatar image or robot that is expected to utter the response sentence.

顔認識部１２１による顔の認識結果は、画像として表示させるアバターの性別を含む種類、顔の表情等の決定に用いるようにしても良い。その場合、検知対象ＫＴにとって、より望ましいアバターの選択が可能になるとともに、アバターの顔の表現を検知対象ＫＴの顔の表情に応じて変化させるようなことも可能となる。
上記のような制御は何れも、検知対象ＫＴにとっては、ゲームへの没入感を更に高くさせられるようにし、且つ更に高い満足度も得られるように作用する。 The results of face recognition by the face recognition unit 121 may be used to determine the type (including gender) of the avatar to be displayed as an image, as well as the facial expression, etc. In this case, it becomes possible to select an avatar that is more desirable for the detection target KT, and it also becomes possible to change the facial expression of the avatar depending on the facial expression of the detection target KT.
All of the above-mentioned controls act to further enhance the sense of immersion in the game for the detection target KT, and also to provide a higher level of satisfaction.

上記のような機能構成では、骨格検知部１１３は、本実施形態における骨格情報生成手段に相当する。同様に、姿勢評価部１１４は解析手段、テキスト変換部はテキスト情報生成手段、動画生成部１１６、及び姿勢制御部１１７は動作制御手段、音声認識部１１８は音声認識手段、会話処理部１１９は会話処理手段、にそれぞれ相当する。ＣＰＵ１１自体は、画像データ取得手段、及び音声情報取得手段に相当する。 In the above functional configuration, the skeleton detection unit 113 corresponds to the skeleton information generation means in this embodiment. Similarly, the posture evaluation unit 114 corresponds to the analysis means, the text conversion unit corresponds to the text information generation means, the video generation unit 116 and posture control unit 117 correspond to the movement control means, the voice recognition unit 118 corresponds to the voice recognition means, and the conversation processing unit 119 corresponds to the conversation processing means. The CPU 11 itself corresponds to the image data acquisition means and the voice information acquisition means.

図７は、本実施形態に係る情報処理装置であるＡＰサーバ１に搭載のＣＰＵによって実行される動画表示処理の一例を示すフローチャートである。この動画表示処理は、例えばサーバ２等からゲームの開始が要求された場合に、検知対象ＫＴにそのゲームを行わせるために実行される処理である。この処理は、本サービスの提供用に開発されたアプリケーション・プログラムをＣＰＵ１１が実行することで実現される。最後に図７を参照し、この動画表示処理について詳細に説明する。 Figure 7 is a flowchart showing an example of video display processing executed by the CPU installed in the AP server 1, which is an information processing device according to this embodiment. This video display processing is executed, for example, when a request to start a game is made by the server 2 or the like, in order to have the detection target KT play the game. This processing is realized by the CPU 11 executing an application program developed for providing this service. Finally, this video display processing will be described in detail with reference to Figure 7.

図７にフローチャート例を示す動画表示処理は、混乱を避けるために、図４に示すような一つのゲーム場を想定したものとしている。説明は、この想定を前提に行う。それにより、カメラＣによる動画撮影による動画情報（画像データ）、及びマイク５による音声情報がサーバ２からＡＰサーバ１に送信されると想定し説明する。アバター画像は、検知対象ＫＴの動きに沿って動かすものと想定する。処理を実行する主体としてはＣＰＵ１１を想定する。 To avoid confusion, the video display process shown in the example flowchart in Figure 7 is based on the assumption of a single game center like the one shown in Figure 4. The explanation will be based on this assumption. Therefore, the explanation will be based on the assumption that video information (image data) captured by camera C and audio information from microphone 5 are sent from server 2 to AP server 1. The avatar image is assumed to move in accordance with the movement of the detection target KT. The CPU 11 is assumed to be the entity that executes the process.

先ず、ステップＳ１では、ＣＰＵ１１は、サーバ２から動画情報を受信したか否か判定する。動画情報を受信した場合、ステップＳ１の判定はＹＥＳとなってステップＳ２に移行する。動画情報を受信していない場合、ステップＳ１の判定はＮＯとなってステップＳ６に移行する。
ステップＳ２では、ＣＰＵ１１は、例えば受信した動画情報から静止画像を生成し、その静止画像上の検知対象画像ＫＴＧの抽出を行う。続くステップＳ３では、ＣＰＵ１１は、抽出した検知対象画像ＫＴＧから骨格情報を生成するための骨格検知処理を実行する。次に移行するステップＳ４では、ＣＰＵ１１は、生成した骨格情報から、検知対象画像ＫＴＧの姿勢の評価を行う。この姿勢の評価には、必要に応じて、既に生成済みの骨格情報も用いられる。 First, in step S1, the CPU 11 determines whether or not video information has been received from the server 2. If video information has been received, the determination in step S1 is YES, and the process proceeds to step S2. If video information has not been received, the determination in step S1 is NO, and the process proceeds to step S6.
In step S2, the CPU 11 generates a still image from, for example, the received video information and extracts a detection target image KTG from the still image. In the following step S3, the CPU 11 executes a skeleton detection process to generate skeleton information from the extracted detection target image KTG. In the next step S4, the CPU 11 evaluates the posture of the detection target image KTG from the generated skeleton information. This posture evaluation may also use previously generated skeleton information, if necessary.

ステップＳ４に続くステップＳ５では、ＣＰＵ１１は、姿勢の評価結果に応じてテキスト情報を生成することにより、その評価結果をテキスト化する。このテキスト化の後、ステップＳ６に移行する。
ステップＳ６では、ＣＰＵ１１は、サーバ２から音声情報を受信したか否か判定する。音声情報を受信した場合、ステップＳ６の判定はＹＥＳとなってステップＳ７に移行する。音声情報を受信していない場合、ステップＳ６の判定はＮＯとなってステップＳ１０に移行する。 In step S5 following step S4, the CPU 11 converts the evaluation result into text by generating text information according to the evaluation result of the posture. After converting the evaluation result into text, the process proceeds to step S6.
In step S6, the CPU 11 determines whether or not voice information has been received from the server 2. If voice information has been received, the determination in step S6 is YES, and the process proceeds to step S7. If voice information has not been received, the determination in step S6 is NO, and the process proceeds to step S10.

ステップＳ７では、ＣＰＵ１１は、受信した音声情報、及び過去に受信した音声情報を用いた音声認識処理を実行する。ここでの音声認識処理には、音声情報からの文字列の生成の他に、その文字列の内容の特定も含まれる。続くステップＳ８では、ＣＰＵ１１は、検知対象ＫＴによる１会話分の発言内容が特定できたか否か判定する。１会話分の発言内容、つまり発言開始から発言終了までの発言内容が特定できた場合、ステップＳ８の判定はＹＥＳとなってステップＳ９に移行する。１会話分の発言内容が特定できていない場合、ステップＳ８の判定はＮＯとなってステップＳ１０に移行する。
なお、発言終了の判定は、例えば検知対象ＫＴの音声が確認できない状態が設定時間、継続した場合に行われる。それにより、ステップＳ８の判定がＹＥＳとなった場合、ステップＳ７で最後に特定された内容が、検知対象ＫＴによる発言内容となる。
ステップＳ９では、ＣＰＵ１１（会話処理部１１９）は、特定した発言内容から、検知対象ＫＴに返すべき文字列を応答文として生成する。その生成後、ステップＳ１０に移行する。 In step S7, the CPU 11 executes a voice recognition process using the received voice information and previously received voice information. The voice recognition process here includes not only generating a character string from the voice information but also identifying the content of the character string. In the following step S8, the CPU 11 determines whether or not the content of one conversation made by the detection target KT has been identified. If the content of one conversation, that is, the content of the speech from the start to the end of the speech, has been identified, the determination in step S8 is YES and the process proceeds to step S9. If the content of one conversation has not been identified, the determination in step S8 is NO and the process proceeds to step S10.
The determination of the end of speech is made, for example, when the state in which the voice of the detection target KT cannot be confirmed continues for a set time. As a result, if the determination in step S8 is YES, the content last identified in step S7 becomes the content of speech by the detection target KT.
In step S9, the CPU 11 (conversation processing unit 119) generates a character string to be returned to the detection target KT as a response sentence from the identified utterance content. After the generation, the process proceeds to step S10.

ステップＳ１０では、ＣＰＵ１１は、ゲームが終了したか否か判定する。検知対象ＫＴ等によるゲームの終了指示、或いはゲーム内でのゲームを終了させるべきイベントの発生等である終了イベントが発生した場合、ステップＳ１０の判定はＹＥＳとなってステップＳ１１に移行する。終了イベントが発生していない場合、ステップＳ１０の判定はＮＯとなってステップＳ１２に移行する。 In step S10, the CPU 11 determines whether the game has ended. If an end event occurs, such as an instruction to end the game from the detection target KT or the like, or an event that should end the game has occurred within the game, the determination in step S10 is YES and the process proceeds to step S11. If an end event has not occurred, the determination in step S10 is NO and the process proceeds to step S12.

ステップＳ１１では、ＣＰＵ１１は、ゲームの終了を検知対象ＫＴに通知するための終了画面をサーバ２に送信する。この終了画面の送信後、動画表示処理が終了する。
ステップＳ１２では、ＣＰＵ１１は、ゲームの進行のために必要なその他の処理を実行する。
ゲームのうちには、ゲームの途中で利用者にアイテム等の選択肢を提示し、提示させた選択肢のうちの何れかを選択させるようになっているものがある。このような選択肢の提示、選択肢のうちからの選択は、ステップＳ１２でのその他の処理を実行することで実現される。 In step S11, the CPU 11 transmits an end screen for notifying the detection target KT of the end of the game to the server 2. After transmitting this end screen, the video display process ends.
In step S12, the CPU 11 executes other processes necessary for the progress of the game.
Some games present the user with options such as items during the game and allow the user to select one of the options. The presentation of options and the selection from among the options are realized by executing other processing in step S12.

続くステップＳ１３では、ＣＰＵ１１は、動画情報、及び音声情報の生成をそれぞれ行い、生成した動画情報、及び音声情報をそれぞれプロジェクター３、及びサウンドシステム４に送信させる。その後、上記ステップＳ１に戻る。
動画情報が表す画面上にアバター画像が存在する場合、そのアバター画像は、ステップＳ３の骨格検知処理の実行により生成される骨格情報を用いて生成されたものである。１画面分の画像は、各アバター画像の他に、各キャラクタ画像、及び背景画像を生成し、背景画像上に、各アバター画像、及び各キャラクタ画像を配置させる形で生成される。何れかのアバター画像、或いは何れかのキャラクタ画像の生成には、ステップＳ５で生成されたテキスト情報、或いはステップＳ９で生成された応答文が必要に応じて反映される。それにより、アバター画像の動作は、生成されるテキスト情報、及び応答文のうちの少なくとも一方によって変化させることが可能となっている。このような画面生成が行われるステップＳ１３で生成される動画情報は、例えば生成した画面に圧縮処理を行って生成される情報である。 In the following step S13, the CPU 11 generates moving image information and audio information, and transmits the generated moving image information and audio information to the projector 3 and sound system 4, respectively. Then, the process returns to step S1.
If an avatar image is present on the screen represented by the video information, the avatar image is generated using the skeleton information generated by executing the skeleton detection process in step S3. An image for one screen is generated by generating each avatar image, character image, and background image, and arranging each avatar image and character image on the background image. The text information generated in step S5 or the response sentence generated in step S9 is reflected in the generation of any avatar image or any character image, as necessary. This allows the movement of the avatar image to be changed by at least one of the generated text information and the response sentence. The video information generated in step S13, where such screen generation is performed, is information generated, for example, by compressing the generated screen.

ステップＳ５でテキスト情報が生成されるか、或いはステップＳ９で応答文が生成された場合、ステップＳ１３における上述の音声情報が生成される。
送信する音声情報は、例えばデジタルの音声信号である。デジタルの音声信号を音声信号として送信する場合、その音声信号は、例えばテキスト→音声変換を行うアプリケーション（以降「音声変換ソフト」）によって生成される。その場合、ステップＳ１３では、その変換の対象となるテキスト情報、或いは応答文を指定してのその音声変換ソフトの呼出、及び音声信号の取得が行われることになる。取得された音声信号の音声情報としての送付は、送信すべき音声信号が無くなるまで継続して行われる。この音声変換ソフトを搭載したサーバ２を介してサウンドシステム４による放音を行わせる場合、テキスト情報、及び応答文は音声情報として送信させることができる。このこともあり、音声情報は、特に限定されない。 If text information is generated in step S5, or if a response sentence is generated in step S9, the above-mentioned voice information is generated in step S13.
The audio information to be transmitted is, for example, a digital audio signal. When a digital audio signal is transmitted as an audio signal, the audio signal is generated, for example, by an application that performs text-to-audio conversion (hereinafter referred to as "audio conversion software"). In this case, in step S13, the audio conversion software is called by specifying the text information or response sentence to be converted, and the audio signal is acquired. The acquired audio signal is continuously sent as audio information until there are no more audio signals to be transmitted. When sound is emitted by the sound system 4 via the server 2 equipped with this audio conversion software, the text information and the response sentence can be transmitted as audio information. For this reason, the audio information is not particularly limited.

動作対象がロボットであった場合、動画表示処理の代わりに、姿勢制御処理が実行される。その姿勢制御処理の流れは、基本的に、上述の動画表示処理と同じである。その動画表示処理との相違点は、動画情報の代わりに、姿勢制御情報を生成することである。このようなことから、姿勢制御処理についての詳細な説明は省略する。 If the object to be operated is a robot, posture control processing is executed instead of video display processing. The flow of this posture control processing is basically the same as the video display processing described above. The difference from this video display processing is that posture control information is generated instead of video information. For this reason, a detailed explanation of the posture control processing will be omitted.

本実施形態では、身体を動かして行うゲームに着目する形で説明を行ったが、本発明は、上記のように、取るべき姿勢が予め一つ以上、判明している姿勢要求動作を身体に行わせるものであれば、幅広く適用させることができる。
例えば手指を動かして行われる手話は、定められた姿勢、その姿勢からの定められた動き等により、話として伝えたい内容を表現するためのものである。このことから、骨格情報を用いて、手話で表される内容をテキスト情報に変換し、そのテキスト情報を出力させるようにしても良い。具体的には、例えば手話の動きを動作対象で再現させるとともに、その動作対象で再現する手話の内容を表示、及び音声のうちの少なくとも一方で出力させるようにしても良い。耳が不自由な人、及びそうでない人の両方が利用することを考慮し、その両方でテキスト情報を出力させるようにすることが望ましい。 In this embodiment, the explanation has been given with a focus on games played by moving the body, but as described above, the present invention can be widely applied to any game in which the body is made to perform a posture-requiring action in which one or more postures to be taken are known in advance.
For example, sign language performed by moving the fingers is intended to express the content to be communicated as speech using a predetermined posture and predetermined movements from that posture. Therefore, the content expressed in sign language may be converted into text information using skeletal information, and the text information may be output. Specifically, for example, the sign language movements may be reproduced by an object of movement, and the content of the sign language reproduced by the object of movement may be output as at least one of a display and a sound. Considering use by both hearing-impaired and non-hearing people, it is desirable to output text information for both.

スポーツでは、各種走り（短距離走、長距離走、ハードル競争等）、及びゴルフ等が姿勢要求動作に相当する。このようなスポーツでは、例えば骨格情報が表す動作を動作対象に行わせつつ、お手本画像ＯＧとの対比により、改善すべき点、或いは良い点、等を検知対象ＫＴに伝えるためのテキスト情報を生成し、そのテキスト情報を表示、及び音声のうちの少なくとも一方で出力させるようにしても良い。走っている検知対象ＫＴには、テキスト情報をタイムリに伝えることは困難であることから、動作対象の動作を表す動画情報をテキスト情報とともに保存し、再生可能にさせるようにすることが望ましい。 In sports, posture-requiring movements include various types of running (sprints, long-distance running, hurdle races, etc.) and golf. In such sports, for example, while the movement target performs the movement represented by the skeletal information, text information can be generated to inform the detection target KT of areas for improvement or good points by comparing it with a model image OG, and the text information can be output as at least one of a display and a sound. Since it is difficult to convey text information in a timely manner to a running detection target KT, it is desirable to save video information representing the movement of the movement target along with the text information and make it playable.

また、骨格情報を用いることで、検知対象ＫＴの状態をタイムリに特定することが可能となる。このことを利用し、検知対象ＫＴの今後の動きを推定し、その推定結果に応じたテキスト情報の生成、及びその出力を行うようにしても良い。それにより、例えば触れるべきでない展示品に触れようとする検知対象ＫＴへの警告、移動すべきでない場所への移動への警告、等を行うようにしても良い。店舗に来店するお客を検知対象ＫＴとし、検知対象ＫＴによる万引き等の犯罪行為の防止、或いは犯罪行為を行った検知対象ＫＴの摘発等に利用することも考えられる。何れであっても、より多くの情報が提供されるようになるとともに、利用する者における利便性はより高いものとなる。 Furthermore, by using skeletal information, it is possible to identify the state of the detection target KT in a timely manner. This can be used to estimate the future movements of the detection target KT, and generate and output text information based on the estimation results. This can be used to warn the detection target KT when they try to touch an exhibit that they should not touch, or when they move to a place that they should not go to, for example. It is also conceivable that customers visiting a store can be used to prevent criminal acts such as shoplifting by the detection target KT, or to apprehend detection target KT who have committed a criminal act. In any case, more information can be provided, and convenience for users will be increased.

動きの推定では、骨格情報から検知対象画像ＫＴＧの移動速度、及び移動方向等を推定し、その推定結果を動作対象の動作に反映させるようにしても良い。そのような推定（予測）を行うことにより、レスポンスを改善させることが期待できる。検知対象ＫＴの危険行動の検知に適用する場合、その検知をより早く行えるようになる。 When estimating movement, the movement speed and direction of the detection target image KTG may be estimated from skeletal information, and the estimation results may be reflected in the movement of the moving target. By making such estimations (predictions), it is expected that response will be improved. When applied to detecting risky behavior of the detection target KT, this detection can be performed more quickly.

この動きの推定は、対戦ゲーム、或いはＲＰＧ（Role-Playing Game）等のゲームにおける対戦相手、或いは対峙させるキャラクタ等のアバターの動作制御に用いても良い。対戦相手には、骨格情報が表す動作に応じて、防御する際の動きを行わせることが考えられる。しかし、動きの推定を行うことにより、防御する際の動きをより自然なものとすることができる。対峙させるキャラクターでは、検知対象ＫＴの動きに応じた動作を行わせることにより、より自身に対峙しているという印象を検知対象ＫＴに与えられるようになる。動きの推定結果をそのキャラクターの動きの制御に用いた場合、そのキャラクターの動きをより自然なものとでき、実際に相手が対峙しているというより強い印象を検知対象ＫＴに与えられるようになる。何れにおいても、骨格情報を用いた姿勢評価によるテキスト情報の生成、生成したテキスト情報に応じた動作、自然会話を行わせるようにするのが望ましい。 This movement estimation may be used to control the movements of an avatar, such as an opponent in a fighting game or a game such as an RPG (Role-Playing Game), or a character that is confronted. It is conceivable that the opponent will perform defensive movements in accordance with the movements represented by the skeletal information. However, by estimating movement, it is possible to make the defensive movements more natural. By having the opposing character perform movements in accordance with the movements of the detection target KT, the detection target KT will be given the impression that they are confronting the character. When the movement estimation results are used to control the character's movement, the character's movements can be made more natural, and the detection target KT will be given a stronger impression that they are actually confronting an opponent. In either case, it is desirable to generate text information by posture evaluation using skeletal information, and to have the character perform movements and converse naturally in accordance with the generated text information.

また、本発明は、無人店舗、或いは無人受付におけるアバター、或いはロボットの動作制御にも適用させることができる。このアバター、及びロボットの何れも、検知対象ＫＴの動作に応じて、その動作とは異なる動作を行わせ、その検知対象ＫＴが望むサービスを提供するために用いられる。
その適用としては、例えば画像認識技術と組み合わせ、商品、或いは荷物等の有無を認識し、その認識結果に応じて、異なる対応をさせるようにしても良い。より具体的には、例えば検知対象ＫＴが何らかの商品を見せるような動作をした場合、その商品の種類を特定し、その特定結果を、以降の会話に反映させるようにしても良い。これは、商品を見せるような動作をした検知対象ＫＴには、他の色違いの商品、それとは同じカテゴリの別の商品、或いは在庫等について確認したいという意図がある可能性が高いためである。 The present invention can also be applied to the control of the movement of avatars or robots in unmanned stores or unmanned reception desks. Both the avatars and robots are made to perform different actions depending on the movement of the detection target KT, and are used to provide the service desired by the detection target KT.
As an application thereof, for example, it may be combined with image recognition technology to recognize the presence or absence of products or luggage, etc., and different responses may be taken depending on the recognition results. More specifically, for example, if the detection target KT makes a movement as if showing a product, the type of product may be identified, and the identification result may be reflected in the subsequent conversation. This is because the detection target KT who makes a movement as if showing a product is likely to have the intention of checking other products of different colors, other products in the same category, or inventory, etc.

上記のように、動作対象としては、検知対象ＫＴの動作に沿って動作をさせるもの（以降「第１動作対象」と表記）と、検知対象ＫＴの動作とは異なる動作をさせるもの（以降「第２動作対象」と表記）と、の２種類に大別される。そのため、生成されるテキスト情報も、第１動作対象を想定したものと、第２動作を想定したものと、の２種類に大別される。これは、お手本データＴＤ（基準画像情報）も同様である。 As mentioned above, movement targets can be broadly divided into two types: those that perform movements in accordance with the movements of the detection target KT (hereinafter referred to as "first movement targets"), and those that perform movements different from the movements of the detection target KT (hereinafter referred to as "second movement targets"). Therefore, the generated text information can also be broadly divided into two types: that which assumes the first movement target, and that which assumes the second movement. The same is true for the model data TD (reference image information).

第１動作対象、及び第２動作対象は、ともに動作制御を行っても良いものである。その二つを同時に動作制御する場合、その二つの動作制御に用いることを想定し、テキスト情報を生成するようにしても良い。例えばお手本データＴＤが表す変身ポーズを検知対象ＫＴが取った場合、テキスト情報を生成して、第１動作対象には「変身」の音声出力、第２動作対象は動作の停止、或いは設定の姿勢への動作を行わせるようにしても良い。音声認識技術、及び自然会話（言語）技術も利用する場合、検知対象ＫＴによる「変身」の発言を条件に、検知対象ＫＴが変身ポーズを取ったか否か評価し、テキスト情報の生成を行うようにしても良い。この場合、テキスト情報は、第１動作対象における演出音の放音、第２動作対象に対する同様の動作制御に用いても良い。このようなこともあり、テキスト情報を生成する条件、生成するテキスト情報の内容、テキスト情報による動作対象の動作制御には、様々な変形が可能である。 The first and second movement targets may both be subjected to movement control. When controlling the movement of both targets simultaneously, text information may be generated assuming its use in controlling the movement of both targets. For example, when the detection target KT takes a transformation pose represented by the model data TD, text information may be generated to output a voice message saying "Transform" to the first movement target, and the second movement target may stop moving or move to a set pose. When using voice recognition technology and natural conversation (language) technology, the detection target KT may utter "Transform" to evaluate whether or not the detection target KT has taken a transformation pose, and generate text information. In this case, the text information may be used to emit a sound effect for the first movement target, or to perform similar movement control for the second movement target. For this reason, various modifications are possible for the conditions for generating text information, the content of the generated text information, and the movement control of the movement targets using the text information.

１ＡＰサーバ、２サーバ、３プロジェクター、４サウンドシステム、５マイクロホン、１１ＣＰＵ、１８記憶部、１９通信部、１１１設定部、１１２画像抽出部、１１３骨格検知部、１１４姿勢評価部、１１５テキスト変換部、１１６動画生成部１１６、１１７姿勢制御部、１１８音声認識部、１１９会話処理部１１９、１２０画像認識部、１２１顔認識部、１８１基準画像格納部、１８２キャラクタ画像格納部、１８３背景画像格納部、１８４辞書格納部、１８５制御情報格納部、Ｃカメラ、Ｈ１骨格情報生成システム、Ｈ２姿勢評価システム、Ｈ３姿勢管理システム、ＫＫ顧客企業、ＫＴ検知対象、ＫＴＧ検知対象画像、ＯＧお手本画像、ＳＫサービス提供会社、ＳＹサービス提供システム 1 AP server, 2 Server, 3 Projector, 4 Sound system, 5 Microphone, 11 CPU, 18 Memory unit, 19 Communication unit, 111 Setting unit, 112 Image extraction unit, 113 Skeleton detection unit, 114 Posture evaluation unit, 115 Text conversion unit, 116 Video generation unit, 117 Posture control unit, 118 Voice recognition unit, 119 Conversation processing unit, 120 Image recognition unit, 121 Face recognition unit, 181 Reference image storage unit, 182 Character image storage unit, 183 Background image storage unit, 184 Dictionary storage unit, 185 Control information storage unit, C Camera, H1 Skeleton information generation system, H2 Posture evaluation system, H3 Posture management system, KK Client company, KT Detection target, KTG Detection target image, OG Model image, SK Service provider, SY Service provision system

Claims

image data acquisition means for acquiring image data obtained by capturing an image of the detection target;
a skeleton information generating means for generating skeleton information including position information of joints in the detection target image and relationship information indicating relationships between joints, based on the detection target image represented by the image data;
an analysis means for analyzing the details of the motion in the detection target image based on the skeleton information;
a text information generating means for generating text information based on the analysis result of the operation content;
an action control means for controlling at least one of a predetermined avatar to be displayed and a robot, which is a physical machine, as an action target based on the text information;
and
the analyzing means analyzes the motion details using a difference between the skeletal information and information indicating a posture that the detection target should take;
the text information generating means generates the text information in accordance with the difference;
The operation control means
performing a scale conversion process on the skeleton information to convert it in accordance with the ratio of each part of the moving object, and causing the moving object to reproduce the movement represented by the detection object using the skeleton information after the scale conversion process;
Based on the text information, the action target is made to perform an action different from the action represented by the image of the detection target by a speech action of the action target by at least one of displaying a message according to whether the posture of the detection target is appropriate or emitting a sound.
Information processing device.

a voice information acquisition means for acquiring voice information representing a voice emitted by the detection target;
a speech recognition unit that performs speech recognition using the speech information and generates a first character string from the speech information;
a conversation processing means for generating a second character string to be a response to the first character string,
the action control means outputs the second character string as a speech content in the speech action of the action target;
The information processing device according to claim 1 .

the text information generation means generates the text information representing the content of the sign language when the finger movement in the detection target image is identified as a movement for sign language as a result of the analysis of the content of the action,
the motion control means causes the motion target to perform a finger motion in accordance with the finger motion using the skeletal information, and causes the motion control means to output the text information.
3. The information processing device according to claim 1 or 2 .

In the information processing device,
a skeletal information generating step of generating skeletal information including position information of joints in a detection target image and relationship information indicating relationships between joints, based on the detection target image represented by image data obtained by capturing an image of an assumed detection target;
an analyzing step of analyzing the details of the motion in the detection target image based on the skeleton information;
a text information generating step of generating text information based on the analysis result of the operation content;
a motion control step of controlling at least one of a predetermined avatar to be displayed and a robot, which is a physical machine, as a motion target based on the text information;
Including,
In the analyzing step, the motion content is analyzed using a difference between the skeletal information and information indicating a posture that the detection target should take;
In the text information generating step, the text information is generated in accordance with the difference;
In the operation control step,
performing a scale conversion process on the skeleton information to convert it in accordance with the ratio of each part of the moving object, and causing the moving object to reproduce the movement represented by the detection object using the skeleton information after the scale conversion process;
Based on the text information, the action target is made to perform an action different from the action represented by the image of the detection target by a speech action of the action target by at least one of displaying a message according to whether the posture of the detection target is appropriate or emitting a sound.
A program that executes a process.