JP7628528B2

JP7628528B2 - Content Processing Device

Info

Publication number: JP7628528B2
Application number: JP2022193806A
Authority: JP
Inventors: 和夫五十嵐
Original assignee: SoftBank Corp
Current assignee: SoftBank Corp
Priority date: 2022-12-02
Filing date: 2022-12-02
Publication date: 2025-02-10
Anticipated expiration: 2042-12-02
Also published as: JP2024080528A

Description

本発明は、コンテンツ処理装置に関し、動画像の被写体であるグループの中で、１人に注目したコンテンツを簡単に提供することができるようにするコンテンツ処理装置に関する。 The present invention relates to a content processing device that can easily provide content that focuses on one person in a group of people who are the subjects of a moving image.

従来より、コンサートや演劇などのコンテンツのネットワークを介した配信が、ライブ配信またはオンデマンド配信により行われている（例えば、非特許文献１参照）。 Content such as concerts and plays have traditionally been distributed via networks either live or on-demand (see, for example, Non-Patent Document 1).

近年配信されるコンテンツでは、特定のアイドルや俳優の出演を呼び物とするより、多数のアイドルや俳優などの出演者の中で、ファンのそれぞれが好む出演者に注目して楽しむことを目的とするものが多い。 In recent years, the content being distributed has tended not to focus on the appearance of a particular idol or actor, but rather to allow fans to enjoy focusing on their favorite performers among the many idols, actors, and other performers.

また、近年のＶＲ（ＶｉｒｔｕａｌＲｅａｌｉｔｙ）技術の発達により、実在するアイドルや俳優とアバターとを組み合わせた仮想空間を作り出すことも可能となってきており、コンテンツの中で表示される特定の出演者に注目した画像処理のニーズが高まっている。 In addition, recent advances in VR (Virtual Reality) technology have made it possible to create virtual spaces that combine avatars with real-life idols and actors, increasing the need for image processing that focuses on specific performers displayed within content.

さらに、画像の中から複数の人物を抽出し、抽出された人物のそれぞれを追跡（トラッキング）する技術が提案されている（例えば、非特許文献２参照）。 Furthermore, a technique has been proposed for extracting multiple people from an image and tracking each of the extracted people (see, for example, Non-Patent Document 2).

https://livr.jp/app-bannerhttps://livr.jp/app-banner https://www.programmersought.com/article/17005126187/https://www.programmersought.com/article/17005126187/

しかしながら、アイドルグループの中には、大人数で構成されるグループも多く、例えば、数十人のメンバーが同一のグループに属する場合もある。このような場合、ステージ上で激しく踊る数十人のアイドルの中で、１人だけに注目したコンテンツを作成することは時間とコストを要することになる。 However, many idol groups are made up of a large number of members, and for example, several dozen members may belong to the same group. In such cases, creating content that focuses on just one idol among several dozen idols dancing vigorously on stage can be time-consuming and costly.

また、再生されるコンテンツの中で注目する１人のアイドルを自動的に検出することも難しい。例えば、顔画像認識などにより、アイドルグループのメンバー各人を識別しようとしても、多人数のアイドルグループの場合、各人あたりの顔領域の画素数は、かなり少なくなり、鮮明な画像が得にくい。さらに、ステージの照明による画質の変化、衣装のデザイン変更などの要素も考慮すると、各人の特徴を定性化することが難しい。 It is also difficult to automatically detect a single idol of interest within the content being played. For example, even if you try to identify each member of an idol group using facial image recognition, in the case of a large idol group, the number of pixels in the facial area for each person is significantly smaller, making it difficult to obtain a clear image. Furthermore, when factors such as changes in image quality due to stage lighting and changes in costume design are also taken into account, it is difficult to qualify each person's characteristics.

また、アイドルグループを構成するメンバーとなる人物は、通常、同年代、同性、同国人であり、肌や体系などの特徴に差異が少なく、各人を自動的に識別することは、やはり難しい。 In addition, the members of an idol group are usually of the same age, sex and country, and there is little difference in characteristics such as skin tone and body type, making it difficult to automatically identify each individual.

本発明の一態様は、動画像の被写体であるグループの中で、１人に注目したコンテンツを簡単に提供することができるようにする技術を実現することを目的とする。 One aspect of the present invention aims to realize a technology that makes it possible to easily provide content that focuses on one person in a group of people who are the subjects of a video image.

本発明の一態様に係るコンテンツ処理装置は、音声とともに動画像が再生されるコンテンツの入力を受け付けるコンテンツ入力受付部と、予め与えられた所定の長さの音声データに対応する音声と類似する音声が再生される区間を、入力されたコンテンツから検出する区間検出部と、検出された前記区間の直後に再生される画像からオブジェクトを検出するオブジェクト検出部と、検出された前記オブジェクトにラベルを付与するラベル付与部とを備える。 A content processing device according to one aspect of the present invention includes a content input receiving unit that receives input of content in which moving images are played together with sound, a section detection unit that detects a section from the input content in which sound similar to sound corresponding to a predetermined length of audio data is played, an object detection unit that detects an object from an image played immediately after the detected section, and a label assignment unit that assigns a label to the detected object.

本発明の一態様において、前記ラベル付与部は、機械学習により得られたモデルパラメータを用いた演算により、前記オブジェクトの名称に係るラベルを推定するようにしてもよい。
本発明の各態様は、コンピュータによって実現してもよく、この場合には、コンピュータを上記システムが備える各部（ソフトウェア要素）として動作させることによりシステムをコンピュータにて実現させるプログラム、およびそれを記録したコンピュータ読み取り可能な記録媒体も、本発明の範疇に入る。 In one aspect of the present invention, the label assignment unit may estimate a label related to a name of the object by a calculation using a model parameter obtained by machine learning.
Each aspect of the present invention may be realized by a computer. In this case, the program for realizing the system on a computer by causing the computer to operate as each part (software element) of the above-mentioned system, and the computer-readable recording medium on which the program is recorded, also fall within the scope of the present invention.

本発明の一態様によれば、動画像の被写体であるグループの中で、１人に注目したコンテンツを簡単に提供することができるようにする技術を実現することができる。 According to one aspect of the present invention, it is possible to realize a technology that makes it possible to easily provide content that focuses on one person in a group that is the subject of a video image.

第一実施形態に係るコンテンツ処理装置の機能的構成例を示すブロック図である。1 is a block diagram showing an example of a functional configuration of a content processing device according to a first embodiment. あるアイドルグループの楽曲Ａの演奏時の画像の例を示す図である。FIG. 13 is a diagram showing an example of an image when a song A of an idol group is being performed. アイドルグループのパフォーマンスにおける１つの場面の画像を示す図である。FIG. 1 shows an image of a scene from an idol group's performance. オブジェクトの指定の際に、コンテンツ処理装置のディスプレイに表示されるＧＵＩの例を示す図である。FIG. 11 is a diagram showing an example of a GUI that is displayed on a display of a content processing device when an object is designated. オブジェクト画像処理部による画像処理が施された画像の一例である画像を示す図である。13 is a diagram showing an example of an image that has been subjected to image processing by an object image processing unit; FIG. オブジェクト画像処理部による画像処理が施された画像の別の例を示す図である。13 is a diagram showing another example of an image that has been subjected to image processing by the object image processing unit. FIG. コンテンツ再生処理の流れの例について説明するフローチャートである。11 is a flowchart illustrating an example of the flow of a content playback process. 第二実施形態に係るコンテンツ処理装置の機能的構成例を示すブロック図である。FIG. 11 is a block diagram showing an example of a functional configuration of a content processing device according to a second embodiment. 仮想空間の例を示す図である。FIG. 1 is a diagram illustrating an example of a virtual space. コンテンツ処理装置の各機能を実現するソフトウェアであるプログラムの命令を実行するコンピュータの構成例を示す図である。FIG. 1 is a diagram showing an example of the configuration of a computer that executes instructions of a program, which is software that realizes each function of a content processing device.

以下、本発明の例示的実施形態について、図面を参照して詳細に説明する。
＜第一実施形態＞
（コンテンツ処理装置）
図１は、第一実施形態に係るコンテンツ処理装置１０の機能的構成例を示すブロック図である。同図に示されるように、コンテンツ処理装置１０は、コンテンツ入力受付部１１、コンテンツ再生部１２および操作入力受付部１３を有している。 Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the drawings.
First Embodiment
(Content processing device)
1 is a block diagram showing an example of the functional configuration of a content processing device 10 according to a first embodiment. As shown in the figure, the content processing device 10 has a content input receiving unit 11, a content playback unit 12, and an operation input receiving unit 13.

コンテンツ入力受付部１１は、コンテンツのデータの入力を受け付ける。コンテンツのデータは、例えば、インターネットなどのネットワークを介して供給される。また、コンテンツ処理装置が、例えば、５Ｇ通信システムなどの広域無線通信ネットワークにアクセスし、広域無線通信ネットワーク経由でコンテンツのデータが供給されるようにしてもよい。 The content input receiving unit 11 receives input of content data. The content data is supplied via a network such as the Internet. The content processing device may also access a wide area wireless communication network such as a 5G communication system, and the content data may be supplied via the wide area wireless communication network.

一例として、コンテンツは、コンサート、演劇などを撮影したＭＰ４形式のデータとされる。コンテンツのデータには、動画像と音声が含まれ、必要に応じて字幕などの情報も含まれる。すなわち、コンテンツ処理装置に入力されるコンテンツは、音声とともに動画像が再生されるコンテンツである。例えば、音声は、コンサートで演奏される楽曲であってもよい。 As an example, the content is MP4 format data recorded from a concert, a play, etc. The content data includes moving images and audio, and may also include information such as subtitles as necessary. In other words, the content input to the content processing device is content in which moving images are played together with audio. For example, the audio may be a piece of music played at a concert.

ここでは、主に、コンテンツがアイドルグループのコンサートを撮影したＭＰ４形式のデータの場合を例として説明する。アイドルグループのコンサートにおいては、複数人のアイドル（人物）から成るグループが、演奏される楽曲に合わせて歌い、踊る。楽曲に合わせた歌と踊りは、パフォーマンスとも称される。 Here, we will mainly use as an example a case where the content is MP4 format data filmed at an idol group concert. At an idol group concert, a group consisting of multiple idols (people) sing and dance along with the music being played. The singing and dancing along with the music is also called a performance.

コンテンツ再生部１２は、コンテンツ入力受付部１１によって入力が受け付けられたコンテンツを再生する。再生されたコンテンツは、例えば、ディスプレイ５０に表示される。なお、ディスプレイ５０には、スピーカーも装備され、再生されたコンテンツの音声がディスプレイ５０のスピーカーから出力される。 The content playback unit 12 plays back the content whose input has been accepted by the content input acceptance unit 11. The played back content is displayed, for example, on the display 50. The display 50 is also equipped with a speaker, and the sound of the played back content is output from the speaker of the display 50.

操作入力受付部１３は、コンテンツ処理装置１０に対するユーザの操作入力を受け付ける。操作入力受付部１３は、キーボード、マウスなどにより構成されるようにしてもよい。あるいは、ディスプレイ５０が、タッチセンサを含んで構成され、ディスプレイ５０に表示されたＧＵＩ（ＧｒａｐｈｉｃａｌＵｓｅｒＩｎｔｅｒｆａｃｅ：グラフィカルユーザインタフェース）の操作入力を検知することによって操作入力受付部１３が構成されるようにしてもよい。 The operation input reception unit 13 receives operation input from a user to the content processing device 10. The operation input reception unit 13 may be configured with a keyboard, a mouse, etc. Alternatively, the display 50 may be configured to include a touch sensor, and the operation input reception unit 13 may be configured by detecting operation input of a GUI (Graphical User Interface) displayed on the display 50.

また、図１に示されるように、コンテンツ処理装置１０は、区間検出部３１、オブジェクト検出部３２、ラベル付与部３３、およびオブジェクト画像処理部３４を有している。 As shown in FIG. 1, the content processing device 10 also has a section detection unit 31, an object detection unit 32, a label assignment unit 33, and an object image processing unit 34.

（区間検出部）
区間検出部３１は、予め与えられた所定の長さの音声データに対応する音声と類似する音声が再生される区間を、入力されたコンテンツから検出する。 (Section detection unit)
The section detection unit 31 detects, from the input content, a section in which a sound similar to a sound corresponding to audio data of a given length is reproduced.

ここで、予め与えられる音声データは、入力されたコンテンツ全体の時間的長さに比べて充分に短い時間の音声に対応する音声データである。例えば、入力されたコンテンツが、アイドルグループのコンサートであった場合、コンサートで演奏された楽曲の一部であって、例えば、イントロ、間奏、サビなどに対応する音声の音声データが予め与えられる。ただし、音声データの音声は、これらに限られるものではなく、楽曲の中で連続する任意の数小節分の音などであってもよい。音声データは、例えば、図示せぬコンテンツ処理装置１０の記憶部などに記憶されるようにしてもおよい。 The audio data provided in advance here is audio data corresponding to audio of a duration sufficiently short compared to the overall time length of the input content. For example, if the input content is a concert by an idol group, audio data corresponding to parts of a song performed at the concert, such as the intro, interlude, and chorus, is provided in advance. However, the audio of the audio data is not limited to these, and may be any number of consecutive bars of sound in a song. The audio data may be stored, for example, in a memory unit of the content processing device 10 (not shown).

なお、音声には、楽器の音、歌声、効果音などが含まれてよい。 The audio may include instrumental sounds, singing, sound effects, etc.

区間検出部３１は、例えば、予め与えられた音声データの音声信号の特徴量と、再生中のコンテンツの音声の再生信号を比較することで、２つの音声信号の類似度を算出する。区間検出部３１は、コンテンツの再生中に、音声データと同じ時間的長さの音声信号を連続して抽出し、抽出した音声信号の特徴量と、音声データの音声信号の特徴量とを比較することで、類似度が閾値以上となる区間を検出する。ここで検出される区間は、音声データと同じ時間的長さを有することになる。 The section detection unit 31, for example, calculates the similarity between the two audio signals by comparing the features of the audio signal of pre-given audio data with the audio playback signal of the content being played. During playback of the content, the section detection unit 31 continuously extracts audio signals of the same temporal length as the audio data, and detects sections where the similarity is equal to or greater than a threshold by comparing the features of the extracted audio signals with the features of the audio signal of the audio data. The section detected here will have the same temporal length as the audio data.

（オブジェクト検出部）
オブジェクト検出部３２は、検出された区間の直後に再生される画像からオブジェクトを検出する。 (Object detection unit)
The object detection unit 32 detects an object from an image played back immediately after the detected section.

区間検出部３１により検出された区間の直後に再生される画像は、当該区間より時間的に後に再生される画像である。例えば、当該区間の動画像に含まれる複数のフレームの中の最終フレームから、１フレーム～３０フレームの後の画像のうち、１または複数の画像であってよい。例えば、入力されたコンテンツが、アイドルグループのコンサートのコンテンツである場合、検出された区間の直後に再生される画像には、数人のアイドル（人物）が写っており、これらの人物の画像が、オブジェクト検出部３２によりオブジェクトとして検出される。 The image played immediately after the section detected by the section detection unit 31 is an image played temporally after that section. For example, it may be one or more images from among the images 1 to 30 frames after the final frame of the multiple frames included in the video of that section. For example, if the input content is the content of an idol group concert, the image played immediately after the detected section will show several idols (people), and the images of these people will be detected as objects by the object detection unit 32.

一例として、オブジェクトの検出は、グラフカット法により行うことができる。グラフカット法では、まず、切り出したいオブジェクトを含む前景オブジェクト画像と、背景画像とからなる２種類の画像の色分布や画素カラーの勾配から切り出すべき前景オブジェクト画像を構成する領域の境界を計算する。そして、計算された境界に沿って画像が切り出されることにより、切り出したい前景オブジェクト画像が抽出される。 As an example, object detection can be performed using the graph cut method. In the graph cut method, the boundary of the area that constitutes the foreground object image to be cut out is first calculated from the color distribution and pixel color gradient of two types of images: a foreground object image containing the object to be cut out, and a background image. The image is then cut out along the calculated boundary, and the foreground object image to be cut out is extracted.

（ラベル付与部）
ラベル付与部３３は、オブジェクト検出部３２により検出されたオブジェクトにラベルを付与する。 (Labeling unit)
The label assignment unit 33 assigns a label to the object detected by the object detection unit 32 .

ここでは、アイドルグループのコンサートのコンテンツの場合を例として、ラベル付与部３３の処理について説明する。この場合、検出されるオブジェクトは人物であり、ラベル付与部３３は、人物の名称に係るラベルを付与する。すなわち、人物の名称がラベルとしてオブジェクトに付与される。 Here, the processing of the label assignment unit 33 will be described using an example of content for an idol group concert. In this case, the detected object is a person, and the label assignment unit 33 assigns a label related to the person's name. In other words, the person's name is assigned to the object as a label.

ラベル付与部３３は、例えば、ＣＮＮ（ｃｏｎｖｏｌｕｔｉｏｎａｌｎｅｕｒａｌｎｅｔｗｏｒｋ）などによって構成され、入力された画像のオブジェクトのラベルを推定する処理を実行する。この際、ラベル付与部３３は、予め実行された機械学習により得られたモデルパラメータを用いてオブジェクトのラベルを推定する。 The label assignment unit 33 is configured, for example, by a convolutional neural network (CNN) and executes a process of estimating the labels of objects in an input image. At this time, the label assignment unit 33 estimates the labels of the objects using model parameters obtained by machine learning executed in advance.

この機械学習では、例えば、予め撮影されたアイドルグループのコンサートを撮影した画像から検出されたオブジェクトに正解のラベルが付与されたデータが教師データとして用いられる。一般に、アイドルグループのコンサートは、同じ会場で頻繁に開催されることが多い。この場合、同じ視点から同じアイドルグループを撮影した動画像のデータが多数存在し得る。 In this machine learning, for example, data in which correct labels are assigned to objects detected from images captured in advance of an idol group concert is used as training data. In general, idol group concerts are often held frequently at the same venue. In this case, there may be a large amount of video data of the same idol group captured from the same viewpoint.

また、アイドルグループのコンサートが頻繁に開催される場合、通常、各回のコンサートで同じ楽曲が演奏される。アイドルグループは、楽曲ごとに決まったフォーメーションで、決まった振付に従って踊ることが多い。従って、例えば、楽曲Ａが演奏される場合、間奏の直後には、決まった数人の人物が前列に立って、同じポーズをとることになる。 Also, when idol groups hold frequent concerts, they usually play the same songs at each concert. Idol groups often dance in a fixed formation and with fixed choreography for each song. So, for example, when song A is played, immediately after the interlude, a set number of people will stand in the front row and take the same pose.

図２は、あるアイドルグループの楽曲Ａの演奏時の画像の例を示す図である。同図には、ディスプレイ５０に表示された画像１０１が示されており、画像１０１は、楽曲Ａの間奏の直後の画像である。同図には、３人の人物１１１、人物１１２、および人物１１３が表示されており、各人物は、それぞれ右手を上にあげるポーズをとっている。 Figure 2 shows an example of an image when a song A by an idol group is being played. The figure shows an image 101 displayed on the display 50, and image 101 is an image taken immediately after the interlude of song A. The figure shows three people, 111, 112, and 113, each of whom is posing with their right hand raised.

楽曲の演奏中、アイドルグループのメンバーの各人は、激しく動くことが多いが、楽曲途中には、アイドルグループのメンバーの各人がほぼ静止する場面がある。例えば、図２の画像が、このような場面に対応する。決まった振付に従って踊るアイドルグループのパフォーマンスでは、このような場面において、各人がどのような位置関係にあり、どのようなポーズをとるかなどが予め分かっており、画像の中のオブジェクトのラベルの推定も比較的容易になる。 While a song is being played, each member of an idol group often moves vigorously, but there are times during the song when each member of the idol group becomes almost still. For example, the image in Figure 2 corresponds to such a scene. In an idol group performance where the dancers follow a set choreography, the relative positions of each member and the poses they will take in such a scene are known in advance, making it relatively easy to estimate the labels of objects in the image.

図２において、人物１１１、人物１１２、および人物１１３は、それぞれ画像の中のオブジェクトとして検出される。これらの人物１１１、人物１１２、および人物１１３のそれぞれの名称（名前、ニックネームなど）をラベルとして付与したデータが、ラベル付与部３３が用いるモデルパラメータの機械学習において、教師データとして用いられる。同じ視点から同じアイドルグループを撮影した動画像のデータが多数存在すれば、このような教師データも多数生成することができる。多数の教師データを用いたか機械学習により、ラベル付与部３３の推定結果の制度も向上する。 In FIG. 2, person 111, person 112, and person 113 are each detected as an object in the image. Data to which the names (first name, nickname, etc.) of person 111, person 112, and person 113 are assigned as labels is used as training data in machine learning of model parameters used by label assignment unit 33. If there is a large amount of video data of the same idol group shot from the same viewpoint, it is possible to generate a large amount of such training data. Machine learning using a large amount of training data also improves the accuracy of the estimation results of label assignment unit 33.

また、図２に示される例においては、３人の人物がそれぞれ同じポーズをとっているが、例えば、各人が異なるポーズをとって静止する場面があれば、個々の人物をより識別しやすくなる。このような場面の画像を用いた機械学習により、ラベル付与部３３の推定結果の制度をさらに向上させ得る。 In the example shown in FIG. 2, the three people are all in the same pose, but if there is a scene in which each person stands still in a different pose, it will be easier to identify each person. Machine learning using images of such scenes can further improve the accuracy of the estimation results of the labeling unit 33.

（オブジェクトの追跡）
また、オブジェクト検出部３２は、前記ラベルが付与された前記オブジェクトを、前記区間検出部により検出された前記区間より時間的に後に再生される前記コンテンツの動画像の中で追跡する。 (Object Tracking)
Furthermore, the object detection unit 32 tracks the object to which the label has been assigned within the video of the content that is played back temporally after the section detected by the section detection unit.

オブジェクトの追跡では、例えば、あるフレームの画像から検出されたオブジェクトと、１フレーム後の画像から検出されたオブジェクトとの類似度が算出される。例えば、オブジェクトの色、形、模様に関する特徴量などを比較することにより、オブジェクトの類似度が算出される。そして、閾値以上の類似度を有するオブジェクトを同一のオブジェクトとして同定することによりオブジェクトの追跡が行われる。 In object tracking, for example, the similarity between an object detected from an image in a certain frame and an object detected from an image one frame later is calculated. For example, the similarity of the objects is calculated by comparing features related to the color, shape, and pattern of the objects. Objects that have a similarity above a threshold are then identified as the same object, thereby tracking the objects.

このようにオブジェクトが追跡されることにより、一度ラベルが付与された人物は、コンテンツの再生中に表示される動画像の中で同じラベルが付与され続けることになる。 By tracking objects in this way, once a person has been labeled, that same label will continue to be assigned to them in the video images displayed during content playback.

なお、例えば、同一の楽曲を再生中に、オブジェクトの検出が複数回行われるようにすることで、より正確にオブジェクトが追跡されるようにしてもよい。例えば、同一の楽曲の中で、イントロで演奏される音声に対応する第１の音声データと、間奏で演奏される音声に対応する第２の音声データとが予め与えられるようにしてもよい。 Note that, for example, object detection may be performed multiple times while the same piece of music is being played, thereby enabling the object to be tracked more accurately. For example, first audio data corresponding to the sound played in the intro and second audio data corresponding to the sound played in the interlude of the same piece of music may be provided in advance.

この場合、コンテンツの中で時間的に先に再生される区間の音声である第１の音声に対応する第１の音声データと、コンテンツの中で時間的に後に再生される区間の音声である第２の音声に対応する第２の音声データが予め与えられることになる。そして、区間検出部３１は、第１の音声と類似する音声が再生される第１の区間と、第２の音声と類似する音声が再生される第２の区間とをそれぞれ検出する。 In this case, first audio data corresponding to a first audio, which is the audio of a section that is played earlier in time in the content, and second audio data corresponding to a second audio, which is the audio of a section that is played later in time in the content, are provided in advance. Then, the section detection unit 31 detects the first section in which an audio similar to the first audio is played, and the second section in which an audio similar to the second audio is played, respectively.

この場合、ラベル付与部３３は、第１の区間の直後の画像からオブジェクト検出部３２が検出したオブジェクトの名称を推定してラベルを付与し、第２の区間の直後の画像からオブジェクト検出部が検出したオブジェクトに再度ラベルを付与する。そして、オブジェクト検出部３２は、第１の区間に対応してラベルが付与されたオブジェクト、および第２の区間に対応してラベルが付与されたオブジェクトをそれぞれ追跡する。 In this case, the labeling unit 33 estimates the name of the object detected by the object detection unit 32 from the image immediately after the first section and labels it, and re-labels the object detected by the object detection unit from the image immediately after the second section. The object detection unit 32 then tracks the object labeled in accordance with the first section and the object labeled in accordance with the second section.

例えば、アイドルグループの人数が多い場合、１つの場面において、メンバーの全員が写った画像を得ることは難しい。図３は、ディスプレイ５０に表示される画像であって、８人のメンバーから成るアイドルグループのパフォーマンスにおける１つの場面の画像１３１を示す図である。図３は、例えば、ある楽曲のイントロの直後の場面に対応する。 For example, if an idol group has a large number of members, it is difficult to obtain an image that shows all of the members in one scene. FIG. 3 shows an image 131 displayed on the display 50, which is a scene from a performance by an idol group consisting of eight members. FIG. 3 corresponds, for example, to a scene immediately after the intro of a certain song.

図３に示される場面では、８人のメンバー全員が、ほぼ静止しており、同じポーズをとっているが、人物１４２は、人物１４１の後ろに位置し、人物１４４は、人物１４３の後ろに位置している。図３に示される場面の画像から人物１４２および人物１４４をオブジェクトとして検出して追跡することは難しい。また、図３に示される場面においてラベル付与部３３により、検出された人物の名称の推定が行われた場合、人物１４２および人物１４４の名称を正しく推定することは難しい。 In the scene shown in FIG. 3, all eight members are almost still and in the same pose, but person 142 is located behind person 141, and person 144 is located behind person 143. It is difficult to detect and track person 142 and person 144 as objects from the image of the scene shown in FIG. 3. In addition, when the labeling unit 33 estimates the names of the detected people in the scene shown in FIG. 3, it is difficult to correctly estimate the names of person 142 and person 144.

一方で、アイドルグループのパフォーマンスでは、１つ楽曲の演奏中にメンバー全員がほぼ静止する場面が複数回存在することが多い。各場面によって、フォーメーションも変わるため、１つの場面では、写らなかった人物が他の場面では写っているように場面を選択することも可能となる。 On the other hand, in idol group performances, there are often multiple scenes in which all members are almost completely still during the performance of a single song. Since the formation changes depending on the scene, it is possible to select scenes in which a person who is not visible in one scene is visible in another.

例えば、楽曲の中で、アイドルグループのフォーメーションが変わるタイミングで演奏される音声に対応する音声データが予め与えられるようにすれば、フォーメーションが変わる都度、ステージの前列に位置する複数の人物の名称が推定されるようにすることができる。また、フォーメーションが変わる都度、ステージの前列に位置する複数の人物のそれぞれが追跡されるようにすることができる。 For example, if audio data corresponding to the sound played when an idol group changes formation in a song is provided in advance, the names of multiple people positioned in the front row of the stage can be estimated each time the formation changes. Also, each of the multiple people positioned in the front row of the stage can be tracked each time the formation changes.

すなわち、複数の音声データ（第１の音声データ、第２の音声データ、・・・）が予め与えられ、各音声データに対応する区間の直後に再生される第１の場面、第２の場面、・・・において、都度、ラベルが付与されるようにしてもよい。例えば、複数の場面のそれぞれにおいて、人物が検出され、ラベルが付与されるようにすれば、コンテンツの再生中、より正確に各人を追跡することが可能となる。 That is, multiple pieces of audio data (first audio data, second audio data, ...) may be provided in advance, and labels may be assigned to the first scene, second scene, ... that is played immediately after the section corresponding to each piece of audio data. For example, if people are detected and labeled in each of the multiple scenes, it may be possible to track each person more accurately while the content is being played.

なお、例えば、楽曲の中で同じ音声が繰り返し演奏される場合、１つの音声データのみが与えられる場合でも、やはり第１の場面、第２の場面、・・・において、都度、ラベルが付与されるようにすることが可能である。 For example, if the same sound is played repeatedly in a piece of music, even if only one piece of sound data is provided, it is still possible to assign a label each time in the first scene, the second scene, etc.

（オブジェクト画像処理部）
オブジェクト画像処理部３４は、オブジェクト検出部が検出した複数のオブジェクトのうち、ユーザが指定したオブジェクトに所定の画像処理を施す。オブジェクトの指定は、例えば、オブジェクトに付与されたラベルに基づいて行われる。 (Object image processing section)
The object image processing unit 34 performs a predetermined image processing on an object designated by a user among the plurality of objects detected by the object detection unit 34. The object is designated based on, for example, a label attached to the object.

図４は、オブジェクトの指定の際に、コンテンツ処理装置１０のディスプレイ５０に表示されるＧＵＩの例を示す図である。この例では、「アイドルグループ〇〇〇第８期メンバー表」が表示されている。ここで「〇〇〇」は、再生中のコンテンツのコンサートでパフォーマンスを行うアイドルグループの名称を示す。このアイドルグループは、例えば、２０人のメンバーから成り、メンバーの少なくとも１人が交代する都度、メンバー表が更新される。グループ結成当時のメンバー表が第１期、その後、メンバーの少なくとも１人が交代する都度、第２期、第３期、・・・のようにメンバー表の更新が行われる。 Figure 4 is a diagram showing an example of a GUI displayed on the display 50 of the content processing device 10 when an object is specified. In this example, "Idol group XXX 8th generation member list" is displayed. Here, "XXX" indicates the name of the idol group that will perform at the concert of the content being played. This idol group is made up of, for example, 20 members, and the member list is updated every time at least one member is replaced. The member list at the time the group was formed is the first generation, and thereafter, the member list is updated to the second generation, third generation, ... each time at least one member is replaced.

図４に示されるメンバー表において、最も左側の列には、「メンバー」が示されており、アイドルグループ〇〇〇を構成する２０人のメンバー各人の名称が記述される。ここでは、「ＡＡＡ」、「ＢＢＢ」、「ＣＣＣ」、・・・によって各人の名称が示されている。なお、実際には、各人の名称は、識別番号に対応付けられ、識別番号のそれぞれは、ラベル付与部３３によって付与されるラベルに対応している。 In the member table shown in FIG. 4, the leftmost column shows "Members," and lists the names of the 20 members who make up idol group XXX. Here, each person's name is indicated by "AAA," "BBB," "CCC," ... Note that in reality, each person's name is associated with an identification number, and each identification number corresponds to a label assigned by label assignment unit 33.

図４に示されるメンバー表において、中央の列には、「プロフィール」が示されており、各メンバーのプロフィールが記述される。 In the member table shown in Figure 4, the center column shows "Profile", which describes the profile of each member.

図４に示されるメンバー表において、最も右側の列には、「注目」が示されており、この列において、ユーザの指定が行われる。例えば、ユーザは、操作入力受付部１３を介して人物の指定に関する操作を入力し、図４に示されるメンバー表の中で自身が注目する人物を指定する。この例では、ユーザがメンバーの「ＣＣＣ」に注目しており、この人物（「ＣＣＣ」）が指定されたことを示す星印が、「注目」の列に表示されている。 In the member table shown in FIG. 4, the rightmost column shows "Attention," and this column is where the user makes their designation. For example, the user inputs an operation for designating a person via the operation input receiving unit 13, and designates a person on whom the user has an interest in the member table shown in FIG. 4. In this example, the user has an interest in the member "CCC," and a star mark indicating that this person ("CCC") has been designated is displayed in the "Attention" column.

オブジェクト画像処理部３４は、再生されるコンテンツの画像の中で、ユーザが注目する人物の画像に所定の画像処理を施す。図５は、ディスプレイ５０に表示される画像であって、オブジェクト画像処理部３４による画像処理が施された画像の一例である画像１６１を示す図である。 The object image processing unit 34 applies a predetermined image processing to an image of a person that the user focuses on, among the images of the content being played back. Figure 5 shows an image 161 that is displayed on the display 50 and is an example of an image that has been subjected to image processing by the object image processing unit 34.

ここでは、例えば、ユーザにより、図４を参照して上述したＧＵＩにより、人物１１３が注目する人物として指定されているものとする。図５の例では、人物１１３の近傍に、マーク（この例では、ハート形のマーク）１７１が重畳されて表示されている。オブジェクト画像処理部３４による画像処理の一例として、図５に示されるように、指定したオブジェクトの近傍の所定の範囲内に予め決められた画像（この例では、マーク１７１）が重畳されて表示される。 Here, for example, it is assumed that the user has specified person 113 as the person of interest using the GUI described above with reference to FIG. 4. In the example of FIG. 5, a mark (in this example, a heart-shaped mark) 171 is displayed superimposed near person 113. As an example of image processing by object image processing unit 34, as shown in FIG. 5, a predetermined image (in this example, mark 171) is displayed superimposed within a predetermined range near the specified object.

図６は、オブジェクト画像処理部３４による画像処理が施された画像の別の例を示す図である。図６の例では、ディスプレイ５０に、人物１１３が拡大されて表示された画像１９１が表示されている。すなわち、ディスプレイ５０に、ユーザが指定した人物１１３のみが拡大されて表示されている。オブジェクト画像処理部３４による画像処理の一例として、図６に示されるように、指定したオブジェクトが拡大されて表示される。 Figure 6 is a diagram showing another example of an image that has been subjected to image processing by the object image processing unit 34. In the example of Figure 6, an image 191 in which a person 113 is displayed in an enlarged form is displayed on the display 50. In other words, only the person 113 specified by the user is displayed in an enlarged form on the display 50. As an example of image processing by the object image processing unit 34, the specified object is displayed in an enlarged form, as shown in Figure 6.

さらに、図６のように拡大された画像において、図５に示されるようなマークが重畳表示されるようにしてもよい。 Furthermore, a mark such as that shown in FIG. 5 may be superimposed on an enlarged image such as that shown in FIG. 6.

なお、オブジェクト画像処理部３４による画像処理は、再生されるコンテンツの動画像を構成する各フレームの画像に連続して施される。例えば、図５に示されるようにマーク１７１が重畳される場合、コンテンツの再生中常に、楽曲に合わせて踊る人物１１３の近傍に、マーク１７１が表示されることになる。また、例えば、図６に示されるように、オブジェクトが拡大されて表示される場合、コンテンツの再生中常に、ほぼ人物１１３のみを写す画像がディスプレイに表示されることになる。 The image processing by the object image processing unit 34 is continuously performed on the images of each frame constituting the moving image of the content being played. For example, when a mark 171 is superimposed as shown in FIG. 5, the mark 171 will be displayed near the person 113 dancing to the music at all times while the content is being played. Also, when an object is displayed in an enlarged manner as shown in FIG. 6, an image showing almost only the person 113 will be displayed on the display at all times while the content is being played.

次に、図７のフローチャートを参照して、コンテンツ処理装置１０によるコンテンツ再生処理の流れの例について説明する。この処理は、コンテンツ入力受付部１１により、コンテンツのデータの入力が受け付けられた後で実行される。ここでは、ライブ配信されたアイドルグループのコンサートのコンテンツのデータが入力された場合の例について説明する。 Next, an example of the flow of content playback processing by the content processing device 10 will be described with reference to the flowchart in FIG. 7. This processing is executed after the content input receiving unit 11 receives input of content data. Here, an example will be described in which content data of a live-streamed concert by an idol group is input.

ステップＳ１０１において、コンテンツ再生部１２は、入力されたコンテンツを再生する。 In step S101, the content playback unit 12 plays the input content.

ステップＳ１０２において、区間検出部３１は、予め与えられた音声データに対応する音声と、再生されるコンテンツの音声との類似度を算出する。なお、上述したように、コンサートで演奏された楽曲の一部であって、例えば、イントロ、間奏、サビなどに対応する音声の音声データが予め与えられている。区間検出部３１は、例えば、予め与えられた音声データの音声信号の特徴量と、再生中のコンテンツの音声の再生信号を比較することで、２つの音声信号の類似度を算出する。 In step S102, the section detection unit 31 calculates the similarity between the sound corresponding to the pre-given audio data and the sound of the content being played. As described above, audio data of parts of a piece of music played at a concert, such as an intro, interlude, chorus, etc., is given in advance. The section detection unit 31 calculates the similarity between the two audio signals, for example, by comparing the features of the audio signal of the pre-given audio data with the playback signal of the audio of the content being played.

ステップＳ１０３において、区間検出部３１は、類似する区間が検出されたか否かを判定する。上述したように、区間検出部３１は、コンテンツの再生中に、音声データと同じ時間的長さの音声信号を連続して抽出し、抽出した音声信号の特徴量音声データの音声信号の特徴量とを比較する。そして、区間検出部３１は、例えば、類似度が閾値以上となる区間を、類似する区間として検出する。 In step S103, the section detection unit 31 determines whether a similar section has been detected. As described above, the section detection unit 31 continuously extracts audio signals of the same time length as the audio data during playback of the content, and compares the extracted audio signal features with the audio signal features of the audio data. Then, the section detection unit 31 detects, for example, a section in which the similarity is equal to or greater than a threshold value as a similar section.

また、複数の音声データが予め与えられている場合、複数の音声データの音声信号の特徴量のそれぞれと、再生中のコンテンツの音声の音声信号を比較することで、音声信号の類似度が算出される。このようにして、例えば、第１の音声データの音声と類似する区間、または、第２の音声データの音声と類似する区間、・・・が類似する区間として検出されることになる。 In addition, when multiple pieces of audio data are provided in advance, the similarity of the audio signals is calculated by comparing each of the features of the audio signals of the multiple pieces of audio data with the audio signal of the audio of the content being played. In this way, for example, a section similar to the audio of the first audio data, or a section similar to the audio of the second audio data, etc. are detected as similar sections.

ステップＳ１０３において、類似する区間が検出されたと判定された場合、ステップＳ１０４の処理が実行される。 If it is determined in step S103 that a similar section has been detected, the process proceeds to step S104.

ステップＳ１０４において、オブジェクト検出部３２は、検出された前記区間の直後に再生される画像からオブジェクトを検出する。このとき、例えば、検出された区間の直後に再生される画像に写った一人または複数人の人物の画像が、オブジェクト検出部３２によりオブジェクトとして検出される。 In step S104, the object detection unit 32 detects an object from the image played immediately after the detected section. At this time, for example, an image of one or more people appearing in the image played immediately after the detected section is detected as an object by the object detection unit 32.

ステップＳ１０５において、ラベル付与部３３は、オブジェクトの名称を推定する。このとき、ステップＳ１０４の処理でオブジェクトとして検出された人物の名称が推定される。上述したように、ラベル付与部３３は、例えば、ＣＮＮなどによって構成され、入力された画像のオブジェクトのラベルを推定する処理を実行する。この際、ラベル付与部３３は、予め実行された機械学習により得られたモデルパラメータを用いてオブジェクトの名称を推定する。 In step S105, the label assignment unit 33 estimates the name of the object. At this time, the name of the person detected as the object in the processing of step S104 is estimated. As described above, the label assignment unit 33 is configured, for example, by a CNN, and executes processing to estimate the label of the object of the input image. At this time, the label assignment unit 33 estimates the name of the object using model parameters obtained by machine learning executed in advance.

ステップＳ１０６において、ラベル付与部３３は、ステップＳ１０４の処理により検出されたオブジェクトに、ステップＳ１０５の処理により推定された名称をラベルとして付与する。すなわち、人物の名称がラベルとしてオブジェクトに付与される。 In step S106, the labeling unit 33 assigns the name estimated in the process of step S105 as a label to the object detected in the process of step S104. In other words, the name of the person is assigned to the object as a label.

ステップＳ１０７において、オブジェクト検出部３２は、ステップＳ１０６の処理でラベルが付与されたオブジェクトを追跡する。このとき、例えば、あるフレームの画像から検出されたオブジェクトと、１フレーム後の画像から検出されたオブジェクトとの類似度が算出され、閾値以上の類似度を有するオブジェクトを同一のオブジェクトとして同定することによりオブジェクトの追跡が行われる。 In step S107, the object detection unit 32 tracks the object to which the label was assigned in the process of step S106. At this time, for example, the similarity between an object detected from an image of a certain frame and an object detected from an image one frame later is calculated, and objects having a similarity equal to or greater than a threshold value are identified as the same object, thereby tracking the object.

ステップＳ１０８において、オブジェクト画像処理部３４は、オブジェクト検出部３２が検出した複数のオブジェクトのうち、ユーザが指定したオブジェクトに所定の画像処理を施す。 In step S108, the object image processing unit 34 applies a predetermined image processing to the object specified by the user from among the multiple objects detected by the object detection unit 32.

これにより、例えば、図５を参照して上述したように、指定したオブジェクトの近傍にマークが表示される。あるいは、例えば、図６を参照して上述したように、ユーザが指定した人物のみが拡大されて表示される。さらに、図６のように拡大された画像において、図５に示されるようなマークが重畳表示されるようにしてもよい。 As a result, for example, as described above with reference to FIG. 5, a mark is displayed near the specified object. Alternatively, for example, as described above with reference to FIG. 6, only the person specified by the user is enlarged and displayed. Furthermore, a mark such as that shown in FIG. 5 may be superimposed on the image enlarged as in FIG. 6.

なお、オブジェクト画像処理部３４による画像処理は、再生されるコンテンツの動画像を構成する各フレームの画像に連続して施される。 The image processing by the object image processing unit 34 is performed continuously on each frame of images that make up the moving image of the content being played.

なお、オブジェクトの指定が行われていない場合、ステップＳ１０８の処理は、スキップされる。 If no object is specified, step S108 is skipped.

また、ステップＳ１０３において、類似する区間が検出されなかったと判定された場合、ステップＳ１０４乃至ステップＳ１０６の処理は、スキップされ、ステップＳ１０７の処理が実行される。まだラベルが付与されていない場合は、ステップＳ１０７の処理およびステップＳ１０８の処理も実質的に実行できないので、これらの処理もスキップされる。 If it is determined in step S103 that no similar section has been detected, steps S104 to S106 are skipped and step S107 is executed. If no label has yet been assigned, steps S107 and S108 cannot be executed, and so these steps are also skipped.

ステップＳ１０９において、コンテンツ再生部１２は、コンテンツを最後まで再生したか否かを判定する。ステップＳ１０９において、まだ最後まで再生されていないと判定された場合、処理は、ステップＳ１０２に戻り、区間検出部３１により、抽出した音声信号の特徴量音声データの音声信号の特徴量とが比較される。そして、区間検出部３１は、類似度が閾値以上となる区間を、類似する区間として検出する。 In step S109, the content playback unit 12 determines whether the content has been played to the end. If it is determined in step S109 that the content has not been played to the end, the process returns to step S102, and the section detection unit 31 compares the extracted audio signal feature quantity with the audio signal feature quantity of the audio data. The section detection unit 31 then detects sections whose similarity is equal to or greater than a threshold value as similar sections.

このように、ステップＳ１０２乃至ステップＳ１０９の処理が繰り返し実行されることにより、予め与えられた音声データに対応する区間が１回または複数回検出され（ステップＳ１０３）、検出された区間の直後の画像からオブジェクトが検出される（ステップＳ１０４）。そして、検出されたオブジェクトの名称が推定され（ステップＳ１０５）、ラベルが付与される（ステップＳ１０６）。一度、ラベルが付与されたオブジェクトは、再生されるコンテンツの画像の中で追跡され（ステップＳ１０７）、指定されたオブジェクトには、画像処理が施される（ステップＳ１０８）。 In this way, by repeatedly executing the processes of steps S102 to S109, a section corresponding to pre-given audio data is detected once or multiple times (step S103), and an object is detected from the image immediately following the detected section (step S104). Then, the name of the detected object is estimated (step S105), and a label is assigned (step S106). Once a label has been assigned to the object, it is tracked within the image of the content being played back (step S107), and image processing is performed on the designated object (step S108).

ステップＳ１０４乃至ステップＳ１０６の処理は、予め与えられた音声データの音声と類似した区間が検出される都度、実行されるので、例えば、第１の場面では、検出できなかった人物を、第２の場面で検出することも可能となる。また、第１の場面において人物の名称の推定が誤っていた場合でも、第２の場面で正しいラベルに修正されるようにすることも可能となる。さらに、第１の場面の後、誤って異なるオブジェクト（人物）が追跡されてしまった場合でも、第２の場面以後は、正しいオブジェクト（人物）が追跡されるようにすることも可能となる。 The processing of steps S104 to S106 is performed each time a section similar to the sound of pre-given audio data is detected, so that, for example, a person who could not be detected in the first scene can be detected in the second scene. Also, even if the estimation of the person's name is incorrect in the first scene, it is possible to correct the label to the correct one in the second scene. Furthermore, even if a different object (person) is mistakenly tracked after the first scene, it is possible to track the correct object (person) from the second scene onwards.

ステップＳ１０９において、コンテンツを最後まで再生したと判定された場合、コンテンツ再生処理は終了する。 If it is determined in step S109 that the content has been played to the end, the content playback process ends.

このようにして、コンテンツ再生処理が実行される。 In this way, the content playback process is carried out.

なお、以上の説明では、主に、コンテンツがアイドルグループのコンサートを撮影したＭＰ４形式のデータの場合を例として説明したが、例えば、演劇を撮影したコンテンツについて同様の処理が実施されるようにしてもよい。演劇には、通常、複数の俳優が出演するが、例えば、特定のセリフや効果音などに対応する音声データが予め与えられるようにすれば、やはりコンテンツの画像の中で検出されるオブジェクトである人物の名称を推定しやすくなる。 In the above explanation, the content is mainly MP4 format data filmed from an idol group concert, but the same processing may be performed on content filmed from a play, for example. A play usually features multiple actors, and if audio data corresponding to specific lines or sound effects is provided in advance, it also becomes easier to estimate the names of people, who are objects detected in the images of the content.

すなわち、演劇の中で、特定のセリフや効果音などが発せられる場面において、各人がどのような位置関係にあり、どのようなポーズをとるかなどが予め分かっており、画像の中のオブジェクトのラベルの推定も比較的容易になる。 In other words, in a scene during a play, when a particular line or sound effect is spoken, the relative positions of each person and the pose they will be in are known in advance, making it relatively easy to estimate the labels of objects in the image.

（第一実施形態の効果）
以上に説明したように、本実施形態によれば、再生されるコンテンツの画像の中で検出されたオブジェクトにラベルを付与することができる。この際、区間検出部３１により、予め与えられた音声データの音声に類似した音声の区間が検出されるので、再生されるコンテンツの画像の中でどのオブジェクトがどの位置で検出されるかを予め予測しやすくなる。また、検出されるオブジェクトである人物がどのようなポーズをとっているかも予め予測しやすくなる。 (Effects of the First Embodiment)
As described above, according to this embodiment, it is possible to assign a label to an object detected in an image of a content to be played. At this time, the section detection unit 31 detects a section of an audio similar to the audio of pre-given audio data, so it becomes easier to predict in advance which object will be detected at which position in the image of the content to be played. It also becomes easier to predict in advance what pose a person, which is an object to be detected, will be in.

従って、オブジェクトである人物の位置、ポーズなどの特徴をもとにオブジェクトのラベルを推定することが可能になるので、例えば、コンテンツの画像の画質が低くても、精度の高いラベルの推定が可能となる。あるいは、コンサートにおける照明、衣装などが変更になっても、やはり精度の高いラベルの推定が可能となる。 As a result, it is possible to estimate the label of an object based on features such as the position and pose of the person that is the object, making it possible to estimate the label with high accuracy even if the image quality of the content is low. Or, even if the lighting or costumes at a concert are changed, it is still possible to estimate the label with high accuracy.

さらに、教師データのオブジェクトと、同じ位置関係にあり、同じポーズをとったオブジェクトについてラベルの推定が行われることになるので、少量の教師データによる学習であっても、ラベルが推定できる可能性が高くなる。従って、ラベル付与部３３が用いるモデルパラメータの学習の程度に係らず、精度の高いラベルの推定も可能となる。 Furthermore, since labels are estimated for objects that are in the same positional relationship and have the same pose as objects in the training data, there is a high possibility that labels can be estimated even when learning with a small amount of training data. Therefore, highly accurate label estimation is possible regardless of the degree of learning of the model parameters used by the label assignment unit 33.

このように、推定されたラベルが付されてオブジェクトが追跡されるようにすることで、指定した人物に注目した画像処理を施すことが可能となる。このようにすることで、例えば、多数のアイドルや俳優などの出演者の中で、ファンのそれぞれが好む出演者に注目して楽しむことが可能となる。 In this way, by tracking objects with estimated labels, it becomes possible to apply image processing that focuses on a specified person. In this way, for example, fans can enjoy focusing on their favorite performers among many performers such as idols and actors.

従って、本実施形態によれば、動画像の被写体であるグループの中で、１人に注目したコンテンツを簡単に提供することができる。 Therefore, according to this embodiment, it is possible to easily provide content that focuses on one person in a group that is the subject of a video image.

＜第二実施形態＞
図８は、第二実施形態に係るコンテンツ処理装置１０の機能的構成例を示すブロック図である。なお、第一実施形態にて説明した構成要素と同じ機能を有する構成要素については、同じ符号を付し、その説明を適宜省略する。 Second Embodiment
8 is a block diagram showing an example of a functional configuration of a content processing apparatus 10 according to the second embodiment. Note that components having the same functions as those described in the first embodiment are denoted by the same reference numerals, and descriptions thereof will be omitted as appropriate.

図８に示されるコンテンツ処理装置１０には、動き解析部６１およびアニメーション画像生成部７１が含まれている。その他の構成は、図１に示されるコンテンツ処理装置１０の構成と同様である。 The content processing device 10 shown in FIG. 8 includes a motion analysis unit 61 and an animation image generation unit 71. The rest of the configuration is the same as that of the content processing device 10 shown in FIG. 1.

（動き解析部）
動き解析部６１は、コンテンツの画像を解析することで、画像の中で検出されたオブジェクト（例えば、人物）の動きを解析する。また、動き解析部６１は、例えば、指定されたオブジェクトの動きを特定する。例えば、コンテンツの中で所定の時間的長さを有する区間における人物の腕、脚、頭などの動きを特定する。なお、指定されたオブジェクトは、例えば、図４を参照して上述したユーザの操作によって注目する人物として指定されたオブジェクトであってもよい。 (Motion Analysis Section)
The motion analysis unit 61 analyzes the image of the content to analyze the motion of an object (e.g., a person) detected in the image. The motion analysis unit 61 also identifies, for example, the motion of a specified object. For example, it identifies the motion of the person's arms, legs, head, etc. in a section having a predetermined time length in the content. Note that the specified object may be, for example, an object specified as a person of interest by the user's operation described above with reference to FIG. 4.

一例として、動き解析部６１は、画像の中で検出されたオブジェクトである人物の関節位置を特定し、人物の腕、脚、頭などの各部位がどのように動いているかを解析する。なお、動き解析部６１による処理では、例えば、事前の機械学習により得られたモデルパラメータを用いて、人物の関節位置、および／または各部位の動きが推定されるようにしてもよい。 As an example, the motion analysis unit 61 identifies the joint positions of a person, which is an object detected in an image, and analyzes how each part of the person, such as the arms, legs, and head, is moving. Note that in the processing by the motion analysis unit 61, the joint positions of the person and/or the movement of each part may be estimated using model parameters obtained by prior machine learning, for example.

動き解析部６１の解析結果は、人物の動きを表す情報としてアニメーション画像生成部７１に供給される。 The analysis results of the motion analysis unit 61 are supplied to the animation image generation unit 71 as information representing the movement of the person.

（アニメーション画像生成部）
アニメーション画像生成部７１は、動き解析部６１の解析結果に基づいてアニメーション画像を生成する。ここで、生成されるアニメーション画像は、例えば、動くキャラクターの画像である。一例として、キャラクターは、人間の体形と同様に腕、脚、頭を有し、人間と同様に関節を動かすものとされる。 (Animation image generation unit)
The animation image generator 71 generates an animation image based on the analysis result of the motion analyzer 61. Here, the generated animation image is, for example, an image of a moving character. As an example, the character has arms, legs, and a head similar to a human body shape, and moves its joints in the same way as a human.

アニメーション画像生成部７１は、動き解析部６１から供給された人物の動きを表す情報に基づいて、キャラクターを動かすアニメーション画像を生成する。すなわち、動き解析部６１により特定された人物の動きと同じ動きをするキャラクターのアニメーション画像が生成される。 The animation image generator 71 generates an animation image that moves a character based on the information representing the person's movement supplied from the motion analyzer 61. That is, an animation image is generated of a character that moves in the same way as the person's movement identified by the motion analyzer 61.

なお、図８の例では、アニメーション画像生成部７１がコンテンツ処理装置１０の内部に設けられているが、アニメーション画像生成部７１は、例えば、コンテンツ処理装置１０とは異なる装置に設けられるようにしてもよい。 Note that in the example of FIG. 8, the animation image generating unit 71 is provided inside the content processing device 10, but the animation image generating unit 71 may be provided, for example, in a device different from the content processing device 10.

すなわち、動き解析部６１は、オブジェクト検出部３２が検出した複数の人物のうち、ユーザが指定した人物の動きを解析し、指定した人物の動きと同じ動きをするアニメーション画像を生成するアニメーション画像生成部７１に、動き解析部６１の解析結果が供給される。 That is, the motion analysis unit 61 analyzes the motion of a person designated by the user from among the multiple people detected by the object detection unit 32, and the analysis result of the motion analysis unit 61 is supplied to the animation image generation unit 71, which generates an animation image that moves in the same manner as the designated person.

（キャラクターの表示）
動くキャラクターのアニメーションは、例えば、ディスプレイ５０に表示されるようにしてもよい。このようにすることで、例えば、ユーザは、アイドルグループの中で注目する人物の動きをまねることができる。例えば、自分が気に入ったアイドルと、楽曲に合わせて一緒に踊ることができる。 (Character display)
The animation of the moving character may be displayed, for example, on the display 50. In this way, for example, the user can imitate the movements of a person of interest in an idol group. For example, the user can dance along to a song with the idol that the user likes.

また、動くキャラクターのアニメーションは、例えば、仮想空間に表示されるアバターとして利用されるようにしてもよい。すなわち、アニメーション画像生成部７１により生成されるアニメーション画像は、ユーザのアバターの画像であってもよい。図９は、仮想空間の例を示す図である。図９に示される仮想空間２００は、イベント会場を模して造られている。同図には、また、アバターとして用いられるキャラクター２１１乃至キャラクター２１６が表示されている。 In addition, the animation of a moving character may be used, for example, as an avatar displayed in a virtual space. That is, the animation image generated by the animation image generating unit 71 may be an image of a user's avatar. FIG. 9 is a diagram showing an example of a virtual space. The virtual space 200 shown in FIG. 9 is modeled after an event venue. The figure also shows characters 211 to 216 used as avatars.

アバターのそれぞれは、仮想空間２００における各ユーザの分身となるキャラクターであり、例えば、ユーザの操作等に基づいて仮想空間２００内を動くように設定されている。すなわち、キャラクター２１１乃至キャラクター２１６のそれぞれは、異なるユーザに対応付けられ、それらのユーザの操作に従って、仮想空間２００の中を移動したり、体を動かしたりする。 Each of the avatars is a character that represents each user in the virtual space 200, and is set to move within the virtual space 200 based on, for example, the user's operation. That is, each of the characters 211 to 216 is associated with a different user, and moves and moves within the virtual space 200 according to the user's operation.

例えば、仮想空間２００の中にあるステージ２００ａに、アイドルグループのファンであるユーザのアバターが集まって、楽曲に合わせて踊ることも可能である。例えば、各ユーザが、アイドルグループの中で自分が注目する人物の動きを動き解析部６１で解析し、アニメーション画像生成部７１によって、自分のアバターであるキャラクターを動かすアニメーション画像を生成する。このアニメーション画像を仮想空間２００のステージ２００ａ上で再生することで、アバターがアイドルグループと同じ振付で踊る画像を楽しむことができる。 For example, avatars of users who are fans of an idol group can gather on stage 200a in virtual space 200 and dance to the music. For example, each user can analyze the movements of a person in the idol group that catches their attention with motion analysis unit 61, and generate an animation image of the user's avatar character moving with animation image generation unit 71. By playing this animation image on stage 200a in virtual space 200, users can enjoy watching their avatars dancing to the same choreography as the idol group.

さらに、仮想空間２００の画像にアイドルグループのパフォーマンスの画像が重畳されて表示されるようにしてもよい。 Furthermore, an image of an idol group's performance may be superimposed and displayed on the image of the virtual space 200.

（第二実施形態の効果）
以上に説明したように、本実施形態によれば、コンテンツの画像から検出されるオブジェクトのうち、指定したオブジェクトの動きを解析して、当該オブジェクトと同じ動きをするアニメーション画像を生成することができる。このようにすることで、例えば、多数のアイドルや俳優などの出演者の中で、ファンのそれぞれが好む出演者に注目した楽しみ方のバリエーションが増える。 (Effects of the Second Embodiment)
As described above, according to this embodiment, it is possible to analyze the movement of a specified object among objects detected from an image of a content, and generate an animation image that moves in the same way as the specified object. In this way, for example, the variety of ways in which fans can enjoy the content increases, focusing on their favorite performers among a large number of performers such as idols and actors.

＜その他の実施形態＞
上述した実施形態において、コンテンツ処理装置１０は、例えば、パーソナルコンピュータ、ゲーム機などによって構成されるようにしてもよいし、スマートフォンなどにより構成されるようにしてもよい。あるいは、コンテンツ処理装置１０の一部の機能が、パーソナルコンピュータなどによって実現され、他の機能がスマートフォンなどによって実現されるようにしてもよい。 <Other embodiments>
In the above-described embodiment, the content processing device 10 may be configured, for example, by a personal computer, a game machine, etc., or may be configured by a smartphone, etc. Alternatively, some functions of the content processing device 10 may be realized by a personal computer, etc., and other functions may be realized by a smartphone, etc.

また、上述した実施形態においては、コンテンツ処理装置１０により再生されるコンテンツの撮影に用いられるカメラは任意に選択されるようにしてよい。 In addition, in the above-described embodiment, the camera used to capture the content to be played back by the content processing device 10 may be selected arbitrarily.

一方で、例えば、コンテンツの撮影において、立体視用のカメラが用いられるようにしてもよい。このようにすることで、視差のある画像を撮影することができ、コンテンツ処理装置１０により再生されるコンテンツが３Ｄ表示されるようにすることが可能となる。なお、２台のカメラを用いてコンテンツの撮影が行われて、視差のある画像を得るようにしてもよい。 On the other hand, for example, a stereoscopic camera may be used to capture content. In this way, an image with parallax can be captured, and the content played back by the content processing device 10 can be displayed in 3D. Note that content may be captured using two cameras to obtain an image with parallax.

また、例えば、コンテンツの撮影において、３６０度カメラが用いられるようにしてもよい。３６０度カメラは、いわゆる魚眼レンズなどの広角レンズを有し、パノラマビューの画像を撮影することができる。３６０度カメラを用いて撮影された画像から生成されるコンテンツにおいては、例えば、任意の視点から見た画像を表示させることが可能となる。 For example, a 360-degree camera may be used to capture content. A 360-degree camera has a wide-angle lens, such as a fisheye lens, and can capture panoramic view images. In content generated from images captured using a 360-degree camera, it is possible to display images viewed from any viewpoint, for example.

このようにすることで、例えば、ユーザが指定する人物が特に目立って写る視点からの画像を再生することも可能となる。また、２台の３６０度カメラを用いてコンテンツを撮影することで、指定する人物に注目した視差のある画像を表示させることも可能となる。 In this way, for example, it is possible to play back an image from a viewpoint in which a person specified by the user stands out. Also, by shooting content using two 360-degree cameras, it is possible to display an image with parallax that focuses on a specified person.

＜ソフトウェアによる実現例＞
上述したコンテンツ処理装置１０は、コンピュータを機能させるためのプログラムであって、コンテンツ処理装置１０としてコンピュータを機能させるためのプログラムにより実現することができる。この場合、コンテンツ処理装置１０は、上記プログラムを実行するためのハードウェアとして、少なくとも１つの制御装置（例えばプロセッサ）と少なくとも１つの記憶装置（例えばメモリ）を有するコンピュータを備えている。このようなコンピュータの一例を図１０に示す。 <Example of software implementation>
The above-mentioned content processing device 10 is a program for causing a computer to function, and can be realized by a program for causing a computer to function as the content processing device 10. In this case, the content processing device 10 includes a computer having at least one control device (e.g., a processor) and at least one storage device (e.g., a memory) as hardware for executing the above-mentioned program. An example of such a computer is shown in FIG. 10.

コンピュータ５００は、少なくとも１つのプロセッサ５０１と、少なくとも１つのメモリ５０２と、を備えている。メモリ５０２には、コンピュータ５００をコンテンツ処理装置１０として動作させるためのプログラム５２０が記録されている。コンピュータ５００において、プロセッサ５０１は、このプログラム５２０をメモリ５０２から読み取って実行することにより、コンテンツ処理装置１０の各機能が実現される。 The computer 500 includes at least one processor 501 and at least one memory 502. The memory 502 stores a program 520 for operating the computer 500 as a content processing device 10. In the computer 500, the processor 501 reads and executes this program 520 from the memory 502, thereby realizing each function of the content processing device 10.

プロセッサ５０１としては、例えば、ＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）、ＧＰＵ（ＧｒａｐｈｉｃＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）、ＤＳＰ（ＤｉｇｉｔａｌＳｉｇｎａｌＰｒｏｃｅｓｓｏｒ）、ＭＰＵ（ＭｉｃｒｏＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）、ＦＰＵ（ＦｌｏａｔｉｎｇｐｏｉｎｔｎｕｍｂｅｒＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）、ＰＰＵ（ＰｈｙｓｉｃｓＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）、マイクロコントローラ、又は、これらの組み合わせなどを用いることができる。 The processor 501 may be, for example, a CPU (Central Processing Unit), a GPU (Graphic Processing Unit), a DSP (Digital Signal Processor), an MPU (Micro Processing Unit), an FPU (Floating point number Processing Unit), a PPU (Physics Processing Unit), a microcontroller, or a combination of these.

メモリ５０２としては、例えば、フラッシュメモリ、ＨＤＤ（ＨａｒｄＤｉｓｋＤｒｉｖｅ）、ＳＳＤ（ＳｏｌｉｄＳｔａｔｅＤｒｉｖｅ）、又は、これらの組み合わせなどを用いることができる。 For example, the memory 502 may be a flash memory, a hard disk drive (HDD), a solid state drive (SSD), or a combination of these.

なお、コンピュータ５００は、プログラム５２０を実行時に展開したり、各種データを一時的に記憶したりするためのＲＡＭ（ＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）を更に備えていてもよい。また、コンピュータ５００は、他の装置との間でデータを送受信するための通信インターフェースを更に備えていてもよい。また、コンピュータ５００は、キーボードやマウス、ディスプレイやプリンタなどの入出力機器を接続するための入出力インターフェースを更に備えていてもよい。 The computer 500 may further include a RAM (Random Access Memory) for expanding the program 520 during execution and for temporarily storing various data. The computer 500 may further include a communication interface for transmitting and receiving data to and from other devices. The computer 500 may further include an input/output interface for connecting input/output devices such as a keyboard, mouse, display, and printer.

また、コンピュータ５００をコンテンツ処理装置１０として動作させるためのプログラム５２０は、コンピュータ５００が読み取り可能な、一時的でない有形の記録媒体５３０に記録することができる。このような記録媒体５３０としては、例えば、テープ、ディスク、カード、半導体メモリ、又はプログラマブルな論理回路などを用いることができる。コンピュータ５００は、このような記録媒体５３０を介してプログラム５２０を取得することができる。 The program 520 for operating the computer 500 as the content processing device 10 can be recorded on a non-transitory, tangible recording medium 530 that can be read by the computer 500. For example, a tape, a disk, a card, a semiconductor memory, or a programmable logic circuit can be used as such a recording medium 530. The computer 500 can obtain the program 520 via such a recording medium 530.

また、コンピュータ５００をコンテンツ処理装置１０として動作させるためのプログラム５２０は、伝送媒体を介して伝送することができる。このような伝送媒体としては、例えば、通信ネットワーク、又は放送波などを用いることができる。コンピュータ５００は、このような伝送媒体を介してプログラム５２０を取得することもできる。 The program 520 for operating the computer 500 as the content processing device 10 can be transmitted via a transmission medium. For example, a communication network or broadcast waves can be used as such a transmission medium. The computer 500 can also obtain the program 520 via such a transmission medium.

また、コンテンツ処理装置１０の各機能の一部または全部は、論理回路により実現することも可能である。例えば、上記各制御ブロックとして機能する論理回路が形成された集積回路も本発明の範疇に含まれる。この他にも、例えば量子コンピュータにより上記各制御ブロックの機能を実現することも可能である。 Furthermore, some or all of the functions of the content processing device 10 can be realized by logic circuits. For example, an integrated circuit in which logic circuits that function as each of the above control blocks are formed is also included in the scope of the present invention. In addition, it is also possible to realize the functions of each of the above control blocks by, for example, a quantum computer.

以上説明してきた本発明の各態様によれば、上述した作用効果を奏することにより、持続可能な開発目標（ＳＤＧｓ）の目標９「産業と具術革新の基盤をつくろう」の達成に貢献できる。 According to each aspect of the present invention described above, the above-mentioned effects can be achieved, thereby contributing to the achievement of Goal 9 of the Sustainable Development Goals (SDGs), "Build resilient infrastructure, promote inclusive and sustainable industrialization and innovation."

なお、本発明は上述した各実施形態に限定されるものではなく、請求項に示した範囲で種々の変更が可能であり、異なる実施形態にそれぞれ開示された技術的手段を適宜組み合わせて得られる実施形態についても本発明の技術的範囲に含まれる。 The present invention is not limited to the above-described embodiments, and various modifications are possible within the scope of the claims. The technical scope of the present invention also includes embodiments obtained by appropriately combining the technical means disclosed in the different embodiments.

〔まとめ〕
本発明の態様１に係るコンテンツ処理装置は、音声とともに動画像が再生されるコンテンツの入力を受け付けるコンテンツ入力受付部と、予め与えられた所定の長さの音声データに対応する音声と類似する音声が再生される区間を、入力されたコンテンツから検出する区間検出部と、検出された前記区間の直後に再生される画像からオブジェクトを検出するオブジェクト検出部と、検出された前記オブジェクトにラベルを付与するラベル付与部とを備える。〔summary〕
A content processing device according to aspect 1 of the present invention includes a content input receiving unit that receives input of content in which moving images are played together with audio, a section detection unit that detects a section from the input content in which audio similar to audio corresponding to audio data of a predetermined length that has been given in advance is played, an object detection unit that detects an object from an image that is played immediately after the detected section, and a label assignment unit that assigns a label to the detected object.

本発明の態様２に係るコンテンツ処理装置は、上記の態様１において、前記オブジェクトが人物であり、前記ラベル付与部は、前記人物の名称に係るラベルを付与する。 In the content processing device according to aspect 2 of the present invention, in the above aspect 1, the object is a person, and the label assignment unit assigns a label related to the name of the person.

本発明の態様３に係るコンテンツ処理装置は、上記の態様１または２において、前記ラベル付与部は、機械学習により得られたモデルパラメータを用いた演算により、前記オブジェクトのラベルを推定する。 In the content processing device according to aspect 3 of the present invention, in the above aspect 1 or 2, the label assignment unit estimates the label of the object by performing a calculation using model parameters obtained by machine learning.

本発明の態様４に係るコンテンツ処理装置は、上記の態様１乃至３のいずれかにおいて、前記動画像とともに再生される音声が楽曲である。 A content processing device according to aspect 4 of the present invention is any one of aspects 1 to 3 above, in which the audio played together with the moving image is music.

本発明の態様５に係るコンテンツ処理装置は、上記の態様１乃至４のいずれかにおいて、前記オブジェクト検出部は、さらに、前記ラベルが付与された前記オブジェクトを、前記区間検出部により検出された前記区間より時間的に後に再生される前記コンテンツの動画像の中で追跡する。 In the content processing device according to aspect 5 of the present invention, in any one of aspects 1 to 4 above, the object detection unit further tracks the object to which the label has been assigned in a video of the content that is played back after the section detected by the section detection unit.

本発明の態様６に係るコンテンツ処理装置は、上記の態様５において、前記コンテンツの中で時間的に先に再生される区間の音声である第１の音声に対応する第１の音声データと、前記コンテンツの中で時間的に後に再生される区間の音声である第２の音声に対応する第２の音声データが予め与えられ、前記区間検出部は、前記第１の音声と類似する音声が再生される第１の区間と、前記第２の音声と類似する音声が再生される第２の区間とをそれぞれ検出し、前記オブジェクト検出部は、前記第１の区間に対応してラベルが付与されたオブジェクト、および前記第２の区間に対応してラベルが付与されたオブジェクトをそれぞれ追跡する。 In the content processing device according to aspect 6 of the present invention, in the above aspect 5, first audio data corresponding to a first audio, which is an audio of a section played earlier in time in the content, and second audio data corresponding to a second audio, which is an audio of a section played later in time in the content, are provided in advance, the section detection unit detects a first section in which an audio similar to the first audio is played and a second section in which an audio similar to the second audio is played, and the object detection unit tracks an object to which a label is assigned corresponding to the first section and an object to which a label is assigned corresponding to the second section, respectively.

本発明の態様７に係るコンテンツ処理装置は、上記の態様５または態様６において、前記オブジェクト検出部が検出した複数のオブジェクトのうち、ユーザが指定したオブジェクトに所定の画像処理を施すオブジェクト画像処理部をさらに備える。 The content processing device according to aspect 7 of the present invention is the above-mentioned aspect 5 or aspect 6, further comprising an object image processing unit that performs a predetermined image processing on an object designated by a user from among the multiple objects detected by the object detection unit.

本発明の態様８に係るコンテンツ処理装置は、上記の態様７において、前記オブジェクト画像処理部は、前記指定したオブジェクトを拡大して表示する画像処理を施す。 In the content processing device according to aspect 8 of the present invention, in the above aspect 7, the object image processing unit performs image processing to enlarge and display the specified object.

本発明の態様９に係るコンテンツ処理装置は、上記の態様７または８において、前記オブジェクト画像処理部は、前記指定したオブジェクトの近傍の所定の範囲内に予め決められた画像を重畳して表示する画像処理を施す。 In the content processing device according to aspect 9 of the present invention, in the above aspect 7 or 8, the object image processing unit performs image processing to superimpose and display a predetermined image within a predetermined range in the vicinity of the specified object.

本発明の態様１０に係るコンテンツ処理装置は、上記の態様５乃至９のいずれかにおいて、前記検出されたオブジェクトの動きを解析する動き解析部をさらに備える。 The content processing device according to aspect 10 of the present invention is any one of aspects 5 to 9 above, further comprising a motion analysis unit that analyzes the motion of the detected object.

本発明の態様１１に係るコンテンツ処理装置は、上記の態様１０において、前記オブジェクトが人物であり、前記動き解析部は、前記オブジェクト検出部が検出した複数の人物のうち、ユーザが指定した人物の動きを解析し、指定した前記人物の動きと同じ動きをするアニメーション画像を生成するアニメーション画像生成部に、前記動き解析部の解析結果が供給される。 In the content processing device according to aspect 11 of the present invention, in the above aspect 10, the object is a person, and the motion analysis unit analyzes the motion of a person designated by a user from among a plurality of people detected by the object detection unit, and the analysis result of the motion analysis unit is supplied to an animation image generation unit that generates an animation image that moves in the same manner as the designated person.

本発明の態様１２に係るコンテンツ処理装置は、上記の態様１１において、前記アニメーション画像は、前記ユーザのアバターの画像である。 In a content processing device according to aspect 12 of the present invention, in the above aspect 11, the animation image is an image of the user's avatar.

本発明の態様１３に係るコンテンツ処理方法は、音声とともに動画像が再生されるコンテンツの入力を受け付けるステップと、予め与えられた所定の長さの音声データに対応する音声と類似する音声が再生される区間を、入力されたコンテンツから検出するステップと、検出された前記区間の直後に再生される画像からオブジェクトを検出するステップと、検出された前記オブジェクトにラベルを付与するステップとを含む。 The content processing method according to aspect 13 of the present invention includes the steps of accepting input of content in which moving images are played together with sound, detecting a section from the input content in which sound similar to sound corresponding to a predetermined length of audio data is played, detecting an object from an image played immediately after the detected section, and assigning a label to the detected object.

本発明の態様１４に係るプログラムは、コンピュータを、音声とともに動画像が再生されるコンテンツの入力を受け付ける入力受付部と、予め与えられた所定の長さの音声データに対応する音声と類似する音声が再生される区間を、入力されたコンテンツから検出する区間検出部と、検出された前記区間の直後に再生される画像からオブジェクトを検出するオブジェクト検出部と、検出された前記オブジェクトにラベルを付与するラベル付与部とを備えるコンテンツ処理装置として機能させる。 The program according to aspect 14 of the present invention causes a computer to function as a content processing device having an input receiving unit that receives input of content in which moving images are played together with audio, a section detection unit that detects a section from the input content in which audio similar to audio corresponding to audio data of a predetermined length is played, an object detection unit that detects an object from an image played immediately after the detected section, and a label assignment unit that assigns a label to the detected object.

１０コンテンツ処理装置
１１コンテンツ入力受付部
１２コンテンツ再生部
１３操作入力受付部
３１区間検出部
３２オブジェクト検出部
３３ラベル付与部
３４オブジェクト画像処理部
５０ディスプレイ
６１動き解析部
７１アニメーション画像生成部 REFERENCE SIGNS LIST 10 Content processing device 11 Content input receiving unit 12 Content playback unit 13 Operation input receiving unit 31 Section detection unit 32 Object detection unit 33 Labeling unit 34 Object image processing unit 50 Display 61 Motion analysis unit 71 Animation image generation unit

Claims

an input receiving unit that receives an input of a content for playing back a moving image of a plurality of people together with audio;
a section detection unit that detects, from the input content, a section in which a sound similar to a sound corresponding to a given audio data of a predetermined length is reproduced;
an object detection unit that detects, as objects , images of a plurality of people appearing in an image played immediately after the detected section;
a label assignment unit that assigns a label corresponding to a name of the object to the object detected by the object detection unit by using model parameters obtained by machine learning executed in advance .

The content processing device according to claim 1 , wherein the sound reproduced together with the moving image is a piece of music.

The object detection unit further
The content processing device according to claim 1 , further comprising: a section for detecting the object to which the label has been assigned, the object being tracked and detected within a moving image of the content that is played back after the section detected by the section detection section.

First audio data corresponding to a first sound, which is a sound of a section to be reproduced earlier in time in the content, and second audio data corresponding to a second sound, which is a sound of a section to be reproduced later in time in the content, are provided in advance;
the section detection unit detects a first section in which a sound similar to the first sound is reproduced and a second section in which a sound similar to the second sound is reproduced;
The content processing device according to claim 3 , wherein the labeling unit assigns a label to an object detected by the object detection unit from the image immediately after the first section, and assigns a label again to an object detected by the object detection unit from the image immediately after the second section.

The content processing device according to claim 3 , further comprising an object image processing unit that performs predetermined image processing on an object designated by a user from among the plurality of objects detected by the object detection unit.

The content processing device according to claim 5 , wherein the object image processing unit performs image processing for enlarging and displaying the specified object.

The content processing device according to claim 5 , wherein the object image processing unit displays a predetermined image in a superimposed manner within a predetermined range near the specified object.

The content processing device according to claim 3 , further comprising a motion analysis unit that analyzes the motion of the detected object.

the motion analysis unit analyzes a motion of a person designated by a user among a plurality of people detected as objects by the object detection unit,
The content processing device according to claim 8 , wherein the analysis result of the analysis unit is supplied to an animation image generation unit that generates an animation image that moves in the same manner as the specified person.