JP4003940B2

JP4003940B2 - VIDEO-RELATED CONTENT GENERATION DEVICE, VIDEO-RELATED CONTENT GENERATION METHOD, AND VIDEO-RELATED CONTENT GENERATION PROGRAM

Info

Publication number: JP4003940B2
Application number: JP2002167419A
Authority: JP
Inventors: 俊彦三須; 昌秀苗村; 文濤鄭
Original assignee: Japan Broadcasting Corp
Current assignee: Japan Broadcasting Corp
Priority date: 2002-06-07
Filing date: 2002-06-07
Publication date: 2007-11-07
Anticipated expiration: 2022-06-07
Also published as: JP2004015523A

Description

【０００１】
【発明の属する技術分野】
本発明は、映像を映像メディア以外のメディアに供給することができる映像関連コンテンツを生成する映像関連コンテンツ生成装置、映像関連コンテンツ生成方法及び映像関連コンテンツ生成プログラムに関する。
【０００２】
【従来の技術】
従来、映像（映像コンテンツ）から、映像コンテンツをそのまま提示することができる映像メディア（テレビ放送等）以外の、例えばデータ放送、ＷＷＷ（ＷｏｒｌｄＷｉｄｅＷｅｂ）、携帯端末等の他のメディアで提示することができるコンテンツを制作し、配信する場合、元となる映像から特定の大きさを切り出したり、伝送フレームレート等を変換することで、他のメディア用の映像コンテンツを制作し、配信を行っている。この映像コンテンツを、映像メディア以外のメディア用コンテンツへ変換する方法は、解像度を除いては基本的に同種（映像）のコンテンツに変換することしか行われていない。
【０００３】
また、従来、音声（音声コンテンツ）から、文字放送等で提示するコンテンツを制作する場合、元となる音声を音声認識によって文字情報に変換して、文字コンテンツとする方法がある。このように、音声コンテンツでは音声コンテンツから文字情報という異種のコンテンツに変換することが行われている。
【０００４】
【発明が解決しようとする課題】
しかし、前記従来の技術では、異なるメディア用のコンテンツに変換する場合、映像から映像への解像度変換が主流であり、その変換の前後では解像度の違いを除いては基本的には同一の映像コンテンツである。また、異種コンテンツへの変換は、音声認識に基づく音声コンテンツから文字コンテンツへの変換が主流である。すなわち、映像から映像コンテンツ以外のコンテンツに変換する手法は考えられていない。
【０００５】
このため、映像コンテンツに関連した、ＷＷＷ等で使用されるコンテンツ記述言語で記述されたテキストベースのコンテンツ、音声コンテンツ、あるいは、映像コンテンツと関連しているが異なる画像を有する画像コンテンツ等を制作する場合、映像コンテンツを利用することができず、最初から制作を行わなければならないという問題があった。
【０００６】
本発明は、以上のような問題点に鑑みてなされたものであり、映像（映像コンテンツ）を、その映像の内容に関連する座標情報、音声情報、画像情報等の異種情報に変換した映像関連コンテンツを生成することを可能にする映像関連コンテンツ生成装置、映像関連コンテンツ生成方法及び映像関連コンテンツ生成プログラムを提供することを目的とする。
【０００７】
【課題を解決するための手段】
本発明は、前記目的を達成するために創案されたものであり、まず、請求項１に記載の映像関連コンテンツ生成装置は、映像信号から、その映像信号の映像内容に関連する情報を映像関連コンテンツとして生成する映像関連コンテンツ生成装置であって、映像信号を解析して、映像特徴量を抽出する映像シーン解析手段と、この映像シーン解析手段で抽出された映像特徴量を、テキストデータ、画像データ及び音声データの少なくとも１つに変換して映像関連コンテンツを生成するコンテンツ生成手段と、を備え、コンテンツ生成手段は、映像特徴量と、その映像特徴量を文字列として表現した特徴量文字列とを対応付けて蓄積した文字列変換データベースと、映像特徴量に基づいて、特徴量文字列を埋め込むテキスト領域をテンプレート化したコンテンツ記述言語のテキスト領域に、特徴量文字列を埋め込む文字列埋め込み手段と、を備える構成とした。
【０００８】
かかる構成によれば、映像関連コンテンツ生成装置は、映像シーン解析手段によって、映像信号から映像特徴量を抽出する。そして、コンテンツ生成手段によって、映像特徴量をテキストデータ、画像データ及び音声データの少なくとも１つに変換して映像関連コンテンツとして生成する。
そして、映像関連コンテンツ生成装置は、文字列埋め込み手段によって、文字列変換データベースを参照して、特徴量文字列を埋め込むテキスト領域をテンプレート化したコンテンツ記述言語のテキスト領域に、映像特徴量に対応する特徴量文字列を埋め込む。
【０００９】
ここで、映像特徴量とは、映像シーンを構成するフレームを特徴付ける数量のことで、例えば、明るさ（輝度値）、色味（色特徴量）、動き（動きベクトル量）、テクスチャ、映像オブジェクトの位置座標、映像オブジェクト数等を数値化したもの、あるいはその統計量である。
【００１０】
また、映像特徴量をテキストデータに変換する場合、テキストベースのコンテンツ記述言語に変換すると、そのコンテンツ記述言語の再生装置によって、コンテンツを再生することが可能になり都合がよい。このコンテンツ記述言語には、例えば、ＨＴＭＬ（HyperTest Markup Language）、ＶＲＭＬ（Virtual Reality Modeling Language）、ＢＭＬ（Broadcast Markup Language）、ＲｅａｌＡｕｄｉｏメタファイル等がある。
また、コンテンツ記述言語のテキスト領域には、映像特徴量に対応する予め定められた置換対象文字列を記述しておき、文字列変換データベースには、映像特徴量の種類とその映像特徴量の値毎に、置換対象文字列と特徴量文字列（置換文字列）とを対応付けておくことで、コンテンツ記述言語のテキスト領域である置換対象文字列を容易に特徴量文字列に置き換えることができる。
【００１１】
また、請求項２に記載の映像関連コンテンツ生成装置は、請求項１に記載の映像関連コンテンツ生成装置において、映像シーン解析手段が、映像シーンに含まれる映像オブジェクトの位置座標を、映像特徴量として検出する映像オブジェクト位置検出手段を備える構成とした。
【００１２】
かかる構成によれば、映像関連コンテンツ生成装置は、映像オブジェクト位置検出手段によって、映像シーンに含まれる映像オブジェクトの位置座標を、映像特徴量として検出する。この位置座標は、映像オブジェクトの特定の位置（例えば、左上座標、中心座標等）でもよいし、映像オブジェクトの重心座標としてもよい。
【００１３】
さらに、請求項３に記載の映像関連コンテンツ生成装置は、請求項１又は請求項２に記載の映像関連コンテンツ生成装置において、映像シーン解析手段が、映像シーンに含まれる映像オブジェクトを特徴付ける特徴量を、映像特徴量として抽出する映像オブジェクト特徴量抽出手段を備える構成とした。
【００１４】
かかる構成によれば、映像関連コンテンツ生成装置は、映像オブジェクト特徴量抽出手段によって、映像シーンに含まれる映像オブジェクトを特徴付ける映像特徴量を抽出する。この映像特徴量（映像オブジェクト特徴量）は、明るさ（輝度値）、色味（色特徴量）、動き（動きベクトル量）、テクスチャ等の映像オブジェクト毎の特徴量である。
【００１７】
また、請求項４に記載の映像関連コンテンツ生成装置は、請求項２に記載の映像関連コンテンツ生成装置において、コンテンツ生成手段が、映像オブジェクト位置検出手段で検出された映像オブジェクトの位置座標に、映像オブジェクトに関連する画像データを合成する画像合成手段を備える構成とした。
【００１８】
かかる構成によれば、映像関連コンテンツ生成装置は、画像合成手段によって、映像オブジェクト位置検出手段で検出された映像オブジェクトの位置座標に、画像データを合成することで、映像オブジェクトの位置座標を可視化したコンテンツを生成する。
【００１９】
さらに、請求項５に記載の映像関連コンテンツ生成装置は、請求項１乃至請求項４のいずれか１項に記載の映像関連コンテンツ生成装置において、コンテンツ生成手段が、映像特徴量に対応付けて、複数の音声データを蓄積した音声データ蓄積手段と、映像特徴量に基づいて、音声データ蓄積手段に蓄積されている音声データを選択する音声選択手段と、この音声選択手段で選択された音声データを出力する音声出力手段と、を備える構成とした。
【００２０】
かかる構成によれば、映像関連コンテンツ生成装置は、映像特徴量に対応付けて、複数の音声データを蓄積した音声データ蓄積手段から、音声選択手段が、映像特徴量に基づいて音声データを選択する。
ここで、音声データ蓄積手段に蓄積されている音声データは、映像特徴量の値に対応付けて、例えば、輝度値等による映像の明るさを映像特徴量とする場合は、明るい映像に対して、楽しい音楽を対応付ける。あるいは、映像オブジェクトの移動量による動きの激しさを映像特徴量とする場合は、映像オブジェクトの動きの激しい映像に対しては、テンポの速い音楽を対応付けることも可能である。
【００２１】
また、請求項６に記載の映像関連コンテンツ生成方法は、映像信号から、その映像信号の映像内容に関連する情報を映像関連コンテンツとして生成する映像関連コンテンツ生成方法であって、映像信号の映像シーンを解析して、映像特徴量を抽出する映像シーン解析ステップと、映像特徴量とその映像特徴量を文字列として表現した特徴量文字列とを対応付けて蓄積した文字列変換データベースから、映像シーン解析ステップで抽出した映像特徴量に対応する特徴量文字列を検索する文字列検索ステップと、特徴量文字列を埋め込むテキスト領域をテンプレート化した、コンテンツ記述言語を入力するコンテンツ記述言語入力ステップと、映像特徴量に基づいて、コンテンツ記述言語のテキスト領域に文字列検索ステップで検索した特徴量文字列を埋め込む文字列埋め込みステップと、を含んでいることを特徴とする。
【００２２】
かかる方法によれば、映像関連コンテンツ生成方法は、映像シーン解析ステップによって、映像シーンを構成するフレームを特徴付ける数量である映像特徴量を抽出し、文字列検索ステップによって、映像特徴量と映像特徴量を文字列として表現した特徴量文字列とを対応付けて蓄積した文字列変換データベースから、映像シーン解析ステップで抽出した映像特徴量に対応する特徴量文字列を検索する。
そして、コンテンツ記述言語入力ステップによって、特徴量文字列を埋め込むテキスト領域をテンプレート化したコンテンツ記述言語を入力し、文字列埋め込みステップによって、コンテンツ記述言語のテキスト領域に特徴量文字列を埋め込んで映像関連コンテンツを生成する。
【００２３】
さらに、請求項７に記載の映像関連コンテンツ生成プログラムは、映像信号から、その映像信号の映像内容に関連する情報を映像関連コンテンツとして生成するために、コンピュータを、映像信号の映像シーンを解析して、映像特徴量を抽出する映像シーン解析手段、この映像シーン解析手段で抽出された映像特徴量を、テキストデータ、画像データ及び音声データの少なくとも１つに変換して映像関連コンテンツを生成するコンテンツ生成手段、として機能させ、コンテンツ生成手段は、映像特徴量と、その映像特徴量を文字列として表現した特徴量文字列とを対応付けて蓄積した文字列変換データベースを参照し、映像特徴量に基づいて、特徴量文字列を埋め込むテキスト領域をテンプレート化したコンテンツ記述言語のテキスト領域に、特徴量文字列を埋め込むことを特徴とする。
【００２４】
かかる構成によれば、映像関連コンテンツ生成プログラムは、映像シーン解析手段によって、映像シーンを構成するフレームを特徴付ける数量である映像特徴量を抽出し、コンテンツ生成手段によって、映像特徴量をテキストデータ、画像データ及び音声データの少なくとも１つに変換して映像関連コンテンツとして生成する。
なお、映像関連コンテンツ生成プログラムは、コンテンツ生成手段が、映像特徴量と、その映像特徴量を文字列として表現した特徴量文字列とを対応付けて蓄積した文字列変換データベースを参照し、映像特徴量に基づいて、特徴量文字列を埋め込むテキスト領域をテンプレート化したコンテンツ記述言語のテキスト領域に、特徴量文字列を埋め込む。
【００２５】
【発明の実施の形態】
以下、本発明の実施の形態について図面を参照して説明する。
（映像関連コンテンツ生成装置の構成）
図１は、本発明における映像関連コンテンツ生成装置の構成を示したブロック図である。図１に示すように映像関連コンテンツ生成装置１は、入力された映像（映像信号）の映像シーンを解析することで、その映像シーンの特徴量（映像特徴量）を抽出し、その抽出した映像特徴量に基づいて、映像内容に関連する情報をコンテンツ記述言語で記述したコンテンツ、画像ファイル又は映像ストリーム、音声ファイル又は音声ストリームを映像関連コンテンツとして生成するものである。
【００２６】
この映像関連コンテンツ生成装置１は、映像シーン解析手段２と、コンテンツ記述言語生成手段４、画像合成手段５及び音声合成手段６を含んだコンテンツ生成手段３と、を備える構成とした。
【００２７】
映像シーン解析手段２は、映像オブジェクト位置検出部２１と、映像オブジェクト特徴量抽出部２２と、映像シーン特徴量抽出部２３とを備え、入力された映像信号から、映像信号の解析を行い、映像特徴量を抽出するものである。
【００２８】
映像オブジェクト位置検出部（映像オブジェクト位置検出手段）２１は、映像シーンから、その映像シーンに含まれる映像オブジェクトを検出するものである。ここでは、この映像オブジェクト位置検出部２１は、映像オブジェクトのフレーム内における重心の位置座標を検出すると同時に、個々の映像オブジェクトに固有の識別子を割り当て、この位置座標、識別子、並びに映像オブジェクトを検出した時刻を映像特徴量としてコンテンツ生成手段３へ出力するものとする。
【００２９】
なお、映像オブジェクトに識別子を割り当てるのは、映像オブジェクトの位置座標に基づいて、例えば、画面上で左から表示される順番に連番を付けることも可能である。あるいは、映像オブジェクトが人物の場合、一般的な顔認識の技術によって、連番の代わりに人物名を識別子として用いることも可能である。
【００３０】
映像オブジェクト特徴量抽出部（映像オブジェクト特徴量抽出手段）２２は、映像シーンから、映像オブジェクト位置検出部２１で検出された映像オブジェクトの映像オブジェクト特徴量を抽出するものである。この映像オブジェクト特徴量は、明るさ（輝度値）、色味（色特徴量）、動き（動きベクトル量）、テクスチャ、形状パラメータ等の映像オブジェクト毎の特徴量である。この映像オブジェクト特徴量抽出部２２は、この映像オブジェクト特徴量とその映像オブジェクト固有の識別子を映像特徴量としてコンテンツ生成手段３へ出力するものである。
なお、映像オブジェクトの検出や特徴量の抽出は、本願出願人において「動画像のオブジェクト抽出装置（特開２００１−３０７１０４）」又は「映像オブジェクト検出・追跡装置（特願２００１−１６６５２５）」として開示されている技術を用いて実現することができる。
【００３１】
映像シーン特徴量抽出部２３は、映像シーンから、フレーム毎の映像特徴量を抽出するものである。この映像シーン特徴量抽出部２３は、フレーム全体の特徴や、映像オブジェクト特徴量抽出部２２で抽出した個々の映像オブジェクト特徴量を統計した情報をフレームの映像特徴量としてコンテンツ生成手段３へ出力するものである。
例えば、映像シーン特徴量抽出部２３は、フレームの各画素の輝度値を、フレーム全体に渡って平均をとったフレームの平均輝度値や、フレーム内の映像オブジェクトの数等を映像特徴量として出力する。
【００３２】
コンテンツ生成手段３は、コンテンツ記述言語生成手段４、画像合成手段５及び音声合成手段６を備え、映像シーン解析手段２から入力される映像特徴量から、映像シーンに関連する情報を映像関連コンテンツとして出力するものである。
【００３３】
コンテンツ記述言語生成手段４は、文字列変換データベース４１と、特徴量文字列変換部４２と、テンプレート文字列置換部４３とを備え、外部から入力されるコンテンツ記述言語のテンプレート（コンテンツ記述言語テンプレート４４ａ）の置換対象文字列を、映像シーン解析手段２で解析され、抽出された映像特徴量に対応する置換文字列に置換して、映像シーンに関連するコンテンツ記述言語で書かれたコンテンツを生成するものである。
【００３４】
このコンテンツ記述言語には、例えば、ＨＴＭＬ、ＶＲＭＬ、ＢＭＬ、ＲｅａｌＡｕｄｉｏメタファイル等がある。ここでは、ＨＴＭＬを代表して説明を行うが、他のコンテンツ記述言語においても同様の構成で実現することが可能である。
【００３５】
文字列変換データベース４１は、映像特徴量を文字列として表現するための変換ルール（文字列変換ルール４１ａ）を蓄積したデータベースで、映像特徴量の種類及びその数値と、コンテンツ記述言語テンプレート４４ａに記述された置換対象文字列と、その置換対象文字列を置換する置換文字列とを対応付けて蓄積したものである。
【００３６】
ここで、図２及び図３を参照して、コンテンツ記述言語テンプレート４４ａ及び文字列変換ルール４１ａについて説明する。図２は、コンテンツ記述言語テンプレート４４ａの一例を示すＨＴＭＬで記述したテンプレート（雛型）であり、図３は、文字列変換ルール４１ａの内容の一例を示す図である。
【００３７】
図２に示すように、コンテンツ記述言語テンプレート４４ａは、コンテンツ記述言語（ここではＨＴＭＬ）で記述したテキストファイルであり、入力される映像シーンの内容に関連する部分を置換対象文字列として記述しておき、あとからその置換対象文字列を置換することができるテンプレートである。ここでは、「部屋の情景」を説明するＨＴＭＬのテンプレートを例としており、置換対象文字列４４ｂとして、「」を用い、映像シーンの明るさに関する映像特徴量に基づいて文字列を置換する領域を示している。また、置換対象文字列４４ｃとして、「」を用い、映像シーン内の数に関する映像特徴量に基づいて文字列を置換する領域を示している。
【００３８】
図３の文字列変換ルール４１ａでは、映像特徴量の種類として、フレームの各画素の輝度値を、フレーム全体に渡って平均をとったフレームの平均輝度値（輝度値の平均値）と、フレーム内の映像オブジェクトの数（オブジェクトの個数）を用い、その映像特徴量の値に置換対象文字列と置換文字列（特徴量文字列）とを対応付けている。
【００３９】
図３（ａ）では、例えば、映像を構成する画素の輝度値を０から２５５の２５６値で表したとき、映像特徴量の値である輝度値の平均値が、「９０未満」の場合は、置換対象文字列が「」、置換文字列が「暗い」であることを示している。これによって、輝度値の平均値が、「９０未満」の場合は、コンテンツ記述言語テンプレート４４ａ（図２）の置換対象文字列４４ｂが「暗い」に置換される。また、オブジェクトの個数が「２以上」の場合は、コンテンツ記述言語テンプレート４４ａの置換対象文字列４４ｃは、「たくさんあります」に置換される。
【００４０】
図３（ｂ）では、図３（ａ）の置換文字列を「暗い」、「たくさんあります」等の日本語文字列で表すのではなく、「<img src="1.png">」等のＨＴＭＬの埋め込み画像として指定する場合の例を示している。このように、置換文字列は、日本語文字列だけではなく画像ファイル、音声ファイル、スクリプト等のファイル名をコンテンツ記述言語に埋め込む置換文字列として記述することとしてもよい。なお、このスクリプトには、ＪａｖａＳｃｒｉｐｔ（登録商標）、ＥＣＭＡＳｃｒｉｐｔ等がある。
図１に戻って説明を続ける。
【００４１】
特徴量文字列変換部（文字列埋め込み手段）４２は、映像シーン解析手段２から入力された映像特徴量に基づいて、文字列変換データベース４１の文字列変換ルール４１ａを参照し、その映像特徴量に対応する置換対象文字列と、置換文字列とをテンプレート文字列置換部４３へ通知するものである。
【００４２】
テンプレート文字列置換部（文字列埋め込み手段）４３は、外部から入力されるコンテンツ記述言語テンプレート４４ａと、特徴量文字列変換部４２から通知される置換対象文字列及び置換文字列とに基づいて、コンテンツ記述言語テンプレート４４ａに記述されている置換対象文字列を置換文字列に置換することで、ＨＴＭＬファイル等のコンテンツ記述言語を生成するものである。
【００４３】
なお、置換文字列で映像オブジェクトの位置を表す場合には、その位置座標の時刻毎の位置座標リストを置換対象文字列としたＶＲＭＬのＰｏｓｉｔｉｏｎＩｎｔｅｒｐｏｌａｔｏｒノードを用いて記述することも可能である。
【００４４】
画像合成手段５は、位置提示画像合成部５１と、画像出力部５２とを備え、映像シーン解析手段２で解析され、抽出された映像特徴量に関連する画像を合成して画像ファイル又は映像ストリームとして出力するものである。
【００４５】
位置提示画像合成部５１は、映像シーン解析手段２の映像オブジェクト位置検出部２１から、映像オブジェクトの位置座標、識別子並びに検出時刻を映像特徴量として入力し、その識別子で区別された映像オブジェクトがある時刻においてどの位置に存在していたかを提示する位置提示画像を合成するものである。ここで合成された画像は画像出力部５２へ出力される。
【００４６】
例えば、予めアイコン画像を蓄積した画像蓄積手段（図示せず）から、アイコン画像を読み込んで、無地の画像上の位置座標で示される位置にアイコン画像を合成する。また、例えば、映像シーン解析手段２の映像オブジェクト特徴量抽出部２２で抽出される映像オブジェクトの画像をそのままアイコン画像として合成することとしてもよい。
【００４７】
画像出力部５２は、位置提示画像合成部５１で合成された画像を画像ファイルとして出力するものである。なお、画像出力部５２は、位置提示画像合成部５１から画像が時系列に入力される場合は、その時系列画像を映像オブジェクトが時刻によって変化する映像ストリームとして出力する。
【００４８】
音声合成手段６は、音声データ蓄積部６１と、音声選択部６２と、音声出力部６３とを備え、映像シーン解析手段２で解析され、抽出された映像特徴量に関連する音声を音声ファイル又は音声ストリームとして出力するものである。
【００４９】
音声データ蓄積部（音声データ蓄積手段）６１は、予め映像シーンに関連する音声データ６１ａを識別番号に対応付けて蓄積しておくものであり、ハードディスク等で構成されるものである。この音声データ６１ａは、映像シーンに関連して映像シーンを表現するための音声データであり、例えば、ＢＧＭ（Back Ground Music）、効果音、人の声等である。
また、この音声データ蓄積部６１は、映像シーンの映像特徴量に基づいた音声データ６１ａを複数保持している。例えば、輝度値に対応付けて「明るさ」のレベルを表現する音声データ６１ａを音声ファイルとして保持している。
【００５０】
音声選択部（音声選択手段）６２は、映像シーン解析手段２から入力される映像シーンの映像特徴量に基づいて、音声データ蓄積部６１に蓄積されている音声データ６１ａを選択して、音声出力部６３へ音声データ６１ａの識別番号を通知するものである。
この音声選択部６２は、映像シーンの映像特徴量（例えば輝度値の平均値）から、映像シーンを表現する音声データ蓄積部６１に蓄積されている音声データ６１ａの識別番号を音声出力部６３へ通知する。例えば、音声選択部６２は、輝度値の平均値に基づいて、「明るさ」のレベルを判定して、その「明るさ」に対応する音声データ６１ａを選択する。あるいは、映像オブジェクトの位置座標に基づいて、予め設定された領域に映像オブジェクトが入ったとときに、特定の音声データ６１ａを選択することとしてもよい。
【００５１】
音声出力部（音声出力手段）６３は、音声選択部６２で選択され、識別番号で通知された音声データ蓄積部６１内の音声データ６１ａを読み込んで、音声ファイル又は音声ストリームとして出力するものである。
このように、コンテンツ記述言語生成手段４から出力されるコンテンツ記述言語（ＨＴＭＬファイル等）、画像合成手段５から出力される画像ファイル（又は映像ストリーム）、音声合成手段６から出力される音声ファイル（又は音声ストリーム）は、個々に出力する形態であっても構わないし、複数の出力を映像関連コンテンツとして出力する形態であっても構わない。
【００５２】
以上、一実施形態に基づいて、映像関連コンテンツ生成装置１の構成について説明したが、映像関連コンテンツ生成装置１は、コンピュータにおいて各手段を各機能プログラムとして実現することも可能であり、各機能プログラムを結合して映像関連コンテンツ生成プログラムとして動作させることも可能である。
【００５３】
（映像関連コンテンツ生成装置の動作：コンテンツ記述言語生成例）
次に、映像関連コンテンツ生成装置１の動作について説明する。
まず、図１及び図７を参照して、映像関連コンテンツ生成装置１がコンテンツ記述言語を生成する動作例について説明する。図７は、映像関連コンテンツ生成装置１がコンテンツ記述言語を生成する動作を示すフローチャートである。
【００５４】
映像関連コンテンツ生成装置１は、入力された映像信号から、映像シーン解析手段２が、映像信号を解析し、映像特徴量を抽出する（ステップＳ１１）。
【００５５】
そして、コンテンツ記述言語生成手段４において、特徴量文字列変換部４２が文字列変換データベース４１内の文字列変換ルール４１ａに基づいて、ステップＳ１１で抽出した映像特徴量に対応する置換対象文字列及び置換文字列を検索して、テンプレート文字列置換部４３に通知する（ステップＳ１２）。
【００５６】
置換対象文字列及び置換文字列を通知されたテンプレート文字列置換部４３は、外部からコンテンツ記述言語テンプレート４４ａを読み込み（ステップＳ１３）、コンテンツ記述言語テンプレート４４ａ内の置換対象文字列を検索する（ステップＳ１４）。
そして、置換対象文字列が存在するかどうかを判定し（ステップＳ１５）、存在する場合（Ｙｅｓ）は、置換対象文字列を置換文字列に置き換えて（ステップＳ１６）、ステップＳ１４に戻ってさらに置換対象文字列を検索する。
【００５７】
一方、置換対象文字列が存在しない場合（ステップＳ１５でＮｏ）は、置換対象文字列をすべて置換文字列に置き換えたものとして、その置換文字列に置き換えたコンテンツ記述言語（ＨＴＭＬファイル等）を出力して（ステップＳ１７）、動作を終了する。
以上のステップによって、映像信号からその映像内容に関連する情報を、コンテンツ記述言語で記述したテキストベースのコンテンツを生成することができる。
【００５８】
次に、図４を参照して、コンテンツ記述言語生成手段４を中心にして、コンテンツ記述言語生成の具体的な動作について説明する。図４は、映像特徴量からＨＴＭＬファイルを生成する例を示す概念図である。
【００５９】
コンテンツ記述言語生成手段４は、まず、映像特徴量２ａを入力する。ここで、映像特徴量２ａとして、輝度値の平均値が「１００」、オブジェクトの個数が「１」であったとすると、コンテンツ記述言語生成手段４は、文字列変換ルール４１ａ（図３（ａ））を参照して、輝度値の平均値「１００」及びオブジェクトの個数「１」に対応する置換対象文字列及び置換文字列を検索し、輝度値の平均値「１００」に対応する置換対象文字列「」並びに置換文字列「薄暗い」と、オブジェクトの個数「１」に対応する置換対象文字列「<!？number-->」並びに置換文字列「一つあります」とを得る。
【００６０】
そして、コンテンツ記述言語生成手段４は、外部から入力されるコンテンツ記述言語テンプレート４４ａ（図２）の置換対象文字列「」並びに「」をそれぞれ「薄暗い」並びに「一つあります」に変換することにより、「部屋の情景」を説明するＨＴＭＬファイル４ａを生成する。
【００６１】
（映像関連コンテンツ生成装置の動作：画像合成例）
次に、図１及び図８を参照して、映像関連コンテンツ生成装置１が映像信号に関連する画像を合成して出力する動作例について説明する。図８は、映像関連コンテンツ生成装置１が合成画像を時系列化した映像ストリームを生成する動作を示すフローチャートである。
【００６２】
まず、映像関連コンテンツ生成装置１は、入力された映像信号に基づいて、映像シーン解析手段２が、映像信号を解析し、映像特徴量である映像シーンの時刻、映像オブジェクトの位置座標を抽出する（ステップＳ２１）。
そして、画像合成手段５において、位置提示画像合成部５１が映像シーンのある時刻における映像オブジェクトの位置座標にアイコン画像を合成した合成画像を生成する（ステップＳ２２）。
【００６３】
映像シーンの全ての時刻における合成画像の生成を完了したかどうかを判定し（ステップＳ２３）、まだ、完了していない場合（Ｎｏ）は、ステップＳ２２へ戻って、次の時刻の合成画像を生成する。一方、すべての時刻における合成画像の生成を完了した場合（Ｙｅｓ）は、映像シーンの時刻毎に生成した合成画像を映像ストリームとして出力して（ステップＳ２４）、動作を終了する。
以上のステップによって、映像信号から、映像オブジェクトの位置のみを視覚化した映像ストリームとして生成することができる。
【００６４】
次に、図５及び図６を参照して、画像合成手段５を中心にして、映像信号から画像ファイル又は映像ストリームを生成する具体的な動作について説明する。図５は、映像特徴量から同一画像上に時系列に変化する映像オブジェクトの位置を視覚化した、画像ファイルを生成する例を示す概念図である。図６は、映像特徴量から時系列に変化する映像オブジェクトの位置を別々の画像ファイルとして生成する、又は、映像ストリームとして生成する例を示す概念図である。
【００６５】
図５に示すように、画像合成手段５は、まず、映像特徴量２ａを入力する。ここで、映像特徴量２ａとして、時刻が「１」、「２」、「３」及び「４」、その時刻に対応する映像オブジェクトの位置座標が「（１，１）」、「（１，２）」、「（２，３）」及び「（３，２）」であったとする。
【００６６】
画像合成手段５は、まず、無地画像の位置座標「（１，１）」、「（１，２）」、「（２，３）」及び「（３，２）」にアイコン画像Ｃ１、Ｃ２、Ｃ３及びＣ４（Ｃ４のみ異なるアイコン画像を使用）を合成し、各アイコン画像（Ｃ１〜Ｃ４）間を直線で結んだ画像ファイル５ａを生成する。これによって、映像オブジェクトがＣ１の位置座標からＣ４の位置座標へ移動したことを表現することができる。
【００６７】
また、図６に示した例では、図５と同様の映像特徴量２ａを入力しているが、画像合成手段５の出力が、一枚の画像ファイルではなく、複数の画像ファイルあるいは映像ストリームとしているところが異なっている。
図６において、画像合成手段５は、時刻「１」から「４」に対応する無地画像の位置座標「（１，１）」、「（１，２）」、「（２，３）」及び「（３，２）」にアイコン画像Ｃ４を合成し、４枚の画像を生成する。なお、画像合成手段５は、この４枚の画像を個々の画像ファイル５ａとして出力することも可能であるし、個々の画像を連続したストリームデータとした映像ストリーム５ｂとして出力することも可能である。
【００６８】
（映像関連コンテンツ生成装置の動作：音声合成例）
次に、図１及び図９を参照して、映像関連コンテンツ生成装置１が映像信号に関連する音声を合成して出力する動作例について説明する。図９は、映像関連コンテンツ生成装置１が音声データを出力する動作を示すフローチャートである。
【００６９】
映像関連コンテンツ生成装置１は、一定の時間間隔又はカット点検出技術により得られるカット点のタイミングに基づいて、映像シーン解析手段２が、映像信号を解析し、映像特徴量を抽出する（ステップＳ３１）。
【００７０】
そして、音声合成手段６において、音声選択部６２が映像特徴量の値に応じて、映像シーンを表現する音声データ蓄積部６１に蓄積されている音声データ６１ａを選択し、その識別番号を音声出力部６３へ通知する（ステップＳ３２）。
その識別番号を通知された音声出力部６３は、識別番号に基づいて選択された音声データ６１ａを音声データ蓄積部６１から読み込んで、音声ファイル又は音声ストリームとして出力する（ステップＳ３３）。
以上のステップによって、映像信号からその映像内容に関連する音声データを、音声ファイル又は音声ストリームとして出力することができる。
【００７１】
【発明の効果】
以上説明したとおり、本発明に係る映像関連コンテンツ生成装置、映像関連コンテンツ生成方法及び映像関連コンテンツ生成プログラムでは、以下に示す優れた効果を奏する。
【００７２】
請求項１、請求項６又は請求項７に記載の発明によれば、入力された映像信号から、映像シーンの解析を行い、その映像の内容に関連する座標情報、音声情報、画像情報等の異種情報に変換した映像関連コンテンツを生成することができる。これによって、今まで音声認識による字幕作成データに限られていた自然入力データに基づく自動コンテンツ制作を、映像入力においても適用することが可能になる。
また、請求項１、請求項６又は請求項７に記載の発明によれば、テンプレート化したＨＴＭＬ等のコンテンツ記述言語から、映像シーンの内容に関連した情報をコンテンツ記述言語として生成することができ、定型化したコンテンツをテンプレート化して準備しておくことで、コンテンツ制作の制作時間の短縮を行うことが可能になる。
【００７３】
請求項２又は請求項３に記載の発明によれば、映像内の映像オブジェクトを検出し、映像特徴量として抽出することができる。そして、映像オブジェクト毎の位置情報や特徴量を、視覚化したテキスト、音声、画像等によって表現することができるため、映像のデータ量を削減したコンテンツを生成することが可能になる。また、ＷＷＷや携帯端末で使用可能なコンテンツを生成することができ、データのアクセシビリティを向上させることができる。
【００７５】
請求項４に記載の発明によれば、映像オブジェクト毎の位置を、他の画像等によって表現することができるため、映像のデータ量を削減したコンテンツを生成することが可能になる。また、ＷＷＷや携帯端末で使用可能なコンテンツを生成することができ、データのアクセシビリティを向上させることができる。
【００７６】
請求項５に記載の発明によれば、映像シーンに適する音声を適宜出力することができるため、映像だけでは表現できない効果を演出することが可能になる。これによって、コンテンツ制作にかける労力を低減させることができる。
【図面の簡単な説明】
【図１】本発明の実施の形態に係る映像関連コンテンツ生成装置の全体構成を示すブロック図である。
【図２】コンテンツ記述言語のテンプレートの例を説明する説明図である。
【図３】文字列変換ルールの例を説明するための説明図である。
【図４】本発明の実施の形態に係るコンテンツ記述言語生成手段の動作例を説明するあめの説明図である。
【図５】本発明の実施の形態に係る画像合成手段の動作を模式的に示した模式図（その１）である。
【図６】本発明の実施の形態に係る画像合成手段の動作を模式的に示した模式図（その２）である。
【図７】本発明の実施の形態に係る映像からコンテンツ記述言語を生成する動作を示すフローチャートである。
【図８】本発明の実施の形態に係る映像から合成画像を生成する動作を示すフローチャートである。
【図９】本発明の実施の形態に係る映像から合成音声を生成する動作を示すフローチャートである。
【符号の説明】
１……映像関連コンテンツ生成装置
２……映像シーン解析手段
２１……映像オブジェクト位置検出部（映像オブジェクト位置検出手段）
２２……映像オブジェクト特徴量抽出部（映像オブジェクト特徴量抽出手段）
２３……映像シーン特徴量抽出部
３……コンテンツ生成手段
４……コンテンツ記述言語生成手段
４１……文字列変換データベース
４２……特徴量文字列変換部（文字列埋め込み手段）
４３……テンプレート文字列置換部（文字列埋め込み手段）
５……画像合成手段
５１……位置提示画像合成部
５２……画像出力部
６……音声合成手段
６１……音声データ蓄積部（音声データ蓄積手段）
６２……音声選択部（音声選択手段）
６３……音声出力部（音声出力手段）[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a video-related content generation apparatus, a video-related content generation method, and a video-related content generation program that generate video-related content that can supply video to media other than video media.
[0002]
[Prior art]
Conventionally, from video (video content), other than video media (TV broadcast etc.) that can present video content as it is, for example, data broadcasting, WWW (World Wide Web), presentation on other media such as mobile terminals When creating and distributing content that can be played, video content for other media is produced and distributed by extracting a specific size from the original video or converting the transmission frame rate, etc. . The method of converting this video content into content for media other than video media is basically only converted to the same type (video) content except for the resolution.
[0003]
Conventionally, when content to be presented by text broadcasting or the like is produced from speech (sound content), there is a method of converting the original speech into character information by speech recognition to obtain character content. As described above, in the audio content, the audio content is converted into different types of content such as character information.
[0004]
[Problems to be solved by the invention]
However, in the conventional technology, when converting to content for different media, resolution conversion from video to video is the mainstream, and basically the same video content except for the difference in resolution before and after the conversion. It is. Also, the conversion to heterogeneous content is mainly from audio content based on audio recognition to character content. That is, a method for converting video into content other than video content is not considered.
[0005]
For this reason, text-based content related to video content, written in a content description language used in WWW or the like, audio content, or image content related to video content but having a different image is produced. In this case, there is a problem that the video content cannot be used and the production must be performed from the beginning.
[0006]
The present invention has been made in view of the above-described problems, and is related to video obtained by converting video (video content) into heterogeneous information such as coordinate information, audio information, and image information related to the content of the video. It is an object of the present invention to provide a video-related content generation device, a video-related content generation method, and a video-related content generation program that make it possible to generate content.
[0007]
[Means for Solving the Problems]
  The present invention was created to achieve the above object, and first, the video-related content generation device according to claim 1, the information related to the video content of the video signal is extracted from the video signal. A video-related content generation device that generates video content, a video scene analysis unit that analyzes a video signal and extracts a video feature amount, a video feature amount extracted by the video scene analysis unit, a text data, an image Content generating means for generating video related content by converting into at least one of data and audio data;The content generation means includes a character string conversion database in which a video feature quantity and a feature quantity character string expressing the video feature quantity as a character string are stored in association with each other, and a feature quantity character based on the video feature quantity. A character string embedding means for embedding a feature amount character string in a text region of a content description language in which a text region for embedding a column is templated;It was set as the structure provided with.
[0008]
  According to this configuration, the video-related content generation device extracts video feature amounts from the video signal by the video scene analysis unit. Then, the content generating means converts the video feature amount into at least one of text data, image data, and audio data, and generates the video related content.
  Then, the video-related content generation device refers to the character string conversion database by the character string embedding unit, and corresponds to the video feature amount in the text region of the content description language in which the text region in which the feature amount character string is embedded is templated. Embed a feature string.
[0009]
Here, the video feature amount is a quantity that characterizes the frames constituting the video scene. For example, brightness (luminance value), color (color feature amount), motion (motion vector amount), texture, video object The position coordinates, the number of video objects, etc. are digitized, or their statistics.
[0010]
  Also, when converting the video feature quantity into text data, converting to the text-based content description language is convenient because the content can be played back by the playback device of the content description language. The content description language includes, for example, HTML (Hyper Test Markup Language), VRML (Virtual Reality Modeling Language), BML (Broadcast Markup Language), and RealAudio metafile.
  In addition, a predetermined replacement target character string corresponding to the video feature amount is described in the text area of the content description language, and the type of the video feature amount and the value of the video feature amount are stored in the character string conversion database. By associating the replacement target character string with the feature amount character string (replacement character string) for each, the replacement target character string that is the text area of the content description language can be easily replaced with the feature amount character string. .
[0011]
The video related content generation device according to claim 2 is the video related content generation device according to claim 1, wherein the video scene analysis means uses the position coordinates of the video object included in the video scene as the video feature amount. The video object position detecting means for detecting is configured.
[0012]
According to this configuration, the video-related content generation device detects the position coordinates of the video object included in the video scene as the video feature amount by the video object position detection unit. This position coordinate may be a specific position of the video object (for example, upper left coordinate, center coordinate, etc.), or may be a barycentric coordinate of the video object.
[0013]
Furthermore, the video related content generation device according to claim 3 is the video related content generation device according to claim 1 or 2, wherein the video scene analysis means calculates the feature quantity characterizing the video object included in the video scene. The video object feature quantity extracting means for extracting the video feature quantity is provided.
[0014]
According to this configuration, the video-related content generation apparatus extracts the video feature amount characterizing the video object included in the video scene by the video object feature amount extraction unit. This video feature amount (video object feature amount) is a feature amount for each video object such as brightness (luminance value), color (color feature amount), motion (motion vector amount), texture, and the like.
[0017]
  Claim 4The video related content generation device according to claim 2, wherein the content generation means includes an image related to the video object at the position coordinates of the video object detected by the video object position detection means. An image synthesizing means for synthesizing data is provided.
[0018]
According to this configuration, the video-related content generation device visualizes the position coordinates of the video object by synthesizing the image data with the position coordinates of the video object detected by the video object position detection means by the image synthesis means. Generate content.
[0019]
  Furthermore, claim 5The video-related content generation device according to claim 1 is provided.Claim 4In the video-related content generation device according to any one of the above, the content generation unit includes audio data storage unit that stores a plurality of audio data in association with the video feature amount, and audio data based on the video feature amount. The voice selection means for selecting the voice data stored in the storage means and the voice output means for outputting the voice data selected by the voice selection means are provided.
[0020]
According to this configuration, the video-related content generation apparatus selects the audio data based on the video feature amount from the audio data storage unit that stores a plurality of audio data in association with the video feature amount. .
Here, the audio data stored in the audio data storage means is associated with the value of the video feature value. For example, when the brightness of the video based on the luminance value is used as the video feature value, Associate fun music. Alternatively, in the case where the intensity of movement due to the moving amount of the video object is used as the video feature amount, it is possible to associate fast-tempo music with a video whose video object moves rapidly.
[0021]
  Also,Claim 6Is a video-related content generation method for generating, as video-related content, information related to the video content of the video signal from the video signal, analyzing the video scene of the video signal, In the video scene analysis step, the video scene analysis step for extracting the video feature amount and the character string conversion database in which the video feature amount and the feature amount character string expressing the video feature amount as a character string are stored in association with each other are extracted. A character string search step for searching for a feature amount character string corresponding to a video feature amount, a content description language input step for inputting a content description language using a text region in which the feature amount character string is embedded as a template, and a video feature amount Then, embed the feature string searched in the string search step in the text area of the content description language Characterized in that it includes a string embedding step.
[0022]
According to such a method, in the video-related content generation method, the video feature amount which is a quantity characterizing the frames constituting the video scene is extracted by the video scene analysis step, and the video feature amount and the video feature amount are extracted by the character string search step. The feature amount character string corresponding to the video feature amount extracted in the video scene analysis step is searched from the character string conversion database in which the feature amount character string expressed as a character string is stored in association with each other.
Then, in the content description language input step, the content description language in which the text region for embedding the feature amount character string is made into a template is input, and in the character string embedding step, the feature amount character string is embedded in the text region of the content description language. Generate content.
[0023]
  further,Claim 7The video-related content generation program described in 1) analyzes a video scene of a video signal to generate information related to the video content of the video signal from the video signal as a video-related content, A video scene analysis means for extracting video, and a content generation means for generating a video related content by converting the video feature amount extracted by the video scene analysis means into at least one of text data, image data, and audio data LetThe content generation means refers to the character string conversion database in which the video feature quantity and the feature quantity character string expressing the video feature quantity as a character string are stored in association with each other, and based on the video feature quantity, A feature amount character string is embedded in a text region of a content description language in which a text region in which a column is embedded is used as a template.
[0024]
  According to this configuration, the video-related content generation program extracts the video feature amount that is a quantity characterizing the frames constituting the video scene by the video scene analysis unit, and converts the video feature amount into the text data, the image by the content generation unit. It converts into at least one of data and audio | voice data, and produces | generates as an image related content.
  Note that the video-related content generation program refers to a character string conversion database in which the content generation unit stores a video feature amount and a feature amount character string expressing the video feature amount as a character string in association with each other. Based on the amount, the feature amount character string is embedded in the text region of the content description language in which the text region in which the feature amount character string is embedded is made into a template.
[0025]
DETAILED DESCRIPTION OF THE INVENTION
Hereinafter, embodiments of the present invention will be described with reference to the drawings.
(Configuration of video-related content generation device)
FIG. 1 is a block diagram showing a configuration of a video-related content generation apparatus according to the present invention. As shown in FIG. 1, the video-related content generation device 1 analyzes a video scene of an input video (video signal) to extract a feature amount (video feature amount) of the video scene, and the extracted video Based on the feature amount, content in which information related to video content is described in a content description language, an image file or video stream, an audio file or audio stream is generated as video-related content.
[0026]
The video related content generation apparatus 1 includes a video scene analysis unit 2, a content generation unit 3 including a content description language generation unit 4, an image synthesis unit 5, and a voice synthesis unit 6.
[0027]
The video scene analysis means 2 includes a video object position detection unit 21, a video object feature amount extraction unit 22, and a video scene feature amount extraction unit 23. The video scene analysis unit 2 analyzes the video signal from the input video signal, and A feature amount is extracted.
[0028]
The video object position detection unit (video object position detection means) 21 detects a video object included in the video scene from the video scene. Here, the video object position detector 21 detects the position coordinates of the center of gravity in the frame of the video object, and at the same time, assigns a unique identifier to each video object, and detects the position coordinate, the identifier, and the video object. It is assumed that the time is output to the content generation unit 3 as a video feature amount.
[0029]
Note that the identifiers are assigned to the video objects based on the position coordinates of the video objects, for example, in the order displayed from the left on the screen. Alternatively, when the video object is a person, a person name can be used as an identifier instead of a serial number by a general face recognition technique.
[0030]
The video object feature quantity extraction unit (video object feature quantity extraction means) 22 extracts the video object feature quantity of the video object detected by the video object position detection unit 21 from the video scene. This video object feature amount is a feature amount for each video object, such as brightness (luminance value), color (color feature amount), motion (motion vector amount), texture, and shape parameter. The video object feature quantity extraction unit 22 outputs the video object feature quantity and an identifier unique to the video object to the content generation unit 3 as a video feature quantity.
The detection of video objects and the extraction of feature quantities are disclosed by the applicant of the present application as “moving image object extraction device (Japanese Patent Application Laid-Open No. 2001-307104)” or “video object detection / tracking device (Japanese Patent Application 2001-166525)”. It can be realized using the technology that has been developed.
[0031]
The video scene feature quantity extraction unit 23 extracts a video feature quantity for each frame from the video scene. The video scene feature quantity extraction unit 23 outputs information about the characteristics of the entire frame and the individual video object feature quantities extracted by the video object feature quantity extraction unit 22 to the content generation unit 3 as video feature quantities of the frame. Is.
For example, the video scene feature amount extraction unit 23 outputs, as video feature amounts, the average luminance value of the frame obtained by averaging the luminance values of the pixels of the frame over the entire frame, the number of video objects in the frame, and the like. To do.
[0032]
The content generation unit 3 includes a content description language generation unit 4, an image synthesis unit 5, and a voice synthesis unit 6, and information related to the video scene as video-related content from the video feature amount input from the video scene analysis unit 2. Output.
[0033]
The content description language generation unit 4 includes a character string conversion database 41, a feature character string conversion unit 42, and a template character string replacement unit 43, and a content description language template (content description language template 44a) input from the outside. ) Is replaced with a replacement character string corresponding to the extracted video feature value, and content written in a content description language related to the video scene is generated. Is.
[0034]
Examples of the content description language include HTML, VRML, BML, RealAudio metafile, and the like. Here, description will be made on behalf of HTML, but it can be realized with the same configuration in other content description languages.
[0035]
The character string conversion database 41 is a database in which conversion rules (character string conversion rules 41a) for expressing video feature values as character strings are stored. The types and values of the video feature values and the content description language template 44a describe them. The replacement target character string and the replacement character string that replaces the replacement target character string are stored in association with each other.
[0036]
Here, the content description language template 44a and the character string conversion rule 41a will be described with reference to FIGS. FIG. 2 is a template (template) described in HTML showing an example of the content description language template 44a, and FIG. 3 is a diagram showing an example of the contents of the character string conversion rule 41a.
[0037]
As shown in FIG. 2, the content description language template 44a is a text file described in a content description language (in this case, HTML), and describes a part related to the contents of the input video scene as a replacement target character string. It is a template that can replace the replacement target character string later. Here, an HTML template for explaining “room scene” is taken as an example, and “<!-Brightness->” is used as the replacement target character string 44b, and it is based on the video feature amount related to the brightness of the video scene. Indicates the area where the character string is to be replaced. Further, “<!-Number->” is used as the replacement target character string 44c, and an area in which the character string is replaced based on the video feature amount related to the number in the video scene is shown.
[0038]
In the character string conversion rule 41a of FIG. 3, as the types of video feature quantities, the luminance value of each pixel of the frame is averaged over the entire frame, the average luminance value of the frame (average value of luminance values), and the frame The number of video objects (number of objects) is used, and the replacement target character string and the replacement character string (feature amount character string) are associated with the value of the video feature amount.
[0039]
In FIG. 3A, for example, when the luminance values of the pixels constituting the video are represented by 256 values from 0 to 255, the average value of the luminance values as the video feature values is “less than 90”. , The replacement target character string is “<!-Brightness->”, and the replacement character string is “dark”. Thereby, when the average value of the luminance values is “less than 90”, the replacement target character string 44b of the content description language template 44a (FIG. 2) is replaced with “dark”. When the number of objects is “2 or more”, the replacement target character string 44c of the content description language template 44a is replaced with “There are many”.
[0040]
In FIG. 3B, the replacement character string in FIG. 3A is not represented by a Japanese character string such as “dark” or “many”, but “<img src =” 1.png ”>”, etc. An example of designating as an HTML embedded image is shown. As described above, the replacement character string may be described as a replacement character string in which not only a Japanese character string but also a file name such as an image file, an audio file, or a script is embedded in the content description language. Examples of the script include JavaScript (registered trademark) and ECMAScript.
Returning to FIG. 1, the description will be continued.
[0041]
The feature amount character string conversion unit (character string embedding unit) 42 refers to the character string conversion rule 41a of the character string conversion database 41 based on the video feature amount input from the video scene analysis unit 2, and the video feature amount. The template character string replacement unit 43 is notified of the replacement target character string and the replacement character string.
[0042]
The template character string replacement unit (character string embedding unit) 43 is based on the content description language template 44a input from the outside, the replacement target character string and the replacement character string notified from the feature amount character string conversion unit 42, By replacing the replacement target character string described in the content description language template 44a with a replacement character string, a content description language such as an HTML file is generated.
[0043]
When the position of the video object is represented by a replacement character string, it can be described using a VRML PositionInterpolator node in which a position coordinate list for each time of the position coordinates is used as a replacement target character string.
[0044]
The image synthesizing unit 5 includes a position presentation image synthesizing unit 51 and an image output unit 52. The image synthesizing unit 5 synthesizes an image related to the extracted video feature amount analyzed by the video scene analyzing unit 2 and an image file or video stream. Is output as
[0045]
The position presentation image composition unit 51 inputs the position coordinates, identifiers, and detection times of the video objects from the video object position detection unit 21 of the video scene analysis unit 2 as video feature amounts, and there are video objects distinguished by the identifiers. A position presentation image that indicates which position was present at the time is synthesized. The synthesized image is output to the image output unit 52.
[0046]
For example, an icon image is read from an image storage unit (not shown) that stores icon images in advance, and the icon image is synthesized at a position indicated by position coordinates on a plain image. Further, for example, the image of the video object extracted by the video object feature amount extraction unit 22 of the video scene analysis unit 2 may be directly combined as an icon image.
[0047]
The image output unit 52 outputs the image synthesized by the position presentation image synthesis unit 51 as an image file. In addition, when an image is input from the position presentation image composition unit 51 in time series, the image output unit 52 outputs the time series image as a video stream in which the video object changes with time.
[0048]
The voice synthesizing unit 6 includes a voice data storage unit 61, a voice selection unit 62, and a voice output unit 63. The voice synthesis unit 6 analyzes the voice related to the video feature amount analyzed and extracted by the video scene analysis unit 2 as a voice file or This is output as an audio stream.
[0049]
The audio data storage unit (audio data storage means) 61 stores audio data 61a related to a video scene in association with an identification number in advance, and is composed of a hard disk or the like. The audio data 61a is audio data for expressing the video scene in relation to the video scene, and is, for example, BGM (Back Ground Music), sound effects, human voice, and the like.
The audio data storage unit 61 holds a plurality of audio data 61a based on the video feature amount of the video scene. For example, audio data 61a expressing the level of “brightness” in association with the luminance value is held as an audio file.
[0050]
The audio selection unit (audio selection unit) 62 selects the audio data 61a stored in the audio data storage unit 61 based on the video feature quantity of the video scene input from the video scene analysis unit 2, and outputs the audio. The unit 63 is notified of the identification number of the audio data 61a.
The audio selection unit 62 sends the identification number of the audio data 61a stored in the audio data storage unit 61 representing the video scene to the audio output unit 63 from the video feature amount (for example, the average value of the luminance values) of the video scene. Notice. For example, the voice selection unit 62 determines the level of “brightness” based on the average value of the luminance values, and selects the voice data 61a corresponding to the “brightness”. Alternatively, the specific audio data 61a may be selected when the video object enters a preset area based on the position coordinates of the video object.
[0051]
The audio output unit (audio output means) 63 reads the audio data 61a in the audio data storage unit 61 selected by the audio selection unit 62 and notified by the identification number, and outputs it as an audio file or audio stream. .
As described above, the content description language (HTML file or the like) output from the content description language generation unit 4, the image file (or video stream) output from the image synthesis unit 5, and the audio file ( (Or audio stream) may be output individually, or may be output as a plurality of video-related contents.
[0052]
As described above, the configuration of the video-related content generation apparatus 1 has been described based on one embodiment. However, the video-related content generation apparatus 1 can also implement each unit as each function program in a computer. Can be combined to operate as a video-related content generation program.
[0053]
(Operation of video-related content generation device: Content description language generation example)
Next, the operation of the video related content generation device 1 will be described.
First, an operation example in which the video-related content generation device 1 generates a content description language will be described with reference to FIGS. 1 and 7. FIG. 7 is a flowchart showing an operation in which the video-related content generation device 1 generates a content description language.
[0054]
In the video-related content generation device 1, the video scene analysis means 2 analyzes the video signal from the input video signal and extracts a video feature amount (step S11).
[0055]
Then, in the content description language generation means 4, the feature amount character string conversion unit 42, based on the character string conversion rule 41a in the character string conversion database 41, the replacement target character string corresponding to the video feature amount extracted in step S11, and The replacement character string is searched and notified to the template character string replacement unit 43 (step S12).
[0056]
The template character string replacement unit 43 notified of the replacement target character string and the replacement character string reads the content description language template 44a from the outside (step S13), and searches for the replacement target character string in the content description language template 44a (step S13). S14).
Then, it is determined whether or not a replacement target character string exists (step S15). If it exists (Yes), the replacement target character string is replaced with a replacement character string (step S16), and the process returns to step S14 for further replacement. Search for the target string.
[0057]
On the other hand, if the replacement target character string does not exist (No in step S15), the content description language (HTML file or the like) replaced with the replacement character string is output assuming that the replacement target character string is all replaced with the replacement character string. (Step S17), and the operation ends.
Through the above steps, text-based content in which information related to the video content is described in the content description language from the video signal can be generated.
[0058]
Next, with reference to FIG. 4, the specific operation of content description language generation will be described with the content description language generation means 4 as the center. FIG. 4 is a conceptual diagram illustrating an example in which an HTML file is generated from video feature amounts.
[0059]
The content description language generation unit 4 first inputs the video feature amount 2a. Here, assuming that the average value of luminance values is “100” and the number of objects is “1” as the video feature quantity 2a, the content description language generation means 4 uses the character string conversion rule 41a (FIG. 3A). ), The replacement target character string and the replacement character string corresponding to the average value “100” of the luminance value and the number of objects “1” are searched, and the replacement target character corresponding to the average value “100” of the luminance value is searched. The string “<!-Brightness->” and the replacement character string “dim”, the replacement target character string “<!? number->” and the replacement character string “one” corresponding to the number of objects “1” And get.
[0060]
Then, the content description language generation means 4 replaces the replacement target character strings “<!-Brightness->” and “<!-Number->” of the content description language template 44a (FIG. 2) input from the outside. Are converted into “dim” and “there is one”, respectively, to generate an HTML file 4a that explains “the scene of the room”.
[0061]
(Operation of video related content generation device: image composition example)
Next, an operation example in which the video-related content generation device 1 synthesizes and outputs an image related to a video signal will be described with reference to FIGS. 1 and 8. FIG. 8 is a flowchart illustrating an operation in which the video-related content generation device 1 generates a video stream in which a composite image is time-series.
[0062]
First, in the video-related content generation device 1, the video scene analysis unit 2 analyzes the video signal based on the input video signal, and extracts the video scene time and video object position coordinates, which are video feature quantities. (Step S21).
In the image synthesizing unit 5, the position presentation image synthesizing unit 51 generates a synthesized image obtained by synthesizing the icon image with the position coordinates of the video object at a certain time of the video scene (step S22).
[0063]
It is determined whether or not generation of the composite image at all times of the video scene has been completed (step S23). If it has not been completed yet (No), the process returns to step S22 to generate a composite image at the next time. To do. On the other hand, when the generation of the composite image at all times is completed (Yes), the composite image generated at each time of the video scene is output as a video stream (step S24), and the operation is terminated.
Through the above steps, a video stream in which only the position of the video object is visualized can be generated from the video signal.
[0064]
Next, a specific operation for generating an image file or a video stream from a video signal will be described with reference to FIGS. FIG. 5 is a conceptual diagram showing an example of generating an image file in which the position of a video object that changes in time series on the same image from the video feature amount is visualized. FIG. 6 is a conceptual diagram illustrating an example in which the positions of video objects that change in time series from video feature quantities are generated as separate image files or as video streams.
[0065]
As shown in FIG. 5, the image synthesizing means 5 first inputs a video feature amount 2a. Here, as the video feature amount 2a, the time is “1”, “2”, “3” and “4”, and the position coordinates of the video object corresponding to the time are “(1, 1)”, “(1, 2) ”,“ (2, 3) ”and“ (3, 2) ”.
[0066]
The image composition means 5 first adds icon images C1, C2 to the position coordinates “(1, 1)”, “(1, 2)”, “(2, 3)” and “(3, 2)” of the plain image. , C3 and C4 (using different icon images only for C4) to generate an image file 5a in which the icon images (C1 to C4) are connected by a straight line. As a result, it can be expressed that the video object has moved from the position coordinate of C1 to the position coordinate of C4.
[0067]
In the example shown in FIG. 6, the same video feature 2a as in FIG. 5 is input, but the output of the image synthesizing means 5 is not a single image file but a plurality of image files or video streams. Is different.
In FIG. 6, the image synthesizing unit 5 includes the position coordinates “(1, 1)”, “(1, 2)”, “(2, 3)” and “3,” corresponding to the time “1” to “4”. The icon image C4 is synthesized with “(3, 2)” to generate four images. The image synthesizing means 5 can output these four images as individual image files 5a, or can output each image as a video stream 5b with continuous stream data. .
[0068]
(Operation of video-related content generation device: speech synthesis example)
Next, an operation example in which the video-related content generation device 1 synthesizes and outputs audio related to the video signal will be described with reference to FIGS. 1 and 9. FIG. 9 is a flowchart illustrating an operation in which the video-related content generation device 1 outputs audio data.
[0069]
In the video-related content generation device 1, the video scene analysis means 2 analyzes the video signal and extracts the video feature amount based on a fixed time interval or cut point timing obtained by the cut point detection technique (step S31). ).
[0070]
Then, in the voice synthesizer 6, the voice selection unit 62 selects the voice data 61a stored in the voice data storage unit 61 representing the video scene according to the value of the video feature value, and outputs the identification number as a voice. The unit 63 is notified (step S32).
The audio output unit 63 notified of the identification number reads the audio data 61a selected based on the identification number from the audio data storage unit 61 and outputs it as an audio file or an audio stream (step S33).
Through the above steps, audio data related to the video content can be output from the video signal as an audio file or an audio stream.
[0071]
【The invention's effect】
As described above, the video-related content generation apparatus, the video-related content generation method, and the video-related content generation program according to the present invention have the following excellent effects.
[0072]
  Claim 1,Claim 6OrClaim 7According to the invention described in the above, a video scene is analyzed from the input video signal, and video-related content converted into heterogeneous information such as coordinate information, audio information, and image information related to the content of the video is generated. be able to. This makes it possible to apply automatic content production based on natural input data, which has been limited to subtitle creation data by voice recognition, to video input.
  According to the invention of claim 1, claim 6 or claim 7, information related to the contents of the video scene can be generated as a content description language from a templated content description language such as HTML. By preparing the stylized content as a template, the production time for content production can be shortened.
[0073]
According to the invention described in claim 2 or claim 3, it is possible to detect a video object in a video and extract it as a video feature amount. Since the position information and feature amount for each video object can be expressed by visualized text, sound, image, and the like, it is possible to generate content with a reduced video data amount. In addition, it is possible to generate content that can be used on the WWW or a portable terminal, and to improve data accessibility.
[0075]
  Claim 4Since the position for each video object can be expressed by another image or the like, it is possible to generate content with a reduced video data amount. In addition, it is possible to generate content that can be used on the WWW or a portable terminal, and to improve data accessibility.
[0076]
  Claim 5Since the sound suitable for the video scene can be output as appropriate, it is possible to produce an effect that cannot be expressed only by the video. This can reduce the labor required for content production.
[Brief description of the drawings]
FIG. 1 is a block diagram showing an overall configuration of a video-related content generation apparatus according to an embodiment of the present invention.
FIG. 2 is an explanatory diagram illustrating an example of a content description language template.
FIG. 3 is an explanatory diagram for explaining an example of a character string conversion rule;
FIG. 4 is an explanatory diagram for explaining an example of the operation of the content description language generating means according to the embodiment of the present invention.
FIG. 5 is a schematic diagram (part 1) schematically showing an operation of the image composition unit according to the embodiment of the present invention.
FIG. 6 is a schematic diagram (part 2) schematically showing the operation of the image composition unit according to the embodiment of the present invention.
FIG. 7 is a flowchart showing an operation of generating a content description language from video according to the embodiment of the present invention.
FIG. 8 is a flowchart showing an operation of generating a composite image from a video according to an embodiment of the present invention.
FIG. 9 is a flowchart showing an operation of generating synthesized audio from video according to the embodiment of the present invention.
[Explanation of symbols]
1. Video-related content generation device
2 ... Video scene analysis means
21 …… Video object position detection unit (video object position detection means)
22 …… Video object feature extraction unit (video object feature extraction means)
23 …… Image scene feature extraction unit
3. Content generation means
4 …… Content description language generation means
41 …… Character string conversion database
42 …… Characteristic character string conversion unit (character string embedding means)
43 …… Template character string replacement part (character string embedding means)
5 …… Image composition means
51 …… Position presentation image composition unit
52 …… Image output section
6 …… Voice synthesis means
61 …… Voice data storage (voice data storage means)
62 …… Voice selection part (voice selection means)
63 …… Voice output unit (voice output means)

Claims

A video-related content generation device that generates information related to video content of a video signal as video-related content from a video signal,
Video scene analysis means for analyzing the video signal and extracting video feature values;
Content generating means for converting the video feature amount extracted by the video scene analyzing means into at least one of text data, image data, and audio data to generate the video related content ;
The content generation means includes
A character string conversion database in which the video feature amount and a feature amount character string expressing the video feature amount as a character string are associated and stored;
A character string embedding unit for embedding the feature amount character string in the text region of the content description language in which the text region for embedding the feature amount character string is made into a template based on the video feature amount;
A video-related content generation device comprising:

The video scene analysis means includes
Video object position detecting means for detecting position coordinates of a video object included in the video scene as the video feature amount;
The video-related content generation device according to claim 1, further comprising:

The video scene analysis means includes
Video object feature quantity extraction means for extracting a feature quantity characterizing a video object included in the video scene as the video feature quantity;
The video related content generation device according to claim 1, wherein the video related content generation device is provided.

The content generation means includes
Image combining means for combining image data related to the video object with the position coordinates of the video object detected by the video object position detecting means;
The video-related content generation device according to claim 2, further comprising:

The content generation means includes
Audio data storage means for storing a plurality of audio data in association with the video feature amount;
Audio selection means for selecting the audio data stored in the audio data storage means based on the video feature amount;
Voice output means for outputting the voice data selected by the voice selection means;
That it comprises a video-related content generation apparatus according to any one of claims 1 to 4, characterized in.

A video-related content generation method for generating information related to video content of a video signal as video-related content from a video signal,
A video scene analysis step of analyzing a video scene of the video signal and extracting a video feature;
The feature quantity corresponding to the video feature quantity extracted in the video scene analysis step from a character string conversion database in which the video feature quantity and a feature quantity character string expressing the video feature quantity as a character string are stored in association with each other A string search step for searching for a string;
A content description language input step for inputting a content description language, which is a template of a text region in which the feature amount character string is embedded,
A character string embedding step of embedding the feature amount character string searched in the character string search step in the text region of the content description language based on the video feature amount;
A video-related content generation method characterized by comprising:

In order to generate information related to the video content of the video signal as video related content from the video signal,
Video scene analysis means for analyzing a video scene of the video signal and extracting a video feature amount,
The video feature quantity extracted by the video scene analysis means is converted into at least one of text data, image data, and audio data to function as content generation means for generating the video related content ,
The content generation means includes
Reference is made to a character string conversion database in which the video feature quantity and a feature quantity character string expressing the video feature quantity as a character string are associated with each other, and the feature quantity character string is embedded based on the video feature quantity A video-related content generation program , wherein the feature amount character string is embedded in the text region of the content description language in which the text region is templated .