JP4064902B2

JP4064902B2 - Meta information generation method, meta information generation device, search method, and search device

Info

Publication number: JP4064902B2
Application number: JP2003320940A
Authority: JP
Inventors: 庄三磯部; 将之芦川; 浩平桃崎; 康之正井
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 2003-09-12
Filing date: 2003-09-12
Publication date: 2008-03-19
Anticipated expiration: 2023-09-12
Also published as: JP2005092295A

Description

本発明は、映像データと音声データを含むコンテンツデータの特徴を記述したメタ情報を生成し、当該コンテンツデータから当該メタ情報を用いて所望のシーンを検索する検索装置に関する。 The present invention relates to a search apparatus that generates meta information describing characteristics of content data including video data and audio data, and searches for a desired scene from the content data using the meta information.

ＣＳやＢＳ等のディジタル放送の普及、ＤＶＤやＨＤＤ等をメディアとする大容量レコーダの普及に伴い、ディジタル放送番組を大量に受信・蓄積し、その中から自分の好みの場面を見たいときに検索して視聴する、といった映像視聴形態が可能になってきている。 With the spread of digital broadcasting such as CS and BS, and the spread of large-capacity recorders such as DVDs and HDDs, when you want to receive and store a large amount of digital broadcasting programs and view your favorite scenes from them Video viewing modes such as searching and viewing are becoming possible.

従来から、映像コンテンツ中のシーンに関する検索条件を指定して、所望の映像シーンを検索、再生したい、といった要望には大なるものがあり、これに応えるための標準規格として、例えばＭＰＥＧ７（ＭｏｖｉｎｇＰｉｃｔｕｒｅｓＥｘｐｅｒｔｓＧｒｏｕｐｐａｒｔ７）が既に提案されている。ＭＰＥＧ７は、映像コンテンツに関する特徴をメタ情報として記述するための規格であり、例えば映像コンテンツとそのメタ情報をデータベースに登録することで、メタ情報の内容がある条件を満たすような映像コンテンツを、ローカルシステム内、あるいはネットワークを経由して検索する技術であり、２００１年にＷ３Ｃで標準化がなされた。なお、映像コンテンツを検索する場合に用いるメタ情報とは、映像コンテンツの特徴を記述した情報（情報に関する情報）である。 Conventionally, there has been a great demand for searching and playing back a desired video scene by specifying a search condition related to a scene in the video content. For example, MPEG7 (Moving Pictures) is a standard for meeting this demand. Experts Group part 7) has already been proposed. MPEG7 is a standard for describing features relating to video content as meta-information. For example, by registering video content and its meta-information in a database, video content satisfying a certain condition of meta-information is locally This is a technology for searching within a system or via a network, and was standardized in 2001 by W3C. The meta information used when searching for video content is information (information related to information) describing the characteristics of the video content.

ＭＰＥＧ７はＸＭＬ（Extensible Markup Language）の一規格でもある。ＸＭＬは柔軟な拡張性と連携性を備えた標準のメタ情報記述言語であり、（Ａ１）コンテンツデータの構造と表示スタイルを分けて定義することができるため、データ構造の変更やコンテンツデータの再利用がしやすい。（Ａ２）タグを用いてコンテンツデータの構造を厳密に定義できるため、タグを利用した精度のよい条件検索に適している。といった特徴がある。その他、スキーマ（データ構造定義）がコンテンツデータに対して必ずしも定義されている必要がない、といった特徴もある。 MPEG7 is also a standard for XML (Extensible Markup Language). XML is a standard meta information description language with flexible extensibility and cooperation. (A1) Since the structure and display style of content data can be defined separately, the data structure can be changed and the content data can be Easy to use. (A2) Since the structure of content data can be strictly defined using a tag, it is suitable for a high-accuracy condition search using a tag. There are features such as. In addition, there is a feature that a schema (data structure definition) is not necessarily defined for content data.

ここで、映像コンテンツに付与するキーワードを作成する方法として、人手でキーワードを作成するやり方が一般的であるが、近年では、音声認識や画像認識の技術向上により、映像コンテンツに対するこれらの認識結果を利用して、自動的にキーワードを抽出する方法も用いられるケースが出始めている。自動キーワード抽出には、キーワードやタグ作成の手間が省けるという多大なメリットがある。 Here, as a method of creating a keyword to be added to video content, a method of manually creating a keyword is generally used. However, in recent years, these recognition results for video content are obtained by improving speech recognition and image recognition techniques. In some cases, a method of automatically extracting keywords using this method has been used. Automatic keyword extraction has the great advantage of saving the effort of creating keywords and tags.

音声認識や画像認識を利用して映像コンテンツから自動的にキーワードを抽出し、映像コンテンツを検索する手法も開示されている（例えば、特許文献１参照）。これは、音声情報を音声認識した結果得られたキーワードと、各コンテンツデータに対応づけて予めデータベースなどに登録されているキーワードとを比較し、両者が一致した場合に上記コンテンツデータを検索結果として返却するものである。この方法は、予め各コンテンツにキーワードを付与してデータベースに格納しておき、ユーザが音声あるいは画像の形でキーワードを指定し、上記データベースのコンテンツから指定したキーワードに合致したものを検索結果として返却する。したがって、検索キーに関しては自動的にキーワードを抽出しているが、データベースに格納されているコンテンツに付随するキーワードは自動抽出を想定しているものではない。 A technique of automatically extracting keywords from video content using voice recognition or image recognition and searching for video content is also disclosed (see, for example, Patent Document 1). This is because a keyword obtained as a result of voice recognition of voice information is compared with a keyword registered in advance in a database or the like in association with each content data, and if the two match, the content data is used as a search result. I will return it. This method assigns a keyword to each content in advance and stores it in a database. The user designates the keyword in the form of sound or image, and returns a search result that matches the keyword designated from the content of the database. To do. Therefore, keywords are automatically extracted for the search key, but keywords associated with content stored in the database are not assumed to be automatically extracted.

一方、上記のキーワード自動抽出機能を拡張することで、映像データの各シーンに対してキーワード自動抽出を適用して、キーワード付与することも可能である。これは例えば、以下のステップで実現できる。 On the other hand, by expanding the keyword automatic extraction function described above, it is also possible to apply keywords to each scene of video data and add keywords. This can be achieved, for example, by the following steps.

（Ｓｔｅｐ１）映像コンテンツを構成する各ストリーム区間に対してそれぞれ代表となる映像シーンを１つ選び対応させる。（Ｓｔｅｐ２）Ｓｔｅｐ１で選んだ各映像シーンに対して、その映像シーンに対応するストリーム区間を構成する音声ストリーム、あるいは画像ストリームに対して、音声認識、画像認識を行いキーワードを抽出する。（Ｓｔｅｐ３）このようにして得られたキーワードを上記映像シーンに対してメタ情報として付与する。 (Step 1) One representative video scene is selected and associated with each stream section constituting the video content. (Step 2) For each video scene selected in Step 1, a keyword is extracted by performing voice recognition and image recognition on an audio stream or an image stream constituting a stream section corresponding to the video scene. (Step 3) The keyword obtained in this way is assigned as meta information to the video scene.

しかしながら、上記方法の適用にはいくつかの問題がある。まず、音声認識結果が一定水準以上の認識精度を保っている、すなわち音声認識の信頼度が高いという前提条件が欠かせない。ニュースのナレーション等雑音の少ないものであれば支障はないものの、ＢＧＭのように音声認識（発話認識）にとって雑音となるような音要素がストリーム区間に多く含まれている場合、この区間においては一般に音声認識結果の信頼度が低下してしまう。 However, there are some problems in applying the above method. First, the precondition that the speech recognition result maintains a recognition accuracy of a certain level or higher, that is, the reliability of speech recognition is high. If there is little noise such as news narration, there will be no problem, but if there are many sound elements in the stream section that cause noise for speech recognition (speech recognition) like BGM, The reliability of the speech recognition result is lowered.

ここで、音声認識の信頼度が低い条件で抽出したキーワード情報を各シーンに付随するメタ情報として登録し、ユーザがキーワード指定によりシーン検索を行う場合、以下の問題が発生する可能性がある。（Ｂ１）検索漏れ：認識誤りの結果、異なるキーワードが抽出登録され、指定したキーワードで本来検索できるはずの映像シーンがヒットしない。（Ｂ２）誤検索：認識誤りの結果、異なるキーワードが抽出登録され、指定したキーワードで本来検索できてはいけないものが検索できてしまう。この２つの問題のうち、検索漏れは、ＸＭＬなどタグの条件を指定して精度の高い検索を行うタグ検索で発生しやすく、誤検索は、全文テキストの中から指定されたキーワードを含むものを検索するといった、タグ検索に比べて条件の緩いキーワード検索で発生しやすい。このため、状況に応じてタグ検索あるいはキーワード検索を使い分けるなどで上記検索漏れおよび誤検索の問題を低減させる方法が望まれている。 Here, when the keyword information extracted under the condition that the reliability of voice recognition is low is registered as meta information accompanying each scene, and the user performs a scene search by specifying a keyword, the following problems may occur. (B1) Missing search: As a result of recognition error, different keywords are extracted and registered, and video scenes that should be originally searchable with the specified keywords are not hit. (B2) Incorrect search: As a result of recognition error, different keywords are extracted and registered, and it is possible to search for those that should not be originally searched with the specified keyword. Of these two problems, a search omission is likely to occur in a tag search that specifies a tag condition such as XML and performs a high-accuracy search, and an erroneous search includes a keyword that is specified from the full text. It is more likely to occur in keyword searches that are more lenient than tag searches, such as searching. For this reason, there is a demand for a method of reducing the above-mentioned search omission and erroneous search problems by properly using tag search or keyword search according to the situation.

さらに、音声認識の信頼度が低い区間においては、キーワード検索においても、検索漏れが発生しやすいため、より条件を緩和した検索方式への切換えを行うような方法も望まれている。 Furthermore, in a section where the reliability of voice recognition is low, a search omission is likely to occur even in a keyword search. Therefore, a method of switching to a search method with more relaxed conditions is also desired.

以上に説明したように、ＭＰＥＧ７に代表されるＸＭＬ形式を用いて映像検索用のメタ情報を記述し、コンテンツデータとともにデータベースに蓄積しておき、ローカルシステム内あるいはネットワーク経由で条件に合致した映像データ内シーンを検索することで、所望の映像シーンを検索表示、さらにそれを選択することで再生するニーズには大なるものがある。 As described above, meta information for video search is described using an XML format represented by MPEG7, stored in a database together with content data, and video data that meets conditions within the local system or via a network There is a great need to search and display a desired video scene by searching for an internal scene and to reproduce it by selecting it.

一方、音声認識結果を基にキーワード指定によるタグ検索或いはキーワード検索を行う場合、ＢＧＭのように音声認識（発話認識）にとって雑音となるような音要素がストリーム区間に多く含まれている場合、この区間においては一般に音声認識結果の信頼度が低下する。すなわち、音声認識の結果得られたキーワードを、映像シーンを検索（タグ検索、キーワード検索）するためのメタ情報としていて用いると、検索漏れや誤検索が発生しやすい。特に音声認識の信頼度が低い区間においては、キーワード検索においても検索漏れの問題が発生しやすい。
特開２００１−２２９１８０公報 On the other hand, when performing tag search or keyword search by keyword specification based on the speech recognition result, if there are many sound elements that cause noise for speech recognition (utterance recognition) like BGM, In general, the reliability of the speech recognition result decreases in the section. That is, if a keyword obtained as a result of speech recognition is used as meta information for searching a video scene (tag search, keyword search), a search omission or an erroneous search is likely to occur. In particular, in a section where the reliability of voice recognition is low, a search omission problem is likely to occur even in keyword search.
JP 2001-229180 A

このように、従来は、映像コンテンツに対応する音声から認識されたキーワードを当該映像コンテンツのメタ情報として用いる場合、特に、ＢＧＭなどの雑音が含まれているような音声認識結果に対する信頼度の低い区間における音声を音声認識した結果得られたキーワードをメタ情報として用いた場合には、検索漏れや誤検索が発生しやすいという問題点があった。 As described above, conventionally, when a keyword recognized from audio corresponding to video content is used as meta information of the video content, the reliability of an audio recognition result including noise such as BGM is particularly low. When a keyword obtained as a result of speech recognition of speech in a section is used as meta information, there is a problem that a search omission or an erroneous search is likely to occur.

そこで、本発明は上記問題点に鑑み、映像データと音声データからなるコンテンツデータに音声認識の信頼度が低い区間が混じっている場合においても、検索漏れや誤検索を極力回避できるコンテンツデータのメタ情報を生成するメタ情報生成方法および装置と、当該メタ情報を用いて所望のシーンを含むコンテンツデータを検索する検索方法および検索装置を提供することを目的とする。 Therefore, in view of the above problems, the present invention is a meta data of content data that can avoid omission of search or erroneous search as much as possible even when content data composed of video data and audio data includes a section with low reliability of voice recognition. It is an object of the present invention to provide a meta information generation method and apparatus for generating information, and a search method and apparatus for searching content data including a desired scene using the meta information.

（１）本発明は、映像データと音声データを含むコンテンツデータの特徴を記述した複数の要素データからなるメタ情報を生成するものであって、（ａ）複数の区間に区切られた前記コンテンツデータの当該複数の区間のそれぞれについて、当該区間の音声データの音声認識結果である音声テキストと当該区間の映像データに含まれるテロップの文字認識結果であるテロップテキストを求め、（ｂ）前記音声テキストから当該音声テキストに含まれるキーワードと当該キーワードの属するカテゴリを求め、（ｃ）前記複数の各区間のそれぞれについて、当該区間から求めた前記音声テキストと前記テロップテキストの両方に同音語が含まれるとき当該区間の前記音声テキストに対する信頼度は高いと判定し、同音語が含まれていないとき当該区間の前記音声テキストに対する信頼度は低いと判定し、（ｄ）前記複数の各区間のそれぞれについて、少なくとも当該区間の識別子を記述した第１の要素データと、前記音声テキストを記述した第２の要素データと、当該音声テキストに対する信頼度を記述した第３の要素データと、当該音声テキストに前記キーワードが含まれているときには当該キーワードと当該キーワードの属する前記カテゴリとを互いに関連付けて記述した第４の要素データとを含む前記メタ情報を生成する。 (1) The present invention generates meta information composed of a plurality of element data describing characteristics of content data including video data and audio data, and (a) the content data divided into a plurality of sections. For each of the plurality of sections, a speech text that is a speech recognition result of the speech data of the section and a telop text that is a character recognition result of the telop included in the video data of the section are obtained, and (b) from the speech text A keyword included in the speech text and a category to which the keyword belongs, and (c) for each of the plurality of sections, when the same sound word is included in both the speech text and the telop text obtained from the section. It is determined that the confidence level for the speech text in the section is high, and when the homophone is not included (D) for each of the plurality of sections, at least first element data describing an identifier of the section and a second describing the speech text Element data, third element data describing the reliability of the speech text, and, when the keyword is included in the speech text, a fourth description describing the keyword and the category to which the keyword belongs in association with each other. The meta information including the element data is generated.

（２）本発明は、（ａ）映像データと音声データを含むコンテンツデータを第１の記憶手段に記憶し、（ｂ）複数の区間に区切られた前記コンテンツデータの当該複数の区間のそれぞれに対応するとともに、それぞれが、当該複数の区間のうちの１つの区間内のコンテンツデータの特徴を記述した複数の要素データからなるメタ情報であって、当該複数の要素データには、当該区間の識別子を記述した第１の要素データと、当該区間の音声データの音声認識結果である音声テキストを記述した第２の要素データと、当該音声テキストに対する信頼度として高いか低いかのいずれか一方を記述した第３の要素データと、当該音声テキストから抽出されたキーワードと当該キーワードの属するカテゴリとを互いに関連付けて記述した第４の要素データとを含む前記複数の区間のそれぞれに対応する複数のメタ情報データを第２の記憶手段に記憶し、（ｃ）第１の文字列と第２の文字列を検索条件として指定されたとき、前記第２の記憶手段に記憶されている複数のメタ情報データのうち前記信頼度が高いメタ情報データを検索対象として、当該検索対象から、前記第１の文字列と同音のカテゴリと前記第２の文字列と同音のキーワードとが互いに関連付けて記述されている前記第４の要素データを含むメタ情報データを検索する。（ｄ）前記第１の文字列と前記第２の文字列のうち前記第２の文字列のみが前記検索条件として指定されたとき、それぞれが前記第２の文字列と同音あるいは類似する音をもつ複数の第３の文字列を求めて、前記第２の記憶手段に記憶されているメタ情報データを検索対象として、当該検索対象から、前記音声テキストに前記複数の第３の文字列のうちのいずれかを含むメタ情報データを検索する。（ｅ）前記第１の記憶手段に記憶されたコンテンツデータから、上記（ｃ）および（ｄ）で検索されたメタ情報データに含まれる前記識別子に対応する区間のコンテンツデータを検索する。 (2) In the present invention, (a) content data including video data and audio data is stored in the first storage means, and (b) each of the plurality of sections of the content data divided into a plurality of sections. And corresponding to each is meta information composed of a plurality of element data describing characteristics of content data in one of the plurality of sections, and the plurality of element data includes an identifier of the section. 1st element data describing the text, 2nd element data describing the speech text that is the speech recognition result of the speech data in the section, and describing either one of high or low reliability for the speech text The fourth element data describing the third element data, the keyword extracted from the speech text, and the category to which the keyword belongs are described in association with each other. A plurality of pieces of meta information data corresponding to each of the plurality of sections including data are stored in the second storage means, and (c) when the first character string and the second character string are designated as search conditions The meta information data having a high reliability among the plurality of meta information data stored in the second storage means is set as a search target, and from the search target, the category of the same sound as the first character string and the first The meta information data including the fourth element data in which the character string of 2 and the keyword of the same sound are described in association with each other is searched. (D) When only the second character string is designated as the search condition among the first character string and the second character string, the sounds that are the same as or similar to the second character string, respectively. A plurality of third character strings are obtained, and the meta information data stored in the second storage unit is set as a search target, and from the search target, the voice text includes the plurality of third character strings. Search meta information data including any of. (E) Content data in a section corresponding to the identifier included in the meta information data searched in (c) and (d) is searched from the content data stored in the first storage means.

本発明によれば、映像データと音声データからなるコンテンツデータから所望のシーンを検索する際に、検索漏れや誤検索を回避することができる。 According to the present invention, when searching for a desired scene from content data including video data and audio data, it is possible to avoid a search omission and an erroneous search.

本発明の実施形態について説明する前に、ＸＭＬについて簡単に説明する。ここでは、説明の簡単のため、コンテンツデータを検索する際に用いるメタ情報（映像メタ情報）を例にとり説明する。なお、コンテンツデータの構造も同様にＸＭＬで表現することができる。この場合、少なくとも映像の始まりからの時間を指定できるタグが含まれている。 Before describing the embodiment of the present invention, XML will be briefly described. Here, for the sake of simplicity, meta information (video meta information) used when searching for content data will be described as an example. Note that the structure of content data can be similarly expressed in XML. In this case, at least a tag that can specify the time from the start of the video is included.

コンテンツデータは、画像（映像）ストリーム（映像データ）と連続する音声データである音声ストリーム（音声データ）が含まれている。 The content data includes an audio stream (audio data) that is audio data continuous with an image (video) stream (video data).

図６は、ＸＭＬで記述された映像メタ情報の一例を示したものである。ＸＭＬでは、構造の表現にタグが用いられる。タグには、開始タグと終了タグがあり、構造情報の構成要素を開始タグと終了タグで囲むことにより、文字列（テキスト）区切りと、そのテキストが構造上どの構成要素に属するのかを明確に記述することができる。 FIG. 6 shows an example of video meta information described in XML. In XML, tags are used to represent structures. There are a start tag and an end tag. By enclosing the structural information components with the start tag and end tag, it is possible to clearly identify the character string (text) delimiter and which structural element the text belongs to. Can be described.

ここで開始タグとは要素名称を記号「＜」、「＞」で閉じたものであり、終了タグとは要素名称を記号「＜／」と「＞」で閉じたものである。タグに続く構成要素の内容が、テキスト（文字列）または子供の構成要素の繰り返しである。また開始タグには「＜要素名称属性＝“属性値”＞」などのように属性情報を設定することができる。「＜メタ情報＞＜／メタ情報＞」のようにテキストを含まない構成要素は、簡易記法として「＜メタ情報／＞」のように表わすこともできる。 Here, the start tag is an element name closed with symbols “<” and “>”, and the end tag is an element name closed with symbols “</” and “>”. The content of the component following the tag is a text (character string) or a repetition of a child component. Further, attribute information such as “<element name attribute =“ attribute value ”>” can be set in the start tag. A component that does not include text, such as “<meta information> </ meta information>”, can also be expressed as “<meta information />” as a simple notation.

図６に示した映像メタ情報は、「区間情報」タグから始まる要素をルート(根)とし、その子要素として「区間」タグから始まる要素集合が存在する。それぞれの「区間」タグの子要素として、「開始時間」、「終了時間」、「音声テキスト」「テロップ」、「選手名」などのタグから始まる要素集合が存在する。また「区間」タグの属性として、「ＩＤ」、「信頼度」属性が存在する。また例えば、「選手名」タグから始まる要素には「さとう」のように、１つのテキスト（文字列）をテキスト値として持つことができる。各要素は複数階層構造で表現することもできる。 The video meta information shown in FIG. 6 has an element set starting from the “section” tag as a child element with the element starting from the “section information” tag as a root. As child elements of each “section” tag, there is an element set starting from tags such as “start time”, “end time”, “voice text”, “telop”, “player name”, and the like. As attributes of the “section” tag, there are an “ID” attribute and a “reliability” attribute. Further, for example, an element starting from the “player name” tag can have one text (character string) as a text value, such as “Sato”. Each element can also be expressed in a multi-layer structure.

以下、図面を参照しながら発明の実施形態を説明する。図１は、本実施形態に係る検索装置１１を含むシステム全体の概略構成例を示したもので、検索装置１１と再生装置７とが所定のネットワーク１０を介して接続されて構成されている。 Hereinafter, embodiments of the invention will be described with reference to the drawings. FIG. 1 shows an example of a schematic configuration of the entire system including a search device 11 according to this embodiment. The search device 11 and a playback device 7 are connected via a predetermined network 10.

検索装置１１は、映像メタ情報生成部１と、コンテンツデータ記憶部２と、映像メタ情報記憶部３と、コンテンツデータ検索部４と、映像メタ情報検索部５と、単語辞書６とから構成されている。 The search device 11 includes a video meta information generation unit 1, a content data storage unit 2, a video meta information storage unit 3, a content data search unit 4, a video meta information search unit 5, and a word dictionary 6. ing.

コンテンツデータ記憶部２には、複数のコンテンツデータが記憶されている。コンテンツデータは１つ又は複数のストリーム区間により構成されている各ストリーム区間には、それぞれを識別するための区間ＩＤや、各ストリーム区間の開始時刻や終了時刻などの情報がＸＭＬ等の記述形式を用いて記述されていて、コンテンツデータに対応付けて記憶されている。 The content data storage unit 2 stores a plurality of content data. The content data is composed of one or a plurality of stream sections. Each stream section has a section ID for identifying each content section and information such as start time and end time of each stream section in a description format such as XML. It is described using and stored in association with content data.

映像メタ情報生成部１は、コンテンツデータ記憶部２に記憶されているコンテンツデータから、当該コンテンツデータを検索する際に用いる映像メタ情報を生成する。映像メタ情報生成部１で生成された映像メタ情報は映像メタ情報記憶部３に記憶される。 The video meta information generation unit 1 generates video meta information used when searching the content data from the content data stored in the content data storage unit 2. The video meta information generated by the video meta information generation unit 1 is stored in the video meta information storage unit 3.

コンテンツデータ検索部４は、再生装置７から送信された検索条件を満たすコンテンツデータ（映像）を得るための映像メタ情報を検索するためのクエリを単語辞書６に登録されている単語を用いて生成し、映像メタ情報検索部５へ出力する。また、映像メタ情報検索部５で求めた区間ＩＤや開始／終了時刻を用いて、コンテンツデータ記憶部２から、当該区間ＩＤや開始／終了時刻に対応するコンテンツデータを検索する。ここで得られたコンテンツデータはネットワーク１０を介して再生装置７へ送信される。 The content data search unit 4 generates a query for searching video meta information for obtaining content data (video) satisfying the search condition transmitted from the playback device 7 using words registered in the word dictionary 6. And output to the video meta information search unit 5. Also, using the section ID and start / end time obtained by the video meta information search unit 5, the content data corresponding to the section ID and start / end time is searched from the content data storage unit 2. The content data obtained here is transmitted to the playback device 7 via the network 10.

再生装置７は、検索要求指定部８と再生部９とから構成されている。検索要求指定部８は、所望のコンテンツデータの検索条件を入力するためのものである。検索条件はネットワーク１０を介して検索装置１１のコンテンツデータ検索部４へ送信される。再生部９は、コンテンツデータ検索部４から検索結果として送信されたコンテンツデータを再生する。 The playback device 7 includes a search request specifying unit 8 and a playback unit 9. The search request specifying unit 8 is for inputting search conditions for desired content data. The search condition is transmitted to the content data search unit 4 of the search device 11 via the network 10. The playback unit 9 plays back the content data transmitted as a search result from the content data search unit 4.

図２は、映像メタ情報生成部１の構成例を示したものである。映像メタ情報生成部１は、デマルチプレクサ部１０１、音声認識部１０２、テロップ認識部１０３、キーワード一致度判定部１０４、シーン情報抽出部１０５、データ生成部１０６、認識辞書１０７から構成される。なお、図２に示すような映像メタ情報生成部１の各構成部は、ソフトウェアで実現可能である。 FIG. 2 shows a configuration example of the video meta information generation unit 1. The video meta information generation unit 1 includes a demultiplexer unit 101, a voice recognition unit 102, a telop recognition unit 103, a keyword matching degree determination unit 104, a scene information extraction unit 105, a data generation unit 106, and a recognition dictionary 107. Each component of the video meta information generating unit 1 as shown in FIG. 2 can be realized by software.

映像メタ情報生成部１にはコンテンツデータ記憶部２に記憶されたコンテンツデータが入力される。入力されたコンテンツデータは、デマルチプレクサ部１０１で音声ストリームと画像ストリームに分離される。音声ストリームは音声認識部１０２に入力し、画像ストリームはテロップ認識部１０３に入力する。さらに、映像メタ情報生成部１に入力されたコンテンツデータは、シーン情報抽出部１０５に入力する。シーン情報抽出部１０５では、コンテンツデータから、当該コンテンツデータに含まれる各ストリーム区間の区間ＩＤや開始時刻、終了時刻を抽出し、これらをデータ生成部１０６へ出力する。 The content data stored in the content data storage unit 2 is input to the video meta information generation unit 1. The input content data is separated into an audio stream and an image stream by the demultiplexer unit 101. The audio stream is input to the audio recognition unit 102, and the image stream is input to the telop recognition unit 103. Further, the content data input to the video meta information generation unit 1 is input to the scene information extraction unit 105. The scene information extraction unit 105 extracts the section ID, start time, and end time of each stream section included in the content data from the content data, and outputs these to the data generation unit 106.

音声認識部１０２は認識辞書１０７に登録されている単語を用いて、音声ストリームの音声認識を行い、音声認識結果である音声テキストと、当該音声テキストに含まれている単語（キーワード）と、このキーワードのカテゴリとを出力する。ここで抽出されるキーワードは、予め定められたカテゴリに予め登録されている単語に一致するキーワードである。音声認識部１０２は、カテゴリ別に予め複数の単語が登録されている単語辞書や、キーワードか否かを判定するためのルールを予め記憶している。この単語辞書やルールを参照して、音声テキストからキーワードとこのキーワードが属するカテゴリを得る。 The voice recognition unit 102 performs voice recognition of the voice stream using the words registered in the recognition dictionary 107, and the voice text as the voice recognition result, the words (keywords) included in the voice text, Output keyword categories. The keyword extracted here is a keyword that matches a word registered in advance in a predetermined category. The voice recognition unit 102 stores in advance a word dictionary in which a plurality of words are registered in advance for each category and a rule for determining whether or not the keyword is a keyword. By referring to the word dictionary and rules, the keyword and the category to which the keyword belongs are obtained from the speech text.

テロップ認識部１０３は、画像ストリーム中のテロップ（ｔｅｌｏｐ）を文字認識して、テロップ認識結果であるテロップテキストと、当該テロップテキストに含まれている単語（キーワード）とそのカテゴリとを出力する。ここで抽出されるキーワードは、予め定められたカテゴリに予め登録されている単語に一致するキーワードである。テロップ認識部１０３は、カテゴリ別に予め複数の単語が登録されている単語辞書や、そのようなキーワードを判定するためのルールを予め記憶している。この単語辞書やルールを参照して、テロップテキストからキーワードとそれが属するカテゴリを得る。 The telop recognition unit 103 performs character recognition on a telop in the image stream, and outputs a telop text as a telop recognition result, a word (keyword) included in the telop text, and its category. The keyword extracted here is a keyword that matches a word registered in advance in a predetermined category. The telop recognition unit 103 stores in advance a word dictionary in which a plurality of words are registered in advance for each category and rules for determining such keywords. By referring to this word dictionary and rules, a keyword and a category to which the keyword belongs are obtained from the telop text.

キーワード一致度判定部１０４は、コンテンツデータを構成するストリーム区間ごとに音声認識部１０２から出力された音声テキストとテロップ認識部１０３から出力されたテロップテキストとを比較し、音声認識部１０２での音声認識結果（音声テキスト）に対する音声認識信頼度を求める。例えば、音声テキストとテロップテキストの両者に互いに一致する（同音の（読みが一致する））文字列（以下、共通語と記す）が含まれているときには信頼度が高いと考えられるので、音声認識信頼度は「高」と判定し、音声テキストとテロップテキストの両者に互いに一致する共通語が含まれていないときには、音声認識信頼度は「低」と判定する。このとき、テロップテキストは辞書により一旦ひらがなに変換された後、ひらがなの音声テキストと比較がされる。 The keyword matching degree determination unit 104 compares the speech text output from the speech recognition unit 102 with the telop text output from the telop recognition unit 103 for each stream section constituting the content data, and the speech in the speech recognition unit 102 The speech recognition reliability for the recognition result (speech text) is obtained. For example, if both text and telop text contain the same character string (same sound (match reading)) (hereinafter referred to as a common word), it is considered highly reliable. The reliability is determined as “high”, and when the common words that match each other are not included in both the voice text and the telop text, the voice recognition reliability is determined as “low”. At this time, the telop text is once converted into hiragana by the dictionary and then compared with the hiragana voice text.

キーワード一致度判定部１０４からは、各ストリーム区間の音声認識信頼度と共通語が出力される。出力される共通語は、ひらがなで表記したが、辞書を基に漢字表記に統一しても良い。 The keyword matching degree determination unit 104 outputs the speech recognition reliability and the common word for each stream section. Although the output common word is written in hiragana, it may be unified into kanji notation based on a dictionary.

データ生成部１０６は、音声認識部１０２から出力された音声テキストと，キーワードと，このキーワードのカテゴリ、テロップ認識部１０３から出力されたテロップテキストと，キーワードと，このキーワードのカテゴリ、キーワード一致度判定部１０４から出力された音声認識信頼度と共通語、シーン情報抽出部１０５から出力された区間ＩＤや開始／終了時刻などを基に、図６に示したような映像メタ情報のデータを生成する。 The data generation unit 106 determines the speech text output from the speech recognition unit 102, the keyword, the category of the keyword, the telop text output from the telop recognition unit 103, the keyword, the category of the keyword, and the keyword matching degree. Based on the voice recognition reliability and common language output from the unit 104, the section ID and start / end times output from the scene information extraction unit 105, video meta information data as shown in FIG. 6 is generated. .

図６に示すように、映像メタ情報は、各ストリーム区間について、当該ストリーム区間から抽出された区間ＩＤと音声認識信頼度（信頼度）、開始時刻、終了時刻、音声テキスト、テロップテキスト、共通語、キーワードなどが記述されている。１つのストリーム区間に対応する映像メタ情報は、「区間」タグから始まる構成要素として記述されている。 As shown in FIG. 6, the video meta information includes, for each stream section, the section ID extracted from the stream section, the voice recognition reliability (reliability), the start time, the end time, the voice text, the telop text, and the common language. , Keywords, etc. are described. Video meta information corresponding to one stream section is described as a component starting from a “section” tag.

区間ＩＤと音声認識信頼度（信頼度）は「区間」タグの属性として記述されている。開始時刻と終了時刻はそれぞれに対応するタグ名から始まる構成要素として記述されている。音声テキストは「音声テキスト」というタグ名（要素名称）の要素の値として記述されている。テロップテキストは「テロップ」というタグ名（要素名称）の要素の値として記述されている。共通語は、「共通タグ値」というタグ名（要素名称）の要素の値として記述されている。音声テキストやテロップテキスト中のキーワード（ここでは、予め定められたカテゴリに予め登録されている単語に一致するキーワード）は、当該キーワードの属するカテゴリをタグ名とする要素の値として記述されている。 The section ID and the voice recognition reliability (reliability) are described as attributes of the “section” tag. The start time and end time are described as components starting from the corresponding tag names. The voice text is described as an element value of a tag name (element name) “voice text”. The telop text is described as an element value of a tag name (element name) “telop”. The common language is described as an element value of a tag name (element name) “common tag value”. A keyword in a speech text or a telop text (here, a keyword that matches a word registered in a predetermined category) is described as an element value having a category to which the keyword belongs as a tag name.

各ストリーム区間に対応したメタ情報は、例えばＭＰＥＧ７に代表されるＸＭＬ形式として表現する。ＭＰＥＧ７は映像データのメタ情報規格としてポピュラーになりつつある、ＸＭＬに準拠する規格である。ここでは、上記のメタ情報を図６のようなＸＭＬ形式で表現する。これはＭＰＥＧ７準拠ではないが、以降の説明では支障はない。 The meta information corresponding to each stream section is expressed as an XML format represented by, for example, MPEG7. MPEG7 is an XML-compliant standard that is becoming popular as a meta information standard for video data. Here, the meta information is expressed in the XML format as shown in FIG. This is not MPEG7 compliant, but there is no problem in the following description.

次に、図３に示すフローチャートを参照して、映像メタ情報生成部１の処理動作について説明する。なお、ここで入力されるコンテンツデータは、ＭＰＥＧ規格（ＭＰＥＧ２あるいはＭＰＥＧ４）で規定されているような、音声ストリームと画像ストリームが多重化（マルチプレックス化）されている映像ストリームを仮定しているが、これに限るものではない。 Next, the processing operation of the video meta information generation unit 1 will be described with reference to the flowchart shown in FIG. The content data input here is assumed to be a video stream in which an audio stream and an image stream are multiplexed (multiplexed) as defined by the MPEG standard (MPEG2 or MPEG4). However, it is not limited to this.

まず、入力されたコンテンツデータをデマルチプレクサ部１０１は、音声ストリームと画像ストリームに多重化分離する（ステップＳ１）。簡単のため、ここで得られた画像ストリームは、後述するように、例えば映像認識処理により２つの映像シーン切替えポイントを判定して得られた３つの区間からなり、それぞれ図４（ａ）から図４（ｃ）に示す３つの区間（ストリーム区間）および音声（ここでは便宜上テキストで示す）を含むものとする。３つの各ストリーム区間の区間ＩＤをそれぞれ「１」、「２」、「３」とする。なお、この区間の区切り方については音声の無音部で区間を分けても良いし、ユーザによる任意の位置で区間を分けても良い。 First, the demultiplexer unit 101 demultiplexes the input content data into an audio stream and an image stream (step S1). For simplicity, the image stream obtained here is composed of three sections obtained by determining two video scene switching points by video recognition processing, for example, as will be described later. It is assumed that three sections (stream sections) shown in 4 (c) and voice (shown here as text for convenience) are included. The section IDs of the three stream sections are “1”, “2”, and “3”, respectively. In addition, about the division | segmentation method of this area, you may divide an area in the silence part of an audio | voice, and may divide an area in the arbitrary positions by a user.

次に、音声認識部１０２は得られた音声ストリームに対して音声認識を施し、ストリーム内の音声データに対応するテキスト（音声テキスト）を求める（ステップＳ２）。通常は、音声認識用の辞書（認識辞書１０７）を用いて、音声データにおける音素の組み合わせ候補にもっともマッチする単語を求めることにより、音声認識率を高める施策が行われる。このような音声認識技術については、従来からある技術を用いればよく、また、本発明の要旨ではないので、説明は省略する。このようにして得られた、各ストリーム区間（図４（ａ）から図４（ｃ））に対応する音声テキストを図５（ａ）〜図５（ｃ）に示す。また、音声認識部１０２は、この音声テキストから上記キーワードがあれば、これを抽出する。 Next, the voice recognition unit 102 performs voice recognition on the obtained voice stream and obtains text (voice text) corresponding to the voice data in the stream (step S2). Normally, a measure for increasing the speech recognition rate is performed by obtaining a word that most closely matches a phoneme combination candidate in speech data using a speech recognition dictionary (recognition dictionary 107). As such a voice recognition technique, a conventional technique may be used, and since it is not the gist of the present invention, description thereof is omitted. The speech text corresponding to each stream section (FIG. 4 (a) to FIG. 4 (c)) thus obtained is shown in FIG. 5 (a) to FIG. 5 (c). In addition, the voice recognition unit 102 extracts the keyword from the voice text if there is one.

また、上記音声認識処理と平行して、テロップ認識部１０３は得られた画像ストリームの各画像フレームのテロップが表示され得る予め定められた領域（例えば、画面の下１／４の領域等）を文字認識してテロップテキストを求める（ステップＳ３）。 In parallel with the voice recognition process, the telop recognition unit 103 displays a predetermined area (for example, the lower ¼ area of the screen) in which the telop of each image frame of the obtained image stream can be displayed. Character recognition is performed to obtain a telop text (step S3).

テロップ認識部１０３は、画像ストリームに対して映像認識処理を行い、映像シーン切替ポイントを判定し、この映像シーンの切替ポイントを画像ストリームにおける区間の区切りとする。この映像シーン切替ポイントの判定は、例えば、画像フレーム間の画素値や輝度や色などのが予め定められた閾値以上に変化する時点を映像シーンの切替ポイントと判定する。あるいは、音声ストリームの区切りやテロップ認識の区切りを基に適当な方法で決定するので構わない。ここでは、映像シーンの切替ポイントが２個と判定され、ストリーム区間が３つあると判定されたものとする。このようにして得られた、各ストリーム区間に対応する映像認識結果のストリーム区間内の音声およびテロップをテキスト化して表示した画像が図４（ａ）から図４（ｃ）である。さらに、得られた各ストリーム区間に対して、それぞれテロップ認識を施してテキスト（テロップテキスト）を抽出する。このテロップを認識してテキストデータを抽出する技術は、既存のものを用いればよい（例えば、特開２００１−２８５７１６公報参照）。このようにして得られた、各ストリーム区間（図４（ａ）から図４（ｃ））に対応するテロップテキストを図５（ａ）〜図５（ｃ）に示す。また、テロップ認識部１０３は、このテロップテキストから上記キーワードがあれば、これを抽出する。 The telop recognition unit 103 performs video recognition processing on the image stream, determines a video scene switching point, and sets the switching point of the video scene as a segment break in the image stream. The video scene switching point is determined, for example, as a video scene switching point when a pixel value, luminance, color, or the like between image frames changes to a predetermined threshold value or more. Alternatively, it may be determined by an appropriate method based on the audio stream delimiter or telop recognition delimiter. Here, it is assumed that it is determined that there are two video scene switching points and that there are three stream sections. FIGS. 4A to 4C show images obtained by converting the audio and telop in the stream section of the video recognition result corresponding to each stream section into text. Further, telop recognition is performed on each obtained stream section to extract text (telop text). An existing technique for recognizing the telop and extracting text data may be used (see, for example, JP-A-2001-285716). The telop text corresponding to each stream section (FIG. 4 (a) to FIG. 4 (c)) thus obtained is shown in FIGS. 5 (a) to 5 (c). The telop recognition unit 103 extracts the keyword from the telop text if there is one.

次に、キーワード一致度判定部１０４は、音声テキストとテロップテキストとを比較し、音声テキストとテロップテキストの両者に互いに一致する（同音語である）共通語が含まれているときには、音声認識信頼度は「高」と判定し、音声テキストとテロップテキストの両者に互いに一致する（同音語である）共通語が含まれていないときには、音声認識信頼度は「低」と判定する。キーワード一致度判定部１０４からは、各ストリーム区間の音声認識信頼度と共通語が出力される（ステップＳ４）。 Next, the keyword matching degree determination unit 104 compares the voice text and the telop text, and if both the voice text and the telop text include common words that are identical to each other (the same word), the keyword recognition trust The degree of speech recognition is determined to be “high”, and the speech recognition reliability is determined to be “low” when both the speech text and the telop text do not include a common word (which is a homophone). The keyword coincidence determination unit 104 outputs the speech recognition reliability and the common word for each stream section (step S4).

例えば、区間ＩＤが「１」のストリーム区間の音声ストリームからは、図５（ａ）に示すように「のざき・せんしゅ」、「に・あんだ」があり、これと同じストリーム区間の画像ストリームから得られたテロップテキストには、図５（ｂ）に示すように「野崎選手」、「２安打」があるので、両者はそれぞれ一致する。従って、当該ストリーム区間における音声認識の精度は高いと考えられる。すなわち、この区間での音声認識信頼度は「高」であり、共通語は「のざき・せんしゅ」「に・あんだ」である。 For example, from the audio stream of the stream section with section ID “1”, there are “Nozaki / Senshu” and “Ni-anda” as shown in FIG. The telop text obtained from the stream includes “Nozaki player” and “2 hits” as shown in FIG. Therefore, it is considered that the accuracy of speech recognition in the stream section is high. That is, the speech recognition reliability in this section is “high”, and the common words are “Nozaki / Senshu” and “Nidan”.

以上のようにして、音声認識部１０２から出力された音声テキストとキーワードとこのキーワードのカテゴリ、テロップ認識部１０３から出力されたテロップテキストとキーワードとこのキーワードのカテゴリ、キーワード一致度判定部１０４から出力された音声認識信頼度と共通語、シーン情報抽出部１０５から出力された区間ＩＤや開始／終了時刻などを基に、データ生成部１０６は図６に示したような映像メタ情報のデータを生成する（ステップＳ５）。 As described above, the speech text and keyword output from the speech recognition unit 102 and the category of the keyword, the telop text and keyword output from the telop recognition unit 103, the category of the keyword, and output from the keyword matching degree determination unit 104. The data generation unit 106 generates the video meta information data as shown in FIG. 6 based on the voice recognition reliability and the common language, the section ID output from the scene information extraction unit 105, the start / end times, and the like. (Step S5).

図６に示した、区間ＩＤが「１」のストリーム区間に対応する映像メタ情報では、「共通タグ値」というタグ名の構成要素で、共通語である「のざき・せんしゅ」と「に・あんだ」が記述されている。 In the video meta information corresponding to the stream section with the section ID “1” shown in FIG. 6, the common words “Nozaki / Senshu” and “Ni” are the components of the tag name “common tag value”.・ "Anda" is described.

区間ＩＤが「１」のストリーム区間から抽出された音声テキスト中には、「のざき・せんしゅ」とあるが、これは、「“ＡＡＡせんしゅ（選手）が”あるいは“ＡＡＡせんしゅ（選手）は”あるいは“ＡＡＡせんしゅ（選手）の”と続いているときに、“ＡＡＡ”を選手名と判定する」というルールを満足する。さらに、単語辞書の「選手名」というカテゴリに「のざき」が登録されているとすると、音声認識部１０２は、「のざき」を当該音声テキスト中のキーワードとして抽出する。このキーワードのカテゴリは「選手名」である。このようにして音声認識部１０２で得られたキーワード「のざき」は、図６に示す映像メタ情報では、「選手名」というタグ名の構成要素の値として記述されている。 In the voice text extracted from the stream section with section ID “1”, there is “Nozaki / Senshu”, which is “AAA Senshu (player)” or “AAA Senshu (player). ) "Or" AAA Sensei (player) 's "is followed," AAA "is determined to be a player name". Further, assuming that “nozaki” is registered in the category “player name” in the word dictionary, the speech recognition unit 102 extracts “nozaki” as a keyword in the speech text. The category of this keyword is “player name”. The keyword “NOZAKI” obtained by the voice recognition unit 102 in this way is described as the value of the component of the tag name “player name” in the video meta information shown in FIG.

一方、区間ＩＤが「２」のストリーム区間や区間ＩＤが「３」のストリーム区間から抽出された音声テキストとテロップテキストには共通語が含まれていないでの、音声認識信頼度はそれぞれ「低」と判定される。 On the other hand, the speech recognition reliability when the speech text and the telop text extracted from the stream section with the section ID “2” and the stream section with the section ID “3” do not include a common word is “low”. Is determined.

なお、区間ＩＤが「３」のストリーム区間から抽出された音声テキスト中には、「さとう・せんしゅ」とあるが、これは、「“ＡＡＡせんしゅ（選手）が”あるいは“ＡＡＡせんしゅ（選手）は”あるいは“ＡＡＡせんしゅ（選手）の”と続いているときに、“ＡＡＡ”を選手名と判定する」というルールを満足する。さらに、単語辞書の「選手名」というカテゴリに「さとう」が登録されているとすると、音声認識部１０２は、「さとう」を当該音声テキスト中のキーワードとして抽出する。このキーワードのカテゴリは「選手名」である。このようにして音声認識部１０２で得られたキーワード「さとう」は、図６に示す映像メタ情報では、「選手名」というタグ名の構成要素の値として記述されている。 In the voice text extracted from the stream section with the section ID “3”, there is “Sato Senshu”, which is “AAA Senshu (player)” or “AAA Senshu ( Athlete) satisfies the rule that “AAA” is determined to be a player name when it is followed by “AAA's name”. Furthermore, assuming that “Sato” is registered in the category “player name” in the word dictionary, the speech recognition unit 102 extracts “Sato” as a keyword in the speech text. The category of this keyword is “player name”. The keyword “Sato” obtained by the voice recognition unit 102 in this way is described as the value of the component element of the tag name “player name” in the video meta information shown in FIG.

なお、図６ではカテゴリ「選手名」に属するキーワード「のざき」をタグ名とその要素値として対応付けている（即ち、＜選手名＞のざき＜／選手名＞）が、対応付けの方法はこれに限るものではない。属性名として対応付けても良い（即ち、＜カテゴリ選手名＝“のざき”／＞）。或いは、カテゴリを階層的に表現しても良い（即ち、＜選手名前＝“のざき”／＞）。この場合、カテゴリ「選手」とそのサブカテゴリ「名前」を併せてカテゴリ「選手名」と同等の表現になる。 In FIG. 6, the keyword “nozaki” belonging to the category “player name” is associated as a tag name and its element value (that is, <player name> nozaki </ player name>). Is not limited to this. It may be associated as an attribute name (that is, <category player name = "nozaki" />). Alternatively, the categories may be expressed hierarchically (that is, <player name = “nozaki” />). In this case, the category “player” and its sub-category “name” are combined to represent the category “player name”.

なお、キーワードからカテゴリを抽出するルールおよびカテゴリ出力部は音声認識部１０２やテロップ認識部１０３に持たせる構成でも構わない。 The rule for extracting the category from the keyword and the category output unit may be provided in the speech recognition unit 102 or the telop recognition unit 103.

映像メタ情報生成部１で生成された図６に示すような映像メタ情報は映像メタ情報記憶部３に記憶される。 Video meta information as shown in FIG. 6 generated by the video meta information generation unit 1 is stored in the video meta information storage unit 3.

次に、例えば図６に示したような映像メタ情報を用いて、所望のコンテンツデータを検索し、これを再生するまでの処理動作について、図７に示すフローチャートを参照して説明する。 Next, for example, processing operations for searching for desired content data using the video meta information as shown in FIG. 6 and reproducing the content data will be described with reference to the flowchart shown in FIG.

所望のコンテンツデータを検索するための検索条件は、再生装置７から入力される。再生装置７の検索要求指定部８は、図８に示すような画面を表示する。この画面では、「項目名」は（映像メタ情報に含まれる）タグ名を指定するための領域であり、「項目値」とは所望の文字列を指定するための領域である。図８に示す画面からは、（映像メタ情報中の）タグ名と当該タグ名の構成要素の値として含まれる文字列を検索条件として指定することもできるし、いずれかの構成要素の値として含まれる文字列のみを検索条件として指定することもできる。なお、図８に示す画面には、検索結果のコンテンツデータを再生表示するための領域Ｒ１が設けられている。 Search conditions for searching for desired content data are input from the playback device 7. The search request designating unit 8 of the playback device 7 displays a screen as shown in FIG. In this screen, “item name” is an area for designating a tag name (included in the video meta information), and “item value” is an area for designating a desired character string. From the screen shown in FIG. 8, a tag name (in the video meta information) and a character string included as a value of a component of the tag name can be designated as a search condition, and as a value of any component Only the character string included can be specified as a search condition. Note that the screen shown in FIG. 8 is provided with a region R1 for reproducing and displaying the content data of the search result.

まず、ユーザがタグ名と文字列を検索条件として指定する場合を例にとり説明する。例えば、「項目名」の欄に入力する場合に、ユーザが図８に示した画面上の検索ボタンＢ１を選択すると、映像メタ情報に含まれる検索条件として選択可能なタグ名の一覧がプルダウンメニュー等により表示される。ユーザは、この一覧のなかから所望のタグ名を選択すれば、「項目名」欄に所望のタグ名を入力することができる。なお、この一覧には「指定なし」も含まれており、この一覧のなかから「指定なし」を選択した場合には、タグ名を検索条件として指定しないことを意味するものとする。 First, a case where the user designates a tag name and a character string as a search condition will be described as an example. For example, when inputting in the “item name” field, if the user selects the search button B1 on the screen shown in FIG. 8, a list of tag names that can be selected as search conditions included in the video meta information is displayed in a pull-down menu. Etc. are displayed. If the user selects a desired tag name from the list, the user can input the desired tag name in the “item name” field. This list includes “not specified”, and selecting “not specified” from this list means that the tag name is not specified as a search condition.

図８に示すように、ユーザが、「項目名」欄に「選手名」を入力し、「項目値」欄に「のざき」と入力したとする（ステップＳ１１）。この後、ユーザがボタンＢ２を選択すると、タグ名「選手名」と文字列「のざき」という検索条件を含む検索要求は、検索装置１１へ送信され、コンテンツデータ検索部４が当該検索要求に含まれる検索条件を受信する。 As shown in FIG. 8, it is assumed that the user inputs “player name” in the “item name” field and “nozaki” in the “item value” field (step S11). Thereafter, when the user selects the button B2, the search request including the search condition of the tag name “player name” and the character string “NOZAKI” is transmitted to the search device 11, and the content data search unit 4 responds to the search request. Receive included search criteria.

コンテンツデータ検索部４は、当該検索条件にタグ名が含まれているので（ステップＳ１２）、信頼度（音声認識信頼度）の高いストリーム区間に対する（データベース検索用の）クエリを生成する（ステップＳ１３）。クエリはＸＱｕｅｒｙで記述された例を示しているが、ＳＱＬ等他のクエリ言語を用いて記述しても構わない。 Since the tag name is included in the search condition (step S12), the content data search unit 4 generates a query (for database search) for a stream section having a high reliability (voice recognition reliability) (step S13). ). Although the query shows an example described in XQuery, it may be described using another query language such as SQL.

このとき生成されるクエリを図９に示す。このクエリは、「各区間データ（各区間の映像メタ情報）うちの信頼度が「高」の映像メタ情報のなかから、「選手名」タグの要素値が「のざき」である映像メタ情報を全て求めよ」という意味をもつものである。図９に示すクエリでは、検索対象の映像メタ情報は、信頼度が「高」である映像メタ情報に限定されている。 A query generated at this time is shown in FIG. This query is “video meta information whose element value of the“ player name ”tag is“ nozaki ”from video meta information whose reliability is“ high ”in each section data (video meta information of each section). "I want all of you." In the query shown in FIG. 9, the video meta information to be searched is limited to video meta information whose reliability is “high”.

図９に示したクエリは、映像メタ情報検索部５へ出力される。映像メタ情報検索部５は、映像メタ情報記憶部３に記憶されている音声認識信頼度の高いストリーム区間のうち、検索条件として指定されたタグ名で、しかも検索条件として指定された文字列を値として含む構成要素をもつストリーム区間の映像メタ情報（区間情報とも呼ぶ）を、タグ検索により検索する。すなわち、図９に示したクエリの場合、信頼度が「高」で、「選手名」という構成要素の値に「のざき」という文字列が含まれている、区間ＩＤが「１」の区間情報が得られる。映像メタ情報検索部５は、当該区間情報の区間ＩＤあるいは開始／終了時刻を取出し、これをコンテンツデータ検索部４へ渡す（ステップＳ１４）。 The query shown in FIG. 9 is output to the video meta information search unit 5. The video meta information search unit 5 uses the tag name specified as the search condition and the character string specified as the search condition among the stream sections having high voice recognition reliability stored in the video meta information storage unit 3. Video meta information (also referred to as section information) of a stream section having components included as values is searched by tag search. That is, in the case of the query shown in FIG. 9, the section with the section ID “1” having the reliability “high” and the component value “player name” including the character string “nozaki” Information is obtained. The video meta information search unit 5 takes out the section ID or start / end time of the section information and passes it to the content data search unit 4 (step S14).

コンテンツデータ検索部４は、得られた区間ＩＤあるいは開始／終了時刻に対応するコンテンツデータをコンテンツデータ記憶部２から検索する（ステップＳ１５）。ここでは、区間ＩＤが「１」である、図４（ａ）に示したようなストリーム区間のコンテンツデータが検索され、この検索されたコンテンツデータがネットワーク１０を介して再生装置７へ送信される。 The content data search unit 4 searches the content data storage unit 2 for content data corresponding to the obtained section ID or start / end time (step S15). Here, the content data of the stream section as shown in FIG. 4A whose section ID is “1” is retrieved, and the retrieved content data is transmitted to the playback device 7 via the network 10. .

再生装置７の再生部９は、区間ＩＤが「１」である、図４（ａ）に示したストリーム区間のコンテンツデータを受け取ると、このコンテンツデータを図１０に示したように検索結果の表示領域Ｒ１に再生表示する（ステップＳ１６）。 When the playback unit 9 of the playback device 7 receives the content data of the stream section shown in FIG. 4A whose section ID is “1”, the content data is displayed as a search result as shown in FIG. Playback and display are performed in the area R1 (step S16).

次に、ユーザがタグ名を指定せずに、文字列のみを検索条件として指定する場合を例にとり説明する。図１１に示すように、ユーザが、「項目名」欄に「指定なし」を入力し、「項目値」欄に「のざき」と入力したとする（ステップＳ１１）。この後、ユーザが検索ボタンＢ２を選択すると、文字列「のざき」という検索条件を含む検索要求は、検索装置１１へ送信され、コンテンツデータ検索部４が当該検索要求に含まれる検索条件を受信する。 Next, a case where the user specifies only a character string as a search condition without specifying a tag name will be described as an example. As shown in FIG. 11, it is assumed that the user inputs “not specified” in the “item name” field and “nozaki” in the “item value” field (step S11). Thereafter, when the user selects the search button B2, a search request including the search condition “NOZAKI” is transmitted to the search device 11, and the content data search unit 4 receives the search condition included in the search request. To do.

コンテンツデータ検索部４は、当該検索条件にタグ名が含まれていないので（ステップＳ１２）、信頼度（音声認識信頼度）の低い、高いを区別せずに全てのストリーム区間の音声テキストに対する（データベース検索用の）クエリを生成する。この際、まず、ステップＳ１７において、コンテンツデータ検索部４は、単語辞書６を参照して、検索条件として指定された文字列と読みが同じ（同音）かあるいは類似する読み（音）をもつ文字列（類似文字列）を求める（ステップＳ１７）。 Since the tag name is not included in the search condition (step S12), the content data search unit 4 does not distinguish between high and low reliability (speech recognition reliability) (for the audio text in all stream sections ( Generate a query (for database search). At this time, first, in step S17, the content data search unit 4 refers to the word dictionary 6 and reads characters that have the same reading (sound) or similar reading (sound) as the character string specified as the search condition. A column (similar character string) is obtained (step S17).

単語辞書６には、複数の単語と、当該複数の単語のそれぞれについて、当該単語とよみが同じ（同音）かあるいは類似する読み（音）をもつ単語（の読み）が登録されている。例えば、指定された文字列が「のざき」であるとき、単語辞書６には「のざき」と同じ読みの「のざき」と、「のざき」と類似する読み（音）の「おざき」が登録されているとする。 In the word dictionary 6, a plurality of words and a word (reading) having a reading (sound) that is the same (same sound) or similar to the word are registered for each of the plurality of words. For example, when the designated character string is “NOZAKI”, the word dictionary 6 has “NOZAKI” with the same reading as “NOZAKI” and “OZAKI” with a reading (sound) similar to “NOZAKI”. Is registered.

コンテンツデータ検索部４は、得られた類似文字列「のざき」、「おざき」を用いて、図１２に示すようなクエリを生成する（ステップＳ１８）。図１２（ａ）に示すクエリは、「信頼度の低いストリーム区間と信頼度の高いストリーム区間（全ストリーム区間）の映像メタ情報のなかから、「音声テキスト」要素に「のざき」という文字列を値として含む映像メタ情報を全て求めよ」という意味のクエリである。図１２（ｂ）に示すクエリは、「信頼度の低いストリーム区間と信頼度の高いストリーム区間（全ストリーム区間）の映像メタ情報のなかから、「音声テキスト」要素に「おざき」という文字列を値として含む映像メタ情報を全て求めよ」という意味のクエリである。 The content data search unit 4 generates a query as shown in FIG. 12 using the obtained similar character strings “Nozaki” and “Ozaki” (step S18). The query shown in FIG. 12A is a character string “Nozaki” in the “speech text” element from the video meta information of the stream section with low reliability and the stream section with high reliability (all stream sections). This is a query meaning “find all video meta information including“ as a value ”. The query shown in FIG. 12 (b) is obtained by adding a character string “Ozaki” to the “speech text” element from the video meta information of the stream segment with low reliability and the stream segment with high reliability (all stream segments). This is a query meaning “find all video meta information included as values”.

図１２に示したクエリは、映像メタ情報検索部５へ出力される。映像メタ情報検索部５は、映像メタ情報記憶部３に記憶されている全ての区間情報のなかから、音声テキストに類似文字列を含む映像メタ情報（区間情報）を検索する。すなわち、図１２に示したクエリによる検索結果を併せることで、区間ＩＤが「１」、「２」の２つの区間情報が得られる。映像メタ情報検索部５は、当該区間情報の区間ＩＤあるいは開始／終了時刻を取出し、これをコンテンツデータ検索部４へ渡す（ステップＳ１９）。 The query shown in FIG. 12 is output to the video meta information search unit 5. The video meta information search unit 5 searches the video meta information (section information) including a similar character string in the audio text from all the section information stored in the video meta information storage unit 3. That is, by combining the search results by the query shown in FIG. 12, two pieces of section information with section IDs “1” and “2” are obtained. The video meta information search unit 5 takes out the section ID or start / end time of the section information and passes it to the content data search unit 4 (step S19).

コンテンツデータ検索部４は、得られた区間ＩＤあるいは開始／終了時刻に対応するコンテンツデータをコンテンツデータ記憶部２から検索する（ステップＳ１５）。ここでは、区間ＩＤが「１」、「２」である、図４（ａ）、図４（ｂ）に示したようなストリーム区間のコンテンツデータが検索され、この検索されたコンテンツデータがネットワーク１０を介して再生装置７へ送信される。再生装置７の再生部９は、図１３に示したように、検索結果の表示領域Ｒ１に、当該２つの区間を再生表示する（ステップＳ１６）。なお、この例では検索された２つの区間を同時に再生表示したが、１つずつ再生するようにしても良いし、先に２つの区間を代表するサムネイルを表示させておき、この中から実際に再生する区間をユーザに選択させるようにしても良い。 The content data search unit 4 searches the content data storage unit 2 for content data corresponding to the obtained section ID or start / end time (step S15). Here, the content data of the stream section as shown in FIG. 4A and FIG. 4B with the section IDs “1” and “2” is searched, and the searched content data is the network 10. Is transmitted to the playback device 7 via. As shown in FIG. 13, the playback unit 9 of the playback device 7 plays back and displays the two sections in the search result display area R1 (step S16). In this example, the two searched sections are reproduced and displayed at the same time, but may be reproduced one by one, or thumbnails representative of the two sections may be displayed first, and from these, You may make it make a user select the area to reproduce | regenerate.

以上説明したように、図７のステップＳ１１において、ユーザが検索条件としてタグ名を指定した場合、ステップＳ１３〜ステップＳ１４では信頼度の高いストリーム区間に対するタグ検索を行い、ユーザが検索条件としてタグ名を指定せずに文字列（検索キーとしてのキーワード）のみを指定した場合、全ストリーム区間の音声テキストに対するキーワード検索を行うようになっている。 As described above, when the user specifies a tag name as a search condition in step S11 of FIG. 7, in steps S13 to S14, a tag search is performed for a highly reliable stream section, and the user selects a tag name as the search condition. When only a character string (keyword as a search key) is specified without specifying “”, keyword search is performed on the speech text of all stream sections.

タグ検索を用いるメリットは、検索キーに用いられるキーワードの意味の曖昧性を少なくすることで、精度の高い検索を可能にする点にある。例えば、検索キーとして指定されたキーワードが「川崎」であり、「川崎」というキーワードを含むテキストを検索する場合、それが「川崎」市のように場所の名前なのか、あるいは「川崎」氏のように人名なのかが不明瞭となり、本来検索結果としたくないノイズまでも検索されてしまうという問題が生ずる。タグ検索では、例えば、データ内で＜場所＞川崎＜／場所＞のように「場所」タグの値が「川崎」であると明示してあり、「場所＝“川崎”」のように指定することで、上記のような検索ノイズを排除することができる。 The advantage of using tag search is that it enables highly accurate search by reducing the ambiguity of the meaning of the keyword used for the search key. For example, if the keyword specified as the search key is "Kawasaki" and you search for text that includes the keyword "Kawasaki", it is the name of the place, such as "Kawasaki" city, or Mr. "Kawasaki" Thus, it becomes unclear whether the name is a person's name, and there arises a problem that even a noise which is not supposed to be a search result is searched. In the tag search, for example, the value of the “location” tag is clearly indicated as “Kawasaki” in the data, such as <location> Kawasaki </ location>, and “location =“ Kawasaki ”” is specified. Thus, the search noise as described above can be eliminated.

しかし、このようにデータ内で＜場所＞川崎＜／場所＞のようにタグ付けされるためには、元のデータにおけるテキスト処理の精度も高くなくてはならない。本実施形態のように音声認識結果として得られた音声テキストを用いてテキスト処理する場合は、雑音の混入等により、認識精度が低下した場合に、本来タグが抽出できる区間においても、このようなタグの生成に失敗するケースも考えられる。例えば、図４（ｂ）に示すような、区間ＩＤが「２」の区間においては、図５（ｂ）に示すように、音声認識により「のざき選手」が「のざき・さんしゃ」と誤認識している。 However, in order to be tagged like <location> Kawasaki </ location> in the data in this way, the accuracy of text processing in the original data must be high. When text processing is performed using speech text obtained as a speech recognition result as in the present embodiment, even when the tag can be originally extracted when the recognition accuracy is reduced due to mixing of noise or the like, There may be cases where tag generation fails. For example, in the section with the section ID “2” as shown in FIG. 4B, as shown in FIG. 5B, the “nozaki player” is “nozaki / sansha” by voice recognition. Misunderstood.

「のざき・さんしゃ」では、上記ルール「“ＡＡＡせんしゅ（選手）が”あるいは“ＡＡＡせんしゅ（選手）は”あるいは“ＡＡＡせんしゅ（選手）の”と続いているときに、“ＡＡＡ”を選手名と判定する」に合致せず、「選手名＝“のざき”」と抽出されない。したがって、「選手名＝“のざき”」という条件で検索をしても、区間ＩＤが「２」の区間に対応するストリーム区間は検索にヒットしない。従って、音声認識精度（音声認識信頼度）が低いと推定できる区間においては、タグ検索の代わりにキーワード検索を用いた方が得策と考えられる。 In “Nozaki / Sansha”, when the above rule “AAA teacher (player)” or “AAA teacher (player)” or “AAA teacher (player)” continues, “AAA” It does not match “determine that player is a player name” and “player name =“ nozaki ”” is not extracted. Therefore, even if the search is performed under the condition “player name =“ Nozaki ””, the stream section corresponding to the section having the section ID “2” does not hit the search. Therefore, it is considered better to use a keyword search instead of a tag search in a section where it can be estimated that the voice recognition accuracy (voice recognition reliability) is low.

このようにして、音声認識精度の高低に応じて、検索方式（タグ検索とキーワード検索）を変えることにより、精度の高い絞込み検索と漏れの少ない検索との使い分けができるため、柔軟なシーン検索が行える。 In this way, by changing the search method (tag search and keyword search) according to the level of voice recognition accuracy, it is possible to selectively use high-precision refinement search and search with few omissions. Yes.

図７のステップＳ１７において、コンテンツ検索部４は、検索キーとして指定された文字列と読みが同じかあるいは類似する読みをもつ文字列（類似文字列）を求める。これは音声認識結果の曖昧性を考慮しているからである。 In step S17 in FIG. 7, the content search unit 4 obtains a character string (similar character string) having a reading that is the same as or similar to the character string specified as the search key. This is because the ambiguity of the speech recognition result is taken into consideration.

音声テキストには、音声認識上の曖昧性を含む文字列（キーワード）が多く含まれている。これらは元の音声データとは異なる文字列として認識されている可能性が高いため、このような文字列を検索対象として検索した場合、検索漏れが発生してしまう危険性が高い。このようなことを考慮して、本実施形態の音声認識部１０２では、かな文字から漢字へと変換を行っていない。 The speech text includes many character strings (keywords) including ambiguity in speech recognition. Since these are likely to be recognized as character strings different from the original voice data, there is a high risk that search omission will occur when such a character string is searched for. Considering this, the speech recognition unit 102 of this embodiment does not convert kana characters into kanji.

なお、図７のステップＳ１７〜１９において、音声認識信頼度の異なる区間（高い区間と低い区間）に対するクエリとして、共通のクエリを生成したが、本発明はこれに限らない。すなわち、信頼度の高い区間に対するクエリと信頼度の低い区間に対するクエリとをそれぞれ別個に生成してもよい。この場合には、図１２（ａ）に示すクエリの代わりに図１２（ｃ）に示すクエリと、図１２（ｄ）に示すクエリが生成される。また、図１２（ｂ）に示すクエリの代わりに図１２（ｅ）に示すクエリが生成される。 In steps S17 to S19 in FIG. 7, a common query is generated as a query for sections having different speech recognition reliability (a high section and a low section), but the present invention is not limited to this. That is, a query for a section with high reliability and a query for a section with low reliability may be generated separately. In this case, a query shown in FIG. 12C and a query shown in FIG. 12D are generated instead of the query shown in FIG. Further, a query shown in FIG. 12E is generated instead of the query shown in FIG.

これにより、信頼度の高い区間については指定された文字列を含むものを、信頼度の低い区間については指定された文字列と類似の文字列を含むものを検索することができる。 As a result, it is possible to search for a section including a specified character string for a section with high reliability and a section including a character string similar to the specified character string for a section with low reliability.

以上説明したように、上記実施形態によれば、映像メタ情報生成部１は、音声・映像からなるコンテンツデータの各ストリーム区間から、音声テキストとテロップテキストを抽出するとともに、これらから音声認識結果に対する信頼度（音声認識信頼度）を求める。 As described above, according to the above-described embodiment, the video meta information generation unit 1 extracts audio text and telop text from each stream section of content data composed of audio and video, and from these, the audio recognition result is obtained. The reliability (speech recognition reliability) is obtained.

例えば、音声テキストとテロップテキストの両方に読みが一致する語（共通語）があるときには、当該音声テキストの音声認識信頼度は高いと判定して、共通語が得られないときには、当該音声テキストの音声認識信頼度は低いと判定する。さらに、音声テキストからは予め記憶されたルールや単語辞書を用いて、キーワードとそのカテゴリを求める。 For example, when there is a word (common word) whose reading matches in both the voice text and the telop text, it is determined that the voice recognition reliability of the voice text is high, and when the common word cannot be obtained, It is determined that the voice recognition reliability is low. Furthermore, keywords and their categories are obtained from speech texts using rules and word dictionaries stored in advance.

このようにして得られたデータを用いて各ストリーム区間に対し、当該ストリーム区間を検索する際の用いられる（当該ストリーム区間の特徴が記述されている）映像メタ情報を生成する。映像メタ情報には、音声テキストと、この音声テキストから求めたキーワードと、このキーワードのカテゴリ、テロップテキスト、音声認識信頼度と共通語、ＩＤや開始／終了時刻などが含まれている。音声テキストから抽出されたキーワードは、このキーワードのカテゴリをタグ名とする要素の値として記述されている。 Using the data obtained in this way, video meta-information used for searching the stream section (in which the characteristics of the stream section are described) is generated for each stream section. The video meta information includes voice text, a keyword obtained from the voice text, a category of the keyword, telop text, voice recognition reliability and common language, ID, start / end time, and the like. The keywords extracted from the speech text are described as element values having the keyword category as a tag name.

一方、コンテンツデータから所望のシーンを検索する際には、ユーザは、検索キーとしてタグ名とキーワードを指定する、或いはキーワードのみを指定する。前者の場合には、音声認識結果の信頼度が高いストリーム区間に対するタグ検索を行い、後者の場合には、音声認識結果の信頼度の高い低いにかかわらず全ストリーム区間の音声テキストに対するテキスト検索を行う。 On the other hand, when searching for a desired scene from the content data, the user specifies a tag name and a keyword as a search key, or specifies only a keyword. In the former case, a tag search is performed for a stream section with a high reliability of the speech recognition result, and in the latter case, a text search is performed on the speech text of all stream sections regardless of whether the reliability of the speech recognition result is low. Do.

音声認識結果の信頼度が高いストリーム区間については精度の高いタグ検索、音声認識結果の信頼度が低いストリーム区間についてはキーワード検索と曖昧検索、といったように、音声認識結果の信頼度に応じて検索方式を切り替えることにより、検索漏れや誤検索を極力回避できる。 Search according to the reliability of speech recognition results, such as high-accuracy tag search for stream sections with high speech recognition result reliability and keyword search and fuzzy search for stream sections with low speech recognition result reliability. By switching methods, search omissions and erroneous searches can be avoided as much as possible.

また、検索方式の切替方式はこれに限るものではない。例えば、検索の精度を指定するボタンを設けておき、高精度モードの場合は信頼度の高いストリーム区間について信頼度の高いタグ検索，通常モードの場合は全区間についてキーワード検索，曖昧モードを指定した場合は全区間について曖昧検索というように検索用画面を構成しても良い。 The search method switching method is not limited to this. For example, a button for specifying the search accuracy is provided, and in the high-accuracy mode, the tag search with high reliability is specified for the stream section with high reliability, and in the normal mode, the keyword search is specified for all sections, and the ambiguous mode is specified. In this case, the search screen may be configured so as to perform an ambiguous search for all sections.

本発明の実施の形態に記載した本発明の手法（図３，図７参照）は、コンピュータに実行させることのできるプログラムとして、磁気ディスク（フレキシブルディスク、ハードディスクなど）、光ディスク（ＣＤ−ＲＯＭ、ＤＶＤなど）、半導体メモリなどの記録媒体に格納して頒布することもできる。 The method of the present invention described in the embodiment of the present invention (see FIGS. 3 and 7) is a program that can be executed by a computer, such as a magnetic disk (flexible disk, hard disk, etc.), optical disk (CD-ROM, DVD). Etc.) can also be stored and distributed in a recording medium such as a semiconductor memory.

なお、本発明は上記実施形態そのままに限定されるものではなく、実施段階ではその要旨を逸脱しない範囲で構成要素を変形して具体化できる。また、上記実施形態に開示されている複数の構成要素の適宜な組み合わせにより、種々の発明を形成できる。例えば、実施形態に示される全構成要素から幾つかの構成要素を削除してもよい。さらに、異なる実施形態にわたる構成要素を適宜組み合わせてもよい。 Note that the present invention is not limited to the above-described embodiment as it is, and can be embodied by modifying the constituent elements without departing from the scope of the invention in the implementation stage. In addition, various inventions can be formed by appropriately combining a plurality of constituent elements disclosed in the embodiment. For example, some components may be deleted from all the components shown in the embodiment. Furthermore, constituent elements over different embodiments may be appropriately combined.

本発明は、例えば、ホームサーバに関する。 The present invention relates to a home server, for example.

本発明の実施形態に係るシステム全体の構成例を示した図。The figure which showed the structural example of the whole system which concerns on embodiment of this invention. 映像メタ情報生成部の構成例を示した図。The figure which showed the structural example of the image | video meta information production | generation part. 映像メタ情報生成部の処理動作を説明するためのフローチャート。The flowchart for demonstrating the processing operation | movement of an image | video meta information generation part. コンテンツデータ含まれる３つのストリーム区間に対応する映像認識結果のストリーム区間内の音声およびテロップをテキスト化して表示した画像の例を示した図。The figure which showed the example of the image which displayed as a text the audio | voice and telop in the stream area of the video recognition result corresponding to three stream areas included in content data. 図４の３つのストリーム区間のそれぞれから得られた音声テキストとテロップテキストを示した図。The figure which showed the audio | voice text and telop text obtained from each of the three stream area | regions of FIG. メタ情報の一例を示した図。The figure which showed an example of meta information. 所望のシーンの含まれているコンテンツデータを検索する際の処理動作を説明するためのフローチャート。The flowchart for demonstrating the processing operation at the time of searching the content data in which the desired scene is contained. 検索条件と検索結果を表示する検索画面の一例と、検索条件の一例を示した図。The figure which showed an example of the search screen which displays a search condition and a search result, and an example of a search condition. コンテンツデータ検索部４で生成されるタグ検索用のクエリの一例を示した図。The figure which showed an example of the query for tag searches produced | generated in the content data search part 4. FIG. 検索結果の表示例を示した図。The figure which showed the example of a display of a search result. 検索画面に入力された検索条件の他の例を示した図。The figure which showed the other example of the search conditions input into the search screen. コンテンツデータ検索部４で生成される音声テキスト検索用のクエリの他の例を示した図。The figure which showed the other example of the query for audio | voice text search produced | generated in the content data search part. 検索結果の他の表示例を示した図。The figure which showed the other example of a display of a search result.

Explanation of symbols

１…映像メタ情報生成部、２…コンテンツデータ記憶部、３…映像メタ情報記憶部、４…コンテンツデータ検索部、５…映像メタ情報検索部、６…単語辞書、７…再生装置、８…検索要求指定部、９…再生部、１０…ネットワーク、１１…検索装置。 DESCRIPTION OF SYMBOLS 1 ... Video meta information production | generation part, 2 ... Content data storage part, 3 ... Video meta information storage part, 4 ... Content data search part, 5 ... Video meta information search part, 6 ... Word dictionary, 7 ... Playback apparatus, 8 ... Search request specifying unit, 9... Playback unit, 10... Network, 11.

Claims

For each of the plurality of sections of content data divided into a plurality of sections, voice recognition means for recognizing voice data in the content data in the section;
For each of the plurality of sections of the content data, character recognition means for recognizing a telop included in video data in the content data in the section;
Determination means for determining a reliability of the voice recognition result using the voice recognition result in the voice recognition means and the character recognition result in the character recognition means;
Generate meta information consisting of a plurality of element data describing the characteristics of the content data based on the speech recognition result in the speech recognition means, the character recognition result in the character recognition means, and the determination result in the determination means Generating means to
A meta information generating method in a meta information generating apparatus comprising:
The voice recognition means, for each of the plurality of sections of the content data, to obtain a voice text that is a voice recognition result of the voice data of the section;
A second step in which the speech recognition means obtains a keyword included in the speech text and a category to which the keyword belongs from the speech text;
A third step in which the character recognition means obtains, for each of the plurality of sections of the content data, a telop text that is a character recognition result of the telop included in the video data of the section;
The determination unit determines that each of the plurality of sections has high reliability for the speech text in the section when the same text is included in both the speech text and the telop text obtained from the section, A fourth step of determining that the reliability of the speech text in the section is low when no homophone is included;
The generating means describes, for each of the plurality of sections, at least first element data describing an identifier of the section, second element data describing the speech text, and reliability of the speech text a third element data, the fifth to generate the meta-information and a fourth element data describing the said category belongs the keyword and the keyword when it contains the keywords in the audio text Steps,
A meta information generation method characterized by comprising:

A meta information generating apparatus that generates meta information including a plurality of element data describing characteristics of content data including video data and audio data,
For each of the plurality of sections of the content data divided into a plurality of sections, a telop text that is a speech recognition result of the speech data of the section and a telop character recognition result of the telop included in the video data of the section A means of seeking
Means for obtaining a keyword contained in the speech text and a category to which the keyword belongs from the speech text;
For each of the plurality of sections, when both the speech text and the telop text obtained from the section include a homonym, it is determined that the reliability of the speech text in the section is high, and the homonym is included. Means for determining that the reliability of the speech text in the section is low when not,
For each of the plurality of sections, at least first element data describing an identifier of the section, second element data describing the speech text, and a third element describing the reliability of the speech text Means for generating the meta-information including data and fourth element data describing the keyword and the category to which the keyword belongs when the keyword is included in the speech text;
A meta information generating apparatus comprising:

3. The meta information generating apparatus according to claim 2, wherein the meta information corresponding to each of the plurality of sections includes fifth element data describing telop text obtained from the section.

First storage means for storing content data including video data and audio data ;
Corresponding to each of the plurality of sections of the content data divided into a plurality of sections, each consisting of a plurality of element data describing the characteristics of the content data in one section of the plurality of sections A plurality of pieces of meta information data, wherein the plurality of element data includes a first element data describing an identifier of the section and a second text describing a speech text that is a speech recognition result of the speech data of the section. Fourth data describing element data, third element data describing either high or low reliability of the speech text, a keyword extracted from the speech text, and a category to which the keyword belongs Second storage means for storing the plurality of meta information data including element data ;
Search means;
A search method in a search device comprising:
When the first character string and the second character string are designated as search conditions, the search means has a high degree of reliability among the plurality of meta information data stored in the second storage means. Using the data as a search target, search the meta information including the fourth element data in which the category of the same sound as the first character string and the keyword of the same sound as the second character string are described. A first step to:
When only the second character string is designated as the search condition among the first character string and the second character string, the search means respectively have the same sound as or similar to the second character string. A second step for obtaining a plurality of third character strings having sounds;
The search means uses the meta information data stored in the second storage means as a search target, and from the search target, meta information including any of the plurality of third character strings in the speech text A third step of searching for
Fourth said search means, for searching the content data from said stored content data in the first storage means, the section corresponding to the identifier included in the first and third of the retrieved meta information in step And the steps
A search method characterized by comprising:

First storage means for storing content data including video data and audio data;
A meta data corresponding to each of the plurality of sections of the content data divided into a plurality of sections, each of which is composed of a plurality of element data describing characteristics of the content data in one section of the plurality of sections. Information data, in the plurality of element data, a first element data describing an identifier of the section, a second element data describing a speech text as a speech recognition result of the speech data of the section; , Third element data describing either high or low reliability of the speech text, fourth element data describing a keyword extracted from the speech text and a category to which the keyword belongs, Second storage means for storing the plurality of meta information data including:
When the first character string and the second character string are designated as search conditions, meta information data having a high reliability among a plurality of meta information data stored in the second storage means is set as a search target. The first search for meta information data including the fourth element data describing a category of the same sound as the first character string and a keyword of the same sound as the second character string from the search target. Search means;
When only the second character string is specified as the search condition among the first character string and the second character string, a plurality of sounds each having the same sound as or similar to the second character string Means for obtaining a third character string;
The meta information data stored in the second storage means is set as a search target, and meta information data including any of the plurality of third character strings in the speech text is searched from the search target. Two search means;
Third search means for searching content data in a section corresponding to the identifier included in the meta information data searched by the first and second search means from the content data stored in the first storage means When,
A search device comprising:

Means for obtaining, for each of the plurality of sections of the content data, speech text that is a speech recognition result of the speech data of the section and telop text that is a character recognition result of the telop included in the video data of the section;
Means for obtaining a keyword contained in the speech text and a category to which the keyword belongs from the speech text;
For each of the plurality of sections, when both the speech text and the telop text obtained from the section include a homonym, it is determined that the reliability of the speech text in the section is high, and the homonym is included. Means for determining that the reliability of the speech text in the section is low when not,
For each of the plurality of sections, at least first element data describing an identifier of the section, second element data describing the speech text, and a third element describing the reliability of the speech text Generating means for generating the meta information data including data and fourth element data describing the keyword and the category to which the keyword belongs when the keyword is included in the speech text;
Further comprising
6. The search apparatus according to claim 5, wherein the second storage unit stores the meta information data generated by the generation unit.