JP7606866B2

JP7606866B2 - Genre-specific text collection device and its program

Info

Publication number: JP7606866B2
Application number: JP2020204235A
Authority: JP
Inventors: 剛三島; 智康小森; 裕明佐藤
Original assignee: Japan Broadcasting Corp
Current assignee: Japan Broadcasting Corp
Priority date: 2020-12-09
Filing date: 2020-12-09
Publication date: 2024-12-26
Anticipated expiration: 2040-12-09
Also published as: JP2022091412A

Description

本発明は、ジャンル別にテキストデータを収集するジャンル別テキスト収集装置およびそのプログラムに関する。 The present invention relates to a genre-specific text collection device and a program for collecting text data by genre.

音声認識における言語モデルの構築や自然言語処理の解析には、大量の自然言語文（以下、テキストコーパス）が必要となる。このテキストコーパスを得る手法として、Ｗｅｂページに掲載されているテキストデータを収集する手法がある。この手法は、インターネット上にテキストデータを含めた大量の情報が開示され、かつ、そのほとんどが自由に参照できる状態にあるため、テキストコーパスを得る目的でしばしば活用される。 Building language models for speech recognition and analyzing natural language processing requires a large amount of natural language text (hereafter referred to as a text corpus). One method for obtaining this text corpus is to collect text data posted on web pages. This method is often used to obtain a text corpus, as a large amount of information, including text data, is available on the Internet, and most of it can be accessed freely.

例えば、特許文献１には、事前に収集した音声認識対象のテキストコーパスから、単語セットを選定し、その単語セットを検索エンジンの検索クエリとすることで、インターネットから音声認識対象に関連するテキストデータを収集する手法が開示されている。 For example, Patent Document 1 discloses a method for collecting text data related to the speech recognition target from the Internet by selecting a word set from a text corpus of speech recognition targets collected in advance and using the word set as a search query for a search engine.

また、他の手法として、特許文献２には、複数の自然言語文を含むテキストコーパスから、事前に準備した単語列テンプレートに合致した単語列を抽出し、変換規則を用いて、目的に沿った形式の単語列に変換することで、特定用途向けのテキストデータを収集する手法が開示されている。 As another method, Patent Document 2 discloses a method for collecting text data for a specific purpose by extracting word strings that match a pre-prepared word string template from a text corpus containing multiple natural language sentences and converting the extracted word strings into a word string in a format suited to the purpose using conversion rules.

また、特許文献３には、放送番組の字幕を、言語モデルの学習データとして収集する手法が開示されている。 Patent document 3 also discloses a method for collecting subtitles of broadcast programs as learning data for a language model.

特開２０１２－８３５４３号公報JP 2012-83543 A 特開２０１２－７８６４７号公報JP 2012-78647 A 特開２０１９－８３１５号公報JP 2019-8315 A

特許文献１に開示されている手法は、単語セットを選定するための文字列の文書形式を人手で作成する必要がある。そのため、この手法は、ジャンル別にテキストデータを収集するために、ジャンルごとに個別に文書形式を作成する手間が生じる。また、この手法は、収集するテキストデータが文書形式や検索エンジンの精度に依存してしまうという問題がある。 The method disclosed in Patent Document 1 requires manual creation of a document format for character strings for selecting word sets. Therefore, this method requires the effort of creating a separate document format for each genre in order to collect text data by genre. In addition, this method has the problem that the collected text data depends on the document format and the accuracy of the search engine.

特許文献２に開示されている手法も、単語例テンプレートや変形規則を人手で作成する必要があり手間がかかるとともに、収集したテキストデータが単語例テンプレートや変形規則の特性に依存してしまうという問題がある。 The technique disclosed in Patent Document 2 also requires the manual creation of word example templates and transformation rules, which is time-consuming, and has the problem that the collected text data depends on the characteristics of the word example templates and transformation rules.

また、特許文献３に開示されている手法は、放送番組の字幕から大量にテキストデータを収集することができる。しかし、放送番組は、ニュース、情報、ドラマ、アニメ等、様々なジャンルがある。そのため、この手法は、人手を介して、放送番組をジャンル別に選定する必要があり、大量にジャンル別のテキストデータを収集することは困難であった。 The method disclosed in Patent Document 3 can collect a large amount of text data from the subtitles of broadcast programs. However, broadcast programs come in a variety of genres, including news, information, drama, and animation. Therefore, this method requires manual selection of broadcast programs by genre, making it difficult to collect a large amount of text data by genre.

本発明は、このような問題に鑑みてなされたものであり、ジャンルを特定するための文書形式、テンプレート等への依存をなくし、ジャンル別のテキストデータを精度よく大量に収集することが可能なジャンル別テキスト収集装置およびそのプログラムを提供することを課題とする。 The present invention was made in consideration of these problems, and aims to provide a genre-specific text collection device and a program therefor that can accurately collect large amounts of genre-specific text data without relying on document formats, templates, etc. to identify genres.

前記課題を解決するため、本発明に係るジャンル別テキスト収集装置は、デジタル放送に多重化されている字幕テキストからジャンル別のテキストを収集するジャンル別テキスト収集装置であって、放送受信手段と、字幕情報抽出手段と、ＥＰＧ情報抽出手段と、番組情報特定手段と、テキスト抽出手段と、を備える構成とした。 To solve the above problem, the genre-specific text collection device of the present invention is a genre-specific text collection device that collects genre-specific text from subtitle text multiplexed into digital broadcasting, and is configured to include a broadcast receiving means, a subtitle information extraction means, an EPG information extraction means, a program information identification means, and a text extraction means.

かかる構成において、ジャンル別テキスト収集装置は、放送受信手段によって、デジタル放送を受信し、ＴＳ（トランスポートストリーム）信号に復調する。
そして、ジャンル別テキスト収集装置は、字幕情報抽出手段によって、放送受信手段で復調されたＴＳ信号から、字幕テキストと字幕テキストを提示する時間情報とを含む字幕情報を抽出する。なお、字幕情報は、ＴＳ信号に多重化され、そのデータ形式は、ＡＲＩＢ（一般社団法人電波産業会）において規格化されている。そのため、字幕情報抽出手段は、ＴＳ信号のデータ形式を解析することで、多重化されている字幕情報を抽出することができる。 In such a configuration, the genre-specific text collection device receives digital broadcasts by the broadcast receiving means, and demodulates them into TS (Transport Stream) signals.
The genre-specific text collection device then uses the subtitle information extraction means to extract subtitle information including subtitle text and time information for presenting the subtitle text from the TS signal demodulated by the broadcast receiving means. Note that the subtitle information is multiplexed into the TS signal, and its data format is standardized by ARIB (Association of Radio Industries and Businesses). Therefore, the subtitle information extraction means can extract the multiplexed subtitle information by analyzing the data format of the TS signal.

また、ジャンル別テキスト収集装置は、ＥＰＧ情報抽出手段によって、ＴＳ信号から、放送番組のＥＰＧ情報を抽出する。ＥＰＧ情報は、電子番組表を生成するための情報であって、放送番組の時間情報、ジャンル等が設定されている。このＥＰＧ情報も字幕情報と同様、ＴＳ信号に多重化され、そのデータ形式は、ＡＲＩＢにおいて規格化されている。そのため、ＥＰＧ情報抽出手段は、ＴＳ信号のデータ形式を解析することで、多重化されているＥＰＧ情報を抽出することができる。 The genre-specific text collection device also uses an EPG information extraction means to extract EPG information of broadcast programs from the TS signal. EPG information is information for generating an electronic program guide, and includes broadcast program time information, genre, etc. Like subtitle information, this EPG information is multiplexed into the TS signal, and its data format is standardized by ARIB. Therefore, the EPG information extraction means can extract the multiplexed EPG information by analyzing the data format of the TS signal.

そして、ジャンル別テキスト収集装置は、番組情報特定手段によって、ＥＰＧ情報から、放送番組の時間情報およびジャンルを特定する。なお、ジャンルは、各放送局で設定される情報であるが、放送内容を特定する情報であるため、放送局間での差は生じにくい。そのため、ＥＰＧ情報から抽出するジャンルは、抽出するテキストに対して、精度の高い情報となる。
なお、ジャンルは、上位の項目で分類した上位分類と上位分類を細分化した下位分類とで構成される。そこで、番組情報特定手段は、上位分類のみをジャンルとして特定する。また、ＥＰＧ情報には、放送番組ごとにジャンルが１または複数設定されている。そこで、番組情報特定手段は、ＥＰＧ情報にジャンルが複数設定されている場合、放送番組に設定されている最も多い上位分類をジャンルとして特定する。 The genre-specific text collection device then uses the program information identification means to identify the time information and genre of the broadcast program from the EPG information. Note that the genre is information set by each broadcasting station, but since it is information that identifies the broadcast content, there is little difference between broadcasting stations. Therefore, the genre extracted from the EPG information is highly accurate information for the extracted text.
A genre is composed of higher-level categories classified by higher-level items and lower-level categories subdivided into the higher-level categories. Therefore, the program information identifying means identifies only the higher-level categories as genres. Also, one or more genres are set for each broadcast program in the EPG information. Therefore, when multiple genres are set in the EPG information, the program information identifying means identifies the most frequently set higher-level category for the broadcast program as the genre.

そして、ジャンル別テキスト収集装置は、テキスト抽出手段によって、字幕情報から、放送番組の時間情報で特定される時間区間の字幕テキストを抽出し、放送番組のジャンルと対応付けてジャンル別テキストとする。 Then, the genre-specific text collection device uses a text extraction means to extract subtitle text for a time period specified by the time information of the broadcast program from the subtitle information, and associates it with the genre of the broadcast program to generate genre-specific text.

これによって、ジャンル別テキスト収集装置は、ＥＰＧ情報に基づいて、ジャンル別に字幕のテキストを放送信号から抽出することができる。
なお、ジャンル別テキスト収集装置は、コンピュータを、前記した各手段として機能させるためのプログラムで動作させることができる。 This allows the genre-specific text collection device to extract subtitle text from the broadcast signal by genre based on the EPG information.
The genre-specific text collection device can be operated by a program that causes a computer to function as each of the above-mentioned means.

本発明は、以下に示す優れた効果を奏するものである。
本発明によれば、ＥＰＧ情報に設定されている放送番組のジャンルおよび時間情報に基づいて、ジャンル別に字幕のテキストを大量に収集することができる。
これによって、本発明は、音声認識、自然言語処理等で必要となるジャンルに分類された精度の高いテキストコーパスを、人手による手間を省いて取得することができる。 The present invention provides the following excellent effects.
According to the present invention, it is possible to collect a large amount of subtitle text by genre based on the genre and time information of broadcast programs set in EPG information.
As a result, the present invention can obtain a highly accurate text corpus classified into genres required for speech recognition, natural language processing, and the like, without requiring manual effort.

本発明の実施形態に係るジャンル別テキスト収集装置の構成を示すブロック構成図である。1 is a block diagram showing a configuration of a genre-specific text collection device according to an embodiment of the present invention. トランスポートストリーム（ＴＳ）信号の概略の構成を示すデータ構成図である。1 is a data structure diagram showing a schematic structure of a transport stream (TS) signal. 字幕情報抽出手段におけるＴＳ信号から抽出する字幕情報の一例を示す図である。11 is a diagram showing an example of subtitle information extracted from a TS signal by a subtitle information extraction means. FIG. ＥＰＧ情報抽出手段におけるＴＳ信号から抽出するＥＰＧ情報の一例を示す図である。FIG. 4 is a diagram showing an example of EPG information extracted from a TS signal by an EPG information extraction means. ジャンルの分類を説明するための説明図である。FIG. 2 is an explanatory diagram for explaining genre classification. 字幕テキスト抽出手段における字幕情報からジャンル別に字幕テキストを抽出する例を説明するための説明図である。11 is an explanatory diagram for explaining an example of extracting subtitle text by genre from subtitle information in a subtitle text extraction means. FIG. 整形手段における正規表現フィルタ処理の処理内容を説明するための説明図である。FIG. 13 is an explanatory diagram for explaining the processing contents of regular expression filter processing in the shaping means; 本発明の実施形態に係るジャンル別テキスト収集装置の全体動作を示すフローチャートである。3 is a flowchart showing the overall operation of the genre-specific text collection device according to the embodiment of the present invention. 図８のジャンル特定の詳細動作を示すフローチャートである。9 is a flowchart showing a detailed operation of genre specification in FIG. 8;

以下、本発明の実施形態について図面を参照して説明する。
＜ジャンル別テキスト収集装置の構成＞
図１を参照して、本発明の実施形態に係るジャンル別テキスト収集装置１の構成について説明する。 Hereinafter, an embodiment of the present invention will be described with reference to the drawings.
<Configuration of genre-specific text collection device>
The configuration of a genre-specific text collection device 1 according to an embodiment of the present invention will be described with reference to FIG.

ジャンル別テキスト収集装置１は、デジタル放送に多重化されている字幕テキストからジャンル別のテキストを収集するものである。
図１に示すように、ジャンル別テキスト収集装置１は、放送受信手段１０と、放送情報抽出手段１１と、記憶手段１２と、番組情報特定手段１３と、テキスト抽出手段１４と、を備える。 The genre-specific text collection device 1 collects genre-specific text from subtitle text multiplexed in digital broadcasting.
As shown in FIG. 1, the genre-specific text collection device 1 includes a broadcast receiving means 10, a broadcast information extracting means 11, a storage means 12, a program information identifying means 13, and a text extracting means 14.

放送受信手段１０は、デジタル放送の放送波を受信し、復調するものである。なお、放送波は、地上デジタル放送、衛星放送、ケーブル放送等、無線、有線を問わない。
放送受信手段１０は、デジタル放送の放送波を受信し、復号することで、ＭＰＥＧ－２トランスポートストリーム信号（以下、ＴＳ信号という）の放送信号に変換するテレビチューナである。 The broadcast receiving means 10 receives and demodulates digital broadcast waves. The broadcast waves may be terrestrial digital broadcast, satellite broadcast, cable broadcast, or the like, and may be wireless or wired.
The broadcast receiving means 10 is a television tuner that receives digital broadcast waves and decodes them to convert them into MPEG-2 transport stream signals (hereinafter referred to as TS signals).

図２に示すように、ＴＳ信号は、映像・音声情報２００、字幕情報２０１、ＥＰＧ情報２０２、データ放送情報２０３等が多重化されている。なお、ＴＳ信号は、ＡＲＩＢ（一般社団法人電波産業会）において規格化されている信号であるため、ここでは詳細な説明は省略する。
放送受信手段１０は、放送波から、指定されたチャンネルの放送信号を復調する。復調するチャンネル数は、1つに限定されるものではなく、放送受信手段１０は、複数のチューナとして構成してもよい。 2, the TS signal is multiplexed with video and audio information 200, subtitle information 201, EPG information 202, data broadcast information 203, etc. Note that the TS signal is a signal standardized by ARIB (Association of Radio Industries and Businesses), and therefore a detailed description thereof will be omitted here.
The broadcast receiving means 10 demodulates the broadcast signal of the specified channel from the broadcast wave. The number of channels to be demodulated is not limited to one, and the broadcast receiving means 10 may be configured as a plurality of tuners.

なお、ジャンル別テキスト収集装置１は、後記する放送情報抽出手段１１において、字幕情報およびＥＰＧ情報のみを利用する。そのため、放送受信手段１０は、受信する放送形態がフルセグメント放送に限定されず、扱う情報量が少なく、ＣＡＳ（Conditional Access System：限定受信システム）によるコピー制御等が不要で、安価に入手が可能なワンセグメント放送に対応するテレビチューナであっても構わない。 The genre-specific text collection device 1 uses only subtitle information and EPG information in the broadcast information extraction means 11 described below. Therefore, the broadcast reception means 10 may be a television tuner that is not limited to receiving full-segment broadcasts, handles a small amount of information, does not require copy control by a CAS (Conditional Access System), and is inexpensively available and supports one-segment broadcasts.

放送受信手段１０は、復調したＴＳ信号を放送情報抽出手段１１に出力する。なお、複数のチャンネルを受信する場合、放送受信手段１０は、チャンネルごとにＴＳ信号を放送情報抽出手段１１に出力する。 The broadcast receiving means 10 outputs the demodulated TS signal to the broadcast information extraction means 11. When receiving multiple channels, the broadcast receiving means 10 outputs a TS signal for each channel to the broadcast information extraction means 11.

放送情報抽出手段１１は、放送受信手段１０で復調されたＴＳ信号から、指定された時間区間の字幕情報およびＥＰＧ情報を抽出するものである。
指定される時間区間は、ユーザがテキストを収集したい時間区間であって、収集開始時刻（例えば、年月日時分秒で指定）と、収集終了時刻（例えば、年月日時分秒で指定）とで外部から指定される。または、この時間区間は、外部スイッチ等で、開始を指示されてから、終了を指示されるまでの区間であってもよい。または、この時間区間は、開始の指示と時間長を指定されることとしてもよい。 The broadcast information extraction means 11 extracts subtitle information and EPG information for a specified time interval from the TS signal demodulated by the broadcast reception means 10 .
The specified time interval is the time interval during which the user wants to collect text, and is specified externally by the collection start time (e.g., specified by year, month, day, hour, minute, and second) and the collection end time (e.g., specified by year, month, day, hour, minute, and second). Alternatively, this time interval may be the interval from when a start instruction is given by an external switch or the like to when an end instruction is given. Alternatively, this time interval may be specified by a start instruction and a time length.

この時間区間の長さは、１日、１週間、１か月、１年等、自由に設定することができる。なお、放送情報抽出手段１１は、図示を省略した計時手段（タイマ）を備え、時間区間の計時を行う。
図１に示すように、放送情報抽出手段１１は、字幕情報抽出手段１１０と、ＥＰＧ情報抽出手段１１１と、を備える。 The length of this time period can be freely set to one day, one week, one month, one year, etc. The broadcast information extraction means 11 includes a timing means (timer) (not shown) for timing the time period.
As shown in FIG. 1, the broadcast information extraction means 11 includes a subtitle information extraction means 110 and an EPG information extraction means 111 .

字幕情報抽出手段１１０は、ＴＳ信号から、字幕テキストと字幕テキストを提示する時間情報とを含む字幕情報を抽出するものである。
字幕情報抽出手段１１０は、ＴＳ信号を解析し、指定された時間区間の字幕情報を抽出する。図３に、字幕情報抽出手段１１０が抽出する字幕情報の一例を示す。 The subtitle information extraction means 110 extracts subtitle information including subtitle text and time information for presenting the subtitle text from the TS signal.
The subtitle information extraction means 110 analyzes the TS signal and extracts subtitle information for a specified time period. An example of the subtitle information extracted by the subtitle information extraction means 110 is shown in FIG.

図３に示すように、字幕情報抽出手段１１０が抽出する字幕情報は、日付３００、開始時刻３０１および字幕テキスト３０２である。
日付３００は、対応する字幕テキスト３０２をテレビ画面上に提示する日付（年／月／日）である。
開始時刻３０１は、対応する字幕テキスト３０２をテレビ画面上に提示する時刻（時：分：秒）である。
字幕テキスト３０２は、対応する日付３００および開始時刻３０１にテレビ画面上に提示する字幕の文字列である。
例えば、図３の例では、日付（２０２０／０７／０８）の開始時刻（０６：００：１２）には、「今や時代の先端をゆくメガロポリスに。」が、字幕として使用されることを表している。 As shown in FIG. 3, the subtitle information extracted by the subtitle information extracting unit 110 includes a date 300 , a start time 301 , and subtitle text 302 .
Date 300 is the date (year/month/day) that the corresponding subtitle text 302 will be presented on the television screen.
Start time 301 is the time (hours:minutes:seconds) when corresponding subtitle text 302 is to be presented on the television screen.
The subtitle text 302 is a character string of subtitles to be presented on the television screen at the corresponding date 300 and start time 301 .
For example, the example in FIG. 3 indicates that "Now in a cutting-edge megalopolis." will be used as a subtitle at the start time (06:00:12) on the date (2020/07/08).

字幕情報抽出手段１１０は、抽出した字幕情報を記憶手段１２に書き込み記憶する。なお、複数のチャンネルに対応したＴＳ信号の場合、字幕情報抽出手段１１０は、それぞれのＴＳ信号から字幕情報を抽出し、記憶手段１２に記憶する。複数のチャンネルで字幕情報を抽出する場合、字幕情報抽出手段１１０は、チャンネルごとに記憶領域を分けて記憶手段１２に記憶することとしてもよいし、チャンネルを区別することなく記憶することとしてもよい。 The subtitle information extraction means 110 writes and stores the extracted subtitle information in the storage means 12. In the case of a TS signal corresponding to multiple channels, the subtitle information extraction means 110 extracts subtitle information from each TS signal and stores it in the storage means 12. When extracting subtitle information for multiple channels, the subtitle information extraction means 110 may store the information in separate memory areas for each channel in the storage means 12, or may store the information without distinguishing between channels.

ＥＰＧ情報抽出手段１１１は、ＴＳ信号から、放送番組のＥＰＧ（Electronic Program Guide：電子番組表）情報を抽出するものである。
ＥＰＧ情報抽出手段１１１は、ＴＳ信号を解析し、指定された時間区間のＥＰＧ情報を抽出する。図４に、ＥＰＧ情報抽出手段１１１が抽出するＥＰＧ情報の一例を示す。 The EPG information extraction means 111 extracts EPG (Electronic Program Guide) information of a broadcast program from a TS signal.
The EPG information extraction means 111 analyzes the TS signal and extracts the EPG information for a specified time period. An example of the EPG information extracted by the EPG information extraction means 111 is shown in FIG.

図４に示すように、ＥＰＧ情報抽出手段１１１が抽出するＥＰＧ情報は、日付４００、開始時刻４０１、時間長４０２、ジャンル識別子４０３およびタイトル４０４である。
日付４００は、対応するタイトル４０４の放送番組が放送される日付（年／月／日）である。
開始時刻４０１は、対応するタイトル４０４の放送番組が放送される時刻（時：分：秒）である。
時間長４０２は、対応するタイトル４０４の放送番組の番組時間長（時：分：秒）である。 As shown in FIG. 4, the EPG information extracted by the EPG information extracting means 111 includes a date 400 , a start time 401 , a duration 402 , a genre identifier 403 , and a title 404 .
The date 400 is the date (year/month/day) on which the broadcast program of the corresponding title 404 will be broadcast.
The start time 401 is the time (hour:minute:second) when the broadcast program of the corresponding title 404 will be broadcast.
The duration 402 is the program duration (hours:minutes:seconds) of the broadcast program of the corresponding title 404 .

ジャンル識別子４０３は、対応するタイトル４０４の放送番組のジャンルを示す識別子である。ジャンルは、放送番組の内容を分野別に識別するもので、ニュース、スポーツ、ドラマ等である。ここでは、ジャンル識別子を上位の項目で分類した上位分類と、上位分類を細分化した下位分類とで構成されるものとする。例えば、上位分類が“スポーツ”の場合、下位分類は“野球”，“サッカー”等である。ここでは、ジャンル識別子４０３を１６進数２桁で表し、上位桁を上位分類、下位桁を下位分類とする識別子とする。
なお、図４に示すように、１つの放送番組に対して複数のジャンル識別子が設定される場合がある。 The genre identifier 403 is an identifier indicating the genre of the broadcast program of the corresponding title 404. The genre identifies the content of the broadcast program by field, such as news, sports, and drama. Here, the genre identifier is composed of higher-level categories that classify higher-level items, and lower-level categories that subdivide the higher-level categories. For example, if the higher-level category is "sports," the lower-level categories are "baseball,""soccer," and so on. Here, the genre identifier 403 is expressed as a two-digit hexadecimal number, with the higher digits representing the higher-level category and the lower digits representing the lower-level category.
As shown in FIG. 4, a plurality of genre identifiers may be set for one broadcast program.

タイトル４０４は、放送番組の番組名である。なお、タイトル４０４は、放送番組が字幕放送である場合、予め定めた文字（ここでは［字］）が付加されている。また、タイトル４０４は、放送番組が再放送である場合、予め定めた文字（ここでは［再］）が付加されている。
例えば、日付（２０２０／０７／０８）の開始時刻（０６：００：００）には、時間長３０分（００：３０：００）、ジャンル識別子が“０ｘ２５”，“０ｘａ０”，“０ｘ８６”であるタイトル“２度目のタイ「バンコク編」［字］”が放送されることを示している。 The title 404 is the name of the broadcast program. If the broadcast program is a subtitled broadcast, predetermined characters (here, [字]) are added to the title 404. If the broadcast program is a rebroadcast, predetermined characters (here, [回]) are added to the title 404.
For example, this indicates that a program with the title “Second Time in Thailand: Bangkok Edition” [subtitles]” with a duration of 30 minutes (00:30:00) and genre identifiers of “0x25”, “0xa0”, and “0x86” will be broadcast at a start time (06:00:00) on the date (2020/07/08).

ここで、図５を参照して、ジャンル識別子について詳細に説明する。
ジャンル識別子には、ＡＲＩＢが規定する標準規格（ＡＲＩＢＳＴＤ－Ｂ１０）を用いることができる。
図５は、ＡＲＩＢＳＴＤ－Ｂ１０第２部付録Ｈで規定しているジャンルの分類項目を示している。 Now, with reference to FIG. 5, the genre identifier will be described in more detail.
The genre identifier can use the standard (ARIB STD-B10) defined by ARIB.
FIG. 5 shows the genre classification items defined in ARIB STD-B10 Part 2, Appendix H.

ジャンル大分類５００は、ジャンルの上位分類を示し、ジャンル中分類５０１は、ジャンル大分類５００をさらに分類した下位分類を示す。ここでは、ＡＲＩＢの用語に合わせて、上位分類をジャンル大分類、下位分類をジャンル中分類と呼ぶ。
ジャンル大分類５００は１６進数２桁の上位桁の値、ジャンル中分類５０１は下位桁の値でそれぞれ予め定めた値が定義されている。 The major genre classification 500 indicates a higher-level classification of genres, and the medium genre classification 501 indicates a lower level classification obtained by further classifying the major genre classification 500. Here, in accordance with the terminology of ARIB, the higher level classification is called the major genre classification, and the lower level classification is called the medium genre classification.
The major genre classification 500 is a two-digit hexadecimal number with the most significant digits defined in advance, and the intermediate genre classification 501 is a two-digit hexadecimal number with the most significant digits defined in advance.

例えば、上位桁の値であるジャンル大分類５００の値“０ｘ０”は“ニュース／報道”、“０ｘ１”は“スポーツ”を示している。
また、ジャンル大分類５００の値“０ｘ２”の“情報／ワイドショー”の下位桁の値であるジャンル中分類５０１の値“０ｘ０”は“芸能・ワイドショー”、“０ｘ１”は“ファッション”を示している。 For example, the value "0x0" of the genre major classification 500, which is the value of the upper digits, indicates "news/reporting," and "0x1" indicates "sports."
In addition, the value "0x0" of the intermediate genre category 501, which is the lower digit value of the value "0x2" of "information/variety show" in the major genre category 500, indicates "entertainment/variety show", and "0x1" indicates "fashion".

ジャンル大分類５００を上位桁（上位４ビット）、ジャンル中分類５０１を下位桁（下位４ビット）とする１６進数２桁（８ビット）の値をジャンル識別子とする。
例えば、ジャンル識別子“０ｘ２１”は、“情報／ワイドショー”＋“ファッション”のジャンルを示す。
図１に戻って、ジャンル別テキスト収集装置１の構成について説明を続ける。 The genre identifier is a 2-digit (8-bit) hexadecimal value with the major genre classification 500 as the upper digits (upper 4 bits) and the intermediate genre classification 501 as the lower digits (lower 4 bits).
For example, the genre identifier "0x21" indicates the genre of "information/variety show" + "fashion."
Returning to FIG. 1, the description of the configuration of the genre-specific text collection device 1 will be continued.

ＥＰＧ情報抽出手段１１１は、抽出したＥＰＧ情報を記憶手段１２に書き込み記憶する。なお、複数のチャンネルに対応したＴＳ信号の場合、ＥＰＧ情報抽出手段１１１は、それぞれのＴＳ信号からＥＰＧ情報を抽出し、記憶手段１２に記憶する。複数のチャンネルでＥＰＧ情報を抽出する場合、ＥＰＧ情報抽出手段１１１は、チャンネルごとに記憶領域を分けて記憶手段１２に記憶することとしてもよいし、チャンネルを区別することなく記憶することとしてもよい。 The EPG information extraction means 111 writes and stores the extracted EPG information in the storage means 12. In the case of a TS signal corresponding to multiple channels, the EPG information extraction means 111 extracts EPG information from each TS signal and stores it in the storage means 12. When extracting EPG information for multiple channels, the EPG information extraction means 111 may store the information in separate storage areas for each channel in the storage means 12, or may store the information without distinguishing between channels.

なお、字幕情報抽出手段１１０およびＥＰＧ情報抽出手段１１１は、それぞれ、既存の手法で各情報を抽出することができる。例えば、プログラミング言語ｐｙｔｈｏｎのライブラリであるａｒｉｂｌｉｂ（参考：https://pypi.org/project/ariblib/）を用いることができる。 The subtitle information extraction means 110 and the EPG information extraction means 111 can each extract the information using existing methods. For example, ariblib (reference: https://pypi.org/project/ariblib/), a library for the programming language python, can be used.

放送情報抽出手段１１は、指定された時間区間の字幕情報およびＥＰＧ情報を抽出後、抽出を完了した旨（抽出完了通知）を番組情報特定手段１３に通知する。なお、この通知は、ユーザが番組情報特定手段１３に指示する場合、必須ではない。 After extracting the subtitle information and EPG information for the specified time period, the broadcast information extraction means 11 notifies the program information identification means 13 that the extraction is complete (extraction completion notification). Note that this notification is not essential if the user instructs the program information identification means 13 to do so.

記憶手段１２は、放送情報抽出手段１１で抽出された字幕情報（図３参照）およびＥＰＧ情報（図４参照）を記憶するものである。
この記憶手段１２は、ハードディスク、半導体メモリ等の一般的な記憶媒体で構成することができる。 The storage means 12 stores the subtitle information (see FIG. 3) and the EPG information (see FIG. 4) extracted by the broadcast information extraction means 11 .
This storage means 12 can be configured with a general storage medium such as a hard disk or a semiconductor memory.

番組情報特定手段１３は、ＥＰＧ情報から、放送番組の時間情報およびジャンルを特定するものである。この番組情報特定手段１３は、放送情報抽出手段１１から抽出完了通知が通知された段階、あるいは、ユーザから指示された段階で動作する。
番組情報特定手段１３は、記憶手段１２に記憶されているＥＰＧ情報（図４参照）から、放送番組ごとにジャンルを特定するとともに、放送番組の時間情報（ここでは、日付、開始時刻、時間長）を特定する。 The program information identifying means 13 identifies the time information and genre of a broadcast program from the EPG information. The program information identifying means 13 operates when an extraction completion notification is received from the broadcast information extracting means 11 or when an instruction is received from the user.
The program information identification means 13 identifies the genre of each broadcast program from the EPG information (see Figure 4) stored in the storage means 12, and also identifies the time information of the broadcast program (here, date, start time, and duration).

なお、番組情報特定手段１３は、予め設定された分類基準で、放送番組のジャンルを特定する。分類基準は、ジャンルをジャンル大分類（上位分類）で分類するか、ジャンル中分類（下位分類）で分類するか否かである。 The program information identification means 13 identifies the genre of a broadcast program using a preset classification criterion. The classification criterion is whether to classify the genre into a major genre classification (higher classification) or into a medium genre classification (lower classification).

設定された分類基準がジャンル大分類（上位分類）である場合、番組情報特定手段１３は、ジャンル識別子のジャンル大分類の値（上位桁）をジャンルの値とする。
また、設定された分類基準がジャンル中分類（下位分類）である場合、番組情報特定手段１３は、ジャンル識別子そのものをジャンルの値とする。
以下、分類基準がジャンル大分類である場合と、ジャンル中分類である場合とに分けて具体的に説明する。 When the set classification criterion is a major genre classification (higher classification), the program information specification means 13 sets the value of the major genre classification (higher digits) of the genre identifier as the genre value.
Furthermore, when the set classification criterion is a genre medium classification (sub-classification), the program information specification means 13 sets the genre identifier itself as the genre value.
Below, a specific description will be given of the case where the classification criterion is a major genre classification and the case where the classification criterion is a medium genre classification.

（分類基準がジャンル大分類〔上位分類〕の場合）
分類基準がジャンル大分類（上位分類）である場合、番組情報特定手段１３は、放送番組に設定されているジャンル識別子のジャンル中分類である下位桁（下位４ビット）を削除し、ジャンル大分類の上位桁（上位４ビット）を識別子とする。
この識別子が放送番組に１つであれば、番組情報特定手段１３は、その識別子を放送番組のジャンルとする。 (When the classification criteria is genre major classification [higher classification])
When the classification criterion is a major genre classification (higher classification), the program information identification means 13 deletes the lower digits (lower 4 bits), which are the genre medium classification of the genre identifier set in the broadcast program, and uses the upper digits (higher 4 bits) of the major genre classification as the identifier.
If there is one such identifier in a broadcast program, the program information specification means 13 determines that identifier as the genre of the broadcast program.

一方、１つの放送番組にジャンル識別子が複数設定され、同じジャンル大分類の上位桁（上位４ビット）の識別子が複数存在する場合、番組情報特定手段１３は、その識別子の出現頻度を累計し、最大頻度の識別子を放送番組のジャンルとする。なお、最大頻度の識別子が複数存在する場合、番組情報特定手段１３は、ＥＰＧ情報に設定されているジャンル識別子の順番で最初に出現する識別子を放送番組のジャンルとする。
これによって、放送番組にジャンル識別子が複数設定されている場合でも、この放送番組の主だったジャンルを特定することができる。 On the other hand, if multiple genre identifiers are set for one broadcast program and multiple identifiers with the highest digits (highest 4 bits) of the same major genre classification exist, the program information identification means 13 accumulates the frequency of appearance of the identifiers and determines the identifier with the highest frequency as the genre of the broadcast program. Note that if multiple identifiers with the highest frequency exist, the program information identification means 13 determines the identifier that appears first in the order of genre identifiers set in the EPG information as the genre of the broadcast program.
This makes it possible to identify the main genre of a broadcast program even if multiple genre identifiers are set for the broadcast program.

ただし、最大頻度の識別子が複数存在する場合、番組情報特定手段１３は、ジャンルを１つに特定せずに、複数の識別子のジャンルごとに、同じ放送番組の時間情報を対応付けることとしてもよい。 However, if there are multiple identifiers with the highest frequency, the program information identification means 13 may associate the time information of the same broadcast program with each of the multiple identifiers' genres, rather than identifying a single genre.

（分類基準がジャンル中分類〔下位分類〕の場合）
分類基準がジャンル中分類（下位分類）である場合、番組情報特定手段１３は、放送番組に設定されているジャンル識別子を放送番組のジャンルとする。
なお、１つの放送番組にジャンル識別子が複数設定されている場合、番組情報特定手段１３は、ＥＰＧ情報に設定されているジャンル識別子の順番で最初に出現するジャンル識別子を放送番組のジャンルとする。 (If the classification criteria is genre classification [sub-classification])
When the classification criterion is a medium genre classification (sub-classification), the program information specification means 13 determines the genre identifier set in the broadcast program as the genre of the broadcast program.
When multiple genre identifiers are set for one broadcast program, the program information identification means 13 determines the genre identifier that appears first in the order of the genre identifiers set in the EPG information as the genre of the broadcast program.

ただし、１つの放送番組にジャンル識別子が複数設定されている場合、番組情報特定手段１３は、ジャンルを１つに特定せずに、複数のジャンルごとに、同じ放送番組の時間情報を対応付けることとしてもよい。
このように、番組情報特定手段１３は、ＥＰＧ情報から、放送番組に対応するジャンルを特定し、放送番組のジャンルと時間情報とを、テキスト抽出手段１４に出力する。 However, when multiple genre identifiers are set for one broadcast program, the program information identification means 13 may associate the time information of the same broadcast program with each of the multiple genres, rather than identifying a single genre.
In this way, the program information identifying means 13 identifies the genre corresponding to the broadcast program from the EPG information, and outputs the genre and time information of the broadcast program to the text extracting means 14 .

なお、番組情報特定手段１３は、ＥＰＧ情報において放送番組が再放送であると判定した場合、当該放送番組を時間情報およびジャンルを特定する対象から除外することとしてもよい。例えば、番組情報特定手段１３は、ＥＰＧ情報（図４）を参照し、タイトル４０４に再放送を示す予め定めた文字（ここでは［再］）が設定されているか否かにより、放送番組が再放送か否かを判定することができる。
このように、再放送の放送番組を、テキスト抽出を行う対象から除外することで、同じジャンルでの二重のテキスト抽出を防止することができる。 When the program information identifying means 13 determines that a broadcast program is a rebroadcast in the EPG information, it may exclude the broadcast program from targets for identifying time information and genre. For example, the program information identifying means 13 can refer to the EPG information (FIG. 4) and determine whether a broadcast program is a rebroadcast or not based on whether a predetermined character indicating a rebroadcast (here, [re]) is set in the title 404.
In this way, by excluding reruns of broadcast programs from the targets for text extraction, it is possible to prevent duplicate text extraction for the same genre.

また、番組情報特定手段１３は、ＥＰＧ情報において放送番組が字幕放送であると判定した場合にのみ当該放送番組の時間情報およびジャンルを特定することとしてもよい。なお、字幕放送ではない放送番組で時間情報およびジャンルを特定しても、字幕情報が存在しないだけで、後段の処理には影響がない。 The program information identification means 13 may identify the time information and genre of a broadcast program only when it has determined that the broadcast program is a subtitled broadcast in the EPG information. Note that even if the time information and genre are identified for a broadcast program that is not a subtitled broadcast, there is no effect on subsequent processing, since the subtitle information simply does not exist.

テキスト抽出手段１４は、字幕情報から、放送番組の時間情報で特定される時間区間の字幕テキストを抽出し、放送番組のジャンルと対応付けてジャンル別テキストとして出力するものである。
図１に示すように、テキスト抽出手段１４は、字幕テキスト抽出手段１４０と、整形手段１４１と、を備える。 The text extraction means 14 extracts subtitle text for a time period specified by time information of a broadcast program from the subtitle information, and outputs the extracted text as genre-specific text in association with the genre of the broadcast program.
As shown in FIG. 1, the text extraction unit 14 includes a subtitle text extraction unit 140 and a formatting unit 141 .

字幕テキスト抽出手段１４０は、番組情報特定手段１３で特定された放送番組のジャンルごとに、放送番組の時間情報に対応する字幕テキストを字幕情報から抽出するものである。
字幕テキスト抽出手段１４０は、番組情報特定手段１３で特定された番組情報に基づいて、記憶手段１２に記憶されている字幕情報から、放送番組のジャンルごとに、放送番組の時間情報で特定される時間区間の字幕テキストを抽出する。 The subtitle text extraction means 140 extracts subtitle text corresponding to time information of a broadcast program from the subtitle information for each genre of the broadcast program identified by the program information identification means 13 .
A subtitle text extraction means 140 extracts subtitle text for a time section specified by time information of a broadcast program for each genre of the broadcast program from the subtitle information stored in the storage means 12 based on the program information specified by the program information specification means 13.

例えば、図６に示すような字幕情報が記憶手段１２に記憶され、番組情報特定手段１３から、ジャンルとして０ｘ０２（ジャンル大分類〔上位分類〕）と、時間情報として日付（２０２０／０７／０８），開始時刻（０６：００：００），時間長（００：３０：００）とが通知されたする。
この場合、字幕テキスト抽出手段１４０は、日付３００が２０２０年７月８日で、開始時刻３０１が６時から３０分間の字幕テキスト３０２を、ジャンル０ｘ０２（情報／ワイドショー）に対応する字幕テキストとして抽出する。他のジャンル０ｘ０８（ドキュメンタリ／教養）についても同様である、 For example, suppose that subtitle information such as that shown in Figure 6 is stored in the storage means 12, and the program information identification means 13 notifies the program of 0x02 (major genre classification [higher classification]) as the genre, and the date (2020/07/08), start time (06:00:00), and duration (00:30:00) as the time information.
In this case, the subtitle text extraction unit 140 extracts the subtitle text 302 for 30 minutes from 6:00 on July 8, 2020, as the subtitle text corresponding to the genre 0x02 (information/variety show). The same applies to the other genre 0x08 (documentary/educational).

このように、記憶手段１２に記憶される字幕情報には、時間情報（日付３００，開始時刻３０１）に対応付けて字幕テキスト３０２が対応付けられているため、字幕テキスト抽出手段１４０は、番組情報特定手段１３から通知されるジャンルおよび時間情報から、ジャンルに対応する字幕テキストを抽出することができる。
字幕テキスト抽出手段１４０は、抽出したジャンル別の字幕テキストを整形手段１４１に出力する。 In this way, the subtitle information stored in the storage means 12 is associated with the subtitle text 302 in association with the time information (date 300, start time 301), so the subtitle text extraction means 140 can extract the subtitle text corresponding to the genre from the genre and time information notified by the program information identification means 13.
The subtitle text extraction means 140 outputs the extracted subtitle text for each genre to the shaping means 141 .

整形手段１４１は、字幕テキスト抽出手段１４０で抽出されたジャンル別の字幕テキストから、発話テキスト以外のテキスト（メタ情報）を削除することで、字幕テキストを整形したテキストに変換するものである。 The shaping means 141 converts the subtitle text into shaped text by deleting text (meta information) other than the spoken text from the genre-specific subtitle text extracted by the subtitle text extraction means 140.

字幕に用いられるメタ情報は、話者表記、情景表記等、一定のパターンに集約されている。そのため、整形手段１４１は、予め定めた正規表現フィルタ処理を行うことで、字幕テキストを整形することができる。 The meta-information used in subtitles is collected into certain patterns, such as speaker notation and scene notation. Therefore, the shaping means 141 can shape the subtitle text by performing a predetermined regular expression filter process.

図７を参照して、整形手段１４１の正規表現フィルタ処理の一例について説明する。
図７（ａ）は、話者表記を削除する例である。例えば、“アナ≫”のように、“話者”＋”≫”については、整形手段１４１は、文頭から”≫”の直前までのテキストを話者と判定し、文頭から”≫”を削除する。
図７（ｂ）は、情景表記を削除する例である。例えば、（拍手と歓声）のように丸括弧に囲まれた情景表記文字列については、整形手段１４１は、丸括弧とともに情景表記文字列を削除する。 An example of the regular expression filter process of the formatter 141 will be described with reference to FIG.
7A shows an example of deleting a speaker notation. For example, in the case of "speaker" + ">>", such as "Ana>>", the shaping unit 141 determines the text from the beginning of the sentence to just before the ">>" as the speaker, and deletes the ">>" from the beginning of the sentence.
7B shows an example of deleting a scene description. For example, for a scene description character string enclosed in parentheses, such as (applause and cheers), the shaping unit 141 deletes the scene description character string together with the parentheses.

図７（ｃ）は、分断された文節を連結する例である。字幕の場合、場面によって一文の字幕が分断され、次文節に続く表記として、例えば、［⇒］が用いられる。この場合、整形手段１４１は、“［⇒］”を読点“、”に置換することで、文節を連結する。
図７（ｄ）は、背景音表記を削除する例である。例えば、字幕では、背景音として、電話が鳴っている音を示す記号６００、背景で誰かが話している音声を示す記号６０１等が用いられる。この場合、整形手段１４１は、背景音を示す記号６００，６０１を削除する。
これによって、整形手段１４１は、字幕テキストを、字幕特有の表記をなくした発話内容のみのテキストに変換することができる。
図１に戻って、ジャンル別テキスト収集装置１の構成について説明を続ける。 7C is an example of connecting segmented phrases. In the case of subtitles, a sentence of subtitles is divided depending on the scene, and for example, [⇒] is used to indicate that the next segment continues. In this case, the shaping unit 141 connects the segments by replacing "[⇒]" with a comma ",".
7D shows an example of deleting background sound notation. For example, in subtitles, a symbol 600 indicating the sound of a telephone ringing, a symbol 601 indicating the sound of someone talking in the background, etc. are used as background sounds. In this case, the shaping means 141 deletes the symbols 600 and 601 indicating the background sounds.
This enables the shaping unit 141 to convert the subtitle text into text that contains only the spoken content, without notations specific to subtitles.
Returning to FIG. 1, the description of the configuration of the genre-specific text collection device 1 will be continued.

整形手段１４１は、ジャンル別に整形したテキストをジャンル別テキストとして出力する。なお、整形手段１４１の出力先は、直接接続された、あるいは、ネットワークを介して接続された記憶装置（不図示）等である。 The shaping means 141 outputs the text shaped by genre as genre-specific text. The output destination of the shaping means 141 is a storage device (not shown) that is directly connected or connected via a network.

以上説明したように構成することで、ジャンル別テキスト収集装置１は、放送波を受信するだけで、ＥＰＧ情報に基づいて、字幕のテキストから、ジャンル別テキストを収集することができる。
また、ジャンルは放送規格に基づいて定められているため、放送局ごとの差が生じにくく、ジャンル別テキスト収集装置１は、言語モデルや自然言語処理の学習に利用可能な良質なテキストコーパスを、ジャンルごとに大量に収集することができる。 With the above-described configuration, the genre-specific text collection device 1 can collect genre-specific text from subtitle text based on EPG information simply by receiving broadcast waves.
In addition, since genres are defined based on broadcasting standards, differences between broadcasting stations are unlikely to occur, and the genre-specific text collection device 1 can collect large amounts of high-quality text corpora for each genre that can be used for learning language models and natural language processing.

なお、ジャンル別テキスト収集装置１は、コンピュータを、前記した各手段として機能させるためのプログラム（ジャンル別テキスト収集プログラム）で動作させることができる。 The genre-specific text collection device 1 can be operated using a program (genre-specific text collection program) that causes a computer to function as each of the above-mentioned means.

＜ジャンル別テキスト収集装置の動作＞
次に、図８，図９を参照（構成については適宜図１参照）して、本発明の実施形態に係るジャンル別テキスト収集装置１の動作について説明する。 <Operation of the genre-specific text collection device>
Next, the operation of the genre-specific text collection device 1 according to the embodiment of the present invention will be described with reference to FIGS. 8 and 9 (and for the configuration, refer to FIG. 1 as appropriate).

（全体動作）
まず、図８を参照して、ジャンル別テキスト収集装置１の全体動作について説明する。
ステップＳ１において、放送受信手段１０は、デジタル放送の放送波を受信し、ＴＳ信号に復調する。このとき、放送受信手段１０は、指定されたチャンネルの放送信号を復調するが、そのチャンネル数は、１または複数である。 (Overall operation)
First, the overall operation of the genre-specific text collection device 1 will be described with reference to FIG.
In step S1, the broadcast receiving means 10 receives a digital broadcast wave and demodulates it into a TS signal. At this time, the broadcast receiving means 10 demodulates a broadcast signal of a specified channel, and the number of channels may be one or more.

ステップＳ２において、放送情報抽出手段１１の字幕情報抽出手段１１０は、ステップＳ１で復調されたＴＳ信号を解析し、指定された時間区間の間、ＴＳ信号から字幕情報を抽出する。ここでは、字幕情報抽出手段１１０は、図３に示すように、ＴＳ信号から、日付３００、開始時刻３０１および字幕テキスト３０２を抽出し、記憶手段１２に記憶する。 In step S2, the subtitle information extraction means 110 of the broadcast information extraction means 11 analyzes the TS signal demodulated in step S1 and extracts subtitle information from the TS signal for a specified time period. Here, the subtitle information extraction means 110 extracts date 300, start time 301, and subtitle text 302 from the TS signal, as shown in FIG. 3, and stores them in the storage means 12.

ステップＳ３において、放送情報抽出手段１１のＥＰＧ情報抽出手段１１１は、ステップＳ１で復調されたＴＳ信号を解析し、指定された時間区間の間、ＴＳ信号からＥＰＧ情報を抽出する。ここでは、ＥＰＧ情報抽出手段１１１は、図４に示すように、ＴＳ信号から日付４００、開始時刻４０１、時間長４０２、ジャンル識別子４０３およびタイトル４０４を抽出し、記憶手段１２に記憶する。
なお、ステップＳ２，Ｓ３は、この順に動作させる必要はなく、ステップＳ３，Ｓ２の順、あるいは、ステップＳ２，Ｓ３を並列に動作させてもよい。 In step S3, the EPG information extraction means 111 of the broadcast information extraction means 11 analyzes the TS signal demodulated in step S1 and extracts EPG information from the TS signal for a specified time period. Here, the EPG information extraction means 111 extracts date 400, start time 401, duration 402, genre identifier 403 and title 404 from the TS signal, as shown in Fig. 4, and stores them in the storage means 12.
It should be noted that steps S2 and S3 do not necessarily have to be performed in this order, and steps S3 and S2 may be performed in this order, or steps S2 and S3 may be performed in parallel.

ステップＳ４において、放送情報抽出手段１１は、指定された時間区間が完了したか否かを判定する。
ここで、まだ、指定された時間区間が完了していない場合（ステップＳ４でＮｏ）、ジャンル別テキスト収集装置１は、ステップＳ２に戻って動作を継続する。 In step S4, the broadcast information extraction means 11 determines whether or not the designated time period has been completed.
If the specified time period has not yet been completed (No in step S4), the genre-specific text collection device 1 returns to step S2 and continues the operation.

一方、指定された時間区間が完了した場合（ステップＳ４でＹｅｓ）、ステップＳ５において、番組情報特定手段１３は、ステップＳ３で記憶手段１２に記憶されたＥＰＧ情報から、放送番組ごとのＥＰＧ情報を読み出す。 On the other hand, if the specified time period has been completed (Yes in step S4), in step S5, the program information identification means 13 reads out the EPG information for each broadcast program from the EPG information stored in the storage means 12 in step S3.

ステップＳ６において、番組情報特定手段１３は、予め設定された分類基準で放送番組のジャンルを特定する。このステップＳ６のジャンル特定の詳細動作について、後記する（図９参照）。 In step S6, the program information identification means 13 identifies the genre of the broadcast program using preset classification criteria. The detailed operation of identifying the genre in step S6 will be described later (see FIG. 9).

ステップＳ７において、テキスト抽出手段１４の字幕テキスト抽出手段１４０は、ステップＳ５で読み出された放送番組ごとのＥＰＧ情報に含まれる時間情報（ここでは、日付、開始時刻、時間長）に対応する字幕テキストを、ステップＳ２で記憶手段１２に記憶された字幕情報から抽出する。 In step S7, the subtitle text extraction means 140 of the text extraction means 14 extracts subtitle text corresponding to the time information (here, date, start time, and duration) included in the EPG information for each broadcast program read in step S5 from the subtitle information stored in the storage means 12 in step S2.

ステップＳ８において、テキスト抽出手段１４の整形手段１４１は、ステップＳ７で抽出された字幕テキストを、予め定めた正規表現フィルタ処理を行うことで整形する。
ステップＳ９において、整形手段１４１は、整形したテキストをステップＳ６で特定されたジャンルとともにジャンル別テキストとして出力する。 In step S8, the shaping means 141 of the text extraction means 14 shapes the subtitle text extracted in step S7 by performing a predetermined regular expression filter process.
In step S9, the shaping means 141 outputs the shaped text together with the genre identified in step S6 as genre-specific text.

ステップＳ１０において、番組情報特定手段１３は、記憶手段１２にまだ読み出されていない放送番組のＥＰＧ情報が存在するか否かを判定する。
ここで、まだ読み出されていない放送番組のＥＰＧ情報が存在する場合（ステップＳ１０でＹｅｓ）、ジャンル別テキスト収集装置１は、ステップＳ５に戻って動作を継続する。
一方、ＥＰＧ情報をすべて読み出した場合（ステップＳ１０でＮｏ）、ジャンル別テキスト収集装置１は、動作を終了する。 In step S10, the program information specifying means 13 judges whether or not there is EPG information of a broadcast program in the storage means 12 that has not yet been read out.
If there is EPG information of a broadcast program that has not yet been read out (Yes in step S10), the genre-specific text collection device 1 returns to step S5 and continues the operation.
On the other hand, if all the EPG information has been read out (No in step S10), the genre-specific text collection device 1 ends the operation.

（ジャンル特定動作）
次に、図９を参照して、番組情報特定手段１３が行うステップＳ６（図８）の動作についてさらに詳細に説明する。 (Genre specific action)
Next, the operation of step S6 (FIG. 8) performed by the program information specifying means 13 will be described in more detail with reference to FIG.

ステップＳ６１において、番組情報特定手段１３は、予め設定された分類基準により、ジャンルを、ジャンル大分類（上位分類）で分類するか、ジャンル中分類（下位分類）で分類するかを判定する。
ここで、ジャンル中分類（下位分類）で分類する場合（ステップＳ６１でＮｏ）、番組情報特定手段１３は、ステップＳ６３に動作を進める。 In step S61, the program information specifying means 13 determines whether to classify the genre into a major genre classification (higher classification) or into a medium genre classification (lower classification) based on a preset classification criterion.
If classification is to be made based on the genre medium classification (sub-classification) (No in step S61), the program information specification means 13 advances the operation to step S63.

一方、ジャンル大分類（上位分類）で分類する場合（ステップＳ６１でＹｅｓ）、ステップＳ６２において、番組情報特定手段１３は、ジャンル識別子の下位桁を削除する。
ステップＳ６３において、番組情報特定手段１３は、放送番組に設定されているジャンル識別子が１つか否かを判定する。 On the other hand, when classifying by major genre classification (higher classification) (Yes in step S61), in step S62, the program information specification means 13 deletes the lowest digits of the genre identifier.
In step S63, the program information specifying means 13 judges whether or not one genre identifier is set for the broadcast program.

ここで、ジャンル識別子が１つの場合（ステップＳ６３でＹｅｓ）、ステップＳ６４において、番組情報特定手段１３は、そのジャンル識別子（ジャンル大分類の場合、上位桁）を、放送番組のジャンルと特定する。 If there is only one genre identifier (Yes in step S63), in step S64, the program information identification means 13 identifies that genre identifier (the upper digits in the case of a major genre classification) as the genre of the broadcast program.

一方、ジャンル識別子が複数の場合（ステップＳ６３でＮｏ）、ステップＳ６５において、番組情報特定手段１３は、複数のジャンル識別子（ジャンル大分類の場合、上位桁）の出現頻度を累計する。
ステップＳ６６において、番組情報特定手段１３は、ステップＳ６５で累計した最大頻度のジャンル識別子（ジャンル大分類の場合、上位桁）が１つか否かを判定する。
ここで、最大頻度のジャンル識別子が１つの場合（ステップＳ６６でＹｅｓ）、ステップＳ６７において、番組情報特定手段１３は、最大頻度のジャンル識別子（ジャンル大分類の場合、上位桁）を、放送番組のジャンルと特定する。 On the other hand, if there are multiple genre identifiers (No in step S63), in step S65, the program information identification means 13 accumulates the occurrence frequencies of the multiple genre identifiers (higher digits in the case of major genre classifications).
In step S66, the program information specifying means 13 judges whether the genre identifier (the upper digits in the case of a major genre classification) having the maximum frequency accumulated in step S65 is one or not.
If there is one genre identifier with the highest frequency (Yes in step S66), in step S67, the program information identification means 13 identifies the genre identifier with the highest frequency (the most significant digits in the case of a major genre classification) as the genre of the broadcast program.

一方、最大頻度のジャンル識別子が複数の場合（ステップＳ６６でＮｏ）、ステップＳ６８において、番組情報特定手段１３は、放送番組に設定されているＥＰＧ情報で最初に出現するジャンル識別子（ジャンル大分類の場合、上位桁）を、放送番組のジャンルと特定する。 On the other hand, if there are multiple genre identifiers with the highest frequency (No in step S66), in step S68, the program information identification means 13 identifies the genre identifier (the most significant digits in the case of a major genre classification) that appears first in the EPG information set for the broadcast program as the genre of the broadcast program.

ステップＳ６４，Ｓ６７，Ｓ６８の後、番組情報特定手段１３は、ステップＳ６の動作を終了し、ステップＳ７（図８）に移行する。
以上の動作により、ジャンル別テキスト収集装置１は、放送波を受信するだけで、ＥＰＧ情報に基づいて、字幕のテキストから、ジャンル別テキストを収集することができる。 After steps S64, S67 and S68, the program information identifying means 13 ends the operation of step S6 and proceeds to step S7 (FIG. 8).
By the above operation, the genre-specific text collection device 1 can collect genre-specific text from subtitle text based on EPG information, simply by receiving broadcast waves.

１ジャンル別テキスト収集装置
１０放送受信手段
１１放送情報抽出手段
１１０字幕情報抽出手段
１１１ＥＰＧ情報抽出手段
１２記憶手段
１３番組情報特定手段
１４テキスト抽出手段
１４０字幕テキスト抽出手段
１４１整形手段 REFERENCE SIGNS LIST 1 Genre-specific text collection device 10 Broadcast receiving means 11 Broadcast information extraction means 110 Subtitle information extraction means 111 EPG information extraction means 12 Storage means 13 Program information identification means 14 Text extraction means 140 Subtitle text extraction means 141 Formatting means

Claims

A genre-specific text collection device that collects genre-specific text from subtitle text multiplexed in digital broadcasting, comprising:
A broadcast receiving means for receiving and demodulating the digital broadcast;
a subtitle information extraction means for extracting subtitle information including the subtitle text and time information for presenting the subtitle text from the signal demodulated by the broadcast receiving means;
an EPG information extraction means for extracting EPG information of a broadcast program from the demodulated signal;
a program information specifying means for specifying time information and a genre of the broadcast program from the EPG information;
a text extraction means for extracting from the caption information a caption text for a time period specified by time information of the broadcast program, and associating the extracted caption text with a genre of the broadcast program to generate a genre-specific text ,
The genres are composed of higher-level categories classified by higher-level items and lower-level categories subdivided from the higher-level categories, and one or more genres are set for each of the broadcast programs in the EPG information,
The program information identification means identifies only the higher-level classification as the genre, and if multiple genres are set in the EPG information, identifies the higher-level classification that is most frequently set in the broadcast program as the genre .

The genre-specific text collection device according to claim 1, characterized in that, when the number of highest-ranking categories set for the broadcast programs is the same, the program information identification means identifies the highest-ranking category that appears first in the EPG information as the genre.

3. The genre-specific text collection device according to claim 1, wherein the text extraction means deletes meta-information other than spoken text from the subtitle text.

The genre-specific text collection device according to any one of claims 1 to 3, characterized in that if the program information identification means determines that the broadcast program is a rebroadcast in the EPG information, it excludes the broadcast program from being targeted for identifying time information and genre.

The genre-specific text collection device according to any one of claims 1 to 4 , characterized in that the program information identification means identifies the time information and genre of the broadcast program only when it is determined in the EPG information that the broadcast program is a subtitled broadcast.

A genre-specific text collection program for causing a computer to function as the genre-specific text collection device according to any one of claims 1 to 5 .