JP6865701B2

JP6865701B2 - Speech recognition error correction support device and its program

Info

Publication number: JP6865701B2
Application number: JP2018023711A
Authority: JP
Inventors: 三島　剛; 剛三島; 庄衛佐藤; 麻乃一木; 伊藤　均; 均伊藤; 愛子所澤; 彰夫小林
Original assignee: Japan Broadcasting Corp; NHK Engineering System Inc
Current assignee: Japan Broadcasting Corp; NHK Engineering System Inc
Priority date: 2017-04-18
Filing date: 2018-02-14
Publication date: 2021-04-28
Anticipated expiration: 2038-02-14
Also published as: JP2018180519A

Description

本発明は、音声認識の誤り修正を支援する音声認識誤り修正支援装置およびそのプログラムに関する。 The present invention relates to a voice recognition error correction support device that supports voice recognition error correction and a program thereof.

番組取材等で収録した音声素材（映像・音声素材を含む）の音声を文字として利用する場合、音声を文字に書き起こす作業が必須の作業となっている。通常、この作業は、作業者が収録した素材の音声を聴取し、端末のキーボード等で文字を入力することにより行っている。このとき、作業者は、音声の再生と停止を頻繁に繰り返したり、何度も同一箇所の音声を聞き直したりすることになるが、この作業は熟練者であっても素材の収録時間に対して約６倍の作業時間がかかるとも言われている。 When using the audio of audio material (including video and audio material) recorded in program coverage as characters, the work of transcribing the audio into characters is indispensable. Usually, this work is performed by listening to the voice of the material recorded by the worker and inputting characters with the keyboard of the terminal or the like. At this time, the worker frequently repeats playing and stopping the sound, and re-listens to the sound at the same place many times. It is said that it takes about 6 times as long as the work time.

従来、音声の書き起こし作業を支援する技術として、入力された音声を任意の単位に区切った文（セル）ごとに音声認識処理を施し、音声認識処理された認識結果と、これに対応する音声とを比較し、音声認識処理の誤りを修正する技術が開示されている（特許文献１参照）。
この技術は、音声認識処理においてセル単位で音声を再生し、操作者がセル単位で認識結果を修正し、セルの修正を一般的なテキストエディタの操作で行う。また、この技術では、操作者は、特殊な操作を覚える必要はなく、セルの修正後、セルの先頭から音声を再生して、操作者が認識結果を正しく修正したか否かを確認していた。 Conventionally, as a technology to support the voice transcription work, voice recognition processing is performed for each sentence (cell) in which the input voice is divided into arbitrary units, and the recognition result obtained by the voice recognition processing and the corresponding voice A technique for correcting an error in speech recognition processing is disclosed (see Patent Document 1).
In this technique, the voice is reproduced in cell units in the voice recognition process, the operator corrects the recognition result in cell units, and the cells are corrected by operating a general text editor. In addition, with this technology, the operator does not need to remember any special operation, and after modifying the cell, the voice is played from the beginning of the cell to check whether the operator has corrected the recognition result correctly. It was.

また、従来の音声の書き起こし作業を支援する技術として、音声の認識結果を、単語ごとに対応付けて、単語単位で修正する技術が開示されている（特許文献２，３参照）。
この技術は、字幕放送等のリアルタイム性が要求される誤り修正や、誤りの少ない認識結果を修正する場合には有効である。 Further, as a technique for supporting the conventional speech transcription work, a technique for associating speech recognition results for each word and correcting the speech for each word is disclosed (see Patent Documents 2 and 3).
This technique is effective for error correction such as subtitle broadcasting that requires real-time performance and for correcting recognition results with few errors.

特開２０１５−１８４５６４号公報Japanese Unexamined Patent Publication No. 2015-184564 特開２００４−２２６９１０号公報Japanese Unexamined Patent Publication No. 2004-226910 特開２００５−２２８１７８号公報Japanese Unexamined Patent Publication No. 2005-228178

特許文献１で開示されている技術は、セル単位で音声の再生および認識結果の修正を行うため、修正箇所が少なくても、修正箇所の音声と修正結果が合致するか否かを確認するために、セルの先頭から音声を再生する必要がある。
そのため、この技術は、セルの途中にある修正対象箇所の音声が再生されるまで、待ち時間が発生してしまうという問題があった。また、この技術は、セル内で、認識結果に対応する音声を操作者が聞き分ける必要があるため、認識結果が悪くなると、音声と修正対象とを対応付けることが困難になってしまうという問題があった。 Since the technique disclosed in Patent Document 1 reproduces the sound and corrects the recognition result on a cell-by-cell basis, it is necessary to confirm whether or not the sound of the corrected part matches the corrected result even if the corrected part is small. In addition, it is necessary to play the sound from the beginning of the cell.
Therefore, this technique has a problem that a waiting time is generated until the sound of the correction target portion in the middle of the cell is reproduced. Further, in this technology, since it is necessary for the operator to distinguish the voice corresponding to the recognition result in the cell, there is a problem that it becomes difficult to associate the voice with the correction target when the recognition result becomes poor. It was.

また、特許文献２，３で開示されている技術のように、音声の認識結果を単語単位で修正する技術では、認識結果の修正と音声の確認とを素早く行うことは可能である。しかし、複数の単語に渡って認識誤りがある場合、順番に単語を指定して修正を行わなければならず、手順が複雑となり、その操作に慣れるまでに時間がかかってしまうという問題があった。 Further, in the technique of correcting the speech recognition result on a word-by-word basis, such as the technique disclosed in Patent Documents 2 and 3, it is possible to quickly correct the recognition result and confirm the speech. However, if there is a recognition error across multiple words, it is necessary to specify the words in order and make corrections, which complicates the procedure and takes time to get used to the operation. ..

そこで、本発明は、音声認識の誤りを修正する際に、修正対象箇所の音声を素早く再生し、簡易な操作で音声認識の誤り修正を行うことが可能な音声認識誤り修正支援装置およびそのプログラムを提供することを課題とする。 Therefore, the present invention is a voice recognition error correction support device and a program thereof, which can quickly reproduce the voice of the correction target portion and correct the voice recognition error with a simple operation when correcting the voice recognition error. The challenge is to provide.

前記課題を解決するため、本発明に係る音声認識誤り修正支援装置は、コンテンツの音声に対する音声認識の誤りを修正する音声認識誤り修正支援装置であって、認識結果分割手段と、認識結果表示制御手段と、誤り修正手段と、音声再生手段と、を備える構成とした。 To solve the above problems, the speech recognition error correction support device according to the present invention, there is provided a speech recognition error correction support device for correcting errors in speech recognition for voice content, the recognition result dividing means, recognition The configuration includes a result display control means, an error correction means, and a voice reproduction means.

かかる構成において、音声認識誤り修正支援装置は、認識結果分割手段によって、テキストデータである音声の認識結果と当該認識結果を構成する単語ごとの時間情報とにより、認識結果を予め定めた基準でセグメントに分割する。 In such a configuration, the voice recognition error correction support device uses the recognition result dividing means to segment the recognition result based on a predetermined standard based on the recognition result of the voice as text data and the time information for each word constituting the recognition result. It is divided into.

そして、音声認識誤り修正支援装置は、認識結果表示制御手段によって、項目情報とともにセグメントに含まれる単語列を表示するか否かを指定するボタンを表示する。また、音声認識誤り修正支援装置は、認識結果表示制御手段によって、ボタンの選択により、編集領域を表示してセグメントの単語列を展開するか、編集領域を非表示とするかの制御を行う。これによって、認識結果表示制御手段は、音声の認識結果をすべて表示するのではなく、項目一覧によって操作者に編集対象のセグメントを指定させ、対象となったセグメントの単語列を編集領域に展開して操作者に提示する。 Then, the speech recognition error correction support device by the recognition result display control means display the Rubo Tan specify whether to display a word sequence contained in the segment with item information. Further, the speech recognition error correction support device, by the recognition result display control means performs the selection of buttons, or to deploy the word sequence of segments to display the editing area, whether the control and hide the editing area .. As a result, the recognition result display control means does not display all the voice recognition results, but causes the operator to specify the segment to be edited by the item list, and expands the word string of the target segment into the editing area. And present it to the operator.

そして、音声認識誤り修正支援装置は、誤り修正手段によって、編集領域でセグメントの誤りを修正する。このとき、誤り修正手段は、編集領域で指定された単語位置からの時間情報に対応するコンテンツの音声を音声再生手段により再生させる。これによって、誤り修正手段は、認識結果またはその修正結果に対応する音声を操作者が素早く確認可能なように、指定された位置の単語から音声を再生する。
なお、音声認識誤り修正支援装置は、コンピュータを、前記した各手段として機能させるための音声認識誤り修正支援プログラムで動作させることができる。 Then, the speech recognition error correction support device by the error correction means corrects the error of the segment in the editing area. In this case, the error correcting means, is reproduced by the audio reproducing means audio Turkey content to correspond to the time information from by word position specified in the editing area. As a result, the error correction means reproduces the voice from the word at the designated position so that the operator can quickly confirm the recognition result or the voice corresponding to the correction result.
The voice recognition error correction support device can be operated by a voice recognition error correction support program for operating the computer as each of the above-mentioned means.

本発明は、以下に示す優れた効果を奏するものである。
本発明によれば、素材コンテンツの音声認識結果を分割して、項目の一覧を表示するため、簡易な操作で音声認識の誤りを確認したい認識結果を素早く選択することができる。
また、本発明によれば、編集領域で単語の位置を指定するという簡易な操作で、対応する音声を再生するため、音声認識結果の誤りの発見や、修正確認を素早く行うことができる。
これによって、本発明は、特別なスキルを必要とせずに、音声認識結果の誤りを修正することができる。 The present invention has the following excellent effects.
According to the present invention, since the voice recognition result of the material content is divided and a list of items is displayed, it is possible to quickly select the recognition result for which the voice recognition error is to be confirmed by a simple operation.
Further, according to the present invention, since the corresponding voice is reproduced by a simple operation of designating the position of the word in the editing area, it is possible to quickly find an error in the voice recognition result and confirm the correction.
Thereby, the present invention can correct the error of the speech recognition result without requiring a special skill.

本発明の実施形態に係る音声認識誤り修正支援装置の構成を示すブロック構成図である。It is a block block diagram which shows the structure of the voice recognition error correction support device which concerns on embodiment of this invention. 素材情報記憶手段が記憶する記憶内容を説明するための説明図である。It is explanatory drawing for demonstrating the memorized content memorized by the material information storage means. 素材コンテンツを選択する素材コンテンツ選択画面の一例を示す画面構成図である。It is a screen block diagram which shows an example of the material content selection screen which selects a material content. 素材コンテンツの音声認識結果を分割した項目の一覧を示す項目一覧画面の一例を示す画面構成図である。It is a screen block diagram which shows an example of the item list screen which shows the list of the item which divided the voice recognition result of the material content. 項目一覧画面で編集領域に音声認識結果を展開した例を示す画面構成図である。It is a screen block diagram which shows the example which expanded the voice recognition result in the edit area on the item list screen. 編集領域における編集作業の一例を説明するための説明図である。It is explanatory drawing for demonstrating an example of an editing work in an editing area. 音声再生に連動して編集領域の単語の表示属性を変更する例を説明するための説明図である。It is explanatory drawing for demonstrating the example which changes the display attribute of a word of an edit area in conjunction with voice reproduction. 編集領域における編集作業の操作内容を提示する例を説明するための説明図である。It is explanatory drawing for demonstrating the example which presents the operation content of the editing work in an editing area. 編集領域における音声の繰り返し再生を指定する例を説明するための説明図である。It is explanatory drawing for demonstrating the example which specifies the repeated reproduction of audio in an editing area. 本発明の実施形態に係る音声認識誤り修正支援装置の音声認識結果をセグメント単位で生成するセグメント情報生成動作を示すフローチャートである。It is a flowchart which shows the segment information generation operation which generates the voice recognition result of the voice recognition error correction support apparatus which concerns on embodiment of this invention in segment units. 本発明の実施形態に係る音声認識誤り修正支援装置の音声認識結果をセグメント単位で表示装置に提示するセグメント情報提示動作を示すフローチャートである。It is a flowchart which shows the segment information presenting operation which presents the voice recognition result of the voice recognition error correction support device which concerns on embodiment of this invention to a display device in segment units. 本発明の実施形態に係る音声認識誤り修正支援装置の音声再生を行いながら認識結果を修正するセグメント修正動作を示すフローチャートである。It is a flowchart which shows the segment correction operation which corrects a recognition result while performing the voice reproduction of the voice recognition error correction support device which concerns on embodiment of this invention. 本発明の変形例の実施形態に係る音声認識誤り修正支援装置の構成を示すブロック構成図である。It is a block block diagram which shows the structure of the voice recognition error correction support device which concerns on embodiment of the modification of this invention.

以下、本発明の実施形態について図面を参照して説明する。
［音声認識誤り修正支援装置の構成］
最初に、図１を参照して、本発明の実施形態に係る音声認識誤り修正支援装置１の構成について説明する。
音声認識誤り修正支援装置１は、少なくとも音声を含んだ素材コンテンツにおける音声の認識誤りの修正を支援するものである。なお、本実施形態では、素材コンテンツは、映像と音声とからなるコンテンツ、例えば、放送用素材とする。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.
[Configuration of voice recognition error correction support device]
First, the configuration of the voice recognition error correction support device 1 according to the embodiment of the present invention will be described with reference to FIG.
The voice recognition error correction support device 1 supports correction of voice recognition errors in at least material content including voice. In the present embodiment, the material content is a content composed of video and audio, for example, a broadcasting material.

音声認識誤り修正支援装置１は、図１に示すように、素材コンテンツ入力手段１０と、音声認識手段１１と、認識結果分割手段１２と、項目情報抽出手段１３と、素材情報記憶手段１４と、編集手段１５と、書き起こし結果出力手段１６と、を備える。 As shown in FIG. 1, the voice recognition error correction support device 1 includes a material content input means 10, a voice recognition means 11, a recognition result dividing means 12, an item information extraction means 13, a material information storage means 14, and the like. The editing means 15 and the transcription result output means 16 are provided.

素材コンテンツ入力手段１０は、素材コンテンツを入力するものである。
素材コンテンツ入力手段１０は、例えば、外部の記憶媒体から素材コンテンツを入力するものであってもよいし、通信回線を介して入力するものであってもよい。
この素材コンテンツ入力手段１０は、入力した素材コンテンツのうち、音声については、音声認識手段１１に出力する。また、素材コンテンツ入力手段１０は、入力した素材コンテンツ（映像・音声）を、後記する編集手段１５における修正作業に使用するため、素材情報記憶手段１４に書き込み記憶する。 The material content input means 10 inputs material content.
The material content input means 10 may, for example, input material content from an external storage medium, or may input material content via a communication line.
The material content input means 10 outputs the voice of the input material content to the voice recognition means 11. Further, the material content input means 10 writes and stores the input material content (video / audio) in the material information storage means 14 in order to use it for the correction work in the editing means 15 described later.

なお、素材コンテンツ入力手段１０は、素材情報記憶手段１４に素材コンテンツを書き込んだ後、音声認識手段１１に対して、素材コンテンツの書き込み完了を通知し、音声認識手段１１が素材情報記憶手段１４から音声を読み出すこととしてもよい。 After writing the material content in the material information storage means 14, the material content input means 10 notifies the voice recognition means 11 of the completion of writing the material content, and the voice recognition means 11 sends the material information storage means 14 to the material information storage means 14. You may read the voice.

音声認識手段１１は、素材コンテンツ入力手段１０が入力した素材コンテンツの音声を認識し、テキストデータである認識結果と当該認識結果を構成する単語ごとの時間情報とを生成するものである。
この音声認識手段１１は、図示を省略した言語モデル、音響モデル、発音辞書により、音声認識を行い、認識した単語と、その単語の音声の先頭からの経過時間を示す時間情報とを生成する。音声認識手段１１は、生成した認識結果の単語と時間情報とを認識結果分割手段１２に出力する。なお、音声認識手段１１における音声認識の手法は、例えば、特開２０１０−１７５７６５等に開示された音声から単語列を認識し、その結果を出力する手法を用いてもよい。 The voice recognition means 11 recognizes the voice of the material content input by the material content input means 10 and generates a recognition result which is text data and time information for each word constituting the recognition result.
The voice recognition means 11 performs voice recognition by using a language model, an acoustic model, and a pronunciation dictionary (not shown), and generates a recognized word and time information indicating an elapsed time from the beginning of the voice of the word. The voice recognition means 11 outputs the generated recognition result word and time information to the recognition result dividing means 12. As the voice recognition method in the voice recognition means 11, for example, a method of recognizing a word string from the voice disclosed in Japanese Patent Application Laid-Open No. 2010-175765 and outputting the result may be used.

認識結果分割手段１２は、音声認識手段１１で認識された認識結果（単語列）を、予め定めた基準で分割するものである。以下、認識結果分割手段１２で生成された分割認識結果のそれぞれのかたまりをセグメントとよぶ。
認識結果分割手段１２が用いる分割の基準は、任意の基準を予め定めることができる。
例えば、分割の基準として、音声の無音区間を用いることができる。この場合、認識結果分割手段１２は、素材情報記憶手段１４に記憶されている音声から音響特徴量であるパワー等によって無音区間を検出し、音声認識手段１１による認識結果を、無音区間の前後で分割する。 The recognition result dividing means 12 divides the recognition result (word string) recognized by the voice recognition means 11 according to a predetermined standard. Hereinafter, each block of the division recognition results generated by the recognition result division means 12 is referred to as a segment.
Any standard can be predetermined as the division standard used by the recognition result dividing means 12.
For example, a silent section of voice can be used as a reference for division. In this case, the recognition result dividing means 12 detects a silent section from the voice stored in the material information storage means 14 by power or the like, which is an acoustic feature amount, and transmits the recognition result by the voice recognition means 11 before and after the silent section. To divide.

また、例えば、分割の基準として、映像のカット点を用いることができる。この場合、認識結果分割手段１２は、素材情報記憶手段１４に記憶されている映像から、隣接するフレームの画像特徴が予め定めた基準よりも大きく異なるフレームをカット点として検出し、カット点の時間の前後で認識結果を分割する。 Further, for example, a cut point of an image can be used as a reference for division. In this case, the recognition result dividing means 12 detects as a cut point a frame in which the image features of the adjacent frames are significantly different from the predetermined reference from the video stored in the material information storage means 14, and the time of the cut point is reached. The recognition result is divided before and after.

また、例えば、分割の基準として、素材コンテンツに予め付加されているメタ情報を用いてもよい。メタ情報としては、ＧＰＳ（Global Positioning System）の位置情報（ジオタグ）等がある。この場合、認識結果分割手段１２は、位置情報によって、素材コンテンツを撮影または集音した場所が異なっている時点で、認識結果を分割する。 Further, for example, meta information added in advance to the material content may be used as a reference for division. The meta information includes GPS (Global Positioning System) position information (geotag) and the like. In this case, the recognition result dividing means 12 divides the recognition result at a time when the place where the material content is photographed or the sound is collected differs depending on the position information.

認識結果分割手段１２は、音声認識結果を分割したセグメントを、項目情報抽出手段１３に出力する。また、認識結果分割手段１２は、セグメントごとに、単語とその時間情報とを素材情報記憶手段１４に書き込み記憶する。 The recognition result dividing means 12 outputs the segment obtained by dividing the voice recognition result to the item information extracting means 13. Further, the recognition result dividing means 12 writes and stores a word and its time information in the material information storage means 14 for each segment.

項目情報抽出手段１３は、認識結果分割手段１２で分割されたセグメントごとに、当該セグメントに含まれる特徴単語を項目として抽出するものである。
この特徴単語は、セグメント内に含まれる特徴的な単語である。例えば、項目情報抽出手段１３は、ＴＦ−ＩＤＦ法（ＴＦ：Term Frequency、単語の出現頻度、ＩＤＦ:Inverse Document Frequency、逆文書頻度）によりセグメントを特徴付ける単語を抽出する。ＴＦ−ＩＤＦは、文書（本実施形態では、セグメント）中の単語に関する重みの一種であり、主に情報検索や文章要約などの分野で利用される。
具体的には、項目情報抽出手段１３は、セグメントｓ内の単語ｗの出現頻度ｔｆ（ｗ，ｓ）を、以下の式（１）で算出する。 The item information extracting means 13 extracts the feature words included in the segment as an item for each segment divided by the recognition result dividing means 12.
This characteristic word is a characteristic word contained in the segment. For example, the item information extracting means 13 extracts words that characterize a segment by the TF-IDF method (TF: Term Frequency, word appearance frequency, IDF: Inverse Document Frequency, reverse document frequency). TF-IDF is a kind of weight related to a word in a document (segment in this embodiment), and is mainly used in fields such as information retrieval and sentence summarization.
Specifically, the item information extraction means 13 calculates the appearance frequency tf (w, s) of the word w in the segment s by the following equation (1).

この式（１）で、ｎ_ｗ，ｓは、ある単語ｗのセグメントｓ内での出現回数、Σ_ｔ∈ｓｎ_ｔ，ｓは、セグメントｓ内のすべての単語の出現回数の和を示す。
また、項目情報抽出手段１３は、ある単語ｗの逆文書頻度ｉｄｆ（ｗ）を、以下の式（２）で算出する。 In this formula _{(1), n w, s} is the number of occurrences of in segment s of a word _{_w,} Σ t∈s n _{t, s} represents the sum of all the number of occurrences of a word in the segment s.
Further, the item information extraction means 13 calculates the inverse document frequency idf (w) of a certain word w by the following equation (2).

この式（２）で、Ｎは、素材コンテンツ内の全セグメント数、ｄｆ（ｗ）は、ある単語ｗが出現する素材コンテンツのセグメントの数（総セグメント数〔総文書数〕）を示す。
そして、項目情報抽出手段１３は、セグメント内の各単語について、以下の式（３）に示すように、式（１）のｔｆ値と式（２）のｉｄｆ値との積が最も大きい単語、あるいは、予め定めた基準値よりも大きい単語を、当該セグメントの特徴単語とする。 In this formula (2), N indicates the total number of segments in the material content, and df (w) indicates the number of segments of the material content in which a certain word w appears (total number of segments [total number of documents]).
Then, the item information extraction means 13 has the largest product of the tf value of the equation (1) and the idf value of the equation (2) for each word in the segment, as shown in the following equation (3). Alternatively, a word larger than a predetermined reference value is set as a characteristic word of the segment.

項目情報抽出手段１３は、抽出した項目を、セグメントに対応付けて素材情報記憶手段１４に書き込み記憶する。
なお、項目情報抽出手段１３は、ＴＦ−ＩＤＦ法を用いずに、セグメントを形態素解析し、名詞や固有名詞を特徴単語として抽出することとしてもよい。 The item information extraction means 13 writes and stores the extracted items in the material information storage means 14 in association with the segments.
The item information extracting means 13 may perform morphological analysis of the segment and extract nouns and proper nouns as feature words without using the TF-IDF method.

また、項目情報抽出手段１３は、素材コンテンツが映像を含んでいる場合、特徴単語以外に、セグメントに対応する時間区間の映像からサムネイル画像を抽出してもよい。例えば、項目情報抽出手段１３は、セグメントに対応する時間区間の映像の先頭フレームをサムネイル画像として抽出する。項目情報抽出手段１３は、抽出したサムネイル画像を、セグメントに対応付けて素材情報記憶手段１４に書き込み記憶する。 Further, when the material content includes an image, the item information extracting means 13 may extract a thumbnail image from the image of the time interval corresponding to the segment in addition to the feature word. For example, the item information extracting means 13 extracts the first frame of the video in the time interval corresponding to the segment as a thumbnail image. The item information extraction means 13 writes and stores the extracted thumbnail image in the material information storage means 14 in association with the segment.

素材情報記憶手段（記憶手段）１４は、音声認識の誤りを修正する対象となる素材コンテンツと、素材コンテンツをセグメントに分割した各種情報とを記憶するものである。この素材情報記憶手段１４は、ハードディスク、半導体メモリ等の一般的な記憶媒体で構成することができる。 The material information storage means (storage means) 14 stores material content to be corrected for an error in voice recognition and various types of information obtained by dividing the material content into segments. The material information storage means 14 can be composed of a general storage medium such as a hard disk or a semiconductor memory.

ここで、図２を参照（適宜図１参照）して、素材情報記憶手段１４が記憶する素材情報について具体的に説明する。
図２に示すように、素材情報記憶手段１４は、音声認識誤りを修正する対象となる素材コンテンツ（映像・音声）Ａ，Ｂ…を記憶する。この素材コンテンツ（映像・音声）Ａ，Ｂ…は、素材コンテンツ入力手段１０によって、記憶されたものである。 Here, the material information stored in the material information storage means 14 will be specifically described with reference to FIG. 2 (see FIG. 1 as appropriate).
As shown in FIG. 2, the material information storage means 14 stores material contents (video / audio) A, B ... To be corrected for a voice recognition error. The material contents (video / audio) A, B ... Are stored by the material content input means 10.

また、図２に示すように、素材情報記憶手段１４は、素材コンテンツごとに、音声認識結果をセグメントに分割した情報を記憶する。
図２の例では、素材コンテンツの識別情報（ここでは、ファイル名Ａ，Ｂ，…）ごとに、セグメント（識別情報ａ１，ａ２，…，ｂ１，…）を対応付けている。
各セグメントは、単語ｗと時間情報tとを複数含み、それぞれは対応付けられている。
このセグメントごとの単語ｗおよび時間情報ｔは、音声認識手段１１で対応付けられた単語および時間情報を、認識結果分割手段１２が分割した情報である。 Further, as shown in FIG. 2, the material information storage means 14 stores information obtained by dividing the voice recognition result into segments for each material content.
In the example of FIG. 2, segments (identification information a1, a2, ..., b1, ...) Are associated with each identification information (here, file names A, B, ...) Of the material content.
Each segment contains a plurality of words w and time information t, and each segment is associated with each other.
The word w and the time information t for each segment are information obtained by dividing the word and time information associated with the voice recognition means 11 by the recognition result dividing means 12.

また、各セグメントは、項目ｋとサムネイル画像ｇとを含む。項目ｋは、項目情報抽出手段１３が抽出した特徴単語である。サムネイル画像ｇは、項目情報抽出手段１３が当該セグメントの先頭の時間情報に対応した、素材コンテンツの映像から抽出したフレーム画像である。
なお、ここでは、素材コンテンツと、素材コンテンツの音声認識結果を分割したセグメントとを、同一の記憶手段に記憶しているが、別々の記憶手段に記憶することとしてもよい。
図１に戻って、音声認識誤り修正支援装置１の構成について説明を続ける。 In addition, each segment includes an item k and a thumbnail image g. Item k is a feature word extracted by the item information extracting means 13. The thumbnail image g is a frame image extracted from the video of the material content by the item information extracting means 13 corresponding to the time information at the beginning of the segment.
Here, the material content and the segment obtained by dividing the voice recognition result of the material content are stored in the same storage means, but may be stored in different storage means.
Returning to FIG. 1, the configuration of the voice recognition error correction support device 1 will be continued.

編集手段１５は、外部に接続された修正端末（入力装置２、表示装置３、スピーカ４）を用いて、操作者が、素材情報記憶手段１４に記憶されている音声認識結果を修正するものである。なお、修正端末の表示装置３は、タッチパネルを備える構成としてもよい。
編集手段１５は、図１に示すように、素材コンテンツ選択手段１５０と、認識結果表示制御手段１５１と、誤り修正手段１５２と、映像／音声再生手段１５３と、を備える。 The editing means 15 is for the operator to correct the voice recognition result stored in the material information storage means 14 by using the correction terminals (input device 2, display device 3, speaker 4) connected to the outside. is there. The display device 3 of the correction terminal may be configured to include a touch panel.
As shown in FIG. 1, the editing means 15 includes a material content selection means 150, a recognition result display control means 151, an error correction means 152, and a video / audio reproduction means 153.

素材コンテンツ選択手段１５０は、修正対象となる素材コンテンツを選択するものである。例えば、素材コンテンツ選択手段１５０は、図３に示すように、素材情報記憶手段１４に記憶されている素材コンテンツＡ，Ｂ，Ｃのいずれかを選択するための選択ボタン３０１を含んだ素材コンテンツ選択画面３０を表示装置３に表示する。そして、素材コンテンツ選択手段１５０は、素材コンテンツ選択画面３０上の選択ボタン３０１の押下により、修正対象となる素材コンテンツを選択する。素材コンテンツ選択手段１５０は、選択された素材コンテンツのファイル名等の識別情報を、認識結果表示制御手段１５１に出力する。 The material content selection means 150 selects the material content to be modified. For example, as shown in FIG. 3, the material content selection means 150 includes a material content selection button 301 for selecting any of the material contents A, B, and C stored in the material information storage means 14. The screen 30 is displayed on the display device 3. Then, the material content selection means 150 selects the material content to be corrected by pressing the selection button 301 on the material content selection screen 30. The material content selection means 150 outputs identification information such as a file name of the selected material content to the recognition result display control means 151.

認識結果表示制御手段１５１は、セグメントごとに、項目と当該セグメントに含まれる単語列を表示するか否かを指定する選択ボタンとを表示し、選択ボタンの押下により、セグメントの単語列を表示するか否かを制御するものである。 The recognition result display control means 151 displays an item and a selection button for specifying whether or not to display the word string included in the segment for each segment, and displays the word string of the segment by pressing the selection button. It controls whether or not.

ここで、図４および図５を参照（適宜図１参照）して、認識結果表示制御手段１５１が表示する画面例について、その制御内容とともに説明する。
図４に示すように、認識結果表示制御手段１５１は、項目一覧画面３１を表示装置３の画面上に表示する。
項目一覧画面３１は、選択ボタン３１１と、項目表示欄３１２と、サムネイル画像表示領域３１３と、タイムテーブル表示欄３１４と、スクロールバー表示欄３１５と、で構成される。 Here, a screen example displayed by the recognition result display control means 151 will be described together with the control contents with reference to FIGS. 4 and 5 (see FIG. 1 as appropriate).
As shown in FIG. 4, the recognition result display control means 151 displays the item list screen 31 on the screen of the display device 3.
The item list screen 31 is composed of a selection button 311, an item display field 312, a thumbnail image display area 313, a timetable display field 314, and a scroll bar display field 315.

選択ボタン３１１は、セグメントごとに単語列を表示するか否かの選択を行うボタンである。
項目表示欄３１２は、セグメント内で抽出された項目を表示する領域である。認識結果表示制御手段１５１は、素材情報記憶手段１４から、当該セグメントに対応する項目（図２の項目ｋ）を読み出して、項目表示欄３１２に表示する。
サムネイル画像表示領域３１３は、セグメント内で抽出されたサムネイル画像を表示する領域である。認識結果表示制御手段１５１は、素材情報記憶手段１４から、当該セグメントに対応するサムネイル画像（図２のサムネイル画像ｇ）を読み出して、サムネイル画像表示領域３１３に表示する。 The selection button 311 is a button for selecting whether or not to display a word string for each segment.
The item display field 312 is an area for displaying the items extracted in the segment. The recognition result display control means 151 reads out the item (item k in FIG. 2) corresponding to the segment from the material information storage means 14, and displays it in the item display column 312.
The thumbnail image display area 313 is an area for displaying the thumbnail image extracted in the segment. The recognition result display control means 151 reads out a thumbnail image (thumbnail image g in FIG. 2) corresponding to the segment from the material information storage means 14, and displays it in the thumbnail image display area 313.

タイムテーブル表示欄３１４は、素材コンテンツの時間軸上におけるセグメント位置を示すタイムテーブルを表示する欄である。認識結果表示制御手段１５１は、素材情報記憶手段１４のセグメントの時間情報（図２の時間情報ｔ）を参照して、タイムテーブルを生成し表示する。
スクロールバー表示欄３１５は、項目一覧が画面に収まらない場合に、どの部分のセグメントを表示しているのかを示すスクロールバーを表示する欄である。認識結果表示制御手段１５１は、スクロールバーの上下によって、画面上の項目一覧を更新する。
このように、項目一覧画面３１を表示することで、操作者は、項目を確認することができ、一度に音声認識結果を表示する場合に比べて、音声認識結果を確認したいセグメントを容易に選択することができる。 The timetable display column 314 is a column for displaying a timetable indicating the segment position on the time axis of the material content. The recognition result display control means 151 generates and displays a time table with reference to the time information (time information t in FIG. 2) of the segment of the material information storage means 14.
The scroll bar display field 315 is a field for displaying a scroll bar indicating which part of the segment is displayed when the item list does not fit on the screen. The recognition result display control means 151 updates the item list on the screen by moving the scroll bar up and down.
By displaying the item list screen 31 in this way, the operator can confirm the items, and can easily select the segment for which the voice recognition result is to be confirmed, as compared with the case where the voice recognition result is displayed at once. can do.

この項目一覧画面３１において、操作者が行う入力装置２のマウスのクリック、あるいは、表示装置３のタッチパネルへのタッチによる選択ボタン（図４中、「ｏｐｅｎ」）３１１の押下により、認識結果表示制御手段１５１は、項目一覧画面３１において、セグメントの単語列の修正を行う編集領域３１６（図５参照）を表示する。 On this item list screen 31, recognition result display control is performed by clicking the mouse of the input device 2 performed by the operator or pressing the selection button (“open” in FIG. 4) 311 by touching the touch panel of the display device 3. The means 151 displays an editing area 316 (see FIG. 5) for correcting a word string of a segment on the item list screen 31.

図５は、編集領域３１６を表示した項目一覧画面３１Ｂを示す画面例である。
この項目一覧画面３１Ｂは、図４で説明した項目一覧画面３１に対して、選択されたセグメントにおいて、動画表示領域３１３Ｂと、編集領域３１６とが表示される。 FIG. 5 is a screen example showing the item list screen 31B displaying the edit area 316.
In the item list screen 31B, the moving image display area 313B and the editing area 316 are displayed in the selected segment with respect to the item list screen 31 described with reference to FIG.

動画表示領域３１３Ｂは、セグメントに対応する素材コンテンツを再生する領域である。認識結果表示制御手段１５１は、当該セグメントが選択されたタイミングで、素材情報記憶手段１４のセグメントの時間情報（図２の時間情報ｔ）を参照して、対応する素材コンテンツの映像の先頭フレームを動画表示領域３１３Ｂに表示する。この動画表示領域３１３Ｂの画像領域をマウス等でクリック、あるいは再生開始ボタンｓｔを押下されることで、認識結果表示制御手段１５１は、映像／音声再生手段１５３に当該素材コンテンツの再生を指示する。 The moving image display area 313B is an area for reproducing the material content corresponding to the segment. The recognition result display control means 151 refers to the time information (time information t in FIG. 2) of the segment of the material information storage means 14 at the timing when the segment is selected, and sets the first frame of the video of the corresponding material content. It is displayed in the moving image display area 313B. By clicking the image area of the moving image display area 313B with a mouse or the like or pressing the reproduction start button st, the recognition result display control means 151 instructs the video / audio reproduction means 153 to reproduce the material content.

編集領域３１６は、セグメントに対応する単語列を表示し、編集対象となる領域である。認識結果表示制御手段１５１は、編集領域３１６に、素材情報記憶手段１４に記憶されている当該セグメントに対応する単語列（図２の単語ｗの列）を展開する。
なお、このとき、認識結果表示制御手段１５１は、選択ボタン３１１を、編集領域３１６を非表示とするボタン（図４中、「ｃｌｏｓｅ」）とする。そして、選択ボタン（図４中、「ｃｌｏｓｅ」）３１１の押下により、認識結果表示制御手段１５１は、編集領域３１６を非表示とし、動画表示領域３１３Ｂをサムネイル画像表示領域３１３として、図４の項目一覧画面３１に表示を戻す。
図１に戻って、音声認識誤り修正支援装置１の構成について説明を続ける。 The editing area 316 is an area to be edited by displaying the word string corresponding to the segment. The recognition result display control means 151 develops a word string (row of the word w in FIG. 2) corresponding to the segment stored in the material information storage means 14 in the editing area 316.
At this time, the recognition result display control means 151 uses the selection button 311 as a button (“close” in FIG. 4) for hiding the editing area 316. Then, by pressing the selection button (“close” in FIG. 4) 311 the recognition result display control means 151 hides the editing area 316, sets the moving image display area 313B as the thumbnail image display area 313, and sets the item in FIG. The display is returned to the list screen 31.
Returning to FIG. 1, the configuration of the voice recognition error correction support device 1 will be continued.

誤り修正手段１５２は、操作者の編集操作により、編集領域３１６（図５）において、セグメントの単語列の誤りを修正するものである。この誤り修正手段１５２は、単語列を修正する編集動作においては、一般的なテキストエディタ（スクリーンエディタ）として機能する。ただし、誤り修正手段１５２は、単語列を修正する際に、音声を再生する機能を有する。 The error correction means 152 corrects an error in the word string of the segment in the editing area 316 (FIG. 5) by the editing operation of the operator. The error correction means 152 functions as a general text editor (screen editor) in the editing operation for correcting the word string. However, the error correction means 152 has a function of reproducing a voice when correcting a word string.

具体的には、誤り修正手段１５２は、編集領域３１６（図５）において、マウスのクリック、あるいは、タッチパネルへのタッチにより、選択された単語から音声を再生する。また、音声再生中、再度、任意の位置を選択されることで、誤り修正手段１５２は、音声の再生を停止する。 Specifically, the error correction means 152 reproduces the voice from the selected word in the editing area 316 (FIG. 5) by clicking the mouse or touching the touch panel. Further, during the sound reproduction, the error correction means 152 stops the sound reproduction by selecting an arbitrary position again.

図６は、編集領域における編集作業の一例を説明するための説明図である。
例えば、図６の編集領域３１６において、「３月」が選択された場合、誤り修正手段１５２は、素材情報記憶手段１４のセグメントの時間情報（図２の時間情報ｔ）を参照して、対応する素材セグメントの位置から音声を再生するように、映像／音声再生手段１５３に指示する。なお、このとき、音声に連動して、動画表示領域３１３Ｂ（図５）において、音声再生の時間に対応する映像を再生することとしてもよい。
ここで、操作者が、誤り（ここでは、「ハタ寒い」）を発見して修正箇所をマウスでクリック等することで、誤り修正手段１５２は、音声再生を停止してカーソルＣを表示する。そして、誤り修正手段１５２は、操作者の編集操作により、誤りである「ハタ寒い」を「肌寒い」と修正する。そして、誤り修正手段１５２は、素材情報記憶手段１４に記憶されている誤りのあった単語を、修正後の単語に置き換える。これによって、音声認識誤り修正支援装置１は、操作者による修正後の保存操作を省略することができる。 FIG. 6 is an explanatory diagram for explaining an example of editing work in the editing area.
For example, when "March" is selected in the editing area 316 of FIG. 6, the error correction means 152 refers to the time information of the segment of the material information storage means 14 (time information t of FIG. 2) and responds. The video / audio reproduction means 153 is instructed to reproduce the audio from the position of the material segment to be used. At this time, in conjunction with the audio, the video corresponding to the audio reproduction time may be reproduced in the moving image display area 313B (FIG. 5).
Here, when the operator finds an error (here, "grouper cold") and clicks the corrected portion with the mouse, the error correcting means 152 stops the voice reproduction and displays the cursor C. Then, the error correction means 152 corrects the error "grouper cold" as "chilly" by the editing operation of the operator. Then, the error correction means 152 replaces the erroneous word stored in the material information storage means 14 with the corrected word. As a result, the voice recognition error correction support device 1 can omit the save operation after correction by the operator.

また、誤り修正手段１５２は、マウスクリック等で指定された単語位置から音声を再生する。
図７は、音声再生に連動して編集領域の単語の表示属性を変更する例を説明するための説明図である。例えば、図７に示すように、編集領域３１６において、音声の再生を開始したい箇所をマウス等で選択された場合、誤り修正手段１５２は、素材情報記憶手段１４のセグメントの時間情報（図２の時間情報ｔ）を参照し、選択した単語から再生停止の指示があるまで音声を再生するように、映像／音声再生手段１５３に指示する。
そして、誤り修正手段１５２は、図７に示すように、音声の再生位置とセグメント中の再生有無とを明示するように、音声の再生に連動して、再生される音声に対応する各単語の表示部分の表示属性を変更する。例えば、誤り修正手段１５２は、音声に対応する単語を、白黒反転または予め定めた色でカラー表示する。 Further, the error correction means 152 reproduces the sound from the word position designated by a mouse click or the like.
FIG. 7 is an explanatory diagram for explaining an example of changing the display attribute of a word in the editing area in conjunction with voice reproduction. For example, as shown in FIG. 7, when a portion in the editing area 316 where the sound reproduction is to be started is selected with a mouse or the like, the error correction means 152 uses the time information of the segment of the material information storage means 14 (FIG. 2). With reference to the time information t), the video / audio reproduction means 153 is instructed to reproduce the audio from the selected word until the instruction to stop the reproduction is given.
Then, as shown in FIG. 7, the error correction means 152 indicates each word corresponding to the reproduced voice in conjunction with the reproduction of the voice so as to clearly indicate the reproduction position of the voice and the presence / absence of reproduction in the segment. Change the display attribute of the display part. For example, the error correction means 152 displays the word corresponding to the voice in black-and-white inversion or in a predetermined color.

このとき、誤り修正手段１５２は、操作者が行った操作のフィードバック情報を画面上に提示する。例えば、図８に示すように、誤り修正手段１５２は、選択された単語位置に音声の再生開始を示すポップアップメッセージｐｏｐ１を表示し、音声が停止した単語位置に音声の再生終了を示すポップアップメッセージｐｏｐ２を表示する。これによって、操作者が不慣れであっても、自身の操作内容を把握することができ、安心して操作を行うことができる。 At this time, the error correction means 152 presents feedback information of the operation performed by the operator on the screen. For example, as shown in FIG. 8, the error correction means 152 displays a pop-up message pop1 indicating the start of voice reproduction at the selected word position, and pop-up message pop2 indicating the end of voice reproduction at the word position where the voice has stopped. Is displayed. As a result, even if the operator is unfamiliar, he / she can grasp his / her own operation contents and can perform the operation with peace of mind.

また、誤り修正手段１５２は、指定された単語または単語列に対応する音声を繰り返し再生することもできる。
例えば、図９に示すように、編集領域３１６において、音声を再生したい単語または単語列をマウス等で選択（図中、白黒反転領域）することで、誤り修正手段１５２は、ポップアップメニューｐｍを表示し、「繰り返し再生」を選択されることで、対応する単語または単語列の音声を繰り返し再生する。
図１に戻って、音声認識誤り修正支援装置１の構成について説明を続ける。 Further, the error correction means 152 can repeatedly reproduce the voice corresponding to the designated word or word string.
For example, as shown in FIG. 9, in the editing area 316, by selecting a word or a word string for which voice is to be reproduced with a mouse or the like (black and white inverted area in the figure), the error correction means 152 displays a pop-up menu pm. Then, by selecting "Repeat playback", the sound of the corresponding word or word string is repeatedly played.
Returning to FIG. 1, the configuration of the voice recognition error correction support device 1 will be continued.

映像／音声再生手段１５３は、素材コンテンツの映像および音声を再生するものである。この映像／音声再生手段１５３は、認識結果表示制御手段１５１または誤り修正手段１５２から指定された位置から、素材コンテンツ（映像・音声）を再生する。 The video / audio reproduction means 153 reproduces the video and audio of the material content. The video / audio reproduction means 153 reproduces the material content (video / audio) from the position designated by the recognition result display control means 151 or the error correction means 152.

書き起こし結果出力手段１６は、編集手段１５で修正された音声認識結果（書き起こし結果）を、外部に出力するものである。
この書き起こし結果出力手段１６は、素材コンテンツのファイル名、または、素材コンテンツ内のセグメントの識別番号を指定されることで、素材情報記憶手段１４に記憶されている該当する素材コンテンツまたはセグメントの単語列を読み出して出力する。 The transcription result output means 16 outputs the voice recognition result (transcription result) corrected by the editing means 15 to the outside.
The transcription result output means 16 specifies the file name of the material content or the identification number of the segment in the material content, and the word of the corresponding material content or segment stored in the material information storage means 14. Read the column and output it.

以上説明したように音声認識誤り修正支援装置１を構成することで、音声認識誤り修正支援装置１は、簡易なテキスト編集操作で、認識結果の単語とその元となった音声とを確認しながら、音声認識の誤りを修正することができる。また、音声認識誤り修正支援装置１は、素材コンテンツに対して、セグメント単位で部分的に誤り修正を行うことができる。
なお、音声認識誤り修正支援装置１は、コンピュータを、前記した各手段として機能させるための音声認識誤り修正支援プログラムで動作させることができる。 By configuring the voice recognition error correction support device 1 as described above, the voice recognition error correction support device 1 can confirm the recognition result word and the original voice by a simple text editing operation. , Speech recognition errors can be corrected. In addition, the voice recognition error correction support device 1 can partially correct errors in the material content on a segment-by-segment basis.
The voice recognition error correction support device 1 can be operated by a voice recognition error correction support program for operating the computer as each of the above-mentioned means.

［音声認識誤り修正支援装置の動作］
次に、図１０〜図１２を参照して、本発明の実施形態に係る音声認識誤り修正支援装置１の動作について説明する。なお、ここでは、音声認識誤り修正支援装置１の動作として、素材コンテンツに対して音声認識による認識結果をセグメント単位で生成するセグメント情報生成動作と、認識結果をセグメント単位で表示装置３に提示するセグメント情報提示動作と、音声再生を行いながら認識結果を修正するセグメント修正動作と、について説明する。 [Operation of voice recognition error correction support device]
Next, the operation of the voice recognition error correction support device 1 according to the embodiment of the present invention will be described with reference to FIGS. 10 to 12. Here, as the operation of the voice recognition error correction support device 1, the segment information generation operation of generating the recognition result by voice recognition for the material content in segment units and the recognition result are presented to the display device 3 in segment units. The segment information presentation operation and the segment correction operation for correcting the recognition result while performing voice reproduction will be described.

（セグメント情報生成動作）
まず、図１０を参照（適宜図１参照）して、音声認識誤り修正支援装置１のセグメント情報生成動作について説明する。
ステップＳ１において、素材コンテンツ入力手段１０は、音声認識を行う素材コンテンツを入力する。このとき、素材コンテンツ入力手段１０は、入力した素材コンテンツを素材情報記憶手段１４に書き込み記憶する。 (Segment information generation operation)
First, the segment information generation operation of the voice recognition error correction support device 1 will be described with reference to FIG. 10 (see FIG. 1 as appropriate).
In step S1, the material content input means 10 inputs the material content for voice recognition. At this time, the material content input means 10 writes and stores the input material content in the material information storage means 14.

ステップＳ２において、音声認識手段１１は、ステップＳ１で入力した素材コンテンツの音声を認識し、テキストデータである認識結果と当該認識結果を構成する単語ごとの時間情報とを対応付けて生成する。 In step S2, the voice recognition means 11 recognizes the voice of the material content input in step S1 and generates the recognition result which is text data and the time information for each word constituting the recognition result in association with each other.

ステップＳ３において、認識結果分割手段１２は、ステップＳ２で認識された認識結果を、予め定めた基準、例えば、映像のカット点、音声の無音区間等によりセグメントに分割する。このとき、認識結果分割手段１２は、セグメント単位で、認識結果の単語と時間情報とを対応付けて、素材コンテンツを素材情報記憶手段１４に書き込み記憶する。 In step S3, the recognition result dividing means 12 divides the recognition result recognized in step S2 into segments according to a predetermined reference, for example, a video cut point, a silent section of audio, or the like. At this time, the recognition result dividing means 12 writes and stores the material content in the material information storage means 14 in association with the word of the recognition result and the time information in segment units.

ステップＳ４において、項目情報抽出手段１３は、ステップＳ３で分割されたセグメントごとに、セグメントに含まれる特徴単語を項目として抽出するとともに、セグメントに対応する映像からサムネイル画像を抽出する。このとき、項目情報抽出手段１３は、抽出した項目およびサムネイル画像を、セグメントに対応付けて素材情報記憶手段１４に書き込み記憶する。
以上の動作によって、音声認識誤り修正支援装置１は、図２に示すように、素材情報記憶手段１４に、素材コンテンツと、素材コンテンツをセグメントに分割した各種情報とを記憶する。 In step S4, the item information extracting means 13 extracts the feature word included in the segment as an item for each segment divided in step S3, and extracts the thumbnail image from the video corresponding to the segment. At this time, the item information extraction means 13 writes and stores the extracted items and thumbnail images in the material information storage means 14 in association with the segments.
As a result of the above operation, the voice recognition error correction support device 1 stores the material content and various information obtained by dividing the material content into segments in the material information storage means 14, as shown in FIG.

（セグメント情報提示動作）
次に、図１１を参照（適宜図１参照）して、音声認識誤り修正支援装置１のセグメント情報提示動作について説明する。
ステップＳ１０において、素材コンテンツ選択手段１５０は、素材情報記憶手段１４に記憶されている素材コンテンツのいずれかを選択するための選択ボタンを含んだ素材コンテンツ選択画面３０(図３参照)を表示装置３に表示する。 (Segment information presentation operation)
Next, the segment information presentation operation of the voice recognition error correction support device 1 will be described with reference to FIG. 11 (see FIG. 1 as appropriate).
In step S10, the material content selection means 150 displays the material content selection screen 30 (see FIG. 3) including a selection button for selecting any of the material contents stored in the material information storage means 14. Display in.

ステップＳ１１において、素材コンテンツ選択手段１５０は、画面上で選択ボタンが押下されるまで待機し（ステップＳ１１でＮｏ）、選択ボタンが押下された場合（ステップＳ１１でＹｅｓ）、ステップＳ１２以降の制御を行う認識結果表示制御手段１５１に制御を移す。 In step S11, the material content selection means 150 waits until the selection button is pressed on the screen (No in step S11), and when the selection button is pressed (Yes in step S11), controls after step S12. Control is transferred to the recognition result display control means 151 to be performed.

ステップＳ１２において、認識結果表示制御手段１５１は、素材情報記憶手段１４に記憶されている各種の情報に基づいて、セグメントごとに、項目と当該セグメントに含まれる単語列を表示するか否かを指定する選択ボタンとを含んだ項目一覧画面３１（図４参照）を表示装置３に表示する。 In step S12, the recognition result display control means 151 specifies whether to display an item and a word string included in the segment for each segment based on various information stored in the material information storage means 14. The item list screen 31 (see FIG. 4) including the selection button to be selected is displayed on the display device 3.

ステップＳ１３において、認識結果表示制御手段１５１は、項目一覧画面で選択ボタン（ｏｐｅｎ）が押下されるまで待機する（ステップＳ１３でＮｏ）。
一方、選択ボタン（ｏｐｅｎ）が押下された場合（ステップＳ１３でＹｅｓ）、ステップＳ１４において、認識結果表示制御手段１５１は、図５に示すように、選択されたセグメントに対応して編集領域３１６を表示し、素材情報記憶手段１４に記憶されている当該セグメントに対応する認識結果である単語列を編集領域３１６に展開する。 In step S13, the recognition result display control means 151 waits until the selection button (open) is pressed on the item list screen (No in step S13).
On the other hand, when the selection button (open) is pressed (Yes in step S13), in step S14, the recognition result display control means 151 sets the editing area 316 corresponding to the selected segment as shown in FIG. The word string which is the recognition result corresponding to the segment which is displayed and stored in the material information storage means 14 is expanded in the editing area 316.

この動作以降、音声認識誤り修正支援装置１は、操作者が画面上で編集結果を修正可能な状態に移行する。なお、選択ボタン（ｏｐｅｎ）の押下により編集領域３１６を表示した場合、認識結果表示制御手段１５１は、任意のタイミングで、選択ボタン（ｃｌｏｓｅ）の押下により編集領域３１６を非表示とすることができるが、この非表示の動作については図示を省略した。また、項目一覧画面３１Ｂ（図５参照）の動画表示領域３１３Ｂにおける素材コンテンツの再生動作についてもここでは説明を省略する。
以上の動作によって、音声認識誤り修正支援装置１は、素材コンテンツをセグメント単位で、音声認識の誤りを修正することが可能になる。 After this operation, the voice recognition error correction support device 1 shifts to a state in which the operator can correct the editing result on the screen. When the edit area 316 is displayed by pressing the selection button (open), the recognition result display control means 151 can hide the edit area 316 by pressing the select button (close) at an arbitrary timing. However, the illustration of this non-display operation is omitted. Further, the description of the reproduction operation of the material content in the moving image display area 313B of the item list screen 31B (see FIG. 5) will be omitted here.
By the above operation, the voice recognition error correction support device 1 can correct the voice recognition error in the material content in segment units.

（セグメント修正動作）
次に、図１２を参照（適宜図１参照）して、音声認識誤り修正支援装置１のセグメント修正動作について説明する。なお、セグメント修正動作は、操作者が行う任意の手順であるため、ここでは、音声再生と修正動作とを併せて行う動作の一例で説明する。 (Segment correction operation)
Next, the segment correction operation of the voice recognition error correction support device 1 will be described with reference to FIG. 12 (see FIG. 1 as appropriate). Since the segment correction operation is an arbitrary procedure performed by the operator, an example of an operation in which the voice reproduction and the correction operation are performed together will be described here.

ステップＳ２０において、誤り修正手段１５２は、操作者のマウスのクリック、あるいは、タッチパネルへのタッチにより、編集領域３１６（図５）内の音声を再生したい単語または単語列を選択する。このとき、誤り修正手段１５２は、映像／音声再生手段１５３を介して、素材情報記憶手段１４のセグメントの時間情報を参照して、単語または単語列に対応する時間の音声を再生する。これによって、操作者は、音声と音声認識された単語列とを対比して確認することができる。 In step S20, the error correction means 152 selects a word or a word string for which voice is to be reproduced in the editing area 316 (FIG. 5) by clicking the mouse of the operator or touching the touch panel. At this time, the error correction means 152 refers to the time information of the segment of the material information storage means 14 via the video / audio reproduction means 153, and reproduces the sound of the time corresponding to the word or the word string. This allows the operator to compare and confirm the voice and the voice-recognized word string.

ステップＳ２１において、誤り修正手段１５２は、操作者のマウスのクリック、あるいは、タッチパネルへのタッチにより、修正箇所の位置の指定を受け付ける。このとき、誤り修正手段１５２は、音声が単語列の末尾まで再生されていない、あるいは、繰り返し再生中で、音声が再生中であれば、音声の再生を停止する。 In step S21, the error correction means 152 accepts the designation of the position of the correction portion by clicking the mouse of the operator or touching the touch panel. At this time, the error correction means 152 stops the reproduction of the sound if the sound has not been reproduced to the end of the word string, or if the sound is being repeatedly reproduced and the sound is being reproduced.

ステップＳ２２において、誤り修正手段１５２は、編集領域の指定された位置にカーソルを表示して、文字削除、文字挿入等の操作者の編集作業により、認識誤りを修正する。ここで、誤り修正手段１５２は、素材情報記憶手段１４の単語を修正結果で更新する。 In step S22, the error correction means 152 displays a cursor at a designated position in the editing area, and corrects the recognition error by the operator's editing work such as character deletion and character insertion. Here, the error correction means 152 updates the word of the material information storage means 14 with the correction result.

ステップＳ２３において、誤り修正手段１５２は、操作者のマウスのクリック、あるいは、タッチパネルへのタッチにより、修正を行った箇所の位置の指定を受け付ける。このとき、誤り修正手段１５２は、映像／音声再生手段１５３を介して、素材情報記憶手段１４のセグメントの時間情報を参照して、単語または単語列に対応する時間の音声を再生する。これによって、操作者は、修正結果が正しいか否かを確認することができる。 In step S23, the error correction means 152 accepts the designation of the position of the corrected portion by clicking the mouse of the operator or touching the touch panel. At this time, the error correction means 152 refers to the time information of the segment of the material information storage means 14 via the video / audio reproduction means 153, and reproduces the sound of the time corresponding to the word or the word string. As a result, the operator can confirm whether or not the correction result is correct.

なお、図示を省略しているが、ステップＳ２３における操作者の確認で、修正箇所がまだ正しく修正されていない場合、ステップＳ２１に戻って、動作を繰り返す。
以上の動作によって、音声認識誤り修正支援装置１は、音声認識の誤りを修正する際に、修正対象箇所の音声を素早く再生し、簡易な操作で音声認識の誤り修正することができる。 Although not shown, if the operator has confirmed in step S23 that the corrected portion has not yet been corrected correctly, the process returns to step S21 and the operation is repeated.
By the above operation, the voice recognition error correction support device 1 can quickly reproduce the voice of the correction target portion when correcting the voice recognition error, and can correct the voice recognition error with a simple operation.

以上、本発明の実施形態について説明したが、本発明は、この実施形態に限定されるものではない。
ここでは、素材コンテンツを、映像および音声を含んだものとして説明したが、音声のみの素材コンテンツであっても構わない。
その場合、項目情報抽出手段１３は、項目のみを抽出し、サムネイル画像を抽出しないこととすればよい。また、映像／音声再生手段１５３は、音声のみを再生する音声再生手段とすればよい。 Although the embodiment of the present invention has been described above, the present invention is not limited to this embodiment.
Here, the material content has been described as including video and audio, but the material content may be audio only.
In that case, the item information extracting means 13 may extract only the items and not the thumbnail images. Further, the video / audio reproduction means 153 may be an audio reproduction means for reproducing only the audio.

また、ここでは、音声認識誤り修正支援装置１に、直接、修正端末（入力装置２、表示装置３、スピーカ４）を接続する構成としたが、これらは、ネットワークを介して接続する形態であっても構わない。 Further, here, the correction terminal (input device 2, display device 3, speaker 4) is directly connected to the voice recognition error correction support device 1, but these are in the form of being connected via a network. It doesn't matter.

また、音声認識誤り修正支援装置１は、修正端末を複数備える構成であっても構わない。その場合、認識結果表示制御手段１５１は、ある修正端末が修正を行っているセグメントについて、他の修正端末が修正対象として選択しないように排他制御し、例えば、他の修正端末において、選択ボタンを表示しないようにする。 Further, the voice recognition error correction support device 1 may be configured to include a plurality of correction terminals. In that case, the recognition result display control means 151 exclusively controls the segment being modified by one modification terminal so that another modification terminal does not select it as a modification target. For example, in another modification terminal, a selection button is pressed. Hide it.

また、音声認識誤り修正支援装置１の編集手段１５は、認識結果を修正するサーバとして、画面制御を行うユーザインタフェースを提供し、ネットワークを介して接続された複数の修正端末が、当該ユーザインタフェースを介して動作するクライアントとして機能させることとしてもよい。これによって、ネットワークを介して、複数の地点で、音声認識の誤りを修正することができる。 Further, the editing means 15 of the voice recognition error correction support device 1 provides a user interface for performing screen control as a server for correcting the recognition result, and a plurality of correction terminals connected via a network use the user interface. It may be made to function as a client that operates through. This makes it possible to correct voice recognition errors at multiple points via the network.

また、音声認識誤り修正支援装置１は、音声認識手段１１を外部に備えてもよい。
例えば、図１３に示す音声認識誤り修正支援装置１Ｂの構成としてもよい。音声認識誤り修正支援装置１Ｂは、音声認識誤り修正支援装置１（図１）の音声認識手段１１を音声認識装置として外部に備える。この場合、認識結果分割手段１２は、音声認識手段１１から出力される音声の認識結果と当該認識結果を構成する単語ごとの時間情報とを、入力インタフェースである認識結果入力手段１７を介して入力すればよい。
なお、音声認識誤り修正支援装置１Ｂも、コンピュータを、前記した各手段として機能させるための音声認識誤り修正支援プログラムで動作させることができる。 Further, the voice recognition error correction support device 1 may be provided with the voice recognition means 11 externally.
For example, the voice recognition error correction support device 1B shown in FIG. 13 may be configured. The voice recognition error correction support device 1B is externally provided with the voice recognition means 11 of the voice recognition error correction support device 1 (FIG. 1) as a voice recognition device. In this case, the recognition result dividing means 12 inputs the voice recognition result output from the voice recognition means 11 and the time information for each word constituting the recognition result via the recognition result input means 17 which is an input interface. do it.
The voice recognition error correction support device 1B can also be operated by the voice recognition error correction support program for operating the computer as each of the above-mentioned means.

１，１Ｂ音声認識誤り修正支援装置
１０素材コンテンツ入力手段
１１音声認識手段
１２認識結果分割手段
１３項目情報抽出手段
１４素材情報記憶手段（記憶手段）
１５編集手段
１５０素材コンテンツ選択手段
１５１認識結果表示制御手段
１５２誤り修正手段
１５３映像／音声再生手段（音声再生手段）
１６書き起こし結果出力手段
１７認識結果入力手段 1,1B Voice recognition error correction support device 10 Material content input means 11 Voice recognition means 12 Recognition result division means 13 Item information extraction means 14 Material information storage means (memory means)
15 Editing means 150 Material content selection means 151 Recognition result display control means 152 Error correction means 153 Video / audio reproduction means (audio reproduction means)
16 Transcription result output means 17 Recognition result input means

Claims

It is a voice recognition error correction support device that corrects voice recognition errors in the voice contained in the content.
A recognition result dividing means for dividing the recognition result into segments according to a predetermined standard based on the recognition result of the voice which is text data and the time information for each word constituting the recognition result.
A button for specifying whether or not to display the word string included in the segment is displayed together with the item information, and by selecting the button, the edit area is displayed and the word string of the segment is expanded, or the edit area is displayed. A recognition result display control means that controls whether to hide or not,
An error correction means for correcting an error in the segment in the editing area,
A sound reproduction means for reproducing the sound corresponding to the segment of the editing area is provided.
The recognition result dividing means divides the recognition result at the change point of the position information or the time information included in the content.
The error correction means is a voice recognition error correction support device, characterized in that the voice of the content corresponding to the time information from the word position designated in the editing area is reproduced by the voice reproduction means.

It is a voice recognition error correction support device that corrects voice recognition errors in the voice contained in the content.
A recognition result dividing means for dividing the recognition result into segments according to a predetermined standard based on the recognition result of the voice which is text data and the time information for each word constituting the recognition result.
A button for specifying whether or not to display the word string included in the segment is displayed together with the item information, and by selecting the button, the edit area is displayed and the word string of the segment is expanded, or the edit area is displayed. A recognition result display control means that controls whether to hide or not,
An error correction means for correcting an error in the segment in the editing area,
A sound reproduction means for reproducing the sound corresponding to the segment of the editing area is provided.
The content includes an image, and the recognition result dividing means divides the recognition result at a cut point of the image.
The error correction means is a voice recognition error correction support device, characterized in that the voice of the content corresponding to the time information from the word position designated in the editing area is reproduced by the voice reproduction means.

It is a voice recognition error correction support device that corrects voice recognition errors in the voice contained in the content.
A recognition result dividing means for dividing the recognition result into segments according to a predetermined standard based on the recognition result of the voice which is text data and the time information for each word constituting the recognition result.
A button for specifying whether or not to display the word string included in the segment is displayed together with the item information, and by selecting the button, the edit area is displayed and the word string of the segment is expanded, or the edit area is displayed. A recognition result display control means that controls whether to hide or not,
An error correction means for correcting an error in the segment in the editing area,
An audio reproduction means for reproducing the audio corresponding to the segment in the editing area, and
Each of the segments is provided with an item information extraction means for extracting a feature word as the item information by the TF-IDF method from the words included in the plurality of the segments.
The recognition result display control means displays a list of the item information including a button for specifying whether or not to display the word string included in the segment.
The error correction means is a voice recognition error correction support device, characterized in that the voice of the content corresponding to the time information from the word position designated in the editing area is reproduced by the voice reproduction means.

It is a voice recognition error correction support device that corrects voice recognition errors in the voice contained in the content.
A recognition result dividing means for dividing the recognition result into segments according to a predetermined standard based on the recognition result of the voice which is text data and the time information for each word constituting the recognition result.
A button for specifying whether or not to display the word string included in the segment is displayed together with the item information, and by selecting the button, the edit area is displayed and the word string of the segment is expanded, or the edit area is displayed. A recognition result display control means that controls whether to hide or not,
An error correction means for correcting an error in the segment in the editing area,
A sound reproduction means for reproducing the sound corresponding to the segment of the editing area is provided.
The error correction means reproduces the sound of the content corresponding to the time information from the word position designated in the editing area by the sound reproduction means , and any word in the editing area during the sound reproduction of the content. By designating the position, the playback of the voice in the voice reproduction means is stopped, a pop-up message indicating the start of voice playback is displayed at the word position specified in the editing area, and the voice is displayed at the word position where the voice is stopped. speech recognition error correction support device according to claim you to view the pop-up message indicating the playback end.

Claims 1 to claim 1, wherein the error correction means stops the reproduction of the voice in the voice reproduction means by designating an arbitrary word position in the editing area during the voice reproduction of the content. The voice recognition error correction support device according to any one of 3.

Any one of claims 1 to 5 , wherein the error correction means changes the display attribute of a word in the editing area corresponding to the reproduced voice in conjunction with the voice reproduction of the content. The voice recognition error correction support device described in the section.

The claim is characterized in that the error correction means repeatedly reproduces the voice of the content corresponding to the time information of the word designated in the editing area or the word string of the designated section by the voice reproducing means. The voice recognition error correction support device according to any one of claims 1 to 6.

It is a voice recognition error correction support device that corrects voice recognition errors in the voice contained in the content.
A recognition result dividing means for dividing the recognition result into segments for each change in the utterance content based on the recognition result of the voice which is text data and the time information for each word constituting the recognition result.
A button for specifying whether or not to display the word string included in the segment is displayed together with the item information, and by selecting the button, the edit area is displayed and the word string of the segment is expanded, or the edit area is displayed. A recognition result display control means that controls whether to hide or not,
An error correction means for correcting an error in the segment in the editing area,
A sound reproduction means for reproducing the sound corresponding to the segment of the editing area is provided.
The error correction means is a voice recognition error correction support device, characterized in that the voice of the content corresponding to the time information from the word position designated in the editing area is reproduced by the voice reproduction means.

A voice recognition error correction support program for causing a computer to function as the voice recognition error correction support device according to any one of claims 1 to 8.