JP6440967B2

JP6440967B2 - End-of-sentence estimation apparatus, method and program thereof

Info

Publication number: JP6440967B2
Application number: JP2014105124A
Authority: JP
Inventors: 厚志安藤; 太一浅見
Original assignee: Nippon Telegraph and Telephone Corp; NTT Inc USA
Current assignee: NTT Inc; NTT Inc USA
Priority date: 2014-05-21
Filing date: 2014-05-21
Publication date: 2018-12-19
Anticipated expiration: 2034-05-21
Also published as: JP2015219480A

Description

この発明は、音声認識結果に意味情報を持たせるための文末記号を推定するための技術に関する。 The present invention relates to a technique for estimating a sentence end symbol for giving semantic information to a speech recognition result.

音声認識技術を応用し、ある会議の参加者の発話ごとに発話メモを自動作成したいという需要が存在する。音声認識により作成した発話メモは、会議中における個々の発話内容の振り返りを可能とし、議論を円滑化させる。また、会議後における会議の振り返りの容易化や議事録作成の手間の削減という利点も持つ。 There is a demand for applying speech recognition technology to automatically create an utterance memo for each utterance of a conference participant. The utterance memo created by voice recognition enables the reflection of individual utterance contents during the conference and facilitates discussion. It also has the advantage of facilitating reviewing the meeting after the meeting and reducing the time and effort of creating the minutes.

発話メモの自動作成を実現するためには、音声認識による音韻情報のみのテキスト化では不十分である。その理由の一つに、音韻情報のみではどこが文章の区切りかを判断できない点が挙げられる。その結果、文章の可読性の低下による議事録の作成時間の増加、意味の取り違えによる意味誤りを含む議事録の作成などの不利益が発生する。 In order to realize automatic creation of an utterance memo, it is not sufficient to make text only phonemic information by speech recognition. One of the reasons is that it is not possible to determine where the sentence breaks based on phonological information alone. As a result, there are disadvantages such as an increase in the time for creating the minutes due to a decrease in the readability of the text, and the creation of the minutes including a meaning error due to a misunderstanding.

このため、音声認識結果に句読点を自動付与する技術が非特許文献１において開示されている。非特許文献１における句読点自動付与技術の処理の流れを図１３に示す。該当技術では、音声認識結果を形態素解析することで得られる単語・品詞・文節境界・係り受け情報と発話と発話の間の時間情報を利用し、条件付き確率場と呼ばれる機械学習の一手法を用いて句読点を付与すべき位置を推定している。その後、音声認識結果のうち、先ほどの処理から推定した句読点付与位置に句読点の付与を行っている。 For this reason, Non-Patent Document 1 discloses a technique for automatically giving punctuation marks to a speech recognition result. FIG. 13 shows a flow of processing of the automatic punctuation technology in Non-Patent Document 1. The corresponding technology uses a method of machine learning called conditional random field using words, parts of speech, clause boundaries, dependency information obtained by morphological analysis of speech recognition results, and time information between utterances. To estimate the position where punctuation marks should be added. Thereafter, punctuation marks are assigned to the punctuation mark assignment positions estimated from the previous processing in the speech recognition result.

会議などの二人以上の対話では、質問や強調の意味が含まれる発話が多数含まれており、結果として音韻情報が同一でも意味が異なる発話が存在する。例えば、「そうですか」という音韻情報は、質問の意味での発話「そうですか？」と、納得の意味での発話「そうですか。」のいずれからも抽出されうる。これに対し、従来技術により文末に同一の句読点を付与した場合、発話の意味の情報が失われ、発話の意味が誤解される恐れがある。上記の例であれば、「そうですか」という音韻情報を持つ全ての発話が納得の意味だととらえられてしまう可能性がある。その結果、発話の意味に誤りのある議事録が作成され、議事録を読んだ人間に誤解が生じるという不利益が発生する。以上から、会議などの二名以上の対話を想定する場合には、音声認識結果に意味情報を持たせる必要がある。 A dialogue between two or more people such as a conference includes many utterances including meanings of questions and emphasis. As a result, there are utterances having the same phonological information but different meanings. For example, the phonological information “is it?” Can be extracted from both the utterance “is it?” In the meaning of the question and the utterance “is it?” In the sense of convincing. On the other hand, when the same punctuation mark is given to the end of the sentence according to the conventional technique, information on the meaning of the utterance is lost, and the meaning of the utterance may be misunderstood. In the above example, there is a possibility that all utterances having phonological information “Is that so?” Are considered to be satisfactory. As a result, the minutes with the meaning of the utterance are created, and there is a disadvantage that the human being who read the minutes has a misunderstanding. From the above, when two or more conversations such as a conference are assumed, it is necessary to have semantic information in the speech recognition result.

文章に意味情報を持たせる方法として、文末記号の活用が挙げられる。例えば、文末に疑問符「？」を付与することで質問の意味を持たせることが可能である。したがって、複数の文末記号を用意し、発話の意味と合致した文末記号を自動的に付与できれば、意味情報を持たせることができたといえる。 One way to give semantic information to sentences is to use sentence endings. For example, it is possible to give the meaning of a question by adding a question mark “?” At the end of the sentence. Therefore, if a plurality of sentence ending symbols are prepared and a sentence ending symbol that matches the meaning of the utterance can be automatically given, it can be said that the semantic information can be provided.

文末記号の自動付与を行う場合、従来技術の句読点自動付与の付与対象を句読点から文末記号に拡張しても、高精度に記号付与を行うことは困難である。これは、対話に参加する話者の発話の傾向や文末記号の出現傾向が、対話状況、つまり対話の場や話者の位置づけに依存するためである。 When automatically assigning an end-of-sentence symbol, it is difficult to assign the symbol with high accuracy even if the subject of the prior art automatic punctuation-point assignment is expanded from the punctuation mark to the end-of-sentence symbol. This is because the utterance tendency of the speakers participating in the dialogue and the appearance tendency of the end-of-sentence symbols depend on the dialogue situation, that is, the location of the dialogue and the position of the speaker.

図１４に対話状況と発話の傾向及び文末記号の出現傾向の関連性の例を示す。例えば、講演における講演者は発話の韻律変動が小さい傾向にあり、発話内容も平静の発話が多く質問発話が少ないため、文末記号には句点の出現が多く疑問符の出現は少ない。したがって、講演における講演者の発話に文末記号を自動で付与する際には、韻律変動の小さな変化も検出し文末記号付与に利用する、句点が出現しやすく疑問符が出現しにくい基準を設けるなどが有効である。しかし、自由討論における参加者は発話の韻律変動が大きい傾向にあり、質問や強調などの感情表現発話が多いため文末記号にも疑問符や感嘆符が出現しやすい。したがって、自由討論における参加者の発話に文末記号を自動で付与する際には、韻律変動の大きな変化のみ検出し文末記号付与に利用する、疑問符や感嘆符が出現しやすく句点が出現しにくい基準を設けるなどの方が有効である。 FIG. 14 shows an example of the relationship between the dialogue status, the utterance tendency, and the appearance tendency of sentence ending symbols. For example, speakers in lectures tend to have less prosodic fluctuations in utterances, and the utterance contents are calm and many question utterances. Therefore, there are many punctuation marks and few question marks. Therefore, when automatically assigning a sentence ending symbol to a speaker's utterance in a lecture, a small change in prosodic variation is detected and used for adding a sentence ending symbol. It is valid. However, participants in free discussions tend to have large prosodic fluctuations in utterances, and there are many emotional utterances such as questions and emphasis, so question marks and exclamation marks are likely to appear at the end of sentences. Therefore, when automatically assigning end-of-sentence symbols to utterances of participants in free discussions, a standard that detects only large changes in prosodic changes and uses them to add end-of-sentence marks, making it difficult for question marks and exclamation points to appear It is more effective to provide

秋田祐哉, 河原達也, “講演に対する読点の複数アノテーションに基づく自動挿入”, 情報処理学会論文誌, Vol.54, No.2, pp.463-470, Feb. 2013.Yuya Akita and Tatsuya Kawahara, “Automatic Insertion Based on Multiple Annotation of Reading Marks for Lectures”, IPSJ Journal, Vol.54, No.2, pp.463-470, Feb. 2013.

以上から、高精度な文末記号付与を実現するためには、対話状況に合わせた文末記号の付与基準を与えるべきである。しかし、従来技術では話者が一名であることを想定しているため、対話状況は考慮されず、常に同一の基準により文末記号の付与を行うこととなる。その結果、文末記号の自動付与の精度が低下する可能性があった。 From the above, in order to achieve high-accuracy sentence ending symbol assignment, a criterion for assigning sentence ending symbols according to the conversation situation should be given. However, since it is assumed in the prior art that there is only one speaker, the conversation situation is not considered and the end-of-sentence symbol is always given according to the same standard. As a result, there is a possibility that the accuracy of automatic assignment of sentence ending symbols is lowered.

この発明の目的は、対話状況特徴を用いて文末記号を推定する文末記号推定装置、この方法及びプログラムを提供することである。 The purpose of this invention, the end of the sentence symbol estimator for estimating the end of the sentence symbols using dialogue situation wherein, to provide a method and a program.

この発明の一態様による文末記号推定装置は、複数の話者によって実施される対話のそれぞれの話者の中心話者度をその対話においてそれぞれの話者の発話の割合を示す指標とし、対話の話者偏り度をその対話における話者の発話の長さの偏り度を表す指標とし、対話の対話厳格度をその対話中の話者の口調の厳格さを表す指標とし、中心話者度、話者偏り度及び対話厳格度の少なくとも１つを対話状況特徴として、対話の対話状況特徴を計算する対話状況特徴計算装置と、対話状況特徴計算装置で計算された対話状況特徴に基づいて複数の文末記号付与規範の中からその対話に対応する文末記号付与規範を選択し、選択された文末記号付与規範、その対話の音響特徴及び言語特徴を用いてその対話の発話内容を表すテキストに対する文末記号を推定する文末記号推定部と、を備えている。 The sentence ending symbol estimation device according to one aspect of the present invention uses the central speaker degree of each speaker of a dialog carried out by a plurality of speakers as an index indicating the ratio of each speaker's utterance in the dialog. The speaker bias is used as an index representing the length of the speaker's utterance in the dialogue, the dialogue severity is used as an index representing the tone of the speaker during the dialogue, A dialog situation feature calculation device for calculating a dialog situation feature of a dialogue using at least one of speaker bias and dialogue severity as a dialogue situation feature, and a plurality of dialogue situation features calculated by the dialogue situation feature calculation device. Select the end-of-sentence giving standard corresponding to the dialogue from the end-of-sentence giving norms, and use the selected end-of-sentence giving norm, the acoustic features and language features of the dialogue to end the text And a, and endnotes symbol estimator for estimating the.

発話の意味に対応する文末記号を推定するために用いる対話状況特徴を計算することができる。また、発話の意味に対応する文末記号を推定することができる。 It is possible to calculate the dialogue situation feature used to estimate the sentence ending symbol corresponding to the meaning of the utterance. Moreover, the sentence end symbol corresponding to the meaning of the utterance can be estimated.

文末記号推定装置の例を説明するためのブロック図。The block diagram for demonstrating the example of the sentence ending symbol estimation apparatus. 文末記号推定方法の例を説明するための流れ図。The flowchart for demonstrating the example of the sentence ending symbol estimation method. 対話状況特徴計算装置の例を説明するためのブロック図。The block diagram for demonstrating the example of a dialog condition characteristic calculation apparatus. 対話状況特徴計算方法の例を説明するための流れ図。The flowchart for demonstrating the example of the dialog condition feature calculation method. 対話厳格度・話者偏り度と実世界での対話の場との関係性の例を示す図。The figure which shows the example of the relationship between the degree of dialogue severity and speaker bias, and the place of dialogue in the real world. 対話の場ごとの中心話者度と話者の位置づけの例を示す図。The figure which shows the example of the center speaker degree for every field of dialogue, and the positioning of a speaker. 話者ごとの音声の発話区間の例を示す図。The figure which shows the example of the speech area of the audio | voice for every speaker. 中心話者度と話者偏り度の例を示す図。The figure which shows the example of a center speaker degree and a speaker bias degree. 対話における全体の発話区間と非発話区間の例を示す図。The figure which shows the example of the whole speech area and non-utterance area in a dialog. 文末記号付与モデル生成部７の例を説明するためのブロック図。The block diagram for demonstrating the example of the sentence end symbol provision model production | generation part. 文末記号付与モデル生成部７の例を説明するためのブロック図。The block diagram for demonstrating the example of the sentence end symbol provision model production | generation part. 回帰係数学習部１７の例を説明するためのブロック図。The block diagram for demonstrating the example of the regression coefficient learning part 17. FIG. 従来の句読点自動付与技術を説明するためのブロック図。The block diagram for demonstrating the conventional punctuation mark automatic provision technique. 対話状況と発話の傾向及び文末記号の出現傾向の関連性の例を示す図。The figure which shows the example of the relationship of a dialog situation, the tendency of speech, and the appearance tendency of a sentence end symbol.

[全体の流れ]
まず、対話に参加した話者ごとの音声を用いて、話者の発話の長さの偏り度合いと対話中の話者の口調の厳格度合いを表す尺度に基づいて対話の場を推定するとともに、対話中の各話者の発話割合を分析して、対話の場における話者の位置づけを推定する。 [Overall flow]
First, using the voice of each speaker who participated in the dialogue, we estimated the place of dialogue based on the scale representing the degree of bias of the speaker's utterance length and the severity of the speaker's tone during the dialogue, Analyzing the utterance rate of each speaker during the dialogue, the position of the speaker in the dialogue is estimated.

次に、対話の場及び話者の位置づけごとに文末記号付与モデルを学習する。まず、様々な対話の場・話者の位置づけの音声が含まれる音声データベースと、各音声データに対応した文末記号付きの書き起こしのテキストデータを用意する。推定した対話の場及び話者の位置づけの情報に基づいて、対話の場や話者の位置づけが近い音声が同じ組となるよう音声データベースを分割する。分割後の各音声データベースに対し、対話音声の音響特徴と、音声認識結果の単語境界ごとに文末記号を付与したテキストデータを学習データとして、文末記号付与モデルを学習する。この文末記号付与モデルは、ある対話の場や話者の位置づけにおける文末記号付与規範となる。なお、文末記号付与モデルの学習は必ずしも行われてなくてもよいが、その場合は文末記号付与規範として音響特徴または言語特徴のしきい値を用いる。 Next, the sentence ending symbol assignment model is learned for each dialogue place and speaker position. First, a speech database including speeches of various dialogue places / speaker positions and transcript text data with sentence ending symbols corresponding to each speech data are prepared. Based on the estimated dialogue field and speaker positioning information, the speech database is divided so that the voices having the same dialogue field and speaker positioning are in the same set. For each divided speech database, a sentence ending symbol assignment model is learned by using, as learning data, acoustic data of conversational speech and text data to which a sentence ending symbol is assigned for each word boundary of the speech recognition result. This end-of-sentence symbol assignment model is a norm for the end-of-sentence symbol in a certain dialogue place or speaker position. Note that learning of the sentence ending symbol assignment model does not necessarily have to be performed, but in that case, a threshold value of an acoustic feature or a language feature is used as a sentence ending symbol assignment criterion.

文末記号自動付与の際には、入力の対話音声から対話の場や話者の位置づけを推定し、対話の場や話者の位置づけの近い文末記号付与規範を用いて文末記号付与を行うことで、対話中の各話者の位置づけに即した文末記号を自動付与した音声認識結果テキストを得ることができる。 When automatically assigning end-of-sentence symbols, the location of the dialogue and the position of the speaker are estimated from the input dialogue voice, and the end-of-sentence symbol is assigned using the end-of-sentence assignment standard that is close to the location of the dialogue and the speaker. Thus, it is possible to obtain a speech recognition result text to which a sentence ending symbol is automatically assigned in accordance with the position of each speaker during the conversation.

[対話状況特徴]
実世界での対話状況を表現でき、かつ対話参加者の音声から計算可能な「対話厳格度」「話者偏り度」「中心話者度」の３つの尺度を定義する。これらの３つの尺度をまとめて「対話状況特徴」とも呼ぶ。 [Dialogue status features]
Three scales are defined: "Dialogue severity", "Speaker bias degree", and "Central speaker degree" that can express the conversation situation in the real world and can be calculated from the speech of the conversation participants. These three measures are collectively called “dialogue situation characteristics”.

「対話厳格度」は、対話参加者の口調が厳格かを示す尺度である。言い換えれば、ある対話の対話厳格度は、その対話の厳格さを表す指標である。これは、例えば対話参加者の韻律変化の大きさと対話全体における非発話区間の長さに基づいて求めるものとする。 “Dialogue severity” is a scale indicating whether the tone of dialogue participants is strict. In other words, the dialogue severity of a dialogue is an index representing the severity of the dialogue. This is determined based on, for example, the size of the prosody change of the conversation participant and the length of the non-speech interval in the entire conversation.

「話者偏り度」は、対話において話者ごとの発話区間の長さに偏りがあるかを示す尺度である。言い換えれば、ある対話の話者偏り度は、その対話における話者の発話の長さの偏り度を表す指標である。これは、対話で最も発話した話者の、全体に占める発話割合の大きさから求めるものとする。 The “speaker bias degree” is a scale indicating whether or not there is a bias in the length of the utterance section for each speaker in the dialogue. In other words, the speaker bias degree of a certain dialog is an index representing the speaker's utterance length bias degree in the dialog. This is calculated from the size of the utterance ratio of the speaker who speaks most in the dialogue.

これらの「対話厳格度」及び「話者偏り度」の尺度は、対話が行われた場を表現する。例えば、講演の場では対話厳格度と話者偏り度が高い値をとり、議会の場では対話厳格度は高いが話者偏り度は低い値をとる。図５に、対話厳格度・話者偏り度と実世界での対話の場との関係性の例を示す。 These measures of “dialogue severity” and “speaker bias” express the place where the dialogue took place. For example, in the lecture, the dialogue severity and speaker bias are high, and in the parliament, the dialogue severity is high but the speaker bias is low. FIG. 5 shows an example of the relationship between the degree of dialogue severity and speaker bias and the real-world dialogue field.

「中心話者度」は対話における話者の発話割合の大小を示す尺度であり、ある対話の場における話者の位置づけに関連している。言い換えれば、ある対話のある話者の中心話者度は、その対話においてその話者の発話の割合を示す指標である。これは、対話参加者ごとの全体に占める発話割合の大きさから求めるものとする。例えば、講演の場で中心話者度が高い話者は講演者であり、中心話者度が低い話者は質問者である。 “Central speaker degree” is a scale indicating the size of a speaker's utterance ratio in a dialog, and is related to the position of the speaker in a certain dialog. In other words, the central speaker degree of a speaker having a certain dialogue is an index indicating the ratio of the speaker's utterance in the dialogue. This is obtained from the size of the utterance ratio in the entire conversation participant. For example, a speaker having a high degree of central speaker at a lecture is a speaker, and a speaker having a low degree of central speaker is a questioner.

図６に、対話の場ごとの中心話者度と話者の位置づけの例を示す。対話厳格度や話者偏り度は対話全体に対して一つ求められるのに対し、中心話者度は対話に参加する話者ごとに求められる。また、これらの対話状況特徴は対話内では不変であるとし、対話開始から対話終了までの全ての区間を用いて一つの対話厳格度と話者偏り度、話者数分の中心話者度を求めるものとする。 FIG. 6 shows an example of the degree of central speaker and the position of the speaker for each dialogue place. One degree of dialogue severity and one degree of speaker bias are required for the entire dialogue, whereas a central speaker degree is obtained for each speaker participating in the dialogue. In addition, these dialogue situation features are assumed to be invariant in the dialogue, and the single speaker severity, speaker bias, and the number of speakers as many as the number of speakers are calculated using all intervals from the beginning of the dialogue to the end of the dialogue. Suppose you want.

対話状況特徴の値を基準として複数の文末記号付与規範を事前準備する。文末記号付与規範は、後述するように、例えば対話状況特徴が近い値をとる音声のみを収集し、それらの音声を用いて事前学習した文末記号付与モデルの確率値であってもよいし、音響特徴と言語特徴のしきい値処理であってもよい。文末記号自動付与時には、対話状況特徴を入力音声から自動推定し、それらが近い値をとる場合の文末記号付与規範を選択して文末記号推定を行う。上記の通り、対話状況特徴に基づいて文末記号付与規範を変化させることで、対話状況に合わせた文末記号付与が可能となり、文末記号付与精度が向上する。 Prepare multiple end-of-sentence assignment rules based on the value of the dialog status feature. As will be described later, the ending symbol assignment norm may be, for example, a probability value of a ending symbol assignment model obtained by collecting only voices having similar values of dialogue situation characteristics and learning in advance using those voices. It may be threshold processing of features and language features. When automatically assigning end-of-sentence symbols, the conversation situation feature is automatically estimated from the input speech, and the end-of-sentence estimation is selected by selecting the end-of-sentence giving standard when the values are close. As described above, by changing the sentence ending symbol provision norm based on the conversation situation characteristics, sentence ending symbols can be assigned according to the conversation situation, and the sentence ending symbol assignment accuracy is improved.

[実施形態]
以下、文末記号推定装置及び方法の実施形態の説明をする。 [Embodiment]
Hereinafter, embodiments of the sentence ending symbol estimation apparatus and method will be described.

文末記号推定装置は、図１に示すように、対話状況特徴計算装置１、音声認識部２、音響特徴抽出部３、テキスト解析部４、文末記号推定部５及び文末記号付与部６を例えば備えている。文末記号付与部６は設けられていなくてもよい。 As shown in FIG. 1, the sentence ending symbol estimation device includes, for example, a dialog situation feature calculation device 1, a speech recognition unit 2, an acoustic feature extraction unit 3, a text analysis unit 4, a sentence ending symbol estimation unit 5, and a sentence ending symbol addition unit 6. ing. The sentence ending symbol assigning unit 6 may not be provided.

文末記号推定方法は、文末記号推定装置が、図２のステップＳ１からステップＳ６の処理を行うことにより例えば実現される。 The sentence ending symbol estimation method is realized, for example, by the sentence ending symbol estimation device performing the processing from step S1 to step S6 in FIG.

この実施形態では、複数人の話者の対話を収録した音声を入力とする。このとき、話者ごとの音声が個別に収録されているものとする。この入力は、話者ごとにヘッドセット等の接話型マイクロホンを装着させた状態で収録を行った音声でもよいし、単一又は複数マイクロホンで収録した音を話者分類や音源分離の技術（例えば、特許第4964204号）を用いて話者ごとに分離した音声であってもよい。なお、音声を収録した話者数を、対話状況特徴抽出部で用いる話者数Nとする。このとき、対話の場に存在したが一度も発言しなかった話者や個別の音声が収録されなかった話者は話者数に含まれないものとする。入力された話者ごとの音声は、対話状況特徴計算装置１及び音声認識部２に入力される。 In this embodiment, it is assumed that a voice recording dialogues of a plurality of speakers is input. At this time, it is assumed that the sound for each speaker is individually recorded. This input may be a voice recorded with a close-up microphone such as a headset attached to each speaker, or a technique for speaker classification and sound source separation using a single or a plurality of microphones. For example, the voice may be separated for each speaker using Japanese Patent No. 4964204). Note that the number of speakers recording voice is the number N of speakers used in the dialog situation feature extraction unit. At this time, it is assumed that speakers who existed in the dialog but never spoken or speakers whose individual voices were not recorded are not included in the number of speakers. The input voice for each speaker is input to the dialog situation feature calculation apparatus 1 and the voice recognition unit 2.

<対話状況特徴計算装置１（図１、図３）>
対話状況特徴計算装置１は、入力された話者ごとの音声を用いて、対話状況特徴を計算する（ステップＳ１）。計算された対話状況特徴は、文末記号推定部５に出力される。 <Dialogue situation feature calculation device 1 (FIGS. 1 and 3)>
The dialog situation feature calculation device 1 calculates the dialog situation feature using the input voice for each speaker (step S1). The calculated dialog situation feature is output to the sentence ending symbol estimation unit 5.

対話状況特徴計算装置１は、図３に示すように、発話区間検出部１１、基本周波数抽出部１２、全体発話区間検出部１３、中心話者度話者偏り度計算部１４、対話厳格度推定特徴計算部１５及び対話厳格度計算部１６を例えば備えている。 As shown in FIG. 3, the conversation situation feature calculation apparatus 1 includes an utterance interval detection unit 11, a fundamental frequency extraction unit 12, an entire utterance interval detection unit 13, a central speaker degree speaker bias degree calculation unit 14, and dialogue severity estimation. For example, a feature calculation unit 15 and a dialogue severity calculation unit 16 are provided.

対話状況特徴計算方法は、対話状況特徴計算装置が、図４のステップＳ１１からステップＳ１７の処理を行うことにより例えば実現される。 The dialog situation feature calculation method is realized, for example, by the dialog situation feature calculation apparatus performing the processing from step S11 to step S17 in FIG.

以下、対話状況特徴計算装置１における各部の詳細について述べる。なお、以下の発話区間検出部１１及び基本周波数抽出部１２においては、入力された音声を例えば10msec程度の短時間ごとに区切って分析する手法が例えばとられるものとする。 Hereinafter, details of each part in the dialog situation feature calculation apparatus 1 will be described. Note that, in the following utterance section detection unit 11 and fundamental frequency extraction unit 12, for example, a method of dividing and analyzing input speech every short time of about 10 msec, for example, is taken.

<<発話区間検出部１１（図３）>>
発話区間検出部１１は、入力された話者ごとの音声を用いて、話者ごとの発話区間を検出する（ステップＳ１１）。検出された発話区間についての情報は、全体発話区間検出部１３、基本周波数抽出部１２及び中心話者度話者偏り度計算部１４に出力される。 << Speech section detector 11 (FIG. 3) >>
The utterance section detection unit 11 detects the utterance section for each speaker using the inputted voice for each speaker (step S11). Information about the detected utterance period is output to the entire utterance period detection unit 13, the fundamental frequency extraction unit 12, and the central speaker degree speaker bias degree calculation unit 14.

発話区間とは話者の一発話の開始時刻から終了時刻までの区間を指し、話者ごとの音声は一つ以上の発話区間を含むとする。また、息継ぎなどの短い間は発話区間に含まれるが、他者の発話の聴取区間などの長い間は発話区間に含まれないものとする。間が発話区間に含まれるかの判別は、例えば発話と発話の間の時間のしきい値処理により実現される。例えば、1秒以下の間は発話区間に含み、１秒より長い間は発話区間に含まれないとする。図７に話者ごとの音声の発話区間の例を示す。この実施形態では、短時間音声パワーのしきい値処理により発話区間検出を行うが、既存のどの発話区間検出手法を用いてもよい。 The utterance section refers to a section from the start time to the end time of one utterance of the speaker, and the voice for each speaker includes one or more utterance sections. Further, it is assumed that a short period such as breathing is included in the utterance section, but a long period such as a listening section of another person's utterance is not included in the utterance section. The determination of whether the interval is included in the utterance section is realized by threshold processing of the time between utterances, for example. For example, suppose that it is included in the utterance interval for less than 1 second and is not included in the utterance interval for longer than 1 second. FIG. 7 shows an example of a speech utterance section for each speaker. In this embodiment, the speech segment detection is performed by the threshold processing of the short time voice power, but any existing speech segment detection method may be used.

<<基本周波数抽出部１２（図３）>>
基本周波数抽出部１２は、入力された話者ごとの音声及び入力された発話区間についての情報を用いて、話者ごとの基本周波数を抽出する（ステップＳ１２）。これにより基本周波数の時系列が生成される。抽出された基本周波数についての情報は、対話厳格度推定特徴計算部１５に出力される。 << Basic frequency extraction unit 12 (FIG. 3) >>
The fundamental frequency extraction unit 12 extracts the fundamental frequency for each speaker using the input voice for each speaker and the information about the input speech period (step S12). As a result, a time series of fundamental frequencies is generated. Information about the extracted fundamental frequency is output to the dialogue severity estimation feature calculation unit 15.

基本周波数抽出の処理は、話者ごとの音声の各発話区間に対して行われる。例えば、自己相関法を用いて基本周波数抽出は行われる。もちろん、既存のどの基本周波数抽出手法を用いてもよい。 The fundamental frequency extraction process is performed for each utterance section of the voice for each speaker. For example, the fundamental frequency extraction is performed using an autocorrelation method. Of course, any existing fundamental frequency extraction method may be used.

<<全体発話区間検出部１３（図３）>>
全体発話区間検出部１３は、入力された全話者の発話区間を用いて、全体の発話区間を検出する（ステップＳ１３）。検出された全体の発話区間についての情報は、中心話者度話者偏り度計算部１４及び対話厳格度推定特徴計算部１５に出力される。 << Whole utterance section detector 13 (FIG. 3) >>
The entire utterance section detection unit 13 detects the entire utterance section using the input utterance sections of all the speakers (step S13). Information about the detected entire utterance section is output to the central speaker degree speaker bias degree calculation unit 14 and the dialogue severity estimation feature calculation unit 15.

全体の発話区間とは、対話において一名以上の話者の発話区間である区間を指す。図７に、全体の発話区間の検出例を示す。このように、全体の発話区間とは、一名以上の話者の発話区間を結合した区間のことである。 The entire utterance interval refers to an interval that is an utterance interval of one or more speakers in the dialogue. FIG. 7 shows an example of detecting the entire utterance section. Thus, the entire utterance interval is an interval obtained by combining the utterance intervals of one or more speakers.

<<中心話者度話者偏り度計算部１４（図３）>>
中心話者度話者偏り度計算部１４は、入力された全話者の発話区間についての情報及び入力された全体の発話区間についての情報を用いて、中心話者度及び話者偏り度を計算する（ステップＳ１４、ステップＳ１５）。中心話者度は話者ごとに計算されるため、中心話者度のことを「話者ごとの中心話者度」と表記することもある。 << Center speaker degree Speaker bias calculator 14 (Fig. 3) >>
The central speaker degree speaker bias degree calculation unit 14 calculates the central speaker degree and the speaker bias degree by using the information about the input utterance intervals of all the speakers and the information about the input whole utterance intervals. Calculate (step S14, step S15). Since the central speaker degree is calculated for each speaker, the central speaker degree may be referred to as “central speaker degree for each speaker”.

まず、中心話者度話者偏り度計算部１４は、話者ごとの発話割合を求める。これは、ある話者の発話区間の長さの総和を全体の発話区間の長さの総和で割ることで求められる。発話割合r_nを式で表すと以下のようになる。Nは話者数であり、n=1,…,NとしてT_nは話者nの発話区間の長さの総和であり、Tは全体の発話区間の長さの総和である。 First, the central speaker degree speaker bias degree calculation unit 14 obtains an utterance ratio for each speaker. This is obtained by dividing the sum of the lengths of the utterance intervals of a certain speaker by the sum of the lengths of the entire utterance intervals. The utterance ratio r _n is expressed as follows. N is the number of speakers, n = 1, ..., T n is the sum of the length of the speech segment of speaker n as N, T is the sum of the overall length of the speech period.

次に、話者ごとの発話割合から中心話者度を求める（ステップＳ１４）。これは、話者ごとの発話割合を発話割合の最大値で割ることで求められる。中心話者度c_nを式で表すと以下のようになる。 Next, the central speaker degree is obtained from the utterance ratio for each speaker (step S14). This is obtained by dividing the utterance ratio for each speaker by the maximum value of the utterance ratio. The central speaker degree c _n is expressed as follows.

最後に、中心話者度話者偏り度計算部１４は、話者偏り度を求める（ステップＳ１５）。これは、発話割合の最大値から参加者が平均的に発話した際の割合を引いたものを、０から１の値にスケーリングすることで得られる。話者偏り度Bを式で表すと以下のようになる。 Finally, the central speaker degree speaker bias degree calculation unit 14 obtains the speaker bias degree (step S15). This can be obtained by scaling a value obtained by subtracting a ratio when the participants speak on average from a maximum value of the utterance ratio from 0 to 1. Speaker bias degree B is expressed as follows.

中心話者度は、ある対話において最も長く発話した話者を１とする話者ごとの発話割合を表す。話者偏り度は、ある対話において最も長く発話した話者の全体の発話区間に占める発話割合を表す。話者偏り度が０のとき、対話に参加した全話者が均等に発話したことを表す。話者偏り度が１のとき、終始一人の話者が発話したことを表す。図８に中心話者度と話者偏り度の例を示す。 The central speaker degree represents the utterance ratio for each speaker, where the speaker who has spoken the longest in a certain dialogue is 1. The degree of speaker bias represents the utterance ratio in the entire utterance section of the speaker who spoke the longest in a certain dialogue. When the speaker bias degree is 0, it means that all the speakers who participated in the dialogue spoke equally. When the speaker bias is 1, it means that one speaker has spoken from the beginning. FIG. 8 shows an example of the central speaker degree and the speaker bias degree.

<<対話厳格度推定特徴計算部１５（図３）>>
対話厳格度推定特徴計算部１５は、入力された全話者の基本周波数の時系列及び入力された全体の発話区間を用いて、平均基本周波数時間変化、平均基本周波数加速度及び非発話区間の割合を計算する（ステップＳ１６）。計算された平均基本周波数時間変化、平均基本周波数加速度及び非発話区間の割合は、対話厳格度計算部１６に出力される。 << Dialogue severity estimation feature calculator 15 (Fig. 3) >>
The dialogue severity estimation feature calculation unit 15 uses the time series of the fundamental frequencies of all the speakers input and the entire speech utterances input, the average fundamental frequency time change, the average fundamental frequency acceleration, and the ratio of non-speech intervals Is calculated (step S16). The calculated average fundamental frequency time change, average fundamental frequency acceleration, and ratio of the non-speech interval are output to the dialogue severity calculator 16.

対話厳格度推定特徴計算部１５は、後段の対話厳格度推定のための特徴抽出を行う。まず話者ごとの基本周波数の時系列から、基本周波数の時間変化及び加速度を求める。基本周波数は離散時間で与えられるので、時間変化の計算は一階差分を、加速度の計算は二階差分を利用する。これらの時間変化及び加速度のそれぞれの絶対値を全発話区間・全話者で平均化し、対話全体での韻律変化の大きさを表す値とする。前者を平均基本周波数時間変化、後者を平均基本周波数加速度と呼ぶ。 The dialogue severity estimation feature calculation unit 15 performs feature extraction for dialogue severity estimation at a later stage. First, the time change and acceleration of the fundamental frequency are obtained from the time series of the fundamental frequency for each speaker. Since the fundamental frequency is given in discrete time, the first-order difference is used for calculating the time change, and the second-order difference is used for calculating the acceleration. The absolute values of these temporal changes and accelerations are averaged over all utterance sections and all speakers, and are used as values representing the magnitude of prosodic change in the entire dialogue. The former is called average fundamental frequency time change, and the latter is called average fundamental frequency acceleration.

全体の発話区間のうち、最初の発話の開始時刻を対話開始時刻、最後の発話の終了時刻を対話終了時刻とする。対話開始時刻から対話終了時刻までの区間のうち、一人の発話もない区間を非発話区間とする。対話開始時刻から対話終了時刻までの長さに対する非発話区間の合計の長さの割合を非発話区間の割合とする。図９に対話における全体の発話区間と非発話区間の例を示す。 Of the entire utterance section, the start time of the first utterance is the dialog start time, and the end time of the last utterance is the dialog end time. Among the sections from the dialog start time to the dialog end time, a section where no one utters is defined as a non-speaking section. The ratio of the total length of the non-speech section to the length from the conversation start time to the conversation end time is defined as the ratio of the non-speech section. FIG. 9 shows an example of the entire utterance interval and non-utterance interval in the dialogue.

<<対話厳格度計算部１６（図３）>>
対話厳格度計算部１６は、入力された平均基本周波数時間変化、入力された平均基本周波数加速度及び入力された非発話区間の割合及び入力された対話厳格度推定のための回帰係数を用いて、対話厳格度を計算する（ステップＳ１７）。 << Dialog severity calculator 16 (Fig. 3) >>
The dialogue severity calculator 16 uses the inputted average fundamental frequency time change, the inputted average fundamental frequency acceleration, the inputted ratio of the non-speech interval, and the inputted regression coefficient for estimating the dialogue severity, The dialogue severity is calculated (step S17).

一般に、厳格な対話（議会など）であるほど基本周波数の変動が小さくなり、非発話区間が長くなる傾向にある。対話厳格度は上記を表現する尺度であり、１から０までの値を取るものとする。対話厳格度が１であれば厳格な対話を、０であれば厳格でない対話（自由討論など）を表す。 In general, the stricter the dialogue (such as parliament), the smaller the fluctuation of the fundamental frequency and the longer the non-speech interval. The dialogue severity is a scale expressing the above, and takes a value from 1 to 0. If the dialogue severity is 1, it represents a strict dialogue, and if it is 0, it represents a less strict dialogue (such as free discussion).

対話厳格度の計算はしきい値処理により実現可能である。例えば、平均基本周波数時間変化及び平均基本周波数加速度が一定値より小さく非発話区間が別の一定値より大きい場合は対話厳格度を１とする。もちろん、ロジスティック回帰等の統計的回帰モデルにより対話厳格度の計算を行ってもよい。ただし、統計的回帰モデルを適用する場合、その出力の値を０から１に正規化する処理が加わるものとする。また統計的回帰モデルを用いて対話厳格度を推定する場合、事前に回帰係数を学習する必要がある。回帰係数の事前学習法については後述する。 The calculation of the dialogue severity can be realized by threshold processing. For example, when the average fundamental frequency time variation and the average fundamental frequency acceleration are smaller than a certain value and the non-speech interval is larger than another certain value, the dialogue severity is set to 1. Of course, the dialogue severity may be calculated by a statistical regression model such as logistic regression. However, when a statistical regression model is applied, processing for normalizing the output value from 0 to 1 is added. In addition, when estimating the severity of dialogue using a statistical regression model, it is necessary to learn the regression coefficient in advance. The prior learning method of the regression coefficient will be described later.

<音声認識部２（図１）>
音声認識部２は、入力された話者ごとの音声を用いて、音声認識結果テキストを出力する（ステップＳ２）。音声認識結果テキストは、テキスト解析部４及び文末記号付与部６に出力される。 <Voice recognition unit 2 (Fig. 1)>
The speech recognition unit 2 outputs a speech recognition result text using the input speech for each speaker (step S2). The speech recognition result text is output to the text analysis unit 4 and the sentence end symbol giving unit 6.

音声認識結果テキストは、話者ごとの音声に対し音声認識を適用し、音声波形を文字へと変換することにより例えば生成される。 The speech recognition result text is generated, for example, by applying speech recognition to the speech for each speaker and converting the speech waveform into characters.

<音響特徴抽出部３（図１）>
音響特徴抽出部３は、入力された話者ごとの音声を用いて、音響特徴を抽出する（ステップＳ３）。抽出された音響特徴は、文末記号推定部５に出力される。 <Acoustic feature extraction unit 3 (FIG. 1)>
The acoustic feature extraction unit 3 extracts an acoustic feature using the input voice for each speaker (step S3). The extracted acoustic features are output to the sentence ending symbol estimation unit 5.

音響特徴は、基本周波数、短時間信号パワー、音声スペクトル包絡及び間の長さの少なくとも１つである。 The acoustic feature is at least one of a fundamental frequency, a short time signal power, a speech spectrum envelope, and a length between.

音響特徴抽出部３は、各時刻での音声に対し、基本周波数・短時間信号パワー・音声スペクトル包絡（MFCC）を抽出する。また、発話区間検出を用いて発話と発話の間の長さを抽出する。間の長さとは、発話区間検出部１１における「息継ぎなどの、発話区間に含まれる短い間」の時間を指す。人間が発話への意味情報を付与する場合、発話の基本周波数や短時間パワーに変化を付けることが多いが、音声スペクトル包絡にもその変化が表れることが知られている。例えば、リラックスして発声した場合と緊張して発声した場合などで音声スペクトル包絡に違いが表れる。また、間の情報は文末かどうかを判断する大きな基準となる。以上から、文末記号推定の際には例えばこれら４種類の音響特徴を用いる。 The acoustic feature extraction unit 3 extracts a fundamental frequency, a short-time signal power, and a speech spectrum envelope (MFCC) for speech at each time. Also, the length between utterances is extracted using utterance interval detection. The interval length refers to a time “a short period included in the utterance interval such as breathing” in the utterance interval detection unit 11. When a human gives semantic information to an utterance, the basic frequency of the utterance and the short-time power are often changed, but it is known that the change also appears in the speech spectrum envelope. For example, a difference appears in the speech spectrum envelope between when the voice is relaxed and when the voice is nervous. In addition, the information between them is a large criterion for judging whether or not the end of the sentence. From the above, for example, these four types of acoustic features are used in the case of sentence ending symbol estimation.

<テキスト解析部４（図１）>
テキスト解析部４は、入力された音声認識結果テキストを用いて、言語特徴を求める（ステップＳ４）。求まった言語特徴は、文末記号推定部５に出力される。 <Text analysis unit 4 (Fig. 1)>
The text analysis unit 4 obtains language features using the input speech recognition result text (step S4). The obtained language feature is output to the sentence ending symbol estimation unit 5.

言語特徴は、単語、品詞及び係り受け構造の少なくとも１つである。例えば、単語、品詞及び係り受け構造の全てが言語特徴とされる。 The language feature is at least one of a word, a part of speech, and a dependency structure. For example, words, parts of speech, and dependency structures are all language features.

テキスト解析部４は、形態素解析器を用いて音声認識結果のテキストを単語ごとに分割し、単語ごとの品詞を求める。音声認識結果に含まれる全ての三単語の連鎖及び三品詞の連鎖を作成し、これを単語および品詞の言語特徴としてもよい。また、テキスト全体を構文解析し、単語ごとの係り受け構造を求め、これも言語特徴としてもよい。なお、単語及び品詞にはそれぞれ時刻情報が付与されており、音響特徴との時間的対応が取れているものとする。 The text analysis unit 4 divides the text of the speech recognition result for each word using a morphological analyzer, and obtains a part of speech for each word. It is also possible to create a chain of all three words and a chain of three parts of speech included in the speech recognition result, and use this as a linguistic feature of the words and parts of speech. Also, the entire text is parsed to obtain a dependency structure for each word, which may also be a language feature. It is assumed that time information is assigned to each word and part of speech, and temporal correspondence with the acoustic feature is taken.

<文末記号推定部５（図１）>
文末記号推定部５は、入力された音響特徴、入力された言語特徴及び入力された対話状況特徴を用いて、単語境界ごとの文末記号付与判定を行う（ステップＳ５）。単語境界ごとの文末記号付与判定は、文末記号付与部６に出力される。 <End-of-sentence estimation unit 5 (Fig. 1)>
The sentence ending symbol estimation unit 5 performs sentence ending symbol assignment determination for each word boundary using the input acoustic feature, the inputted language feature, and the inputted dialogue situation feature (step S5). The sentence ending symbol assignment determination for each word boundary is output to the sentence ending symbol assignment unit 6.

文末記号推定部５は、対話状況特徴に基づいて選択された文末記号付与規範を用いて、単語境界ごとの最適文末記号の推定を行う。文末記号付与規範は、対話状況特徴に基づいて選択される。文末記号付与規範とは、文末記号付与基準又は文末記号付与モデルのことである。文末記号付与基準は、例えば音響特徴・言語特徴のしきい値処理により最適な文末符号を推定するルールベースの手法を利用する。文末記号付与モデルは、例えば条件付き確率場やサポートベクターマシンなどの機械学習により学習した文末記号の出現確率を表すモデル及び識別器を表す。 The sentence ending symbol estimation unit 5 estimates the optimum sentence ending symbol for each word boundary by using the sentence ending symbol assignment norm selected based on the conversation situation feature. The end-of-sentence adding standard is selected based on the dialog situation feature. The sentence ending symbol provision norm is a sentence ending symbol provision standard or a sentence ending symbol provision model. As the sentence end symbol assignment standard, for example, a rule-based method for estimating an optimum sentence end code by threshold processing of acoustic features and language features is used. The sentence ending symbol assignment model represents a model and a classifier that represent the appearance probability of a sentence ending symbol learned by machine learning such as a conditional random field or a support vector machine.

文末記号と音響特徴及び言語特徴には強い関連性があることが知られている。例えば、疑問符が付与される場合には、基本周波数の上昇や助詞・格助詞の出現が増加する傾向がある。しかし、対話状況によって文末記号と音響特徴や言語特徴との関連性は変化する。例えば、厳格な会議では質問以外の場面での基本周波数の変化が少ないため、主に基本周波数を用いて疑問符を推定すべきである。しかし、厳格でない会議の場合は様々な場面で基本周波数の変化が生じるため、主に言語特徴を用いて疑問符を推定すべきである。上記の変化への自動的な対応を可能とすることを目的とし、対話状況特徴の自動推定と対話状況特徴を用いた文末記号推定規範の選択を導入する。 It is known that there is a strong relationship between the end-of-sentence symbol and the acoustic and language features. For example, when a question mark is given, there is a tendency that the fundamental frequency increases and the appearance of particles / case particles increases. However, the relationship between the sentence ending symbol and the acoustic or language features changes depending on the conversation situation. For example, in a strict meeting, there is little change in the fundamental frequency in scenes other than questions, so the question mark should be estimated mainly using the fundamental frequency. However, in the case of a non-strict meeting, the fundamental frequency changes in various situations, so the question mark should be estimated mainly using language features. For the purpose of enabling automatic response to the above changes, we introduce automatic estimation of dialogue situation features and selection of sentence ending symbol estimation criteria using dialogue situation features.

また、音響特徴及び言語特徴の複数の要因に基づいて文末記号が決定する場合も多い。例えば、基本周波数の上昇と、疑問を表す助詞の出現とが同時に発生した場合に疑問符が付与される。このため、音響特徴や言語特徴を単純にしきい値処理するだけでは誤検出が頻出する可能性がある。そのため、複合的な要因も考慮することが可能な、機械学習により学習した文末記号推定モデルを用いて文末記号推定を行うことも有効である。 In many cases, a sentence ending symbol is determined based on a plurality of factors of acoustic features and language features. For example, a question mark is given when an increase in fundamental frequency and the appearance of a particle indicating a question occur simultaneously. For this reason, there is a possibility that false detections frequently occur by simply thresholding acoustic features and language features. Therefore, it is also effective to perform sentence ending symbol estimation using a sentence ending symbol estimation model learned by machine learning, which can take into account complex factors.

なお、文末記号付与モデルを用いて最適な文末記号を推定する場合、モデルの事前学習が必要となる。このときの事前学習の概要については後述する。 Note that when an optimal sentence ending symbol is estimated using a sentence ending symbol assignment model, prior learning of the model is required. The outline of the pre-learning at this time will be described later.

例えば、付与する文末記号は、疑問符「？」、感嘆符「！」、三点リーダ「…」、笑い記号「(笑)」、句点「。」、読点「、」の６種類とし、選択された文末記号付与規範に基づいて、単語境界ごとに６種類の文末記号と「何も付与しない」の７種類のどれが適切かを分類する。 For example, there are six types of sentence ending symbols to be selected: question mark “?”, Exclamation mark “!”, Three-point reader “…”, laugh symbol “(laugh)”, punctuation mark “.”, And punctuation mark “,”. On the basis of the end-of-sentence symbol assignment norm, it is classified for each word boundary which of the six types of end-of-sentence symbol and the seven types of “don't give anything” is appropriate.

このように、文末記号推定部５は、対話状況特徴計算装置１で計算された対話状況特徴に基づいて複数の文末記号付与規範の中からその対話状況に対応する文末記号付与規範を選択し、選択された文末記号付与規範、その対話の音響特徴及び言語特徴を用いてその対話の音声認識結果テキストに対する文末記号を推定する。 As described above, the sentence ending symbol estimation unit 5 selects a sentence ending symbol provision norm corresponding to the conversation situation from a plurality of sentence ending symbol provision norms based on the conversation situation feature calculated by the conversation situation feature calculation device 1. The end-of-sentence symbol for the speech recognition result text of the dialog is estimated using the selected end-of-sentence adding standard, the acoustic feature and the language feature of the dialog.

<文末記号付与部６（図１）>
文末記号付与部６は、入力された単語境界ごとの文末記号付与判定及び入力された音声認識結果テキストを用いて、文末記号付き音声認識結果を生成する（ステップＳ６）。 <End-of-sentence adding unit 6 (FIG. 1)>
The sentence ending symbol assigning unit 6 generates a speech recognition result with a sentence ending symbol using the sentence ending symbol assignment determination for each input word boundary and the input speech recognition result text (step S6).

具体的には、文末記号付与部６は、音声認識結果テキストに対し文末記号の付与を行うことにより文末記号付き音声認識結果を生成する。その際、文末付与の基準として単語境界ごとの文末記号付与判定が用いられる。 Specifically, the sentence ending symbol assigning unit 6 generates a speech recognition result with a sentence ending symbol by assigning a sentence ending symbol to the speech recognition result text. At that time, sentence end assignment assignment determination for each word boundary is used as a reference for sentence end assignment.

<文末記号付与モデル生成部７>
文末記号付与モデルを事前学習により生成する機能が文末記号推定装置に設けられていてもよい。 <End-of-sentence assignment model generator 7>
A function for generating a sentence ending symbol assignment model by prior learning may be provided in the sentence ending symbol estimation device.

文末記号付与モデル生成部７は、図１０及び図１１に示すように、対話状況特徴計算装置７１、音声データベース分割部７２、文末記号正解ラベル作成部７３、音声認識部７４、音響特徴抽出部７５、テキスト解析部７６及び文末記号付与モデル生成部７７を例えば備えている。 As shown in FIGS. 10 and 11, the sentence ending symbol assigning model generation unit 7 includes a dialog situation feature calculation device 71, a speech database division unit 72, a sentence ending symbol correct label creation unit 73, a speech recognition unit 74, and an acoustic feature extraction unit 75. For example, a text analysis unit 76 and a sentence ending symbol addition model generation unit 77 are provided.

文末記号付与モデル生成部７による事前学習には、話者ごとの音声が収録された音声データベースと、各音声データに対応した文末記号付きの書き起こしとが用いられる。この音声データベースは、後述する対話厳格度推定のための回帰係数の事前学習に用いる音声データベースであってもよい。また、文末記号付きの書き起こしは、人が音声を聞き作成したテキストデータであって、単語境界ごとに、話者ごとの音声データベースの音声と対応付け可能な時刻情報が付与されているものとする。 The pre-learning by the sentence ending symbol assigning model generation unit 7 uses a speech database in which speech for each speaker is recorded and a transcript with a sentence ending symbol corresponding to each speech data. This speech database may be a speech database used for pre-learning of regression coefficients for estimating dialogue severity described later. Transcripts with end-of-sentence text are text data that a person listens to and creates, and each word boundary is given time information that can be associated with the voice in the voice database for each speaker. To do.

対話状況特徴計算装置７１は、対話状況特徴計算装置１と同様にして、対話状況特徴を計算する。計算された対話状況特徴は、音声データベース分割部７２に出力される。 The dialogue situation feature calculation device 71 calculates the dialogue situation feature in the same manner as the dialogue situation feature calculation device 1. The calculated dialog situation feature is output to the voice database dividing unit 72.

音声データベース分割部７２は、入力された話者ごとの音声データベース、入力された文末記号付き書き起こし及び入力された対話状況特徴を用いて、対話状況特徴の閾値処理により対話状況特徴が近い音声のデータベースを出力する。例えば、中心話者度が0.7以上、話者偏り度が0.5以上、対話厳格度が0.5以上などの閾値を設定し、それらを満たす音声を一つのデータベースとする。上記の例の場合、対話厳格度・話者偏り度が高い対話の場である「講演」の、中心話者度が高い「講演者」の音声をデータベースから分割することを意図している。対話状況特徴に基づいて分割した個々のデータベースは、発話内容や発話方式が類似した音声の集合とみなすことができる。なお、各データベースに含まれる音声との対応が取れる形で文末記号付き書き起こしも分割されるものとする。 The voice database dividing unit 72 uses the input voice database for each speaker, the input transcript with the sentence ending symbol, and the input dialog status feature, and the dialog status feature threshold processing of the dialog status feature makes it possible to generate a voice with a close dialog status feature. Output the database. For example, threshold values such as a central speaker degree of 0.7 or more, a speaker bias degree of 0.5 or more, and a dialogue severity of 0.5 or more are set, and speech satisfying them is set as one database. In the case of the above example, the intention is to divide from the database the speech of the “speaker” having a high central speaker level of the “speaker” which is a place of dialogue having a high level of dialog severity and speaker bias. Each database divided based on the conversation situation feature can be regarded as a set of voices having similar utterance contents and utterance methods. It is assumed that the transcript with a sentence ending symbol is also divided in such a way that it can correspond to the speech included in each database.

このようにして、対話状況特徴が近い音声のデータベース及び対応する文末記号付き書き起こしがグループ化される。各グループに含まれる音声データベース及び対応する文末記号付き書き起こしのそれぞれについて以下の処理が行われ、各グループの「ある対話状況での文末記号付与モデル」が生成される。 In this way, speech databases with similar dialog status features and corresponding transcripts with end-of-sentence symbols are grouped together. The following processing is performed for each of the speech database included in each group and the corresponding transcript with a sentence ending symbol, and a “sentence ending symbol assignment model in a certain dialog situation” of each group is generated.

文末記号正解ラベル作成部７３は、入力された文末記号付き書き起こしを用いて、文末記号正解ラベルを生成する。生成された文末記号正解ラベルは、文末記号付与モデル生成部７７に出力される。 The sentence ending symbol correct label creating unit 73 generates a sentence ending symbol correct label using the input transcript with the sentence ending symbol. The generated sentence ending symbol correct label is output to the sentence ending symbol assignment model generation unit 77.

文末記号正解ラベルとは、単語境界に入る文末記号の種類を指し、例えば、疑問符「？」、感嘆符「！」、三点リーダ「…」、笑い記号「(笑)」、句点「。」、読点「、」、何も付与しないの７種類の何れかであるとする。 The correct ending symbol label indicates the type of ending symbol that falls on a word boundary. For example, a question mark “?”, An exclamation point “!”, A three-point reader “…”, a laugh symbol “(laugh)”, and a phrase “.”. It is assumed that any of the seven types of reading marks “,” and nothing is given.

文末記号正解ラベル作成部７３は、具体的には、文末記号付き書き起こしを形態素解析し、単語ごとに分割する。その後、文末記号を除く全単語に対して単語境界にどの文末記号が入っているかを求め、文末記号正解ラベルとする。 Specifically, the sentence ending symbol correct label creating unit 73 performs morphological analysis of the transcription with the sentence ending symbol and divides it into words. After that, the end-of-sentence symbol is found at the word boundary for all words excluding the end-of-sentence symbol, and the correct end-sentence label is used.

音声認識部７４、音響特徴抽出部７５及びテキスト解析部７６の処理は、それぞれ音声認識部２、音響特徴抽出部３及びテキスト解析部４の処理と同様であるため、ここでは重複説明を省略する。音響特徴抽出部３で抽出された音響特徴及びテキスト解析部４で求められた言語特徴は、文末記号付与モデル生成部７７に出力される。 The processes of the speech recognition unit 74, the acoustic feature extraction unit 75, and the text analysis unit 76 are the same as the processes of the speech recognition unit 2, the acoustic feature extraction unit 3, and the text analysis unit 4, respectively. . The acoustic features extracted by the acoustic feature extraction unit 3 and the language features obtained by the text analysis unit 4 are output to the sentence ending symbol addition model generation unit 77.

文末記号付与モデル生成部７７は、入力された音響特徴、入力された言語特徴及び入力された文末記号正解ラベルを用いて、ある対話状況での文末記号付与モデルを生成する。 The sentence ending symbol assignment model generation unit 77 generates a sentence ending symbol assignment model in a certain dialogue situation using the input acoustic features, the input language features, and the input sentence ending symbol correct labels.

文末記号付与モデル生成部７７は、対話状況特徴が近い音声のデータベースに含まれる各音声の音響特徴と言語特徴を入力データ、文末記号正解ラベルを教師データとし、機械学習により文末記号付与モデルを学習する。機械学習手法として条件付き確率場やサポートベクターマシンの利用を想定するが、分類問題を解くことが可能であればどの機械学習手法を用いてもよい。 The sentence ending symbol adding model generating unit 77 learns the sentence ending symbol assigning model by machine learning using the acoustic features and language features of each speech included in the speech database having similar conversation situation features as input data and the sentence ending symbol correct answer label as teacher data. To do. As the machine learning method, it is assumed that a conditional random field or a support vector machine is used, but any machine learning method may be used as long as it can solve the classification problem.

このようにして、文末記号付与モデル生成部７７は、対話状況特徴に基づいて各対話状況を推定し、推定された対話ごとにその対話の音響特徴、言語特徴及び文末記号正解ラベルを教師データとして学習することにより、複数の文末記号付与規範である複数の文末記号付与モデルを生成する。 In this way, the sentence ending symbol assignment model generation unit 77 estimates each conversation situation based on the conversation situation feature, and uses the acoustic feature, language feature, and sentence ending symbol correct label of the dialogue as teacher data for each estimated conversation. By learning, a plurality of sentence ending symbol assignment models, which are a plurality of sentence ending symbol assignment norms, are generated.

<対話厳格度推定のための回帰係数学習部１７>
対話厳格度推定のための回帰係数学習部１７が対話状況特徴計算装置及び文末記号推定装置に設けられていてもよい。 <Regression coefficient learning unit 17 for dialogue severity estimation>
The regression coefficient learning unit 17 for estimating the dialogue severity may be provided in the dialogue situation feature calculation device and the sentence ending symbol estimation device.

回帰係数学習部１７の例を図１２に示す。回帰係数学習部１７は、発話区間検出部１７１、基本周波数抽出部１７２、全体発話区間検出部１７３、対話厳格度推定特徴計算部１７４及び回帰分析部１７５を例えば備えている。 An example of the regression coefficient learning unit 17 is shown in FIG. The regression coefficient learning unit 17 includes, for example, an utterance section detection unit 171, a fundamental frequency extraction unit 172, an entire utterance section detection unit 173, a dialogue severity estimation feature calculation unit 174, and a regression analysis unit 175.

事前学習の際には、様々な対話を含む音声データベースを用意する。ただし、データベースに含まれる各対話において話者ごとの音声の個別収録と対話厳格度正解ラベルの付与が行われているものとする。対話厳格度正解ラベルは人手での付与を行い、人が対話を聞いて厳格であると感じれば１を、感じなければ０を与える。なお、対話厳格度正解ラベルは対話単位で与えるものとする。音声データベースに含まれる全ての対話と全ての対話厳格度正解ラベルを用いて回帰分析を行い、対話厳格度推定のための回帰係数を求める。 For pre-learning, prepare a voice database that contains various dialogues. However, it is assumed that the individual recording of speech for each speaker and the assignment of the correctness string for dialogue severity are performed in each dialogue included in the database. The dialogue strictness correct answer label is manually assigned, and 1 is given if a person feels strictness after hearing the dialogue, and 0 if not. It is assumed that the dialogue severity correct answer label is given in units of dialogue. Regression analysis is performed using all dialogues included in the speech database and all dialogue severity correct answer labels, and a regression coefficient for dialogue severity estimation is obtained.

発話区間検出部１７１、基本周波数抽出部１７２、全体発話区間検出部１７３及び対話厳格度推定特徴計算部１７４の処理は、それぞれ発話区間検出部１１、基本周波数抽出部１２、全体発話区間検出部１３及び対話厳格度推定特徴計算部１５の処理と同様であるため、これらの重複説明を省略する。ここでは、回帰分析部１７５の説明のみを行う。 The processing of the utterance interval detection unit 171, the fundamental frequency extraction unit 172, the entire utterance interval detection unit 173, and the dialogue severity estimation feature calculation unit 174 is performed in the utterance interval detection unit 11, the fundamental frequency extraction unit 12, and the entire utterance interval detection unit 13, respectively. And since it is the same as the process of the dialog strictness estimation feature calculation unit 15, a duplicate description thereof will be omitted. Here, only the regression analysis unit 175 will be described.

回帰分析部１７５は、入力された平均基本周波数変化量、非発話区間の割合及び対話厳格度正解ラベルを用いて、対話厳格度推定のための回帰係数を計算する。 The regression analysis unit 175 calculates a regression coefficient for estimating the dialogue severity using the input average fundamental frequency change amount, the ratio of the non-speech interval, and the dialogue severity correct answer label.

具体的には、回帰分析部１７５は、例えば以下のようにして対話厳格度推定のための回帰係数の事前学習を行う。話者ごとの音声から求めた平均基本周波数時間変化、平均基本周波数加速度及び非発話区間の割合を説明変数、正解ラベルを従属変数として回帰分析を適用し、回帰係数を求める。なお、回帰分析の際には対話厳格度計算部１６と同一の回帰モデル（ロジスティック回帰モデルなど）を用いる必要がある。 Specifically, the regression analysis unit 175 performs pre-learning of regression coefficients for estimating dialogue severity, for example, as follows. The regression coefficient is obtained by applying regression analysis with the average fundamental frequency time change, the average fundamental frequency acceleration and the ratio of the non-speech interval obtained from the speech for each speaker as explanatory variables and the correct label as a dependent variable. In the regression analysis, it is necessary to use the same regression model (logistic regression model or the like) as the dialogue severity calculation unit 16.

[変形例等]
装置及び方法において説明した処理は、記載の順にしたがって時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されてもよい。 [Variations]
The processes described in the apparatus and method are not only executed in chronological order according to the order of description, but may be executed in parallel or individually according to the processing capability of the apparatus that executes the processes or as necessary.

また、各装置における各処理をコンピュータによって実現する場合、その各装置が有すべき機能の処理内容はプログラムによって記述される。そして、このプログラムをコンピュータで実行することにより、その各処理がコンピュータ上で実現される。 Further, when each process in each device is realized by a computer, the processing contents of the functions that each device should have are described by a program. Then, by executing this program on a computer, each process is realized on the computer.

この処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体としては、例えば、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等どのようなものでもよい。 The program describing the processing contents can be recorded on a computer-readable recording medium. As the computer-readable recording medium, for example, any recording medium such as a magnetic recording device, an optical disk, a magneto-optical recording medium, and a semiconductor memory may be used.

また、各処理手段は、コンピュータ上で所定のプログラムを実行させることにより構成することにしてもよいし、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。 Each processing means may be configured by executing a predetermined program on a computer, or at least a part of these processing contents may be realized by hardware.

その他、この発明の趣旨を逸脱しない範囲で適宜変更が可能であることはいうまでもない。 Needless to say, other modifications are possible without departing from the spirit of the present invention.

発話の意味に対応した文末記号を付与することにより、意味の誤解を防止し、場の雰囲気も理解可能な議事録を作成することが可能となる。また副次的な効果として、文末符号を利用した特定発話の検索（例えば、質問部分のみを検索するなど）が可能となり、議事録作成の効率が向上する。 By adding a sentence ending symbol corresponding to the meaning of the utterance, it is possible to prevent the misunderstanding of the meaning and to create a minutes that can understand the atmosphere of the place. Further, as a secondary effect, it is possible to search for a specific utterance using a sentence end code (for example, to search only a question part), and the efficiency of making minutes is improved.

１対話状況特徴計算装置
１１発話区間検出部
１２基本周波数抽出部
１３全体発話区間検出部
１４中心話者度話者偏り度計算部
１５対話厳格度推定特徴計算部
１６対話厳格度計算部
１７回帰係数学習部
１７１発話区間検出部
１７２基本周波数抽出部
１７３全体発話区間検出部
１７４対話厳格度推定特徴計算部
１７５回帰分析部
２音声認識部
３音響特徴抽出部
４テキスト解析部
５文末記号推定部
６文末記号付与部
７文末記号付与モデル生成部
７１対話状況特徴計算装置
７２音声データベース分割部
７３文末記号正解ラベル作成部
７４音声認識部
７５音響特徴抽出部
７６テキスト解析部
７７文末記号付与モデル生成部 DESCRIPTION OF SYMBOLS 1 Dialogue state characteristic calculation apparatus 11 Speech area detection part 12 Fundamental frequency extraction part 13 Whole speech area detection part 14 Central speaker degree Speaker bias degree calculation part 15 Dialogue severity estimation characteristic calculation part 16 Dialogue severity calculation part 17 Regression coefficient Learning section 171 Speaking section detection section 172 Fundamental frequency extraction section 173 Whole utterance section detection section 174 Dialogue severity estimation feature calculation section 175 Regression analysis section 2 Speech recognition section 3 Acoustic feature extraction section 4 Text analysis section 5 End of sentence symbol estimation section 6 End of sentence Symbol assignment unit 7 End-of-sentence symbol addition model generation unit 71 Dialogue state feature calculation device 72 Speech database division unit 73 End-of-word symbol correct label creation unit 74 Speech recognition unit 75 Acoustic feature extraction unit 76 Text analysis unit 77 End-of-sentence symbol addition model generation unit

Claims

The central speaker degree of each speaker of a dialog conducted by a plurality of speakers is used as an index indicating the proportion of each speaker's utterance in the dialog, and the speaker bias of the dialog is the speaker in the dialog. The degree of utterance length of the utterance, and the degree of dialogue severity of the dialogue as the index of strictness of the speaker's tone during the dialogue, the central speaker degree, the speaker bias degree and the dialogue A dialog situation feature calculation device for calculating a dialog situation feature of the dialogue, using at least one of the strictness as a dialogue situation feature;
Based on the dialog situation feature calculated by the above dialog situation feature calculation device, a sentence ending symbol assignment standard corresponding to the dialogue is selected from a plurality of sentence ending symbol assignment standards, and the selected sentence ending symbol provision standard and the sound of the dialogue are selected. A sentence ending symbol estimation unit for estimating a sentence ending symbol for a text representing the utterance content of the dialogue using the feature and the linguistic feature;
End-of-sentence estimation device.

In the sentence ending symbol estimation apparatus of Claim 1,
The situation of each dialogue is estimated based on the dialogue situation feature, and for each estimated dialogue, the acoustic features, language features, and sentence ending symbol correct labels of the dialogue are learned as teacher data. A sentence ending symbol addition model generating unit for generating a plurality of sentence ending symbol assignment models;
End-of-sentence estimation device.

The central speaker degree of each speaker of a dialog conducted by a plurality of speakers is used as an index indicating the proportion of each speaker's utterance in the dialog, and the speaker bias of the dialog is the speaker in the dialog. The degree of utterance length of the utterance, and the degree of dialogue severity of the dialogue as the index of strictness of the speaker's tone during the dialogue, the central speaker degree, the speaker bias degree and the dialogue With at least one of the strictness as a dialogue situation feature,
A dialog situation feature calculating step for calculating a dialog situation feature of the dialog;
The sentence ending symbol estimation unit selects a sentence ending symbol assignment criterion corresponding to the dialogue from a plurality of sentence ending symbol provision criteria based on the conversation situation feature calculated by the dialog situation feature calculation device, and assigns the selected sentence ending symbol An end-of-sentence estimation step for estimating an end-of-sentence for a text representing the utterance content of the dialogue using the norm, the acoustic features and language features of the dialogue;
Sentence ending symbol estimation method.

The sentence ending symbol estimation method according to claim 3 ,
The sentence ending symbol assignment model generation unit estimates the situation of each dialogue based on the dialogue situation feature, and learns the acoustic feature, language feature and sentence ending symbol correct label of the dialogue as teacher data for each estimated dialogue, A sentence ending symbol giving model generating step for generating a plurality of sentence ending symbol giving models as the plurality of sentence ending symbol giving norms;
End-of-sentence estimation method.

The program for functioning a computer as each part of the sentence ending symbol estimation apparatus of Claim 1 or 2.