JP7814892B2

JP7814892B2 - Information processing device, information processing method, and program

Info

Publication number: JP7814892B2
Application number: JP2021185190A
Authority: JP
Inventors: 寛基浦島
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 2021-11-12
Filing date: 2021-11-12
Publication date: 2026-02-17
Anticipated expiration: 2041-11-12
Also published as: JP2023072557A

Description

本発明は、自然言語処理のための情報処理技術に関する。 The present invention relates to information processing technology for natural language processing.

近年、AI技術の進歩に伴い、人間の話し言葉や書き言葉で書かれた文書をコンピュータで解析する自然言語処理と呼ばれる分野が進展してきている。自然言語処理の技術は文書の要約や翻訳、音声対話、データ分析といった文書データを扱う様々な分野で応用が期待されている。 In recent years, advances in AI technology have led to advances in the field of natural language processing, which uses computers to analyze documents written in human speech and language. Natural language processing technology is expected to be applied in a variety of fields that handle document data, such as document summarization and translation, voice dialogue, and data analysis.

自然言語処理の応用技術の中に、文書データから予め定義した項目（固有表現）の値を抽出する固有表現抽出と呼ばれるものがある。例えば固有表現として法人名、有効期限が定義されているときに、文書データ中から法人名に該当する文字列と有効期限に該当する文字列を抽出するというものである。 One of the applied technologies of natural language processing is named entity extraction, which extracts the values of pre-defined items (named entities) from document data. For example, if a corporate name and expiration date are defined as named entities, the strings corresponding to the corporate name and the expiration date can be extracted from the document data.

現在、自然言語処理において主流となっているＴｒａｎｓｆｏｒｍｅｒｓを用いたＢＥＲＴに代表される自然言語処理モデルは、文書データに含まれる文字列をトークンと呼ばれる単位に分解し、そのトークンをベクトル化したものを入力データとする。但し、自然言語処理モデルが一度に処理できるトークンの数には上限があるため、上限を超えるトークンを含む長文の文書データを入力する場合は、文書データを２以上のトークングループに分割してそれらを別々に入力して処理する必要がある。１つの文書データに含まれる複数のトークンを、単純に自然言語処理モデルの入力上限に合せて分割した場合、固有表現を区別するためのキーワードや文脈（固有表現の周辺の文字列）を失ってしまい、固有表現の推定精度が低下することがある。 Currently, natural language processing models using Transformers, such as BERT, which is the mainstream in natural language processing, break down character strings contained in document data into units called tokens, and use these tokens as vectors as input data. However, because there is a limit to the number of tokens that a natural language processing model can process at one time, when inputting long document data containing tokens that exceed this limit, the document data must be divided into two or more token groups, which must be input and processed separately. If the multiple tokens contained in a single piece of document data are simply divided to fit the natural language processing model's input limit, the keywords and context (character strings surrounding the named entity) used to distinguish named entities may be lost, resulting in a decrease in the accuracy of named entity inference.

特許文献１では、文書データを章や節、段落といったセクションで分割し、一定の文脈を保持することが期待されるトークングループ毎に自然言語処理モデルを用いた固有表現の抽出を行っている。 In Patent Document 1, document data is divided into sections such as chapters, sections, and paragraphs, and named entities are extracted using a natural language processing model for each token group that is expected to retain a certain context.

特開２０２１－６４１４３号公報Japanese Patent Application Laid-Open No. 2021-64143

引用文献１では、一度に処理するトークン数が自然言語処理モデルの入力上限に収まるように、文書データに含まれるトークンをセクションで分割するようにしている。しかし、分割したセクションをそれぞれ別々に処理するため、隣接するセクションに含まれる文脈を失うことで推定精度が低下することがあるという課題がある。 In Cited Document 1, tokens contained in document data are divided into sections so that the number of tokens processed at one time falls within the input limit of the natural language processing model. However, because each divided section is processed separately, there is an issue that the estimation accuracy may decrease due to the loss of context contained in adjacent sections.

そこで本発明では、文書データに含まれるトークンを分割する際に、自然言語処理における推定精度の低下を抑えることを目的とする。 The present invention therefore aims to prevent a decline in estimation accuracy in natural language processing when dividing tokens contained in document data.

本開示の技術は、入力文字列を分解して得られた複数のトークンから固有表現を抽出するための情報処理装置であって、前記入力文字列を分解して得られた複数のトークンの数が所定の上限数を超える場合、当該複数のトークンを２以上のトークングループに分割する分割手段であって、前記トークングループにおいてはそれぞれ所定の数のトークンが他のトークングループとオーバーラップする、前記分割手段と、前記トークングループごとに前記固有表現を抽出する抽出手段と、前記他のトークングループとオーバーラップする部分についての前記抽出手段による前記固有表現の抽出結果に基づき、前記オーバーラップする部分における前記固有表現の抽出結果を決定する決定手段と、を備え、前記決定手段は、前記オーバーラップする部分におけるオーバーラップする２つのトークングループのそれぞれからの前記固有表現の抽出結果のうち、当該オーバーラップするトークングループの中でトークン数の多いトークングループの前記固有表現の抽出結果を前記オーバーラップする部分における前記固有表現の抽出結果に決定する、ことを特徴とする。
The technology disclosed herein is an information processing device for extracting named entities from a plurality of tokens obtained by decomposing an input character string, the information processing device comprising: a division means for dividing the plurality of tokens obtained by decomposing the input character string into two or more token groups when the number of the plurality of tokens obtained by decomposing the input character string exceeds a predetermined upper limit, wherein each of the token groups has a predetermined number of tokens overlapping with another token group; an extraction means for extracting the named entities for each of the token groups; and a determination means for determining the extraction result of the named entities in the overlapping portion based on the extraction result of the named entities by the extraction means for the overlapping portion with the other token group, wherein the determination means determines the extraction result of the named entity in the overlapping portion to be the extraction result of the named entity in the overlapping portion, of the token group with the largest number of tokens among the extraction results of the named entities from each of the two overlapping token groups in the overlapping portion .

本発明によれば、文書データに含まれるトークンを分割する際に、自然言語処理における推定精度の低下を抑えることができる。 According to the present invention, it is possible to suppress a decrease in estimation accuracy in natural language processing when dividing tokens contained in document data.

固有表現抽出装置１００の機能およびハードウェアの構成の一例を表すブロック図A block diagram showing an example of the functional and hardware configuration of a named entity extraction device 100. 受信部１０２が受信する文書データの一例を表す図FIG. 1 is a diagram illustrating an example of document data received by a receiving unit 102. 制御部１０１が取得するトークンの一例を表すテーブルTable showing an example of a token acquired by the control unit 101 算出部１０３が算出する固有表現及び限界トークン数の一例を表すテーブルTable showing an example of named entities and limit numbers of tokens calculated by the calculation unit 103 制御部１０１が実行する処理の一例を表すフローチャート1 is a flowchart illustrating an example of processing executed by the control unit 101. 実施形態２における制御部１０１が実行する処理の一例を表すフローチャート10 is a flowchart illustrating an example of processing executed by the control unit 101 according to the second embodiment. 実施形態３における算出部１０３が算出する限界トークン数の一例を表すテーブル10 is a table showing an example of the limit number of tokens calculated by the calculation unit 103 in the third embodiment. 実施形態３における制御部１０１が実行する処理の一例を表すフローチャート10 is a flowchart illustrating an example of processing executed by the control unit 101 according to the third embodiment. 実施形態４における制御部１０１が実行する処理の一例を表すフローチャート10 is a flowchart illustrating an example of processing executed by the control unit 101 according to the fourth embodiment.

以下、本発明を実施するための最良の形態について図面を用いて説明する。なお、以下の実施形態において示す構成は一例に過ぎず、本発明は図示された構成に限定されるものではない。 The best mode for carrying out the present invention will be described below with reference to the drawings. Note that the configurations shown in the following embodiments are merely examples, and the present invention is not limited to the configurations shown in the drawings.

［実施形態１］
実施形態１では、文書データから文字列を抽出し、抽出した文字列をトークン単位に分解し、得られたトークンから自然言語処理モデルを用いて固有表現を抽出・出力するシステムの例を説明する。得られたトークンを一度に自然言語処理モデルに入力して固有表現を抽出できれば、結果を効率的に推定できる。一方、自然言語処理モデルには入力上限があり、一度に入力できるトークン数には上限がある。そのため、長文でトークン数が入力上限を超える場合は、トークングループを複数に分けて自然言語処理モデルに入力する必要がある。しかし、単純に自然言語処理モデルの入力上限に合わせてトークングループを分割した場合、固有表現を区別するためのキーワードや文脈が失われることがあり、そのような場合には固有表現の抽出精度が低下することがある。 [Embodiment 1]
In the first embodiment, an example of a system will be described in which character strings are extracted from document data, the extracted character strings are broken down into token units, and named entities are extracted and output from the obtained tokens using a natural language processing model. If the obtained tokens can be input into the natural language processing model at once to extract named entities, the results can be estimated efficiently. However, natural language processing models have an input upper limit, meaning that there is an upper limit on the number of tokens that can be input at one time. Therefore, if the number of tokens in a long sentence exceeds the input upper limit, it is necessary to divide the token groups into multiple groups and input them into the natural language processing model. However, if the token groups are simply divided to fit the input upper limit of the natural language processing model, keywords and contexts for distinguishing named entities may be lost, which may result in a decrease in the accuracy of named entity extraction.

そこで実施形態１では、トークン数が自然言語処理モデルの入力上限を超える場合に、まず固有表現が正しく抽出できる限界トークン数に基づき、トークンを一部オーバーラップする２以上のトークングループに分割する。そして、分割したそれぞれのトークングループに対し固有表現の抽出を行い、オーバーラップした部分から抽出する固有表現を決定することで、固有表現抽出の精度低下を抑える。 In embodiment 1, therefore, when the number of tokens exceeds the upper limit of input to the natural language processing model, the tokens are first divided into two or more partially overlapping token groups based on the limit number of tokens for which named entities can be correctly extracted. Named entities are then extracted for each divided token group, and the named entities to be extracted from the overlapping portions are determined, thereby preventing a decrease in the accuracy of named entity extraction.

図１（ａ）に、本実施形態に係る固有表現抽出装置１００の一例の機能ブロック図を示す。固有表現抽出装置１００は、制御部１０１、受信部１０２、算出部１０３、分割部１０４、抽出部１０５を含む情報処理装置である。 Figure 1(a) shows an example functional block diagram of a named entity extraction device 100 according to this embodiment. The named entity extraction device 100 is an information processing device that includes a control unit 101, a receiving unit 102, a calculation unit 103, a segmentation unit 104, and an extraction unit 105.

制御部１０１はＣＰＵ１１１等から構成され、ＲＯＭ１１３に記憶されたプログラムやデータをＲＡＭ１１２に読みだして固有表現抽出などの処理を実行する。 The control unit 101 is composed of a CPU 111 and other components, and reads programs and data stored in ROM 113 into RAM 112 to perform processes such as named entity extraction.

受信部１０２は、固有表現抽出装置１００が備える入力装置１１４の操作により、固有表現抽出処理を行う文書データを受信する。文書データは記憶装置１１６に保存したものを取得してもよいし、ネットワークインタフェース１１７を介してネットワーク１１８上の文書データを取得してもよい。 The receiving unit 102 receives document data to be subjected to named entity extraction processing through operation of the input device 114 provided in the named entity extraction device 100. The document data may be obtained from the storage device 116, or may be obtained from the network 118 via the network interface 117.

算出部１０３は、制御部１０１において文書データから固有表現を正しく抽出できる限界トークン数を算出する。 The calculation unit 103 calculates the limit number of tokens that allows the control unit 101 to correctly extract named entities from document data.

分割部１０４は、文書データから変換されたトークングループを、一度に入力するトークン数が自然言語処理モデルの入力上限数未満となるよう、２以上のトークングループに分割する。 The division unit 104 divides the token group converted from the document data into two or more token groups so that the number of tokens input at one time is less than the upper limit number of tokens that can be input to the natural language processing model.

抽出部１０５は、自然言語処理モデルを用いて分割されたトークングループから固有表現を抽出する。抽出した固有表現は記憶装置１１６に保存され、ディスプレイなどの出力装置１１５に表示される。 The extraction unit 105 extracts named entities from the divided token groups using a natural language processing model. The extracted named entities are stored in the storage device 116 and displayed on an output device 115 such as a display.

図２（ａ）に、受信部１０２が受信する文書データ例を示す。文書データはページ内に文字列や記号、罫線などを含む種類の異なる複数の要素で構成される。自然言語処理モデルの学習や評価に用いる文書データには、通常ＧＴ（ＧｒｏｕｎｄＴｒｕｔｈ）と呼ばれる正解データが付与されている。文書データ２１０に付与されているＧＴには、破線矩形内の文字列２１１、２１２、２１３が固有表現であることを示すために、それぞれの文字列に対して法人名、法人名、有効期限という固有表現の種別（または、属性）が定義されているものとする。 Figure 2(a) shows an example of document data received by the receiving unit 102. The document data is composed of multiple elements of different types, including character strings, symbols, and ruled lines, within a page. Document data used for training and evaluation of natural language processing models is usually accompanied by ground truth data called GT (Ground Truth). The GT attached to document data 210 defines named entity types (or attributes) of corporate name, corporate name, and expiration date for each character string, in order to indicate that character strings 211, 212, and 213 within the dashed rectangles are named entities.

文書データ２１０は、通常、ページ単位で構成され、記号や罫線などを含むが、複数ページで構成されていてもよいし、レイアウト情報を持たず文字情報だけを持つデータであってもよい。すなわち、文書データは文字列情報が取得できるものであればどのような形式のデータであってもよい。 Document data 210 is typically composed of pages and includes symbols and ruled lines, but it may also be composed of multiple pages, or may be data that contains only text information without layout information. In other words, document data can be in any format as long as it allows for the acquisition of text information.

図２（ｂ）に、受信部１０２が受信する文書データの例を示す。図２（ａ）と同様のであり、かつ、同様のレイアウトを有する文書データであるが、ＧＴ（正解データ）が付与されていない点のみが異なる。 Figure 2(b) shows an example of document data received by the receiving unit 102. This is document data similar to that shown in Figure 2(a) and has a similar layout, except that no GT (correct answer data) is attached.

図３（ａ）、図３（ｂ）に、制御部１０１が取得するトークンの一例について示す。トークンは、識別子３１１、トークン文字列３１２、ＧＴ３１３で表される。 Figures 3(a) and 3(b) show an example of a token acquired by the control unit 101. The token is represented by an identifier 311, a token string 312, and a GT 313.

図３（ａ）は、正解データの付与された文書データ２１０から抽出した文字列を形態素解析によりトークン単位に分解して得られた２６１個のトークンが列挙されたトークンのテーブル３１０である。各トークンにはＩＯＢ（Ｉｎｓｉｄｅ－Ｏｕｔｓｉｄｅ－Ｂｅｇｉｎｎｉｎｇ）形式でＧＴが付与される。本実施形態では、固有表現の種別としては、法人名（ＯＲＧ）、人名（ＰＥＲＳＯＮ）、有効期限（ＤＡＴＥ）の3種類を使用するものとするが、この3種に限らず他の種別を定義、使用してもよい。 Figure 3(a) shows a token table 310 listing 261 tokens obtained by decomposing character strings extracted from document data 210, to which correct answer data has been assigned, into token units using morphological analysis. Each token is assigned a GT in IOB (Inside-Outside-Beginning) format. In this embodiment, three types of named entities are used: corporate name (ORG), personal name (PERSON), and expiration date (DATE), but other types may be defined and used without being limited to these three.

固有表現は複数のトークンから構成されることがあるため、そのような場合には、固有表現の先頭のトークンのＧＴには“Ｂ－”を、それに続くトークンのＧＴには“Ｉ－”をそれぞれ付加する。例えば図２（ａ）に示す文字列２１１の“ＡＢＣ株式会社”は法人名（ＯＲＧ）の固有表現であるが、2つのトークンＴ１＿００３（“ＡＢＣ”）、Ｔ１＿００４（“株式会社”）で構成されている。そのため、トークンＴ１＿００３のＧＴとしては先頭のトークンを指す“Ｂ－ＯＲＧ”、トークンＴ１＿００４のＧＴとしては後に続くトークンを指す“Ｉ－ＯＲＧ”がそれぞれ付与される。固有表現の種別が付与されないトークンのＧＴには、固有表現以外であることを表す“Ｏ”が付与される。本実施形態では、上記のように各トークンに対してＩＯＢ形式でＧＴを付与しているが、複数のトークンにまたがる固有表現に対してＧＴを付与する方法は、他の方法であってもよい。 Because a named entity may be composed of multiple tokens, in such cases, a "B-" is added to the GT of the first token of the named entity, and an "I-" is added to the GT of each subsequent token. For example, "ABC Co., Ltd." in the character string 211 shown in Figure 2(a) is a named entity for a corporate name (ORG), but is composed of two tokens, T1_003 ("ABC") and T1_004 ("Co., Ltd."). Therefore, the GT of token T1_003 is assigned "B-ORG," indicating the first token, and the GT of token T1_004 is assigned "I-ORG," indicating the subsequent token. Tokens to which no named entity type is assigned are assigned an "O," indicating that they are not named entities. In this embodiment, a GT is assigned to each token in IOB format as described above, but other methods may be used to assign GTs to named entities that span multiple tokens.

図３（ｂ）は、制御部１０１が図２（ｂ）に示す文書データから抽出した入力文字列を形態素解析によりトークン単位に分解して得られた２６４個のトークンが列挙されたトークンのテーブル３２０である。テーブル３２０は、ＧＴが付与されていない文書データ２２０に基づくため、各トークンについて識別子３２１、トークン文字列３２２を有するが、ＧＴは無い。 Figure 3(b) is a token table 320 listing 264 tokens obtained by the control unit 101 breaking down the input string extracted from the document data shown in Figure 2(b) into token units using morphological analysis. Because table 320 is based on document data 220 to which no GT has been assigned, each token has an identifier 321 and a token string 322, but no GT.

図４（ａ）に、算出部１０３が取得する固有表現の一例についてのテーブルを用いて説明する。固有表現は、固有表現の識別子４１１、文字列４１２、種別４１３で構成され、最小のトークン数４１４は、固有表現を正しく抽出するために必要な、固有表現の周辺のトークン数の最小数を表す。最小のトークン数４１４は、ＧＴが付与された複数のトークンを基に算出部１０３において算出する。固有表現ＮＥ＿００１は、文書データ２１０の文字列２１１に対応し、文字列として“ＡＢＣ株式会社”、種別として“法人名（ＯＲＧ）”を持つ。同様に、固有表現ＮＥ＿００２は、文書データ２１０の文字列２１２に対応し、文字列として“ＤＥＦ株式会社”、種別として“法人名（ＯＲＧ）”を持つ。固有表現ＮＥ＿００３は、文書データ２１０の文字列２１３に対応し、文字列として“１０月３１日”、種別として“有効期限（ＤＡＴＥ）”を持つ。固有表現ＮＥ＿００４～ＮＥ＿００６は異なる文書データにおいて定義された固有表現であり、それぞれ固有表現の種別として人名、法人名、有効期限が設定されている。 Figure 4(a) shows an example of a named entity acquired by the calculation unit 103, using a table. A named entity is composed of a named entity identifier 411, a character string 412, and a type 413, and the minimum number of tokens 414 indicates the minimum number of tokens surrounding the named entity required to correctly extract the named entity. The minimum number of tokens 414 is calculated by the calculation unit 103 based on multiple tokens to which GT is assigned. The named entity NE_001 corresponds to character string 211 in document data 210, and has the character string "ABC Co., Ltd." and the type "Corporate Name (ORG)." Similarly, the named entity NE_002 corresponds to character string 212 in document data 210, and has the character string "DEF Co., Ltd." and the type "Corporate Name (ORG)." The named entity NE_003 corresponds to character string 213 in document data 210, and has the character string "October 31st" and the type "Expiration Date (DATE)." Named entities NE_004 to NE_006 are named entities defined in different document data, and the type of named entity is set to a person's name, a corporate name, or an expiration date, respectively.

図４（ｂ）に、図４（ａ）に示す例において算出部１０３が導出する限界トークン数を示す。限界トークン数は、テーブル４１０に登録されている全ての固有表現それぞれに対応する最小のトークン数４１４の中で最も大きい値である。 Figure 4(b) shows the limit number of tokens derived by the calculation unit 103 for the example shown in Figure 4(a). The limit number of tokens is the largest value among the minimum number of tokens 414 corresponding to all named entities registered in the table 410.

図５（ａ）は、本実施形態において制御部１０１において実行される限界トークン数導出処理の一例を表すフローチャートである。本フローチャートは、固有表現抽出装置１００において、初期化時に実行される。本処理は、後述する固有表現抽出装置１００と異なるデバイスで算出したものを取得するようになっていてもよい。 Figure 5(a) is a flowchart showing an example of the limit token number derivation process executed by the control unit 101 in this embodiment. This flowchart is executed by the named entity extraction device 100 at initialization. This process may also be configured to obtain a result calculated by a device different from the named entity extraction device 100, which will be described later.

Ｓ５１１では、制御部１０１が、ＧＴが付与された文書データ２１０を受信部１０２から取得する。 In S511, the control unit 101 acquires the document data 210 with GT attached from the receiving unit 102.

Ｓ５１２では、制御部１０１が、取得した文書データから入力文字列を抽出し、抽出した入力文字列を形態素解析によりトークン単位に分解する。 In S512, the control unit 101 extracts the input character string from the acquired document data and breaks the extracted input character string into token units using morphological analysis.

Ｓ５１３では、制御部１０１が、算出部１０３を用いて、付与されたＧＴに定義された固有表現の抽出が正解する最小のトークン数を特定する。本ステップではまず、各文書データのトークンに付与されたＧＴに基づき固有表現を取得し、記憶する。ＧＴが付与された文書データ２１０については、“ＡＢＣ株式会社”、“ＤＥＦ真空株式会社”、“１０月３１日”をテーブル４１０に固有表現としてそれぞれ記憶する。その際、取得した各固有表現に識別子４１１を付与して、固有表現の文字列４１２、固有表現の種別４１３を対応付けて記憶する。次に、文書データに対し固有表現抽出を行って抽出した固有表現について、固有表現の抽出が正解する最小のトークン数を特定する。具体的には、固有表現の前後のトークン数の初期値を１２８とし、その数の周辺のトークンを用いて固有表現の抽出を実行する。固有表現の抽出が正解した場合は、周辺のトークン数を１減らして再度固有表現の抽出を実行し、抽出不可または不正解になるまで周辺のトークンを減らしながら固有表現の抽出を実行する。特定された抽出が成功した最小のトークン数は、テーブル４１０の最小のトークン数４１４に記憶する。尚、最小のトークン数を特定する際の周辺トークン数の初期値は、固定値でも良いし、文書データに含まれるトークン数や他の文書データの結果を基に決めてもよい。最小のトークン数を特定する際にトークン数を減らしていく方法としては、１ずつ減らす方法の他に、二分探索を用いて探索してもよく、最小のトークン数を求める方法であればその他の方法を用いてもよい。 In S513, the control unit 101 uses the calculation unit 103 to identify the minimum number of tokens that will result in correct extraction of the named entity defined in the assigned GT. In this step, first, the named entity is acquired and stored based on the GT assigned to the token of each document data. For the document data 210 to which the GT is assigned, "ABC Co., Ltd.", "DEF Vacuum Co., Ltd.", and "October 31st" are each stored as named entities in table 410. At this time, an identifier 411 is assigned to each acquired named entity, and the named entity string 412 and named entity type 413 are associated and stored. Next, named entity extraction is performed on the document data, and the minimum number of tokens that will result in correct extraction of the named entity is identified for the extracted named entities. Specifically, the initial value of the number of tokens before and after the named entity is set to 128, and named entity extraction is performed using that number of surrounding tokens. If the extraction of the named entity is correct, the number of surrounding tokens is reduced by one and named entity extraction is performed again, reducing the number of surrounding tokens until extraction is impossible or an incorrect result is obtained. The identified minimum number of tokens that resulted in successful extraction is stored in minimum token number 414 of table 410. The initial value of the number of surrounding tokens when identifying the minimum number of tokens may be a fixed value, or may be determined based on the number of tokens included in the document data or the results of other document data. Methods for reducing the number of tokens when identifying the minimum number of tokens include reducing the number by one, searching using a binary search, or any other method that can determine the minimum number of tokens.

Ｓ５１４では、算出部１０３が、特定した固有表現ごとの最小のトークン数の中の最大値を限界トークン数としてテーブル４２０に記憶する。テーブル４１０において最小のトークン数の最大値は“７”のため、限界トークン数として“７”を記憶する。 In S514, the calculation unit 103 stores the maximum value among the minimum token numbers for each identified named entity as the limit token number in table 420. Since the maximum value for the minimum token number in table 410 is "7", "7" is stored as the limit token number.

尚、ここでは最小のトークン数の最大値を限界トークン数として用いたが、所定の割合の固有表現において最小のトークン数以上となるトークン数の中で最小のものを限界トークン数として用いてもよい。例えば固有表現の８割が正解するトークン数を限界トークン数とした場合、最小のトークン数４１４の８割が正解する“６”が、限界トークン数となる。 Note that here, the maximum value of the minimum number of tokens is used as the limiting number of tokens, but the smallest number of tokens that is equal to or greater than the minimum number of tokens for a specified percentage of named entities may also be used as the limiting number of tokens. For example, if the limiting number of tokens is the number of tokens that are correct for 80% of named entities, then the limiting number of tokens would be "6," which is the minimum number of tokens (414) that is correct for 80% of entities.

図５（ｂ）は、本実施形態において制御部１０１において実行される処理の一例を表すフローチャートである。本フローチャートは、固有表現抽出装置１００において、文書データに対して固有表現抽出が指示されたのに応じて実行される。 Figure 5(b) is a flowchart showing an example of processing executed by the control unit 101 in this embodiment. This flowchart is executed in response to an instruction to extract named entities from document data being issued by the named entity extraction device 100.

Ｓ５２１では、制御部１０１が、文書データから入力文字列を抽出し、抽出した入力文字列をトークン単位に分解してＳ５２２に移行する。本ステップでは正解データの付与されていない文書データ２２０から入力文字列を抽出し、抽出した入力文字列をトークン単位に分解する。 In S521, the control unit 101 extracts an input character string from the document data, decomposes the extracted input character string into token units, and proceeds to S522. In this step, an input character string is extracted from document data 220 that has not been assigned correct answer data, and the extracted input character string is decomposed into token units.

Ｓ５２２では、制御部１０１が、トークン数と固有表現抽出器である自然言語処理モデルの入力上限数とを比較し、抽出した入力文字列を分解して得られたトークン数が入力上限数を超える場合はＳ５２３に、超えない場合はＳ５２４に移行する。自然言語処理モデルの入力上限を２５６としたときに、テーブル４１０に記憶されたトークン数は２６４個で、入力上限数を超えるため、Ｓ５２３に移行する。 In S522, the control unit 101 compares the number of tokens with the upper limit of the number of inputs of the natural language processing model, which is the named entity extractor. If the number of tokens obtained by decomposing the extracted input character string exceeds the upper limit of the inputs, the control unit 101 proceeds to S523; if it does not, the control unit 101 proceeds to S524. When the upper limit of the inputs of the natural language processing model is set to 256, the number of tokens stored in table 410 is 264, which exceeds the upper limit of the inputs, and the control unit 101 proceeds to S523.

Ｓ５２３では、分割部１０４が、Ｓ５１４で特定された限界トークン数を基に、文書データに含まれるトークンを部分的にオーバーラップする形で２以上のトークングループに分割し、Ｓ５２４に移行する。本ステップでは、まずテーブル４２０に記憶された限界トークン数を取得する。この値は、予め設定された値を用いてもよい。限界トークン数“７”を２倍にした値の１４個のトークンをオーバーラップする形で１つのトークングループを２以上のトークングループに分割する。ここでオーバーラップするトークン数を限界トークン数の２倍にした理由は、固有表現の周辺のトークンとして7個のトークンを確保するためである。この値にさらに固有表現の平均トークン数を足した値を用いてもよい。テーブル３２０に示す例の場合、Ｔ２＿００１からＴ２＿２５６のトークングループ１と、Ｔ２＿２４３からＴ２＿２６４のトークングループ２とに分割し、Ｔ２＿２４３からＴ２＿２５６までの１４個のトークンがオーバーラップするようにする。分割する際、限界トークン数から算出した数字“１４”をオーバーラップするトークン数の下限として、分割数を増やさずにトークン数が最大化されるように分割してもよい。例えば、テーブル３２０のトークングループを、Ｔ２＿００１からＴ２＿２５６のトークングループ１と、Ｔ２＿００９からＴ２＿２６４のトークングループ2とに分割してもよい。いずれにせよ分割後のトークングループが限界トークン数から算出した下限以上のトークン数を有するようにトークングループを分割できていればよい。 In S523, the division unit 104 divides the tokens contained in the document data into two or more token groups with partial overlap based on the limit number of tokens identified in S514, and then proceeds to S524. In this step, the limit number of tokens stored in table 420 is first obtained. A preset value may be used for this value. One token group is divided into two or more token groups with overlapping tokens, which is 14 tokens, double the limit number of tokens of "7". The reason for doubling the number of overlapping tokens here is to ensure seven tokens as tokens surrounding named entities. A value obtained by adding the average number of tokens for named entities to this value may also be used. In the example shown in table 320, the document data is divided into token group 1 (T2_001 to T2_256) and token group 2 (T2_243 to T2_264), so that the 14 tokens from T2_243 to T2_256 overlap. When dividing, the number "14" calculated from the limit token number can be used as the lower limit for the number of overlapping tokens, and the division can be performed so that the number of tokens is maximized without increasing the number of divisions. For example, the token group in table 320 can be divided into token group 1 from T2_001 to T2_256, and token group 2 from T2_009 to T2_264. In any case, it is sufficient that the token groups are divided so that the number of tokens after division is equal to or greater than the lower limit calculated from the limit token number.

本実施形態では限界トークン数を１つの数値で扱ったが、固有表現の前のトークン数と後ろのトークン数で分けて計算し、それらを足したトークン数をオーバーラップさせてもよい。 In this embodiment, the limit number of tokens is treated as a single numerical value, but it may also be calculated separately for the number of tokens before and the number of tokens after the named entity, and the sum of these may be used to create the overlap.

Ｓ５２４では、抽出部１０５が、分割したトークングループごとに固有表現抽出を実行する。本ステップではトークングループ１、トークングループ２それぞれに対して固有表現抽出を実行し、固有表現を取得する。トークングループ１からは法人名（ＯＲＧ）として“ＧＨＩ株式会社”と、“ＪＫＬ運輸会社”、有効期限（ＤＡＴＥ）として“３月５日”が抽出されたとする。また、トークングループ２からは有効期限（ＤＡＴＥ）として“３月５日”が抽出されたとする。“３月５日”はトークングループ１、トークングループ２のオーバーラップされた部分からそれぞれ抽出されたとする。 In S524, the extraction unit 105 performs named entity extraction for each divided token group. In this step, named entity extraction is performed for each of token group 1 and token group 2, and named entities are obtained. Assume that "GHI Co., Ltd." and "JKL Transport Company" are extracted as the corporate name (ORG) from token group 1, and "March 5th" is extracted as the expiration date (DATE). Assume also that "March 5th" is extracted as the expiration date (DATE) from token group 2. Assume that "March 5th" is extracted from the overlapping parts of token group 1 and token group 2, respectively.

Ｓ５２５では、制御部１０１が、オーバーラップ部分で抽出された固有表現を決定し、処理を終了する。具体的には、オーバーラップ部分で共通のトークンに対して同じ結果が抽出された場合は、一方のみを結果として出力し、異なる結果が抽出された場合は、周辺のトークン数が多い方の結果を出力する。オーバーラップ部分で一方が未検出の場合も、トークン数が多い方の結果を優先する。Ｓ５２４において示した例では、トークングループ１およびトークングループ２の両方のオーバーラップ部分において有効期限（ＤＡＴＥ）として“３月５日”が抽出されているため、一方の結果のみを出力する。そのため、テーブル４１０の抽出結果としては、法人名（ＯＲＧ）として“ＧＨＩ株式会社”と“ＪＫＬ運輸会社”、有効期限（ＤＡＴＥ）として“３月５日”を出力する。 In S525, the control unit 101 determines the named entity extracted in the overlapping portion and terminates processing. Specifically, if the same result is extracted for a common token in the overlapping portion, only one of the results is output; if different results are extracted, the result with the greater number of surrounding tokens is output. Even if one of the results is not detected in the overlapping portion, the result with the greater number of tokens is given priority. In the example shown in S524, "March 5th" is extracted as the expiration date (DATE) in the overlapping portions of both token group 1 and token group 2, so only one result is output. Therefore, the extraction results for table 410 are "GHI Co., Ltd." and "JKL Transport Company" as the corporate names (ORG) and "March 5th" as the expiration date (DATE).

以上のように、文字数の多い文書データから固有表現抽出を行う際、オーバーラップする２以上のトークングループに分割することで、固有表現の周辺のキーワードとなる文字列や文脈が失われることを防ぎ、固有表現抽出の精度低下を抑えることができる。 As described above, when extracting named entities from document data with a large number of characters, dividing the data into two or more overlapping token groups prevents the loss of keyword strings and context surrounding the named entity, and minimizes any decline in the accuracy of named entity extraction.

［実施形態２］
実施形態１では、各固有表現に対し最小のトークン数を特定する際に、トークン数を徐々に減らして正解可能な最小のトークン数を求めた。これに対し本実施形態では、自然言語処理モデルのネットワークに現れるトークンとの関連度に基づき、限界トークン数を求める。 [Embodiment 2]
In the first embodiment, when specifying the minimum number of tokens for each named entity, the number of tokens is gradually reduced to obtain the minimum number of tokens that allows for a correct answer. In contrast, in the present embodiment, the limit number of tokens is obtained based on the degree of association with tokens that appear in the network of the natural language processing model.

図６は、本実施形態において制御部１０１で実行される処理の一例を表すフローチャートである。なお、フローチャートのＳ５１１、Ｓ５１２、Ｓ５１４は図５（ａ）の同一符号のステップと同様の処理のため、ここでは説明を割愛する。 Figure 6 is a flowchart showing an example of processing executed by the control unit 101 in this embodiment. Note that steps S511, S512, and S514 in the flowchart are the same as the steps with the same reference numerals in Figure 5(a), and therefore will not be described here.

Ｓ６１１では、算出部１０３において、ＧＴにおいて定義された固有表現に対し、自然言語処理モデルのネットワークに現れるトークンとの関連度に基づき、固有表現の抽出が正解する最小のトークン数を算出する。本ステップでは、まず各文書データのトークンに付与されたＧＴに基づき固有表現を取得し、図４に示すテーブル４１０に記憶する。テーブル４１０には固有表現の識別子４１１、固有表現の文字列４１２、固有表現の種別４１３を記憶する。次に、それぞれの固有表現について、固有表現の抽出が正解する最小のトークン数４１４を算出する。 In S611, the calculation unit 103 calculates the minimum number of tokens that will result in correct named entity extraction for the named entities defined in the GT, based on the degree of association with tokens that appear in the network of the natural language processing model. In this step, first, named entities are obtained based on the GT assigned to the tokens of each document data, and stored in table 410 shown in Figure 4. Table 410 stores named entity identifier 411, named entity string 412, and named entity type 413. Next, for each named entity, the minimum number of tokens 414 that will result in correct named entity extraction is calculated.

具体的には、まず固有表現の前後のトークン数が最大になるように文書データに含まれるトークンを２以上のトークングループに分割する。そして、分割して得られたトークングループに対しＴｒａｎｓｆｏｒｍｅｒｓを用いたＢＥＲＴに代表されるセルフアテンション機構を持つ自然言語処理モデルを用いて、固有表現の抽出を行う。固有表現の抽出が正解した場合は、その固有表現のトークンと、その周辺のトークンとの関係を自然言語処理モデルのネットワークに現れるアテンションの強度で測る。アテンションの強度が所定の閾値以上である周辺のトークン数のうちの最小数を、固有表現に対応した最小のトークン数とする。算出した最小のトークン数はテーブル４１０の最小のトークン数４１４として記憶する。 Specifically, first, the tokens contained in the document data are divided into two or more token groups so that the number of tokens before and after each named entity is maximized. Then, a natural language processing model with a self-attention mechanism, such as BERT using Transformers, is used to extract named entities from the resulting token groups. If the named entity is correctly extracted, the relationship between the named entity token and its surrounding tokens is measured by the attention strength that appears in the network of the natural language processing model. The minimum number of surrounding tokens with attention strength above a predetermined threshold is determined to be the minimum number of tokens corresponding to the named entity. The calculated minimum number of tokens is stored as minimum number of tokens 414 in table 410.

以上のように、文字数の多い文書データから固有表現抽出を行う際、オーバーラップする２以上のトークングループに分割することで、固有表現の周辺のキーワードとなる文字列や文脈が失われることを防ぎ、固有表現抽出の精度低下を抑えることができる。さらに、自然言語処理モデルのネットワークの情報を基に最小のトークン数を算出することにより、より容易に最小のトークン数を算出できる。 As described above, when extracting named entities from document data with a large number of characters, dividing the data into two or more overlapping token groups prevents the loss of keyword strings and context surrounding the named entity, and minimizes a decrease in the accuracy of named entity extraction. Furthermore, calculating the minimum number of tokens based on information from the network of the natural language processing model makes it easier to calculate the minimum number of tokens.

［実施形態３］
実施形態１では、固有表現に対して共有の限界トークン数を算出した。これに対し本実施形態では、固有表現の種別ごとに限界トークン数を求める。 [Embodiment 3]
In the first embodiment, the shared limit number of tokens is calculated for each named entity. In contrast, in the present embodiment, the limit number of tokens is calculated for each type of named entity.

算出部１０３が特定する限界トークン数の一例について図７のテーブル７００を用いて説明する。限界トークン数は固有表現の種別７０１と、限界トークン数７０２により構成される。固有表現のテーブル４１０を基に固有表現の種別ごとに算出した値が７０２に入る。テーブル７００には法人名（ＯＲＧ）、人名（ＰＥＲＳＯＮ）、有効期限（ＤＡＴＥ）の限界トークン数が７０３、７０４、７０５にそれぞれ定義される。 An example of the limit number of tokens determined by the calculation unit 103 will be explained using table 700 in Figure 7. The limit number of tokens is composed of a named entity type 701 and a limit number of tokens 702. A value calculated for each named entity type based on the named entity table 410 is entered in 702. In table 700, the limit numbers of tokens for corporate name (ORG), personal name (PERSON), and expiration date (DATE) are defined in 703, 704, and 705, respectively.

図８（ａ）は、本実施形態において制御部１０１において実行される処理の一例を表すフローチャートである。なお、フローチャートのＳ５１１、Ｓ５１２、Ｓ５１３は図５（ａ）の同一符号のステップと同様の処理のため、説明を割愛する。 Figure 8(a) is a flowchart showing an example of processing executed by the control unit 101 in this embodiment. Note that steps S511, S512, and S513 in the flowchart are the same as the steps with the same reference numerals in Figure 5(a), and therefore their explanation will be omitted.

Ｓ８１１では、算出部１０３が、固有表現の種別ごとに限界トークン数を特定し、処理を終了する。固有表現のテーブル４１０の最小のトークン数４１４に記憶された値を取得し、固有表現の種別ごとに最大値を求め、テーブル７００に固有表現の種別ごとに限界トークン数を保存する。ここでは最大値を用いたが、所定の割合の固有表現において最小のトークン数を上回るトークン数を限界トークン数として用いてもよい。 In S811, the calculation unit 103 determines the limit number of tokens for each type of named entity and terminates the process. The value stored in the minimum number of tokens 414 in the named entity table 410 is obtained, the maximum value is calculated for each type of named entity, and the limit number of tokens is saved in the table 700 for each type of named entity. While the maximum value is used here, the number of tokens that exceeds the minimum number of tokens for a specified percentage of named entities may also be used as the limit number of tokens.

図８（ｂ）は、本実施形態において制御部１０１において実行される処理の一例を表すフローチャートである。なお、フローチャートのＳ５２１、Ｓ５２２、Ｓ５２４、Ｓ５２５は図５（ｂ）の同名のステップと同様の処理のため、説明を割愛する。 Figure 8(b) is a flowchart showing an example of processing executed by the control unit 101 in this embodiment. Note that steps S521, S522, S524, and S525 in the flowchart are the same as the steps with the same names in Figure 5(b), and therefore their explanation will be omitted.

Ｓ８２１では、受信部１０２が抽出する固有表現の種別（または属性）を受け付けて、Ｓ５２１に移行する。ここでは固有表現抽出デバイスのユーザから文書データに含まれる人名の種別の固有表現を抽出することを受け付けたとする。 In S821, the receiving unit 102 accepts the type (or attribute) of the named entity to be extracted, and proceeds to S521. Here, it is assumed that a request has been received from the user of the named entity extraction device to extract named entities of the type of personal names contained in the document data.

Ｓ８２２では、分割部１０４が、Ｓ８２１で受け付けた固有表現の種別に対応する限界トークン数を基に、文書データに含まれるトークンを部分的にオーバーラップする形で２以上のトークングループに分割し、Ｓ５２４に移行する。本ステップでは、Ｓ８２１において受け付けた種別を基にテーブル７００に記憶された限界トークン数を取得する。複数の種別を受け付けた場合はその最大値を用いる。ここではＳ８２１において人名を受け付けているので、対応する限界トークン数である“２”を基に文書データに含まれるトークンを部分的にオーバーラップする形で２以上のトークングループに分割する。 In S822, the division unit 104 divides the tokens contained in the document data into two or more partially overlapping token groups based on the limit number of tokens corresponding to the type of named entity received in S821, and proceeds to S524. In this step, the limit number of tokens stored in table 700 is obtained based on the type received in S821. If multiple types are received, the maximum value is used. In this case, since a person's name was received in S821, the tokens contained in the document data are divided into two or more partially overlapping token groups based on the corresponding limit number of tokens, "2."

このように限界トークン数を抽出対象として受け付けた固有表現の種別に対応した最小のトークン数に限定することで、トークングループの分割数を抑え、固有表現抽出において処理するトークン数を低減させることが可能である。 By limiting the token limit to the minimum number of tokens corresponding to the type of named entity accepted for extraction in this way, it is possible to reduce the number of token group divisions and the number of tokens processed during named entity extraction.

以上のように、文字数の多い文書データから固有表現抽出を行う際、オーバーラップする２以上のトークングループに分割することで、固有表現の周辺のキーワードとなる文字列や文脈が失われることを防ぎ、固有表現抽出の精度低下を抑えることができる。さらに、限界トークン数として抽出対象の固有表現の種別に対応した最小のトークン数を設定することにより、文書データの分割数を抑え、計算量を削減することができる。 As described above, when extracting named entities from document data with a large number of characters, dividing the data into two or more overlapping token groups prevents the loss of keyword strings and context surrounding the named entity, and minimizes a decrease in the accuracy of named entity extraction. Furthermore, by setting the minimum number of tokens corresponding to the type of named entity to be extracted as the token limit, the number of times the document data needs to be divided can be reduced, thereby reducing the amount of calculations.

［実施形態４］
実施形態１では、予め求めておいた限界トークン数を用いて、全トークンを２以上のトークングループに分割した。これに対し本実施形態では、推定した固有表現の利用状況を基に限界トークン数を更新していく例を説明する。 [Embodiment 4]
In the first embodiment, all tokens are divided into two or more token groups using a predetermined limit number of tokens. In contrast, in the present embodiment, an example will be described in which the limit number of tokens is updated based on the estimated usage status of named entities.

図９は、本実施形態において制御部１０１において実行される処理の一例を表すフローチャートである。本フローチャートは、固有表現抽出デバイスにおいて、文書データに対し固有表現抽出が実行され、抽出した固有表現がデバイスのユーザによって使用された後に、実行される。なお、フローチャートのＳ５１２、Ｓ５１３は図５（ａ）の同一符号のステップと同様の処理のため、説明を割愛する。 Figure 9 is a flowchart showing an example of processing executed by the control unit 101 in this embodiment. This flowchart is executed after named entity extraction is performed on document data in a named entity extraction device and the extracted named entities are used by the device user. Note that steps S512 and S513 in the flowchart are the same as the steps with the same reference numerals in Figure 5(a), and therefore their explanation will be omitted.

Ｓ９１１では、デバイスのユーザが選択したトークンを固有表現として追加的に定義したＧＴが付与された文書データを取得し、Ｓ５１２に移行する。文書データから固有表現が抽出され、その中で抽出された法人名（ＯＲＧ）の“ＧＨＩ株式会社”がユーザによって選択された際に、テーブル３２０の対応するトークンを固有表現として追加的に定義したＧＴを付与する。具体的にはトークンＴ２＿００３、Ｔ２＿００４にＧＴとしてそれぞれ“Ｂ＿ＯＲＧ”、“Ｉ＿ＯＲＧ”を付与し、それ以外のトークンにＧＴとして“Ｏ”を付与したものを文書データとして取得する。 In S911, document data is acquired to which a GT has been assigned, which defines the token selected by the device user as a named entity, and the process proceeds to S512. Named entities are extracted from the document data, and when the extracted corporate name (ORG) "GHI Co., Ltd." is selected by the user, the corresponding token in table 320 is assigned a GT, which has been additionally defined as a named entity. Specifically, tokens T2_003 and T2_004 are assigned the GTs "B_ORG" and "I_ORG", respectively, and the remaining tokens are assigned the GT "O", and the resulting document data is acquired.

Ｓ９１２では、算出部１０３において、限界トークン数を更新し、処理を終了する。具体的には、Ｓ９１１で取得した文書データに対するＳ５１２、Ｓ５１３における処理の結果が反映され、ユーザが選択した固有表現に対応する最小のトークン数が追加された固有表現のテーブル４１０を基に、限界トークン数のテーブル４２０を更新する。 In S912, the calculation unit 103 updates the limit number of tokens and terminates the process. Specifically, the results of the processes in S512 and S513 on the document data acquired in S911 are reflected, and the limit number of tokens table 420 is updated based on the named entity table 410 to which the minimum number of tokens corresponding to the named entity selected by the user has been added.

以上のように、文字数の多い文書データから固有表現抽出を行う際、オーバーラップする２以上のトークングループに分割することで、固有表現の周辺のキーワードとなる文字列や文脈が失われることを防ぎ、固有表現抽出の精度低下を抑えることができる。さらに、限界トークン数を更新していくことで、未知の文書に対しても精度低下を抑えることができる。 As described above, when extracting named entities from document data with a large number of characters, dividing the data into two or more overlapping token groups prevents the loss of keyword strings and context surrounding the named entity, thereby minimizing a decrease in the accuracy of named entity extraction. Furthermore, by updating the limit number of tokens, it is possible to minimize a decrease in accuracy even for unknown documents.

（その他の実施例）
本発明は、上述の実施形態の１以上の機能を実現するプログラムを、ネットワーク又は記憶媒体を介してシステム又は装置に供給し、そのシステム又は装置のコンピュータにおける１つ以上のプロセッサーがプログラムを読出し実行する処理でも実現可能である。また、１以上の機能を実現する回路（例えば、ＡＳＩＣ）によっても実現可能である。 (Other Examples)
The present invention can also be realized by supplying a program that realizes one or more of the functions of the above-described embodiments to a system or device via a network or a storage medium, and having one or more processors in the computer of the system or device read and execute the program.The present invention can also be realized by a circuit (e.g., an ASIC) that realizes one or more of the functions.

１００固有表現抽出装置
１０１制御部
１０２受信部
１０３算出部
１０４分割部
１０５抽出部 100 Named Entity Extraction Device 101 Control Unit 102 Receiving Unit 103 Calculation Unit 104 Segmentation Unit 105 Extraction Unit

Claims

An information processing device for extracting named entities from a plurality of tokens obtained by decomposing an input character string, comprising:
a dividing means for dividing the plurality of tokens obtained by decomposing the input character string into two or more token groups when the number of the plurality of tokens obtained by decomposing the input character string exceeds a predetermined upper limit, wherein each of the token groups has a predetermined number of tokens overlapping with other token groups;
extraction means for extracting the named entities for each of the token groups;
a determination means for determining an extraction result of the named entity in the overlapping portion based on an extraction result of the named entity by the extraction means for the overlapping portion with the other token group;
Equipped with
the determining means determines, among the extraction results of the named entities from each of the two overlapping token groups in the overlapping portion, the extraction result of the named entity from the token group with the greater number of tokens in the overlapping token groups as the extraction result of the named entity in the overlapping portion;
1. An information processing device comprising:

The predetermined upper limit number is the number of tokens that the extraction means can process at one time.
2. The information processing apparatus according to claim 1, wherein:

the predetermined number is set based on a limit number of tokens required for the extraction means to extract the named entity;
3. The information processing apparatus according to claim 1 , wherein the information processing apparatus is a computer.

the limit number of tokens is the smallest number of tokens among the numbers of tokens input when the extraction means correctly extracts a named entity from an input character string to which correct answer data defining the named entity has been added;
4. The information processing apparatus according to claim 3 ,

the limit number of tokens is the smallest number of tokens among the number of tokens whose relevance to tokens corresponding to named entities defined in the correct answer data assigned to the input character string is equal to or greater than a predetermined value in the natural language processing model used by the extraction means;
4. The information processing apparatus according to claim 3 ,

When the correct answer data defines a plurality of named entities, the limit number of tokens is the largest of the plurality of minimum numbers of tokens corresponding to the plurality of named entities.
6. The information processing apparatus according to claim 4, wherein:

When the correct answer data defines a plurality of named entities, the limit number of tokens is the smallest number of tokens among the numbers of tokens input when a predetermined percentage of the plurality of named entities are correctly extracted.
6. The information processing apparatus according to claim 4, wherein:

further comprising a receiving means for receiving a type of named entity extracted by the extracting means,
The limit number of tokens is set to the smallest number of tokens input when the extraction of the named entity of the type accepted by the accepting means is correct.
4. The information processing apparatus according to claim 3 ,

the correct answer data is obtained by additionally defining a named entity selected by a user from the named entities extracted from the input character string by the extraction means.
8. The information processing apparatus according to claim 4, wherein the information processing apparatus is a computer.

An information processing method for extracting named entities from a plurality of tokens obtained by decomposing an input character string, comprising:
a step of dividing the plurality of tokens into two or more token groups when the number of tokens obtained by the information processing device decomposing the input character string exceeds a predetermined upper limit, wherein each of the token groups has a predetermined number of tokens overlapping with other token groups;
a step of extracting the named entities for each of the token groups by the information processing device ;
a step of determining an extraction result of the named entity in the overlapping portion based on an extraction result of the named entity in the extraction step for the overlapping portion with the other token group by the information processing device ;
and
the determining step determines, among the extraction results of the named entities from each of the two overlapping token groups in the overlapping portion, the extraction result of the named entity from a token group having a larger number of tokens in the overlapping token groups as the extraction result of the named entity in the overlapping portion;
1. An information processing method comprising:

A program for causing a computer to function as the information processing device according to any one of claims 1 to 9 .