JP7724676B2

JP7724676B2 - Information processing method, program, and information processing device

Info

Publication number: JP7724676B2
Application number: JP2021164224A
Authority: JP
Inventors: 有華青野; 幸史市川
Original assignee: Japan Research Institute Ltd
Current assignee: Japan Research Institute Ltd
Priority date: 2021-10-05
Filing date: 2021-10-05
Publication date: 2025-08-18
Anticipated expiration: 2041-10-05
Also published as: JP2023055098A

Description

本発明は、情報処理方法、プログラム及び情報処理装置に関する。 The present invention relates to an information processing method, a program, and an information processing device.

会計情報などを示す文書から情報を認識する情報処理方法が提案されている。（例えば、特許文献１） An information processing method for recognizing information from documents containing accounting information, etc., has been proposed. (For example, see Patent Document 1.)

特開２０２１－３４０３０号公報Japanese Patent Application Laid-Open No. 2021-34030

文字列が複数の行に亘って位置している文書内の領域においては、改行位置を認識するために、行を跨ぐ文字間に改行記号を挿入される。これにより、一項目が複数の行に亘っている場合、上行の文字列と下行の文字列を、それぞれ別個の項目として認識する恐れがある。 In areas of a document where text spans multiple lines, line break symbols are inserted between characters that span lines in order to recognize line breaks. As a result, if an item spans multiple lines, the text on the line above and the text on the line below may be recognized as separate items.

本開示の目的は、文書内の一の領域に、一項目内の文字間で改行がされている場合でも、該文字列を、一項目ごとに尤もらしく認識する情報処理方法等を提供することである。 The purpose of this disclosure is to provide an information processing method that can plausibly recognize a string of characters for each item in a document, even if there are line breaks between characters within the item.

本開示に係る情報処理方法は、文書の一領域内に複数の行に亘って位置する複数の文字列と、前記文字列間に挿入された分割を示すコードと、を含む文字列群を取得し、前記領域に含まれる情報に基づいて、前後の文字列を結合すべき前記コードを削除する。 The information processing method disclosed herein acquires a group of character strings that includes multiple character strings located across multiple lines within one area of a document and codes that indicate divisions inserted between the character strings, and deletes the codes that should join the preceding and following character strings based on the information contained in the area.

本開示に係る情報処理方法は、前記コードの前後の前記文字列に基づいて、前記コードの前後の前記文字列を結合すべきか否かを判定し、前記コードの前後の前記文字列を結合すべきと判定した場合、前記コードを削除する。 The information processing method disclosed herein determines whether the character strings before and after the code should be combined based on the character strings before and after the code, and if it is determined that the character strings before and after the code should be combined, deletes the code.

本開示に係る情報処理方法は、所定の文字列を記録したテーブルを参照し、前記コードの前後の前記文字列を結合すべきか否かを判定する。 The information processing method disclosed herein references a table that records specific character strings and determines whether the character strings before and after the code should be combined.

本開示に係る情報処理方法は、複数の前記文字列を入力した場合に、前記コードの前後の前記文字列を結合すべき確度を出力するように学習された学習モデルに、複数の前記文字列を入力し、前記確度を出力する。 The information processing method disclosed herein inputs multiple character strings into a learning model that has been trained to output the probability that the character strings before and after the code should be combined when multiple character strings are input, and outputs the probability.

本開示に係る情報処理方法は、前記確度と閾値を比較し、前記確度と前記閾値の比較結果に基づいて前記コードを削除する。 The information processing method disclosed herein compares the probability with a threshold and deletes the code based on the comparison result between the probability and the threshold.

本開示に係る情報処理方法は、前記文字列群を入力した場合に、前後の前記文字列を結合すべき前記コードを削除した文字列群を出力するように学習された学習モデルに、前記文字列群を入力し、前後の前記文字列を結合するべき前記コードを削除した文字列群を出力する。 The information processing method disclosed herein inputs the group of character strings into a learning model that has been trained to output a group of character strings from which the code that should join the preceding and following character strings has been deleted when the group of character strings is input, and outputs a group of character strings from which the code that should join the preceding and following character strings has been deleted.

本開示に係る情報処理方法は、前記文字列を含む領域に関連する関連領域に含まれる情報に基づいて、前記コードを削除する。 The information processing method disclosed herein removes the code based on information contained in a related region related to the region containing the character string.

本開示に係るプログラムは、文書の一領域内に複数の行に亘って位置する複数の文字列と、前記文字列間に挿入された分割を示すコードと、を含む文字列群を取得するステップと、前記領域に含まれる情報に基づいて、前記コードの前後の文字列を結合すべき場合、前記コードを削除するステップをコンピュータに実行させる。 The program disclosed herein causes a computer to execute the steps of: acquiring a string group including multiple strings of characters located across multiple lines in one area of a document and a code indicating a division inserted between the strings; and deleting the code if the strings before and after the code should be combined based on the information contained in the area.

本開示に係る情報処理装置は、文書の一領域内に複数の行に亘って位置する複数の文字列と、前記文字列間に挿入された分割を示すコードと、を含む文字列群を取得する取得部と、前記領域に含まれる情報に基づいて、前記コードの前後の文字列を結合すべき場合、前記コードを削除する削除部とを備える。 The information processing device disclosed herein includes an acquisition unit that acquires a group of character strings including multiple character strings located across multiple lines within one area of a document and a code that indicates a division inserted between the character strings, and a deletion unit that deletes the code when the character strings before and after the code should be combined based on the information contained in the area.

本開示の一実施形態に係る情報処理方法にあっては、文書内の一の領域に、一項目内の文字間で改行がされている場合でも、該文字列を、一項目ごとに尤もらしく認識することが可能である。 In an information processing method according to an embodiment of the present disclosure, even if there are line breaks between characters within an item in a single area of a document, the character string can be plausibly recognized for each item.

実施形態１に係る情報処理装置のブロック図である。1 is a block diagram of an information processing apparatus according to a first embodiment. 読み取り対象の文書の一例を示す説明図である。FIG. 2 is an explanatory diagram illustrating an example of a document to be read. 実施形態１に係る情報処理装置の処理部による処理手順の一例を示すフローチャートである。10 is a flowchart illustrating an example of a processing procedure performed by a processing unit of the information processing apparatus according to the first embodiment. 文書の注目領域内から認識した文字列を示す説明図である。FIG. 10 is an explanatory diagram showing a character string recognized from within a region of interest of a document. 情報処理装置の処理部が出力する処理結果を示す説明図である。FIG. 2 is an explanatory diagram showing a processing result output by a processing unit of the information processing device. 実施形態２に係る情報処理装置のブロック図である。FIG. 10 is a block diagram of an information processing apparatus according to a second embodiment. 実施形態２に係る学習モデルに関する説明図である。FIG. 10 is an explanatory diagram of a learning model according to the second embodiment. 実施形態２に係る情報処理装置の処理部による処理手順の一例を示すフローチャートである。10 is a flowchart illustrating an example of a processing procedure performed by a processing unit of an information processing apparatus according to a second embodiment. 実施形態３に係る学習モデルに関する説明図である。FIG. 10 is an explanatory diagram of a learning model according to the third embodiment. 実施形態３に係る情報処理装置の処理部による処理手順の一例を示すフローチャートである。11 is a flowchart illustrating an example of a processing procedure performed by a processing unit of an information processing apparatus according to a third embodiment. 注目領域と該注目領域に関連する関連領域の一例を示す説明図である。FIG. 2 is an explanatory diagram showing an example of a region of interest and a related region related to the region of interest.

（実施形態１）
以下、実施形態１について図面に基づいて説明する。図１は実施形態１に係る情報処理装置のブロック図である。情報処理装置１は、処理部１１、記憶部１２及び入出力Ｉ／Ｆ１３を備えたコンピュータである。情報処理装置１は、例えば、サーバコンピュータ、パーソナルコンピュータ、タブレット型コンピュータ、スマートフォン又は外部ネットワークで接続されたクラウドサーバ等である。 (Embodiment 1)
Hereinafter, a first embodiment will be described with reference to the drawings. Fig. 1 is a block diagram of an information processing device according to the first embodiment. The information processing device 1 is a computer including a processing unit 11, a storage unit 12, and an input/output I/F 13. The information processing device 1 is, for example, a server computer, a personal computer, a tablet computer, a smartphone, or a cloud server connected via an external network.

処理部１１は、ＣＰＵ（Central Processing Unit）又はＭＰＵ（Micro Processing Un
it）等である。処理部１１は、記憶部１２に予め記憶されたプログラムＰ及びデータを読み出して実行することにより、種々の制御処理、演算処理などを行う。 The processing unit 11 is a CPU (Central Processing Unit) or an MPU (Micro Processing Unit).
The processing unit 11 reads out and executes a program P and data stored in advance in the storage unit 12, thereby carrying out various control processes, arithmetic processes, and the like.

記憶部１２は、ＳＲＡＭ（Static Random Access Memory）、ＤＲＡＭ（Dynamic Random Access Memory）、フラッシュメモリ等の揮発性記憶領域及び、ＥＥＰＲＯＭ又はハードディスク等の不揮発性記憶領域を含む。記憶部１２には、プログラムＰ及び処理時に参照するデータが予め記憶してある。プログラムＰは、処理部１１が読み取り可能に記録媒体１２ａに格納されているプログラムを読み出だして記憶したものであってもよい。また、図示しない通信網に接続されている図示しない外部コンピュータからプログラムをダウンロードし、記憶部１２に記憶させたものであってもよい。
記憶部１２は項目テーブル１２１を備える。項目テーブル１２１には、例えば、読み取り対象の決算書に記載される項目を表す所定の文字列である項目文字列が格納されている。項目テーブル１２１には、読み取り対象の文書と同分野の書籍群などから抽出した名詞を項目文字列として登録してもよく、読み取り対象の文書が属する分野の専門家によって選定された名詞を項目文字列として登録してもよい。項目テーブル１２１は、後述するパターンマッチング処理の際に、処理部１１に参照される。 The storage unit 12 includes a volatile storage area such as a static random access memory (SRAM), a dynamic random access memory (DRAM), or a flash memory, and a non-volatile storage area such as an EEPROM or a hard disk. The storage unit 12 pre-stores a program P and data to be referenced during processing. The program P may be a program stored in a readable recording medium 12a that is read by the processing unit 11 and stored therein. Alternatively, the program P may be a program downloaded from an external computer (not shown) connected to a communication network (not shown) and stored in the storage unit 12.
The storage unit 12 includes an item table 121. The item table 121 stores, for example, item strings, which are predetermined character strings representing items described in the financial statements to be read. The item table 121 may register, as item strings, nouns extracted from a group of books in the same field as the document to be read, or nouns selected by experts in the field to which the document to be read belongs. The item table 121 is referenced by the processing unit 11 during the pattern matching process described below.

入出力Ｉ／Ｆ１３は、情報処理装置１の外部装置と通信を行うためのインターフェイス群であり、例えばＵＳＢ又はＤＳＵＢ等の規格化された通信インターフェイスである。入出力Ｉ／Ｆ１３には、入力部１３１と、出力部１３２が接続される。入力部１３１は例えばスキャナであり、紙媒体の文書を読み取り、ＰＤＦなどのレイアウト情報を含む文書ファイルに変換して処理部１１に出力する。また、入力部１３１は図示しない外部ネットワークや外部装置から文書ファイルを取得し、取得した文書ファイルを処理部１１に出力してもよい。 The input/output I/F 13 is a group of interfaces for communicating with external devices of the information processing device 1, and is a standardized communication interface such as USB or DSUB. The input/output I/F 13 is connected to an input unit 131 and an output unit 132. The input unit 131 is, for example, a scanner that reads documents on paper, converts them into document files containing layout information such as PDF, and outputs them to the processing unit 11. The input unit 131 may also acquire document files from an external network or external device (not shown), and output the acquired document files to the processing unit 11.

処理部１１は、入力部１３１から取得した文書ファイルに各種の処理を行い、処理結果を出力部１３２に出力する。処理部１１が行う処理については後述する。出力部１３２は例えばディスプレイであり、処理部１１から処理結果を取得し、処理結果を表示することでユーザに通知する。また、出力部１３２は処理部１１から取得した処理結果を図示しない外部ネットワークや外部装置に出力してもよい。 The processing unit 11 performs various processes on the document file acquired from the input unit 131 and outputs the processing results to the output unit 132. The processes performed by the processing unit 11 will be described later. The output unit 132 is, for example, a display, and acquires the processing results from the processing unit 11 and notifies the user by displaying the processing results. The output unit 132 may also output the processing results acquired from the processing unit 11 to an external network or external device (not shown).

図２は読み取り対象の文書の一例を示す説明図である。読み取り対象の文書は、例えば決算書や会計表など、列と行によって構成された領域を含む文書である。情報処理装置１の処理部１１は、文書ファイルのレイアウト情報に含まれる位置情報に基づいて、情報を取得する必要のある領域を注目領域とし、後述する処理を行う。注目領域とする領域はユーザが入力部１３１に入力した情報に基づいて決定してもよく、予め定めておいたタイトル名が最上行の領域に記載されている行に含まれる各領域を注目領域としてもよい。以下では、図２において破線で囲まれた領域を注目領域として説明する。注目領域内では、例えば、「関係会社短期貸付金」という項目が存在するが、複数の行に亘って位置しており、「関係会社短期」と「貸付金」で上下の行に分割されている。このように、一項目が複数の行に亘って位置している場合、上行と下行の文字列は、別個の項目として認識されるおそれがある。即ち、「関係会社短期」と「貸付金」がそれぞれ一項目として認識されるおそれがある。情報処理装置１は、後述の処理を行うことで、「関係会社短期貸付金」が一項目であると認識する。 Figure 2 is an explanatory diagram showing an example of a document to be read. The document to be read is, for example, a financial statement or accounting table, a document containing areas composed of columns and rows. Based on the position information contained in the layout information of the document file, the processing unit 11 of the information processing device 1 designates the area from which information needs to be acquired as the area of interest and performs the processing described below. The area to be designated as the area of interest may be determined based on information entered by the user into the input unit 131, or each area included in a line containing a predetermined title in the topmost line may be designated as the area of interest. Below, the area surrounded by a dashed line in Figure 2 will be described as the area of interest. For example, within the area of interest, there is an item called "Short-term loans to affiliated companies," which spans multiple lines, with "Short-term loans to affiliated companies" and "Loans" divided into upper and lower lines. In this way, when one item spans multiple lines, the character strings in the upper and lower lines may be recognized as separate items. That is, "Short-term loans to affiliated companies" and "Loans" may each be recognized as a single item. By performing the processing described below, the information processing device 1 recognizes that "short-term loans to affiliated companies" is a single item.

図３は実施形態１に係る情報処理装置の処理部による処理手順の一例を示すフローチャートである。図４は、文書の注目領域内から認識した文字列を示す説明図である。図５は、情報処理装置の処理部が出力する処理結果を示す説明図である。
情報処理装置１の処理部１１は、入力部１３１から、読み取った文書ファイルを取得する（Ｓ１）。処理部１１は、図４に示すように、入力部１３１から取得した文書ファイルの注目領域内の文字列をОＣＲ（Optical Character Recognition）などによって抽出し（Ｓ２）、行を跨ぐ文字間に分割を示すコード（例えば、＜ＳＥＰ＞）を挿入する（Ｓ３）。以下、該コードを分割コードとし、注目領域内の文字列を上行から配列し、文字列間に分割コードを挿入したものを文字列群とする。 Fig. 3 is a flowchart showing an example of a processing procedure by the processing unit of the information processing apparatus according to embodiment 1. Fig. 4 is an explanatory diagram showing a character string recognized from within a region of interest of a document. Fig. 5 is an explanatory diagram showing a processing result output by the processing unit of the information processing apparatus.
The processing unit 11 of the information processing device 1 acquires a read document file from the input unit 131 (S1). As shown in Fig. 4, the processing unit 11 extracts character strings in a region of interest in the document file acquired from the input unit 131 using OCR (Optical Character Recognition) or the like (S2), and inserts a code indicating a division (e.g., <SEP>) between characters that span lines (S3). Hereinafter, this code will be referred to as a division code, and the character strings in the region of interest will be arranged from the top line, with the division codes inserted between the character strings, forming a group of character strings.

情報処理装置１の処理部１１は、図５に示すように、項目テーブル１２１を参照して、分割コードの前後の文字列を結合した結合文字列をパターンマッチング処理し（Ｓ４）、分割コードの前後の文字列を結合した結合文字列と一致する項目文字列が項目テーブル１２１に登録されている場合、分割コードの前後の文字列を結合すべきと判定し、該分割コードを削除する（Ｓ５）。項目テーブル１２１には、前述のとおり、例えば、読み取り対象の決算書に記載される項目文字列が格納されている。処理部１１は、各分割コードの前後の文字列を結合した結合文字列と、項目テーブル１２１に登録されている項目名とを総当たりで比較し、パターンマッチングを行う。
例えば、図５に示すように、「関連会社短期」と「貸付金」を結合した「関連会社短期貸付金」という項目名が項目テーブル１２１に登録されている場合、処理部１１は両文字列を結合すべきと判定し、「関連会社短期」と「貸付金」との間に挿入されていた分割コードを削除する。 As shown in Figure 5, the processing unit 11 of the information processing device 1 refers to the item table 121 and performs pattern matching on the combined string formed by combining the character strings before and after the division code (S4). If an item string matching the combined character string formed by combining the character strings before and after the division code is registered in the item table 121, it determines that the character strings before and after the division code should be combined and deletes the division code (S5). As mentioned above, the item table 121 stores, for example, item strings to be written on the financial statement to be read. The processing unit 11 performs pattern matching by comparing the combined character string formed by combining the character strings before and after each division code with the item names registered in the item table 121 in a brute-force manner.
For example, as shown in Figure 5, if the item name "short-term loan to affiliated company" which combines "short-term loan to affiliated company" and "loan" is registered in the item table 121, the processing unit 11 determines that the two character strings should be combined and deletes the division code inserted between "short-term loan to affiliated company" and "loan".

さらに処理部１１は、上述の処理後に残っている分割コードに基づいて一項目をそれぞれ認識し、一行につき一項目が記載された表形式に変換し（Ｓ６）、出力部１３２に出力する（Ｓ７）。なお、処理部１１が出力部に分割コードが挿入されている文字列群を出力する形式はこれに限られず、例えば、ＨＴＭＬ、ＸＭＬ、またはＣＳＶなどの形式に変換して出力部１３２に出力してもよい。 Furthermore, the processing unit 11 recognizes each item based on the division codes remaining after the above processing, converts it into a table format with one item per line (S6), and outputs it to the output unit 132 (S7). Note that the format in which the processing unit 11 outputs the group of character strings with division codes inserted to the output unit is not limited to this, and it may also be converted into a format such as HTML, XML, or CSV and output to the output unit 132.

読み取り対象の文書と同分野の書籍群から名詞を抽出して項目テーブル１２１に登録する際、重複を許して名詞を登録してもよい。即ち、項目テーブル１２１には、該書籍群に記載されている個数と同数、各名詞が項目文字列として登録される。このとき、Ｓ５において処理部１１は、分割コードの前後の文字列を結合した結合文字列が、項目テーブル１２１に一定数以上登録されている項目文字列である場合、該分割コードを削除する。 When nouns are extracted from a group of books in the same field as the document to be read and registered in the item table 121, duplication may be allowed. That is, the same number of nouns as the number of nouns listed in the group of books are registered as item strings in the item table 121. At this time, in S5, if the combined string formed by combining the strings before and after the division code is an item string that has been registered a certain number of times in the item table 121, the processing unit 11 deletes the division code.

以上の方法によれば、文書の注目領域内に、一項目が複数の行に亘って位置している場合においても、各項目を尤もらしく認識することが可能である。 The above method makes it possible to recognize each item with high plausibility, even when a single item spans multiple lines within the document's focus area.

（実施形態２）
以下実施形態２に係る本発明について説明する。実施形態２に係る構成の内、実施形態１と同様な構成については同じ符号を付し、その詳細な説明を省略する。
実施形態１に係る処理部１１は、項目テーブル１２１を参照することによって削除する分割コードを判定していたが、学習モデルによって分割コードの前後の文字列を結合した結合文字列が一項目であり、結合すべき確度（結合確度）を算出し、該確度に基づいて削除する分割コードを判定してもよい。結合確度は確率を表し、値の幅は０～１である。 (Embodiment 2)
The present invention according to embodiment 2 will be described below. Of the configurations according to embodiment 2, the same configurations as those in embodiment 1 are given the same reference numerals, and detailed description thereof will be omitted.
The processing unit 11 according to the first embodiment determines the divided code to be deleted by referring to the item table 121. However, it is also possible to use a learning model to calculate the probability of combining the characters before and after the divided code into one item (combination probability), and determine the divided code to be deleted based on this probability. The combination probability represents a probability, and its value ranges from 0 to 1.

図６は実施形態２に係る情報処理装置のブロック図である。実施形態２に係る情報処理装置１の記憶部１２は後述する学習モデル１２２を構成する実体ファイル（ニューラルネットワーク（ＮＮ）のインスタンスファイル）が保存されている。これら実体ファイルは、プログラムの一部位として構成されるものであってもよい。 Figure 6 is a block diagram of an information processing device according to embodiment 2. The memory unit 12 of the information processing device 1 according to embodiment 2 stores entity files (neural network (NN) instance files) that constitute the learning model 122 described below. These entity files may be configured as part of a program.

図７は実施形態２に係る学習モデルに関する説明図である。実施形態２に係る学習モデル１２２は、例えば、ニューラルネットワークによって文字列の特徴量抽出を行うモデルである。学習モデル１２２に含まれる入力層は、処理部１１が文書ファイルから認識した文字列群に挿入された分割コードと、分割コードの前後の文字列に関する情報を入力データとして受け付ける複数のニューロンを有し、入力された文字列を中間層に受け渡す。中間層は、文字列の特徴量を抽出する複数のニューロンを有し、抽出した特徴量を出力層に受け渡す。出力層は、文字列に含まれる分割コードの結合確度を出力する一又は複数のニューロンを有し、中間層から出力された特徴量に基づいて、各分割コードの結合確度を出力する。又は、学習モデル１２２は、中間層から出力された特徴量をＳＶＭ（support vector machine）に入力して結合確度を出力するものであってもよい。訓練データを用いて学習されたニューラルネットワーク（学習モデル１２２）は、人工知能ソフトウェアの一部であるプログラムモジュールとして利用が想定される。学習モデル１２２は、上述のごとく処理部１１及び記憶部１２を備える情報処理装置１にて用いられるものであり、このように演算処理能力を有する情報処理装置１にて実行されることにより、ニューラルネットワークシステムが構成される。すなわち、情報処理装置１の処理部１１が、記憶部１２に記憶された学習モデル１２２からの指令に従って、入力層に入力された分割コードの前後の文字列の特徴量を抽出する演算を行い、出力層から各分割コードの結合確度を出力する。
なお、学習モデル１２２はＲＮＮ（Recurrent Neural Network）、ＬＳＴＭ（Long Short-Term Memory）またはTransformer等でもよい。加えて、ＢＥＲＴ（Bidirectional Encoder Representations from Transformers）等の事前学習済みモデルでもよい。 FIG. 7 is an explanatory diagram of a learning model according to the second embodiment. The learning model 122 according to the second embodiment is, for example, a model that extracts features of a string using a neural network. The input layer included in the learning model 122 has a plurality of neurons that receive, as input data, information about a divided code inserted into a group of strings recognized from a document file by the processing unit 11 and the strings before and after the divided code, and passes the input string to the intermediate layer. The intermediate layer has a plurality of neurons that extract features of the string and passes the extracted features to the output layer. The output layer has one or more neurons that output the combination probability of the divided codes included in the string, and outputs the combination probability of each divided code based on the features output from the intermediate layer. Alternatively, the learning model 122 may input the features output from the intermediate layer into an SVM (support vector machine) and output the combination probability. A neural network (learning model 122) trained using training data is expected to be used as a program module that is part of artificial intelligence software. The learning model 122 is used in the information processing device 1 that includes the processing unit 11 and the storage unit 12 as described above, and a neural network system is configured by being executed in the information processing device 1 that has such calculation processing capabilities. That is, the processing unit 11 of the information processing device 1 performs calculations to extract feature amounts of character strings before and after a divided code input to the input layer in accordance with instructions from the learning model 122 stored in the storage unit 12, and outputs the combination probability of each divided code from the output layer.
The learning model 122 may be a recurrent neural network (RNN), a long short-term memory (LSTM), a transformer, etc. Additionally, it may be a pre-trained model such as a bidirectional encoder representations from transformers (BERT).

訓練データには、例えば、一部の注目領域内から認識した文字列に、ユーザによって結合確度をラベルされた分割コードが挿入された文字列群が使用される。この場合、項目名間にある分割コードには、結合確度０がラベルされ、項目名内の文字間にある分割コードには結合確度１がラベルされることによって訓練データは作成される。
また、実施形態２に係る記憶部１２は項目テーブル１２１を備え、訓練データは項目テーブル１２１を用いて作成されてもよい。具体的には、項目テーブル１２１に登録されている項目名内の文字間に、重複を許さず、結合確度１がラベルされた分割コードをランダムに挿入した文字列を複数用意する。次に複数用意した該文字列間に結合確度０がラベルされた分割コードを挿入し、複数用意した該文字列を結合する。このようにして作成された文字列群のデータを訓練データとする。
処理部１１は、上述の訓練データを読み込み、分割コードの結合確度の予測値と、正解値である分割コードにラベルされた結合確度を比較し、差分が小さくなるように中間層における重みを学習して、学習モデル１２２を生成する。 For example, the training data may be a group of character strings recognized from within a region of interest, into which split codes labeled with a connection probability by the user are inserted. In this case, training data is created by labeling split codes between item names with a connection probability of 0, and split codes between characters within item names with a connection probability of 1.
Furthermore, the storage unit 12 according to the second embodiment may include an item table 121, and training data may be created using the item table 121. Specifically, a plurality of character strings are prepared by randomly inserting split codes labeled with a combination probability of 1 between characters in the item names registered in the item table 121, without allowing overlaps. Next, a split code labeled with a combination probability of 0 is inserted between the plurality of prepared character strings, and the plurality of prepared character strings are combined. Data of the group of character strings created in this manner is used as training data.
The processing unit 11 reads the above-mentioned training data, compares the predicted value of the combination probability of the divided code with the combination probability labeled on the divided code, which is the correct value, and learns the weights in the intermediate layer so that the difference is small, thereby generating a learning model 122.

図７に示すように、処理部１１は決算書の注目領域（図２参照）内の文字列をＯＣＲ等で抽出し、文字列間に挿入された分割コードと、分割コードの前後の文字列に関する情報を学習モデル１２２に入力し、各分割コードの結合確度を出力する。 As shown in Figure 7, the processing unit 11 extracts character strings within the region of interest in the financial statement (see Figure 2) using OCR or the like, inputs information about the split codes inserted between the character strings and the character strings before and after the split codes into the learning model 122, and outputs the probability of combining each split code.

処理部１１は、一の分割コードの結合確度が閾値以上である場合、該分割コードの前後の文字列を結合すべきと判定し、分割コードを削除する。該閾値は、例えば０．５である。よって、図７においては、「関連会社短期」と「貸付金」との間に挿入されていた分割コードが削除される。
処理部１１は、上述の処理後に残っている分割コードに基づいて項目を認識し、一行につき一項目が記載された表形式に変換し、出力部１３２に出力する（図５参照）。 If the combination probability of a division code is equal to or greater than a threshold, the processing unit 11 determines that the character strings before and after the division code should be combined, and deletes the division code. The threshold is, for example, 0.5. Therefore, in Figure 7, the division code inserted between "Affiliate Company Short-Term" and "Loan" is deleted.
The processing unit 11 recognizes the items based on the division codes remaining after the above-mentioned processing, converts the items into a table format in which one item is written per line, and outputs the table to the output unit 132 (see FIG. 5).

図８は実施形態２に係る情報処理装置の処理部による処理手順の一例を示すフローチャートである。処理部１１は、入力部１３１から、読み取った文書ファイルを取得する（Ｓ１１）。処理部１１は、文書ファイル内の注目領域の文字列を抽出し（Ｓ１２）、改行位置に分割コードを挿入する（Ｓ１３）。処理部１１は、文字列群内の文字列を学習モデルに入力し（Ｓ１４）、分割コードの結合確度を取得する（Ｓ１５）。処理部１１は、結合確度が閾値以上であるか判定し（Ｓ１６）、結合確度が閾値以上である場合（Ｓ１６：ＹＥＳ）、分割コードを削除し（Ｓ１７）、文字列群を表形式に変換する（Ｓ１８）。結合確度が閾値未満である場合（Ｓ１６：ＮＯ）、処理部１１は処理をＳ１８に進める。処理部１１は、表形式に変換された文字列群を出力部１３２に出力し（Ｓ１９）、処理を終了する。 Figure 8 is a flowchart showing an example of a processing procedure performed by the processing unit of the information processing device according to the second embodiment. The processing unit 11 acquires a read document file from the input unit 131 (S11). The processing unit 11 extracts character strings in a region of interest within the document file (S12) and inserts split codes at line break positions (S13). The processing unit 11 inputs character strings within the group of character strings into a learning model (S14) and acquires the combination accuracy of the split codes (S15). The processing unit 11 determines whether the combination accuracy is equal to or greater than a threshold (S16). If the combination accuracy is equal to or greater than the threshold (S16: YES), the processing unit 11 deletes the split codes (S17) and converts the group of character strings into a table format (S18). If the combination accuracy is less than the threshold (S16: NO), the processing unit 11 proceeds to S18. The processing unit 11 outputs the group of character strings converted into a table format to the output unit 132 (S19) and ends the processing.

実施形態２に係る学習モデル１２２は、分割コードの前後の文字列を結合した結合文字列が一項目である確度（結合確度）を算出するが、分割コードの前後の文字列を結合した結合文字列が一項目でない確度（非結合確度）を算出してもよい。このとき、処理部１１は、非結合確度が閾値未満である分割コードを削除する。 The learning model 122 according to the second embodiment calculates the probability that the combined string formed by combining the character strings before and after the split code is one item (combined probability), but it may also calculate the probability that the combined string formed by combining the character strings before and after the split code is not one item (non-combined probability). In this case, the processing unit 11 deletes split codes whose non-combined probability is less than a threshold.

（実施形態３）
以下実施形態３に係る本発明について説明する。実施形態３に係る構成の内、実施形態２と同様な構成については同じ符号を付し、その詳細な説明を省略する。
実施形態２に係る学習モデル１２２は、文字列の特徴に基づいて分割コードの結合確度を算出し、処理部１１は、結合確度に基づいて削除する分割コードを判定するが、実施形態３に係る学習モデル１２２は、文字列の特徴量に基づいて、前後の文字列を結合すると一項目となる分割コードを削除された文字列群を出力する。 (Embodiment 3)
The present invention according to embodiment 3 will be described below. Of the configurations according to embodiment 3, the same configurations as those in embodiment 2 are given the same reference numerals, and detailed description thereof will be omitted.
The learning model 122 of embodiment 2 calculates the probability of combining divided codes based on the characteristics of the string, and the processing unit 11 determines the divided code to be deleted based on the probability of combining, while the learning model 122 of embodiment 3 outputs a group of strings from which divided codes that become one item when the preceding and following strings are combined are deleted, based on the characteristics of the string.

図９は実施形態３に係る学習モデルに関する説明図である。実施形態３に係る学習モデルは、例えば、ＲＮＮ、ＬＳＴＭ、またはTransformer 等、文字列群あるいは文を出力可能なモデルである。学習モデル１２２に含まれる入力層は、分割コードが挿入された文字列群を受け付ける複数のニューロンを有し、入力された文字列群を中間層に受け渡す。中間層は、文字列の特徴量を抽出する複数のニューロンを有し、抽出した特徴量を出力層に受け渡す。出力層は、前後の文字列を結合すると一項目になる分割コードを削除した文字列群を出力する一又は複数のニューロンを有し、中間層から出力された特徴量に基づいて、前後の文字列を結合すると一項目になる分割コードを削除した文字列群を出力する。
処理部１１は、上述の処理後に残っている分割コードに基づいて項目を認識し、一行につき一項目が記載された表形式に変換して、出力部１３２に出力する（図５参照）。 FIG. 9 is an explanatory diagram of a learning model according to the third embodiment. The learning model according to the third embodiment is a model capable of outputting a group of character strings or sentences, such as an RNN, an LSTM, or a Transformer. The input layer included in the learning model 122 has a plurality of neurons that accept a group of character strings into which splitting codes have been inserted, and passes the input group of character strings to the intermediate layer. The intermediate layer has a plurality of neurons that extract features of the character strings, and passes the extracted features to the output layer. The output layer has one or more neurons that output a group of character strings from which splitting codes have been deleted, from which the preceding and following character strings become one item when combined, and outputs a group of character strings from which splitting codes have been deleted, from which the preceding and following character strings become one item when combined, based on the features output from the intermediate layer.
The processing unit 11 recognizes the items based on the division codes remaining after the above-mentioned processing, converts them into a table format with one item written per line, and outputs them to the output unit 132 (see FIG. 5).

図１０は実施形態３に係る情報処理装置の処理部による処理手順の一例を示すフローチャートである。Ｓ２１～Ｓ２４においてはＳ１１～Ｓ１４と同様の処理を行う。処理部１１は、前後の文字列を結合すると一項目になる分割コードを削除した文字列群を取得し（Ｓ２５）、文字列群を表形式に変換する（Ｓ２６）。処理部１１は、表形式に変換された文字列群を出力部１３２に出力し（Ｓ２７）、処理を終了する。 Figure 10 is a flowchart showing an example of a processing procedure performed by the processing unit of the information processing device according to the third embodiment. In steps S21 to S24, the same processing as in steps S11 to S14 is performed. The processing unit 11 acquires a group of character strings from which dividing codes that form a single item when the preceding and following character strings are combined have been removed (S25), and converts the group of character strings into a table format (S26). The processing unit 11 outputs the group of character strings converted into a table format to the output unit 132 (S27), and ends the processing.

（変形例）
上述の各実施形態において、処理部１１は、注目領域内の文字列の情報についての処理を行うが、注目領域内の文字列の情報に関連した情報を含む関連領域内の情報を加えて参照し、分割コードを削除してもよい。 (Modification)
In each of the above-described embodiments, the processing unit 11 processes the information of the character string in the region of interest, but may also refer to information in a related region that includes information related to the information of the character string in the region of interest, and delete the division code.

図１１は注目領域と該注目領域に関連する関連領域の一例を示す説明図である。情報処理装置の処理部１１は、例えば、注目領域の右に位置する領域を関連領域とし、関連領域に含まれる文字列を注目領域と同様にОＣＲなどによって抽出する。処理部１１は行を跨ぐ文字間に分割コードを挿入して、文字列群を得る。関連領域とする領域はユーザが入力部１３１に入力した情報に基づいて決定してもよく、文書ファイルのレイアウト情報に含まれる位置情報に基づき、予め設定された条件に合致する領域を関連領域としてもよい。該条件は例えば、「注目領域の右に位置し、数字列のみを含む領域」と設定される。図１１に示すような決算書では、金額を表す数字列は複数行に亘って位置することは稀であり、関連領域内の一行の数字列に対して、注目領域内の一項目が対応すると推定される。よって、前述の条件に合致する領域を関連領域とし、関連領域に含まれる情報である、関連領域内から抽出した文字列群内の分割コード数に基づいて、注目領域内の文字列群から削除する分割コードの数を決定して各実施形態に係る処理を行うことで、より高い精度で必要十分な位置に分割コードが挿入された文字列群を取得することが出来る。また、処理部１１が注目領域の各項目と関連領域の各行を一対一で対応させて出力部１３２に出力することで、統計処理や分析などに活用できる情報を得ることが可能である。
関連領域に含まれる情報は分割コード数に限られず、例えば、関連領域内の文字列でもよい。この場合、実施形態１において記憶部１２が複数の項目テーブルを備え、処理部１１が関連領域内の文字列と各項目テーブルに登録されている項目文字列との合致度に基づいて参照する項目テーブルを選択することで、高い精度でパターンマッチング処理を行うことが出来る。また、実施形態２または３において、関連領域内の文字列を関連領域に含まれる情報とし、該情報を学習モデル１２２に入力することで、より高い精度で分割コードの結合確度または文字列群を出力できる。 FIG. 11 is an explanatory diagram showing an example of a region of interest and a related region related to the region of interest. The processing unit 11 of the information processing device, for example, determines the region located to the right of the region of interest as the related region and extracts the character strings contained in the related region using OCR or the like, in the same manner as the region of interest. The processing unit 11 inserts a dividing code between characters that span lines to obtain a group of character strings. The region to be determined as the related region may be determined based on information input by the user to the input unit 131, or may be determined as the related region based on position information contained in the layout information of the document file, and a region that meets a predetermined condition may be determined as the related region. For example, the condition may be set as "a region located to the right of the region of interest that contains only a numeric string." In a financial statement such as that shown in FIG. 11, a numeric string representing an amount rarely spans multiple lines, and it is estimated that one item in the region of interest corresponds to one row of a numeric string in the related region. Therefore, by determining the region that meets the above-mentioned conditions as the related region, and determining the number of divided codes to delete from the group of character strings in the region of interest based on the number of divided codes in the group of character strings extracted from the related region, which is information contained in the related region, and performing the processing according to each embodiment, it is possible to obtain a group of character strings in which divided codes have been inserted in necessary and sufficient positions with higher accuracy. Furthermore, by having the processing unit 11 correspond one-to-one between each item in the region of interest and each line in the related region and outputting this to the output unit 132, it is possible to obtain information that can be used for statistical processing, analysis, etc.
The information contained in the related region is not limited to the number of divided codes, but may also be, for example, character strings within the related region. In this case, in embodiment 1, the storage unit 12 includes multiple item tables, and the processing unit 11 selects an item table to reference based on the degree of match between the character strings in the related region and the item strings registered in each item table, thereby enabling highly accurate pattern matching processing. Furthermore, in embodiment 2 or 3, the character strings in the related region are treated as information contained in the related region, and this information is input to the learning model 122, allowing the combination probability of the divided codes or the group of character strings to be output with higher accuracy.

各実施形態では、注目領域内の文字列に基づいて、分割コードの前後の文字列を結合すべきか判定したが、他の情報に基づいて判定してもよい。例えば、レイアウト情報に含まれる位置情報に基づいて文字列間のマージンや字下げを認識し、認識した文字列間のマージンや字下げに基づいて判定してもよい。
また、注目領域内の文字列のフォントなどのスタイルに関する情報、色、枠線、下線または罫線に基づいて判定してもよい。 In each embodiment, whether to combine the character strings before and after the division code is determined based on the character string in the region of interest, but the determination may also be based on other information. For example, the margins and indents between character strings may be recognized based on position information included in the layout information, and the determination may be based on the recognized margins and indents between character strings.
The determination may also be made based on information about the style, such as the font of the character string in the attention area, the color, the frame, the underline, or the ruled line.

各実施形態では、読み取り対象の文書は決算書であったが、これに限られない。読み取り対象の文書は、例えば、会計表、各種統計資料、人事データ、または顧客情報文書などでもよい。 In each embodiment, the document to be read is a financial statement, but this is not limited to this. The document to be read may also be, for example, an accounting table, various statistical materials, personnel data, or customer information documents.

今回開示した実施の形態は、全ての点で例示であって、制限的なものではないと考えられるべきである。各実施例にて記載されている技術的特徴は互いに組み合わせることができ、本発明の範囲は、特許請求の範囲内での全ての変更及び特許請求の範囲と均等の範囲が含まれることが意図される。 The embodiments disclosed herein are illustrative in all respects and should not be considered limiting. The technical features described in each embodiment can be combined with one another, and the scope of the present invention is intended to include all modifications within the scope of the claims and equivalents to the scope of the claims.

１情報処理装置
１１処理部
１２記憶部
１２１項目テーブル
１２２学習モデル
１２ａ記録媒体
１３入出力Ｉ／Ｆ
１３１入力部
１３２出力部
Ｐプログラム REFERENCE SIGNS LIST 1 Information processing device 11 Processing unit 12 Storage unit 121 Item table 122 Learning model 12a Recording medium 13 Input/output I/F
131 Input section 132 Output section P Program

Claims

1. A computer-implemented information processing method, comprising:
A group of character strings is obtained, the group of character strings being located across multiple lines in one area of a document including an area made up of columns and rows , and including multiple character strings made up of multiple items and codes indicating divisions inserted between the character strings;
based on the information included in the area, deleting the code that should be combined with the preceding and following character strings from among the codes in the group of character strings ;
A character string between two of the codes in the character string group from which the code that should join the preceding and following character strings has been deleted is recognized as one item of the plurality of items.
Information processing methods.

1. A computer-implemented information processing method, comprising:
A group of character strings is obtained, the group of character strings including a plurality of character strings located across a plurality of lines in one region of the document and a code indicating a division inserted between the character strings;
determining the number of codes to be removed from the group of character strings in the region based on the number of codes indicating divisions in the group of character strings extracted from the related region related to the region;
Based on the information contained in the area, the code that should be combined with the preceding and following character strings is deleted.
Information processing methods.

determining whether the character strings before and after the code should be combined based on the character strings before and after the code;
The information processing method according to claim 1 , further comprising the step of: deleting the code when it is determined that the character strings before and after the code should be combined.

The information processing method according to claim 3 , further comprising the step of: referring to a table in which predetermined character strings are recorded, and determining whether the character strings before and after the code should be combined.

A learning model is trained to output a probability that the character strings before and after the code should be combined when the character strings are input, and
The information processing method according to claim 3 , further comprising outputting the degree of certainty.

comparing the likelihood with a threshold;
The information processing method according to claim 5 , further comprising the step of: deleting the code based on a comparison result between the probability and the threshold value.

inputting the group of character strings into a learning model that has been trained so as to output a group of character strings from which the code that should join the preceding and following character strings has been deleted when the group of character strings is input;
The information processing method according to claim 1 , further comprising the step of outputting a group of character strings from which the code that should join the preceding and following character strings has been deleted.

The information processing method according to claim 1 , further comprising the step of: deleting the code based on information contained in a related region related to the region containing the character string.

A step of acquiring a group of character strings, the group of character strings being located across multiple lines in one area of a document including an area made up of columns and rows , and including a plurality of character strings made up of multiple items and a code indicating a division inserted between the character strings;
a step of deleting a code that should be combined with a preceding or following character string from among the codes in the group of character strings based on information included in the area ;
a step of recognizing a character string between two of the codes in the group of character strings from which the code for combining the preceding and following character strings has been deleted as one item of the plurality of items;
A program that causes a computer to execute the following.

obtaining a group of character strings including a plurality of character strings located across a plurality of lines in one region of a document and a code indicating a division inserted between the character strings;
determining the number of codes to be removed from the set of strings in the region based on the number of codes indicating divisions in the set of strings extracted from the related region associated with the region;
a step of deleting the code if the character strings before and after the code should be combined based on the information contained in the area;
A program that causes a computer to execute the following.

an acquisition unit that acquires a group of character strings, the group of character strings being located across multiple lines in one area of a document including an area made up of columns and rows , and including multiple character strings made up of multiple items and codes indicating divisions inserted between the character strings;
a deletion unit that deletes, from the codes in the character string group, a code whose preceding and following character strings should be combined based on the information included in the area ;
a recognition unit that recognizes a character string between two of the codes in the character string group from which the code that should combine the preceding and following character strings has been deleted as one item among the plurality of items;
An information processing device comprising:

an acquisition unit that acquires a group of character strings including a plurality of character strings located across a plurality of lines in one area of a document and a code indicating a division inserted between the character strings;
a determination unit that determines the number of codes to be deleted from a group of character strings in the region based on the number of codes indicating divisions in the group of character strings extracted from a related region that is related to the region;
a deletion unit that deletes the code when character strings before and after the code should be combined based on information included in the area;
An information processing device comprising: