JP5771971B2

JP5771971B2 - Information processing apparatus, information processing method, and computer program

Info

Publication number: JP5771971B2
Application number: JP2010278634A
Authority: JP
Inventors: 淑隆林
Original assignee: Canon Marketing Japan Inc; Canon IT Solutions Inc
Current assignee: Canon Marketing Japan Inc; Canon IT Solutions Inc
Priority date: 2010-12-14
Filing date: 2010-12-14
Publication date: 2015-09-02
Anticipated expiration: 2030-12-14
Also published as: JP2012128603A

Description

本発明は、複数の文字列をデータ構造に効率良く格納し、当該データ構造へ文字列を追加して格納し、当該データ構造から文字列を検索するのに好適な技術に関する。 The present invention relates to a technique suitable for efficiently storing a plurality of character strings in a data structure, adding a character string to the data structure, storing the character string, and retrieving the character string from the data structure.

従来から、複数の文字列の集合に指定された文字列が存在するか否かを高速に判定するためのデータ構造として、トライ（ｔｒｉｅ）構造を用いた手法が提案されている。 Conventionally, a method using a trie structure has been proposed as a data structure for quickly determining whether or not a designated character string exists in a set of a plurality of character strings.

トライ構造は、複数の文字列集合における共通接頭辞を状態遷移構造として表現することで、重複した記憶領域を削減し、且つ検索処理における無駄な処理も削減する。 The trie structure expresses a common prefix in a plurality of character string sets as a state transition structure, thereby reducing duplicate storage areas and reducing unnecessary processing in search processing.

このトライ構造を効率的に実装し且つ高速に検索する方法として、ダブル配列法が開示されている（非特許文献１）。ダブル配列法は２つの整数型配列（以下、ＢＡＳＥ配列及びＣＨＥＣＫ配列）にトライ構造を効率的に格納し、トライ構造上における状態遷移を、ＢＡＳＥ配列上の加算演算とＣＨＥＣＫ配列上の比較演算で実現する非常に高速な検索手法を開示している。さらに接尾辞部分について、状態遷移の分岐が発生しない状態遷移を１つの文字列型配列（以下、ＴＡＩＬ配列）に集約して格納することで、状態遷移を表すＢＡＳＥ配列とＣＨＥＣＫ配列の記憶領域を削減し且つ検索速度の向上を図っている。 A double array method is disclosed as a method for efficiently mounting this trie structure and performing a high-speed search (Non-Patent Document 1). In the double array method, the trie structure is efficiently stored in two integer type arrays (hereinafter referred to as the BASE array and the CHECK array), and the state transition on the trie structure is performed by an addition operation on the BASE array and a comparison operation on the CHECK array. A very fast search technique is disclosed. Furthermore, for the suffix part, state transitions that do not cause branching of state transitions are aggregated and stored in one character string type array (hereinafter referred to as TAIL array), so that the storage areas of the BASE array and the CHECK array representing the state transition are stored. To improve the search speed.

また、非特許文献２では前記ダブル配列法を改良し、よりコンパクトなダブル配列構造を実装する方法が開示されている。具体的には、前述したＴＡＩＬ配列について、後方一致する共通接尾辞を併合することで、ＴＡＩＬ配列のさらなる圧縮を行っている。 Non-Patent Document 2 discloses a method for improving the double arrangement method and mounting a more compact double arrangement structure. Specifically, the TAIL sequence is further compressed by merging a common suffix that matches the rear end of the TAIL sequence.

このようにダブル配列法は、トライ構造上の状態遷移において、分岐の発生しない部分をＴＡＩＬ配列として実装することで、空間領域の削減と検索速度の向上を図っている。 As described above, in the double array method, a portion where no branch occurs in the state transition on the tri structure is mounted as a TAIL array, thereby reducing the space area and improving the search speed.

一方、トライ構造の空間効率を重視した実装方法として、簡潔木法が開示されている（非特許文献３）。簡潔木法はトライ構造上で幅優先探索を行い、トライ構造をビット列に変換する。このビット列構造とトライ構造上の状態遷移を表す文字集合を合わせて格納することで、非常にコンパクトなトライ構造の実装が実現できる。 On the other hand, a concise tree method is disclosed as a mounting method that places importance on the space efficiency of the trie structure (Non-Patent Document 3). The concise tree method performs a breadth-first search on the trie structure and converts the trie structure into a bit string. By storing the bit string structure and the character set representing the state transition on the trie structure together, a very compact trie structure can be implemented.

しかしながら、簡潔木法は、常にビット列構造を先頭からカウントする関数を用いて状態遷移を行う。具体的には、先頭からのビット１の数を返すrank₁関数、及び先頭からｎ番目の０ビットの位置を返すselect₀関数を利用する。これらの状態遷移関数は算出コストが高く、特にselect₀関数は非常に算出コストが高い。従って、前述したビット列構造に対して、任意の大きさの区画ごとに当該ビット数を予め算出した索引を併用する実装が一般的である。 However, the simple tree method always performs state transition using a function that counts the bit string structure from the top. Specifically, a rank ₁ function that returns the number of bits 1 from the head and a select ₀ function that returns the position of the nth 0 bit from the head are used. These state transition functions have a high calculation cost, and in particular, the select ₀ function has a very high calculation cost. Therefore, in general, the above-described bit string structure is implemented by using an index in which the number of bits is calculated in advance for each arbitrarily sized partition.

青江順一、「ダブル配列による高速ディジタル検索アルゴリズム」、電子情報通信学会論文誌、Vol.J71-D、No.4、pp.1592-1600、1988Junichi Aoe, "High-speed digital search algorithm using double array", IEICE Transactions, Vol.J71-D, No.4, pp.1592-1600, 1988 矢田晋ほか、「ダブル配列におけるキャッシュの効率化」、FIT2006、pp.71-72、2006Satoshi Yada et al. “Efficiency of cache in double array”, FIT2006, pp.71-72, 2006 G.Jacobson、「Space-efficent static trees and graphs」、In Proc. 30th FOCS、pp.549-554、1989G. Jacobson, "Space-efficent static trees and graphs", In Proc. 30th FOCS, pp. 549-554, 1989

ダブル配列法は、非常に高速な検索手法が特徴であるが、元になる文字列集合の大きさに対して、同等もしくは大きなサイズのデータ構造になる。また、簡潔木法は元になる文字列集合の大きさに対して、非常にコンパクトなデータ構造になるが、ビット列の探索において検索手法が高速にならない。 The double array method is characterized by a very high-speed search method. However, the double array method has a data structure having the same or larger size than the size of the original character string set. In addition, the concise tree method has a very compact data structure with respect to the size of the original character string set, but the search method is not accelerated in the search for bit strings.

本発明は上記の課題を解決するためになされたものであり、指定された文字列を用いて、データ構造として登録された文字列を検索する処理の効率化を図る技術を提供することを目的とする。 The present invention has been made to solve the above-described problem, and an object of the present invention is to provide a technique for improving the efficiency of processing for searching a character string registered as a data structure using a specified character string. And

上記した目的を達成するために、本発明の情報処理装置は、検索対象となる複数の第１のキーワードをデータとしてダブル配列へ登録する登録手段と、前記登録手段で登録されたダブル配列のデータを用いて幅優先探索を行うための簡潔木構造に登録するためのデータを作成する作成手段と、前記作成手段によって作成したデータを備えた簡潔木構造を記憶する記憶手段と、前記複数の第１のキーワードを検索するための検索キーワードである第２のキーワードの入力を受け付ける受付手段と、前記受付手段で受け付けた第２のキーワードが前記複数の第１のキーワードに含まれるか、を前記記憶手段によって記憶された前記簡潔木構造を用いて検索する検索手段と、を備え、前記作成手段は、前記ダブル配列のトライ構造を構成するノードに対するＴＡＩＬ配列へのリンクを前記簡潔木構造のノードに対する前記ＴＡＩＬ配列へのリンクとしてデータを作成し、前記検索手段は、前記受付手段で受け付けた第２のキーワードの最初の文字を読み込み、前記簡潔木構造のルートノードを第１のノードに設定し、該第１のノードの子ノードに前記読み込んだ文字を示す子ノードがあるかを判定し、前記読み込んだ文字を示す子ノードがあると判定した場合には、当該子ノードを第１のノードに設定し、前記第２のキーワードの次の１文字を読み込み、変更後の第１のノードに該読み込んだ１文字を示す子ノードが存在するかを判定する処理を繰り返し、前記読み込んだ文字を示す子ノードがないと判定した場合には、前記子ノードから前記ＴＡＩＬ配列へのリンクを参照することで前記ＴＡＩＬ配列に遷移し、前記ＴＡＩＬ配列に残りの文字列が登録されているかを判定することにより、前記第２のキーワードが前記複数の第１のキーワードに含まれるかを検索することを備えることを特徴とする。 In order to achieve the above-described object, an information processing apparatus according to the present invention includes a registration unit that registers a plurality of first keywords to be searched as data in a double array, and a double-array data registered by the registration unit. Creating means for creating data to be registered in a concise tree structure for performing breadth-first search, storage means for storing a concise tree structure comprising data created by the creating means, and the plurality of first Receiving means for receiving an input of a second keyword which is a search keyword for searching for the keyword, and whether the second keyword received by the receiving means is included in the plurality of first keywords. Search means for searching using the concise tree structure stored by the node, wherein the creation means comprises nodes constituting the double array trie structure. The data is generated by using the link to the TAIL array as the link to the TAIL array for the node of the concise tree structure, and the search means reads the first character of the second keyword accepted by the accepting means, and The root node of the tree structure is set as the first node, it is determined whether there is a child node indicating the read character in the child node of the first node, and it is determined that there is a child node indicating the read character In this case, the child node is set as the first node, the next character after the second keyword is read, and the child node indicating the read one character exists in the changed first node. If it is determined that there is no child node indicating the read character, the previous node is referred to by referring to the link from the child node to the TAIL array. It transitions to TAIL sequence by determining whether the rest of the string is registered in the TAIL sequence that comprises searching whether the second keyword is included in the plurality of first keywords Features.

上記した目的を達成するために、本発明の情報処理方法は、データの検索処理を行う情報処理装置によって行われる情報処理方法であって、前記情報処理装置は、検索対象となる複数の第１のキーワードをデータとしてダブル配列へ登録する登録工程と、前記登録工程で登録されたダブル配列のデータを用いて幅優先探索を行うための簡潔木構造に登録するためのデータを作成する作成工程と、前記作成工程によって作成したデータを備えた簡潔木構造を記憶する記憶工程と、前記複数の第１のキーワードを検索するための検索キーワードである第２のキーワードの入力を受け付ける受付工程と、前記受付工程で受け付けた第２のキーワードが前記複数の第１のキーワードに含まれるか、を前記記憶工程によって記憶された前記簡潔木構造を用いて検索する検索工程と、を実行し、前記作成工程は、前記ダブル配列のトライ構造を構成するノードに対するＴＡＩＬ配列へのリンクを前記簡潔木構造のノードに対する前記ＴＡＩＬ配列へのリンクとしてデータを作成し、前記検索工程は、前記受付手段で受け付けた第２のキーワードの最初の文字を読み込み、前記簡潔木構造のルートノードを第１のノードに設定し、該第１のノードの子ノードに前記読み込んだ文字を示す子ノードがあるかを判定し、前記読み込んだ文字を示す子ノードがあると判定した場合には、当該子ノードを第１のノードに設定し、前記第２のキーワードの次の１文字を読み込み、変更後の第１のノードに該読み込んだ１文字を示す子ノードが存在するかを判定する処理を繰り返し、前記読み込んだ文字を示す子ノードがないと判定した場合には、前記子ノードから前記ＴＡＩＬ配列へのリンクを参照することで前記ＴＡＩＬ配列に遷移し、前記ＴＡＩＬ配列に残りの文字列が登録されているかを判定することにより、前記第２のキーワードが前記複数の第１のキーワードに含まれるかを検索することを実行することを特徴とする。 In order to achieve the above object, an information processing method of the present invention is an information processing method performed by an information processing apparatus that performs data search processing, and the information processing apparatus includes a plurality of first information to be searched . A registration step of registering the keyword in the double array as data, and a creation step of creating data for registering in a simple tree structure for performing a breadth-first search using the double array data registered in the registration step; A storage step for storing a concise tree structure including data created by the creation step, a reception step for receiving input of a second keyword that is a search keyword for searching the plurality of first keywords, and the reception Whether the second keyword received in the step is included in the plurality of first keywords is the simple tree structure stored in the storage step. And a search step for searching, and the creation step creates data as a link to the TAIL array for the node constituting the tri-structure of the double array as a link to the TAIL array for the node of the concise tree structure The search step reads the first character of the second keyword accepted by the accepting means, sets the root node of the concise tree structure as the first node, and sets the child node of the first node as the child node. It is determined whether or not there is a child node indicating the read character. If it is determined that there is a child node indicating the read character, the child node is set as the first node, and the next keyword after the second keyword. The process of determining whether there is a child node indicating the read one character in the first node after the change is repeated to indicate the read character When it is determined that there is no node, the transition to the TAIL array is made by referring to the link from the child node to the TAIL array, and it is determined whether the remaining character string is registered in the TAIL array. , Searching whether the second keyword is included in the plurality of first keywords .

上記した目的を達成するために、本発明のコンピュータプログラムは、データの検索処理を行う情報処理装置において読取り実行可能なコンピュータプログラムであって、前記情報処理装置を、検索対象となる複数の第１のキーワードをデータとしてダブル配列へ登録する登録手段と、前記登録手段で登録されたダブル配列のデータを用いて幅優先探索を行うための簡潔木構造に登録するためのデータを作成する作成手段と、前記作成手段によって作成したデータを備えた簡潔木構造を記憶する記憶手段と、前記複数の第１のキーワードを検索するための検索キーワードである第２のキーワードの入力を受け付ける受付手段と、前記受付手段で受け付けた第２のキーワードが前記複数の第１のキーワードに含まれるか、を前記記憶手段によって記憶された前記簡潔木構造を用いて検索する検索手段と、して機能させ、前記作成手段は、前記ダブル配列のトライ構造を構成するノードに対するＴＡＩＬ配列へのリンクを前記簡潔木構造のノードに対する前記ＴＡＩＬ配列へのリンクとしてデータを作成し、前記検索手段は、前記受付手段で受け付けた第２のキーワードの最初の文字を読み込み、前記簡潔木構造のルートノードを第１のノードに設定し、該第１のノードの子ノードに前記読み込んだ文字を示す子ノードがあるかを判定し、前記読み込んだ文字を示す子ノードがあると判定した場合には、当該子ノードを第１のノードに設定し、前記第２のキーワードの次の１文字を読み込み、変更後の第１のノードに該読み込んだ１文字を示す子ノードが存在するかを判定する処理を繰り返し、前記読み込んだ文字を示す子ノードがないと判定した場合には、前記子ノードから前記ＴＡＩＬ配列へのリンクを参照することで前記ＴＡＩＬ配列に遷移し、前記ＴＡＩＬ配列に残りの文字列が登録されているかを判定することにより、前記第２のキーワードが前記複数の第１のキーワードに含まれるかを検索することとして機能させるためのコンピュータプログラムである。 In order to achieve the above object, a computer program of the present invention is a computer program that can be read and executed by an information processing apparatus that performs data search processing, and the information processing apparatus is a plurality of first targets to be searched . Registering means for registering the keywords in the double array as data, creating means for creating data for registering in a concise tree structure for performing breadth-first search using the double array data registered by the registering means, Storage means for storing a concise tree structure having data created by the creating means, accepting means for accepting input of a second keyword which is a search keyword for retrieving the plurality of first keywords, and accepting the accepting Whether the second keyword received by the means is included in the plurality of first keywords is determined by the storage means. And a search means for searching using the stored simple tree structure, wherein the creating means sets a link to the TAIL array for the nodes constituting the double array trie structure for the nodes of the simple tree structure. Creating data as a link to the TAIL array, the search means reads the first character of the second keyword accepted by the acceptance means, sets the root node of the concise tree structure as the first node, It is determined whether there is a child node indicating the read character in the child node of the first node, and if it is determined that there is a child node indicating the read character, the child node is set as the first node. Setting, reading the next one character of the second keyword, and determining whether there is a child node indicating the read one character in the first node after the change. If it is determined that there is no child node indicating the read character, the transition to the TAIL array is made by referring to the link from the child node to the TAIL array, and the remaining character strings are stored in the TAIL array. It is a computer program for functioning as a search to determine whether the second keyword is included in the plurality of first keywords by determining whether the second keyword is registered .

本発明によれば、巨大な文字列集合を単体の計算機上で実現可能なサイズまでコンパクトに実装した辞書構造ファイルを実行できるシステムを提供でき、さらには、組み込み機器のような限られたリソースを持つ環境では、辞書をよりコンパクトに実装し且つ実用的に検索できる、等の効果を奏する。 According to the present invention, it is possible to provide a system capable of executing a dictionary structure file in which a large set of character strings is compactly implemented to a size that can be realized on a single computer, and further, limited resources such as embedded devices can be provided. In an environment that has a dictionary, it is possible to implement a dictionary more compactly and search practically.

本発明の実施形態における文字列登録検索システムの構成を示す図である。It is a figure which shows the structure of the character string registration search system in embodiment of this invention. 本発明の実施形態における各種端末のハードウェア構成を示す図である。It is a figure which shows the hardware constitutions of the various terminals in embodiment of this invention. 本発明の実施形態における文字列登録処理のフローチャートである。It is a flowchart of the character string registration process in embodiment of this invention. 本発明の実施形態におけるＴＡＩＬ併合処理のフローチャートである。It is a flowchart of the TAIL merge process in the embodiment of the present invention. 本発明の実施形態における簡潔木生成処理のフローチャートである。It is a flowchart of the simple tree production | generation process in embodiment of this invention. 本発明の実施形態における遷移可能集合追加処理のフローチャートである。It is a flowchart of the transition possible set addition process in the embodiment of the present invention. 本発明の実施形態におけるキーワード集合の一例である。It is an example of the keyword set in the embodiment of the present invention. 本発明の実施形態における併合ＴＡＩＬ配列を用いたダブル配列構造の一例である。It is an example of the double arrangement | sequence structure using the merged TAIL arrangement | sequence in embodiment of this invention. 本発明の実施形態における併合ＴＡＩＬ配列を用いた簡潔木構造の一例である。It is an example of the simple tree structure using the merged TAIL arrangement | sequence in embodiment of this invention. 本発明の実施形態における簡潔木検索処理のフローチャートである。It is a flowchart of the simple tree search process in the embodiment of the present invention.

以下、図面を参照して本発明の実施の形態の一例について説明する。 Hereinafter, an example of an embodiment of the present invention will be described with reference to the drawings.

図１は、本発明の実施形態における文字列登録検索装置の構成を示す図である。 FIG. 1 is a diagram showing a configuration of a character string registration / retrieval apparatus according to an embodiment of the present invention.

文字列登録検索装置１００は、ダブル配列構築部１０２と、簡潔木構築部１０３と、簡潔木データ１０４と、検索結果表示部１０５と、簡潔木検索部１０６とを備える。尚、簡潔木データ１０４は後述する外部メモリ２１１等の記憶装置に記憶されている。 The character string registration / retrieval apparatus 100 includes a double array construction unit 102, a concise tree construction unit 103, a concise tree data 104, a search result display unit 105, and a concise tree search unit 106. The concise tree data 104 is stored in a storage device such as an external memory 211 described later.

ダブル配列構築部１０２は、キーワード集合１０１を入力としてダブル配列を構築する。構築されたダブル配列は簡潔木構築部１０３に送られて、簡潔木データ１０４を生成する。簡潔木検索部１０６は、検索キーワード１０７を入力として簡潔木データ１０４を検索し、検索結果を検索結果表示部１０５に表示する。これら一連のデータ登録処理及びデータ検索処理については、詳しく後述する。 The double array construction unit 102 constructs a double array using the keyword set 101 as an input. The constructed double array is sent to the concise tree construction unit 103 to generate concise tree data 104. The simple tree search unit 106 searches the simple tree data 104 with the search keyword 107 as an input, and displays the search result on the search result display unit 105. The series of data registration processing and data search processing will be described later in detail.

次に、図１の文字列登録検索装置１００のハードウェア構成について、図２を用いて説明する。 Next, the hardware configuration of the character string registration / retrieval apparatus 100 in FIG. 1 will be described with reference to FIG.

図中、ＣＰＵ２０１は、システムバス２０４に接続される後述の各デバイスやコントローラを統括的に制御する。また、ＲＯＭ２０３あるいは外部メモリ２１１には、ＣＰＵ２０１の制御プログラムであるＢＩＯＳ（Basic Input / Output System）やオペレーティングシステムプログラム（以下、ＯＳ）や、文字列登録検索装置１００に後述する各種の処理を実行させるために必要な各種プログラムやデータ等が記憶されている。ＲＡＭ２０２は、ＣＰＵ２０１の主メモリ、ワークエリア等として機能する。 In the figure, a CPU 201 comprehensively controls each device and controller described later connected to a system bus 204. In addition, the ROM 203 or the external memory 211 causes a BIOS (Basic Input / Output System) or an operating system program (hereinafter referred to as OS), which is a control program of the CPU 201, or the character string registration / retrieval apparatus 100 to execute various processes described later. Various programs and data necessary for this are stored. The RAM 202 functions as a main memory, work area, and the like for the CPU 201.

ＣＰＵ２０１は、処理の実行に際して必要なプログラム等をＲＡＭ２０２にロードして、プログラムを実行することで後述する各種処理を実現するものである。また、入力コントローラ（入力Ｃ）２０５は、キーボードやポインティングデバイス等で構成される入力装置２０９からの入力を制御する。ビデオコントローラ（ＶＣ）２０６は、ディスプレイ装置２１０等の表示装置への表示を制御する。ディスプレイ装置２１０は、例えばＣＲＴディスプレイや液晶ディスプレイ等で構成される。 The CPU 201 implements various processes to be described later by loading a program or the like necessary for executing the process into the RAM 202 and executing the program. An input controller (input C) 205 controls input from an input device 209 configured with a keyboard, a pointing device, and the like. A video controller (VC) 206 controls display on a display device such as the display device 210. The display device 210 is composed of, for example, a CRT display or a liquid crystal display.

メモリコントローラ（ＭＣ）２０７は、ブートプログラム、ブラウザソフトウエア、各種のアプリケーション、フォントデータ、ユーザファイル、編集ファイル、各種データ等を記憶するハードディスク（ＨＤ）やフロッピーディスク（登録商標ＦＤ）或いはＰＣＭＣＩＡカードスロットにアダプタを介して接続されるコンパクトフラッシュメモリ等の外部メモリ２１１へのアクセスを制御する。 A memory controller (MC) 207 is a hard disk (HD), floppy disk (registered trademark FD) or PCMCIA card slot for storing boot programs, browser software, various applications, font data, user files, editing files, various data, and the like. Controls access to an external memory 211 such as a compact flash memory connected via an adapter.

通信Ｉ／Ｆコントローラ（通信Ｉ／ＦＣ）２０８は、ネットワークを介して、外部機器と接続・通信するものであり、ネットワークでの通信制御処理を実行する。例えば、ＴＣＰ／ＩＰを用いたインターネット通信等が可能である。 A communication I / F controller (communication I / FC) 208 is connected to and communicates with an external device via a network, and executes communication control processing in the network. For example, Internet communication using TCP / IP is possible.

なお、ＣＰＵ２０１は、例えばＲＡＭ２０２内の表示情報用領域へアウトラインフォントの展開（ラスタライズ）処理を実行することにより、ディスプレイ装置２１０上での表示を可能としている。また、ＣＰＵ２０１は、ディスプレイ装置２１０上の不図示のマウスカーソル等でのユーザ指示を可能とする。以上が、文字列登録検索装置１００のハードウェア構成の説明であるが、後述する各種の処理を実行可能であれば、必ずしも図２に記載のハードウェア構成を有していなくとも構わないことは言うまでもない。 Note that the CPU 201 enables display on the display device 210 by executing outline font rasterization processing on a display information area in the RAM 202, for example. Further, the CPU 201 enables a user instruction with a mouse cursor (not shown) on the display device 210. The above is the description of the hardware configuration of the character string registration / retrieval device 100. However, the hardware configuration illustrated in FIG. 2 is not necessarily required as long as various processes described below can be performed. Needless to say.

次に、文字列登録検索装置１００における文字列登録処理について、図３から図９を用いて、詳しく説明する。 Next, the character string registration process in the character string registration / retrieval apparatus 100 will be described in detail with reference to FIGS.

図３は、文字列登録処理の全体フローチャートを示す図である。文字列登録処理では、ＣＰＵ２０１は、キーワード集合１０１を読み込み、ダブル配列構造を構築してから簡潔木データ構造を作成する処理を行う。この文字列登録処理は、ＣＰＵ２０１をダブル配列構築部１０２、簡潔木構築部１０３として機能させるためのプログラムによる制御に従って行われる処理である。文字列登録処理の説明の前に、この処理で扱うキーワード集合の一例を、図７を参照して説明する。 FIG. 3 is an overall flowchart of the character string registration process. In the character string registration process, the CPU 201 reads the keyword set 101, builds a double array structure, and then creates a concise tree data structure. This character string registration processing is processing performed according to control by a program for causing the CPU 201 to function as the double array construction unit 102 and the simple tree construction unit 103. Prior to the description of the character string registration process, an example of a keyword set handled in this process will be described with reference to FIG.

図７は、キーワード集合１０１の一例である。キーワード７０１は、文字列「山形県」に対して数値「１０」が定義されている。同様に、キーワード７０２は文字列「山梨県」に対して数値「２０」が定義され、キーワード７０３は文字列「大阪府大阪市」に数値「３０」が定義されている。キーワード７０４から７０６についても同様である。前記数値は、例えばＩＤ番号を示すものである、等であり、キーワード集合を利用するソフトウェア上において意味を持つ数値である（以下、ＯＵＴＰＵＴ値とする）。 FIG. 7 is an example of the keyword set 101. The keyword 701 has a numerical value “10” defined for the character string “Yamagata Prefecture”. Similarly, the keyword 702 has a numerical value “20” defined for the character string “Yamanashi Prefecture”, and the keyword 703 has a numerical value “30” defined for the character string “Osaka City, Osaka Prefecture”. The same applies to the keywords 704 to 706. The numerical value is, for example, an ID number, and the like, and is a numerical value having meaning on software using a keyword set (hereinafter referred to as an OUTPUT value).

図３に戻って、ステップＳ３０１において、ＣＰＵ２０１は、図７に示すキーワード集合を１行ずつ読み込む。このとき、行情報は所定の書式に従って文字列とＯＵＴＰＵＴ値に分解され、ステップＳ３０２において、ダブル配列に追加される。本実施形態におけるダブル配列は、文字列による検索処理が成功すると、ＯＵＴＰＵＴ値が返される。 Returning to FIG. 3, in step S301, the CPU 201 reads the keyword set shown in FIG. 7 line by line. At this time, the line information is decomposed into a character string and an OUTPUT value according to a predetermined format, and added to the double array in step S302. The double array in this embodiment returns an OUTPUT value when a search process using a character string is successful.

続いて、ステップＳ３０３において、ＣＰＵ２０１は、すべてのキーワード集合について登録処理が完了したかどうかを判定し、未処理情報があると判定した場合には処理をステップＳ３０１に進める。すべてのキーワードをダブル配列へ登録完了したと判定した場合（ステップＳ３０３で「ＹＥＳ」の場合）は、ステップＳ３０４に進み、ＴＡＩＬ併合処理を実施する。 Subsequently, in step S303, the CPU 201 determines whether or not the registration process has been completed for all keyword sets. If it is determined that there is unprocessed information, the process proceeds to step S301. If it is determined that all keywords have been registered in the double array ("YES" in step S303), the process proceeds to step S304, and TAIL merge processing is performed.

ＴＡＩＬ併合処理は、非特許文献２で開示されている共通接尾辞を併合したＴＡＩＬ配列のデータ構造について、ダブル配列への追加処理（ステップＳ３０２）の際に発生するＴＡＩＬ配列内の未使用領域を削減しつつ当該データ構造を作成する処理を示す。この処理の詳細は図４を参照して説明する。 In the TAIL merge processing, an unused area in the TAIL sequence that is generated when the data structure of the TAIL sequence merged with the common suffix disclosed in Non-Patent Document 2 is added to the double array (step S302). A process of creating the data structure while reducing the number is shown. Details of this processing will be described with reference to FIG.

ここで、図４を参照して、図３のステップＳ３０４のＴＡＩＬ併合処理の詳細について説明する。図４はＴＡＩＬ併合処理の詳細を示すフローチャートである。ＴＡＩＬ併合処理では、ＴＡＩＬ配列内の未使用領域を回収するために、ＢＡＳＥ配列からＴＡＩＬ配列へのリンクが定義されているものを探索する。 Here, the details of the TAIL merging process in step S304 in FIG. 3 will be described with reference to FIG. FIG. 4 is a flowchart showing details of the TAIL merging process. In the TAIL merging process, in order to collect unused areas in the TAIL sequence, a search is made for those in which a link from the BASE sequence to the TAIL sequence is defined.

ＣＰＵ２０１は、ステップＳ４０１で、当該リンクを登録するためのリストを初期化し、続くステップＳ４０２からＢＡＳＥ配列の探索を開始する。具体的には、ＢＡＳＥ配列の値が負数になっているものがＴＡＩＬ配列へのリンクを表現しており、その絶対値がＴＡＩＬ配列の添字である。従って、有効なＢＡＳＥ配列の中から負値を持つものについて、そのリンク先となるＴＡＩＬ配列に登録されたＴＡＩＬ文字列を取得し（ステップＳ４０３）、取得したＴＡＩＬ文字列を反転させた反転ＴＡＩＬ文字列とリンク元となるＢＡＳＥ配列の添字をセットにして前記リストに追加する（ステップＳ４０４）。 In step S401, the CPU 201 initializes a list for registering the link, and starts searching for a BASE array from the subsequent step S402. Specifically, a negative value of the BASE array represents a link to the TAIL array, and the absolute value is a subscript of the TAIL array. Therefore, for a valid BASE array having a negative value, a TAIL character string registered in the linked TAIL array is acquired (step S403), and the inverted TAIL character obtained by inverting the acquired TAIL character string A subscript of the BASE array as a column and a link source is added as a set to the list (step S404).

すべてのＴＡＩＬ文字列を追加した後、続くステップＳ４０５において、反転ＴＡＩＬ文字列をキーとして、前記リストを逆順ソートする。即ち、前記リストを順に辿ることで、反転ＴＡＩＬ文字列を逆辞書順に取得することができる。 After all the TAIL character strings have been added, in the subsequent step S405, the list is sorted in reverse order using the inverted TAIL character string as a key. That is, the inverted TAIL character string can be acquired in reverse dictionary order by tracing the list in order.

続くステップＳ４０６で、ＣＰＵ２０１は新しくＴＡＩＬ配列を格納する領域（以下、併合ＴＡＩＬ配列）を確保し、ステップＳ４０７で前記リストから先頭要素を取得する。続くステップＳ４０８で、ステップＳ４０６で確保した併合ＴＡＩＬ配列に、反転していない元の正ＴＡＩＬ文字列を登録する。元の正ＴＡＩＬ文字列は、反転ＴＡＩＬ文字列を反転させて得ることもできるし、ステップＳ４０４において、別要素にセットして前記リストに追加してもよい。続くステップＳ４０９で、ＢＡＳＥ配列の更新を行う。前記反転ＴＡＩＬ文字列とセットで追加したＢＡＳＥ配列の添字を参照することで、当該ＢＡＳＥ配列の値を併合ＴＡＩＬ配列への添字の負数に更新し、ＢＡＳＥ配列から併合ＴＡＩＬ配列へのリンクを更新する。 In the subsequent step S406, the CPU 201 secures an area for storing a new TAIL array (hereinafter referred to as a merged TAIL array), and acquires the first element from the list in step S407. In subsequent step S408, the original non-inverted original TAIL character string is registered in the merged TAIL array secured in step S406. The original positive TAIL character string can be obtained by inverting the inverted TAIL character string, or may be set in another element and added to the list in step S404. In the subsequent step S409, the BASE array is updated. By referring to the subscript of the BASE array added as a set with the inverted TAIL character string, the value of the BASE array is updated to the negative number of the subscript to the merged TAIL array, and the link from the BASE array to the merged TAIL array is updated. .

具体例をあげて説明する。ＢＡＳＥ［２６］＝−１２であるとき、ＴＡＩＬ［１２］から正ＴＡＩＬ文字列「ＡＢＣ」が取得できたとする。このとき、ステップＳ４０４で前記リストに登録するセットは、（２６、「ＣＢＡ」）である。また、併合ＴＡＩＬ配列に登録した先がＴＡＩＬ［１］の場合、ステップＳ４０９で、ＢＡＳＥ［２６］＝−１に更新する。 A specific example will be described. When BASE [26] = − 12, it is assumed that the normal TAIL character string “ABC” can be acquired from TAIL [12]. At this time, the set to be registered in the list in step S404 is (26, “CBA”). Also, if the destination registered in the merged TAIL array is TAIL [1], it is updated to BASE [26] = − 1 in step S409.

その後、ＣＰＵ２０１はステップＳ４１０で、ＢＡＳＥ配列をすべて探索したかを判定し、すべての探索が終了したと判定した場合すれば（ステップＳ４１０で「ＹＥＳ」の場合）、ＴＡＩＬ併合処理を終了する。探索が終了していないと判定した場合には（ステップＳ４１０で「ＮＯ」の場合）、ステップＳ４１１に進める。 Thereafter, in step S410, the CPU 201 determines whether or not all BASE arrays have been searched. If it is determined that all searches have been completed (in the case of “YES” in step S410), the TAIL merging process ends. If it is determined that the search has not ended (in the case of “NO” in step S410), the process proceeds to step S411.

ＣＰＵ２０１は、ステップＳ４１１において、前記リストから次の要素セットを取得し、続くステップＳ４１２で、取得したセットの反転ＴＡＩＬ文字列と、前記ステップＳ４０７で取得したセットの反転ＴＡＩＬ文字列を比較する。このとき、ステップＳ４１１で取得した反転ＴＡＩＬ文字列がステップＳ４０７で取得した反転ＴＡＩＬ文字列に含まれていると判定した場合（共通接頭辞になっている場合、即ち、ステップＳ４１２で「ＹＥＳ」の場合）、ステップＳ４０９に戻り、ＢＡＳＥ配列の更新のみを実施する。ステップＳ４１２で「ＮＯ」の場合、ステップＳ４０８に戻り、併合ＴＡＩＬ配列に正ＴＡＩＬ文字列に登録して、続くステップＳ４０９でＢＡＳＥ配列の更新を実施する。 In step S411, the CPU 201 acquires the next element set from the list, and in subsequent step S412, compares the inverted TAIL character string of the acquired set with the inverted TAIL character string of the set acquired in step S407. At this time, if it is determined that the inverted TAIL character string acquired in step S411 is included in the inverted TAIL character string acquired in step S407 (if it is a common prefix, that is, “YES” in step S412) ), The process returns to step S409, and only the BASE array is updated. If “NO” in the step S 412, the process returns to the step S 408 to register the original TAL character string in the merged TAIL array, and the BASE array is updated in the subsequent step S 409.

前述の具体例で説明する。先頭要素の反転ＴＡＩＬ文字列「ＣＢＡ」である場合に、次要素のセットが（７、「ＣＢ」）である場合、ステップＳ４１２において、共通接頭辞であると判断し（「ＹＥＳ」の場合）、ＢＡＳＥ［７］＝−２とする。即ち、ＴＡＩＬ［１］で正ＴＡＩＬ文字列「ＡＢＣ」を得ることができ、ＴＡＩＬ［２］で正ＴＡＩＬ文字列「ＢＣ」を得ることができる。また、次要素のセットが（７、「ＤＡ」）である場合は、ステップＳ４１２において「ＮＯ」となり、ＴＡＩＬ［５］＝ＡＤを登録し（ステップＳ４０８）、ＢＡＳＥ［７］＝−５となる（ステップＳ４０９）。 This will be described with reference to the above specific example. If it is the inverted TAIL character string “CBA” of the first element and the set of the next element is (7, “CB”), it is determined in step S412 that it is a common prefix (in the case of “YES”) , BASE [7] = − 2. That is, TAIL [1] can obtain a positive TAIL character string “ABC”, and TAIL [2] can obtain a positive TAIL character string “BC”. If the next element set is (7, “DA”), “NO” in step S412, TAIL [5] = AD is registered (step S408), and BASE [7] = − 5. (Step S409).

このように、前記リストに追加したすべての要素セットについて、ＢＡＳＥ配列の更新処理と併合ＴＡＩＬ配列への登録処理を実施することで、非特許文献２で開示されているデータ構造を得ることができる。以上が、ＴＡＩＬ併合処理の詳細な説明である。 In this way, the data structure disclosed in Non-Patent Document 2 can be obtained by performing the BASE array update process and the merged TAIL array registration process for all element sets added to the list. . The above is the detailed description of the TAIL merge process.

図８に、図７で示したキーワード集合７００を入力にした、併合ＴＡＩＬ配列を用いたダブル配列構造の一例を示す。ダブル配列構造は、ＢＡＳＥ配列部８０１、ＣＨＥＣＫ配列部８０２、併合ＴＡＩＬ配列部８０３、ＯＵＴＰＵＴ部（索引部８０４及び値部８０５）で構成される。変換表８０６は、ダブル配列構造に登録される文字列に対して、文字を数値に置き換えたものである。通常は文字コードを使用するが、本発明では以降の説明の簡略化のために、便宜的に定義した。 FIG. 8 shows an example of a double array structure using the merged TAIL array with the keyword set 700 shown in FIG. 7 as an input. The double array structure includes a BASE array section 801, a CHECK array section 802, a merged TAIL array section 803, and an OUTPUT section (index section 804 and value section 805). The conversion table 806 is obtained by replacing characters with numerical values with respect to character strings registered in the double array structure. Normally, character codes are used, but in the present invention, they are defined for the sake of convenience in order to simplify the following description.

ＯＵＴＰＵＴ索引部８０４は、ＢＡＳＥ配列部８０１において併合ＴＡＩＬ配列部８０３へのリンクが定義されているものについて１を定義し、その他は０となるビット列で表現される。即ち、当該ＯＵＴＰＵＴ値を得るためには、ＯＵＴＰＵＴ索引部８０４においてrank₁関数を用いればよい。例えば、ＢＡＳＥ［４］＝−１であることから、ＢＡＳＥ［４］はＴＡＩＬ［１］へのリンクを意味している。このとき、ＯＵＴＰＵＴ索引部８０４において、ＢＡＳＥ配列の添字４を引数にしてrank₁(4)を求めると２であることから、ＯＵＴＰＵＴ［２］を参照し、ＯＵＴＰＵＴ値「２０」を得る（図７におけるキーワード７０２の検索結果を意味する）。 The OUTPUT index unit 804 defines 1 for those in which a link to the merged TAIL array unit 803 is defined in the BASE array unit 801, and the others are represented by a bit string of 0. That is, in order to obtain the OUTPUT value may be used rank ₁ function at OUTPUT index portion 804. For example, since BASE [4] = − 1, BASE [4] means a link to TAIL [1]. At this time, in the OUTPUT index unit 804, rank ₁ (4) is obtained with the subscript 4 of the BASE array as an argument, which is 2. Therefore, OUTPUT [2] is referenced to obtain the OUTPUT value “20” (FIG. 7). The search result of the keyword 702 in FIG.

図３の説明に戻る。ＣＰＵ２０１は、ＴＡＩＬ併合処理（ステップＳ３０４）が終了すると、処理をステップＳ３０５に進め、前記併合ＴＡＩＬ配列を用いたダブル配列構造を入力として、簡潔木生成処理を実施する。この処理の詳細は図５を参照して説明する。 Returning to the description of FIG. When the TAIL merging process (step S304) ends, the CPU 201 advances the process to step S305, and performs a concise tree generation process with the double array structure using the merged TAIL array as an input. Details of this processing will be described with reference to FIG.

ここで、図５を参照して、図３のステップＳ３０５の簡潔木生成処理の詳細について説明する。図５は、簡潔木生成処理の詳細を示すフローチャートである。簡潔木生成処理では、ダブル配列構造を幅優先探索方式で辿ることにより簡潔木を生成する。 Here, with reference to FIG. 5, the details of the concise tree generation process in step S305 of FIG. 3 will be described. FIG. 5 is a flowchart showing details of the concise tree generation process. In the succinct tree generation process, a succinct tree is generated by tracing a double array structure using a breadth-first search method.

まずＣＰＵ２０１は、ステップＳ５０１で簡潔木構造における先頭ビット列「１０」を追加する。そして、続くステップＳ５０２において、状態キューを初期化する。前記状態キューはダブル配列構造において幅優先探索を実現するために使用される。 First, in step S501, the CPU 201 adds the first bit string “10” in the simple tree structure. In step S502, the status queue is initialized. The state queue is used to implement a breadth-first search in a double array structure.

ステップＳ５０３では、状態ｓを１に初期化し、ステップＳ５０４で遷移可能集合追加処理を実施する。遷移可能集合追加処理では、状態ｓから遷移可能な文字を探索し、前記状態キューに追加する処理を実施する。この遷移可能集合追加処理の詳細については、図６を参照して説明することにする。 In step S503, the state s is initialized to 1, and a transitionable set addition process is performed in step S504. In the transitionable set addition process, a process for searching for a transitionable character from the state s and adding it to the state queue is performed. Details of this transitionable set addition processing will be described with reference to FIG.

ここで、図６を参照して、図５のステップＳ５０４（及びステップＳ５１１）の遷移可能集合追加処理の詳細について説明する。図６は遷移可能集合追加処理の詳細を示すフローチャートである。 Here, with reference to FIG. 6, the details of the transitionable set addition processing in step S504 (and step S511) in FIG. 5 will be described. FIG. 6 is a flowchart showing details of the transitionable set addition processing.

まずＣＰＵ２０１はステップＳ６０１で、遷移可能な文字を保持するためのリストを初期化する。続くステップＳ６０２からステップＳ６０３において、状態ｓから遷移可能なすべての文字を探索する。前述したように、ダブル配列構造はＣＨＥＣＫ配列の値が遷移元の状態番号を示すことが開示されているため、有効な文字集合の範囲内で、ＣＨＥＣＫ配列の値がＢＡＳＥ配列添字になるものを探索すればよい。 First, in step S601, the CPU 201 initializes a list for holding transitionable characters. In subsequent steps S602 to S603, all characters that can be changed from the state s are searched. As described above, since it is disclosed that the value of the CHECK array indicates the state number of the transition source in the double array structure, the value of the CHECK array becomes a BASE array index within the valid character set. Just search.

具体的には、例えば、有効な文字集合をアルファベット２６文字とし且つＡを１、Ｂを２・・・Ｚを２６、と定義した場合において、ＢＡＳＥ［１］＝２のとき、ＢＡＳＥ［１］＋１からＢＡＳＥ［１］＋２６の範囲、即ち、ＣＨＥＣＫ［３］からＣＨＥＣＫ［２８］の範囲において、ＣＨＥＣＫ［ｓ］＝１であるｓを満たす文字を探索する。 Specifically, for example, when a valid character set is 26 alphabets and A is defined as 1, B is defined as 2 ... Z is defined as 26, and BASE [1] = 2, BASE [1] In the range from +1 to BASE [1] +26, that is, in the range from CHECK [3] to CHECK [28], a character satisfying s where CHECK [s] = 1 is searched.

その後、ＣＰＵ２０１はステップＳ６０４において、前記条件を満たす文字Ciを前記リストに追加する。 Thereafter, in step S604, the CPU 201 adds a character Ci satisfying the condition to the list.

続くステップＳ６０５以降の処理において、前記リストに追加されたすべての文字Ciについて、図５のステップＳ５０２で初期化した前記状態キューに追加する。 In subsequent processing in step S605, all the characters Ci added to the list are added to the state queue initialized in step S502 of FIG.

ここで、ＢＡＳＥ［ｓ］が正値の場合（ステップＳ６０６で「ＹＥＳ」の場合）、ダブル配列構造において分岐を持った遷移状態を意味しているので、ステップＳ６０７に進み、遷移先の状態番号を示すＢＡＳＥ［ｓ］＋Ciとペアにして、状態（ＢＡＳＥ［ｓ］＋Ci、Ci）を前記状態キューに追加する。ＢＡＳＥ［ｓ］が正値でない場合（ステップＳ６０６で「ＮＯ」の場合）、ＢＡＳＥ［ｓ］はＴＡＩＬ配列へのリンク、或いは後述する終端状態を示しているので、ステップＳ６０８に進み、状態（ＢＡＳＥ［ｓ］、Ci）を前記状態キューに追加する。 Here, if BASE [s] is a positive value (“YES” in step S606), this means a transition state having a branch in the double array structure, so the process proceeds to step S607, and the state number of the transition destination The state (BASE [s] + Ci, Ci) is added to the state queue. If BASE [s] is not a positive value (in the case of “NO” in step S606), since BASE [s] indicates a link to the TAIL array or a termination state described later, the process proceeds to step S608 and the state (BASE [S], Ci) is added to the state queue.

以上の遷移可能集合追加処理によって、状態ｓから遷移可能なすべての文字が遷移先の状態番号と共に、前記状態キューに追加されたことになる。続くステップＳ６０９において、状態ｓからの遷移探索が完了したことを意味する終端状態（０、０）を前記状態キューに追加する。以上が、遷移可能集合追加処理の詳細な説明である。 Through the above-described transitionable set addition process, all characters that can be transitioned from the state s are added to the state queue together with the state number of the transition destination. In the subsequent step S609, a terminal state (0, 0) indicating that the transition search from the state s is completed is added to the state queue. The above is the detailed description of the transitionable set addition process.

図５の説明に戻る。ステップＳ５０４の遷移可能集合追加処理終了後、ＣＰＵ２０１は、ステップＳ５０５において、前記状態キューが空かどうかを判定する。空でないと判定した場合（ステップＳ５０５で「ＮＯ」の場合）、処理をステップＳ５０６に進め、前記状態キューから状態（ｓ、ｃ）をひとつ取得する。 Returning to the description of FIG. After completion of the transitionable set addition process in step S504, the CPU 201 determines in step S505 whether the state queue is empty. If it is determined that it is not empty (“NO” in step S505), the process proceeds to step S506, and one state (s, c) is acquired from the state queue.

続くステップＳ５０７の判定処理において、ＣＰＵ２０１が状態ｓを正値であると判定した場合（ステップＳ５０７で「ＹＥＳ」の場合）、ステップＳ５０８に進む。ここでＢＡＳＥ［ｓ］が正数の場合（ステップＳ５０８で「ＮＯ」の場合）、ステップＳ５０６で取得した状態ｓは分岐を持った遷移状態であるので、ステップＳ５１１に進み、前述した遷移可能集合追加処理を再帰的に実施して、ステップＳ５１２に進む。このように前記状態キューを使用することで、ダブル配列構造において幅優先探索を実現している。 If the CPU 201 determines that the state s is a positive value in the determination process in the subsequent step S507 (if “YES” in the step S507), the process proceeds to a step S508. Here, when BASE [s] is a positive number (in the case of “NO” in step S508), the state s acquired in step S506 is a transition state having a branch, so the process proceeds to step S511, and the transitionable set described above The addition process is recursively performed, and the process proceeds to step S512. By using the state queue in this way, a breadth-first search is realized in a double array structure.

また、ステップＳ５０８で「ＹＥＳ」とＣＰＵ２０１が判定した場合、即ち、ＢＡＳＥ［ｓ］が負数の場合は前記併合ＴＡＩＬ配列へのリンクを示す状態であるため、ステップＳ５０９に進み、新たに併合ＴＡＩＬ配列へのリンクを作成する。簡潔木構造では、状態ｓに対応するビット番号に対して併合ＴＡＩＬ配列へのリンクを設定する必要がある。従って、前述した図８におけるＯＵＴＰＵＴ索引部８０４とＯＵＴＰＵＴ値部８０５のように、rank₁関数を用いた索引部とリンク部の構造を持つように併合ＴＡＩＬ配列へのリンクを設定する。当該リンク構造の詳細については、後述する。 If the CPU 201 determines “YES” in step S508, that is, if BASE [s] is a negative number, the state indicates a link to the merged TAIL array, so the process proceeds to step S509 to newly merge the merged TAIL array. Create a link to In the simple tree structure, it is necessary to set a link to the merged TAIL array for the bit number corresponding to the state s. Therefore, as in the OUTPUT index unit 804 and the OUTPUT value unit 805 in FIG. 8 described above, the link to the merged TAIL array is set so as to have the structure of the index unit and the link unit using the rank ₁ function. Details of the link structure will be described later.

ステップＳ５０９の処理終了後、処理をステップＳ５１０に進め、状態ｓから分岐できる遷移は存在しないため（併合ＴＡＩＬ配列への分岐なし状態遷移のみ存在する）、前記状態キューに終端状態（０、０）を追加し、ステップＳ５１２に進む。 After the process of step S509 is completed, the process proceeds to step S510, and there is no transition that can branch from the state s (there is only a state transition without branching to the merged TAIL array). And proceeds to step S512.

ステップＳ５０７の判定処理で戻って「ＮＯ」と判定した場合、即ち、状態ｓが正値でない場合には、処理をステップＳ５１２に進める。 If it is determined as “NO” in the determination process in step S507, that is, if the state s is not a positive value, the process proceeds to step S512.

ステップＳ５１２では、ステップＳ５０６で取得した文字ｃをエッジ文字として簡潔木に登録する。文字ｃが０でない場合は、簡潔木のビット列に１を追加して、文字ｃをエッジ文字列に追加する。文字ｃが０の場合は（終端状態）、簡潔木のビット列に０を追加する。 In step S512, the character c acquired in step S506 is registered as an edge character in the concise tree. If the character c is not 0, 1 is added to the bit string of the concise tree, and the character c is added to the edge character string. If the character c is 0 (terminal state), 0 is added to the bit string of the concise tree.

ステップＳ５１２での文字ｃの登録処理が終了すると、処理をステップＳ５０５に進め、前記状態キューが空になるまで、前述した処理を繰り返す。 When the registration process of the character c in step S512 is completed, the process proceeds to step S505, and the above-described process is repeated until the state queue becomes empty.

前記状態キューが空になった場合（ステップＳ５０５で「ＹＥＳ」の場合）、ステップＳ５１３に進み、索引生成処理を実施する。索引生成処理は、ビット列におけるrank₁関数やselect₀関数の算出コストを削減するために、任意の大きさの区画ごとに当該ビット数を予め算出した索引を生成する処理である。 If the status queue is empty (in the case of “YES” in step S505), the process proceeds to step S513, and index generation processing is performed. The index generation process is a process of generating an index in which the number of bits is calculated in advance for each arbitrarily sized partition in order to reduce the calculation cost of the rank ₁ function and the select ₀ function in the bit string.

以上の処理により、ダブル配列構造において幅優先探索を実施し、簡潔木構造への変換が実現される。 Through the above processing, a breadth-first search is performed in the double array structure, and conversion to a simple tree structure is realized.

図９に、図８に示したダブル配列構造から生成した簡潔木構造を一例として示す。簡潔木構造は、トライ構造を表すビット配列９０１とエッジ文字列９０２、併合ＴＡＩＬ配列へのリンクを示すＴＡＩＬビット配列９０３とＴＡＩＬリンク配列９０４、ＯＵＴＰＵＴ値を保持するＯＵＴＰＵＴ配列９０５、及び併合ＴＡＩＬ配列９０６から構成される。併合ＴＡＩＬ配列９０６は、図８における前記併合ＴＡＩＬ配列８０３と同等のものである。また、エッジ文字列９０２及び併合ＴＡＩＬ配列９０６内に保持されている文字は、図８の前記変換表８０６を用いて数値に置き換えている。 FIG. 9 shows a simple tree structure generated from the double array structure shown in FIG. 8 as an example. The concise tree structure includes a bit array 901 and an edge character string 902 representing a trie structure, a TAIL bit array 903 and a TAIL link array 904 indicating a link to the merged TAIL array, an OUTPUT array 905 holding an OUTPUT value, and a merged TAIL array 906. Consists of The merged TAIL sequence 906 is equivalent to the merged TAIL sequence 803 in FIG. Further, the characters held in the edge character string 902 and the merged TAIL array 906 are replaced with numerical values using the conversion table 806 in FIG.

ビット配列９０１及びＴＡＩＬビット配列９０３はビット列で表現される。従って、前述したように、これらのデータにはrank₁関数やselect₀関数の算出コストを削減するための索引情報が付与されることになる。 The bit array 901 and the TAIL bit array 903 are represented by bit strings. Therefore, as described above, index information for reducing the calculation cost of the rank ₁ function and the select ₀ function is given to these data.

図１０に、前期簡潔木構造の検索方法のフローチャートを示す。なお、図９を合わせて用いることで、具体的に検索方法を説明する。 FIG. 10 shows a flowchart of the search method for the simple tree structure in the previous period. The search method will be specifically described with reference to FIG.

まず、ＣＰＵ２０１は、ステップＳ１００１で検索キーワードから文字ｃを取得する。文字ｃが取得できた場合（ステップＳ１００２で「ＹＥＳ」の場合）、ステップＳ１００３に進み、簡潔木構造における最初の子ノードを探索する。最初の子ノードの探索は、次の関数として非特許文献３に開示されている。 First, the CPU 201 acquires the character c from the search keyword in step S1001. If the character c is obtained (“YES” in step S1002), the process proceeds to step S1003 to search for the first child node in the concise tree structure. The search for the first child node is disclosed in Non-Patent Document 3 as the following function.

first-child (n) = select₀(rank₁(n))＋１ first-child (n) = select ₀ (rank ₁ (n)) + 1

図９を用いて具体的に説明する。検索開始位置はビット配列９０１の１番目であり、rank₁(1)＝１である。従って、select₀(1)＝２となり、first-child (1)＝３であるから、最初の子ノード番号が３とわかる。ここでビット配列９０１の３番目を見ると１であるため、子ノードが存在することがわかる。０や領域外の場合は、子ノードは存在しない。 This will be specifically described with reference to FIG. The search start position is the first bit array 901 and rank ₁ (1) = 1. Therefore, since select ₀ (1) = 2 and first-child (1) = 3, it is known that the first child node number is 3. Here, when the third bit arrangement 901 is seen, it is 1 and it can be seen that there is a child node. If 0 or out of the area, there are no child nodes.

続くステップＳ１００４で、子ノードが存在するかどうかを確認する。子ノードが存在する場合（ステップＳ１００４で「ＹＥＳ」の場合）、ステップＳ１００５に進む。ステップＳ１００５で当該子ノードに対応したエッジ文字を取得する。エッジ文字は図９におけるエッジ文字列９０２に登録されており、その算出式は以下のようになる。 In a succeeding step S1004, it is confirmed whether or not a child node exists. If there is a child node (“YES” in step S1004), the process proceeds to step S1005. In step S1005, an edge character corresponding to the child node is acquired. The edge character is registered in the edge character string 902 in FIG. 9, and the calculation formula is as follows.

e = edge[rank₁(n)−１] e = edge [rank ₁ (n) -1]

図９の例で説明する。ビット配列９０１の３番目のエッジ文字は、rank₁(3)＝２であるため、edge[1]＝１が選択される。文字コード「１」は、図８における変換表８０６において、文字「山」であることがわかる。従って、ビット配列９０１の３番目のエッジ文字は「山」となる。 This will be described with reference to the example of FIG. Since the third edge character of the bit array 901 is rank ₁ (3) = 2, edge [1] = 1 is selected. It can be seen that the character code “1” is the character “mountain” in the conversion table 806 in FIG. Therefore, the third edge character of the bit array 901 is “mountain”.

図１０のステップＳ１００５ではさらに、取得したエッジ文字ｅとステップＳ１００１で取得した検索キーワードの文字ｃが一致するかを判定する。ＣＰＵ２０１が一致したと判定した場合（ステップＳ１００５で「ＹＥＳ」の場合）、状態遷移が成立したため、ステップＳ１００１に戻り、検索キーワードにおける次の文字を取得し、その文字について、簡潔木構造上での状態遷移を確認する。 In step S1005 in FIG. 10, it is further determined whether the acquired edge character e matches the character c of the search keyword acquired in step S1001. If the CPU 201 determines that they match (in the case of “YES” in step S1005), since the state transition is established, the process returns to step S1001, acquires the next character in the search keyword, and the character is displayed on the concise tree structure. Check the state transition.

ステップＳ１００５の判定処理で、取得したエッジ文字ｅとステップＳ１００１で取得した検索キーワードの文字ｃが一致しないとＣＰＵ２０１が判定した場合（ステップＳ１００５で「ＮＯ」の場合）、ステップＳ１００６に進み、兄弟ノードの探索を実施する。簡潔木構造は幅優先探索を用いて作成されているため、兄弟ノードの探索は、当該ビット列において、隣接するビットが１であれば探索に成功する。図９のビット配列９０１の例では、子ノード番号３に隣接する子ノード番号４のビットが１であることから、子ノード番号４は子ノード番号３の兄弟ノードであるとわかる。 If the CPU 201 determines in the determination processing in step S1005 that the acquired edge character e and the character c of the search keyword acquired in step S1001 do not match (in the case of “NO” in step S1005), the process proceeds to step S1006, and the sibling node Perform a search for. Since the simple tree structure is created using the breadth-first search, the search for sibling nodes is successful if the adjacent bit is 1 in the bit string. In the example of the bit array 901 in FIG. 9, since the bit of the child node number 4 adjacent to the child node number 3 is 1, it can be understood that the child node number 4 is a sibling node of the child node number 3.

ステップＳ１００７で兄弟ノードが存在しないと判定した場合（ステップＳ１００７で「ＮＯ」の場合）、当該検索キーワードを用いた簡潔木構造において、もはや状態遷移が成立し得ないため、検索失敗として簡潔木検索処理を終了する。 If it is determined in step S1007 that no sibling node exists (in the case of “NO” in step S1007), the state transition can no longer be established in the concise tree structure using the search keyword, and the concise tree search is determined as a search failure. The process ends.

一方、ステップＳ１００７で、兄弟ノードが存在するとＣＰＵ２０１が判定した場合（ステップＳ１００７で「ＹＥＳ」の場合）には、ステップＳ１００５に戻り、当該ビット番号のエッジ文字と当該検索キーワードから取得した文字ｃとの比較を実施する。 On the other hand, if the CPU 201 determines in step S1007 that a sibling node exists (“YES” in step S1007), the process returns to step S1005, and the edge character of the bit number and the character c acquired from the search keyword are returned. Perform a comparison of

このように、当該検索キーワードを１文字ずつ簡潔木構造上で探索していくが、子ノードが存在しない場合（ステップＳ１００４で「ＮＯ」の場合）、或いは、当該検索キーワードすべての文字を探索した場合（ステップＳ１００２で「ＮＯ」の場合）、ステップＳ１００８へ進む。 In this way, the search keyword is searched on the concise tree structure character by character, but when there is no child node (in the case of “NO” in step S1004), or all characters of the search keyword are searched. In the case ("NO" in step S1002), the process proceeds to step S1008.

ステップＳ１００８以降では、ビット配列９０１上における状態遷移が終了し、遷移分岐のない併合ＴＡＩＬ配列９０６を用いた探索を実施する。ステップＳ１００８において、併合ＴＡＩＬ配列９０６へのリンクを取得する。例えば、検索キーワードが「山形県」の場合、ビット配列９０１上を、１⇒３⇒６⇒１３と遷移し bit[13]＝０であることから、ビット番号６までの状態遷移が成立する。ここで、ＴＡＩＬビット配列９０３に対してrank₁(6)を算出すると１であるので、ＴＡＩＬリンク配列９０４を参照して
併合ＴＡＩＬ配列９０６へのリンク値 tail_link [1] ＝１を得る。 In step S1008 and after, the state transition on the bit array 901 is completed, and a search using the merged TAIL array 906 having no transition branch is performed. In step S1008, a link to the merged TAIL array 906 is acquired. For example, when the search keyword is “Yamagata Prefecture”, the bit sequence 901 transitions from 1⇒3⇒6⇒13 and bit [13] = 0, so that the state transition up to bit number 6 is established. Here, when rank ₁ (6) is calculated with respect to the TAIL bit array 903, it is 1. Therefore, the link value tail_link [1] = 1 to the merged TAIL array 906 is obtained with reference to the TAIL link array 904.

続くステップＳ１００９で、前記リンク値が取得できたかどうかを判定する。取得できなかったと判定した場合（ステップＳ１００９で「ＮＯ」の場合）、併合ＴＡＩＬ配列へのリンクが存在しないため、検索失敗として簡潔木検索処理を終了する。取得できたと判定した場合（ステップＳ１００９で「ＹＥＳ」の場合）は、ステップＳ１０１０に進む。 In a succeeding step S1009, it is determined whether or not the link value has been acquired. If it is determined that acquisition has failed (in the case of “NO” in step S1009), since there is no link to the merged TAIL array, the concise tree search process ends as a search failure. If it is determined that it has been acquired (in the case of “YES” in step S1009), the process proceeds to step S1010.

ステップＳ１０１０では、前記検索キーワードの未探索部分である接尾辞部と、併合ＴＡＩＬ配列９０６に格納された文字列が一致するかどうかを調べる。一致しなかった場合（ステップＳ１０１０で「ＮＯ」の場合）、前記検索キーワードが存在しないことから、検索失敗として簡潔木検索処理を終了する。 In step S1010, it is checked whether or not the suffix part, which is an unsearched part of the search keyword, matches the character string stored in the merged TAIL array 906. If they do not match (in the case of “NO” in step S1010), the search keyword does not exist, so the concise tree search process is terminated as a search failure.

前述の例では、検索キーワード「山形県」において、「山形」までがビット配列９０１上で確認されており、残りの接尾辞「県」が併合ＴＡＩＬ配列９０６との比較対象となる。前記ＴＡＩＬリンク配列９０４から得たリンク値が１であるため、併合ＴＡＩＬ配列９０６の１番目を先頭に文字列比較を実施する。Tail[1]＝10であることから、変換表８０６を参照すると「県」であり、続くTail[2]＝０で終端となる。従って、前記検索キーワードの接尾辞「県」と併合ＴＡＩＬ配列９０６に登録された文字列「県」が一致することがわかる。 In the above example, in the search keyword “Yamagata Prefecture”, up to “Yamagata” is confirmed on the bit array 901, and the remaining suffix “prefecture” is to be compared with the merged TAIL array 906. Since the link value obtained from the TAIL link array 904 is 1, character string comparison is performed starting from the first of the merged TAIL array 906. Since Tail [1] = 10, referring to the conversion table 806, it is “prefecture”, and the next Tail [2] = 0 ends. Therefore, it can be seen that the suffix “prefecture” of the search keyword matches the character string “prefecture” registered in the merged TAIL array 906.

ステップＳ１０１０で「ＹＥＳ」と判定した場合、即ち、文字列比較が一致した場合はステップＳ１０１１に進む。ステップＳ１０１１では、ＯＵＴＰＵＴ値をＯＵＴＰＵＴ配列９０５から取得する。取得する方法は、ＴＡＩＬリンク配列９０４と同様であり、ＴＡＩＬビット配列９０３を用いて取得する。 If “YES” is determined in the step S1010, that is, if the character string comparisons match, the process proceeds to a step S1011. In step S1011, the OUTPUT value is acquired from the OUTPUT array 905. The acquisition method is the same as that of the TAIL link array 904 and is acquired using the TAIL bit array 903.

以上の処理をもって、検索キーワードの検索が成功し、簡潔木検索処理はＯＵＴＰＵＴ値を返すことができる。 With the above processing, the search keyword search is successful, and the concise tree search processing can return an OUTPUT value.

以上説明したように、本実施形態によれば、ダブル配列構造を経由させて簡潔木構造を構築することによって、トライ構造をコンパクトに実装できる。また、簡潔木構造の検索において、その計算コストが大きいビット配列を使った状態遷移を最小限に留め、検索速度の向上に寄与している。 As described above, according to the present embodiment, the trie structure can be compactly implemented by constructing the simple tree structure via the double array structure. Further, in the search of a simple tree structure, state transition using a bit array having a high calculation cost is minimized, and the search speed is improved.

以上、実施形態例を詳述したが、本発明は、例えば、システム、装置、方法、プログラムもしくは記憶媒体等としての実施態様をとることが可能であり、具体的には、複数の機器から構成されるシステムに適用しても良いし、また、一つの機器からなる装置に適用しても良い。 Although the embodiments have been described in detail above, the present invention can take an embodiment as, for example, a system, an apparatus, a method, a program, or a storage medium, and specifically includes a plurality of devices. The present invention may be applied to a system that is configured, or may be applied to an apparatus that includes a single device.

なお、上述した各種データの構成及びその内容はこれに限定されるものではなく、用途や目的に応じて、様々な内容で構成されることは言うまでもない。 It should be noted that the configuration and contents of the various data described above are not limited to this, and it is needless to say that they are configured with various contents according to applications and purposes.

また、本発明は、システム或いは装置にプログラムを供給することにとって達成される場合にも適用できることは言うまでもない。この場合、本発明を達成するためのソフトウェアによって表されるプログラムを格納した記録媒体を該システム或いは装置に読み出すことによって、そのシステム或いは装置が、本発明の効果を享受することが可能となる。 Needless to say, the present invention can also be applied to a case where the present invention is achieved by supplying a program to a system or apparatus. In this case, by reading out a recording medium storing a program represented by software for achieving the present invention to the system or apparatus, the system or apparatus can enjoy the effects of the present invention.

さらに、本発明を達成するためのソフトウェアによって表されるプログラムをネットワーク上のサーバー、データーベース等から通信プログラムによりダウンロードして読み出すことによって、そのシステム或いは装置が、本発明の効果を享受することが可能となる。 Furthermore, by downloading and reading a program represented by software for achieving the present invention from a server, database, etc. on a network using a communication program, the system or apparatus can enjoy the effects of the present invention. It becomes possible.

なお、上述した各実施形態及びその変形例を組み合わせた構成もすべて本発明に含まれるものである。 In addition, all the structures which combined each embodiment mentioned above and its modification are also included in this invention.

１００文字列登録検索装置
１０１キーワード集合
１０２ダブル配列構築部
１０３簡潔木構築部
１０４簡潔木データ
１０５検索結果表示部
１０６簡潔木検索部
１０７検索キーワード
２０１ＣＰＵ
２０２ＲＡＭ
２０３ＲＯＭ
２０４システムバス
２０５入力コントローラ
２０６ビデオコントローラ
２０７メモリコントローラ
２０８通信Ｉ／Ｆ（インターフェース）コントローラ
２０９キーボード
２１０ディスプレイ装置
２１１外部メモリ DESCRIPTION OF SYMBOLS 100 Character string registration search apparatus 101 Keyword set 102 Double arrangement | sequence construction part 103 Simple tree construction part 104 Simple tree data 105 Search result display part 106 Simple tree search part 107 Search keyword 201 CPU
202 RAM
203 ROM
204 System Bus 205 Input Controller 206 Video Controller 207 Memory Controller 208 Communication I / F (Interface) Controller 209 Keyboard 210 Display Device 211 External Memory

Claims

Registration means for registering a plurality of first keywords to be searched as data in a double array;
Creating means for creating data for registration in a concise tree structure for performing breadth-first search using double array data registered by the registration means;
Storage means for storing a concise tree structure comprising data created by the creating means;
Receiving means for receiving an input of a second keyword that is a search keyword for searching for the plurality of first keywords;
Search means for searching whether the second keyword received by the receiving means is included in the plurality of first keywords using the simple tree structure stored by the storage means;
With
The creation means creates data as a link to the TAIL array for the node constituting the tri-structure of the double array as a link to the TAIL array for the node of the concise tree structure,
The search means reads the first character of the second keyword received by the reception means, sets the root node of the concise tree structure as the first node, and reads the read into the child node of the first node Determine if there is a child node indicating a character,
When it is determined that there is a child node indicating the read character, the child node is set as the first node, the next character after the second keyword is read, and the changed first node is set. Repeating the process of determining whether or not there is a child node indicating the read one character;
If it is determined that there is no child node indicating the read character, the transition to the TAIL array is made by referring to the link from the child node to the TAIL array, and the remaining character strings are registered in the TAIL array. And determining whether the second keyword is included in the plurality of first keywords by determining whether the second keyword is included.

2. The information processing apparatus according to claim 1, wherein the creation unit registers data for a node that can transition from a node constituting the double array trie structure in the node of the simple tree structure.

3. The creation means, wherein a node that can transition from a predetermined node is a transition source node obtained from a value of a CHECK array that constitutes the double array for the transitionable node, as the predetermined node. The information processing apparatus described in 1.

The information according to any one of claims 1 to 3, wherein the creating means creates the TAIL array by merging suffix portions that are not common among the plurality of first keywords. Processing equipment.

An information processing method performed by an information processing apparatus that performs data search processing,
The information processing apparatus includes:
A registration step of registering a plurality of first keywords to be searched as data in a double array;
A creation step of creating data for registering in a concise tree structure for performing a breadth-first search using double array data registered in the registration step;
A storage step of storing a concise tree structure comprising data created by the creation step;
An accepting step of receiving an input of a second keyword that is a search keyword for searching for the plurality of first keywords;
A search step for searching whether the second keyword received in the receiving step is included in the plurality of first keywords using the simple tree structure stored in the storage step;
Run
The creating step creates data as a link to a TAIL array for a node constituting the tri-structure of the double array as a link to the TAIL array for a node of the concise tree structure,
The search step reads the first character of the second keyword accepted in the acceptance step , sets the root node of the concise tree structure as the first node, and reads the child node of the first node Determine if there is a child node indicating a character,
When it is determined that there is a child node indicating the read character, the child node is set as the first node, the next character after the second keyword is read, and the changed first node is set. Repeating the process of determining whether or not there is a child node indicating the read one character;
If it is determined that there is no child node indicating the read character, the transition to the TAIL array is made by referring to the link from the child node to the TAIL array, and the remaining character strings are registered in the TAIL array. And determining whether or not the second keyword is included in the plurality of first keywords by determining whether the second keyword is included.

A computer program that can be read and executed in an information processing apparatus that performs data search processing,
The information processing apparatus;
Registration means for registering a plurality of first keywords to be searched as data in a double array;
Creating means for creating data for registration in a concise tree structure for performing breadth-first search using double array data registered by the registration means;
Storage means for storing a concise tree structure comprising data created by the creating means;
Receiving means for receiving an input of a second keyword that is a search keyword for searching for the plurality of first keywords;
Search means for searching whether the second keyword received by the receiving means is included in the plurality of first keywords using the simple tree structure stored by the storage means;
To function,
The creation means creates data as a link to the TAIL array for the node constituting the tri-structure of the double array as a link to the TAIL array for the node of the concise tree structure,
The search means reads the first character of the second keyword received by the reception means, sets the root node of the concise tree structure as the first node, and reads the read into the child node of the first node Determine if there is a child node indicating a character,
When it is determined that there is a child node indicating the read character, the child node is set as the first node, the next character after the second keyword is read, and the changed first node is set. Repeating the process of determining whether or not there is a child node indicating the read one character;
If it is determined that there is no child node indicating the read character, the transition to the TAIL array is made by referring to the link from the child node to the TAIL array, and the remaining character strings are registered in the TAIL array. A computer program for functioning as a search to determine whether the second keyword is included in the plurality of first keywords by determining whether the second keyword is included.