JP4569178B2

JP4569178B2 - Classification code processor

Info

Publication number: JP4569178B2
Application number: JP2004166212A
Authority: JP
Inventors: 奨本間
Original assignee: Fuji Xerox Co Ltd; Fujifilm Business Innovation Corp
Current assignee: Fujifilm Business Innovation Corp
Priority date: 2004-06-03
Filing date: 2004-06-03
Publication date: 2010-10-27
Anticipated expiration: 2024-06-03
Also published as: JP2005346485A

Description

本発明は、国際特許分類などの分類符号をドキュメントに付与する作業を支援する分類符号処理装置に関する。 The present invention relates to a classification code processing apparatus that supports an operation of assigning a classification code such as an international patent classification to a document.

科学技術文献や特許文献など、文献資料の蓄積量は年々増える一方であり、膨大な文献資料から目的の文献資料を見つけ出すための技術が求められている。例えば特許文献であれば、国際特許分類（ＩＰＣ）により、技術分野を限定して検索を行うことで、目的の文献資料が見つけやすくなるよう工夫されている。なお、キーワードとそれに関連深い分野を関連付けた分類テーブルを参照し、文献資料に対する分類付与を支援する技術が特許文献１に開示されている。
特開平６−７５９９５号公報 The accumulated amount of literature materials such as scientific and technical literatures and patent literatures is increasing year by year, and there is a need for a technique for finding target literature materials from a huge amount of literature materials. For example, if it is a patent document, it is devised to make it easy to find a target document material by performing a search by limiting the technical field according to the International Patent Classification (IPC). Patent Document 1 discloses a technique for supporting classification assignment for document materials by referring to a classification table in which keywords and related fields are associated with each other.
JP-A-6-75995

しかしながら、上記従来の分類付与を支援する技術では、予めキーワード間の距離の情報を生成しておく必要がある。このためキーワードとして既知のキーワードを選択せざるを得ず、日々多様な術語が生み出される科学技術文献や、使用される単語に元来多様性のある文献などにおいては、有効な分類処理を行うことが困難であった。 However, in the conventional technology for supporting classification, it is necessary to generate information on the distance between keywords in advance. For this reason, it is necessary to select a known keyword as a keyword, and perform effective classification processing in scientific and technical literature in which a variety of technical terms are created every day, or in documents that have inherent diversity in the words used. It was difficult.

本発明は上記実情に鑑みて為されたもので、多様な単語が出現する科学技術文献等に対しても有効な分類処理を遂行できる分類符号処理装置を提供することを、その目的の一つとする。 The present invention has been made in view of the above circumstances, and one of its objects is to provide a classification code processing apparatus capable of performing effective classification processing even for scientific and technical documents in which various words appear. To do.

上記従来例の問題点を解決するための本発明は、分類符号に関連づけられた複数のドキュメントを保持するドキュメントデータベースにアクセス可能に接続され、処理対象となったドキュメントに対する分類符号の付与を支援する分類符号処理装置であって、処理対象となったドキュメントから少なくとも一つのキーとなる文字列をキーワードとして抽出するキーワード抽出手段と、前記ドキュメントデータベースから、前記抽出したキーワードの各々に関係するドキュメント群を抽出するドキュメント群抽出手段と、前記キーワードごとに抽出したドキュメント群を参照し、当該ドキュメント群に関連付けられた分類符号の少なくとも一部を取得して、キーワードごとの分類符号のリストを生成するリスト生成手段と、を含み、前記分類符号のリストが前記処理対象のドキュメントに対する分類符号の付与を支援する処理に供されることを特徴としている。 The present invention for solving the problems of the conventional example is connected to a document database that holds a plurality of documents associated with a classification code, and supports assignment of the classification code to a document to be processed. A classification code processing device, comprising: a keyword extracting unit that extracts at least one key character string from a document to be processed as a keyword; and a document group related to each of the extracted keywords from the document database. Document group extraction means for extracting and list generation for generating a list of classification codes for each keyword by referring to the document group extracted for each keyword and acquiring at least a part of the classification codes associated with the document group And the classification mark List is a feature to be subjected to treatment to assist the application of classification code for the processing target document.

このように本実施の形態では、予め定められたキーワード間の距離の情報を用いることなく、ドキュメントデータベースに格納されているドキュメントを直接キーワード検索して、当該検索の結果として得られたドキュメントに付与されている分類符号が、処理対象となったドキュメントへの分類符号の付与を支援する処理に供される。これにより、多様な単語が出現する科学技術文献等に対しても有効な分類処理を遂行できる。 As described above, in the present embodiment, a document stored in the document database is directly searched for keywords without using information on a predetermined distance between keywords, and the document obtained as a result of the search is given. The classified code is used for processing that supports the assignment of the classification code to the document to be processed. Thereby, an effective classification process can be performed even for scientific and technical documents in which various words appear.

また、前記キーワード抽出手段は、前記処理対象となったドキュメントのうち、キーワードの抽出を行う部分を特定し、当該特定したドキュメントの一部からキーワードとなる文字列を抽出することを特徴としている。 In addition, the keyword extraction unit is characterized in that, in the document to be processed, a portion where a keyword is extracted is specified, and a character string that is a keyword is extracted from a part of the specified document.

また、キーワードと、当該キーワードに関するドキュメントに割当てるべき分類符号とを関連付けて記憶する辞書データベースを保持しており、前記キーワード抽出手段によって抽出されたキーワードに関連付けて、前記辞書データベースに保持されている分類符号を検索する検索手段を含み、前記検索手段による検索の結果、前記キーワード抽出手段によって抽出されたキーワードに関連付けて保持されている分類符号がない場合に、前記ドキュメント群抽出手段により前記ドキュメントデータベースから、前記抽出したキーワードの各々に関係するドキュメント群を抽出し、前記リスト生成手段が、当該キーワードごとに抽出したドキュメント群を参照し、当該ドキュメント群に関連付けられた分類符号の少なくとも一部を取得して、キーワードごとの分類符号のリストを生成することとしてもよい。 In addition, a dictionary database that stores a keyword and a classification code to be assigned to a document related to the keyword is stored in association with each other, and the classification is stored in the dictionary database in association with the keyword extracted by the keyword extraction unit. A search means for searching for a code, and if there is no classification code held in association with the keyword extracted by the keyword extraction means as a result of the search by the search means, the document group extraction means Extracting a document group related to each of the extracted keywords, and the list generation means refers to the document group extracted for each keyword and obtains at least a part of a classification code associated with the document group. And ki It is also possible to generate a list of classification codes for each word.

さらに前記ドキュメント群抽出手段は、当該ドキュメントデータベースに保持されたドキュメントに複数種類の分類符号のいずれか少なくとも一つが選択的に付与されている場合には、処理対象となったドキュメントに付与する種類の分類符号が付与されているドキュメントの群を前記ドキュメントデータベースから抽出することとしてもよい。 Further, the document group extracting means, when at least one of a plurality of types of classification codes is selectively given to the document held in the document database, A group of documents to which a classification code is assigned may be extracted from the document database.

さらに、前記抽出したキーワードに基づいて、処理対象となったドキュメントの内容に関係する中心概念語を取得し、当該中心概念語に関係する分類符号の少なくとも一部を取得して、キーワードごとの分類符号の第２のリストを生成する第２リスト生成手段をさらに含み、前記キーワード毎の分類符号のリストと、第２のリストとが、前記処理対象のドキュメントに対する分類符号の付与を支援する処理に供されることとしてもよい。 Further, based on the extracted keyword, a central concept word related to the content of the document to be processed is acquired, and at least a part of a classification code related to the central concept word is acquired, and classification for each keyword is performed. And a second list generation means for generating a second list of codes, wherein the list of classification codes for each keyword and the second list are for processing that supports the assignment of classification codes to the document to be processed. It may be provided.

このとき、前記抽出したキーワードの少なくとも一部を所定のルールに従って選択して、当該選択したキーワードを、処理対象となったドキュメントの内容に関係する中心概念語として取得することとしてもよい。 At this time, at least a part of the extracted keywords may be selected according to a predetermined rule, and the selected keywords may be acquired as central concept words related to the contents of the document to be processed.

また、本発明の一態様に係る分類符号処理方法は、分類符号に関連づけられた複数のドキュメントを保持するドキュメントデータベースにアクセス可能に接続されたコンピュータを用い、処理対象となったドキュメントに対する分類符号の付与を支援する分類符号処理方法であって、処理対象となったドキュメントから少なくとも一つのキーとなる文字列をキーワードとして抽出するキーワード抽出工程と、前記ドキュメントデータベースから、前記抽出したキーワードの各々に関係するドキュメント群を抽出するドキュメント群抽出工程と、前記キーワードごとに抽出したドキュメント群を参照し、当該ドキュメント群に関連付けられた分類符号の少なくとも一部を取得して、キーワードごとの分類符号のリストを生成するリスト生成工程と、をコンピュータに実行させ、前記分類符号のリストが前記処理対象のドキュメントに対する分類符号の付与を支援する処理に供されることを特徴としている。 In addition, a classification code processing method according to an aspect of the present invention uses a computer that is connected to a document database that holds a plurality of documents associated with a classification code. A classification code processing method for supporting assignment, wherein a keyword extraction step of extracting at least one character string as a keyword from a document to be processed as a keyword, and each of the extracted keywords from the document database A document group extraction step of extracting a document group to be performed, and a reference to the document group extracted for each keyword to obtain at least a part of a classification code associated with the document group, and a list of classification codes for each keyword is obtained. List generation process to be generated , Cause the computer to execute a list of the classification code is characterized in that it is subjected to treatment to assist the application of classification code for the processing target document.

ここで前記キーワード抽出工程では、前記処理対象となったドキュメントのうち、キーワードの抽出を行う部分を特定し、当該特定したドキュメントの一部からキーワードとなる文字列を抽出することとしてもよい。 Here, in the keyword extracting step, a part from which the keyword is extracted is specified in the document to be processed, and a character string serving as a keyword may be extracted from a part of the specified document.

さらに前記コンピュータは、キーワードと、当該キーワードに関するドキュメントに割当てるべき分類符号とを関連付けて記憶する辞書データベースにアクセス可能に接続されており、前記キーワード抽出手段によって抽出されたキーワードに関連付けて、前記辞書データベースに保持されている分類符号を検索する工程を含み、前記検索の結果、抽出されたキーワードに関連付けて保持されている分類符号がない場合に、前記ドキュメント群抽出工程にて、前記ドキュメントデータベースから、前記抽出したキーワードの各々に関係するドキュメント群を抽出し、前記リスト生成工程にて、当該キーワードごとに抽出したドキュメント群を参照し、当該ドキュメント群に関連付けられた分類符号の少なくとも一部を取得して、キーワードごとの分類符号のリストを生成することとしてもよい。 Further, the computer is connected to a dictionary database that stores a keyword and a classification code to be assigned to a document related to the keyword in association with each other. The dictionary database is associated with the keyword extracted by the keyword extraction unit. And when there is no classification code held in association with the extracted keyword as a result of the search, in the document group extraction step, from the document database, A document group related to each of the extracted keywords is extracted, and at the list generation step, the document group extracted for each keyword is referred to, and at least a part of the classification code associated with the document group is acquired. Keyword It is also possible to generate a list of classification codes.

また、前記ドキュメント群抽出工程では、当該ドキュメントデータベースに保持されたドキュメントに複数種類の分類符号のいずれか少なくとも一つが選択的に付与されている場合には、処理対象となったドキュメントに付与する種類の分類符号が付与されているドキュメントの群を前記ドキュメントデータベースから抽出することとしてもよい。 In the document group extraction step, when at least one of a plurality of types of classification codes is selectively given to the document held in the document database, the type given to the document to be processed It is also possible to extract a group of documents to which the classification code is assigned from the document database.

また、前記抽出したキーワードに基づいて、処理対象となったドキュメントの内容に関係する中心概念語を取得し、当該中心概念語に関係する分類符号の少なくとも一部を取得して、キーワードごとの分類符号の第２のリストを生成する第２リスト生成工程をさらにコンピュータに実行させ、前記キーワード毎の分類符号のリストと、第２のリストとが、前記処理対象のドキュメントに対する分類符号の付与を支援する処理に供されることとしてもよい。 Further, based on the extracted keyword, a central concept word related to the content of the document to be processed is acquired, at least a part of a classification code related to the central concept word is acquired, and classification for each keyword is performed. The computer further executes a second list generation step of generating a second list of codes, and the list of classification codes for each keyword and the second list support the assignment of classification codes to the document to be processed. It is good also as being used for the process to do.

また、本発明の別の態様に係るプログラムは、分類符号に関連づけられた複数のドキュメントを保持するドキュメントデータベースにアクセス可能に接続されたコンピュータに、処理対象となったドキュメントに対する分類符号の付与を支援する処理を実行させるプログラムであって、処理対象となったドキュメントから少なくとも一つのキーとなる文字列をキーワードとして抽出するキーワード抽出手順と、前記ドキュメントデータベースから、前記抽出したキーワードの各々に関係するドキュメント群を抽出するドキュメント群抽出手順と、前記キーワードごとに抽出したドキュメント群を参照し、当該ドキュメント群に関連付けられた分類符号の少なくとも一部を取得して、キーワードごとの分類符号のリストを生成するリスト生成手順と、をコンピュータに実行させ、前記分類符号のリストが前記処理対象のドキュメントに対する分類符号の付与を支援する処理に供されることを特徴としている。 The program according to another aspect of the present invention supports assignment of a classification code to a document to be processed to a computer that is connected to a document database that holds a plurality of documents associated with the classification code. A keyword extraction procedure for extracting at least one key character string as a keyword from a document to be processed, and a document related to each of the extracted keywords from the document database A document group extraction procedure for extracting a group and the document group extracted for each keyword are referred to, and at least a part of the classification codes associated with the document group is acquired to generate a list of classification codes for each keyword. List generator When, cause the computer to execute a list of the classification code is characterized in that it is subjected to treatment to assist the application of classification code for the processing target document.

本発明の実施の形態について図面を参照しながら説明する。本発明の実施の形態に係る分類符号処理装置は、図１に示すように、制御部１１と、記憶部１２と、ストレージ部１３と、操作部１４と、表示部１５とを含んで構成されている。 Embodiments of the present invention will be described with reference to the drawings. As shown in FIG. 1, the classification code processing apparatus according to the embodiment of the present invention includes a control unit 11, a storage unit 12, a storage unit 13, an operation unit 14, and a display unit 15. ing.

制御部１１は、ＣＰＵ（Central Processing Unit）などによって実現され、記憶部１２に格納されているプログラムに従って動作している。本実施の形態では、制御部１１は、分類符号付与の対象となったドキュメントを処理対象として、当該ドキュメント（以下、区別のため「処理対象ドキュメント」と呼ぶ）からキーワードを抽出するキーワード抽出処理と、抽出されたキーワードに対して予め関連づけられている分類符号を検索する検索処理と、ドキュメントデータベースから抽出したキーワードの各々に関するドキュメント群を抽出し、分類符号のリストを生成する分類符号リスト生成処理と、抽出したキーワードから処理対象ドキュメントの内容を表す中心概念語を選択し、当該選択した中心概念語に関する分類符号を取得する処理と、これらの処理を介して取得される少なくとも一つの分類符号を利用者に提示する処理と、を実行する。これらの具体的な処理の内容については、後に詳しく説明する。 The control unit 11 is realized by a CPU (Central Processing Unit) or the like, and operates according to a program stored in the storage unit 12. In the present embodiment, the control unit 11 uses a keyword extraction process for extracting a keyword from the document (hereinafter referred to as a “processing target document” for distinction) using the document to which the classification code is assigned as a processing target. A search process for searching for a classification code associated with the extracted keyword in advance, a classification code list generation process for extracting a document group related to each of the keywords extracted from the document database, and generating a list of classification codes , Selecting a central concept word representing the content of the processing target document from the extracted keywords, obtaining a classification code related to the selected central concept word, and using at least one classification code obtained through these processes Processing to be presented to the person. Details of these specific processes will be described in detail later.

記憶部１２は、ＲＡＭ（Random Access Memory）等のメモリ素子を含んで構成され、制御部１１によって実行されるプログラムを格納している。この記憶部１２はまた、制御部１１の処理の過程で利用される種々のデータを保持するワークメモリとしても動作する。 The storage unit 12 includes a memory element such as a RAM (Random Access Memory), and stores a program executed by the control unit 11. The storage unit 12 also operates as a work memory that holds various data used in the process of the control unit 11.

ストレージ部１３は、例えばハードディスク装置等のコンピュータ可読な記録媒体を含んで構成され、制御部１１によって実行されるプログラムを格納している。制御部１１によって処理が行われるときには、このプログラムがストレージ部１３から読出されて記憶部１２に格納され、処理に供されることとなる。また、このストレージ部１３は、予め分類符号が付与されている複数のドキュメントを保持するドキュメントデータベースを格納している。さらに、ストレージ部１３は、キーワードと、当該キーワードに関するドキュメントに割当てるべき分類符号とを関連付けて記憶する辞書データベースを保持している。 The storage unit 13 includes a computer-readable recording medium such as a hard disk device, and stores a program executed by the control unit 11. When processing is performed by the control unit 11, the program is read from the storage unit 13, stored in the storage unit 12, and used for processing. The storage unit 13 stores a document database that holds a plurality of documents to which classification codes are assigned in advance. Furthermore, the storage unit 13 holds a dictionary database that stores a keyword and a classification code to be assigned to a document related to the keyword in association with each other.

具体的に、ストレージ部１３が保持しているドキュメントデータベースには、複数のドキュメントと、各ドキュメントに関連付けられた分類符号とが保持されている（図２）。本実施の形態の説明では、具体的な例として、このドキュメントが特許文献と、科学技術文献とである場合を例として説明する。また、各ドキュメントには、その発行時期・発行機関等に応じて、例えば特許文献であれば国際特許分類のうち、所定の版（例えば２０００年１月以降であれば第７版）の種類の分類符号が付与されている。ここでドキュメントには複数の種類の分類符号が付与されていてもよい。例えば国際特許分類と、ＦＩ分類等の符号が付与されていてもよい。 Specifically, the document database held by the storage unit 13 holds a plurality of documents and a classification code associated with each document (FIG. 2). In the description of the present embodiment, as a specific example, a case where this document is a patent document and a scientific and technical document will be described as an example. In addition, each document has a type of a predetermined edition (for example, the seventh edition if it is later than January 2000) in the international patent classification if it is a patent document, for example, depending on the issue date, issuing organization, etc. A classification code is assigned. Here, a plurality of types of classification codes may be assigned to the document. For example, codes such as international patent classification and FI classification may be given.

また、ストレージ部１３は、図３に示すように、キーワードと、当該キーワードに関するドキュメントに割当てるべき分類符号とを関連付けた辞書データベースを保持していてもよい。この辞書データベースは、例えばドキュメントデータベースに新規のドキュメントが追加された際などに更新されるようにしてもよい。この更新の処理は、制御部１１によって実行されることになる。この更新処理の内容についても後に詳しく述べる。 Further, as shown in FIG. 3, the storage unit 13 may hold a dictionary database in which a keyword is associated with a classification code to be assigned to a document related to the keyword. This dictionary database may be updated, for example, when a new document is added to the document database. This update process is executed by the control unit 11. The contents of this update process will be described later in detail.

操作部１４は、マウスやキーボード等であり、利用者の指示操作の内容を制御部１１に出力する。表示部１５は、ディスプレイ等であり、制御部１１から入力される指示に従って、利用者に対して情報を提示する。 The operation unit 14 is a mouse, a keyboard, or the like, and outputs the content of the user's instruction operation to the control unit 11. The display unit 15 is a display or the like, and presents information to the user in accordance with an instruction input from the control unit 11.

ここで制御部１１によって実行される処理の内容について説明する。まず、キーワード抽出処理について説明する。制御部１１は、まず処理対象ドキュメントの入力を受け付ける。ここで処理対象ドキュメントは、図示しない通信部を介してネットワーク上のサーバから受信されたものであってもよいし、図示しない可搬記録媒体から読み込まれるものであってもよい。また、操作部１４から入力されるテキストもまた、本実施の形態における処理対象ドキュメントとして扱うことができる。 Here, the content of the process performed by the control part 11 is demonstrated. First, the keyword extraction process will be described. The control unit 11 first receives an input of a processing target document. Here, the processing target document may be received from a server on a network via a communication unit (not shown), or may be read from a portable recording medium (not shown). Further, text input from the operation unit 14 can also be handled as a processing target document in the present embodiment.

制御部１１は、処理対象ドキュメントからキーワードを抽出する。ここでキーワードとは、ドキュメントデータベース内のドキュメントを検索するためのキーとなる文字列である。具体的に制御部１１は、処理対象ドキュメントに含まれるテキストを、主語述語関係の単位（一般に単語よりも大きい単位）で区切り、区切った結果、得られた各文字列を記憶部１２に格納する。例えば日本語の場合、いわゆる「てにをは」などの助詞（格助詞・係助詞）で区切ればよい。なお、不要な助詞などの部分は当該区切って得た文字列群の各々から除去する。また、予め不要語（ストップワード）を列挙した不要語データベースをストレージ部１３に保持しておき、当該列挙されている不要語は記憶部１２から取り除く。 The control unit 11 extracts keywords from the processing target document. Here, the keyword is a character string serving as a key for searching for a document in the document database. Specifically, the control unit 11 delimits the text included in the processing target document by a unit of subject predicate relation (generally a unit larger than a word), and stores each character string obtained as a result of the division in the storage unit 12. . For example, in the case of Japanese, it may be separated by a particle (case particle / corresponding particle) such as so-called “Tenihaha”. It should be noted that unnecessary particles such as particles are removed from each character string group obtained by the division. Further, an unnecessary word database in which unnecessary words (stop words) are listed in advance is held in the storage unit 13, and the listed unnecessary words are removed from the storage unit 12.

例えば、「急須の内部を、仕切りを開口部近傍まで延設して２分し、仕切により分けて得られた各室に注ぎ口を一つずつ設けて、１つの急須で２つの茶わんに一斉に注ぎ入れることができるように、利便性を高めた」というテキストを区切って、「急須の」、「内部を」…といった文字列を得る。そして助詞などの部分を除去し、「急須」、「内部」…とする。次に、予め定められた不要語（どのようなドキュメントにも登場し得る文字列）として「内部」や「２分」などといった文字列を除去し、結局、「急須」、「開口部近傍」、「仕切」、「各室」、「注ぎ口」、「急須」、「茶わん」、「注ぎ入れ」といった文字列群を得る。制御部１１は、これらの文字列群をキーワードとして選択する。 For example, “The interior of the teapot is divided into two by extending the partition to the vicinity of the opening, and one spout is provided in each chamber obtained by dividing the partition into two tea bowls in one teapot. By dividing the text “Improved convenience so that it can be poured into”, a character string such as “teapot”, “inside”, etc. is obtained. Then, parts such as particles are removed, and “teapot”, “inside”, and so on. Next, character strings such as “inside” and “two minutes” are removed as predetermined unnecessary words (character strings that can appear in any document), and eventually “teapot” and “near the opening” , “Partition”, “each room”, “spout”, “tea pot”, “tea bowl”, “pour”, and the like. The control unit 11 selects these character string groups as keywords.

さらに制御部１１は、処理対象ドキュメント全体からこのキーワードを抽出する処理を行ってもよいが、処理対象ドキュメントの一部分を対象部分として選択し、当該選択した対象部分からキーワードを抽出することとしてもよい。具体的に特許文献の場合、特許請求の範囲の欄や、実施形態、実施例を記述する欄を選択的に処理対象として特定し、当該特定した処理対象の部分からキーワードを抽出することとしてもよい。 Further, the control unit 11 may perform a process of extracting the keyword from the entire processing target document, but may select a part of the processing target document as a target part and extract the keyword from the selected target part. . Specifically, in the case of patent documents, it is also possible to selectively specify a column for claims, a column for describing embodiments and examples as a processing target, and extract a keyword from the specified processing target part. Good.

なお、ここでは主語述語関係の単位で区切ってキーワードを抽出しているが、形態素解析により単語ごとに分割した上で、例えば隣接して分割された漢字の単語同士を連結して得た文字列をキーワードとして抽出してもよい。これにより、例えば「紫外線照射装置」のような語を「紫外線」と「照射」と「装置」とに細かく分割してしまうことによる検索精度の低下（「紫外線」に関わらず何らかの「照射」を行うことに関する文献などが検索結果に含まれてしまうなど）を防止できる。 Note that here, keywords are extracted by dividing them in units of subject predicates, but after dividing each word by morphological analysis, for example, a character string obtained by concatenating adjacent kanji words May be extracted as a keyword. As a result, for example, a term such as “ultraviolet irradiation device” is subdivided into “ultraviolet light”, “irradiation”, and “device”. Documents related to what to do are included in the search results).

制御部１１は、次に、これらの抽出されたキーワードに対して予め関連づけられている分類符号を検索する。すなわち、制御部１１は、辞書データベースから各キーワードを検索する。そして辞書データベースにキーワードに関係した分類符号（複数あってもよい）が関連づけられて保持されている場合には、当該分類符号のリストを、キーワードに関連付けて記憶部１２に保持する。なお、この際、分類符号の一部が重要符号として辞書データベースに登録されている場合には、当該重要符号のみを選択的に取り出して、分類符号リストとしてもよい。 Next, the control unit 11 searches for a classification code associated in advance with these extracted keywords. That is, the control unit 11 searches each keyword from the dictionary database. If a classification code (a plurality of classification codes) related to the keyword is stored in association with the dictionary database, the list of classification codes is stored in the storage unit 12 in association with the keyword. At this time, when a part of the classification code is registered as an important code in the dictionary database, only the important code may be selectively extracted to form a classification code list.

また検索の結果、辞書データベースにキーワードが保持されていない場合（キーワードに関連付けられた分類符号がない場合）は、当該キーワードについて、ドキュメント群を抽出する処理を行う。 If the keyword is not stored in the dictionary database as a result of the search (if there is no classification code associated with the keyword), processing for extracting a document group is performed for the keyword.

分類符号リスト生成処理では、制御部１１は、抽出したキーワードの各々に関するドキュメント群をドキュメントデータベースから全文検索処理等により抽出する。これにより、キーワードごとに、各キーワードを含むドキュメントがドキュメントデータベースから取り出されることになる。制御部１１は、検索により抽出されたドキュメントの各々に付与されている分類符号の少なくとも一部を取り出す。例えば、ドキュメントに複数の分類符号が付与されている場合には、そのうち筆頭のものを取り出す。また、複数種類の分類符号のいずれか少なくとも一つの種類が選択的に付与されている場合には、処理対象ドキュメントに付与する種類の分類符号があるときは、その少なくとも一部を取り出し、処理対象ドキュメントに付与する種類の分類符号がなければ、当該抽出されたドキュメントから分類符号を取り出さない。 In the classification code list generation process, the control unit 11 extracts a document group related to each extracted keyword from the document database by a full-text search process or the like. Thereby, for each keyword, a document including each keyword is extracted from the document database. The control unit 11 takes out at least a part of the classification code given to each of the documents extracted by the search. For example, when a plurality of classification codes are assigned to the document, the first one is taken out. In addition, when at least one of a plurality of types of classification codes is selectively given, if there is a type of classification code to be given to the processing target document, at least a part thereof is taken out and processed If there is no classification code of the type to be given to the document, the classification code is not extracted from the extracted document.

これにより、処理対象ドキュメントに付与する種類の分類符号が付与されているドキュメントの群をドキュメントデータベースから抽出し、当該ドキュメントの群から分類符号を取り出す。 As a result, a group of documents to which the type of classification code to be given to the processing target document is extracted from the document database, and the classification code is extracted from the document group.

具体的には、国際特許分類の第６版が付与されているドキュメントと、第７版が付与されているドキュメントと、の双方のドキュメントが、ドキュメントデータベースに格納されている場合であって、処理対象ドキュメントに国際特許分類第７版の分類符号を付与する場合、当該処理対象ドキュメントから抽出したキーワードをキーとして、ドキュメントデータベースから抽出されたドキュメントのうち、国際特許分類第７版の分類符号が付与されているものを選択し、当該選択したドキュメントに付与されている分類符号を取り出すことになる。 Specifically, it is a case where both the document to which the sixth edition of the international patent classification is granted and the document to which the seventh edition is granted are stored in the document database. When assigning the classification code of the international patent classification 7th edition to the target document, the classification code of the international patent classification 7th edition of the document extracted from the document database is assigned using the keyword extracted from the processing target document as a key. Is selected, and the classification code assigned to the selected document is extracted.

制御部１１は、この処理によってキーワードごとに分類符号のリストを得て、各リストをキーワードに関連付けて記憶部１２に格納する。具体的に上記の例のように、「急須」、「開口部近傍」、「仕切」、「各室」、「注ぎ口」、「急須」、「茶わん」、「注ぎ入れ」の各々の結果を得た場合は、図４に示すように「急須」について国際特許分類の第７版の符号「A23F 3/06」、「A45C 11/20」などを含むリストが記憶され、「仕切」について「A11C 11/02」などを含むリストが記憶される。なお、このリストでは、重複する分類符号も含まれているものとして図示しているが、アルファベット順などの所定順序で並替え（ソート）を行った上で、重複行を除去する処理（ＵＮＩＸ（登録商標）のコマンドでいう、uniqコマンドに相当する処理）を行ってもよい。これにより重複を除いたリストを生成できる。 The control unit 11 obtains a list of classification codes for each keyword by this process, and stores each list in the storage unit 12 in association with the keyword. Specifically, as in the above example, each result of “teapot”, “near the opening”, “partition”, “each room”, “spout”, “teapot”, “tea bowl”, “pour” As shown in Fig. 4, a list containing "A23F 3/06", "A45C 11/20", etc. of the seventh edition of the international patent classification is stored for "teapot" as shown in Fig. 4 A list including “A11C 11/02” and the like is stored. This list is illustrated as including overlapping classification codes, but after sorting in a predetermined order such as alphabetical order, the process of removing duplicate lines (UNIX ( (Registered trademark) command, processing equivalent to the uniq command) may be performed. This makes it possible to generate a list excluding duplicates.

そして制御部１１は、ここで得た、キーワードと、それに関連する分類符号のリストとを、辞書データベースに登録する。これにより、辞書データベースに未だ登録されていなかったキーワードについての分類符号のリストを、辞書データベースに登録することができる。つまり、辞書データベースは、制御部１１の分類符号リスト生成処理の結果をキャッシュしたものということができる。 Then, the control unit 11 registers the keyword obtained here and a list of classification codes related thereto in the dictionary database. As a result, a list of classification codes for keywords that have not been registered in the dictionary database can be registered in the dictionary database. That is, it can be said that the dictionary database is a cache of the result of the classification code list generation process of the control unit 11.

なお、分類符号のリストに含まれる分類符号の少なくとも一つを重要符号として、他の分類符号と識別可能に登録してもよい。具体的には、重複を除去する前に、各分類符号の出現頻度を調べ、この出現頻度が所定のしきい値以上のものを重要符号として登録する。さらに制御部１１は、出現頻度が所定のしきい値以上の分類符号のうち、さらに特異性のある分類符号を重要符号として登録することとしてもよい。ここで特異性とは、他のキーワードをキーとしてドキュメントデータベースを検索した結果からは得られない（得られたとしても出現頻度が所定頻度未満の）分類符号であることを意味する。 Note that at least one of the classification codes included in the list of classification codes may be registered as an important code so as to be distinguishable from other classification codes. Specifically, before removing duplicates, the appearance frequency of each classification code is checked, and those whose appearance frequency is equal to or higher than a predetermined threshold are registered as important codes. Further, the control unit 11 may register a more specific classification code as an important code among the classification codes whose appearance frequency is equal to or higher than a predetermined threshold. Here, the specificity means a classification code that cannot be obtained from the result of searching the document database using another keyword as a key (even if it is obtained, the appearance frequency is less than a predetermined frequency).

制御部１１は、検索処理と、分類符号リスト生成処理とによって得られた結果を参照して、キーワード間に跨って存在する分類符号を取り出す。具体的に制御部１１は、キーワードごとに、それによって得られた分類符号の欄にチェックを入れたチェックテーブルを生成する（図５）。この図５ではチェックされた欄に「○」の記号を表記している。制御部１１は、チェックテーブルを生成すると、次に、各分類符号ごとにチェックの数を調べる。例えば特定の分類符号について、関係するキーワードが５つある場合は、チェックテーブル上で、当該分類符号の欄に５つのチェックがなされていることになる。 The control unit 11 refers to the results obtained by the search process and the classification code list generation process, and extracts the classification codes that exist between the keywords. Specifically, for each keyword, the control unit 11 generates a check table in which the column of the classification code obtained thereby is checked (FIG. 5). In FIG. 5, a symbol “◯” is shown in the checked column. After generating the check table, the control unit 11 next checks the number of checks for each classification code. For example, when there are five related keywords for a specific classification code, five checks are made in the column of the classification code on the check table.

そして制御部１１は、このチェックの最も多い、少なくとも一つの分類符号を、第１の付与候補分類符号として選択する。具体的に上記の例のように、「急須」、「開口部近傍」、「仕切」、「各室」、「注ぎ口」、「急須」、「茶わん」、「注ぎ入れ」の各々の結果があった場合には、これらの７つ（重複を除く）のキーワードのうち、５つのキーワードに該当するものとして国際特許分類第７版の分類符号で「A47G 19/22」が得られる。 Then, the control unit 11 selects at least one classification code having the highest number of checks as a first assignment candidate classification code. Specifically, as in the above example, each result of “teapot”, “near the opening”, “partition”, “each room”, “spout”, “teapot”, “tea bowl”, “pour” If there is, the “A47G 19/22” is obtained with the classification code of the 7th edition of the international patent classification as corresponding to five of these seven keywords (excluding duplication).

なお、ここではチェックの数が最も多い分類符号を選択することとしているが、例えばキーワードの数に比して所定の比率以上の個数のチェックがあるものとの条件や、予め定めたしきい値以上の個数のチェックがあるものなどの条件で、第１の付与候補分類符号を選択してもよい。 Here, the classification code having the largest number of checks is selected. However, for example, a condition that there are more checks than a predetermined ratio compared to the number of keywords, a predetermined threshold value, and the like. The first grant candidate classification code may be selected under conditions such as those with the above number of checks.

例えば、ここでは５０％以上の比率以上のチェックがあるものとの条件では７個のキーワードの５０％、つまり「３．５」個以上（ただし、個数は必ず整数であるので「４」個以上と言換えることができる）の分類符号として、４つのキーワードに該当する「A47J 31/06」が得られる。 For example, under the condition that there is a check of a ratio of 50% or more, 50% of 7 keywords, that is, "3.5" or more (however, since the number is always an integer, "4" or more In other words, “A47J 31/06” corresponding to the four keywords is obtained.

さらに制御部１１は、処理対象ドキュメントから抽出したキーワードのうちの一部を、中心概念語として選択し、この中心概念語に基づいて第２の付与候補分類符号を選択する。ここで中心概念語とは、処理対象ドキュメントの内容を表すキーワードであり、キーワード抽出処理によって抽出されたキーワードのうちから所定の条件に基づいて選択されたものである。 Further, the control unit 11 selects a part of the keywords extracted from the processing target document as a central concept word, and selects a second grant candidate classification code based on the central concept word. Here, the central concept word is a keyword representing the content of the document to be processed, and is selected based on a predetermined condition from the keywords extracted by the keyword extraction process.

ここで所定の条件は、例えば次のようなものである。すなわち制御部１１は、検索処理と分類符号リスト生成処理とによって得られた分類符号のリストを記憶部１２から読出して、当該リストに含まれる分類符号の集合の論理和を生成する。これにより、抽出したキーワードに関連して取り出された分類符号の群が得られる。制御部１１は、キーワードごとに、当該キーワードに関連して取り出した分類符号の個数が、上記生成した論理和に含まれる分類符号の個数に対して占める割合（分類符号分布割合）を演算する。例えば論理和に含まれる分類符号の個数が１００個で、キーワード「急須」に関連して得られた分類符号の個数が３０個である場合、その分類符号分布割合は３０％ということになる。 Here, the predetermined condition is, for example, as follows. That is, the control unit 11 reads out a list of classification codes obtained by the search process and the classification code list generation process from the storage unit 12, and generates a logical sum of a set of classification codes included in the list. Thereby, a group of classification codes extracted in association with the extracted keyword is obtained. For each keyword, the control unit 11 calculates a ratio (classification code distribution ratio) that the number of classification codes extracted in association with the keyword occupies with respect to the number of classification codes included in the generated logical sum. For example, when the number of classification codes included in the logical sum is 100 and the number of classification codes obtained in association with the keyword “teapot” is 30, the classification code distribution ratio is 30%.

制御部１１は、さらに抽出したキーワードの群（重複を排除する前の群）から、各キーワードの出現頻度を調べる。具体的に上記の例のように、「急須」、「開口部近傍」、「仕切」、「各室」、「注ぎ口」、「急須」、「茶わん」、「注ぎ入れ」の各々の結果があった場合、「急須」について頻度が「２」、その他のキーワードについては頻度は「１」となる。そこで制御部１１は、この出現頻度の順に、注目キーワードを選択し、当該注目キーワードに関して演算された分類符号分布割合が所定の割合しきい値（例えば４０％）を下回っているか否かを調べる。ここで所定の割合しきい値を下回っている場合は、当該注目キーワードを中心概念語として選択する。また、割合しきい値を下回っていない場合は、次の注目キーワードを選択する。なお、キーワード群のうちでの出現頻度が所定頻度しきい値より小さいキーワードについては、注目キーワードとして選択しないようにしてもよい。 The control unit 11 further examines the appearance frequency of each keyword from the extracted keyword group (the group before elimination of duplication). Specifically, as in the above example, each result of “teapot”, “near the opening”, “partition”, “each room”, “spout”, “teapot”, “tea bowl”, “pour” If there is, the frequency is “2” for “teapot” and the frequency is “1” for other keywords. Therefore, the control unit 11 selects a keyword of interest in the order of appearance frequency, and checks whether or not the classification code distribution ratio calculated for the keyword of interest is below a predetermined ratio threshold (for example, 40%). Here, if it is below the predetermined ratio threshold value, the attention keyword is selected as the central concept word. If the ratio threshold value is not below, the next keyword of interest is selected. Note that a keyword whose appearance frequency in the keyword group is smaller than the predetermined frequency threshold may not be selected as the keyword of interest.

ここでは、頻度が最大となっている「急須」を注目キーワードとして、当該「急須」に関する分類符号分布割合が３０％となっているので、この３０％が所定の割合しきい値を下回っていれば、この「急須」を中心概念語として選択する。 Here, “teapot”, which has the highest frequency, is the keyword of interest, and the classification code distribution ratio for the “teapot” is 30%, so that 30% is below the predetermined ratio threshold. For example, this “teapot” is selected as a central concept word.

なお制御部１１は、条件に合致するものがなければ、中心概念語の選択をせず、中心概念語に基づく第２の付与候補分類符号を選択しないこととしてもよい。また、中心概念語は、必ずしも一つでなくてもよい。 Note that the control unit 11 may not select the central concept word and not select the second assignment candidate classification code based on the central concept word if there is nothing that matches the condition. Moreover, the number of central concept words is not necessarily one.

制御部１１は、中心概念語について得られた分類符号リストから、当該分類符号リスト上で出現する各分類符号の出現頻度を演算する。そして、出現頻度が、所定の出現割合しきい値を上回っている分類符号を第２の付与候補分類符号として選択する。例えば中心概念語「急須」について得られた分類符号リストに、８５個の分類符号があり、そのうちの３５個（３７．５％）が「A47G 19/14」であり、２４個（２８％）が「A47J 31/06」であり、…といった場合、出現割合しきい値を３５％と定めておくと、「A47G 19/14」が第２付与候補分類符号として選択される。 The control unit 11 calculates the appearance frequency of each classification code appearing on the classification code list from the classification code list obtained for the central concept word. Then, a classification code whose appearance frequency exceeds a predetermined appearance ratio threshold value is selected as the second assignment candidate classification code. For example, in the classification code list obtained for the central concept word “teapot”, there are 85 classification codes, of which 35 (37.5%) are “A47G 19/14” and 24 (28%) Is “A47J 31/06”, and so on, if the appearance ratio threshold is set to 35%, “A47G 19/14” is selected as the second grant candidate classification code.

以上の流れを整理すると、制御部１１は図６に示すような動作を行っていることになる。まず、制御部１１は処理対象ドキュメントからキーワードを抽出する（Ｓ１）。そして抽出したキーワードごとに、各キーワードに関連するドキュメントデータベース内のドキュメントに予め付与されている分類符号のリストを取得する（Ｓ２）。この処理Ｓ２の検索においては、予めキャッシュされて辞書データベースに関連する分類符号リストが登録されているキーワードについては、当該辞書データベースを参照して分類符号リストを取得し、キャッシュされていないものについては、当該キーワードをキーとしてドキュメントデータベースを検索し、検索の結果、得られたドキュメント群から当該ドキュメント群に含まれるドキュメントに付与されている分類符号の少なくとも一部を取り出して、分類符号リストを取得する。 If the above flow is arranged, the control unit 11 performs an operation as shown in FIG. First, the control unit 11 extracts keywords from the processing target document (S1). Then, for each extracted keyword, a list of classification codes assigned in advance to the document in the document database related to each keyword is acquired (S2). In the search of this process S2, for keywords that are cached and registered with a classification code list related to the dictionary database, the classification code list is obtained with reference to the dictionary database. The document database is searched using the keyword as a key, and as a result of the search, at least part of the classification codes assigned to the documents included in the document group is extracted from the obtained document group, and the classification code list is obtained. .

制御部１１は、ここで取得した分類符号のリストに含まれる各分類符号について、いくつのキーワードに関連して取得されているかを調べる。そして例えば全キーワードに対して所定の割合の個数のキーワードに共通して関連づけられている分類符号を取り出し、第１の付与候補分類符号（本発明のキーワード毎の分類符号のリストに相当する）として選択する（Ｓ３）。 The control unit 11 checks how many keywords are associated with each classification code included in the list of classification codes acquired here. Then, for example, a classification code that is commonly associated with a predetermined number of keywords with respect to all keywords is extracted, and is used as a first assignment candidate classification code (corresponding to a list of classification codes for each keyword of the present invention). Select (S3).

次に制御部１１は、処理Ｓ１で抽出したキーワードのうちから中心概念語を選択する（Ｓ４）。中心概念語の選択は、抽出したキーワード群中の各キーワードの出現頻度と、各キーワードに関連して取り出された分類符号の分布（分類符号リストの論理和中で占める、各キーワードに関連して取り出された分類符号の割合など）とに基づく所定のルールに従って行われる。 Next, the control unit 11 selects a central concept word from the keywords extracted in the process S1 (S4). The central concept word is selected according to the frequency of appearance of each keyword in the extracted keyword group and the distribution of classification codes extracted in relation to each keyword (related to each keyword in the logical sum of the classification code list). This is performed according to a predetermined rule based on the ratio of the extracted classification code.

そして中心概念語に関する分類符号リストから、その出現頻度に基づいて、例えば所定の出現割合しきい値を越える出現頻度の分類符号を第２付与候補分類符号（本発明の第２のリストに相当する）として選択する（Ｓ５）。なお、この処理Ｓ５において、上記出現割合しきい値を越える分類符号がなければ、第２の付与候補分類符号は必ずしも選択する必要はない。 Then, from the classification code list related to the central concept word, based on the appearance frequency, for example, a classification code having an appearance frequency exceeding a predetermined appearance ratio threshold value corresponds to a second grant candidate classification code (corresponding to the second list of the present invention) ) Is selected (S5). In addition, in this process S5, if there is no classification code exceeding the appearance ratio threshold value, it is not always necessary to select the second addition candidate classification code.

制御部１１はさらに、処理Ｓ３で選択した第１の付与候補分類符号と、処理Ｓ５で選択した第２の付与候補分類符号との論理和を演算して、その結果を付与候補分類符号として記憶部１２に格納し（Ｓ６）、処理を終了する。制御部１１は、こうして取得した付与候補分類符号を表示部１５に表示して利用者に提示する。 The control unit 11 further calculates a logical sum of the first grant candidate classification code selected in process S3 and the second grant candidate classification code selected in process S5, and stores the result as the grant candidate classification code. The data is stored in the unit 12 (S6), and the process ends. The control unit 11 displays the granted candidate classification code thus acquired on the display unit 15 and presents it to the user.

また、制御部１１は、処理対象ドキュメントに、当該付与候補分類符号の少なくとも一部を含む分類符号情報を関連付けて、ストレージ部１３のドキュメントデータベースに格納することとしてもよい。具体的には、表示された付与候補分類符号から、実際に付与する分類符号として利用者が選択操作した分類符号を処理対象ドキュメントに関連付けてストレージ部１３のドキュメントデータベースに格納する。この際、利用者が付与候補分類符号に含まれない別の分類符号を入力した場合は、当該入力された分類符号を併せて処理対象ドキュメントに関連付けし、ストレージ部１３のドキュメントデータベースに格納してもよい。 In addition, the control unit 11 may associate classification code information including at least a part of the grant candidate classification code with the processing target document and store it in the document database of the storage unit 13. Specifically, the classification code selected and operated by the user as the classification code to be actually assigned from the displayed grant candidate classification codes is stored in the document database of the storage unit 13 in association with the processing target document. At this time, if the user inputs another classification code that is not included in the grant candidate classification code, the input classification code is also associated with the processing target document and stored in the document database of the storage unit 13. Also good.

次に、制御部１１による辞書データベースの更新処理について説明する。制御部１１は、ドキュメントデータベースに新たなドキュメントが追加されると、辞書データベースの更新処理として、次の処理を行うようにしてもよい。 Next, dictionary database update processing by the control unit 11 will be described. When a new document is added to the document database, the control unit 11 may perform the following process as the dictionary database update process.

すなわち、制御部１１は辞書データベースに既に登録されているキーワードのリストを生成する。そして、当該リストに含まれる各キーワードを順次キーとして選択し、選択したキーを用いてドキュメントデータベースを検索する。そしてドキュメントデータベースから、当該選択したキーを含むドキュメントを抽出する。ここで抽出したドキュメントに関連づけられている分類符号の少なくとも一部を取り出して、分類符号のリストを生成し、キーとして選択したキーワードと、当該生成した分類符号のリストとを関連付けて、辞書データベースに格納する。なお、当該キーワードに関連付けて格納されている既存の情報に上書きする。これにより辞書データベースが更新される。 That is, the control unit 11 generates a list of keywords already registered in the dictionary database. Then, each keyword included in the list is sequentially selected as a key, and the document database is searched using the selected key. Then, a document including the selected key is extracted from the document database. At least a part of the classification codes associated with the extracted document is taken out, a classification code list is generated, the keyword selected as a key is associated with the generated classification code list, and stored in the dictionary database. Store. The existing information stored in association with the keyword is overwritten. Thereby, the dictionary database is updated.

本実施の形態の分類符号処理装置は、上述のように動作するので、例えば多様な術語を含む科学技術文献であっても、既存のドキュメントに対する全文検索を利用し、かつ処理対象ドキュメントを特徴づける中心概念語を用いて分類処理を遂行することができる。 Since the classification code processing apparatus of the present embodiment operates as described above, for example, even a scientific and technical literature including various technical terms uses a full-text search for an existing document and characterizes the processing target document. The classification process can be performed using the central concept word.

なお、ここまでの説明では、例えば処理対象ドキュメントに国際特許分類の第７版の分類符号を付与する場合、ドキュメントデータベースに格納されているドキュメントのうち、国際特許分類第７版の分類符号が付与されているものを選択的に抽出し、当該抽出したドキュメントに付与されている分類符号を参照して、付与候補分類符号を選択する例について述べたが、これに代えて次のようにしてもよい。すなわち、各種類の分類符号間に対応表（いわゆるコンコーダンス表）を設定しておき、国際特許分類第６版の分類符号が付与されているドキュメントについては、この表を参照しながら当該付与されている分類符号を第７版の分類符号に変換する。そして、当該変換後の分類符号を参照して、付与候補分類符号を選択するようにしてもよい。 In the description so far, for example, when a classification code of the seventh edition of the international patent classification is given to the processing target document, a classification code of the seventh edition of the international patent classification is given among the documents stored in the document database. In the above example, the candidate candidates are selected by referring to the classification code assigned to the extracted document. Good. That is, a correspondence table (so-called concordance table) is set between each type of classification code, and a document to which a classification code of International Patent Classification Sixth Edition is assigned is referred to this table. The existing classification code is converted into a seventh classification code. Then, the candidate classification code may be selected with reference to the converted classification code.

本発明の実施の形態に係る分類符号処理装置の構成例を表すブロック図である。It is a block diagram showing the structural example of the classification code processing apparatus which concerns on embodiment of this invention. ドキュメントデータベースの内容例を表す説明図である。It is explanatory drawing showing the example of the content of a document database. 辞書データベースの内容例を表す説明図である。It is explanatory drawing showing the example of the content of a dictionary database. キーワード毎の分類符号のリストの例を表す説明図である。It is explanatory drawing showing the example of the list | wrist of the classification code for every keyword. チェックテーブルの一例を表す説明図である。It is explanatory drawing showing an example of a check table. 本発明の実施の形態に係る分類符号処理装置の処理例を表すフローチャート図である。It is a flowchart figure showing the process example of the classification code processing apparatus which concerns on embodiment of this invention.

Explanation of symbols

１１制御部、１２記憶部、１３ストレージ部、１４操作部、１５表示部。
11 control unit, 12 storage unit, 13 storage unit, 14 operation unit, 15 display unit.

Claims

A classification code processing apparatus that is connected to a document database that holds a plurality of documents associated with a classification code and supports the assignment of a classification code to a document to be processed,
A keyword extracting means for extracting, as a keyword, a character string that is at least one key from a document to be processed;
A document group extracting means for extracting a document group related to each of the extracted keywords from the document database;
A list generation unit that refers to the document group extracted for each keyword, obtains at least a part of the classification code associated with the document group, and generates a list of classification codes for each keyword;
Means for counting the number of related keywords for each classification code from the list of classification codes for each keyword, and selecting a first grant candidate classification code based on this number;
Obtaining a logical sum of classification codes included in the classification code list for each keyword, and a ratio of the number of classification codes included in the classification code list for each keyword to the number of classification codes included in the logical sum It was calculated as the distribution ratio of the classification code for each keyword from among the keywords the keyword extracting means has extracted, based on the distribution ratio of the classification code for each keyword that the operation to get the keywords centered concept words, the keyword Means for selecting a second grant candidate classification code from the classification code list for each keyword based on the appearance frequency of the classification code related to the central concept word from the list of classification codes for each
Including
A classification code processing apparatus that performs a process of supporting the assignment of a classification code to the processing target document based on the first grant candidate classification code and the second grant candidate classification code .

The classification code processing device according to claim 1,
The keyword extracting means includes
A classification code processing apparatus characterized by identifying a portion from which a keyword is extracted from the document to be processed and extracting a character string as a keyword from a part of the identified document.

The classification code processing device according to claim 1 or 2 ,
A dictionary database that stores the keywords and the classification codes to be assigned to the documents related to the keywords in association with each other;
Search means for searching for a classification code held in the dictionary database in association with the keyword extracted by the keyword extraction means;
As a result of the search by the search means, when there is no classification code held in association with the keyword extracted by the keyword extraction means, the document group extraction means relates to each of the extracted keywords from the document database. Means for extracting the document group to be selected and selecting the first grant candidate classification code by referring to the document group extracted for each keyword and obtaining at least a part of the classification code associated with the document group A classification code processing device that selects a first assignment candidate classification code from among classification codes for each keyword.

The classification code processing device according to any one of claims 1 to 3 ,
The document group extraction means, when at least one of a plurality of kinds of classification codes is selectively given to the document held in the document database, the kind of classification given to the document to be processed A classification code processing apparatus, wherein a group of documents to which codes are assigned is extracted from the document database.

A program that is connected to a document database that holds a plurality of documents associated with a classification code, and that executes processing for supporting the assignment of a classification code to a document to be processed.
A keyword extracting means for extracting, as a keyword, a character string that is at least one key from a document to be processed;
A document group extracting means for extracting a document group related to each of the extracted keywords from the document database;
A list generation unit that refers to the document group extracted for each keyword, obtains at least a part of the classification code associated with the document group, and generates a list of classification codes for each keyword;
Means for counting the number of related keywords for each classification code from the list of classification codes for each keyword, and selecting a first grant candidate classification code based on this number;
Obtaining a logical sum of classification codes included in the classification code list for each keyword, and a ratio of the number of classification codes included in the classification code list for each keyword to the number of classification codes included in the logical sum It was calculated as the distribution ratio of the classification code for each keyword from among the keywords the keyword extracting means has extracted, based on the distribution ratio of the classification code for each keyword that the operation to get the keywords centered concept words, the keyword Means for selecting a second grant candidate classification code from the classification code list for each keyword based on the appearance frequency of the classification code related to the central concept word from the list of classification codes for each
Function as
Wherein the first grant candidate classification code, based on the second grant candidate classification code, program characterized by causing the processing to assist the application of classification code for the processing target document.