JP5282880B2

JP5282880B2 - Search system, search method, and program

Info

Publication number: JP5282880B2
Application number: JP2008315158A
Authority: JP
Inventors: 展久白石; 威有熊
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 2008-12-11
Filing date: 2008-12-11
Publication date: 2013-09-04
Anticipated expiration: 2028-12-11
Also published as: JP2010140209A

Abstract

<P>PROBLEM TO BE SOLVED: To display a document where the field of an input retrieval keyword is overlooked, in an upper level of retrieval results in a retrieval system. <P>SOLUTION: A retrieval system includes: a similarity determination part for calculating similarity between components by dividing each document of a document group as a retrieval object for each of components of the document; a group analyzing part for grouping similar components based on the similarity calculated by the similarity determination part, and for calculating a deviation between the components as a center in the group and the other components in the group, and for digitizing the degree of similarity of the components configuring the group based on the deviation for all the grouped groups; and a diversion degree calculation part for calculating the scores of the components based on the product of the deviation and the degree of similarity about the components included in the retrieval object document, and for tabulating the score values of the document by accumulating the score values of the contained components, wherein a plurality of retrieval object documents are output in order based on the score values of the document tabulated by the diversion degree calculation part. <P>COPYRIGHT: (C)2010,JPO&INPIT

Description

本発明は、入力された検索キーワードに基づき文書データを検索する検索システムに関し、詳しくは、文書間の引用関係に基づく文書の適合度を用いる検索システム、検索方法、およびプログラムに関する。 The present invention relates to a search system that searches document data based on an input search keyword, and more particularly, to a search system, a search method, and a program that use document fitness based on a citation relationship between documents.

昨今、情報処理装置を用いてユーザ等によって入力されたキーワードに基づいて、文書を検索する検索システムおよび検索エンジンの研究が多くの研究者によってなされている。 Recently, many researchers have been researching search systems and search engines that search for documents based on keywords input by a user or the like using an information processing apparatus.

検索システムで用いられる検索方法の一例としては、データベースに蓄積記憶された文書群の中から所望の文書を取り出すために、予め文書群をインデックス化して蓄積処理し、ユーザからの検索キーワードの入力を受け付け、キーワードを含む文書をリストアップすると共に、リストアップした文書をキーワード含有率に基づくスコア算出処理を行い、当該算出したスコア値に基づいて文書を順序付けて出力する方法がある。 As an example of a search method used in a search system, in order to retrieve a desired document from a document group stored and stored in a database, the document group is indexed and stored in advance, and a search keyword is input from a user. There is a method of accepting and listing documents including keywords, performing score calculation processing based on the keyword content for the listed documents, and outputting the documents in order based on the calculated score values.

また、関連する特許文献としては、特許文献１が挙げられる。
特許文献１には、情報源から情報を取得する情報取得手段と、情報源から取得された情報を記憶する情報記憶手段と、情報源の識別情報を取得する情報源取得手段と、文書に情報を追加する際に情報源の識別情報を付加して埋め込む情報源埋込手段と、情報源埋込手段によって埋め込まれた識別情報を取り出す情報源取出手段と、情報源の識別情報を基に文章を系統化した系統化情報を生成する系統化手段とを備えた文書管理装置が記載されている。 Moreover, patent document 1 is mentioned as a related patent document.
Patent Document 1 discloses an information acquisition unit that acquires information from an information source, an information storage unit that stores information acquired from an information source, an information source acquisition unit that acquires identification information of an information source, and information in a document. Information source embedding means to which information source identification information is added and embedded, information source extracting means for extracting identification information embedded by the information source embedding means, and text based on the information source identification information Document management apparatus including systematization means for generating systematization information that systematizes systematically is described.

特許文献１に記載された文書管理装置は、文書自体とその情報源の識別情報との両方を管理し、文書を文書管理装置に追加する際にその文書の情報源の識別情報を埋め込むことによって、文書間の引用関係を基に文書を系統化して管理し、情報源の共有の程度を用いて関連文書の検索を可能とする。 The document management apparatus described in Patent Document 1 manages both the document itself and the identification information of the information source, and embeds the identification information of the information source of the document when adding the document to the document management apparatus. Documents are organized and managed based on citation relationships between documents, and related documents can be searched using the degree of sharing of information sources.

特開２００７−０７２７２３号公報JP 2007-0727223 A

しかしながら、上記の検索システムでは、入力された検索キーワードの分野を俯瞰した文書を、検索結果の上位に表示することができない。 However, in the search system described above, a document overlooking the field of the input search keyword cannot be displayed at the top of the search results.

具体的には、特許文献１に記載された文書管理装置では、系統を遡ることしか出来ない。即ち、文書に記載されている分野毎に、良く使用される（インサーションされる：流用される：引用される：利用される）内容を端的に説明した図や表、段落などが記載された文書を求めて表示することができない。 Specifically, the document management apparatus described in Patent Document 1 can only trace the system. In other words, for each field described in the document, a diagram, table, paragraph, etc. that briefly explained the contents that are often used (inserted: diverted: cited: used) were described. The document cannot be displayed in search of it.

尚、分野を俯瞰した文書とは、分野に関して説明した複数の文書において多く流用されている文書ブロックを、多く含んだ文書のことを指す。また、文書ブロックとは、図・段落・各ページ等の文書を構成する構成要素を指す。文書ブロックの具体的な例としては、PowerPoint（登録商標）等のプレゼンテーションファイルにおいては、各スライドページや各スライドページ内の図などである。また、Word等のビジネス文書においては、各文章や、複数の文章から構成される段落である。これらの文書においては、当該分野の説明でよく使用される図や、よく使用されるスライドページが流用度の高い文書ブロックとなる。ここで流用度とは、ある文書ブロックが、その内容に類似した文書ブロックを含む文書に出現する頻度を指す。換言すれば、より多くの文書に、所定の文書ブロックに類似した内容の文書ブロックが出現するほど、その所定の文書ブロックの流用度が高いと定義される。 Note that a document that has an overview of a field refers to a document that includes many document blocks that are frequently used in a plurality of documents described in the field. A document block refers to a component constituting a document such as a figure, a paragraph, or each page. As a specific example of the document block, in a presentation file such as PowerPoint (registered trademark), there are each slide page and a figure in each slide page. A business document such as Word is a paragraph composed of each sentence or a plurality of sentences. In these documents, a figure often used in the description of the field and a frequently used slide page become a document block with high diversion. Here, the degree of diversion refers to the frequency at which a certain document block appears in a document including a document block similar to the content. In other words, it is defined that the degree of diversion of the predetermined document block is higher as the document block having the content similar to the predetermined document block appears in more documents.

本発明の目的は、入力された検索キーワードの分野を俯瞰した文書を、検索結果の上位に表示する検索システムを提供することにある。 An object of the present invention is to provide a search system that displays a document overlooking the field of an input search keyword at the top of the search results.

本発明の他の目的は、入力された検索キーワードの分野を俯瞰した文書を、検索結果の上位に表示する検索方法を提供することにある。 Another object of the present invention is to provide a search method for displaying a document overlooking the field of an input search keyword at the top of the search results.

更に、本発明の他の目的は、入力された検索キーワードの分野を俯瞰した文書を、検索結果の上位に表示可能なプログラムを提供することにある。 Furthermore, another object of the present invention is to provide a program capable of displaying a document overlooking the field of the input search keyword at the top of the search results.

本発明の検索システムは、検索対象である文書群の各文書を、文書の構成要素毎に分割し、構成要素間の類似度を算出する類似度判定部と、前記類似度判定部の算出した類似度に基づいて、類似する構成要素をグループ化してグループ内の中心となる構成要素とグループ内の他の構成要素との偏差を算出すると共に、グループ化した全てのグループに対して前記偏差に基づくグループを構成する構成要素の類似度合いを数値化する構成要素グループ解析部と、検索対象の文書に含まれる構成要素について、前記偏差と前記類似度合いとの積に基づく前記構成要素のスコアを算出して、含まれる構成要素のスコア値を累積した文書のスコア値を集計する流用度算出部とを備えることを特徴とする。 In the search system of the present invention, each document of the document group to be searched is divided for each component of the document, and the similarity determination unit that calculates the similarity between the components and the similarity determination unit calculates Based on the similarity, group similar components together to calculate the deviation between the central component in the group and the other components in the group, and to the deviation for all grouped groups A component group analysis unit that quantifies the degree of similarity of the components constituting the group based on the component, and calculates the score of the component based on the product of the deviation and the similarity degree for the components included in the search target document And a diversion degree calculating unit for totalizing score values of documents in which score values of included components are accumulated.

本発明によれば、入力された検索キーワードの分野を俯瞰した文書を、検索結果の上位に表示する検索システムを提供できる。 ADVANTAGE OF THE INVENTION According to this invention, the search system which displays the document which overlooked the field of the input search keyword on the upper rank of a search result can be provided.

また、本発明によれば、入力された検索キーワードの分野を俯瞰した文書を、検索結果の上位に表示する検索方法を提供できる。 In addition, according to the present invention, it is possible to provide a search method for displaying a document overlooking the field of the input search keyword at the top of the search results.

更に、本発明によれば、入力された検索キーワードの分野を俯瞰した文書を、検索結果の上位に表示可能なプログラムを提供できる。 Furthermore, according to the present invention, it is possible to provide a program capable of displaying a document overlooking the field of the input search keyword at the top of the search results.

本発明に係る検索システムは、検索対象である文書群の各文書を文書の構成要素毎に分割して、前記各文書から抽出された構成要素各々間の類似度を算出する類似度判定部と、前記類似度判定部の算出した構成要素間の類似度に基づいて、類似する構成要素をグループ化して個々のグループ内の中心となる構成要素と該グループ内の他の構成要素とのそれぞれの偏差を算出すると共に、算出した構成要素間の前記偏差を参照してグループを構成する構成要素群の類似度合いをグループ毎に数値化するグループ解析部と、検索対象の各文書に含まれる１ないし複数の構成要素について、それぞれ該当する前記偏差と前記類似度合いとの積に基づくスコアを算出して、含まれる構成要素のスコア値を文書毎に累積してスコア値を集計する流用度算出部とを備え、検索要求に応じて、前記流用度算出部によって集計された文書毎のスコア値を参照して、検索対照である文書群を要求に応じた順序付けで出力することを特徴とする。A search system according to the present invention includes a similarity determination unit that divides each document of a document group to be searched for each component of the document and calculates a similarity between the components extracted from the documents. , Based on the similarity between the components calculated by the similarity determination unit, similar components are grouped, and each of the central component in each group and the other components in the group A group analysis unit that calculates a deviation and quantifies the similarity between constituent elements that constitute a group with reference to the calculated deviation between the constituent elements, and 1 to 1 included in each document to be searched For a plurality of components, a score based on the product of the corresponding deviation and the similarity is calculated, the score values of the included components are accumulated for each document, and the score value is aggregated. A document group as a search reference is output in order according to the request with reference to the score value for each document aggregated by the diversion degree calculation unit in response to the search request .
本発明に係る情報処理システムによる検索方法は、検索対象である文書群の各文書を文書の構成要素毎に分割して、前記各文書から抽出された構成要素各々間の類似度を算出処理し、前記算出した構成要素間の類似度に基づいて、類似する構成要素をグループ化して個々のグループ内の中心となる構成要素と該グループ内の他の構成要素とのそれぞれの偏差を算出処理し、算出した構成要素間の前記偏差を参照してグループを構成する構成要素群の類似度合いをグループ毎に数値化処理し、検索対象の各文書に含まれる１ないし複数の構成要素について、それぞれ該当する前記偏差と前記類似度合いとの積に基づくスコアを算出処理して、含まれる構成要素のスコア値を文書毎に累積してスコア値を集計処理して記憶部に記録し、入力部より受け付けた検索要求に応じて、前記集計した文書毎のスコア値を参照して、検索対照である文書群の中から該当する文書を要求に応じた順序付けで出力部から出力することを特徴とする。 The search method by the information processing system according to the present invention divides each document of a document group to be searched for each component of the document, and calculates a similarity between each component extracted from each document. Based on the calculated similarity between the constituent elements, similar constituent elements are grouped to calculate a deviation between the central constituent element in each group and the other constituent elements in the group. , The degree of similarity of the constituent element group constituting the group with reference to the calculated deviation between the constituent elements is quantified for each group, and one or more constituent elements included in each document to be searched are respectively applicable Calculating a score based on the product of the deviation and the degree of similarity, accumulating the score values of the included components for each document, summing the score values, and recording them in the storage unit, from the input unit In response to the search request, the score value for each of the aggregated documents is referred to, and the corresponding document is output from the output unit in the order according to the request from the document group that is the search target. To do.
本発明に係るプログラムは、制御部を、検索対象である文書群の各文書を文書の構成要素毎に分割して、前記各文書から抽出された構成要素各々間の類似度を算出する類似度判定手段と、前記類似度判定手段の算出した構成要素間の類似度に基づいて、類似する構成要素をグループ化して個々のグループ内の中心となる構成要素と該グループ内の他の構成要素とのそれぞれの偏差を算出すると共に、算出した構成要素間の前記偏差を参照してグループを構成する構成要素群の類似度合いをグループ毎に数値化するグループ解析手段と、検索対象の各文書に含まれる１ないし複数の構成要素について、それぞれ該当する前記偏差と前記類似度合いとの積に基づくスコアを算出して、含まれる構成要素のスコア値を文書毎に累積してスコア値を集計する流用度算出手段として機能させ、前記流用度算出手段によって集計された文書毎のスコア値に基づいて、検索システムが検索対照である文書群を検索要求に応じて順序付けて出力可能とする。In the program according to the present invention, the control unit divides each document of the document group to be searched for each component of the document, and calculates a similarity between each component extracted from each document Based on the similarity between the components calculated by the determination unit, and the similarity determination unit, similar components are grouped to form a central component in each group and other components in the group Included in each document to be searched, and a group analysis means for calculating the respective deviations of each of them and quantifying the degree of similarity of the component group constituting the group by referring to the calculated deviation between the component elements For each of one or more constituent elements, a score is calculated based on the product of the corresponding deviation and the degree of similarity, and the score values of the included constituent elements are accumulated for each document to collect score values. To function as diverting calculating means for, based on the score value for each document that has been aggregated by the diverting calculating means, the search system can output to be ordered in response to the search request documents is a search control.

図１は、第１の実施の形態の検索システム１０の構成を示す機能ブロック図である。
図１に示すように、検索システム１０は、類似度判定部２０とグループ解析部３０と流用度算出部４０を含み、検索対文書群を記憶保持するデータベース５０と接続されて、図示しない文書検索部によって所望する文書を検索可能なように構成されている。 FIG. 1 is a functional block diagram illustrating a configuration of a search system 10 according to the first embodiment.
As shown in FIG. 1, the search system 10 includes a similarity determination unit 20, a group analysis unit 30, and a diversion degree calculation unit 40. The search system 10 is connected to a database 50 that stores and holds a search target document group, and performs a document search (not shown). The desired document can be searched by the section.

類似度判定部２０は、検索対象である文書群の各文書を、文書の構成要素毎に分割し、構成要素間の類似度を算出する。 The similarity determination unit 20 divides each document in the document group to be searched for each component of the document, and calculates the similarity between the components.

グループ解析部３０は、類似度判定部２０の算出した類似度に基づいて、類似する構成要素を集めて１以上のグループを生成する。また、グループ解析部３０は、グループ毎に、グループ内の中心となる構成要素とグループ内の他の構成要素との偏差を算出する。また、グループ解析部３０は、グループ化した全てのグループに対して、偏差に基づき、グループを構成する構成要素の類似度合いを数値化する。 The group analysis unit 30 collects similar components based on the similarity calculated by the similarity determination unit 20 and generates one or more groups. The group analysis unit 30, for each group, to calculate a deviation between the other components in the components and the group that is central to the group. Further, the group analysis unit 30 quantifies the degree of similarity of the constituent elements constituting the group based on the deviation for all the grouped groups.

流用度算出部４０は、検索対象の文書に含まれる構成要素について、上記偏差と上記類似度合いとの積に基づく構成要素のスコアを算出する。また、流用度算出部４０は、検索対象の文書に含まれている構成要素のスコア値を累積加算して、検索対象の文書のスコア値とする。 The diversion degree calculation unit 40 calculates the score of the component based on the product of the deviation and the similarity degree for the component included in the document to be searched. In addition, the diversion degree calculating unit 40 cumulatively adds the score values of the constituent elements included in the search target document to obtain the score value of the search target document.

検索システム１０は、流用度算出部４０によって集計された文書のスコア値に基づいて、データベース５０に記憶されている複数の検索対照である文書を順序付けて、スコア値の高い文書を検索結果として出力する。 The search system 10 orders a plurality of documents that are search references stored in the database 50 based on the score values of the documents tabulated by the diversion degree calculation unit 40, and outputs a document having a high score value as a search result. To do.

検索システム１０は、上記した文書ブロックの偏差の算出と類似度合いの算出を、検索対象の文書をデータベース５０に登録するときに検索処理の前処理として予め行なっても良いし、検索時に行なっても良い。また、検索システム１０は、文書ブロックの偏差や類似度合いの数値を、テーブル情報として記憶するようにしても良い。この場合は、算出した偏差と類似度合いを別々のテーブルとしても良いし、一つのテーブルとしても良い。 The search system 10 may perform the calculation of the deviation of the document block and the calculation of the degree of similarity as a pre-process of the search process when registering the search target document in the database 50 or may be performed at the time of the search. good. In addition, the search system 10 may store numerical values of document block deviation and similarity as table information. In this case, the calculated deviation and the degree of similarity may be separate tables or a single table.

また、文書検索部は、データベース５０に記憶された文書から、入力を受けたキーワードのキーワード含有率に基づく第１の検索処理を行い、その上で、前記第１の検索処理で抽出した文書を、流用度算出部４０によって集計される文書のスコア値に基づく第２の検索処理を行うようにしてもよい。 In addition, the document search unit performs a first search process based on the keyword content rate of the input keyword from the documents stored in the database 50, and then extracts the document extracted by the first search process. The second search process based on the score values of the documents tabulated by the diversion degree calculation unit 40 may be performed.

ここで、文書ブロックの一例を、図２を用いて説明する。図２は、文書ブロックを説明する図である。図２には、３ページで構成されるドキュメントファイルＡが記載され、その1ページ目には、タイトル、段落Ａ〜Ｃ、表およびグラフが記載されている。ドキュメントファイルＡの文書ブロックは、ドキュメントファイルＡの構成要素であるページ、タイトル、段落、図、表である。 Here, an example of the document block will be described with reference to FIG. FIG. 2 is a diagram for explaining a document block. In FIG. 2, a document file A composed of three pages is described. On the first page, a title, paragraphs A to C, a table, and a graph are described. The document block of the document file A is a page, a title, a paragraph, a figure, and a table that are components of the document file A.

図３は、第１の実施の形態の検索システム１０の構成を詳細に示す機能ブロック図である。
図３を参照すると、ユーザなどが検索を実行する検索システム１０と、検索対象文書群を記憶するデータベース５０とから構成されている。
検索システム１０は、検索インタフェース１１０と、検索部１１１と、文書インデックス作成部１１２と、文書インデックス１１３と、文書ブロック解析部１２０と、文書ブロック流用度算出部１４０と、文書ブロック類似度テーブル１３１と、グループ内文書ブロック偏差テーブル１３２と、文書ブロックグループ尖度テーブル１３３とを含んで成る。 FIG. 3 is a functional block diagram illustrating in detail the configuration of the search system 10 according to the first embodiment.
Referring to FIG. 3, the search system 10 includes a search system 10 in which a user or the like executes a search, and a database 50 that stores a search target document group.
The search system 10 includes a search interface 110, a search unit 111, a document index creation unit 112, a document index 113, a document block analysis unit 120, a document block diversion calculation unit 140, and a document block similarity table 131. The intra-group document block deviation table 132 and the document block group kurtosis table 133 are included.

文書ブロック解析部１２０は、文書ブロック類似度判定部１２１と、文書ブロックグループ解析部１２２と、文書ブロック類似度グラフ１２３と、文書ブロック類似度分布１２４とを含む。 The document block analysis unit 120 includes a document block similarity determination unit 121, a document block group analysis unit 122, a document block similarity graph 123, and a document block similarity distribution 124.

文書ブロック類似度判定部１２１は、検索対象文書群１５１から各文書の文書ブロックの類似度を演算処理し、文書ブロック類似度テーブル１３１に文書ブロック類似度情報を格納する。 The document block similarity determination unit 121 calculates the document block similarity of each document from the search target document group 151 and stores the document block similarity information in the document block similarity table 131.

文書ブロックグループ解析部１２２は、文書ブロック類似度テーブル１３１に格納された各文書ブロックの類似度情報を基に、文書ブロック類似度グラフ１２３と文書ブロック類似度分布１２４とを構築し、それらの情報を基に、文書ブロック偏差テーブル１３２に各文書ブロックの属する文書ブロックグループの中心文書ブロックとの類似度情報（偏差情報）を格納すると共に、文書ブロックグループ尖度テーブル１３３に各文書ブロックグループの類似度の尖度情報を格納する。 The document block group analysis unit 122 builds a document block similarity graph 123 and a document block similarity distribution 124 based on the similarity information of each document block stored in the document block similarity table 131, and information about them. The similarity information (deviation information) with the central document block of the document block group to which each document block belongs is stored in the document block deviation table 132, and the similarity of each document block group is stored in the document block group kurtosis table 133. Stores degree kurtosis information.

文書ブロック流用度算出部１４０は、ユーザが検索インタフェース１１０を介して文書検索要求を行った際に、検索結果文書の各文書の各文書ブロックの文書ブロック流用度（スコア）の算出処理を行う。 When the user makes a document search request via the search interface 110, the document block diversion degree calculation unit 140 calculates the document block diversion degree (score) of each document block of each document of the search result document.

ここで、各テーブル及び各データモデルを、例示して説明する。
図４は、文書ブロック類似度テーブルを可視的に示す説明図である。
文書ブロック類似度テーブル１３１は、図４に示すように、文書ブロックＩＤフィールド３０１と、対象文書ブロックＩＤフィールド３０２と、類似度フィールド３０３とを含む、各文書ブロック間の類似度を示す文書ブロック類似度情報が記載された表である。
文書ブロックＩＤフィールド３０１と対象文書ブロックＩＤフィールド３０２には、データベース５０に記録された検索対象文書群の各文書に含まれる文書ブロックがIDを割当てられて記録される。類似度フィールド３０３には、文書ブロックＩＤフィールド３０１に対して、対象文書ブロックＩＤフィールド３０２の類似度が記録される。当該類似度は、文書ブロック類似度判定部１２１で算出される。図示した例では、文書ブロックＢ２が文書ブロックＡ２に対して、類似している度合いが０．８であることを示している。即ち、文書ブロックＢ２は、文書ブロックＡ２に相当類似している。他方、文書ブロックＢ４は、文書ブロックＡ２との類似度が０．２であり、さほど類似していない。 Here, each table and each data model will be described by way of example.
FIG. 4 is an explanatory diagram visually showing the document block similarity table.
The document block similarity table 131 includes a document block ID field 301, a target document block ID field 302, and a similarity field 303, as shown in FIG. It is a table in which the degree information is described.
In the document block ID field 301 and the target document block ID field 302, a document block included in each document of the search target document group recorded in the database 50 is assigned an ID and recorded. The similarity field 303 records the similarity of the target document block ID field 302 with respect to the document block ID field 301. The similarity is calculated by the document block similarity determination unit 121. In the illustrated example, the degree of similarity between the document block B2 and the document block A2 is 0.8. That is, the document block B2 is substantially similar to the document block A2. On the other hand, the document block B4 has a similarity of 0.2 to the document block A2, and is not very similar.

図５は、文書ブロック類似度グラフを可視的に示す説明図である。
文書ブロック類似度グラフ１２３は、図５に示すように、各文書ブロックを、各文書ブロック間の類似度に基づいてグラフ化したモデルである。 FIG. 5 is an explanatory diagram visually showing a document block similarity graph.
As shown in FIG. 5, the document block similarity graph 123 is a model in which each document block is graphed based on the similarity between the document blocks.

図６は、文書ブロック類似度分布を可視的に示す説明図である。
文書ブロック類似度分布１２４は、図６に示すように、文書ブロックグループ内での中心文書ブロックからの内容距離と文書ブロックグループ内の文書ブロック数とを軸にした各文書ブロックグループの分布グラフである。各グループにおける文書ブロックの度分布は、文書ブロックグループ内での中心文書ブロックからの類似度の高い文書ブロックが多いほど、類似度合いが大きくなる。換言すれば、各グループにおける文書ブロックの度分布は、文書ブロックグループ内での中心文書ブロックからの内容距離の短い文書ブロックが多いほど尖度が高いと言える。尚、尖度とは、類似した内容の文書ブロックのグループにおいて、そのグループを構成する文書ブロックの各内容がどれくらい類似しているかの度合いである。そしてその中心文書ブロックからの類似度距離が短い(＝類似した)文書が多いほど、当該グループの文書ブロックグループ尖度が高いと定義される。 FIG. 6 is an explanatory diagram visually showing the document block similarity distribution.
As shown in FIG. 6, the document block similarity distribution 124 is a distribution graph of each document block group with the content distance from the central document block in the document block group and the number of document blocks in the document block group as axes. is there. In the degree distribution of document blocks in each group, the degree of similarity increases as the number of document blocks having higher similarity from the central document block in the document block group increases. In other words, the degree distribution of document blocks in each group can be said to have a higher kurtosis as the number of document blocks having a shorter content distance from the central document block in the document block group increases. Note that the kurtosis is a degree of similarity between the contents of document blocks having similar contents in the document blocks constituting the group. Then, it is defined that the document block group kurtosis of the group is higher as the number of documents having a shorter similarity distance from the central document block (= similar).

尚、文書ブロック類似度グラフ１２３と文書ブロック類似度分布１２４とは、データモデルを示したものであり、記憶するデータとしては数値化した値を保存すればよい。例えば、文書ブロック類似度分布１２４は、データモデルとしては、各文書ブロックグループにおいて、中心文書ブロックからの内容距離を横軸とし、その内容距離にある文書ブロック数を縦軸としてプロットすることによって生成でき、他方、実装上は、各メモリテーブル等に、文書ブロックグループID、中心文書ブロックからの内容距離、該当する文書ブロック数の対応表として記憶しても良い。尚、図５および図６に例示したモデルは、後に詳説する。 Note that the document block similarity graph 123 and the document block similarity distribution 124 show data models, and numerical values may be stored as stored data. For example, the document block similarity distribution 124 is generated as a data model by plotting the content distance from the central document block on the horizontal axis and the number of document blocks at the content distance on the vertical axis in each document block group. On the other hand, in terms of implementation, each memory table or the like may be stored as a correspondence table of document block group ID, content distance from the central document block, and the number of corresponding document blocks. The models illustrated in FIGS. 5 and 6 will be described in detail later.

図７は、グループ内文書ブロック偏差テーブルを可視的に示す説明図である。
グループ内文書ブロック偏差テーブル１３２は、図７に示すように、文書ブロックＩＤフィールド６０１と、対象文書ブロックグループＩＤフィールド６０２と、中心文書ブロックからの内容距離フィールド６０３とを含む。グループ内文書ブロック偏差テーブル１３２には、文書ブロックＩＤフィールド６０１に記載された各文書ブロックの属する文書ブロックグループの中心文書ブロックとの内容距離が内容距離フィールド６０３に記載される。 FIG. 7 is an explanatory diagram visually showing the intra-group document block deviation table.
As shown in FIG. 7, the intra-group document block deviation table 132 includes a document block ID field 601, a target document block group ID field 602, and a content distance field 603 from the central document block. In the intra-group document block deviation table 132, the content distance to the central document block of the document block group to which each document block described in the document block ID field 601 belongs is described in the content distance field 603.

例示した表を説明すれば、文書ブロックグループの中心文書ブロックが文書ブロックＡ２である為、文書ブロックＡ２の中心文書ブロックとの類似度は１（同一）、同様に、文書ブロックＢ２の類似度は０．８、文書ブロックＣ５の類似度は０．７である。この場合に、内容距離フィールド６０３に記載される内容距離は、『類似度 = １−内容距離』の式に基づいて、算出された値が記載される。尚、本実施形態では、類似度の算出は、『類似度 = １−内容距離』としたが、内容距離の増加に応じて単調減少する別のアルゴリズムを用いても代替可能である。 Explaining the illustrated table, since the central document block of the document block group is the document block A2, the similarity of the document block A2 to the central document block is 1 (identical), and similarly, the similarity of the document block B2 is 0.8 and the similarity of the document block C5 is 0.7. In this case, as the content distance described in the content distance field 603, a value calculated based on the expression “similarity = 1−content distance ” is described. In the present embodiment, the similarity is calculated as “similarity = 1−content distance ”, but may be replaced by another algorithm that monotonously decreases as the content distance increases.

図８は、文書ブロックグループ尖度テーブルを可視的に示す説明図である。
文書ブロックグループ尖度テーブル１３３は、図８に示すように、文書ブロックグループＩＤフィールド７０１と、文書ブロックグループ尖度フィールド７０２とを含む。文書ブロックグループ尖度テーブル１３３は、各文書ブロックグループの尖度の表である。文書ブロックグループ尖度テーブル１３３には、グループを構成する各文書ブロックの内容の類似度を数値化して、算出された文書ブロックグループ尖度が記載される。文書ブロックグループ尖度は、グループの中心となる文書ブロックを特定し、当該中心文書ブロックからの類似度距離が小さい文書ブロックが多いほど、文書ブロックグループ尖度が高くなる。
各文書ブロックのグループ尖度の算出処理は、各文書ブロックグループにおいて、文書ブロックの中心文書ブロックからの内容距離の平均値を算出し、グループ尖度＝ 1 −内容距離平均値として算出すれば良い。 FIG. 8 is an explanatory diagram visibly showing a document block group kurtosis table.
As shown in FIG. 8, the document block group kurtosis table 133 includes a document block group ID field 701 and a document block group kurtosis field 702. The document block group kurtosis table 133 is a table of kurtosis of each document block group. The document block group kurtosis table 133 describes the calculated document block group kurtosis by quantifying the similarity of the contents of each document block constituting the group. The document block group kurtosis specifies the document block that is the center of the group, and the document block group kurtosis increases as the number of document blocks having a smaller similarity distance from the central document block increases.
The group kurtosis calculation processing of each document block may be performed by calculating the average value of the content distance from the central document block of the document block in each document block group and calculating the group kurtosis = 1−content distance average value. .

次に、図３、図９ないし図１１のフローチャートを参照して、本実施の形態の全体の動作について詳細に説明する。 Next, the overall operation of the present embodiment will be described in detail with reference to the flowcharts of FIGS. 3 and 9 to 11.

まず、文書ブロック類似度判定部１２１が、検索対象文書群１５１から各文書の文書ブロックの類似度を演算処理し、文書ブロック類似度テーブル１３１に文書ブロック類似度情報を格納する、文書ブロック類似度判定処理の動作を、図３及び図９のフローチャートを参照して説明する。 First, the document block similarity determination unit 121 calculates the document block similarity of each document from the search target document group 151 and stores the document block similarity information in the document block similarity table 131. The operation of the determination process will be described with reference to the flowcharts of FIGS.

図９は、文書ブロック類似度判定処理を示すフローチャートである。
文書ブロック類似度判定部１２１は、検索対象文書群１５１の各文書を文書ブロックに分割し、文書と文書ブロックに対してＩＤを付与する（Ｓ９０１）。 FIG. 9 is a flowchart showing document block similarity determination processing.
The document block similarity determination unit 121 divides each document in the search target document group 151 into document blocks, and assigns IDs to the documents and the document blocks (S901).

文書ブロック類似度判定部１２１は、付与したＩＤの順にその文書に含まれる各文書ブロックに関して、他の文書の各文書ブロックとの類似度を比較する（Ｓ９０２）。この際、比較対象とする文書は、まだ類似度を算出していない文書群とし、既に類似度を演算済みの文書との比較は行わない。 The document block similarity determination unit 121 compares the similarity of each document block included in the document with each document block in the order of the assigned IDs (S902). At this time, the document to be compared is a document group for which the similarity is not yet calculated, and is not compared with the document whose similarity has already been calculated.

文書ブロック類似度判定部１２１は、演算した文書ブロック間の類似度の情報を、文書ブロック類似度テーブル１３１の類似度フィールド３０３に格納する（Ｓ９０３）。 The document block similarity determination unit 121 stores the calculated similarity information between document blocks in the similarity field 303 of the document block similarity table 131 (S903).

文書ブロック類似度判定部１２１は、検索対象文書群１５１の全ての文書に対して文書ブロック類似度評価が完了したかどうかを判定し（Ｓ９０４）、検索対象文書群１５１の全ての文書に対して文書ブロック類似度評価が完了するまで上記８０２〜８０３の処理を繰り返す。
検索対象文書群１５１の全ての文書に対して文書ブロック類似度評価が完了したら、文書ブロック類似度判定処理を終了する。 The document block similarity determination unit 121 determines whether or not the document block similarity evaluation has been completed for all documents in the search target document group 151 (S904), and for all documents in the search target document group 151. The above processes 802 to 803 are repeated until the document block similarity evaluation is completed.
When the document block similarity evaluation is completed for all the documents in the search target document group 151, the document block similarity determination process ends.

次に、文書ブロックグループ解析部１２２が、文書ブロック類似度テーブル１３１に格納された各文書ブロックの類似度情報を基に、文書ブロック類似度グラフ１２３と文書ブロック類似度分布１２４とを構築処理し、構築した情報を基に文書ブロック偏差テーブル１３２に各文書ブロックの属する文書ブロックグループの中心文書ブロックとの類似度情報を格納し、文書ブロックグループ尖度テーブル１３３に各文書ブロックグループの類似度の尖度情報を格納する、文書ブロックグループ解析処理の動作を、図３、図５、図６及び図１０のフローチャートを参照して説明する。 Next, the document block group analysis unit 122 constructs the document block similarity graph 123 and the document block similarity distribution 124 based on the similarity information of each document block stored in the document block similarity table 131. Based on the constructed information, similarity information with the central document block of the document block group to which each document block belongs is stored in the document block deviation table 132, and the similarity of each document block group is stored in the document block group kurtosis table 133. The operation of document block group analysis processing for storing kurtosis information will be described with reference to the flowcharts of FIGS. 3, 5, 6 and 10.

図１０は、文書ブロックグループ解析処理を示すフローチャートである。
文書ブロックグループ解析部１２２は、文書ブロック類似度テーブル１３１に格納された各文書ブロック間の類似度情報を基に、文書ブロック類似度グラフ１２３を構築する（Ｓ１００１）。 FIG. 10 is a flowchart showing document block group analysis processing.
The document block group analysis unit 122 constructs the document block similarity graph 123 based on the similarity information between the document blocks stored in the document block similarity table 131 (S1001).

文書ブロックグループ解析部１２２は、文書ブロック類似度グラフ１２３において、一定の閾値以上の類似度の文書ブロック群を、文書ブロックグループとしてグループ化する（Ｓ１００２）。 In the document block similarity graph 123, the document block group analysis unit 122 groups document block groups having a similarity equal to or higher than a certain threshold as a document block group (S1002).

ここで、文書ブロックのグループ化を、図５に例示した文書ブロック類似度グラフ１２３を用いて説明する。尚、グループ化の閾値は、０．５とする。
図５の例では、ドキュメントファイルＡの第２ブロック（段落Ｂ）である文書ブロックＡ２（５０１）と文書Ｂの第２ブロックである文書ブロックＢ２（５０２）との類似度は０．８である。同様に、文書Ａの第２ブロックである文書ブロックＡ２（５０１）と文書Ｃの第５ブロックである文書ブロックＣ５（５０３）との類似度は０．７である。同様に、文書Ｂの第２ブロックである文書ブロックＢ２（５０２）と文書Ｃの第５ブロックである文書ブロックＣ５（５０３）との類似度は０．６である。同様に、文書Ａの第２ブロックである文書ブロックＡ２（５０１）と文書Ｂの第４ブロックである文書ブロックＢ４（５０４）との類似度は０．２である。
この場合、文書ブロックグループ化の閾値が０．５である為、文書ブロックＡ２（５０１）と文書ブロックＢ２（５０２）と文書ブロックＣ５（５０３）とが、文書ブロックグループＧ１（５１０）としてグループ化される。他方、文書ブロックＢ４は、グループ化されない。 Here, grouping of document blocks will be described using the document block similarity graph 123 illustrated in FIG. The grouping threshold is 0.5.
In the example of FIG. 5, the similarity between the document block A2 (501) that is the second block (paragraph B) of the document file A and the document block B2 (502) that is the second block of the document B is 0.8. . Similarly, the similarity between the document block A2 (501) that is the second block of the document A and the document block C5 (503) that is the fifth block of the document C is 0.7. Similarly, the similarity between the document block B2 (502) that is the second block of the document B and the document block C5 (503) that is the fifth block of the document C is 0.6. Similarly, the similarity between the document block A2 (501) that is the second block of the document A and the document block B4 (504) that is the fourth block of the document B is 0.2.
In this case, since the document block grouping threshold is 0.5, the document block A2 (501), the document block B2 (502), and the document block C5 (503) are grouped as a document block group G1 (510). Is done. On the other hand, the document block B4 is not grouped.

文書ブロックグループ解析部１２２は、グループ化した全ての文書ブロックグループを、重心演算等の処理を用いて、各グループ内の中心文書ブロックを特定する（Ｓ１００３）。図５に例示した文書ブロック類似度グラフ１２３では、文書ブロックグループＧ１（５１０）の重心を用いて、文書ブロックＡ２（５０１）を中心文書ブロックと特定し、必要に応じて、フラグ処理等を行う。 The document block group analysis unit 122 identifies the central document block in each group of all the grouped document block groups using processing such as centroid calculation (S1003). In the document block similarity graph 123 illustrated in FIG. 5, the document block A2 (501) is identified as the central document block using the center of gravity of the document block group G1 (510), and flag processing or the like is performed as necessary. .

文書ブロックグループ解析部１２２は、各文書ブロックグループにおける中心文書ブロックに対するグループ内の他の文書ブロックの内容距離を算出処理し、グループ内文書ブロック偏差テーブル１３２の内容距離フィールド６０３に、内容距離を格納する（Ｓ１００４）。 The document block group analysis unit 122 calculates the content distance of other document blocks in the group with respect to the central document block in each document block group, and stores the content distance in the content distance field 603 of the in-group document block deviation table 132. (S1004).

文書ブロックグループ解析部１２２は、各グループの中心文書ブロックと各文書ブロックとの類似度分布１２４を用いて、各文書ブロックグループの尖度を演算処理し、文書ブロックグループ尖度情報として、文書ブロックグループ尖度テーブル１３３の文書ブロックグループ尖度フィールド７０２に格納する（Ｓ１００５）。図６に例示した文書ブロック類似度分布１２４では、文書ブロックグループＧ１（６０１）の尖度は０．８であり、文書ブロックグループＧ２（６０２）の尖度は、０．５である。また、文書ブロックグループＧ１（６０１）の方が、文書ブロックグループＧ２（６０２）よりも文書ブロックグループの尖度が高い。
Ｓ１００５において、全ての文書ブロックグループに対する尖度の演算処理が完了したら、文書ブロックグループ解析処理を終了する。 The document block group analysis unit 122 calculates the kurtosis of each document block group using the similarity distribution 124 between the central document block of each group and each document block, and uses the document block group kurtosis information as document block group kurtosis information. The document is stored in the document block group kurtosis field 702 of the group kurtosis table 133 (S1005). In the document block similarity distribution 124 illustrated in FIG. 6, the kurtosis of the document block group G1 (601) is 0.8, and the kurtosis of the document block group G2 (602) is 0.5. The document block group G1 (601) has a higher kurtosis of the document block group than the document block group G2 (602).
When the kurtosis calculation processing for all the document block groups is completed in S1005, the document block group analysis processing is terminated.

次に、文書ブロック流用度算出部１４０が、ユーザなどから検索システム１０に対して文書検索要求があった際に、検索結果文書の各文書の各文書ブロックの文書ブロック流用度の算出を行う文書ブロック流用度算出処理の動作を、図３、図５、図６及び図１１のフローチャートを参照して説明する。 Next, the document block diversion degree calculation unit 140 calculates the document block diversion degree of each document block of each document of the search result document when a user or the like makes a document search request to the search system 10. The operation of the block diversion degree calculation process will be described with reference to the flowcharts of FIGS. 3, 5, 6, and 11.

図１１は、文書ブロック流用度算出処理を示すフローチャートである。
ユーザから検索システム１０に対して文書検索を行われると、検索部１１１は、検索インタフェース１１０から入力された検索キーワードを用いて文書インデックス１１３を検索し、キーワード含有率に基づき、所定値以上のスコアの文書を抽出処理する（Ｓ１１０１）。 FIG. 11 is a flowchart showing document block diversion degree calculation processing.
When the user performs a document search on the search system 10, the search unit 111 searches the document index 113 using the search keyword input from the search interface 110, and based on the keyword content rate, the score is equal to or higher than a predetermined value. The document is extracted (S1101).

文書ブロック流用度算出部１４０は、検索部１１１によって抽出された検索結果の文書群に含まれる各文書ブロックに対して、グループ内文書ブロック偏差テーブル１３２と文書ブロックグループ尖度テーブル１３３を参照し、文書ブロックグループ尖度と中心文書ブロックとの類似度情報（偏差情報）の積を算出処理した値に基づき、検索結果の文書群に含まれる各文書ブロックのスコアを演算する（Ｓ１１０２）。
尚、文書ブロックのスコアは、文書ブロックグループ尖度と文書ブロック類似度のみでなく、当該文書ブロックが含まれる文書ブロックグループの文書ブロック数を加えて用いることにより、更に精度の高い文書ブロックのスコアを算出可能である。例えば、Ｓ１１０２において、各文書ブロックのスコアを、スコアを算出する文書ブロックを含む文書ブロックグループの総文書ブロック数の積に、属する文書ブロックグループの尖度と中心文書ブロックとの類似度情報の積を、両方の積に適切な係数をかけて乗算する。このようにスコアを算出することで、文書ブロックグループ尖度が低くでも、多数の文書ブロックを含む文書ブロックグループに含まれる文書ブロックのスコアが高くできる。即ち、文書ブロック数をスコア算出処理に用いることで、更に精度の高い文書ブロックのスコアを算出可能となる。
The document block diversion degree calculation unit 140 refers to the in-group document block deviation table 132 and the document block group kurtosis table 133 for each document block included in the document group of the search result extracted by the search unit 111, Based on the value obtained by calculating the product of similarity information (deviation information) between the kurtosis of the document block group and the central document block, the score of each document block included in the document group of the search result is calculated (S1102).
Note that the score of the document block is not only the kurtosis of the document block group and the similarity of the document block, but by adding the number of document blocks of the document block group including the document block and using the score of the document block with higher accuracy. Can be calculated. For example, in S1102, the score of each document block is calculated by multiplying the product of the total number of document blocks of the document block group including the document block for which the score is calculated by the similarity information between the kurtosis of the document block group and the central document block. Multiply both products by an appropriate factor. By calculating the score in this manner, even if the kurtosis of the document block group is low, the score of the document block included in the document block group including a large number of document blocks can be increased. That is, by using the number of document blocks for the score calculation process, it is possible to calculate the score of the document block with higher accuracy.

また、文書ブロックのスコアは、算出対象である文書ブロックが、属するグループの中心文書ブロックであった場合に加算するようにしても良い。また、文書ブロックのスコアは、算出対象である文書ブロックが、属するグループの尖度が所定値以上であった場合に加算するようにしても良い。 The score of the document block may be added when the document block to be calculated is the central document block of the group to which it belongs. Further, the score of the document block may be added when the kurtosis of the group to which the document block to be calculated belongs is equal to or greater than a predetermined value.

文書ブロック流用度演算部１４０は、上記Ｓ１１０２で算出した各文書ブロックのスコアを、検索結果の文書毎に検索結果文書の文書スコアとして累積して算出処理する（Ｓ１１０３）。 The document block diversion degree calculation unit 140 performs calculation processing by accumulating the score of each document block calculated in S1102 as the document score of the search result document for each document of the search result (S1103).

文書ブロック流用度演算部１４０は、検索結果文書群の全ての文書に対して、上記Ｓ１１０３での文書スコアの演算処理が完了したかどうかを判定し（Ｓ１１０４）、検索結果文書群の全ての文書に対して文書スコアの演算処理が完了するまで上記Ｓ１１０２〜Ｓ１１０３の処理を繰り返す。 The document block diversion degree calculation unit 140 determines whether or not the document score calculation processing in S1103 has been completed for all the documents in the search result document group (S1104), and all the documents in the search result document group The processing of S1102 to S1103 is repeated until the document score calculation processing is completed.

検索結果文書群の全ての文書に対して文書スコアの演算処理が完了したら、文書ブロック流用度算出処理を終了する。 When the document score calculation process is completed for all the documents in the search result document group, the document block diversion degree calculation process is terminated.

その後、検索システム１０は、算出処理した文書スコアに基づいて、検索した文書を順序付けてユーザに提示する。 Thereafter, the search system 10 orders the searched documents based on the calculated document score and presents them to the user.

本実施の形態では、文書ブロックの類似度判定部１２１によって各文書に含まれる文書ブロック間の類似度を判定し、文書ブロックグループ解析部１２２によって類似する文書ブロックをグループ化してグループ内の中心となる文書ブロックとグループ内の他の文書ブロックとの偏差を算出すると共に、グループ化した全てのグループに対して偏差に基づくグループを構成する文書ブロックの類似度合いを数値化し、文書ブロック流用度算出部１４０によって検索対象の文書群に含まれる文書ブロックのスコアを算出し、当該スコアを用いて文書全体のスコアを算出する。このようにすることで、ユーザ等から入力された検索キーワードの分野を俯瞰した文書を、検索結果の上位に表示することができる。 In the present embodiment, the similarity determination unit 121 of document blocks determines the similarity between document blocks included in each document, and the similar document blocks are grouped by the document block group analysis unit 122 to obtain the center of the group. And calculating the deviation between the document block and other document blocks in the group, and quantifying the degree of similarity of the document blocks constituting the group based on the deviation for all the grouped groups. The score of the document block included in the document group to be searched is calculated by 140, and the score of the entire document is calculated using the score. By doing in this way, the document which overlooked the field of the search keyword input from the user etc. can be displayed on the upper rank of a search result.

また、本実施の形態では、文書ブロックグループ解析部１２２によって、文書ブロック類似度グラフ１２３より、各文書ブロックの類似度の重心演算などを行うことにより、類似した文書ブロックの各グループにおいて中心となる文書ブロックを特定することができる。 In the present embodiment, the document block group analysis unit 122 performs the center of gravity calculation of the similarity of each document block from the document block similarity graph 123, and thereby becomes the center in each group of similar document blocks. Document blocks can be identified.

また、本実施の形態では、文書ブロックグループ解析部１２２によって、文書ブロック類似度分布１２４を用いて、類似した文書ブロックの集合体であるグループにおける中心文書ブロックと各文書ブロックとの類似度の偏差の尖度を演算することができる。 In the present embodiment, the document block group analysis unit 122 uses the document block similarity distribution 124 to change the similarity between the central document block and each document block in a group that is an aggregate of similar document blocks. Can be calculated.

次に、本発明の第２の実施の形態について説明する。第２の実施の形態は第１の実施の形態と同様の部分を含む。同様の部分は同符号を付け、詳細な説明は省くものとする。 Next, a second embodiment of the present invention will be described. The second embodiment includes the same parts as in the first embodiment. Similar parts are denoted by the same reference numerals, and detailed description thereof is omitted.

第２の実施の形態の検索システムは、第１の実施の形態の検索システムの構成に加え、検索対象である文書群の各文書を、文書の構成要素毎に分割し、各構成要素の編集履歴に基づく構成要素間の類似度を算出する類似度判定部を含んで成る。以下差の部分を詳細に説明する。 In addition to the configuration of the search system of the first embodiment, the search system of the second embodiment divides each document of the document group to be searched for each component of the document, and edits each component It includes a similarity determination unit that calculates the similarity between components based on the history. The difference will be described in detail below.

図１２は、第２の実施の形態の検索システム２００の構成を詳細に示す機能ブロック図である。図１２を参照すると、第２の実施の形態では、第１の実施の形態の構成に加え、データベース２５０に、文書操作履歴・改版履歴２５１を含む。 FIG. 12 is a functional block diagram illustrating in detail the configuration of the search system 200 according to the second embodiment. Referring to FIG. 12, in the second embodiment, in addition to the configuration of the first embodiment, the database 250 includes a document operation history / revised history 251.

文書操作履歴・改版履歴２５１は、データベース２５０に記憶される検索対象文書の作成者や操作者による操作履歴や改版履歴のような編集履歴（操作情報、操作に関する履歴情報、作成年月日情報、編集年月日情報、作成者情報、操作者情報など）である。 The document operation history / revision history 251 includes an edit history (operation information, operation history information, creation date information, creation date information, operation history and revision history of the search target document stored in the database 250 by the creator and operator). Edit date information, creator information, operator information, etc.).

本実施の形態における文書ブロック類似度判定部２２１は、文書システム２５０に含まれる文書操作履歴・改版履歴２５１に格納されている、編集履歴に基づき、検索対象文書群に含まれる各文書ブロックの類似性の判定を行い、類似度を算出する。即ち、文書ブロック類似度判定部１２１の検索対象文書群の文書ブロックの内容による類似度判定に代えて、検索対象である文書群の各文書を文書の文書ブロック毎に分割し、各文書ブロックの編集履歴に基づく文書ブロック間の類似度を算出する。
上記以外の検索システム２００の構成及び動作は、第１の実施の形態である検索システム１０と同様である。 The document block similarity determination unit 221 according to the present embodiment is based on the editing history stored in the document operation history / revision history 251 included in the document system 250, and the similarity of each document block included in the search target document group. The sex is determined and the similarity is calculated. That is, instead of the similarity determination based on the content of the document block of the search target document group in the document block similarity determination unit 121, each document of the document group to be searched is divided for each document block of the document block. The similarity between document blocks based on the editing history is calculated.
The configuration and operation of the search system 200 other than those described above are the same as those of the search system 10 according to the first embodiment.

本実施の形態では、文書ブロックの類似度判定部２２１によって、データベース２５０に含まれる文書操作履歴・改版履歴２５２に含まれる履歴情報を基に、検索対象文書群に含まれる各文書ブロックの類似性を判定し、文書ブロックグループ解析部１２２によって類似する文書ブロックをグループ化してグループ内の中心となる文書ブロックとグループ内の他の文書ブロックとの偏差を算出すると共に、グループ化した全てのグループに対して偏差に基づくグループを構成する文書ブロックの類似度合いを数値化し、文書ブロック流用度算出部１４０によって検索対象の文書群に含まれる文書ブロックのスコアを算出し、当該スコアを用いて文書全体のスコアを算出する。このようにすることで、入力された検索キーワードの分野を俯瞰した文書を、検索結果の上位に表示することができる。
また、第１の実施の形態と同様に、本実施の形態では、文書ブロックグループ解析部１２２によって、文書ブロック類似度グラフ１２３より、各文書ブロックの類似度の重心演算などを行うことにより、類似した文書ブロックの各グループにおいて中心となる文書ブロックを特定することができる。同じく、本実施の形態では、文書ブロックグループ解析部１２２によって、文書ブロック類似度分布１２４を用いて、類似した文書ブロックの集合体であるグループにおける中心文書ブロックと各文書ブロックとの類似度の偏差の尖度を演算することができる。 In this embodiment, the similarity determination unit 221 of document blocks uses the similarity of each document block included in the search target document group based on the history information included in the document operation history / revised history 252 included in the database 250. The document block group analysis unit 122 groups similar document blocks to calculate the deviation between the central document block in the group and the other document blocks in the group, and adds all the grouped groups. On the other hand, the degree of similarity of the document blocks constituting the group based on the deviation is digitized, and the score of the document block included in the document group to be searched is calculated by the document block diversion degree calculation unit 140, and the score of the entire document is calculated using the score. Calculate the score. By doing in this way, the document which overlooked the field of the input search keyword can be displayed on the top of the search result.
Similarly to the first embodiment, in this embodiment, the document block group analysis unit 122 performs similarity calculation by performing a centroid calculation of the similarity of each document block from the document block similarity graph 123. The document block that is the center of each group of the document blocks can be specified. Similarly, in the present embodiment, the document block group analysis unit 122 uses the document block similarity distribution 124 to change the similarity between the central document block and each document block in a group that is an aggregate of similar document blocks. Can be calculated.

本発明は、上記説明のように、検索対象である文書群の各文書に含まれる構成要素（文書ブロック）間の内容に基づく類似度を算出する類似度判定部１２１や、各構成要素の編集履歴に基づく類似度を算出する類似度判定部２２１のように、様々な検索対象文書に関する様々な情報やアルゴリズムによる構成要素（文書ブロック）類似度判定手段を、文書ブロック類似度判定部として組み入れることができる。 As described above, the present invention provides a similarity determination unit 121 that calculates a similarity based on the content between components (document blocks) included in each document of a document group to be searched, and editing of each component As in the similarity determination unit 221 that calculates the similarity based on the history, a component (document block) similarity determination unit based on various information and algorithms related to various search target documents is incorporated as the document block similarity determination unit. Can do.

尚、本発明の具体的な構成は前述の実施の形態に限られるものではなく、この発明の要旨を逸脱しない範囲の変更があってもこの発明に含まれる。
一例としては、各検索システムの各部及び各種手段は、ハードウェア又は、ハードウェアとソフトウェアの組み合わせを用いて実現しても良い。ハードウェアとソフトウェアとを組み合わせた形態では、ＲＡＭにプログラムが展開され、プログラムに基づいて制御部等のハードウェアを動作させることによって、各部及び各種手段を実現する。また、上記プログラムは、記憶媒体に記録されて頒布されても良い。当該記録媒体に記録されたプログラムは、有線、無線、又は記録媒体そのものを介して、記憶部に読込まれ、制御部等を動作させる。尚、記録媒体を例示すれば、オプティカルディスクや磁気ディスク、半導体メモリ装置、ハードディスクなどが挙げられる。具体的な一例としては、図１３に示すように、一般的なコンピュータを用いて検索システムを実現できる。図１３に示された検索システムは、ネットワークを介して接続された検索対象群を格納するデータベースと接続し、補助記憶装置に記憶された各種プログラムがＲＡＭに展開されて制御部に読込まれることによって、検索システムとして動作する。制御部は、ＲＡＭに読込まれた各種プログラムに基づいて、文書ブロック類似度判定手段、文書ブロックグループ解析手段、文書ブロック流用度算出手段、文書検索手段、文書インデックス作成手段などとして機能する。検索システムとして動作するコンピュータは、入力部やネットワークインタフェースを介して入力された検索キーワードを用いて、データベースに記録されている検索対象群から、内在する各種手段を用いて算出した文書のスコア値に基づいて順序付けて出力し、検索キーワードの分野を俯瞰した文書を、検索結果の上位に表示できる。 It should be noted that the specific configuration of the present invention is not limited to the above-described embodiment, and modifications within a range not departing from the gist of the present invention are included in the present invention.
As an example, each unit and various means of each search system may be realized using hardware or a combination of hardware and software. In a form in which hardware and software are combined, a program is developed in the RAM, and each unit and various means are realized by operating hardware such as a control unit based on the program. Further, the program may be recorded on a storage medium and distributed. The program recorded in the recording medium is read into the storage unit via the wired, wireless, or recording medium itself, and operates the control unit and the like. Examples of the recording medium include an optical disk, a magnetic disk, a semiconductor memory device, and a hard disk. As a specific example, as shown in FIG. 13, a search system can be realized using a general computer. The search system shown in FIG. 13 is connected to a database that stores search object groups connected via a network, and various programs stored in the auxiliary storage device are expanded in the RAM and read into the control unit. By operating as a search system. The control unit functions as a document block similarity determination unit, a document block group analysis unit, a document block diversion calculation unit, a document search unit, a document index creation unit, and the like based on various programs read into the RAM. A computer that operates as a search system uses a search keyword input via an input unit or a network interface to obtain a score value of a document calculated from various groups included in a search target group recorded in a database. Documents that are ordered and output based on the search keyword field can be displayed at the top of the search results.

換言すれば、検索システムは、制御部と、検索対象である文書群を蓄積記憶する記憶部と、検索結果を出力する出力部とを備え、前記制御部は、前記文書群の各文書を、文書の構成要素毎に分割し、構成要素間の類似度を算出し、算出した類似度に基づいて、類似する構成要素をグループ化してグループ内の中心となる構成要素とグループ内の他の構成要素との偏差を算出すると共に、グループ化した全てのグループに対して前記偏差に基づくグループを構成する構成要素の類似度合いを数値化し、検索対象の文書に含まれる構成要素について、前記偏差と前記類似度合いとの積に基づく前記構成要素のスコアを算出して、含まれる構成要素のスコア値を累積した文書のスコア値を集計し、集計された文書のスコア値に基づいて、複数の検索対照である文書を順序付けて前記出力部に出力するように、構築できる。また、第２の実施の形態も同様に構築できる。 In other words, the search system includes a control unit, a storage unit that stores and stores a document group that is a search target, and an output unit that outputs a search result, and the control unit stores each document in the document group. Divide each component of the document, calculate the similarity between the components, group similar components based on the calculated similarity, and the other components in the group While calculating the deviation from the element, the similarity degree of the constituent elements constituting the group based on the deviation is quantified for all the grouped groups, and for the constituent elements included in the search target document, the deviation and the The score of the component is calculated based on the product with the degree of similarity, the score value of the document obtained by accumulating the score value of the included component is aggregated, and a plurality of searches are performed based on the score value of the aggregated document. So as to output to the output unit order the document is irradiation, it can be constructed. Further, the second embodiment can be similarly constructed.

また、検索対象群は、ネットワークを介して接続されたデータベースを用いずに、補助記憶装置に記憶するようにしても良い。 Further, the search target group may be stored in the auxiliary storage device without using the database connected via the network.

以上説明したように、本発明によれば、入力された検索キーワードの分野を俯瞰した文書を、検索結果の上位に表示する検索システムを提供できる。
また、本発明によれば、入力された検索キーワードの分野を俯瞰した文書を、検索結果の上位に表示する検索方法を提供できる。
更に、本発明によれば、入力された検索キーワードの分野を俯瞰した文書を、検索結果の上位に表示可能なプログラムを提供できる。 As described above, according to the present invention, it is possible to provide a search system that displays a document overlooking the field of the input search keyword at the top of the search results.
In addition, according to the present invention, it is possible to provide a search method for displaying a document overlooking the field of the input search keyword at the top of the search results.
Furthermore, according to the present invention, it is possible to provide a program capable of displaying a document overlooking the field of the input search keyword at the top of the search results.

第１の実施の形態の検索システム１０の構成を示す機能ブロック図である。It is a functional block diagram which shows the structure of the search system 10 of 1st Embodiment. 文書ブロックを説明する図である。It is a figure explaining a document block. 第１の実施の形態の検索システム１０の構成を詳細に示す機能ブロック図である。It is a functional block diagram which shows the structure of the search system 10 of 1st Embodiment in detail. 文書ブロック類似度テーブルを可視的に示す説明図である。It is explanatory drawing which shows a document block similarity table visually. 文書ブロック類似度グラフを可視的に示す説明図である。It is explanatory drawing which shows a document block similarity graph visually. 文書ブロック類似度分布を可視的に示す説明図である。It is explanatory drawing which shows a document block similarity distribution visually. グループ内文書ブロック偏差テーブルを可視的に示す説明図である。It is explanatory drawing which shows the document block deviation table in a group visually. 文書ブロックグループ尖度テーブルを可視的に示す説明図である。It is explanatory drawing which shows a document block group kurtosis table visually. 文書ブロック類似度判定処理を示すフローチャートである。It is a flowchart which shows a document block similarity determination process. 文書ブロックグループ解析処理を示すフローチャートである。It is a flowchart which shows a document block group analysis process. 文書ブロック流用度算出処理を示すフローチャートである。It is a flowchart which shows a document block diversion degree calculation process. 第２の実施の形態の検索システム２００の構成を詳細に示す機能ブロック図である。It is a functional block diagram which shows the structure of the search system 200 of 2nd Embodiment in detail. コンピュータによる検索システムの構成を示す機能ブロック図である。It is a functional block diagram which shows the structure of the search system by a computer.

Explanation of symbols

１０検索システム
２０類似度判定部
３０グループ解析部
４０流用度算出部
５０データベース
１１０検索インタフェース
１１１検索部
１１２文書インデックス作成部
１１３文書インデックス
１２０文書ブロック解析部
１２１文書ブロック類似度判定部
１２２文書ブロックグループ解析部
１２３文書ブロック類似度グラフ
１２４文書ブロック類似度分布
１３１文書ブロック類似度テーブル
１３２グループ内文書ブロック偏差テーブル
１３３文書ブロックグループ尖度テーブル
２００検索システム
２２０文書ブロック解析部
２２１文書ブロック類似度判定部
２５０データベース（文書システム）
２５１文書操作履歴・改版履歴（操作に関する履歴情報） 10 Search system
20 similarity determination unit 30 group analysis unit 40 diversion calculation unit 50 database 110 search interface
111 Search part
112 Document index creation unit
113 Document Index
120 Document block analyzer
121 Document block similarity determination unit
122 Document Block Group Analysis Unit
123 Document Block Similarity Graph
124 Document block similarity distribution
131 Document block similarity table
132 Group document block deviation table
133 Document block group kurtosis table
200 Search System 220 Document Block Analysis Unit
221 Document block similarity determination unit
250 database (document system)
251 Document operation history / revision history (history information related to operation)

Claims

By dividing each document group of documents to be retrieved for each component of the document, and similarity determination unit for calculating the degree of similarity between elements of each said extracted from each document,
Based on the similarity between the calculated components of the similarity determination unit, each of the deviation between the other components of the group the similar components in the center and becomes component and said groups in each group A group analysis unit that quantifies the degree of similarity of the component group constituting the group with reference to the deviation between the calculated component elements for each group;
For one or more constituent elements included in each document to be searched, a score based on the product of the corresponding deviation and the similarity is calculated, and the score values of the included constituent elements are accumulated for each document. And a diversion degree calculation unit for counting score values,
In response to the search request, the search system above with reference to the score value for each document that has been aggregated by diverting calculator, and outputs in the ordering on demand document set is the search control.

By dividing each document group of documents to be retrieved for each component of the document, and similarity determination section that calculates the similarity between the components each based on the editing history for each component extracted from each document ,
Based on the similarity between the calculated components of the similarity determination unit, each of the deviation between the other components of the group the similar components in the center and becomes component and said groups in each group A group analysis unit that quantifies the degree of similarity of the component group constituting the group with reference to the deviation between the calculated component elements for each group;
For one or more constituent elements included in each document to be searched, a score based on the product of the corresponding deviation and the similarity is calculated, and the score values of the included constituent elements are accumulated for each document. And a diversion degree calculation unit for counting score values,
In response to the search request, by referring to the score value for each document that has been aggregated by the diverting calculator, to output the ordering on demand to the appropriate documentation from the document group is a search control Feature search system.

A first search process based on the keyword content rate of the keyword received from the search target document group is performed, and the document group extracted by the first search process is aggregated by the diversion degree calculation unit. 3. The search system according to claim 1, further comprising: a document search unit that performs a second search process based on a score value of each individual document and outputs the results in order.

A search system that receives an input of a keyword, executes a search with a score calculation process from a document group that is a search target, and outputs the result in order by document unit ,
Document block similarity determination that divides a document to be searched into document blocks and calculates the similarity between the document blocks by comparing the document blocks of the divided documents with the document blocks of other divided documents. And
Using similarities between document blocks calculated by the document block similarity determination unit, similar document blocks are grouped as a document block group, and a document block serving as a center in each of the grouped document block groups identify central document block is, and calculates the deviation information similarity for each document block group between the other documents blocks constituting the same group and the central document block,
Document block group analysis unit for calculating kurtosis information of all grouped document block groups using a similarity distribution between a central document block included in the deviation information and other document blocks in the group When,
From documents is the search control, the document retrieval unit for extracting a plurality of documents based on keywords content of keywords that received input,
For each of said plurality of documents extracted by the document retrieval unit, respectively, for the document blocks the underlying, and kurtosis information of the document block group, thereby performing the score calculation processing of the document block based on the product of said deviation information ,
A document block diversion calculating unit that accumulates score values for each of the calculated document blocks and totals the score values of individual documents,
A search system, wherein the plurality of extracted documents are output in order by referring to the score values for each of the collected documents.

The document block similarity determination unit calculates, for each document block for each of the divided documents, a similarity between document blocks based on an editing history regarding operations performed by a document creator and / or an operator. The search system according to claim 4, wherein:

The document block diversion degree calculation unit calculates the kurtosis of the document block group to which each document block belongs and the central document block by accumulating the score value considering the total number of document blocks of the document block group including the document block whose score is to be calculated. retrieval system according to claim 4 or 5, characterized in that accumulated to calculate the score of each document block by multiplying the appropriate coefficients respectively over the product of the similarity information with.

7. The document block diversion degree calculating unit adds a score when a document block whose score is to be calculated is a central document block of a group to which the document block belongs. The search system described in 1.

8. The document block diversion degree calculating unit adds a score when the kurtosis of a group to which a document block that is a score calculation target belongs is equal to or greater than a predetermined value. The search system according to Kanichi.

By dividing each document group of documents to be retrieved for each component of the document, and calculation processing the similarity between the components respectively extracted from each document,
Based on the similarity between the components and the calculated, by grouping similar components calculated process each deviation between other components in the center become components and said groups in each group,
Referring to the deviation between the calculated components, the similarity degree of the component group constituting the group is quantified for each group ,
For one or a plurality of components included in each document to be searched, a score is calculated based on the product of the corresponding deviation and the degree of similarity, and the score values of the included components are accumulated for each document. The score values are tabulated and recorded in the storage unit ,
Depending on the input search request received from, with reference to the score value for each document that the aggregate is outputted from the output section in the ordering on demand to the appropriate documentation from the document group is a search control The search method by the information processing system characterized by this .

Dividing each document of the document group to be searched for each component of the document, calculating the similarity between each component based on the editing history of each component extracted from each document ,
Based on the similarity between the components and the calculated, by grouping similar components calculated process each deviation between other components in the center become components and said groups in each group,
Referring to the deviation between the calculated components, the similarity degree of the component group constituting the group is quantified for each group ,
For one or a plurality of components included in each document to be searched, a score is calculated based on the product of the corresponding deviation and the degree of similarity, and the score values of the included components are accumulated for each document. The score values are tabulated and recorded in the storage unit ,
Depending on the input search request received from, by referring to the score value for each document that the aggregate is outputted from the output section in the ordering on demand to the appropriate documentation from the document group is a search control The search method by the information processing system characterized by this .

For the keyword to be retrieved received from the input unit, from the document group that is the search reference, a document group to be output candidates based on the keyword content rate is searched,
The document group as output candidates extracted by the search process is searched and narrowed down based on the score values of the individual documents, and the results are output in order. Search method described.

A search method by an information processing system that receives an input of a keyword from an input unit, executes a search with a score calculation process from a document group that is a search target, orders the document unit, and outputs the result to an output unit ,
Dividing the document as a search reference into document blocks, comparing the document blocks of the divided documents with the document blocks of other documents divided in the same manner, and calculating the similarity between the document blocks;
With reference to the similarity between the calculated document blocks, similar document blocks are grouped as a document block group, and a central document block which is a central document block in each of the grouped document block groups identify, and calculation process as the deviation information similarity for each document block group between the other documents block constituting the central document block and the same group,
With reference to the similarity distribution between the central document block included in the deviation information and the other document blocks in the group, the kurtosis information of all the document block groups that have been grouped is calculated,
A plurality of documents are extracted from the search target document group based on the keyword content rate of the keyword received from the input unit ,
For each of said plurality of documents extracted by the extracting process, respectively, for the document blocks the underlying, and implementation and kurtosis information of the document block group, the score calculation processing of the document block based on the product of said deviation information,
Accumulating score values for each of the calculated document blocks and totaling the score values for the entire document for each document and recording them in the storage unit ,
A search method by an information processing system, wherein a process of ordering and outputting the plurality of extracted documents from the output unit with reference to a score value as a whole document for each aggregated document is performed.

The calculation processing of the similarity between the document blocks calculates the similarity between the document blocks for each of the divided document blocks based on the editing history regarding the operation performed by the document creator and / or the operator. The search method according to claim 12, wherein:

The control unit
By dividing each document group of documents to be retrieved for each component of the document, and similarity determination means for calculating a similarity between the components respectively extracted from each document,
On the basis of the similarity between the calculated components of the similarity determination means, each of the deviation between the other components of the group the similar components in the center and becomes component and said groups in each group And a group analysis means for quantifying the degree of similarity of the component group constituting the group for each group with reference to the deviation between the calculated component elements ,
For one or more constituent elements included in each document to be searched, a score based on the product of the corresponding deviation and the similarity is calculated, and the score values of the included constituent elements are accumulated for each document. It functions as a diversion degree calculation means for totalizing score values,
A program for enabling a search system to output a group of documents as search targets in order according to a search request based on score values for each document tabulated by the diversion degree calculating means.

The control unit
Each document in the document group to be retrieved by dividing each component of the document, and similarity determination means for calculating a similarity between the components, each based on the editing history for each component extracted from each document ,
On the basis of the similarity between the calculated components of the similarity determination means, each of the deviation between the other components of the group the similar components in the center and becomes component and said groups in each group And a group analysis means for quantifying the degree of similarity of the component group constituting the group for each group with reference to the deviation between the calculated component elements ,
For one or more constituent elements included in each document to be searched, a score based on the product of the corresponding deviation and the similarity is calculated, and the score values of the included constituent elements are accumulated for each document. It functions as a diversion degree calculation means for totalizing score values,
A program for enabling a search system to output a group of documents as search targets in order according to a search request based on score values for each document collected by the diversion degree calculating means.

The program according to claim 14 or 15,
The controller is further
From documents is the search control, performs a first search process based on the keyword content of keyword inputted via the input section, a document group extracted in the first search process, the diversion calculation A program for performing a second search process based on score values of individual documents tabulated by the means and functioning as a document search means for outputting the results in order.

A program used in a search system that receives input of a keyword, executes a search with a score calculation process from a document group that is a search reference, and outputs the results in order by document unit ,
The control unit
Document block similarity determination that divides a document to be searched into document blocks and calculates the similarity between the document blocks by comparing the document blocks of the divided documents with the document blocks of other divided documents. Means,
Using the similarity between the document blocks calculated by the document block similarity determination unit, similar document blocks are grouped as a document block group, and the document block serving as the center in each of the grouped document block groups identify central document block is, and calculates the deviation information similarity for each document block group between the other documents blocks constituting the same group and the central document block,
Document block group analysis means for calculating kurtosis information of all grouped document block groups using a similarity distribution between the central document block included in the deviation information and other document blocks in the group When,
Document search means for extracting a plurality of documents based on the keyword content rate of the input keyword from the document group that is the search reference,
For each of said plurality of document extracted by said document retrieving means, respectively, for the document blocks the underlying, and kurtosis information of the document block group, thereby performing the score calculation processing of the document block based on the product of said deviation information ,
Function as document block diversion calculating means for accumulating score values for each of the calculated document blocks and totaling score values of individual documents;
A program for enabling the search system to output the plurality of extracted documents in an order according to a search request with reference to the score value for each document.

The document block similarity determination means, for each document block of each document that the divided, based on the edit history related creator and / or operator have done the operation of the document, that is calculating the similarity between documents blocks The program according to claim 17, wherein