JP4423385B2

JP4423385B2 - Document classification support apparatus and computer program

Info

Publication number: JP4423385B2
Application number: JP2002309555A
Authority: JP
Inventors: 健一郎山本; 啓北内
Original assignee: NTT Data Corp; National Institute of Information and Communications Technology
Current assignee: National Institute of Information and Communications Technology; NTT Data Group Corp
Priority date: 2002-10-24
Filing date: 2002-10-24
Publication date: 2010-03-03
Anticipated expiration: 2022-10-24
Also published as: JP2004145626A

Description

【０００１】
【発明の属する技術分野】
この発明は、文書データの分類を支援する文書分類支援装置に関する。
【０００２】
【従来の技術】
従来、コンピュータによる文書の自動分類は、各文書がすでにカテゴリに分類されている文書集合を利用することによって、与えられた文書をより詳細で適切なカテゴリに分類することが多い。このような文書の自動分類の従来手法の多くは、文書から単語やフレーズなどの特徴を抽出し、その出現頻度などの特徴量を用いて適切なカテゴリに自動分類する。
特許文献１には、「会社名」や「製品名」などのグループごとに分類された名詞的表現や、同じくグループごとに分類された動詞的表現を文書から抽出し、これらの表現とその出現個所を文書の特徴量として用いることにより文書の分類を行う技術が記載されている。
また、特許文献２には、文書中に含まれる単語に加え、文書の話題内容を表す単語、また文書の発言者や作成日付などの文書の付随属性情報などを文書の特徴量として用いることにより文書の分類を行う技術が記載されている。
また、特許文献３には、一つの文書から複数の単語集合を主題として抽出することにより、複数の主題を考慮して二つの文書間の類似度を算出したり、この文書間類似度算出方法を用いて文書検索を行ったり、文書集合のクラスタリングを行う技術が記載されている。
【０００３】
【特許文献１】
特開２００２−１０８８９３号公報（段落００１４−段落００７９、第１図−第２４図）
【特許文献２】
特開２００１−６０１９９号公報（段落００２９−００８０段落、第図１−第７図）
【特許文献３】
特開２０００−１２３０４１号公報（段落００４８−０１０６段落、第４図−第１０図）
【０００４】
【発明が解決しようとする課題】
ある文書を、参考書や取扱説明書のような文書集合へ分類することを想定する。参考書や取扱説明書は、章や節などの階層的なセクションにより構成されていることが多い。また、章や節には順序があり、節が進むにしたがって高度な内容について記述され、ある箇所に記述されている内容が前提となってそれ以降の内容が記述されている。従って、ある節で出現した重要語がそれ以降の節にも出現し、節が進むにつれ、出現する重要語が累積していくことが多い。また、章のような上位階層のセクションどうしは関連する内容が少ない。そこで、分類対象の文書をこのような文書集合の節に自動分類する場合、その分類対象の文書の内容について最初に記述されている節（「初出の節」とよぶ）に分類すべきである。初出の節以降の節にはより高度な内容が記述されているため、分類先として適切でない。例えば、理科の教科書のある節において「電流」が、次の節において「電流」を用いて「電圧」が説明され、さらに後述の節において「電流」及び「電圧」を用いて「抵抗」の説明がなされている場合、「電流」が記載されている分類対象の文書は、「電圧」又は「抵抗」の説明がなされている節ではなく、最初に「電流」が説明されている初出の節に分類されるべきである。
このような状況において、特許文献１〜３に示される従来の分類手法には以下のような問題点があった。
（１）分類対象文書中には主題となる内容が複数ある場合があるにも関わらず、分類先を一つに決定していたため、利用者は正しい分類先を見つけ出すのに時間がかかっていた。例えば、特許文献３の主題抽出手法を用いれば、抽出された主題ごとに分類先を決定することもできる。しかし、この主題抽出方法は、一つの文書中の単語分布のみから主題を抽出するものであり、すでにカテゴリに分類されている文書集合の文書中の単語分布を利用していない。そのため、抽出された主題はカテゴリの内容を考慮したものになっておらず、しばしば両者の内容がうまく適合しないことがあるという問題点がある。
（２）分類対象文書の内容に関連する重要語は、初出の節よりも後ろの節に多く出現することがある。従って、単語の出現頻度などを利用する従来手法においては、重要語がしばしば初出の節よりも後ろの節に分類されてしまい、利用者が正しい分類先に修正する手間が大きかった。
（３）分類対象文書全体に対する分類先を提示しているが、文書中のどの部分が分類先カテゴリに関連しているかを提示することは行っていない。よって、利用者は分類先が正しいかを否かを判断するために分類対象文書全体を参照する必要があり、手間がかかっていた。
【０００５】
この発明は、上記のような事情を考慮してなされたもので、その目的は、節が進むに従って高度な内容が記述される文書データに対して、新規の分類対象文書を適切な節に分類するための分類先候補の提示を行うことができる文書分類支援装置を提供することにある。
【０００６】
【課題を解決するための手段】
この発明は、上記の課題を解決すべくなされたもので、請求項１に記載の発明は、階層化されたセクションにより構成される分類先文書、前記分類先文書の下位階層のセクションへの分類を行う対象の分類対象文書、及び、重要語と、該重要語が前記分類先文書中に出現する下位階層のセクションの上位階層のセクションと、該上位階層のセクションにおける該下位階層のセクションの出現順を示す番号との対応付けを記憶する記憶部と、前記記憶部から前記分類先文書を読み出して重要語を抽出し、該重要語が出現する下位階層のセクションの上位階層のセクション及び該下位階層のセクションの番号と対応付けて前記記憶部に書き込む第１の重要語抽出部と、前記記憶部から前記分類対象文書及び前記重要語を読み出し、前記分類対象文書から読み出した重要語を抽出する第２の重要語抽出部と、前記第２の重要語抽出部が抽出した重要語と、前記記憶部内の該重要語が前記分類先文書中に出現する下位階層のセクションの上位階層のセクション及び該下位階層のセクションの番号とを基に、上位階層のセクション毎に、同じ下位階層のセクションに出現する前記重要語が同じ集合に含まれるように構成した１または複数の集合であって、前記重要語が互いに共有されないように最小化した１または複数の集合を主題として抽出する主題抽出部と、前記主題抽出部が抽出した分類対象文書の主題を構成する重要語群と、前記記憶部内の該重要語が前記分類先文書中に出現する下位階層のセクションの番号とを基に、主題を構成する各重要語が前記分類先文書内に初めて出現する下位階層のセクションのうち、最も後ろの下位階層のセクションを分類先の下位階層のセクションとして導出する分類先導出部と、前記主題抽出部が抽出した分類対象文書の主題を構成する重要語群と、前記分類先導出部が導出した分類対象文書の主題の分類先の下位階層のセクションとを表示する表示部と、を具備することを特徴とする文書分類支援装置である。
【０００８】
請求項２に記載の発明は、請求項１に記載の文書分類支援装置であって、前記第１の重要語抽出部は、予め決められた品詞、重要な事柄であることを表す文表現、あるいは、分類先文書中の単語分布を基に重要語を抽出することを特徴とする。
【００１０】
請求項３に記載の発明は、文書分類支援装置として用いられるコンピュータに、階層化されたセクションにより構成される分類先文書、前記分類先文書の下位階層のセクションへの分類を行う対象の分類対象文書、及び、重要語と、該重要語が前記分類先文書中に出現する下位階層のセクションの上位階層のセクションと、該上位階層のセクションにおける該下位階層のセクションの出現順を示す番号との対応付けを記憶する記憶部から前記分類先文書を読み出すステップと、読み出した前記分類先文書から重要語を抽出し、該重要語が出現する下位階層のセクションの上位階層のセクション及び該下位階層のセクションの番号と対応付けて前記記憶部に書き込むステップと、前記記憶部から分類対象文書及び重要語を読み出し、前記分類対象文書から読み出した重要語を抽出するステップと、前記分類対象文書から抽出した重要語と、前記記憶部内の該重要語が前記分類先文書中に出現する下位階層のセクションの上位階層のセクション及び該下位階層のセクションの番号とを基に、上位階層のセクション毎に、同じ下位階層のセクションに出現する前記重要語が同じ集合に含まれるように構成した１または複数の集合であって、前記重要語が互いに共有されないように最小化した１または複数の集合を主題として抽出するステップと、抽出した前記分類対象文書の主題を構成する重要語群と、前記記憶部内の該重要語が前記分類先文書中に出現する下位階層のセクションの番号とを基に、主題を構成する各重要語が前記分類先文書内に初めて出現する下位階層のセクションのうち、最も後ろの下位階層のセクションを分類先の下位階層のセクションとして導出するステップと、前記分類対象文書の主題を構成する重要語群と、主題の分類先の下位階層のセクションとを表示するステップと、を実行させるためのコンピュータプログラムである。
【００１１】
【発明の実施の形態】
以下、図面を参照し、この発明の実施の形態について説明する。
まず、本実施の形態による文書分類支援装置が文書分類を支援する対象の文書（以下、「分類対象文書」）の分類先となる文書（以下、「分類先文書」）の特徴を示す。分類先文書は、参考書や取扱説明書のように、徐々に記述内容が高度になっていく文書であり、以下のような特徴を備える。
（１）章及び節からなる階層的なセクションにより構成される。もっとも細かい下位階層のセクションを節、節よりも上位階層のセクションを章と呼ぶ。したがって、最も下位階層の章は複数の節から構成される。
（２）ある一つのもっとも下位階層の章においては、節が進むにしたがって徐々に高度な内容が記述される。すなわち、ある箇所で記述されている内容を前提として、それ以降の内容が記述される。そのため、ある節で出現した重要語がそれ以降の節にも出現し、節が進むにしたがって出現する重要語が累積していく。
（３）章の間には、関連する内容が少ない。教科書を例にとると、ある学年・科目も一つの章として捉えることができる。
【００１２】
図１は、この発明の一実施の形態による文書分類支援装置の構成を機能展開して示したブロック図である。
分類先文書データベース（ＤＢ）１０１（記憶部）は、電子化された文書データの集合である分類先文書と、各文書が記述されているセクション、すなわち、各文書データが属する章及び節の情報とを記憶している。分類先文書は、例えば、教科書、参考書、各種操作マニュアルなどである。
重要語データベース（ＤＢ）１０３（記憶部）は、分類先文書から抽出された重要語に関する情報と、重要語の候補となる重要語候補に関する情報とを記憶する。
分類対象文書記憶部２０１（記憶部）は、電子化された文書データである分類対象文書を記憶している。分類対象文書は、例えば、新聞記事やコラム、操作マニュアルの一部などである。
重要語抽出部１０２（第１の重要語抽出部）は、分類先文書ＤＢ１０１から分類先文書を読み出して重要語及び重要語候補を抽出し、重要語ＤＢ１０３に書き込む機能を有する。
重要語抽出部２０２（第２の重要語抽出部）は、分類対象文書記憶部２０１から分類対象文書を読み出し、重要語ＤＢ１０３に登録されている重要語及び重要語候補を抽出する機能を有する。
主題抽出部２０３は、重要語抽出部２０２が分類対象文書から抽出した重要語を用いて、分類対象文書の主題を抽出する機能を有する。
分類先導出部２０４は、主題抽出部２０３が抽出した主題に基づき、重要語が分類されるべき分類先文書の節を導出する機能を有する。
記述範囲導出部２０５は、分類対象文書中の重要語の記載範囲を導出する機能を有する。
表示部２０６は、文書分類支援装置が備えるディスプレイへの出力を制御し、分類先導出部２０４や記述範囲導出部２０５の処理結果を表示する機能を有する。
【００１３】
次に、本実施の形態による文書分類支援装置の処理手順について説明する。文書分類支援装置の処理手順は、「分類先文書からの重要語抽出」段階と「分類対象文書の分類支援」段階との２つの段階により構成される。
図２は、分類先文書からの重要語抽出の処理手順を示す図である。「分類先文書からの重要語抽出」段階においては、まず分類の前段階として、参考書や取扱説明書などの分類先文書から節ごとに重要語を抽出する。
ステップＳ１１０：
まず、重要語抽出部１０２は、分類先文書と分類先文書中の各文書データが属する章及び節の情報を分類先文書ＤＢ１０１から読み出し、形態素解析によって単語に分割し、単語ごとの品詞を特定する。
【００１４】
ステップＳ１２０：
続いて、重要語抽出部１０２は、ステップＳ１１０において分類した単語の品詞、分類先文書中の文表現及び単語分布を利用して重要語を抽出する。具体的には、以下の「（１）品詞の条件」を満たし、さらに、「（２ａ）文表現の条件」または「（２ｂ）単語分布の条件」を満たす単語を重要語として抽出する。さらに、重要語抽出部１０２は、重要語の条件を満たさないが、「（１）品詞の条件」のみを満たす単語を重要語候補として抽出する。
（１）品詞の条件
特定の品詞をもつ単語を抽出する。例えば、品詞が名詞、動詞、形容詞のいずれかである単語を抽出する。
（２ａ）文表現の条件
重要な事柄であることを表す文表現に基づき重要語を抽出する。例えば、形態素解析結果により、
「を／格助詞Ａ／名詞と／格助詞いい／動詞ます／助動詞」
という文表現を認識した場合、単語Ａを重要単語として抽出する。その他、重要な事柄であることを表す文表現には、以下がある。
「Ａ／名詞と／格助詞は／係助詞（いくつかの単語）の／格助詞こと／名詞です／助動詞」（単語Ａが重要語）
「Ａ／名詞に／格助詞なる／動詞と／接続助詞」（単語Ａが重要語）
（２ｂ）単語分布の条件
一般的に、多くの節に出現する単語は重要語ではないことが多い。換言すれば、ある箇所とその周辺に集中して出現し、その他の場所にはあまり出現しない単語が重要であることが多い。そこで以下の２つの条件を満たす単語を重要語として抽出する。
・文書中の全節に対して、単語が出現する節の比率が所定の閾値以下の割合である。例えば、閾値は１／５〜１／１０とする。
・分類先文書中のすべての文章に連番を付与した場合、単語が出現する文の番号の分散が所定の閾値以下の値である。
【００１５】
ステップＳ１３０：
重要語抽出部１０２は、ステップＳ１２０において抽出した重要語と重要語候補に関する情報を重要語ＤＢ１０３に登録する。すなわち、重要語、重要語の品詞、重要語が出現する分類先文書の章と節、及び、重要語が出現する節ごとの出現頻度からなる重要語情報と、重要語候補、重要語候補の品詞、重要語候補が出現する分類先文書の章と節、及び、重要語候補が出現する節ごとの出現頻度からなる重要語候補情報とを重要語ＤＢ１０３に書き込む。
本実施例においては、以下の重要語情報が書き込まれたとする。

【００１６】
図３は、分類対象文書の分類支援の処理手順を示す図である。「分類対象文書の分類支援」においては、まず分類対象文書から関連する重要語のグループにより構成される主題を抽出し、各主題を分類先文書内の節に分類する。そして、各主題に対応する分類対象文書の記述範囲を求めて提示する。さらに、利用者の操作により、分類先文書内の分類先の節を修正し、決定する。
ステップＳ２１０：
まず、重要語抽出部２０２は、分類対象文書記憶部２０１から分類対象文書を読み出し、形態素解析によって単語に分割し、単語ごとの品詞を特定する。
【００１７】
ステップＳ２２０：
重要語抽出部２０２は、重要語ＤＢ１０３から重要語情報及び重要語候補情報を読み出し、ステップＳ２１０において分割した単語のうち、読み出した重要語、あるいは、重要語候補と一致する単語を分類対象文書内から抽出する。
本実施例においては、分類対象文書内から重要語として、重要語１、重要語２、重要語３、重要語４、重要語５、重要語６及び重要語７が抽出され、重要語候補として単語８、単語９、単語１０、単語１１、及び、単語１２が抽出されたとする。
【００１８】
ステップＳ２３０：
主題抽出部２０３は、ステップＳ２２０において重要語抽出部２０２が分類対象文書から抽出した重要語が分類先文書において出現する章と節を用い、分類対象文書の主題を抽出する。すなわち、主題抽出部２０３は、以下の２段階により、主題を構成する重要語群を抽出する。
（１）分類先文書の章ごとに出現する重要語群を求める。ひとつの重要語が複数の章に含まれていてもよい。
（２）各章に含まれる重要語群に対し、「同じ節に出現する重要語は同じクラスタに含まれる」という条件に基づいて重要語群をクラスタリング（分割）し、最小のクラスタを得る。得られた各クラスタが一つの主題を表し、同じクラスタ内に含まれる重要語群が主題を構成する重要語群となる。
ステップＳ２２０において重要語ＤＢ１０３から読み出した重要語情報と、分類対象文書から抽出された重要語の例を用いて具体的に説明する。分類先文書の１章において、節１．１に重要語４が、節１．２及び節１．３に重要語４及び重要語５が出現しており、他の重要語と、重要語４あるいは重要語５が同じ節内に出現している箇所はない。従って、重要語４及び重要語５からなる重要語群が１つの主題（「主題Ｂ」とする）を表している。また、節１．４及び節１．５に重要語３が、節１．６に重要語２及び重要語３が、節１．７に重要語１及び重要語２が、節１．８に重要語１、重要語２及び重要語３が出現しており、１章において、重要語１、重要語２あるいは重要語３が他の重要語と同時に出現している節はない。よって、重要語１、重要語２及び重要語３からなる重要語群が１つの主題（「主題Ａ」とする）を表している。同様に、２章については、重要語６及び重要語７からなる重要語群が１つの主題（「主題Ｃ」とする）を表している。
【００１９】
ステップＳ２４０：
分類先導出部２０４は、各主題の分類先の節を導出する。すなわち、各主題について、主題を構成する各重要語が分類先文書内に初めて出現する節（「初出の節」）のうち、最も後ろの節を分類先の節とする。
具体的に説明すると、主題Ａの重要語群は重要語１、重要語２及び重要語３からなり、重要語１の初出の節は節１．７、重要語２の初出の節は節１．６、重要語３の初出の節は節１．４である。従って、重要語１の初出の節１．７が主題Ａを構成する重要語群の中で最も後ろの初出の節であり、主題Ａの分類先の節となる。同様に、主題Ｂの分類先の節は重要語４の初出の節１．２、主題Ｃの分類先の節は重要語６の初出の節２．３となる。
【００２０】
ステップＳ２５０：
分類先導出部２０４は、表示部２０６に指示することにより、各主題の分類先の節や重要語が出現する節などを視覚的に表示する。具体的には、以下により、文書分類支援画面の表示を行う。
（１）各主題を構成する重要語群と、重要語群を構成する重要語と同じ節内に出現し、ステップＳ２２０で抽出された重要語候補群とを表示する。
（２）初出の節が最も後ろの重要語から順に各重要語が出現する節とその出現頻度、および、初出の節を表示する。また、ステップＳ２４０において導出した分類先の節に対応するチェックボックスをＯＮに設定する。なお、節に対応するチェックボックスは、分類先の節の決定に使用される。
（３）分類先文書の章と節の一覧のうち、各主題の分類先の節を反転表示したり、他の節とは色を変えるなどして強調表示する。
【００２１】
図４は、文書分類支援画面イメージを示す図である。
文書分類支援画面には、分類先文書を構成する章とその配下の節の一覧が縦方向にツリー状に表示され、各節の横には主題の分類先であるか否かを示すチェックボックスが表示される。そして、各主題Ａ、主題Ｂ、主題Ｃが横方向に並べて表示され、各主題を構成する重要語群及び重要語群を構成する重要語と同じ節内に出現する重要語候補群が示される。図においては、主題Ａは、重要語１、重要語２及び重要語３からなる重要語群と、単語８及び単語９からなる重要語候補群とで構成されることを示している。また、主題Ｂは、重要語４及び重要語５からなる重要語群と、単語１０及び単語１１からなる重要語候補群とで構成され、主題Ｃは、重要語６及び重要語７からなる重要語群と、単語１２からなる重要語候補群とで構成されることを示している。
各主題の重要語群は、初出の節がもっとも後ろの重要語から順に、各重要語が出現する節とその出現頻度が提示される。また、各重要語の初出の節が強調表示される。図において、主題Ａの重要語１は初出の節１．７に２回、節１．８に４回出現し、重要語２は初出の節１．６に３回、節１．７に２回、節１．８に３回出現し、重要語３は初出の節１．４に４回、節１．５に１回、節１．６に３回、節１．８に４回出現していることを示している。そして、主題Ａの中で最も初出の節が後ろである重要語Ａの初出の節１．７が強調表示され、横のチェックボックスがＯＮとなり、主題Ａの分類先の節であることを示している。同様に、主題Ｂにおいては、分類先の節として重要語４の初出の節１．２が強調表示され、横のチェックボックスがＯＮとなり、主題Ｃにおいては、重要語６の初出の節２．３が強調表示され、横のチェックボックスがＯＮとなっている。これにより、重要語が出現する順番を把握するとともに、初出の節が最も後ろの重要語が分類先に寄与していることを一目で認識することが可能となる。
【００２２】
図３のステップＳ２５０に戻り、さらに、分類先導出部２０４は、表示された文書分類支援画面に対する利用者の操作に従って、分類先文書ＤＢ１０１から分類先文書と分類先文書中の各文書データが属する章及び節の情報を読み出し、文書分類を支援する以下の画面を表示するよう表示部２０６へ指示する。
（１）利用者が、マウスのクリックにより分類先文書の章と節の一覧の中から章あるいは節を選択した場合、分類先文書中の該当する章あるいは節の文全体を表示する。また、表示された章あるいは節中に出現する重要語及び出現頻度の一覧を表示する。なお、このとき表示される重要語には、分類対象文書には含まれていない重要語も含まれる。
（２）利用者が、マウスのクリックにより各主題を構成する重要語群の中から重要語を選択した場合、分類先文書中の該当する重要語が出現する文とその周辺の文を表示する。
（３）利用者が、マウスのクリックにより重要語が出現する節の出現頻度の部分を選択した場合、選択した重要語が該当する分類先文書の節において出現する文とその周辺の文を表示する。
【００２３】
また、利用者が、主題を構成する重要語群に対して、マウスによるドラッグ＆ドロップの操作により、重要語を重要語候補に変更、あるいは、重要語候補を重要語に変更した場合は、重要語群の変更を受け、ステップＳ２４０からの処理を再び行い、新たに指定された重要語が分類先文書中に出現する節とその出現頻度、主題の分類先の節を抽出し、文書分類支援画面の表示を指示する。また、新たな分類先の節に対応するチェックボックスをＯＮにする。
【００２４】
ステップＳ２６０：
記述範囲導出部２０５は、以下の手順により、各主題の分類対象文書中における記述範囲を求めて表示部２０６に通知し、表示部２０６はディスプレイへの表示を行う。すなわち、各主題を構成する重要語群のうち一つ以上の重要語を含む分類対象文書中の文の集合を、その重要語が属する主題に対応する記述範囲として選択する。そして、分類対象文書中の主題ごとの重要語群と、主題の記述範囲とを提示する。
【００２５】
図５は、各主題の分類対象文書中における記述範囲の表示画面イメージを示す図である。図において、分類対象文書において、重要語６及び重要語７を含み、主題Ｃに対応する記述範囲が提示されている。また、重要語５及び重要語４を含み、主題Ｂに対応する記述範囲が、重要語１、重要語２及び重要語３を含み、主題Ａに対応する記述範囲が提示されている。
【００２６】
ステップＳ２７０：
図３のステップＳ２７０において、分類先導出部２０４は、利用者が文書分類支援画面に対して行う以下の操作による分類先の修正、選択に従い、分類先を決定する。
（１）利用者は、再び、各主題の重要語を重要語候補に変更、あるいは、重要語候補を重要語に変更する。この操作に応じて、自動的に分類先の節を修正し、新たな分類先の節に対応するチェックボックスをＯＮにする。
（２）利用者は、分類先の節に対応するチェックボックスをクリックすることにより、ＯＮ／ＯＦＦの設定を行い、分類先を選択する。
（３）利用者は、分類先を選択後、「分類先決定」ボタンをマウスでクリックするなどの操作を行い、分類先を確定する。分類先導出部２０４は、ＯＮが設定された節を主題の分類先の節として内部に記憶する。
【００２７】
本実施の形態による文書分類支援装置の利用イメージとして、以下があげられる。
（１）学校の先生がネットワーク上に公開されている新聞記事などの文書を授業の補助教材として活用するために、文書を教科書の節（ある程度まとまった学習範囲）に分類するための支援を行う。文書に含まれる主題ごとに分類先の節が提示されるため、正しい分類先の節を効率よく見つけることが可能となり、教科書の各節に対応する補助教材を短時間のうちに蓄積できる。
（２）ある装置を利用しようとしている人が、その装置の取扱説明書を読んでいるときに、意味の分からない文章や用語が出てきた場合、その文章や用語に対して取扱説明書の節への自動分類を行い、内容の理解を支援する。分類先の節の説明を参照することにより、その文章や用語の内容を理解することができる。
【００２８】
本実施の形態によれば、参考書や取扱説明書のような文書集合への文書の自動分類において、文書集合の初出の節を利用して分類対象文書の分類を行うことが可能になる。従って、従来の自動分類手法よりも適切な節（カテゴリ）に分類対象文書を分類することができる。
また、分類対象文書の複数の主題を分類先文書から抽出した重要語群により表すことが可能となる。従って、分類対象文書の各主題を構成する重要語群を表示することにより、利用者は分類対象文書にどのような主題が含まれているかを一目で把握することが可能となるとともに、分類作業の効率が向上する。
また、各主題の重要語を初出の節がもっとも後ろの重要語から順に表示することにより、初出の節がもっとも後ろの重要語が分類先に寄与していることが一目で分かり、分類作業の効率が向上する。
また、各重要語に対する分類先文書中の節が提示されるため、利用者は意味の分からない重要語の分類先文書中の節を参照することにより、分類対象文書の理解支援に役立つ。
【００２９】
なお、分類先文書ＤＢ１０１及び分類対象文書記憶部２０１は、文書が公開されているＵＲＩ（Universal Resource Identifier）など、文書の記憶場所を記憶することでもよい。この場合、記憶場所により示される文書を読み込み、上記処理が行われる。
また、ステップＳ１２０における品詞の条件、あるいは、単語分布の条件に付随する閾値は、利用者の操作により変更してもよい。
【００３０】
なお、上述の文書分類支援装置は、内部にコンピュータシステムを有している。そして、上述した文書分類支援装置の動作の過程は、プログラムの形式でコンピュータ読み取り可能な記録媒体に記憶されており、このプログラムをコンピュータシステムが読み出して実行することによって、上記処理が行われる。ここでいうコンピュータシステムとは、ＯＳや周辺機器等のハードウェアを含むものである。
【００３１】
また、「コンピュータ読み取り可能な記録媒体」とは、ＲＯＭの他に、磁気ディスク、光磁気ディスク、ＣＤ−ＲＯＭ、ＤＶＤ−ＲＯＭ等の可搬媒体、コンピュータシステムに内蔵されるハードディスク等の記憶装置のことをいう。さらに「コンピュータ読み取り可能な記録媒体」とは、インターネット等のネットワークや電話回線等の通信回線を介してプログラムが送信された場合のシステムやクライアントとなるコンピュータシステム内部の揮発性メモリ（ＲＡＭ）のように、一定時間プログラムを保持しているものも含むものとする。
【００３２】
また、上記プログラムは、このプログラムを記憶装置等に格納したコンピュータシステムから、伝送媒体を介して、あるいは、伝送媒体中の伝送波により他のコンピュータシステムに伝送されてもよい。ここで、プログラムを伝送する「伝送媒体」は、インターネット等のネットワーク（通信網）や電話回線等の通信回線（通信線）のように情報を伝送する機能を有する媒体のことをいう。
また、上記プログラムは、前述した機能の一部を実現するためのものであっても良い。さらに、前述した機能をコンピュータシステムにすでに記録されているプログラムとの組み合わせで実現できるもの、いわゆる差分ファイル（差分プログラム）であっても良い。
【００３３】
【発明の効果】
この発明によれば、参考書や取扱説明書のような文書集合への文書の自動分類において、文書集合の初出の節を利用して分類対象文書の分類を行うことが可能になる。従って、従来の自動分類手法よりも適切な節（カテゴリ）に分類対象文書を分類することができる。
また、分類対象文書の複数の主題を分類先文書から抽出した重要語群により表すことが可能となる。従って、分類対象文書の各主題を構成する重要語群を表示することにより、利用者は分類対象文書にどのような主題が含まれているかを一目で把握することが可能となるとともに、分類作業の効率が向上する。
【図面の簡単な説明】
【図１】この発明の一実施の形態による文書分類支援装置の構成を機能展開して示したブロック図である。
【図２】同実施の形態による分類先文書からの重要語抽出の処理手順を示す図である。
【図３】同実施の形態による分類対象文書の分類支援の処理手順を示す図である。
【図４】同実施の形態による文書分類支援画面イメージを示す図である。
【図５】同実施の形態による各主題の分類対象文書中における記述範囲の表示画面イメージを示す図である。
【符号の説明】
１０１…分類先文書ＤＢ（データベース）
１０２、２０２…重要語抽出部
１０３…重要語ＤＢ（データベース）
２０１…分類対象文書記憶部
２０３…主題抽出部
２０４…分類先導出部
２０５…記述範囲導出部
２０６…表示部[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a document classification support apparatus that supports document data classification.
[0002]
[Prior art]
Conventionally, automatic document classification by a computer often classifies a given document into a more detailed and appropriate category by using a document set in which each document is already classified into a category. Many of the conventional methods for automatically classifying documents extract features such as words and phrases from a document and automatically classify them into appropriate categories using feature quantities such as their appearance frequency.
Patent Document 1 extracts nounistic expressions classified for each group such as “company name” and “product name” and verbal expressions classified similarly for each group from documents, and these expressions and their appearances. A technique for classifying a document by using a location as a feature amount of the document is described.
Further, in Patent Document 2, in addition to words included in a document, a word representing the topic content of the document, or accompanying attribute information of the document such as a document speaker or a creation date is used as a document feature amount. Describes techniques for classifying documents.
In Patent Document 3, a plurality of word sets are extracted as a theme from one document, and a similarity between two documents is calculated in consideration of a plurality of themes. A technique for performing a document search by using, or clustering a document set is described.
[0003]
[Patent Document 1]
Japanese Patent Laid-Open No. 2002-108893 (paragraphs 0014 to 0079, FIGS. 1 to 24)
[Patent Document 2]
Japanese Patent Laid-Open No. 2001-60199 (paragraphs 0029-0080, FIGS. 1-7)
[Patent Document 3]
JP 2000-123041 (paragraphs 0048-0106, FIGS. 4-10)
[0004]
[Problems to be solved by the invention]
Assume that a document is classified into a collection of documents such as reference books and instruction manuals. Reference books and instruction manuals are often composed of hierarchical sections such as chapters and sections. In addition, chapters and sections have an order, and advanced contents are described as the sections progress, and the subsequent contents are described on the assumption that the contents are described in a certain place. Therefore, important words that appear in a certain section also appear in subsequent sections, and as the section progresses, the appearing important words often accumulate. In addition, there is little content related to sections in higher layers such as chapters. Therefore, when automatically classifying a document to be classified into sections of such a document set, it should be classified into a section that is first described about the contents of the document to be classified (referred to as “first section”). . Since more advanced content is described in the sections after the first section, it is not appropriate as a classification destination. For example, “current” is explained in one section of a science textbook, “voltage” is explained using “current” in the next section, and “resistance” is explained using “current” and “voltage” in the following section. If explained, the document to be classified with “current” is not the section in which “voltage” or “resistance” is explained, but the first time that “current” is explained first. Should be classified into sections.
Under such circumstances, the conventional classification methods disclosed in Patent Documents 1 to 3 have the following problems.
(1) Although there are cases where there are multiple contents as the subject matter in the classification target document, since the classification destination is determined as one, the user takes time to find the correct classification destination. . For example, if the subject extraction method of Patent Document 3 is used, a classification destination can be determined for each extracted subject. However, this subject extraction method extracts the subject only from the word distribution in one document, and does not use the word distribution in the document of the document set already classified into categories. For this reason, the extracted theme does not take into account the contents of the category, and there is a problem that the contents of both often do not fit well.
(2) Many important words related to the contents of the classification target document may appear in a section after the first section. Therefore, in the conventional method using the appearance frequency of words, the important words are often classified into a section after the first appearing section, and it takes much time for the user to correct it to the correct classification destination.
(3) Although the classification destination for the entire classification target document is presented, it is not indicated which part of the document is related to the classification destination category. Therefore, it is necessary for the user to refer to the entire classification target document in order to determine whether or not the classification destination is correct.
[0005]
The present invention has been made in consideration of the above circumstances, and its purpose is to classify a new classification target document into an appropriate section for document data in which advanced contents are described as the section progresses. An object of the present invention is to provide a document classification support apparatus capable of presenting classification destination candidates to be used.
[0006]
[Means for Solving the Problems]
The present invention has been made to solve the above-described problems, and the invention according to claim 1 is directed to a classification destination document composed of hierarchized sections, and classification into subordinate sections of the classification destination document. A classification target document to be subjected to classification, an important word, a higher hierarchy section of a lower hierarchy section in which the important word appears in the classification destination document, and an appearance of the lower hierarchy section in the upper hierarchy section A storage unit that stores a correspondence with a number indicating an order; and reads out the classification destination document from the storage unit, extracts a keyword, and extracts a lower-level section in which the important word appears and the lower-level section. A first important word extraction unit that writes to the storage unit in association with a section number of the hierarchy; reads out the classification target document and the important word from the storage unit; A second important word extraction unit that extracts the important word read from the key word extracted by the second important word extraction unit, and a lower hierarchy in which the important word in the storage unit appears in the classification destination document Based on the section of the upper hierarchy and the section number of the lower hierarchy, 1 or a plurality of sets configured so that the important words appearing in the same lower hierarchy section are included in the same set for each section of the higher hierarchy, and the 1 is minimized so that the important words are not shared with each other Or multiple sets subject As A subject extraction unit to extract, an important word group constituting a subject of the classification target document extracted by the subject extraction unit, and a section number of a lower hierarchy in which the important word in the storage unit appears in the classification destination document; Based on Among the lower-level sections where each important word constituting the subject appears for the first time in the classified document, the last lower-level section is selected. Subordinate section of the classification destination As A classification destination derivation unit to be derived, an important word group constituting a subject of the classification target document extracted by the subject extraction unit, and a section of a lower level of a classification destination of the subject of the classification target document derived by the classification destination derivation unit; A document classification support apparatus comprising: a display unit that displays
[0008]

Claim

2 The invention described in claim 1 In the document classification support apparatus described above, the first important word extraction unit is based on a predetermined part of speech, a sentence expression indicating an important matter, or a word distribution in a classification destination document. It is characterized by extracting words.
[0010]

Claim

3 According to the invention described in the above, in a computer used as a document classification support apparatus, a classification target document composed of hierarchized sections, a classification target document to be classified into sections in a lower hierarchy of the classification target document, and , An important word, and a correspondence between an upper hierarchy section of a lower hierarchy section in which the important word appears in the classification destination document and a number indicating an appearance order of the lower hierarchy section in the upper hierarchy section. A step of reading out the classification destination document from the storage unit to be stored; a key word is extracted from the read classification destination document; a section of a higher hierarchy of a lower hierarchy section in which the key word appears; and a section number of the lower hierarchy And writing to the storage unit in association with each other, reading the classification target document and the important word from the storage unit, whether the classification target document A step of extracting the read important word, the important word extracted from the classification target document, the upper hierarchy section of the lower hierarchy section in which the important word in the storage unit appears in the classification destination document, and the lower hierarchy Based on the section number of 1 or a plurality of sets configured so that the important words appearing in the same lower hierarchy section are included in the same set for each section of the higher hierarchy, and the 1 is minimized so that the important words are not shared with each other Or multiple sets subject As Based on the extracting step, the important word group constituting the extracted subject of the classification target document, and the section number of the lower hierarchy in which the important word in the storage unit appears in the classification destination document, Among the lower-level sections where each important word constituting the subject appears for the first time in the classified document, the last lower-level section is selected. Subordinate section of the classification destination As A computer program for executing a deriving step, a step of displaying an important word group constituting a subject of the classification target document, and a lower-level section of a subject classification destination.
[0011]
DETAILED DESCRIPTION OF THE INVENTION
Embodiments of the present invention will be described below with reference to the drawings.
First, the characteristics of a document (hereinafter referred to as “classification target document”) that is a classification destination of a document (hereinafter referred to as “classification target document”) that is supported by the document classification support apparatus according to the present embodiment will be described. The classification destination document is a document whose description contents gradually become advanced, such as a reference book or an instruction manual, and has the following characteristics.
(1) Consists of hierarchical sections consisting of chapters and sections. The finest lower-level section is called a section, and the higher-level section is called a chapter. Therefore, the chapter of the lowest hierarchy is composed of a plurality of sections.
(2) In one chapter at the lowest level, advanced contents are gradually described as the section progresses. That is, on the premise of the contents described in a certain place, the subsequent contents are described. Therefore, important words that appear in a certain section also appear in subsequent sections, and the important words that appear as the section progresses accumulate.
(3) There are few related contents between chapters. Taking a textbook as an example, a certain grade and subject can be considered as a chapter.
[0012]
FIG. 1 is a functional block diagram showing the structure of a document classification support apparatus according to an embodiment of the present invention.
The classification destination document database (DB) 101 (storage unit) includes a classification destination document that is a set of digitized document data, a section in which each document is described, that is, information on a chapter and a section to which each document data belongs. Is remembered. The classification destination document is, for example, a textbook, a reference book, various operation manuals, and the like.
The important word database (DB) 103 (storage unit) stores information on important words extracted from the classification destination document and information on important word candidates that are candidates for important words.
The classification target document storage unit 201 (storage unit) stores a classification target document which is digitized document data. The classification target document is, for example, a newspaper article, a column, or a part of an operation manual.
The important word extraction unit 102 (first important word extraction unit) has a function of reading out a classification destination document from the classification destination document DB 101, extracting an important word and an important word candidate, and writing them into the important word DB 103.
The keyword extraction unit 202 (second keyword extraction unit) has a function of reading out the classification target document from the classification target document storage unit 201 and extracting the keyword and the keyword candidate registered in the keyword DB 103.
The subject extraction unit 203 has a function of extracting the subject of the classification target document using the important word extracted from the classification target document by the keyword extraction unit 202.
The classification destination derivation unit 204 has a function of deriving a section of the classification destination document to which the important word should be classified based on the subject extracted by the subject extraction unit 203.
The description range deriving unit 205 has a function of deriving a description range of important words in the classification target document.
The display unit 206 has a function of controlling output to a display included in the document classification support apparatus and displaying processing results of the classification destination deriving unit 204 and the description range deriving unit 205.
[0013]
Next, the processing procedure of the document classification support apparatus according to this embodiment will be described. The processing procedure of the document classification support apparatus is composed of two stages: an “important word extraction from a classification destination document” stage and a “classification support for classification target document” stage.
FIG. 2 is a diagram showing a processing procedure for extracting important words from the classification destination document. In the “important word extraction from the classification destination document” stage, first, as a pre-classification stage, an important word is extracted for each section from the classification destination document such as a reference book or an instruction manual.
Step S110:
First, the keyword extraction unit 102 reads out the classification destination document and chapter and section information to which each document data in the classification destination document belongs from the classification destination document DB 101, divides it into words by morphological analysis, and specifies the part of speech for each word. To do.
[0014]
Step S120:
Subsequently, the keyword extraction unit 102 extracts the keyword using the part of speech of the word classified in step S110, the sentence expression in the classified document, and the word distribution. Specifically, words satisfying the following “(1) part-of-speech condition” and further satisfying “(2a) sentence expression condition” or “(2b) word distribution condition” are extracted as important words. Furthermore, the important word extraction unit 102 extracts words that do not satisfy the important word condition but satisfy only “(1) part-of-speech condition” as important word candidates.
(1) Part-of-speech conditions
Extract words with specific parts of speech. For example, a word whose part of speech is a noun, a verb, or an adjective is extracted.
(2a) Conditions for sentence expression
Important words are extracted based on sentence expressions that represent important matters. For example, according to the morphological analysis result,
“O / Noun A A / Noun and / Noun Good / Verb Mas / Auxiliary Verb”
Is recognized as an important word. Other sentence expressions that represent important matters are as follows.
“A / noun and / or case particle is / in case particle (some words) / case particle that is / noun is / auxiliary verb” (word A is an important word)
"A / Noun / Case particles become / Verbs and / Connective particles" (word A is an important word)
(2b) Word distribution conditions
In general, words that appear in many sections are often not important words. In other words, words that appear in a concentrated manner in and around a certain part and that do not appear much in other places are often important. Therefore, words satisfying the following two conditions are extracted as important words.
-The ratio of the section in which a word appears is less than a predetermined threshold with respect to all sections in the document. For example, the threshold value is 1/5 to 1/10.
-When serial numbers are assigned to all sentences in the classified document, the distribution of the numbers of sentences in which words appear is a value equal to or less than a predetermined threshold.
[0015]
Step S130:
The important word extraction unit 102 registers the information related to the important word and the important word candidate extracted in step S120 in the important word DB 103. In other words, important word information consisting of important words, parts of speech of important words, chapters and sections of classified documents where important words appear, and the frequency of occurrence of each important word appearing, important word candidates, important word candidates Part-of-speech, chapters and sections of the classification destination document in which important word candidates appear, and important word candidate information including the appearance frequency for each section in which important word candidates appear are written in the important word DB 103.
In this embodiment, it is assumed that the following important word information is written.

[0016]
FIG. 3 is a diagram illustrating a classification support processing procedure for a document to be classified. In “classification support for classification target documents”, first, a subject composed of a group of related important words is extracted from the classification target document, and each subject is classified into sections in the classification destination document. Then, the description range of the classification target document corresponding to each subject is obtained and presented. Further, the classification destination section in the classification destination document is corrected and determined by the user's operation.
Step S210:
First, the keyword extraction unit 202 reads the classification target document from the classification target document storage unit 201, divides it into words by morphological analysis, and specifies the part of speech for each word.
[0017]
Step S220:
The important word extraction unit 202 reads the important word information and the important word candidate information from the important word DB 103, and among the words divided in step S210, the read important word or a word that matches the important word candidate is included in the classification target document. Extract from
In this embodiment, important words 1, important words 2, important words 3, important words 4, important words 5, important words 6 and important words 7 are extracted as important words from the classification target document, and are used as important word candidates. It is assumed that word 8, word 9, word 10, word 11, and word 12 are extracted.
[0018]
Step S230:
The subject extraction unit 203 extracts the subject of the classification target document using chapters and clauses in which the important word extracted from the classification target document by the important word extraction unit 202 in step S220 appears in the classification destination document. That is, the subject extraction unit 203 extracts an important word group constituting the subject according to the following two steps.
(1) An important word group appearing for each chapter of the classified document is obtained. One important word may be included in multiple chapters.
(2) For the important word group included in each chapter, the important word group is clustered (divided) based on the condition that “important words appearing in the same clause are included in the same cluster” to obtain the smallest cluster. Each obtained cluster represents one theme, and important word groups included in the same cluster are important word groups constituting the theme.
A specific description will be given using important word information read from the important word DB 103 in step S220 and examples of important words extracted from the classification target document. In chapter 1 of the classified document, important word 4 appears in section 1.1, important words 4 and 5 appear in sections 1.2 and 1.3, and other important words 4 Alternatively, there is no place where the important word 5 appears in the same clause. Therefore, the important word group including the important words 4 and 5 represents one theme (referred to as “theme B”). The important word 3 is added to the sections 1.4 and 1.5, the

important words

2 and 3 are added to the section 1.6, the important words 1 and 2 are added to the section 1.7, and the section 1.8 is added. Important word 1, important word 2 and important word 3 appear, and there is no clause in which important word 1, important word 2 or important word 3 appears simultaneously with other important words in Chapter 1. Therefore, the important word group consisting of the important word 1, the important word 2, and the important word 3 represents one theme (referred to as “theme A”). Similarly, for Chapter 2, an important word group consisting of important words 6 and 7 represents one theme (referred to as “theme C”).
[0019]
Step S240:
The classification destination derivation unit 204 derives a classification destination section of each subject. That is, for each subject, the last clause among the clauses in which each important word constituting the subject appears for the first time in the classified document (“first appearing clause”) is set as the classified clause.
More specifically, the important word group of the subject A consists of important words 1, important words 2 and important words 3. The first appearing section of important word 1 is section 1.7, and the first appearing section of important word 2 is section 1. .6, the first clause of key word 3 is clause 1.4. Therefore, the first appearing section 1.7 of the important word 1 is the last first appearing section in the important word group constituting the theme A, and is the section to which the subject A is classified. Similarly, the section to which the subject B is classified is the first section 1.2 of the important word 4, and the section to which the subject C is classified is the first section 2.3 of the important word 6.
[0020]
Step S250:
The classification destination deriving unit 204 visually displays a classification destination clause of each subject, a clause in which an important word appears, and the like by instructing the display unit 206. Specifically, the document classification support screen is displayed as follows.
(1) The important word group constituting each theme and the important word candidate group appearing in the same clause as the important word constituting the important word group and extracted in step S220 are displayed.
(2) The first appearing section displays the section in which each important word appears in order from the last important word, its appearance frequency, and the first appearing section. In addition, the check box corresponding to the classification destination section derived in step S240 is set to ON. A check box corresponding to a section is used to determine a classification destination section.
(3) From the list of chapters and sections of the classified document, the classified section of each subject is highlighted and displayed differently from other sections.
[0021]
FIG. 4 is a diagram showing a document classification support screen image.
The document classification support screen displays a list of chapters that make up the classification target document and the subordinate sections in a tree shape vertically, and a check box that indicates whether the subject is the classification target of the subject next to each section. Is displayed. Each subject A, subject B, and subject C are displayed side by side in the horizontal direction, and the important word group that constitutes each subject and the important word candidate group that appears in the same section as the important word that constitutes the important word group are shown. . In the figure, it is shown that the subject A is composed of an important word group consisting of the important word 1, the important word 2 and the important word 3, and an important word candidate group consisting of the word 8 and the word 9. The subject B is composed of an important word group consisting of the important words 4 and 5, and an important word candidate group consisting of the words 10 and 11, and the subject C is an important word consisting of the important words 6 and 7. It shows that it is composed of a word group and an important word candidate group consisting of words 12.
In the important word group of each subject, the first appearing section is presented in order from the last significant word, and the section in which each important word appears and its appearance frequency are presented. Also, the first occurrence of each important word is highlighted. In the figure, the important word 1 of the subject A appears twice in the first section 1.7 and four times in the section 1.8, and the important word 2 appears three times in the first section 1.6 and 2 in the section 1.7. Times, appearing 3 times in section 1.8, key word 3 appears 4 times in the first section 1.4, 1 time in section 1.5, 3 times in section 1.6, 4 times in section 1.8 It shows that you are doing. Then, the first appearing section 1.7 of the important word A, which is the first appearing section in the subject A, is highlighted, and the horizontal check box is turned on to indicate that it is the section to which the subject A is classified. ing. Similarly, in the subject B, the first appearing clause 1.2 of the important word 4 is highlighted as the classification destination clause, the horizontal check box is turned ON, and in the subject C, the first appearing clause 2. 3 is highlighted and the horizontal check box is ON. This makes it possible to grasp the order in which important words appear and recognize at a glance that the first important word in the first sentence contributes to the classification destination.
[0022]
Returning to step S250 of FIG. 3, the classification destination derivation unit 204 further belongs to the classification destination document and each document data in the classification destination document from the classification destination document DB 101 in accordance with the user's operation on the displayed document classification support screen. The chapter and section information is read, and the display unit 206 is instructed to display the following screen that supports document classification.
(1) When the user selects a chapter or section from the list of chapters and sections of the classified document by clicking the mouse, the entire sentence of the corresponding chapter or section in the classified document is displayed. In addition, a list of important words and appearance frequencies appearing in the displayed chapter or section is displayed. The important words displayed at this time include important words that are not included in the classification target document.
(2) When a user selects an important word from a group of important words constituting each subject by clicking with the mouse, a sentence in which the relevant important word appears in the classified document and the surrounding sentences are displayed. .
(3) When the user selects a part of the appearance frequency of a section in which an important word appears by clicking the mouse, the sentence that appears in the section of the classification destination document to which the selected important word appears and the surrounding sentences are displayed. To do.
[0023]
Also, if a user changes an important word to an important word candidate or changes an important word candidate to an important word by dragging and dropping with the mouse to the important words that make up the subject, it is important. In response to the change of the word group, the processing from step S240 is performed again to extract the clause in which the newly designated important word appears in the classification destination document, its appearance frequency, the subject classification destination clause, and document classification support Instructs screen display. Also, the check box corresponding to the new classification destination section is turned ON.
[0024]
Step S260:
The description range deriving unit 205 obtains the description range in the classification target document of each subject according to the following procedure and notifies the display unit 206 of the description range, and the display unit 206 displays it on the display. That is, a set of sentences in the classification target document including one or more important words from the important word group constituting each subject is selected as a description range corresponding to the subject to which the important word belongs. Then, an important word group for each subject in the classification target document and a description range of the subject are presented.
[0025]
FIG. 5 is a diagram showing a display screen image of the description range in the classification target document of each subject. In the drawing, a description range corresponding to the subject C is presented including the important words 6 and 7 in the classification target document. The description range corresponding to the subject B including the important word 5 and the important word 4 includes the important word 1, the important word 2 and the important word 3, and the description range corresponding to the subject A is presented.
[0026]
Step S270:
In step S270 of FIG. 3, the classification destination derivation unit 204 determines a classification destination according to correction and selection of the classification destination by the following operation performed by the user on the document classification support screen.
(1) The user again changes the important word of each subject to an important word candidate or changes the important word candidate to an important word. In accordance with this operation, the classification destination section is automatically corrected, and the check box corresponding to the new classification destination section is turned ON.
(2) The user clicks the check box corresponding to the section of the classification destination to set ON / OFF and select the classification destination.
(3) After selecting the classification destination, the user performs an operation such as clicking a “classification destination determination” button with a mouse to determine the classification destination. The classification destination deriving unit 204 stores therein the section for which ON is set as the subject classification destination section.
[0027]
The following are examples of usage images of the document classification support apparatus according to the present embodiment.
(1) In order for school teachers to use documents such as newspaper articles published on the network as supplementary teaching materials for classes, support is provided for classifying documents into textbook sections (a certain range of learning). . Since the classification destination section is presented for each subject included in the document, the correct classification destination section can be efficiently found, and auxiliary teaching materials corresponding to each section of the textbook can be accumulated in a short time.
(2) If a person trying to use a device reads a text or term that does not make sense when reading the manual for that device, the Automatically classify sections and assist understanding of content. By referring to the explanation in the section of the classification destination, you can understand the contents of the sentences and terms.
[0028]
According to this embodiment, in the automatic classification of documents into document sets such as reference books and instruction manuals, it is possible to classify the classification target documents using the first section of the document set. Therefore, it is possible to classify the classification target documents into more appropriate sections (categories) than the conventional automatic classification method.
In addition, a plurality of themes of the classification target document can be represented by an important word group extracted from the classification destination document. Therefore, by displaying the important word group constituting each subject of the classification target document, it becomes possible for the user to grasp at a glance what themes are included in the classification target document, as well as the classification work. Increases efficiency.
In addition, by displaying the important words of each subject in order from the most significant word in the first sentence, it can be seen at a glance that the most important word in the first sentence contributes to the classification destination. Efficiency is improved.
In addition, since a section in the classification destination document for each important word is presented, the user can assist in understanding the classification target document by referring to the section in the classification destination document of the important word whose meaning is unknown.
[0029]
The classification destination document DB 101 and the classification target document storage unit 201 may store a storage location of a document such as a URI (Universal Resource Identifier) where the document is disclosed. In this case, the document indicated by the storage location is read and the above processing is performed.
In addition, the threshold associated with the part-of-speech condition or the word distribution condition in step S120 may be changed by a user operation.
[0030]
The document classification support apparatus described above has a computer system inside. The operation process of the document classification support apparatus described above is stored in a computer-readable recording medium in the form of a program, and the above-described processing is performed by the computer system reading and executing this program. The computer system here includes an OS and hardware such as peripheral devices.
[0031]
In addition to ROM, “computer-readable recording medium” refers to a portable medium such as a magnetic disk, a magneto-optical disk, a CD-ROM, a DVD-ROM, or a storage device such as a hard disk built in a computer system. That means. Further, the “computer-readable recording medium” refers to a volatile memory (RAM) in a computer system serving as a system or a client when a program is transmitted via a network such as the Internet or a communication line such as a telephone line. In addition, those holding programs for a certain period of time are also included.
[0032]
The program may be transmitted from a computer system storing the program in a storage device or the like to another computer system via a transmission medium or by a transmission wave in the transmission medium. Here, the “transmission medium” for transmitting the program refers to a medium having a function of transmitting information, such as a network (communication network) such as the Internet or a communication line (communication line) such as a telephone line.
The program may be for realizing a part of the functions described above. Furthermore, what can implement | achieve the function mentioned above in combination with the program already recorded on the computer system, and what is called a difference file (difference program) may be sufficient.
[0033]
【The invention's effect】
According to the present invention, in the automatic classification of documents into document sets such as reference books and instruction manuals, it becomes possible to classify the classification target documents by using the first section of the document set. Therefore, it is possible to classify the classification target documents into more appropriate sections (categories) than the conventional automatic classification method.
In addition, a plurality of themes of the classification target document can be represented by an important word group extracted from the classification destination document. Therefore, by displaying the important word group constituting each subject of the classification target document, it becomes possible for the user to grasp at a glance what themes are included in the classification target document, as well as the classification work. Increases efficiency.
[Brief description of the drawings]
FIG. 1 is a block diagram showing a functional development of a configuration of a document classification support apparatus according to an embodiment of the present invention.
FIG. 2 is a diagram showing a processing procedure for extracting important words from a classification destination document according to the embodiment;
FIG. 3 is a diagram showing a processing procedure for classification support of a classification target document according to the embodiment;
FIG. 4 is a diagram showing a document classification support screen image according to the embodiment.
FIG. 5 is a diagram showing a display screen image of a description range in the subject classification target document according to the embodiment;
[Explanation of symbols]
101 ... Classification destination document DB (database)
102, 202 ... Important word extraction unit
103 ... Keyword DB (database)
201: Classification target document storage unit
203 ... Theme extraction unit
204 ... Classification destination deriving unit
205: Description range deriving unit
206 ... display section

Claims

A classification target document composed of hierarchized sections, a classification target document to be classified into a lower hierarchy section of the classification target document, an important word, and the important word appear in the classification target document A storage unit that stores a correspondence between a section of a higher layer of the section of the lower layer and a number indicating an appearance order of the section of the lower layer in the section of the higher layer;
Reading the classification destination document from the storage unit to extract important words, and writing them to the storage unit in association with the upper layer section of the lower layer section in which the important word appears and the section number of the lower layer 1 key word extraction unit;
A second important word extraction unit that reads out the classification target document and the important word from the storage unit and extracts the important word read out from the classification target document;
An important word extracted by the second important word extraction unit, a section of an upper hierarchy of a section of a lower hierarchy in which the important word in the storage unit appears in the classification destination document, and a section number of the lower hierarchy Based on one or a plurality of sets configured so that the important words appearing in the same lower hierarchy section are included in the same set for each section in the upper hierarchy, and the minimum so that the important words are not shared with each other A subject extraction unit that extracts one or a plurality of collected sets as a subject;
And key words that constitute the subject of classified documents the subject extracting unit is extracted, based on the number of the lower layer section in which the key words appear in the grouping destination document in the storage unit, form the subject A classification destination derivation unit for deriving the last lower hierarchy section as a classification lower hierarchy section among the lower hierarchy sections in which each important word appears for the first time in the classification destination document ;
A display unit for displaying an important word group constituting a subject of the classification target document extracted by the subject extraction unit, and a lower-level section of a classification target of the subject of the classification target document derived by the classification destination derivation unit;
A document classification support apparatus comprising:

The first important word extraction unit extracts important words based on a predetermined part of speech, a sentence expression indicating an important matter, or a word distribution in a classification destination document. Item 2. The document classification support device according to Item 1 .

In a computer used as a document classification support device,
A classification target document composed of hierarchized sections, a classification target document to be classified into a lower hierarchy section of the classification target document, an important word, and the important word appear in the classification target document Reading the classification destination document from a storage unit that stores a correspondence between a section of an upper layer of a section of the lower layer and a number indicating an appearance order of the section of the lower layer in the section of the upper layer;
Extracting a key word from the read classification destination document, writing the key word in the storage unit in association with a section of a lower layer in which the key word appears and a section number of the lower layer;
Reading the classification target document and the important word from the storage unit, extracting the important word read from the classification target document;
Based the important words extracted from the classified document, and a number of sections of the key words of the upper hierarchy of sections of the lower layer appearing in the grouping destination document section and lower level hierarchy in the storage unit, the upper For each section of the hierarchy, one or a plurality of sets configured such that the important words appearing in the same lower-level section are included in the same set, wherein the important words are minimized so that they are not shared with each other extracting a plurality of sets as a subject,
Each important word constituting the theme based on the extracted important word group constituting the subject of the classification target document and the section number of the lower hierarchy in which the important word in the storage unit appears in the classification destination document Deriving the last lower hierarchy section as the lower hierarchy section of the classification destination among the lower hierarchy sections first appearing in the classification destination document ;
Displaying a group of important words constituting the subject of the document to be classified, and a section of a lower hierarchy to which the subject is classified;
A computer program for running.