JP3847992B2

JP3847992B2 - Database management apparatus and program recording medium

Info

Publication number: JP3847992B2
Application number: JP36637598A
Authority: JP
Inventors: 定彦三須; 敬正二木; 祐一川西; 公敏内藤; 愼宮澤; 辰也伊藤
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 1998-12-24
Filing date: 1998-12-24
Publication date: 2006-11-22
Anticipated expiration: 2018-12-24
Also published as: JP2000187604A

Description

【０００１】
【発明の属する技術分野】
本発明は、登録ＩＤを持つエントリデータを管理するデータベースに対してエントリデータを登録するときに、そのデータベースへの重複エントリをチェックする処理を行うデータベース管理装置と、その装置の実現に用いられるプログラムが記録されるプログラム記録媒体とに関し、特に、重複エントリを高速にチェックできるようにするデータベース管理装置と、その装置の実現に用いられるプログラムが記録されるプログラム記録媒体とに関する。
【０００２】
【従来の技術】
遺伝子の本体はＤＮＡと呼ばれる物質である。ＤＮＡは４種類の塩基が一列につながって構成されている。この塩基の並びを塩基配列と言う。生物の体を構成したり、酵素などの生体内での様々な機能を担う主要な物質は蛋白質であるが、この蛋白質を構成する２０種類のアミノ酸の一列の並び一組が、遺伝子の塩基配列によって決定されている。実際には、連続した３つの塩基の並び一組が１つのアミノ酸に対応している。
【０００３】
ＤＮＡの塩基配列や蛋白質のアミノ酸配列として表される遺伝子配列は、塩基それぞれ、あるいはアミノ酸それぞれを１文字とした文字列で記述することができる。例えば、４種類の塩基は名前の頭文字を取って、Ａ、Ｔ、Ｇ、Ｃで表される。
【０００４】
文字列で表記された遺伝子配列は、ＤＮＡデータベースへと蓄積される。現在までに明らかにされた遺伝子の塩基配列は、日本、欧州、米国の共同により構築されているＤＤＢＪ／ＥＭＢＬ／ＧenＢank 国際塩基配列データベースに登録されている。
【０００５】
図９に、このＤＮＡデータベースの内部構造を図示する。このＤＮＡデータベースに登録されるＤＮＡデータには、固有のアクセッション番号（図中のＵ７５８６５）が付されており、遺伝子の塩基配列情報以外に、その遺伝子を持つ生物名や、その遺伝子から作られる蛋白質の名前などの関連情報が記述されている。同様に、蛋白質でも、ＰＩＲ／ＳＷＩＳＳ−ＰＲＯＴといったアミノ酸配列データベースが構築されている。
【０００６】
なお、上述のアクセッション番号は、これまでは、「アルファベット１文字＋数字５桁」という６桁で構成されるものが用いられてきたが、ＤＮＡデータベースに登録されるＤＮＡデータが大量になることで、最近では、「アルファベット２文字＋数字６桁」という８桁で構成されるものが用いられている。
【０００７】
このＤＮＡデータベースは、定期的にリリースされるリリースデータベースと、このリリースデータベースの発行以後に登録されたＤＮＡデータが格納されるデイリーアップデートデータベースとで構成されている。デイリーアップデートデータベースには、日々、ＤＮＡデータが登録され、このデイリーアップデートデータベースのＤＮＡデータは、数カ月毎に、リリースデータベースに追加されてリリースされることになる。
【０００８】
デイリーアップデートデータベースに登録されるＤＮＡデータには、新規のＤＮＡデータの他に、既にリリースデータベースに登録されたＤＮＡデータに修正や変更を加えたＤＮＡデータがある。これから、デイリーアップデートデータベースとリリースデータベースとに、同一のアクセッション番号を持つＤＮＡデータが存在することが起こる。更に、連続して同一のＤＮＡデータの修正や変更が行われると、デイリーアップデートデータベースに、同一のアクセッション番号を持つ複数のＤＮＡデータが存在することになる。
【０００９】
ＤＮＡデータベースの発行元機関（ＤＤＢＪ／ＥＭＢＬ／ＧenＢank ）は、新規のＤＮＡデータに対してアクセッション番号を必ずしも連続して割り当てていないなどの理由から、デイリーアップデートデータベースに登録されたＤＮＡデータ（アクセッション番号）が既にリリースデータベースに登録されているＤＮＡデータ（アクセッション番号）かどうかを判断することはできない。
【００１０】
生物学の研究のために、アクセッション番号によるデータ検索が一般的に行われているが、同一のアクセッション番号を持つＤＮＡデータが複数存在すると、データ検索によって正しい結果が得られないことから、デイリーアップデートデータベースに登録されるＤＮＡデータをリリースデータベースに追加する際、Non-Redundant 化を行う必要がある。
【００１１】
従来では、先ず全てのＤＮＡデータをエントリ（アクセッション番号）ごとに分割して、アクセッション番号をファイル名にしてディスク上でファイル展開し、新規のＤＮＡデータと更新されたＤＮＡデータとを同じ領域に同じ方法でファイル展開することで、このNon-Redundant 化を実現するという方法を採っていた。
【００１２】
この方法に従って、重複エントリが存在する場合には、同じ名称のファイルが複数存在することになり、新しいファイルによって古いファイルが上書きされることで、Non-Redundant 化が実現されることになる。そして、この展開されたファイルを再度マージすることで、完全にNon-Redundant 化されたＤＮＡデータベースが完成する。
【００１３】
【発明が解決しようとする課題】
最近では、ＤＮＡデータベースに登録されるＤＮＡデータは、年間１.5〜２倍の増加率で推移しており、今後更に増加する見込みである。
【００１４】
これから、従来技術に従ってNon-Redundant 化を実行していると、エントリの分割とファイルの結合処理とに多大な時間を要することで、２４時間以内にNon-Redundant 化の処理を完了することが困難になってきた。
【００１５】
この問題点は、ＤＮＡデータベースに限られずに、エントリデータを登録ＩＤに関係なくランダムに管理するデータベースに対して、大量のエントリデータを登録するという処理構成が採られるときにも、同様に起こることになる。
【００１６】
本発明はかかる事情に鑑みてなされたものであって、登録ＩＤを持つエントリデータを管理するデータベースに対してエントリデータを登録するときに、そのデータベースへの重複エントリを高速にチェックできるようにする新たなデータベース管理装置の提供と、その装置の実現に用いられるプログラムが記録される新たなプログラム記録媒体の提供とを目的とする。
【００１７】
【課題を解決するための手段】
この目的を達成するために、本発明のデータベース管理装置は、登録ＩＤを持つエントリデータを管理するデータベースに対してエントリデータを登録するときに、そのデータベースへの重複エントリをチェックする処理を行うために、（１）登録対象のエントリデータの持つ登録ＩＤから、その登録ＩＤを構成する規定位置の文字列情報を取り出すことで、データベースに登録されている可能性のある登録ＩＤであるのか否かの判断に用いる文字列情報を作成する作成手段と、（２）データベースからエントリデータの持つ登録ＩＤを１つずつ読み出し、その登録ＩＤを構成する上記規定位置の文字列情報と作成手段の作成した文字列情報とが一致するのか否かを判断することで、その登録ＩＤが登録対象のエントリデータの持つ登録ＩＤである可能性があるのか否かを判断する第１の判断手段と、（３）第１の判断手段が可能性ありと判断する登録ＩＤと、登録対象のエントリデータの持つ登録ＩＤとが一致するのか否かを判断することで、登録対象のエントリデータが重複エントリデータであるのか否かを判断する第２の判断手段とを備えるように構成する。
この構成を採るときに、作成手段は、登録対象のエントリデータが複数ある場合に、データベースに登録されている可能性のある登録ＩＤであるのか否かの判断に用いる文字列情報として、それらの登録対象のエントリデータの持つ登録ＩＤを構成する上記規定位置の文字列情報を取り出し、それらの取り出した文字列情報の一覧について記録するテーブルを作成することがあり、この場合には、第１の判断手段は、データベースから読み出した登録ＩＤを構成する上記規定位置の文字列情報がそのテーブルに記録されているのか否かを判断することで、その登録ＩＤが登録対象のエントリデータの持つ登録ＩＤである可能性があるのか否かを判断することになる。
また、この構成を採るときに、作成手段は、登録対象のエントリデータが複数ある場合に、データベースに登録されている可能性のある登録ＩＤであるのか否かの判断に用いる文字列情報として、それらの登録対象のエントリデータの持つ登録ＩＤを構成する上記規定位置の文字列情報を取り出し、上記規定位置に位置する文字列情報の取りうる一覧について記録するテーブル構造を有して、上記規定位置に位置する文字列情報の有無情報について記録するテーブルに対して、その取り出した文字列情報の存在を示す情報を記録することで、そのテーブルを作成することがあり、この場合には、第１の判断手段は、データベースから読み出した登録ＩＤを構成する上記規定位置の文字列情報がそのテーブルに存在有りとして記録されているのか否かを判断することで、その登録ＩＤが登録対象のエントリデータの持つ登録ＩＤである可能性があるのか否かを判断することになる。
ここで、本発明のデータベース管理装置の持つ機能は具体的にはプログラムで実現されるものであり、このプログラムは、フロッピィディスクなどに格納されたり、サーバなどのディスクなどに格納され、それらから本発明のデータベース管理装置にインストールされてメモリ上で動作することで、本発明を実現することになる。
次に、図１に示す原理構成図に従って、本発明についてさらに詳細に説明する。
ここで、図１に示す原理構成図では、登録対象のエントリデータが複数であることを想定している。
【００１８】
図中、１は本発明を具備するデータベース管理装置、２はデータベースであって、エントリデータを登録ＩＤに関係なくランダムに管理するもの、３は登録用データベースであって、データベース２への登録対象となるエントリデータを格納するものである。
【００１９】
本発明のデータベース管理装置１は、データベース２に対して登録用データベース３に格納される大量のエントリデータを登録するときに、データベース２への重複エントリをチェックする処理を行うものであって、この機能を実現するために、作成手段１０と、文字一覧テーブル１１と、ハッシュテーブル１２と、判断手段１３と、検索手段１４とを備える。
【００２０】
この作成手段１０は、登録対象のエントリデータの持つ登録ＩＤから、データベース２に登録されている可能性のある登録ＩＤであるのか否かの検索に用いる文字一覧テーブル１１やハッシュテーブル１２を作成する。
【００２１】
文字一覧テーブル１１は、作成手段１０により作成されて、登録対象のエントリデータの持つ登録ＩＤの規定位置に位置する文字の一覧を管理する。ハッシュテーブル１２は、作成手段１０により作成されて、登録ＩＤの規定位置に位置する文字の取りうる一覧に文字の有無を記録することで、登録対象のエントリデータの持つ登録ＩＤの規定位置に位置する文字の一覧を管理する。
【００２２】
判断手段１３は、データベース２からエントリデータの持つ登録ＩＤを１つずつ読み出して、その読み出した登録ＩＤが登録用データベース３に格納されているエントリデータの持つ登録ＩＤである可能性があるのか否かを判断する。検索手段１４は、登録対象のエントリデータに含まれる重複エントリデータを検索する。
【００２３】
ここで、本発明のデータベース管理装置１の持つ機能は具体的にはプログラムで実現されるものであり、このプログラムは、フロッピィディスクなどに格納されたり、サーバなどのディスクなどに格納され、それらからデータベース管理装置１にインストールされてメモリ上で動作することで、本発明を実現することになる。
【００２４】
このように構成される本発明のデータベース管理装置１では、登録用データベース３に格納されるエントリデータをデータベース２に登録する要求が発行されると、作成手段１０は、登録対象のエントリデータの持つ登録ＩＤから、文字一覧テーブル１１やハッシュテーブル１２を作成する。
【００２５】
この作成手段１０の作成処理を受けて、判断手段１３は、データベース２からエントリデータの持つ登録ＩＤを１つずつ読み出し、作成された文字一覧テーブル１１やハッシュテーブル１２を使って、その読み出した登録ＩＤが登録用データベース３に格納されているエントリデータの持つ登録ＩＤである可能性があるのか否かを判断する。
【００２６】
この判断手段１３の判断処理を受けて、検索手段１４は、判断手段１３が可能性ありと判断した登録ＩＤと、登録用データベース３に格納されるそれぞれのエントリデータの持つ登録ＩＤとを比較することで、登録用データベース３に格納される重複するエントリデータを検索する。
【００２７】
このように、本発明のデータベース管理装置１では、登録ＩＤを持つエントリデータを管理するデータベース２に対して複数のエントリデータを登録するときに、登録対象のエントリデータの持つ登録ＩＤから、文字一覧テーブル１１やハッシュテーブル１２を作成することで、データベース２に登録されている可能性のある登録ＩＤであるのか否かの判断に用いる文字列情報を作成し、それを使って、データベース２に管理されるエントリデータの持つ登録ＩＤの中から、登録対象のエントリデータの持つ登録ＩＤである可能性があるものを抽出して、その抽出した登録ＩＤと登録対象のエントリデータの持つ登録ＩＤとを比較することで、登録対象のエントリデータに含まれる重複エントリデータを検索する構成を採るので、データベース２への重複エントリを高速にチェックできるようになる。
【００２８】
【発明の実施の形態】
以下、実施の形態に従って本発明を詳細に説明する。
【００２９】
図２に、本発明のデータベース管理装置１の一実施例を図示する。図中、図１で説明したものと同じものについては同一の記号で示してある。
【００３０】
２ａはＤＮＡリリースデータベースであって、ＤＮＡデータを管理して定期的にリリースされるもの、３ａはＤＮＡアップデートデータベースであって、ＤＮＡリリースデータベース２ａの発行以後に登録されたＤＮＡデータを格納するものである。
【００３１】
本発明のデータベース管理装置１は、ＤＮＡリリースデータベース２ａに対して、ＤＮＡアップデートデータベース３ａに格納されるＤＮＡデータを登録する処理を行うものであって、テーブル作成プログラム１００と、テーブル作成プログラム１００により作成される文字一覧テーブル１１及びハッシュテーブル１２と、重複エントリ検索プログラム２００と、重複エントリ検索プログラム２００により作成される削除リスト３００と、エントリデータ登録プログラム４００とを備える。
【００３２】
ここで、テーブル作成プログラム１００や重複エントリ検索プログラム２００やエントリデータ登録プログラム４００は、フロッピィディスクや回線などを介してインストールされることになる。
【００３３】
以下、説明の便宜上、ＤＮＡデータの持つアクセッション番号は、「アルファベット２文字＋数字６桁」という８桁のみが用いられることを想定する。
【００３４】
この図２に示すように、文字一覧テーブル１１は、例えば、第１の文字一覧テーブル１１０と第２の文字一覧テーブル１１１と第３の文字一覧テーブル１１２とで構成され、ハッシュテーブル１２は、例えば、第１のハッシュテーブル１２０と第２のハッシュテーブル１２１と第３のハッシュテーブル１２２とで構成されている。
【００３５】
図３に、文字一覧テーブル１１の一実施例、図４に、ハッシュテーブル１２の一実施例を図示する。
【００３６】
図３に示すように、第１の文字一覧テーブル１１０は、ＤＮＡアップデートデータベース３ａに格納されるＤＮＡデータの持つアクセッション番号を管理対象として、それらのアクセッション番号の持つ先頭文字の一覧を管理する。そして、第２の文字一覧テーブル１１１は、それらのアクセッション番号を管理対象として、それらのアクセッション番号の持つ先頭２文字の一覧を管理する。そして、第３の文字一覧テーブル１１２は、それらのアクセッション番号を管理対象として、それらのアクセッション番号の持つ先頭３文字の一覧を管理する。
【００３７】
例えば、ＤＮＡアップデートデータベース３ａに格納されるＤＮＡデータの持つアクセッション番号がＡ、Ｃ、Ｄ、Ｗという４種類の先頭文字を持つときには、第１の文字一覧テーブル１１０は、このＡ、Ｃ、Ｄ、Ｗという４つの文字を管理するのである。
【００３８】
一方、図４に示すように、第１のハッシュテーブル１２０は、アクセッション番号の持つ数字部上３桁の取りうる数字列（０００〜９９９）と、その数字列を持つアクセッション番号が存在するのか否かを示すフラグとの対データで構成されるテーブル構造を有して、ＤＮＡアップデートデータベース３ａに格納されるＤＮＡデータの持つアクセッション番号を管理対象として、それらのアクセッション番号の持つ数字部上３桁の数字列に対となるフラグを“１”にすることで、それらのアクセッション番号の持つ数字部上３桁の数字列を管理する。
【００３９】
そして、第２のハッシュテーブル１２１は、アクセッション番号の持つ数字部中３桁の取りうる数字列（０００〜９９９）と、その数字列を持つアクセッション番号が存在するのか否かを示すフラグとの対データで構成されるテーブル構造を有して、ＤＮＡアップデートデータベース３ａに格納されるＤＮＡデータの持つアクセッション番号を管理対象として、それらのアクセッション番号の持つ数字部中３桁の数字列に対となるフラグを“１”にすることで、それらのアクセッション番号の持つ数字部中３桁の数字列を管理する。
【００４０】
そして、第３のハッシュテーブル１２２は、アクセッション番号の持つ数字部下３桁の取りうる数字列（０００〜９９９）と、その数字列を持つアクセッション番号が存在するのか否かを示すフラグとの対データで構成されるテーブル構造を有して、ＤＮＡアップデートデータベース３ａに格納されるＤＮＡデータの持つアクセッション番号を管理対象として、それらのアクセッション番号の持つ数字部下３桁の数字列に対となるフラグを“１”にすることで、それらのアクセッション番号の持つ数字部下３桁の数字列を管理する。
【００４１】
例えば、ＤＮＡアップデートデータベース３ａに格納されるＤＮＡデータの持つアクセッション番号が数字部上３桁に、“１１１”と“１２３”と“２２２”という３種類の数字列を持つときには、第１のハッシュテーブル１２０は、“１１１”の数字列に対となるフラグを“１”にし、“１２３”の数字列に対となるフラグを“１”にし、“２２２”の数字列に対となるフラグを“１”にすることで、それらのアクセッション番号の持つ数字部上３桁の数字列を管理するのである。
【００４２】
図５及び図６に、テーブル作成プログラム１００の実行する処理フローの一実施例、図７及び図８に、重複エントリ検索プログラム２００の実行する処理フローの一実施例を図示する。次に、これらの処理フローに従って、このように構成される本発明のデータベース管理装置１の処理について詳細に説明する。
【００４３】
テーブル作成プログラム１００は、ＤＮＡアップデートデータベース３ａに格納されるＤＮＡデータのＤＮＡリリースデータベース２ａへの登録要求が発行されると、図５及び図６の処理フローに示すように、先ず最初に、ステップ１で、変数ｉに初期値“１”をセットする。
【００４４】
続いて、ステップ２で、ＤＮＡアップデートデータベース３ａに格納されるｉ番目のＤＮＡデータの持つアクセッション番号を読み込む。続いて、ステップ３で、その読み込んだアクセッション番号の先頭文字を抽出する。続いて、ステップ４で、その抽出した先頭文字が既に第１の文字一覧テーブル１１０に記録されているのか否かを判断して、未だ記録されていないことを判断するときには、ステップ５に進んで、その抽出した先頭文字を第１の文字一覧テーブル１１０に記録し、既に記録されていることを判断するときには、このステップ５の処理を省略する。
【００４５】
続いて、ステップ６で、ステップ２で読み込んだアクセッション番号の先頭２文字を抽出する。続いて、ステップ７で、その抽出した先頭２文字が既に第２の文字一覧テーブル１１１に記録されているのか否かを判断して、未だ記録されていないことを判断するときには、ステップ８に進んで、その抽出した先頭２文字を第２の文字一覧テーブル１１１に記録し、既に記録されていることを判断するときには、このステップ８の処理を省略する。
【００４６】
続いて、ステップ９で、ステップ２で読み込んだアクセッション番号の先頭３文字を抽出する。続いて、ステップ１０で、その抽出した先頭３文字が既に第３の文字一覧テーブル１１２に記録されているのか否かを判断して、未だ記録されていないことを判断するときには、ステップ１１に進んで、その抽出した先頭３文字を第３の文字一覧テーブル１１２に記録し、既に記録されていることを判断するときには、このステップ１１の処理を省略する。
【００４７】
続いて、ステップ１２（図６の処理フロー）で、ステップ２で読み込んだアクセッション番号の数字部上３桁を抽出する。続いて、ステップ１３で、その抽出した数字列の指す第１のハッシュテーブル１２０のフラグが既に“１”となっているのか否かを判断して、未だ“１”となっていないことを判断するときには、ステップ１４に進んで、そのフラグを“１”とすることで、その抽出した数字列を第１のハッシュテーブル１２０に記録し、既に“１”となっているときは、このステップ１４の処理を省略する。
【００４８】
続いて、ステップ１５で、ステップ２で読み込んだアクセッション番号の数字部中３桁を抽出する。続いて、ステップ１６で、その抽出した数字列の指す第２のハッシュテーブル１２１のフラグが既に“１”となっているのか否かを判断して、未だ“１”となっていないことを判断するときには、ステップ１７に進んで、そのフラグを“１”とすることで、その抽出した数字列を第２のハッシュテーブル１２１に記録し、既に“１”となっているときは、このステップ１７の処理を省略する。
【００４９】
続いて、ステップ１８で、ステップ２で読み込んだアクセッション番号の数字部下３桁を抽出する。続いて、ステップ１９で、その抽出した数字列の指す第３のハッシュテーブル１２２のフラグが既に“１”となっているのか否かを判断して、未だ“１”となっていないことを判断するときには、ステップ２０に進んで、そのフラグを“１”とすることで、その抽出した数字列を第３のハッシュテーブル１２２に記録し、既に“１”となっているときは、このステップ２０の処理を省略する。
【００５０】
続いて、ステップ２１で、変数ｉの値を１つインクリメントし、続くステップ２２で、変数ｉの値がＤＮＡアップデートデータベース３ａに格納されるＤＮＡデータのデータ数を超えたのか否かを判断して、超えていないことを判断するときには、ステップ２に戻り、超えたことを判断するときには、全ての処理を終了する。
【００５１】
このようにして、テーブル作成プログラム１００は、ＤＮＡアップデートデータベース３ａに格納されるＤＮＡデータのＤＮＡリリースデータベース２ａへの登録要求が発行されると、ＤＮＡアップデートデータベース３ａに格納されるＤＮＡデータの持つアクセッション番号から、第１の文字一覧テーブル１１０／第２の文字一覧テーブル１１１／第３の文字一覧テーブル１１２／第１のハッシュテーブル１２０／第２のハッシュテーブル１２１／第３のハッシュテーブル１２２を作成するように処理するのである。
【００５２】
重複エントリ検索プログラム２００は、テーブル作成プログラム１００が処理が終了することで起動されると、図７及び図８の処理フローに示すように、先ず最初に、ステップ１で、変数ｉに初期値“１”をセットする。
【００５３】
続いて、ステップ２で、ＤＮＡリリースデータベース２ａに格納されるｉ番目のＤＮＡデータの持つアクセッション番号を読み込む。続いて、ステップ３で、その読み込んだアクセッション番号の先頭文字を抽出する。続いて、ステップ４で、その抽出した先頭文字が第１の文字一覧テーブル１１０に記録されているのか否かを判断して、記録されていないことを判断するときには、ＤＮＡアップデートデータベース３ａにはその読み込んだアクセッション番号を持つＤＮＡデータが存在しないことを判断して、後述するステップ１８（図８の処理フロー）に進む。
【００５４】
一方、ステップ４で、ステップ３で抽出した先頭文字が第１の文字一覧テーブル１１０に記録されていることを判断するときには、ステップ５に進んで、ステップ２で読み込んだアクセッション番号の先頭２文字を抽出する。続いて、ステップ６で、その抽出した先頭２文字が第２の文字一覧テーブル１１１に記録されているのか否かを判断して、記録されていないことを判断するときには、ＤＮＡアップデートデータベース３ａにはその読み込んだアクセッション番号を持つＤＮＡデータが存在しないことを判断して、後述するステップ１８（図８の処理フロー）に進む。
【００５５】
一方、ステップ６で、ステップ５で抽出した先頭２文字が第２の文字一覧テーブル１１１に記録されていることを判断するときには、ステップ７に進んで、ステップ２で読み込んだアクセッション番号の先頭３文字を抽出する。続いて、ステップ８で、その抽出した先頭３文字が第３の文字一覧テーブル１１２に記録されているのか否かを判断して、記録されていないことを判断するときには、ＤＮＡアップデートデータベース３ａにはその読み込んだアクセッション番号を持つＤＮＡデータが存在しないことを判断して、後述するステップ１８（図８の処理フロー）に進む。
【００５６】
一方、ステップ８で、ステップ７で抽出した先頭３文字が第３の文字一覧テーブル１１２に記録されていることを判断するときには、ステップ９に進んで、ステップ２で読み込んだアクセッション番号の数字部上３桁の数字列を抽出する。続いて、ステップ１０で、その抽出した数字列の指す第１のハッシュテーブル１２０のフラグが“１”であるのか否かを判断して、“０”であることを判断するときには、ＤＮＡアップデートデータベース３ａにはその読み込んだアクセッション番号を持つＤＮＡデータが存在しないことを判断して、後述するステップ１８（図８の処理フロー）に進む。
【００５７】
一方、ステップ１０で、ステップ９で抽出した数字列の指す第１のハッシュテーブル１２０のフラグが“１”であることを判断するときには、ステップ１１に進んで、ステップ２で読み込んだアクセッション番号の数字部中３桁の数字列を抽出する。続いて、ステップ１２（図８の処理フロー）で、その抽出した数字列の指す第２のハッシュテーブル１２１のフラグが“１”であるのか否かを判断して、“０”であることを判断するときには、ＤＮＡアップデートデータベース３ａにはその読み込んだアクセッション番号を持つＤＮＡデータが存在しないことを判断して、後述するステップ１８に進む。
【００５８】
一方、ステップ１２で、ステップ１１で抽出した数字列の指す第２のハッシュテーブル１２１のフラグが“１”であることを判断するときには、ステップ１３に進んで、ステップ２で読み込んだアクセッション番号の数字部下３桁の数字列を抽出する。続いて、ステップ１４で、その抽出した数字列の指す第３のハッシュテーブル１２２のフラグが“１”であるのか否かを判断して、“０”であることを判断するときには、ＤＮＡアップデートデータベース３ａにはその読み込んだアクセッション番号を持つＤＮＡデータが存在しないことを判断して、後述するステップ１８に進む。
【００５９】
一方、ステップ１４で、ステップ１３で抽出した数字列の指す第３のハッシュテーブル１２２のフラグが“１”であることを判断するときには、ＤＮＡアップデートデータベース３ａにステップ２で読み込んだアクセッション番号を持つＤＮＡデータが存在する可能性のあることを判断して、ステップ１５に進んで、その読み込んだアクセッション番号を持つＤＮＡデータがＤＮＡアップデートデータベース３ａに格納されているかのを検索する。
【００６０】
続いて、ステップ１６で、ステップ１５の検索結果に従って、ＤＮＡアップデートデータベース３ａにステップ２で読み込んだアクセッション番号を持つＤＮＡデータが存在するのか否かを判断して、存在することを判断するとき、すなわち、ステップ２で読み込んだアクセッション番号を持つＤＮＡデータが更新されることになるときには、ステップ１７に進んで、削除リスト３００に対して、そのアクセッション番号を記録し、存在しないことを判断するときには、このステップ１７の処理を省略する。
【００６１】
続いて、ステップ１８で、変数ｉの値を１つインクリメントし、続くステップ１９で、変数ｉの値がＤＮＡリリースデータベース２ａに格納されるＤＮＡデータのデータ数を超えたのか否かを判断して、超えていないことを判断するときには、ステップ２に戻り、超えたことを判断するときには、全ての処理を終了する。
【００６２】
このようにして、重複エントリ検索プログラム２００は、テーブル作成プログラム１００が処理が終了すると、テーブル作成プログラム１００の作成した文字一覧テーブル１１及びハッシュテーブル１２を使って、ＤＮＡリリースデータベース２ａに格納されるＤＮＡデータが更新される可能性があるのか否かを判断して、更新される可能性がある場合には、ＤＮＡアップデートデータベース３ａを検索することで、実際に更新されるのか否かを判断していくことで、更新されるＤＮＡデータの持つアクセッション番号を削除リスト３００に記録していくように処理するのである。
【００６３】
重複エントリ検索プログラム２００の処理に従って削除リスト３００が作成されると、エントリデータ登録プログラム４００は、ＤＮＡリリースデータベース２ａから削除リスト３００に記録されるアクセッション番号を持つＤＮＡデータを削除しつつ、ＤＮＡアップデートデータベース３ａに格納されるＤＮＡデータをＤＮＡリリースデータベース２ａに登録していくように処理する。
【００６４】
図示実施例に従って本発明を説明したが、本発明はこれに限定されるものではない。例えば、実施例では、ＤＮＡデータベースへの適用に従って本発明を説明したが、本発明はＤＮＡデータベースへの適用に限られるものではない。
【００６５】
【発明の効果】
以上説明したように、本発明では、登録ＩＤを持つエントリデータを管理するデータベースに対して複数のエントリデータを登録するときに、登録対象のエントリデータの持つ登録ＩＤから、データベースに登録されている可能性のある登録ＩＤであるのか否かの判断に用いる文字列情報について記録するテーブルを作成することで、データベースに登録されている可能性のある登録ＩＤであるのか否かの判断に用いる文字列情報を作成し、それを使って、データベースに管理されるエントリデータの持つ登録ＩＤの中から、登録対象のエントリデータの持つ登録ＩＤである可能性があるものを抽出して、その抽出した登録ＩＤと登録対象のエントリデータの持つ登録ＩＤとを比較することで、登録対象のエントリデータに含まれる重複エントリデータを検索する構成を採るので、データベースへの重複エントリを高速にチェックできるようになる。
【図面の簡単な説明】
【図１】本発明の原理構成図である。
【図２】本発明の一実施例である。
【図３】文字一覧テーブルの一実施例である。
【図４】ハッシュテーブルの一実施例である。
【図５】テーブル作成プログラムの処理フローである。
【図６】テーブル作成プログラムの処理フローである。
【図７】重複エントリ検索プログラムの処理フローである。
【図８】重複エントリ検索プログラムの処理フローである。
【図９】ＤＮＡデータベースの説明図である。
【符号の説明】
１データベース管理装置
２データベース
３登録用データベース
１０作成手段
１１文字一覧テーブル
１２ハッシュテーブル
１３判断手段
１４検索手段[0001]
BACKGROUND OF THE INVENTION
The present invention Has an ID with a registration ID. Entry data The tube For the database T A database management device that performs processing for checking duplicate entries in the database when registering the registry data, and a program used to implement the device Record In particular, there is a database management device that enables high-speed checking of duplicate entries, and a program used to implement the device. Record And a program recording medium.
[0002]
[Prior art]
The body of a gene is a substance called DNA. DNA is composed of four types of bases connected in a row. This sequence of bases is called a base sequence. Proteins are the main substances that make up the body of organisms and perform various functions in vivo, such as enzymes. The sequence of 20 amino acids that make up this protein is the base sequence of the gene. Is determined by. In practice, a set of three consecutive bases corresponds to one amino acid.
[0003]
A gene sequence represented as a DNA base sequence or a protein amino acid sequence can be described by a character string with each base or each amino acid as one character. For example, four types of bases are represented by A, T, G, and C by taking the initial letter of the name.
[0004]
Gene sequences expressed in character strings are accumulated in a DNA database. The nucleotide sequences of genes clarified so far are registered in the DDBJ / EMBL / GenBank international nucleotide sequence database constructed jointly by Japan, Europe and the United States.
[0005]
FIG. 9 illustrates the internal structure of this DNA database. The DNA data registered in this DNA database is given a unique accession number (U75865 in the figure), and in addition to the base sequence information of the gene, it is created from the name of the organism having the gene and the gene. Related information such as the name of the protein is described. Similarly, an amino acid sequence database such as PIR / SWISS-PROT is also constructed for proteins.
[0006]
Up to now, the above accession number has been composed of 6 digits of "1 alphabet letter + 5 digits", but the amount of DNA data registered in the DNA database becomes large. Recently, an 8-digit code “2 alphabet letters + 6 digits” has been used.
[0007]
The DNA database is composed of a release database that is periodically released and a daily update database that stores DNA data registered after the release database is issued. In the daily update database, DNA data is registered every day, and the DNA data in the daily update database is added to the release database and released every several months.
[0008]
In addition to new DNA data, DNA data registered in the daily update database includes DNA data obtained by correcting or changing DNA data already registered in the release database. As a result, DNA data having the same accession number exists in the daily update database and the release database. Further, when the same DNA data is continuously corrected or changed, a plurality of DNA data having the same accession number exist in the daily update database.
[0009]
The DNA database issuer (DDBJ / EMBL / GenBank) does not necessarily assign consecutive accession numbers to new DNA data. It is not possible to determine whether the number is DNA data (accession number) already registered in the release database.
[0010]
For biology research, data retrieval by accession number is generally performed, but if there are multiple DNA data with the same accession number, correct results cannot be obtained by data retrieval, When adding DNA data registered in the daily update database to the release database, it is necessary to make it non-redundant.
[0011]
Conventionally, first, all DNA data is first divided into entries (accession numbers), the file is developed on the disk with the accession number as the file name, and the new DNA data and the updated DNA data are stored in the same area. The same method was used to expand the file to achieve this non-redundant conversion.
[0012]
If duplicate entries exist according to this method, there will be multiple files with the same name, and non-redundantization will be realized by overwriting old files with new files. Then, by merging the expanded files again, a completely non-redundant DNA database is completed.
[0013]
[Problems to be solved by the invention]
Recently, DNA data registered in the DNA database has been increasing at a rate of 1.5 to 2 times per year, and is expected to increase further in the future.
[0014]
From now on, if non-redundant conversion is executed according to the conventional technology, it will be difficult to complete the non-redundant conversion process within 24 hours because it takes a lot of time to divide the entries and combine the files. It has become.
[0015]
This problem is not limited to the DNA database, and a large amount of entry data is registered in a database that manages entry data randomly regardless of the registration ID. U The same happens when the processing configuration is adopted.
[0016]
The present invention has been made in view of such circumstances. E with registration ID Entry data The tube For the database T Providing a new database management device that enables high-speed checking of duplicate entries in the database when registering entry data, and a program used to implement the device. Record It is an object of the present invention to provide a new program recording medium.
[0017]
[Means for Solving the Problems]
In order to achieve this object, the database management apparatus of the present invention performs a process of checking duplicate entries in a database when registering entry data in a database that manages entry data having a registration ID. (1) Whether or not it is a registration ID that may be registered in the database by extracting character string information at a specified position that constitutes the registration ID from the registration ID of the entry data to be registered Creating means for creating the character string information used for the determination of (2) reading the registration IDs of the entry data from the database one by one, and creating the character string information and the creating means for the above-mentioned prescribed positions constituting the registration ID By determining whether or not the character string information matches, the registration ID of the entry data to be registered is the registration ID. The first determination means for determining whether or not there is a possibility, (3) the registration ID that the first determination means determines to be possible matches the registration ID of the entry data to be registered By determining whether or not the entry data to be registered is duplicate entry data.
When adopting this configuration, when there are a plurality of entry data to be registered, the creating means uses those as character string information used for determining whether or not the registration ID may be registered in the database. Character string information of the above-mentioned specified position constituting the registration ID of the entry data to be registered Information and a list of the extracted string information. In this case, the first determination means includes the character string information of the specified position constituting the registration ID read from the database in the table. Record By determining whether or not the registration ID is registered, it is determined whether or not the registration ID may be a registration ID of the entry data to be registered.
Further, when adopting this configuration, when there are a plurality of entry data to be registered, the creation means uses character string information used for determining whether or not the registration ID may be registered in the database. Then, the character string information of the above-mentioned specified position constituting the registration ID of the entry data to be registered is extracted, and the upper One possible character string information at the specified position About viewing It has a table structure to record And above Specified position Statement Presence / absence information of character string information Information indicating the presence of the extracted character string information is recorded in a table for recording information, so that the text is recorded. In this case, the first determination means determines that the character string information of the specified position constituting the registration ID read from the database exists in the table. Record By determining whether or not the registration ID is registered, it is determined whether or not the registration ID may be a registration ID of the entry data to be registered.
Here, the function of the database management apparatus of the present invention is specifically realized by a program, and this program is stored in a floppy disk or the like, or stored in a disk or the like of a server and the like. The present invention is realized by being installed in the database management apparatus of the invention and operating on the memory.
Next, the present invention will be described in more detail with reference to the principle configuration diagram shown in FIG.
Here, in the principle configuration diagram shown in FIG. 1, it is assumed that there are a plurality of entry data to be registered.
[0018]
In the figure, 1 is a database management apparatus comprising the present invention, 2 is a database, which randomly manages entry data regardless of registration ID, 3 is a registration database, and is registered in the database 2 Entry data is stored.
[0019]
The database management apparatus 1 according to the present invention performs processing for checking duplicate entries in the database 2 when registering a large amount of entry data stored in the registration database 3 with respect to the database 2. In order to realize the function, a creation unit 10, a character list table 11, a hash table 12, a determination unit 13, and a search unit 14 are provided.
[0020]
The creation means 10 creates a character list table 11 and a hash table 12 used for searching whether or not the registration ID is likely to be registered in the database 2 from the registration ID of the entry data to be registered. .
[0021]
The character list table 11 is created by the creating unit 10 and manages a list of characters located at the specified position of the registration ID of the entry data to be registered. The hash table 12 is created by the creation unit 10 and is recorded at the specified position of the registration ID of the entry data to be registered by recording the presence / absence of characters in a list of characters that can be taken at the specified position of the registration ID. Manage the list of characters to be used.
[0022]
The judging means 13 reads out the registration IDs of the entry data one by one from the database 2, and whether or not the read registration IDs may be the registration IDs of the entry data stored in the registration database 3. Determine whether. The search means 14 searches for duplicate entry data included in entry data to be registered.
[0023]
Here, the functions of the database management apparatus 1 of the present invention are specifically realized by a program, and this program is stored in a floppy disk or the like, or stored in a disk or the like of a server, and from these The present invention is realized by being installed in the database management apparatus 1 and operating on the memory.
[0024]
In the database management apparatus 1 of the present invention configured as described above, when a request for registering entry data stored in the registration database 3 in the database 2 is issued, the creating means 10 has the entry data to be registered. The character list table 11 and the hash table 12 are created from the registration ID.
[0025]
In response to the creation processing of the creation means 10, the judgment means 13 reads the registration IDs of the entry data one by one from the database 2, and uses the created character list table 11 and hash table 12 to read the registered information. It is determined whether or not there is a possibility that the ID is a registration ID of entry data stored in the registration database 3.
[0026]
In response to the determination process of the determination unit 13, the search unit 14 stores the registration ID determined by the determination unit 13 and the registration database 3. each By comparing with the registration ID of the entry data, duplicate entry data stored in the registration database 3 is searched.
[0027]
This As described above, in the database management apparatus 1 of the present invention, the database 2 that manages the entry data having the registration ID is used. plural When registering entry data To, climb Character string information used for determining whether or not the registration ID may be registered in the database 2 by creating the character list table 11 or the hash table 12 from the registration ID of the entry data to be recorded. Is used to extract from the registration IDs of the entry data managed in the database 2 those that may be the registration IDs of the entry data to be registered, and the extracted registration By comparing the ID and the registration ID of the entry data to be registered, the configuration is such that duplicate entry data included in the entry data to be registered is searched, so that duplicate entries in the database 2 can be checked at high speed. Become.
[0028]
DETAILED DESCRIPTION OF THE INVENTION
Hereinafter, the present invention will be described in detail according to embodiments.
[0029]
FIG. 2 shows an embodiment of the database management apparatus 1 of the present invention. In the figure, the same components as those described in FIG. 1 are indicated by the same symbols.
[0030]
2a is a DNA release database, which is regularly released by managing DNA data, 3a is a DNA update database, which stores DNA data registered after the issuance of the DNA release database 2a. is there.
[0031]
The database management apparatus 1 according to the present invention performs processing for registering DNA data stored in the DNA update database 3a with respect to the DNA release database 2a, and is created by the table creation program 100 and the table creation program 100. A character list table 11 and a hash table 12, a duplicate entry search program 200, a deletion list 300 created by the duplicate entry search program 200, and an entry data registration program 400.
[0032]
Here, the table creation program 100, the duplicate entry search program 200, and the entry data registration program 400 are installed via a floppy disk, a line, or the like.
[0033]
Hereinafter, for convenience of explanation, it is assumed that only 8 digits of “2 alphabet letters + 6 digits” are used as the accession number of DNA data.
[0034]
As shown in FIG. 2, the character list table 11 includes, for example, a first character list table 110, a second character list table 111, and a third character list table 112. The hash table 12 includes, for example, The first hash table 120, the second hash table 121, and the third hash table 122 are configured.
[0035]
FIG. 3 shows an embodiment of the character list table 11, and FIG. 4 shows an embodiment of the hash table 12.
[0036]
As shown in FIG. 3, the first character list table 110 manages accession numbers of DNA data stored in the DNA update database 3a as management targets, and manages a list of leading characters of those accession numbers. . Then, the second character list table 111 manages the list of the first two characters of those accession numbers with those accession numbers as management targets. The third character list table 112 manages a list of the first three characters of the accession numbers with the accession numbers as management targets.
[0037]
For example, when the accession number of the DNA data stored in the DNA update database 3a has four types of leading characters A, C, D, and W, the first character list table 110 stores the A, C, D , W are managed.
[0038]
On the other hand, as shown in FIG. 4, the first hash table 120 includes a numeric string (000 to 999) that can take three digits in the numeric part of the accession number and an accession number having the numeric string. A table structure composed of pair data with a flag indicating whether or not the data is stored, and the accession numbers of the DNA data stored in the DNA update database 3a are managed, and the numerical part of the accession numbers By setting the flag paired with the upper 3 digit number string to “1”, the upper 3 digit number sequence of the accession number is managed.
[0039]
The second hash table 121 includes a numeric string (000 to 999) that can take three digits in the numeric part of the accession number, and a flag indicating whether or not an accession number having the numeric string exists. The accession number of the DNA data stored in the DNA update database 3a is a management target, and is stored in a three-digit number string in the number part of the accession number. By setting the paired flag to “1”, a 3-digit number string in the number part of those accession numbers is managed.
[0040]
The third hash table 122 includes a number string (000 to 999) that can be taken by the last three digits of the accession number and a flag indicating whether or not an accession number having the number string exists. The accession number of the DNA data stored in the DNA update database 3a has a table structure composed of paired data, and the number of the lower three digits of the accession number is paired with the accession number. By setting the flag to “1”, the number string of the last three digits of the accession number is managed.
[0041]
For example, when the accession number of DNA data stored in the DNA update database 3a has three types of numeric strings “111”, “123”, and “222” in the first three digits of the numeric part, the first hash The table 120 sets the flag paired with the number string “111” to “1”, the flag paired with the number string “123” “1”, and the flag paired with the number string “222”. By setting it to “1”, the numeric string of the upper three digits of the accession number is managed.
[0042]
5 and 6 show an embodiment of the processing flow executed by the table creation program 100, and FIGS. 7 and 8 show an embodiment of the processing flow executed by the duplicate entry search program 200. FIG. Next, according to these processing flows, the processing of the database management apparatus 1 of the present invention configured as described above will be described in detail.
[0043]
When the registration request of the DNA data stored in the DNA update database 3a to the DNA release database 2a is issued, the table creation program 100 first starts with step 1 as shown in the processing flow of FIGS. Then, the initial value “1” is set to the variable i.
[0044]
Subsequently, in step 2, the accession number of the i-th DNA data stored in the DNA update database 3a is read. Subsequently, in step 3, the first character of the read accession number is extracted. Subsequently, in step 4, when it is determined whether or not the extracted first character has already been recorded in the first character list table 110, and when it is determined that it has not been recorded yet, the process proceeds to step 5. When the extracted first character is recorded in the first character list table 110 and it is determined that it has already been recorded, the process of step 5 is omitted.
[0045]
Subsequently, in step 6, the first two characters of the accession number read in step 2 are extracted. Subsequently, in step 7, it is determined whether or not the extracted first two characters are already recorded in the second character list table 111, and when it is determined that they are not yet recorded, the process proceeds to step 8. Thus, when the extracted first two characters are recorded in the second character list table 111 and it is determined that they are already recorded, the processing of step 8 is omitted.
[0046]
Subsequently, in step 9, the first three characters of the accession number read in step 2 are extracted. Subsequently, in step 10, it is determined whether or not the extracted first three characters are already recorded in the third character list table 112, and when it is determined that they are not yet recorded, the process proceeds to step 11. Thus, when the extracted first three characters are recorded in the third character list table 112 and it is determined that they are already recorded, the processing of step 11 is omitted.
[0047]
Subsequently, in step 12 (processing flow in FIG. 6), the upper three digits of the accession number read in step 2 are extracted. Subsequently, in step 13, it is determined whether or not the flag of the first hash table 120 indicated by the extracted number string is already “1”, and it is determined that it is not yet “1”. When the process proceeds to step 14, the flag is set to “1” so that the extracted numeric string is recorded in the first hash table 120. If it is already “1”, this step 14 The process of is omitted.
[0048]
Subsequently, in step 15, three digits in the numeric part of the accession number read in step 2 are extracted. Subsequently, in step 16, it is determined whether or not the flag of the second hash table 121 pointed to by the extracted number string is already “1”, and it is determined that it is not yet “1”. When the process proceeds to step 17, the flag is set to “1” so that the extracted numeric string is recorded in the second hash table 121. The process of is omitted.
[0049]
Subsequently, in step 18, the last three digits of the numeric part of the accession number read in step 2 are extracted. Subsequently, in step 19, it is determined whether or not the flag of the third hash table 122 indicated by the extracted number string is already “1”, and it is determined that it is not yet “1”. When proceeding to step 20, the flag is set to “1” so that the extracted number string is recorded in the third hash table 122, and when it is already “1”, this step 20 The process of is omitted.
[0050]
Subsequently, in step 21, the value of the variable i is incremented by one. In the following step 22, it is determined whether or not the value of the variable i exceeds the number of DNA data stored in the DNA update database 3a. When it is determined that it has not been exceeded, the process returns to step 2, and when it is determined that it has been exceeded, all processing is terminated.
[0051]
In this way, when the registration request for the DNA data stored in the DNA update database 3a is issued to the DNA release database 2a, the table creation program 100 accesses the accession of the DNA data stored in the DNA update database 3a. From the numbers, the first character list table 110 / second character list table 111 / third character list table 112 / first hash table 120 / second hash table 121 / third hash table 122 are created. It is processed like this.
[0052]
When the table creation program 100 is started when the process is completed, the duplicate entry search program 200 is first set to the initial value “1” in the variable i in step 1 as shown in the processing flow of FIGS. 1 ”is set.
[0053]
Subsequently, in step 2, the accession number of the i-th DNA data stored in the DNA release database 2a is read. Subsequently, in step 3, the first character of the read accession number is extracted. Subsequently, in Step 4, when it is determined whether or not the extracted first character is recorded in the first character list table 110, and it is determined that it is not recorded, the DNA update database 3a stores the When it is determined that there is no DNA data having the read accession number, the process proceeds to step 18 (processing flow in FIG. 8) described later.
[0054]
On the other hand, when it is determined in step 4 that the first character extracted in step 3 is recorded in the first character list table 110, the process proceeds to step 5 and the first two characters of the accession number read in step 2 To extract. Subsequently, in Step 6, when it is determined whether or not the extracted first two characters are recorded in the second character list table 111, and it is determined that they are not recorded, the DNA update database 3a includes When it is determined that there is no DNA data having the read accession number, the process proceeds to step 18 (processing flow in FIG. 8) described later.
[0055]
On the other hand, when it is determined in step 6 that the first two characters extracted in step 5 are recorded in the second character list table 111, the process proceeds to step 7 where the first 3 characters of the accession number read in step 2 are read. Extract characters. Subsequently, in Step 8, when it is determined whether or not the extracted first three characters are recorded in the third character list table 112, and it is determined that they are not recorded, the DNA update database 3a includes When it is determined that there is no DNA data having the read accession number, the process proceeds to step 18 (processing flow in FIG. 8) described later.
[0056]
On the other hand, when it is determined in step 8 that the first three characters extracted in step 7 are recorded in the third character list table 112, the process proceeds to step 9 where the numeric part of the accession number read in step 2 Extract the first three digits. Subsequently, in Step 10, when it is determined whether or not the flag of the first hash table 120 indicated by the extracted number string is “1”, and when it is determined that it is “0”, the DNA update database It is determined that there is no DNA data having the read accession number in 3a, and the process proceeds to step 18 (processing flow in FIG. 8) described later.
[0057]
On the other hand, when it is determined in step 10 that the flag of the first hash table 120 pointed to by the number string extracted in step 9 is “1”, the process proceeds to step 11 where the accession number read in step 2 is stored. Extract a 3-digit number string in the number part. Subsequently, in step 12 (processing flow of FIG. 8), it is determined whether or not the flag of the second hash table 121 pointed to by the extracted number string is “1”. When making the determination, it is determined that there is no DNA data having the read accession number in the DNA update database 3a, and the process proceeds to step 18 described later.
[0058]
On the other hand, when it is determined in step 12 that the flag of the second hash table 121 indicated by the number string extracted in step 11 is “1”, the process proceeds to step 13 where the accession number read in step 2 is stored. Extract the numeric string of the last 3 digits of the number part. Subsequently, in step 14, it is determined whether or not the flag of the third hash table 122 indicated by the extracted number string is “1”, and when determining that it is “0”, the DNA update database It is determined that there is no DNA data having the read accession number in 3a, and the process proceeds to step 18 described later.
[0059]
On the other hand, when it is determined in step 14 that the flag of the third hash table 122 indicated by the number string extracted in step 13 is “1”, the DNA update database 3 a has the accession number read in step 2. When it is determined that there is a possibility that DNA data exists, the process proceeds to step 15 to search whether the DNA data having the read accession number is stored in the DNA update database 3a.
[0060]
Subsequently, when it is determined in step 16 whether there is DNA data having the accession number read in step 2 in the DNA update database 3a according to the search result in step 15, That is, when the DNA data having the accession number read in step 2 is to be updated, the process proceeds to step 17 to record the accession number in the deletion list 300 and determine that it does not exist. Sometimes, the process of step 17 is omitted.
[0061]
Subsequently, in step 18, the value of the variable i is incremented by 1. In the following step 19, it is determined whether or not the value of the variable i exceeds the number of DNA data stored in the DNA release database 2a. When it is determined that it has not been exceeded, the process returns to step 2, and when it is determined that it has been exceeded, all processing is terminated.
[0062]
In this way, the duplicate entry search program 200 uses the character list table 11 and the hash table 12 created by the table creation program 100 when the table creation program 100 finishes processing, and the DNA stored in the DNA release database 2a. It is determined whether or not the data is likely to be updated. If there is a possibility that the data is updated, the DNA update database 3a is searched to determine whether or not the data is actually updated. As a result, the accession number of the updated DNA data is recorded in the deletion list 300.
[0063]
When the deletion list 300 is created in accordance with the process of the duplicate entry search program 200, the entry data registration program 400 deletes DNA data having an accession number recorded in the deletion list 300 from the DNA release database 2a, while updating the DNA. Processing is performed so that the DNA data stored in the database 3a is registered in the DNA release database 2a.
[0064]
Although the present invention has been described with reference to the illustrated embodiments, the present invention is not limited thereto. For example, in the examples, the present invention has been described according to application to a DNA database, but the present invention is not limited to application to a DNA database.
[0065]
【The invention's effect】
As explained above, the book In the invention, for a database managing entry data having a registration ID plural When registering entry data To, climb Registered in the database by creating a table that records character string information used to determine whether the registration ID of the entry data to be recorded is a registration ID that may be registered in the database. Character string information used for determining whether or not the registration ID is likely to be registered is created, and using that information, the registration data of the entry data to be registered is registered among the registration IDs of the entry data managed in the database. By extracting what may be a registration ID and comparing the extracted registration ID with the registration ID of the entry data to be registered, duplicate entry data included in the entry data to be registered is searched. Since the configuration is adopted, duplicate entries in the database can be checked at high speed.
[Brief description of the drawings]
FIG. 1 is a principle configuration diagram of the present invention.
FIG. 2 is an example of the present invention.
FIG. 3 is an example of a character list table.
FIG. 4 is an example of a hash table.
FIG. 5 is a processing flow of a table creation program.
FIG. 6 is a processing flow of a table creation program.
FIG. 7 is a processing flow of a duplicate entry search program.
FIG. 8 is a processing flow of a duplicate entry search program.
FIG. 9 is an explanatory diagram of a DNA database.
[Explanation of symbols]
1 Database management device
2 Database
3 registration database
10 Making means
11 Character list table
12 Hash table
13 Judgment means
14 Search means

Claims

A database management apparatus that performs processing for checking duplicate entries in a database when registering entry data in a database that manages entry data having a registration ID,
When the entry data to be registered there is a plurality, and eject the character string information defined position constituting the registration I D with the those registered target entry data, records for a list of those of the extracted character string information Creating means to create a table ;
Reading one registration ID with the entry data from the database, that string information of the defined position constituting the registration ID to determine whether or not recorded in the table, the registration ID A first determination means for determining whether or not there is a possibility of a registration ID of entry data to be registered;
The entry data to be registered is duplicate entry data by judging whether or not the registration ID that the first judging means judges is likely to match the registration ID of the entry data to be registered. Comprising a second judging means for judging whether or not
Feature database management device.

A database management apparatus that performs processing for checking duplicate entries in a database when registering entry data in a database that manages entry data having a registration ID,
When the entry data to be registered there is a plurality, and eject the character string information defined position constituting the registration I D with the those registered target entry data, it can take a string information located in its predetermined position By recording information indicating the presence of the extracted character string information in a table having a table structure for recording the list and recording the presence / absence information of the character string information located at the specified position, the table and creation means for creating,
Reading one registration ID with the entry data from the database, that string information of the defined position constituting the registration ID to determine whether or not recorded as there exist in the table, First determination means for determining whether or not there is a possibility that the registration ID is a registration ID of entry data to be registered;
The entry data to be registered is duplicate entry data by judging whether or not the registration ID that the first judging means judges is likely to match the registration ID of the entry data to be registered. Comprising a second judging means for judging whether or not
Feature database management device.

A program recording medium for recording a program used to realize a database management apparatus that performs processing for checking duplicate entries in a database when registering entry data in a database that manages entry data having a registration ID. There,
When the entry data to be registered there is a plurality, and eject the character string information defined position constituting the registration I D with the those registered target entry data, records for a list of those of the extracted character string information Creation process to create a table ,
Reading one registration ID with the entry data from the database, that string information of the defined position constituting the registration ID to determine whether or not recorded in the table, the registration ID A first determination process for determining whether or not there is a possibility of a registration ID of entry data to be registered;
The registration target entry data is duplicate entry data by determining whether or not the registration ID that is determined to be possible in the first determination processing matches the registration ID of the registration target entry data. A program for causing the computer to execute the second determination process for determining whether or not
A program recording medium.

A program recording medium for recording a program used for realizing a database management apparatus that performs processing for checking duplicate entries in a database when registering entry data in a database that manages entry data having a registration ID Because
When there are a plurality of entry data to be registered, character string information at a specified position that constitutes a registration ID of the entry data to be registered is extracted, and a list of possible character string information at the specified position is recorded. The table is created by recording the information indicating the presence of the extracted character string information on the table that records the presence / absence information of the character string information located at the specified position. Creation process,
The registration ID of the entry data is read from the database one by one, and it is determined whether or not the character string information of the specified position constituting the registration ID is recorded as existing in the table. A first determination process for determining whether there is a possibility that the ID is a registration ID of entry data to be registered;
The registration target entry data is duplicate entry data by determining whether or not the registration ID that is determined to be possible in the first determination processing matches the registration ID of the registration target entry data. A program for causing the computer to execute the second determination process for determining whether or not
A program recording medium.