JP5324903B2

JP5324903B2 - Similarity calculation apparatus, method and program, data search system and method

Info

Publication number: JP5324903B2
Application number: JP2008314498A
Authority: JP
Inventors: 彰中山; 貞大石崎; 望阿部; 典史片渕
Original assignee: NTT Docomo Business Inc; NTT Communications Corp
Current assignee: NTT Docomo Business Inc
Priority date: 2008-12-10
Filing date: 2008-12-10
Publication date: 2013-10-23
Anticipated expiration: 2028-12-10
Also published as: JP2010140162A

Description

この発明は、２つのデータが互いにどの程度類似しているのかを表す指標である類似度を計算する技術、及び、その類似度を用いてユーザに有意な検索結果を提供する技術に関する。 The present invention relates to a technique for calculating a similarity that is an index indicating how similar two data are to each other, and a technique for providing a significant search result to a user using the similarity.

２つのＷｅｂページが互いにどの程度類似しているのかを表す指標である類似度を計算する技術として、非特許文献１に記載された技術が知られている。
非特許文献１に記載された技術では、ＷｅｂページＡのリンク構造とＷｅｂページＢのリンク構造とをそれぞれ抽出して、両リンク構造が近いほど高い類似度をＷｅｂページＡ，Ｂの類似度として与える。
栗原聡，外３名，「リンク情報によるＷｅｂページ間の類似度推定」，コンピュータソフトウェア，日本ソフトウェア科学会，２００１年１１月２２日，Vol.18，No.6，ｐ．１５−２６ A technique described in Non-Patent Document 1 is known as a technique for calculating a similarity that is an index indicating how similar two Web pages are to each other.
In the technique described in Non-Patent Document 1, the link structure of Web page A and the link structure of Web page B are extracted, and the higher the similarity between the two link structures, the higher the similarity between Web pages A and B. give.
Atsushi Kurihara, 3 others, “Similarity between Web pages using link information”, Computer Software, Japan Software Science Society, November 22, 2001, Vol. 18, No. 6, p. 15-26

上記非特許文献１の技術は、それらのＷｅｂページＡ，Ｂに実際にアクセスしたユーザの行動を考慮せずに類似度を定めている。
この発明は、Ｗｅｂページ等のデータに実際にアクセスしたユーザの行動を考慮して類似度を計算する類似度計算装置、方法及びプログラム、その類似を用いてユーザに有意な検索結果を提供するデータ検索システム及び方法を提供することを目的とする。 The technique of Non-Patent Document 1 determines the similarity without considering the actions of users who actually access these Web pages A and B.
The present invention relates to a similarity calculation device, method, and program for calculating similarity in consideration of a user's action that actually accesses data such as a web page, and data that provides a significant search result to the user using the similarity. It is an object to provide a search system and method.

この発明の一態様による類似度計算装置は、各ユーザによる各データへのアクセスに関するアクセスログが記憶されるアクセスログ記憶部から読み込んだアクセスログを用いて、各ユーザがアクセスした各データにその各ユーザがアクセスした順番を定めるアクセス順番付与部と、異なる２つのデータをデータｎ，ｍとして、ユーザｉがデータｎにアクセスした順番と、その後ユーザｉがデータｍにアクセスした順番との差Ｎ_ｌｉｓｔ（ｎ，ｍ）を求めるアクセス順番差計算部と、ユーザｉがデータｎにアクセスした時刻と、その後ユーザｉがデータｍにアクセスした時刻との差ｔ_{ｍａｒｇｉｎ}（ｎ，ｍ）を求めるアクセス時刻差計算部と、アクセス順番差Ｎ_ｌｉｓｔ（ｎ，ｍ）及びアクセス時刻差ｔ_{ｍａｒｇｉｎ}（ｎ，ｍ）についての単調減少関数ｆ_１に、アクセス順番差Ｎ_ｌｉｓｔ（ｎ，ｍ）及びアクセス時刻差ｔ_{ｍａｒｇｉｎ}（ｎ，ｍ）を入力した計算結果を求めることにより、データｎ，ｍについてのアクセスパターンに基づくアクセスパターン類似度α_ｎ−ｍを求めるアクセスパターン類似度計算部とを含み、アクセスパターン類似度α _ｎ−ｍとアクセスパターン類似度α _ｍ−ｎとの両方が所定の閾値よりも高い場合に、データｎとデータｍとが類似していると判断する類似決定部を更に含む。 The similarity calculation apparatus according to an aspect of the present invention uses an access log read from an access log storage unit in which an access log related to access to each data by each user is used, and each data accessed by each user is The difference N _list between the order of access by the user and the order in which the user i accesses the data n, and the order in which the user i accesses the data m thereafter, with the two different data as data n and m. An access order difference calculating unit for _obtaining (n, m), and an access time difference for _obtaining a difference t _margin (n, m) between a time when user i accesses data n and a time when user i subsequently accesses data m a calculation unit, the access order difference _n list (n, m) and access time difference _{t margin} (n, m) for Monotonically decreasing function _{f 1,} the access order difference _N list (n, m) and access time difference _{t margin} (n, m) by obtaining a calculation result of entering the access pattern based on the access pattern of the data n, m look including an access pattern similarity calculation section for obtaining the similarity alpha _n-m, when both the access pattern similarity alpha _n-m and access pattern similarity alpha _m-n is higher than a predetermined threshold value, the data A similarity determining unit that determines that n and data m are similar is further included .

この発明によるデータ検索装置は、類似度計算装置と、検索装置とを含み、検索装置は、受け取ったクエリーに対応するデータについての情報を取得するデータ情報取得部と、クエリーに対応するデータに類似するデータについての情報を類似度計算装置から取得する類似データ情報取得部と、クエリーに対応するデータについての情報と共に類似するデータについての情報を出力する出力部とを含む。 A data search device according to the present invention includes a similarity calculation device and a search device. The search device is similar to data corresponding to a query and a data information acquisition unit that acquires information about data corresponding to the received query. A similar data information acquisition unit that acquires information about the data to be obtained from the similarity calculation device, and an output unit that outputs information about the similar data together with information about the data corresponding to the query.

各ユーザによる各データへのアクセスに関するアクセスログを参照することにより求まるアクセス順番差及びアクセス時刻差に基づいて類似度を計算することにより、データに実際にアクセスしたユーザの行動を考慮して類似度を計算することができる。 By calculating the similarity based on the access order difference and access time difference obtained by referring to the access log related to the access to each data by each user, the similarity is considered in consideration of the action of the user who actually accesses the data. Can be calculated.

［類似度計算装置及び方法］
この発明による類似度計算装置及び方法の一実施形態を説明する。図１は類似度計算装置を例示する機能ブロック図であり、図４は類似度計算方法を例示するフローチャートである。 [Similarity Calculation Apparatus and Method]
An embodiment of a similarity calculation apparatus and method according to the present invention will be described. FIG. 1 is a functional block diagram illustrating a similarity calculation device, and FIG. 4 is a flowchart illustrating a similarity calculation method.

類似度計算装置及び方法は、２つのデータの類似度を計算する。「データ」とは、Ｗｅｂページ、画像データ、音データ、動画データ、任意のアプリケーションが用いるデータ等の情報であり、その情報を特定するための文字列が定められているものを意味する。データがＷｅｂページである場合には、ＵＲＬ（Uniform Resource Locator）が、Ｗｅｂページを特定するための文字列に該当する。また、データがＷｅｂページ以外の他のデータの場合には、そのデータのファイル名等がそのデータを特定するための文字列に該当する。以下では、データがＷｅｂページである場合を例にあげて説明する。すなわち、以下に述べる類似度計算装置１００及び方法は、２つのＷｅｂページの類似度を計算する。 The similarity calculation apparatus and method calculates the similarity between two data. “Data” refers to information such as a web page, image data, sound data, moving image data, data used by an arbitrary application, and the like, and means a character string for specifying the information. When the data is a Web page, a URL (Uniform Resource Locator) corresponds to a character string for specifying the Web page. When the data is other data than the web page, the file name of the data corresponds to a character string for specifying the data. Hereinafter, a case where the data is a Web page will be described as an example. That is, the similarity calculation apparatus 100 and method described below calculate the similarity between two Web pages.

アクセスログ記憶部１には、各ユーザによる各データへのアクセスに関するアクセスログが記憶される。アクセスログは、後述するアクセス順番差及びアクセス時刻差を計算することができるものであればどのようなものでもよい。例えば、アクセスログの各行は、アクセスしたユーザを特定するための情報（ユーザＩＤ、ユーザグループＩＤ等）、アクセスされたデータを特定するための情報（ＵＲＬ等）、ユーザがデータにアクセスした時刻についての情報を少なくとも含む。 The access log storage unit 1 stores an access log related to access to each data by each user. The access log may be anything as long as it can calculate an access order difference and an access time difference, which will be described later. For example, each line of the access log includes information for specifying the accessed user (user ID, user group ID, etc.), information for specifying the accessed data (URL, etc.), and the time when the user accessed the data. Information at least.

具体的には、インターネットサービスプロバイダ（Internet Service Provider）において、その加入者がインターネットに接続する際に自動的に取られるアクセスログを用いることができる。アクセス順番差及びアクセス時刻差を計算するための基となるアクセスログとして、企業、学校等のプロキシサーバで自動的に取られるアクセスログを用いてもよい。また、データがファイルサーバに格納されたファイルである等の場合には、そのファイルサーバにユーザが接続した際に自動的に取られるアクセスログを用いてもよい。 Specifically, an Internet service provider can use an access log automatically taken when the subscriber connects to the Internet. An access log automatically taken by a proxy server such as a company or a school may be used as an access log that is a basis for calculating an access order difference and an access time difference. In addition, when the data is a file stored in a file server, an access log automatically taken when a user connects to the file server may be used.

ユーザは、１名のユーザであってもよいし、複数のユーザからなるユーザ（換言すれば、ユーザグループ）であってもよい。すなわち、アクセスログ記憶部１に記憶されたアクセスログは、複数のユーザからのアクセスを同一のユーザ（ユーザグループ）からのアクセスとして記憶したものであってもよい。 The user may be one user or a user (in other words, a user group) composed of a plurality of users. That is, the access log stored in the access log storage unit 1 may store access from a plurality of users as access from the same user (user group).

類似度計算装置１００は、アクセスログ記憶部１を有していてもよいし、有していなくてもよい。すなわち、類似度計算装置１００は、アクセスログが記憶されているサーバ等の類似度計算装置１００の外部に配置されたアクセスログ記憶部１にアクセス可能であれば足りる。 The similarity calculation device 100 may or may not have the access log storage unit 1. That is, the similarity calculation device 100 only needs to be able to access the access log storage unit 1 arranged outside the similarity calculation device 100 such as a server in which an access log is stored.

アクセス順番付与部２は、アクセスログ記憶部１から読み込んだアクセスログを用いて、各ユーザがアクセスした各データにその各ユーザがアクセスした順番を定める（ステップＳ１）。アクセスした順番（アクセス順番とも呼ぶ。）に関する情報は、アクセス順番差計算部３に送られる。アクセス順番は、図１０に示すアクセスパターンファイルのように例えば記述される。 The access order assigning unit 2 uses the access log read from the access log storage unit 1 to determine the order in which each user has accessed each data accessed by each user (step S1). Information regarding the access order (also referred to as access order) is sent to the access order difference calculation unit 3. The access order is described, for example, like an access pattern file shown in FIG.

アクセス順番付与部２の処理の例を図５を参照して説明する。アクセス順番付与部２は、アクセスパターンファイルをまだ作成していないユーザのアクセスログを読み込む（ステップＳ１２）。例えば、アクセスログが行ごとに取られている場合には、アクセスログ記憶部１の記憶されたアクセスログのうち、そのユーザのアクセスに関する行をすべて抽出して、それらを結合することによりそのユーザのアクセスログファイルを作成する。ユーザが使用する端末に割り振られたＩＰアドレス又はそのハッシュ値でユーザ表現されている場合には、ユーザのＩＰアドレス又はそのハッシュ値をキーとして行の抽出を例えば行う。 An example of processing of the access order assigning unit 2 will be described with reference to FIG. The access order assignment unit 2 reads an access log of a user who has not yet created an access pattern file (step S12). For example, when the access log is taken for each row, all the rows related to the user's access are extracted from the access logs stored in the access log storage unit 1, and the user is joined by combining them. Create an access log file for. If the user is represented by an IP address assigned to a terminal used by the user or a hash value thereof, a row is extracted using the user's IP address or the hash value as a key, for example.

なお、アクセス頻度計算部１０が、アクセスログ記憶部１から読み込んだアクセスログから、各データがアクセスされた頻度を計算して、所定の回数以上アクセスされているページのみアクセス順番付与部２に送られるようにしてもよい（ステップＳ１１）。すなわち、アクセス回数が所定の回数より小さいデータについては類似度判定の対象から外すようにしてもよい。例えば、アクセスログが行ごとに取られている場合には、アクセスログ記憶部１の記憶されたアクセスログのうち、所定の回数以上アクセスされているデータに関する行のみがアクセス頻度計算部１０により抽出されるようにする。 The access frequency calculation unit 10 calculates the frequency of accessing each data from the access log read from the access log storage unit 1 and sends only the pages accessed more than a predetermined number of times to the access order assignment unit 2. (Step S11). That is, data whose access count is smaller than a predetermined count may be excluded from the similarity determination target. For example, when the access log is taken for each row, the access frequency calculation unit 10 extracts only the rows related to the data accessed a predetermined number of times or more from the access logs stored in the access log storage unit 1. To be.

アクセス順番付与部２のセッション分割部２１は、ユーザのデータへのアクセス系列をセッションに分割する（ステップＳ１３）。セッションとは、同一ユーザによるデータへのアクセスの系列であって、そのアクセスの系列においては次のデータにアクセスするまでの時間が予め定められた時間よりも短いアクセスの系列のことを意味する。セッション分割部２１は、ユーザがあるデータにアクセスした時刻と次のデータにアクセスした時刻との時刻差が所定の閾値Ｔｈｓｅｓｓｉｏｎを超えたかどうかを判定して、その時刻差が閾値Ｔｈｓｅｓｓｉｏｎを超えている場合には、そのあるデータへのアクセスと、その次のデータへのアクセスとの間でセッションが分かれたと判断することにより、セッション分割を行う。 The session dividing unit 21 of the access order assigning unit 2 divides an access sequence for user data into sessions (step S13). A session is a sequence of access to data by the same user, and means a sequence of access in which the time until access to the next data is shorter than a predetermined time. The session dividing unit 21 determines whether or not the time difference between the time when the user accesses certain data and the time when the next data is accessed exceeds a predetermined threshold value Thsession, and the time difference exceeds the threshold value Thsession. In some cases, the session is divided by determining that the session is divided between the access to the certain data and the access to the next data.

アクセス順番付与部２は、各セッションごとにアクセス順番を付与する。すなわち、アクセス順番付与部２は、アクセス順番をまだ付与していないセッションを選択して、そのセッションにアクセス順番を付与する（ステップＳ１４）。例えば、そのセッションにおける最初のアクセスにアクセス番号「０００００１」を付与し、そのセッションにおけるその後の各アクセスに順次１を足したアクセス番号を付与する。そして、すべてのセッションにアクセス順番を付与したかどうかを判定して（ステップＳ１６）、すべてのセッションにアクセス順番を付与するまでステップＳ１４の処理を繰り返す。これにより、図１０に例示するアクセスパターンファイルを作成する。なお、アクセス順番付与部２の経過時間付与部２２が、ステップＳ１４の後に、各アクセスに、セッション開始時刻からの経過時間を付与してもよい（ステップＳ１５）。 The access order assigning unit 2 assigns an access order for each session. That is, the access order assigning unit 2 selects a session that has not yet been given an access order, and assigns an access order to the session (step S14). For example, an access number “000001” is assigned to the first access in the session, and an access number obtained by sequentially adding 1 is assigned to each subsequent access in the session. Then, it is determined whether the access order is assigned to all sessions (step S16), and the process of step S14 is repeated until the access order is assigned to all sessions. Thereby, an access pattern file illustrated in FIG. 10 is created. Note that the elapsed time granting unit 22 of the access order granting unit 2 may grant the elapsed time from the session start time to each access after step S14 (step S15).

その後、アクセス順番付与部２は、すべてのユーザのそれぞれについてのアクセスパターンファイルを作成したかを判定して、作成されていない場合にはステップＳ１２の処理に戻る。すべてのユーザのそれぞれについてのアクセスパターンファイルが作成された場合には、類似度計算装置は、ステップＳ１の処理を終えて、ステップＳ２（図４）の処理に進む。 Thereafter, the access order assigning unit 2 determines whether an access pattern file has been created for each of all users, and returns to the process of step S12 if it has not been created. When access pattern files for all users have been created, the similarity calculation apparatus ends the process of step S1 and proceeds to the process of step S2 (FIG. 4).

アクセス順番差計算部３は、異なる２つのデータをデータｎ，ｍとして、上記決定されたアクセス順番を用いて、ユーザｉ（ｉ＝１，…，Ｉ、Ｉはユーザの総数）がデータｎにアクセスした順番と、その後ユーザｉがデータｍにアクセスした順番との差Ｎ_ｌｉｓｔ（ｎ，ｍ）を求める（ステップＳ２、図４）。例えば、アクセスログ記憶部１に記憶されたアクセスログに登場するユーザの総数がＩとなる。この例のように、アクセス順番がセッ
ションごとに付与されている場合には、アクセス順番差計算部３は、ユーザｉがデータｎにアクセスした順番と、その後そのデータｎへのアクセスと同一のセッションにおいてユーザｉがデータｍにアクセスした順番との差Ｎ_ｌｉｓｔ（ｎ，ｍ）を求める。これにより、各ユーザの各セッションごとにアクセス順番差Ｎ_ｌｉｓｔ（ｎ，ｍ）を計算する。計算されたアクセス順番差Ｎ_ｌｉｓｔ（ｎ，ｍ）についての情報は、アクセスパターン類似度計算部５に送られる。 The access order difference calculation unit 3 sets two different data as data n and m, and uses the determined access order to change the user i (i = 1,..., I and I are the total number of users) into the data n. A difference N _list (n, m) between the access order and the order in which the user i subsequently accesses the data m is obtained (step S2, FIG. 4). For example, the total number of users appearing in the access log stored in the access log storage unit 1 is I. As in this example, when the access order is given for each session, the access order difference calculation unit 3 determines the order in which the user i accesses the data n and the same session as the access to the data n thereafter. The difference N _list (n, m) from the order in which the user i accesses the data m is obtained. Thus, the access order difference N _list (n, m) is calculated for each session of each user. Information about the calculated access order difference N _list (n, m) is sent to the access pattern similarity calculation unit 5.

アクセス順番差計算部３の処理の具体例を図６を参照して説明する。アクセス順番差計算部３は、アクセス順番差Ｎ_ｌｉｓｔ（ｎ，ｍ）を計算していないユーザを選択する（ステップＳ２１）。その選択されたユーザをユーザｉと表現する。 A specific example of the processing of the access order difference calculation unit 3 will be described with reference to FIG. The access order difference calculation unit 3 selects a user who has not calculated the access order difference N _list (n, m) (step S21). The selected user is expressed as user i.

アクセス順番差計算部３は、ユーザｉのアクセス順番差Ｎ_ｌｉｓｔ（ｎ，ｍ）を計算していないセッションを選択して、アクセス順番付与部２が付与したアクセス順番に基づいて、そのセッションにおいて、ユーザｉがデータｎにアクセスした順番と、その後ユーザｉがデータｍにアクセスした順番との差Ｎ_ｌｉｓｔ（ｎ，ｍ）を求める（ステップＳ２２）。図１１のアクセスパターン１に例示するように、データｎがhttp://aaa.bbb.com/content01.htmlでありそのアクセス順番が１であり、データｍがhttp://aaa.bbb.com/content02.htmlでありそのアクセス順番が１００である場合には、アクセス順番差Ｎ_ｌｉｓｔ（ｎ，ｍ）＝１００−１＝９９となる。 The access order difference calculation unit 3 selects a session for which the user i's access order difference N _list (n, m) has not been calculated, and in the session based on the access order given by the access order assignment unit 2, A difference N _list (n, m) between the order in which the user i accesses the data n and the order in which the user i accesses the data m is obtained (step S22). As illustrated in the access pattern 1 of FIG. 11, the data n is http://aaa.bbb.com/content01.html, the access order is 1, and the data m is http://aaa.bbb.com. If /content02.html and the access order is 100, the access order difference N _list (n, m) = 100-1 = 99.

ユーザｉのすべてのセッションにおけるアクセス順番差Ｎ_ｌｉｓｔ（ｎ，ｍ）を計算したかどうかを判定することにより（ステップＳ２３）、ユーザｉの各セッションにおけるアクセス順番差Ｎ_ｌｉｓｔ（ｎ，ｍ）が計算されるまでステップＳ２２の処理を繰り返す。 By determining whether or not the access order difference N _list (n, m) in all sessions of user i has been calculated (step S23), the access order difference N _list (n, m) in each session of user i is calculated. Step S22 is repeated until it is done.

アクセス順番差計算部３は、すべてのユーザにおいてアクセス順番差Ｎ_ｌｉｓｔ（ｎ，ｍ）を計算したかどうかを判定する（ステップＳ２４）。計算していない場合には、ステップＳ２１の処理に戻る。すべてのユーザにおいてアクセス順番差Ｎ_ｌｉｓｔ（ｎ，ｍ）が計算された場合には、ステップＳ３の処理に進む。 The access order difference calculation unit 3 determines whether or not the access order difference N _list (n, m) has been calculated for all users (step S24). If not, the process returns to step S21. If the access order difference N _list (n, m) is calculated for all users, the process proceeds to step S3.

アクセス時刻差計算部４は、上記アクセスログを用いて、ユーザｉがデータｎにアクセスした時刻と、その後ユーザｉがデータｍにアクセスした時刻との差ｔ_{ｍａｒｇｉｎ}（ｎ，ｍ）を求める（ステップＳ３）。計算されたアクセス時刻差ｔ_{ｍａｒｇｉｎ}（ｎ，ｍ）は、アクセスパターン類似度計算部５に送られる。 The access time difference calculation unit 4 obtains a difference t _margin (n, m) between the time when the user i accesses the data n and the time when the user i subsequently accesses the data m by using the access log (steps). S3). The calculated access time difference t _margin (n, m) is sent to the access pattern similarity calculation unit 5.

アクセス時刻差計算部４の処理の具体例を図７を参照して説明する。アクセス順番差計算部３は、アクセス時刻差ｔ_{ｍａｒｇｉｎ}（ｎ，ｍ）を計算していないユーザを選択する（ステップＳ３１）。その選択されたユーザをユーザｉと表現する。 A specific example of the processing of the access time difference calculation unit 4 will be described with reference to FIG. The access order difference calculation unit 3 selects a user who has not calculated the access time difference t _margin (n, m) (step S31). The selected user is expressed as user i.

アクセス時刻差計算部４は、ユーザｉのアクセス時刻差ｔ_{ｍａｒｇｉｎ}（ｎ，ｍ）を計算していないセッションを選択して、そのセッションにおいて、ユーザｉがデータｎにアクセスした時刻と、その後ユーザｉがデータｍにアクセスした時刻との差であるアクセス時刻差ｔ_{ｍａｒｇｉｎ}（ｎ，ｍ）を求める（ステップＳ３２）。アクセス時刻差ｔ_{ｍａｒｇｉｎ}（ｎ，ｍ）は、アクセスログ記憶部１から読み込んだアクセスログを参照して求めてもよいし、ステップＳ１５（図５）において各アクセスにセッション開始時刻からの経過時間が付与されている場合にはそのセッション開始時刻からの経過時間を参照して求めてもよい。 The access time difference calculation unit 4 selects a session for which the access time difference t _margin (n, m) of the user i has not been calculated, the time when the user i accessed the data n in the session, and the user i thereafter _Obtains an access time difference t _margin (n, m) which is a difference from the time when the data m is accessed (step S32). The access time difference t _margin (n, m) may be obtained by referring to the access log read from the access log storage unit 1, or the elapsed time from the session start time in each access in step S15 (FIG. 5). If it is given, it may be obtained by referring to the elapsed time from the session start time.

ユーザｉのすべてのセッションにおけるアクセス時刻差ｔ_{ｍａｒｇｉｎ}（ｎ，ｍ）を計算したかどうかを判定することにより（ステップＳ３３）、ユーザｉの各セッションにお
けるアクセス時刻差ｔ_{ｍａｒｇｉｎ}（ｎ，ｍ）が計算されるまでステップＳ３２の処理を繰り返す。 By determining whether or not the access time difference t _margin (n, m) in all sessions of user i has been calculated (step S33), the access time difference t _margin (n, m) in each session of user i is calculated. Step S32 is repeated until it is done.

アクセス順番差計算部３は、すべてのユーザにおいてアクセス時刻差ｔ_{ｍａｒｇｉｎ}（ｎ，ｍ）を計算したかどうかを判定する（ステップＳ３４）。計算していない場合には、ステップＳ３１の処理に戻る。すべてのユーザにおいてアクセス時刻差ｔ_{ｍａｒｇｉｎ}（ｎ，ｍ）が計算された場合には、ステップＳ４の処理に進む。 The access order difference calculation unit 3 determines whether or not the access time difference t _margin (n, m) has been calculated for all users (step S34). If not, the process returns to step S31. If the access time difference t _margin (n, m) is calculated for all users, the process proceeds to step S4.

なお、ステップＳ２２及びステップＳ３２において、同一のセッションにおいて、データｎ及びデータｍが複数存在する場合には、同一セッションにおける、データｎとそのデータｎの後にアクセスされたデータｍとのすべての組み合わせのそれぞれについてのアクセス順番差Ｎ_ｌｉｓｔ（ｎ，ｍ）及びアクセス時刻差ｔ_{ｍａｒｇｉｎ}（ｎ，ｍ）が計算される。すなわち、同一セッション内に図１２に例示されるようにデータｎ及びデータｍがある場合には、アクセス順番１のデータｎからアクセス順番２のデータｍについてのアクセス順番差Ｎ_ｌｉｓｔ（ｎ，ｍ）及びアクセス時刻差ｔ_{ｍａｒｇｉｎ}（ｎ，ｍ）、アクセス順番１のデータｎからアクセス順番４のデータｍについてのアクセス順番差Ｎ_ｌｉｓｔ（ｎ，ｍ）及びアクセス時刻差ｔ_{ｍａｒｇｉｎ}（ｎ，ｍ）、アクセス順番３のデータｎからアクセス順番４のデータｍについてのアクセス順番差Ｎ_ｌｉｓｔ（ｎ，ｍ）及びアクセス時刻差ｔ_{ｍａｒｇｉｎ}（ｎ，ｍ）がそれぞれ計算される。 In step S22 and step S32, when there are a plurality of data n and data m in the same session, all combinations of data n and data m accessed after the data n in the same session are included. An access order difference N _list (n, m) and an access time difference t _margin (n, m) are calculated for each. That is, when there is data n and data m as illustrated in FIG. 12 in the same session, the access order difference N _list (n, m) from the data n in the access order 1 to the data m in the access order 2 And access time difference t _margin (n, m), access order difference N _list (n, m) and access time difference t _margin (n, m) for access order 1 data n to access order 4 data m, access An access order difference N _list (n, m) and an access time difference t _margin (n, m) are calculated from the data n of the order 3 to the data m of the access order 4.

アクセス順番差Ｎ_ｌｉｓｔ（ｎ，ｍ）及びアクセス時刻差ｔ_{ｍａｒｇｉｎ}（ｎ，ｍ）は、データｎとデータｍとの関連性の高さを表す指標となり得る。すなわち、図１１のアクセスパターン１のようにアクセス順番差Ｎ_ｌｉｓｔ（ｎ，ｍ）及びアクセス時刻差ｔ_{ｍａｒｇｉｎ}（ｎ，ｍ）が大きい場合にはデータｎとデータｍとは関連性が低いと判断することができ、一方図１１のアクセスパターン２のようにアクセス順番差Ｎ_ｌｉｓｔ（ｎ，ｍ）及びアクセス時刻差ｔ_{ｍａｒｇｉｎ}（ｎ，ｍ）が小さい場合にはデータｎとデータｍとは関連性が高いと判断することができる。 The access order difference N _list (n, m) and the access time difference t _margin (n, m) can be an index indicating the degree of relevance between the data n and the data m. That is, when the access order difference N _list (n, m) and the access time difference t _margin (n, m) are large as in the access pattern 1 of FIG. 11, it is determined that the relationship between the data n and the data m is low. On the other hand, when the access order difference N _list (n, m) and the access time difference t _margin (n, m) are small as in the access pattern 2 of FIG. 11, the data n and the data m are related to each other. Can be determined to be high.

アクセスパターン類似度計算部５は、アクセス順番差Ｎ_ｌｉｓｔ（ｎ，ｍ）及びアクセス時刻差ｔ_{ｍａｒｇｉｎ}（ｎ，ｍ）についての単調減少関数ｆ_１に、アクセス順番差Ｎ_ｌｉｓｔ（ｎ，ｍ）及びアクセス時刻差ｔ_{ｍａｒｇｉｎ}（ｎ，ｍ）を入力した計算結果を求めることにより、データｎ，ｍについてのアクセスパターンに基づくアクセスパターン類似度α_ｎ−ｍを求める（ステップＳ４）。単調減少関数ｆ_１は、後述する単調減少関数ｆ_２と単調増加関数ｆ_３との合成関数として例えば与えられる。 The access pattern similarity calculation unit 5 calculates the access order difference N _list (n, m) and the monotonically decreasing function f ₁ for the access order difference N _list (n, m) and the access time difference t _margin (n, m). By _obtaining a calculation result obtained by inputting the access time difference t _margin (n, m), an access pattern similarity α _n−m based on the access pattern for the data n, _m is obtained (step S4). The monotone decreasing function f ₁ is given as, for example, a composite function of a monotone decreasing function f ₂ and a monotone increasing function f ₃ described later.

ここで、単調減少関数とは、任意のｘ_１，ｘ_２（ただし、ｘ_１＜ｘ_２）に対して、ｆ（ｘ_１）≧ｆ（ｘ_２）となる関数ｆのことを意味する。同様に、単調増加関数とは、任意のｘ_１，ｘ_２（ただし、ｘ_１＜ｘ_２）に対して、ｆ（ｘ_１）≦ｆ（ｘ_２）となる関数ｆのことを意味する。 Here, the monotone decreasing function means a function f that satisfies f (x ₁ ) ≧ f (x ₂ ) with respect to arbitrary x ₁ , x ₂ (where x ₁ <x ₂ ). Similarly, the monotonically increasing function means a function f that satisfies f (x ₁ ) ≦ f (x ₂ ) with respect to arbitrary x ₁ , x ₂ (where x ₁ <x ₂ ).

まず、アクセスパターン類似度計算部５の第一統合部５１は、アクセス順番差Ｎ_ｌｉｓｔ（ｎ，ｍ）及びアクセス時刻差ｔ_{ｍａｒｇｉｎ}（ｎ，ｍ）についての単調減少関数ｆ_２に、アクセス順番差Ｎ_ｌｉｓｔ（ｎ，ｍ）及びアクセス時刻差ｔ_{ｍａｒｇｉｎ}（ｎ，ｍ）を入力した計算結果を求める。Ａを所定の正の実数、ｅをネピア数として、ｆ_２は次式のように例えば与えられる。 First, the first integration unit 51 of the access pattern similarity calculation unit 5 calculates the access order difference to the monotonically decreasing function f ₂ for the access order difference N _list (n, m) and the access time difference t _margin (n, m). A calculation result obtained by inputting N _list (n, m) and the access time difference t _margin (n, m) is obtained. For example, f ₂ is given by the following equation where A is a predetermined positive real number and e is a Napier number.

この例のように、各ユーザの各セッションごとにアクセス順番差Ｎ_ｌｉｓｔ（ｎ，ｍ）及びアクセス時刻差ｔ_{ｍａｒｇｉｎ}（ｎ，ｍ）を計算している場合には、各ユーザの各セッションごとのアクセス順番差Ｎ_ｌｉｓｔ（ｎ，ｍ）及びアクセス時刻差ｔ_{ｍａｒｇｉｎ}（ｎ，ｍ）を用いて、上記の計算結果をそれぞれ求める。また、同一のセッションにおいて２以上のアクセス順番差Ｎ_ｌｉｓｔ（ｎ，ｍ）及びアクセス時刻差ｔ_{ｍａｒｇｉｎ}（ｎ，ｍ）が計算されている場合には、それぞれのアクセス順番差Ｎ_ｌｉｓｔ（ｎ，ｍ）及びアクセス時刻差ｔ_{ｍａｒｇｉｎ}（ｎ，ｍ）を用いて、上記の計算結果をそれぞれ求める。計算結果は、アクセスパターン類似度計算部５の第二統合部５２に送られる。 As in this example, when the access order difference N _list (n, m) and the access time difference t _margin (n, m) are calculated for each session of each user, Using the access order difference N _list (n, m) and the access time difference t _margin (n, m), the above calculation results are obtained. When two or more access order differences N _list (n, m) and access time differences t _margin (n, m) are calculated in the same session, the respective access order differences N _list (n, m) ) And the access time difference t _margin (n, m), respectively, to obtain the above calculation results. The calculation result is sent to the second integration unit 52 of the access pattern similarity calculation unit 5.

第二統合部５２は、上記各計算結果についての単調増加関数ｆ_３に上記計算結果を入力した計算結果を求めることにより、アクセスパターン類似度α_ｎ−ｍを求める。例えば、上記各計算結果を加算した値をアクセスパターン類似度α_ｎ−ｍとする。また、上記各計算結果の最大値をアクセスパターン類似度α_ｎ−ｍとしてもよい。最大値を用いた場合には、あるユーザの特異なアクセスパターンを基に、データ間の隠れたつながりも考慮に入れた類似ページ判定を行うことができる。 The second integration unit 52 obtains the access pattern similarity α _n−m by obtaining a calculation result obtained by inputting the calculation result to the monotonically increasing function f ₃ for each calculation result. For example, a value obtained by adding the calculation results is defined as an access pattern similarity α _n−m . In addition, the maximum value of each calculation result may be the access pattern similarity α _nm . When the maximum value is used, similar page determination can be performed in consideration of a hidden connection between data based on a specific access pattern of a certain user.

このようにして求められるアクセスパターン類似度α_ｎ−ｍは、各ユーザのデータｎ，ｍへのアクセスログのみから計算されており、データに実際にアクセスしたユーザの行動が考慮されている。また、データｎ，ｍの中身を考慮するという比較的重い処理を行わずに類似度を計算することができるという有利な効果がある。 The access pattern similarity α _n−m obtained in this way is calculated only from the access logs to the data n and m of each user, and the action of the user who actually accessed the data is taken into consideration. Moreover, there is an advantageous effect that the similarity can be calculated without performing a relatively heavy process of considering the contents of the data n and m.

上記の例では、アクセス順番付与部２は、ユーザｉのアクセス系列をセッションに分割してセッションごとにアクセス順番を付与した。しかし、セッション分割しないで、ユーザｉのアクセス系列を構成する各アクセスに対して順次１づつ増えるアクセス順番を付与してもよい。
なお、データを特定する文字列の距離に基づく距離類似度を考慮して、データｎ，ｍの類似度を求めてもよい。 In the above example, the access order assigning unit 2 divides the access sequence of the user i into sessions and assigns an access order for each session. However, an access order that increases one by one may be given to each access constituting the access sequence of the user i without dividing the session.
Note that the similarity between the data n and m may be obtained in consideration of the distance similarity based on the distance between character strings that specify data.

距離計算部６１（図１）は、データｎを特定する文字列と、データｍを特定する文字列との距離Ｌ_ｎ−ｍを計算する（ステップＳ５１、図８）。計算された距離Ｌ_ｎ−ｍは、距離類似度計算部６２に送られる。データを特定する文字列とは、データがＷｅｂページである場合には、ＵＲＬ（Uniform Resource Locator）である。距離Ｌ_ｎ−ｍとしては、例えばレーベンシュタイン距離を採用する（例えば、参考文献１参照。）。
［参考文献１］［平成２０年１２月４日検索］、インターネット＜ＵＲＬ：http://en.wikipedia.org/wiki/Levenshtein_distance＞
レーベンシュタイン距離は、一方の文字列に対して文字の付加、削除及び置換を何回行うことにより他方の文字列を構成することができるかに基づいて、両文字列間の距離を定義する。その文字の付加、削除及び置換の回数が両文字列間の距離となる。 The distance calculation unit 61 (FIG. 1) calculates the distance L _nm between the character string that specifies the data n and the character string that specifies the data m (step S51, FIG. 8). The calculated distance L _nm is sent to the distance similarity calculation unit 62. The character string specifying data is a URL (Uniform Resource Locator) when the data is a Web page. As the distance L _nm , for example, the Levenshtein distance is adopted (for example, see Reference 1).
[Reference 1] [Search on December 4, 2008], Internet <URL: http://en.wikipedia.org/wiki/Levenshtein_distance>
The Levenshtein distance defines a distance between two character strings based on how many times a character string is added, deleted, or replaced with respect to one character string. The number of additions, deletions, and replacements of the character is the distance between the two character strings.

例えば、図１３の文字列類似パターン１においては、文字列「http://aaa.bbb.com/content01.html」の文字「1」を文字「2」に変換することにより、すなわち１回の置換により、文字列「http://aaa.bbb.com/content02.html」を構成することができる。したがって、文字列類似パターン１における両文字列のレーベンシュタイン距離は１となる。 For example, in the character string similarity pattern 1 in FIG. 13, the character “1” of the character string “http://aaa.bbb.com/content01.html” is converted into the character “2”, that is, one time. The character string “http://aaa.bbb.com/content02.html” can be formed by the replacement. Therefore, the Levenshtein distance between both character strings in the character string similarity pattern 1 is 1.

また、図１３の文字列類似パターン２においては、文字列「http://aaa.bbb.com/content01.html」の、文字「c」を文字「a」に、文字「o」を文字「a」に、文字「n」を文字「a」に置換して、文字「t」「e」「n」「t」「0」「1」を削除することにより、すなわち３回の置換と６回の削除により、文字列「http://aaa.bbb.com/aaa.html」を構成するこ
とができる。したがって、文字列類似パターン２における両文字列のレーベンシュタイン距離は９となる。 In the character string similarity pattern 2 in FIG. 13, the character “c” in the character string “http://aaa.bbb.com/content01.html” is changed to the character “a”, and the character “o” is changed to the character “ In “a”, by replacing the character “n” with the character “a” and deleting the characters “t”, “e”, “n”, “t”, “0”, “1”, ie three replacements and 6 The character string “http://aaa.bbb.com/aaa.html” can be configured by deleting the times. Therefore, the Levenshtein distance between both character strings in the character string similarity pattern 2 is 9.

距離類似度計算部６２は、距離Ｌ_ｎ−ｍについての単調減少関数ｆ_４に上記計算された距離Ｌ_ｎ−ｍを入力した計算結果を求めることにより、距離類似度β_ｎ−ｍを求める（ステップＳ５２）。求まった距離類似度β_ｎ−ｍは、類似度統合部７に送られる。Ｂを所定の正の実数、ｅをネピア数として、ｆ_４は次式のように例えば与えられる。 The distance similarity calculation unit 62 obtains a distance similarity β _n−m by obtaining a calculation result obtained by inputting the calculated distance L _n−m to the monotonously decreasing function f ₄ for the distance L _n−m ( Step S52). The obtained distance similarity β _nm is sent to the similarity integration unit 7. For example, f ₄ is given by the following equation where B is a predetermined positive real number and e is a Napier number.

類似度統合部７は、アクセスパターン類似度α_ｎ−ｍ及び距離類似度β_ｎ−ｍについての単調増加関数ｆ_５に上記類似度α_ｎ−ｍ及び上記距離類似度β_ｎ−ｍを入力した計算結果を求めることにより、類似度Ｒ_ｎ−ｍを求める（ステップＳ７、図４）。ｋ_α、ｋ_βを所定の正の実数として、ｆ_５は次式のように例えば与えられる。例えば、ｋ_α＝ｋ_β＝１である。 Similarity integration section 7 has entered the similarity alpha _n-m and the distance similarity beta _n-m monotonically increasing function f ₅ for access pattern similarity alpha _n-m and the distance similarity beta _n-m By obtaining the calculation result, the similarity R _nm is obtained (step S7, FIG. 4). For example, f ₅ is given by the following equation, where k _α and k _β are predetermined positive real numbers. For example, k _α = k _β = 1.

このように、データを特定する文字列の距離に基づく距離類似度を考慮して、データｎ，ｍの類似度を求めてもよい。
なお、データｎのコンテンツ及びデータｍのコンテンツの類似度を考慮して、データｎ，ｍの類似度を求めてもよい。例えば、データｎ，ｍのそれぞれがテキストを含む場合には、次のようにして、コンテンツの類似度を考慮する。 As described above, the similarity between the data n and m may be obtained in consideration of the distance similarity based on the distance between character strings that specify data.
Note that the similarity between the data n and m may be obtained in consideration of the similarity between the content of the data n and the content of the data m. For example, when each of the data n and m includes text, the similarity of content is considered as follows.

第一名詞抽出部７１が、データｎのテキストに含まれる各名詞を抽出する。データがＷｅｂページである場合には、データｎのソースから<body>タグで囲まれた文字列を取得して、ＨＴＭＬタグを消去する。そして、Ｎを所定の正の整数として、先頭Ｎ文字を取得する。先頭Ｎ文字に対して形態素解析を実施して、その先頭Ｎ文字に含まれる各名詞を抽出する。抽出された名詞は、重複名詞数計算部７３に送られる。 The first noun extraction unit 71 extracts each noun included in the text of the data n. If the data is a Web page, the character string enclosed by the <body> tag is acquired from the source of the data n, and the HTML tag is deleted. Then, the first N characters are acquired, where N is a predetermined positive integer. Morphological analysis is performed on the first N characters, and each noun included in the first N characters is extracted. The extracted nouns are sent to the duplicate noun count calculator 73.

第二名詞抽出部７２は、第一名詞抽出部７１と同様にしてデータｍのテキストに含まれる各名詞を抽出する。抽出された名詞は、重複名詞数計算部７３に送られる。
重複名詞数計算部７３は、第一名詞抽出部７１で抽出された名詞と、第二名詞抽出部７２で抽出された名詞とに共通する名詞の数である重複名詞数を計算してコンテンツ類似度γ_ｎ−ｍとする。計算されたコンテンツ類似度γ_ｎ−ｍは、類似度統合部７に送られる。 The second noun extraction unit 72 extracts each noun included in the text of the data m in the same manner as the first noun extraction unit 71. The extracted nouns are sent to the duplicate noun count calculator 73.
The duplicate noun number calculation unit 73 calculates the number of duplicate nouns, which is the number of nouns common to the noun extracted by the first noun extraction unit 71 and the noun extracted by the second noun extraction unit 72, and is similar in content. Degree γ _nm . The calculated content similarity γ _nm is sent to the similarity integration unit 7.

類似度統合部７は、アクセスパターン類似度α_ｎ−ｍ、距離類似度β_ｎ−ｍ及びコンテンツ類似度γ_ｎ−ｍについての単調増加関数ｆ_６に、計算されたアクセスパターン類似度α_ｎ−ｍ、計算された距離類似度β_ｎ−ｍ及び計算されたコンテンツ類似度γ_ｎ−ｍを入力した計算結果を求めることにより、類似度Ｒ_ｎ−ｍを求める。 The similarity integration unit 7 calculates the access pattern similarity α _n− calculated as the monotonically increasing function f ₆ for the access pattern similarity α _n−m , the distance similarity β _n−m, and the content similarity γ _n−m. _The similarity R _n-m is obtained by obtaining a calculation result obtained by inputting _m , the calculated distance similarity β _n-m, and the calculated content similarity γ _n-m .

例えば、α_ｎ−ｍ＋β_ｎ−ｍが所定の閾値Ｔｈ３以上であり、かつ、コンテンツ類似度γ_ｎ−ｍも所定の閾値Ｔｈ４以上の場合に、類似度Ｒ_ｎ−ｍ＝α_ｎ−ｍ＋β_ｎ−ｍとし、
そうでない場合に類似度Ｒ_ｎ−ｍ＝０とする。これにより、同一主題について書かれた一連のデータ等についての類似度を高くすることができる。 For example, when α _n−m + β _nm is greater than or _equal to a predetermined threshold Th3 and the content similarity γ _n−m is also greater than or equal to the predetermined threshold Th4, the similarity R _n−m = α _n−m + β _n−m ,
Otherwise, the similarity R _nm is set to 0. Thereby, the degree of similarity about a series of data etc. written about the same subject can be made high.

アクセスパターン類似度α_ｎ−ｍ又は類似度Ｒ_ｎ−ｍが所定の閾値以上であるかどうかを判断する類似決定部８を設けてもよい。類似決定部８は、アクセスパターン類似度α_ｎ−ｍ又は類似度Ｒ_ｎ−ｍが所定の閾値以上であれば、データｎ，ｍは類似であると判断してその旨を表す情報を出力する。類似決定部８は、そうでなければ、データｎ，ｍは類似であないと判断してその旨を表す情報を出力する。 Access pattern similarity alpha _n-m or similarity R _n-m may be provided with a similar determination unit 8 determines whether more than a predetermined threshold value. If the access pattern similarity α _n−m or the similarity R _n−m is equal to or greater than a predetermined threshold, the similarity determination unit 8 determines that the data n and m are similar and outputs information indicating that effect. . Otherwise, the similarity determination unit 8 determines that the data n and m are not similar and outputs information indicating that.

また、類似決定部８は、アクセスパターン類似度α_ｎ−ｍ又は類似度Ｒ_ｎ−ｍが所定の閾値Ｔｈ１以上であり、かつ、アクセスパターン類似度α_ｍ−ｎ又は類似度Ｒ_ｍ−ｎが所定の閾値Ｔｈ２以上である場合に、データｎ，ｍは類似であると判断してその旨を表す情報を出力してもよい。閾値Ｔｈ１と閾値Ｔｈ２とは必ずしも一致していなくてもよい。 Further, the similarity determination unit 8 has an access pattern similarity α _n−m or similarity R _{n−m that} is _{equal to} or greater than a predetermined threshold Th1, and the access pattern similarity α _m−n or similarity R _m−n is If the threshold value Th2 is equal to or greater than the predetermined threshold Th2, the data n and m may be determined to be similar and information indicating that may be output. The threshold value Th1 and the threshold value Th2 do not necessarily match.

すなわち、類似度決定部８は、類似度α_ｎ−ｍと類似度α_ｍ−ｎとの一方又は両方が所定の第一閾値よりも高い場合、又は、類似度Ｒ_ｎ−ｍと類似度Ｒ_ｍ−ｎとの一方又は両方が所定の第二閾値よりも高い場合に、データｎとデータｍとが類似していると判断してもよい
さらに、類似決定部８は、類似度α_ｎ−ｍと類似度α_ｍ−ｎとの一方又は両方が所定の第一閾値よりも高くコンテンツ類似度γ _ｎ−ｍとコンテンツ類似度γ _ｍ−ｎとの一方又は両方が所定の第三閾値よりも高い場合、又は、類似度Ｒ_ｎ−ｍと類似度Ｒ_ｍ−ｎとの一方又は両方が所定の第二閾値よりも高くコンテンツ類似度γ _ｎ−mとコンテンツ類似度γ _ｍ−ｎとの一方又は両方が所定の第三閾値よりも高い場合に、データｎとデータｍとが類似していると判断してもよい。 That is, the similarity determination unit 8 determines that one or both of the similarity α _n−m and the similarity α _m−n are higher than a predetermined first threshold, or the similarity R _n−m and the similarity R _When one or both of _m−n is higher than a predetermined second threshold value, it may be determined that the data n and the data m are similar. Further, the similarity determination unit 8 determines the similarity α _n−. One or both of _m and similarity α _m−n are higher than a predetermined first threshold, and one or both of content similarity γ _n−m and content similarity γ _m−n are higher than a predetermined third threshold. If it is high, or one or both of the similarity R _n-m and the similarity R _m-n are higher than a predetermined second threshold, one of the content similarity γ _{n- m} and the content similarity γ _m-n Alternatively, when both are higher than a predetermined third threshold, it may be determined that the data n and the data m are similar. .

この例では、異なる２つのデータの組み合わせのひとつであるデータｎ，ｍについてのアクセスパターン類似度α_ｎ−ｍ又は類似度Ｒ_ｎ−ｍを計算したが、アクセスログに含まれる異なる２つのデータのすべての組み合わせのそれぞれについて類似度を計算して、類似であるか否かの決定を行ってもよい。その際、アクセスパターン類似度α_ｎ−ｍ又は類似度Ｒ_ｎ−ｍと、アクセスパターン類似度α_ｍ−ｎ又は類似度Ｒ_ｍ−ｎとは必ずしも一致しないため、データｎからデータｍへのアクセスパターン類似度α_ｎ−ｍ又は類似度Ｒ_ｎ−ｍのみならず、データｍからデータｎへのアクセスパターン類似度α_ｍ−ｎ又は類似度Ｒ_ｍ−ｎも計算して、類似であるか否かの決定を行ってもよい。 In this example, the access pattern similarity α _n-m or similarity R _n-m is calculated for the data n and m, which is one of the combinations of two different data, but the two different data included in the access log are calculated. Similarity may be calculated for each of all combinations to determine if they are similar. At this time, since the access pattern similarity α _n−m or similarity R _n−m does not necessarily match the access pattern similarity α _m−n or similarity R _m−n , access from the data n to the data m is performed. Whether not only the pattern similarity α _n−m or the similarity R _n−m but also the access pattern similarity α _m−n or the similarity R _m−n from the data m to the data _n is calculated, whether or not they are similar Such a determination may be made.

［データ検索システム及び方法］
この発明によるデータ検索システム及び方法の一実施例を図３及び図９を参照して説明する。図３はデータ検索システムを例示するブロック図、図９はデータ検索方法を例示するフローチャートである。
検索装置２００の入力部２０１は、ユーザ端末３００から、検索ワード等のクエリーを取得する（ステップＡ１）。ユーザ端末３００は、例えば携帯電話等の情報端末機器、パーソナルコンピュータである。 [Data search system and method]
An embodiment of a data search system and method according to the present invention will be described with reference to FIGS. FIG. 3 is a block diagram illustrating a data search system, and FIG. 9 is a flowchart illustrating a data search method.
The input unit 201 of the search device 200 acquires a query such as a search word from the user terminal 300 (step A1). The user terminal 300 is, for example, an information terminal device such as a mobile phone or a personal computer.

検索装置２００のデータ情報取得部２０２は、受け取ったクエリーに対応するデータについての情報を取得する（ステップＡ２）。取得された情報は、類似データ情報取得部２０３に送られる。例えば、データがＷｅｂページである場合には、クエリーである検索ワードを含むＷｅｂページを任意のＷｅｂページ検索技術を用いて取得する。検索装置２００は、外部の検索エンジン４００（例えば、goo（登録商標））を用いてクエリーに対応するデータについての情報を取得してもよいし、内部に検索エンジン機能を有する場合にはその内部の検索エンジン機能を用いてクエリーに対応するデータについての情報を取得してもよい。 The data information acquisition unit 202 of the search device 200 acquires information about data corresponding to the received query (step A2). The acquired information is sent to the similar data information acquisition unit 203. For example, when the data is a Web page, a Web page including a search word that is a query is acquired using any Web page search technology. The search device 200 may acquire information about data corresponding to a query using an external search engine 400 (for example, goo (registered trademark)), and if it has a search engine function inside, Information on data corresponding to the query may be acquired using the search engine function.

類似データ情報取得部２０３は、クエリーに対応するデータに類似するデータについての情報を類似度計算装置１００から取得する（ステップＡ３）。 The similar data information acquisition unit 203 acquires information about data similar to the data corresponding to the query from the similarity calculation device 100 (step A3).

例えば、類似度計算装置１００は、アクセスログに含まれる異なる２つのデータｎ，ｍのすべての組み合わせのそれぞれについて類似度を計算して、類似決定部８でデータｎ，ｍの組み合わせのそれぞれが類似するかどうかを決定しておく。検索装置２００は、クエリーに対応するデータについての情報を類似度計算装置１００に送る。類似度計算装置１００は、そのクエリーに対応するデータに類似するデータについての情報を検索装置２００に送り、類似データ情報取得部２０３がその情報を取得する。 For example, the similarity calculation device 100 calculates the similarity for each of all combinations of two different data n and m included in the access log, and the similarity determination unit 8 determines that each combination of the data n and m is similar. Decide whether you want to do it. The search device 200 sends information about data corresponding to the query to the similarity calculation device 100. The similarity calculation device 100 sends information about data similar to the data corresponding to the query to the search device 200, and the similar data information acquisition unit 203 acquires the information.

検索装置２００の出力部２０４は、クエリーに対応するデータについての情報と共に、そのクエリーに対応するデータに類似するデータついての情報をユーザ端末３００に出力する（ステップＡ４）。 The output unit 204 of the search device 200 outputs information about data similar to the data corresponding to the query to the user terminal 300 together with information about the data corresponding to the query (step A4).

図１４は、データがＷｅｂページである場合に、出力部２０４からユーザ端末３００に提供される情報により、ユーザ端末３００のディスプレイに表示される画面の例である。この例のように、検索結果が上位のＷｅｂページのそれぞれについての類似ページについての情報が例えばユーザ端末３００に提供される。
また、クエリーに対応するデータについての類似データが複数ある場合には、複数の類似データをユーザ端末３００に提供してもよい。さらに、類似データ情報取得部２０３は、取得した類似データについての情報を類似度計算装置１００に送り、その類似データに類似するデータについての情報を更に取得して、ユーザ端末３００に提供してもよい。 FIG. 14 is an example of a screen displayed on the display of the user terminal 300 based on information provided from the output unit 204 to the user terminal 300 when the data is a Web page. As in this example, for example, the user terminal 300 is provided with information about similar pages for each of the Web pages with higher search results.
Further, when there are a plurality of similar data regarding the data corresponding to the query, the plurality of similar data may be provided to the user terminal 300. Further, the similar data information acquisition unit 203 may send the information about the acquired similar data to the similarity calculation apparatus 100, further acquire information about the data similar to the similar data, and provide the information to the user terminal 300. Good.

検索装置２００と類似度計算装置１００との間の通信、検索装置２００と外部検索エンジンとの間の通信のプロトコルはＰＯＳＴ等の一般に公開されているプロトコルを用いることができる。また、フォーマットについても一般に公開されているＸＭＬ、ＨＴＭＬ、構造化されている表等の形式を用いることができる。 As a communication protocol between the search device 200 and the similarity calculation device 100 and a communication protocol between the search device 200 and the external search engine, a publicly available protocol such as POST can be used. Also, the format such as publicly available XML, HTML, structured table, etc. can be used.

上述の構成をコンピュータによって実現する場合、類似度計算装置の各部が有する機能の処理内容はプログラムによって記述される。そして、このプログラムをコンピュータで実行することにより、上記各部の機能がコンピュータ上で実現される。 When the above configuration is realized by a computer, the processing contents of the functions of each unit of the similarity calculation device are described by a program. By executing this program on a computer, the functions of the above-described units are realized on the computer.

すなわち、ＣＰＵがプログラムを逐次読み込んで実行することにより、アクセス順番付与部２、セッション分割部２１、経過時間付与部２２、アクセス順番差計算部３、アクセス時間差計算部４、アクセスパターン類似度計算部５、第一統合部５１、第二統合部５２、距離計算部６１、距離類似度計算部６２、類似度統合部７、第一名詞抽出部７１、第二名詞抽出部７２、重複名詞数計算部７３、類似決定部８のそれぞれが実現される。また、補助記憶装置又はメモリが、アクセスログ記憶部１として機能する。 That is, when the CPU sequentially reads and executes the program, the access order assigning unit 2, the session dividing unit 21, the elapsed time giving unit 22, the access order difference calculating unit 3, the access time difference calculating unit 4, and the access pattern similarity calculating unit 5, first integration unit 51, second integration unit 52, distance calculation unit 61, distance similarity calculation unit 62, similarity integration unit 7, first noun extraction unit 71, second noun extraction unit 72, duplicate noun count calculation Each of the unit 73 and the similarity determination unit 8 is realized. Further, the auxiliary storage device or the memory functions as the access log storage unit 1.

類似度計算装置の各部として機能するＣＰＵは、メモリ又は補助記憶装置から読み込み込んだデータに対して処理を行い、処理を行った後のデータをメモリ又は補助記憶装置に格納する。すなわち、メモリ又は補助記憶装置を介して類似度計算装置の各部間でデータがやり取りされる。 The CPU functioning as each unit of the similarity calculation device performs processing on the data read from the memory or the auxiliary storage device, and stores the processed data in the memory or the auxiliary storage device. That is, data is exchanged between the units of the similarity calculation device via the memory or the auxiliary storage device.

また、このプログラムの流通は、例えば、そのプログラムを記録したＤＶＤ、ＣＤ−ＲＯＭ等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記憶装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させる構成としてもよい。 The program is distributed by selling, transferring, or lending a portable recording medium such as a DVD or CD-ROM in which the program is recorded. Furthermore, the program may be distributed by storing the program in a storage device of the server computer and transferring the program from the server computer to another computer via a network.

また、上述した実施形態とは別の実行形態として、コンピュータが可搬型記録媒体から直接このプログラムを読み取り、そのプログラムに従った処理を実行することとしてもよく、さらに、このコンピュータにサーバコンピュータからプログラムが転送されるたびに、逐次、受け取ったプログラムに従った処理を実行することとしてもよい。また、サーバコンピュータから、このコンピュータへのプログラムの転送は行わず、その実行指示と結果取得のみによって処理機能を実現する、いわゆるＡＳＰ（Application Service Provider）型のサービスによって、上述の処理を実行する構成としてもよい。なお、本形態におけるプログラムには、電子計算機による処理の用に供する情報であってプログラムに準ずるもの（コンピュータに対する直接の指令ではないがコンピュータの処理を基底する性質を有するデータ等）を含むものとする。 As an execution form different from the above-described embodiment, the computer may read the program directly from the portable recording medium and execute processing according to the program. Each time is transferred, the processing according to the received program may be executed sequentially. Also, the program is not transferred from the server computer to the computer, and the above-described processing is executed by a so-called ASP (Application Service Provider) type service that realizes the processing function only by the execution instruction and result acquisition. It is good. Note that the program in this embodiment includes information that is used for processing by an electronic computer and that conforms to the program (data that is not a direct command to a computer but has a property that is based on computer processing).

また、この形態では、コンピュータ上で所定のプログラムを実行させることにより、本装置を構成することとしたが、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。
また、上述の各種の処理は、記載に従って時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されてもよい。
その他、本発明の趣旨を逸脱しない範囲で適宜変更が可能であることはいうまでもない。 In this embodiment, the present apparatus is configured by executing a predetermined program on a computer. However, at least a part of these processing contents may be realized by hardware.
In addition, the various processes described above are not only executed in time series according to the description, but may be executed in parallel or individually according to the processing capability of the apparatus that executes the processes or as necessary.
Needless to say, other modifications are possible without departing from the spirit of the present invention.

類似度計算装置を例示する機能ブロック図。The functional block diagram which illustrates a similarity calculation apparatus. アクセスパターン類似度計算部を例示する機能ブロック図。The functional block diagram which illustrates an access pattern similarity calculation part. 検索システムを例示する機能ブロック図。The functional block diagram which illustrates a search system. 類似度計算方法を例示するフローチャート。The flowchart which illustrates a similarity calculation method. ステップＳ１の例のフローチャート。The flowchart of the example of step S1. ステップＳ２の例のフローチャート。The flowchart of the example of step S2. ステップＳ３の例のフローチャート。The flowchart of the example of step S3. ステップＳ５の例のフローチャート。The flowchart of the example of step S5. 検索方法を例示するフローチャート。The flowchart which illustrates a search method. アクセスパターンテーブルを例示する図。The figure which illustrates an access pattern table. アクセスパターン類似度を説明するための図。The figure for demonstrating access pattern similarity. 同一のセッションにデータｎ，ｍの組み合わせが複数登場する場合の計算について説明するための図。The figure for demonstrating the calculation when multiple combinations of data n and m appear in the same session. ユーザに提供される情報を例示する図。The figure which illustrates the information provided to a user. レーベンシュタイン距離を説明するための図。The figure for demonstrating Levenshtein distance.

Explanation of symbols

１アクセスログ記憶部
２アクセス順番付与部
３アクセス順番差計算部
４アクセス時刻差計算部
５アクセスパターン類似度計算部
５１第一統合部
５２第二統合部
６１距離計算部
６２距離類似度計算部
７類似度統合部
７１第一名詞抽出部
７２第二名詞抽出部
７３重複名詞数計算部
８類似決定部
１０アクセス頻度計算部
１００類似度計算装置
２００検索装置
２０１入力部
２０２データ情報取得部
２０３類似データ情報取得部
２０４出力部 DESCRIPTION OF SYMBOLS 1 Access log memory | storage part 2 Access order provision part 3 Access order difference calculation part 4 Access time difference calculation part 5 Access pattern similarity calculation part 51 First integration part 52 Second integration part 61 Distance calculation part 62 Distance similarity calculation part 7 Similarity integration unit 71 First noun extraction unit 72 Second noun extraction unit 73 Duplicate noun count calculation unit 8 Similarity determination unit 10 Access frequency calculation unit 100 Similarity calculation device 200 Search device 201 Input unit 202 Data information acquisition unit 203 Similar data Information acquisition unit 204 output unit

Claims

An access order giving unit for determining the order in which each user has accessed each data accessed by each user using an access log read from an access log storage unit in which an access log relating to access to each data by each user is stored; ,
An access order difference calculation unit that obtains a difference N _list (n, m) between the order in which user i accesses data n and the order in which user i subsequently accesses data m, using two different data as data n and m ,
An access time difference calculation unit for _obtaining a difference t _margin (n, m) between a time when the user i accesses the data n and a time when the user i accesses the data m thereafter;
Access order difference _N list (n, m) and access time difference _{t margin} (n, m) monotonically decreasing function _{f 1} for, the access order difference _N list (n, m) and the access time difference _{t margin} (n , M), an access pattern similarity calculation unit for obtaining an access pattern similarity α _n−m based on the access pattern for the data n, m,
Including
It further includes a similarity determination unit that determines that the data n and the data m are similar when both the access pattern similarity α _n−m and the access pattern similarity α _m−n are higher than a predetermined threshold. ,
The similarity calculation apparatus characterized by the above.

The similarity calculation apparatus according to claim 1,
The access pattern similarity calculation unit
The access order difference N _list (n, m) and the access time difference t _margin (n) are _added to the monotonically decreasing function f ₂ for the access order difference N _list (n, m) and the access time difference t _margin (n, m). , M), a first integration unit for obtaining a calculation result for each user i,
A second integration unit for obtaining the access pattern similarity α _n−m by obtaining a calculation result obtained by inputting the calculation result to the monotonically increasing function f ₃ for each calculation result;
The similarity calculation apparatus characterized by including.

In the similarity calculation apparatus according to claim 2,
The session is a sequence of access to data by the same user, and in the access sequence, the time until access to the next data is shorter than a predetermined time,
The access order difference calculation unit calculates the difference N _list (n, m) between the order in which the user i accesses the data n and the order in which the user i accesses the data m in the same session as the access to the data n thereafter. Is a part seeking
The access time difference calculation unit calculates a difference t _margin (n, m) between a time when the user i accesses the data n and a time when the user i accesses the data m in the same session as the access to the data n thereafter. Is a part seeking
The first integration unit calculates the access order difference N _list (n, m) and the monotonically decreasing function f ₂ for the access order difference N _list (n, m) and the access time difference t _margin (n, m). It is a unit that obtains a calculation result obtained by inputting an access time difference t _margin (n, m) for each session of each user i.
The similarity calculation apparatus characterized by the above.

In the similarity calculation device according to any one of claims 1 to 3,
A distance calculation unit that calculates a distance L _n−m between a character string that specifies data n and a character string that specifies data m;
A distance similarity calculation unit for obtaining a distance similarity β _n-m by obtaining a calculation result obtained by inputting the distance L _n-m to the monotone decreasing function f ₄ for the distance L _n-m ;
Obtaining a calculation result obtained by inputting the access pattern similarity α _n-m and the distance similarity β _n-m to the monotonically increasing function f ₅ for the access pattern similarity α _n-m and the distance similarity β _n-m A similarity integration unit for obtaining the similarity R _nm ,
Similarity calculation device.

In the similarity calculation apparatus according to any one of claims 1 to 4,
The above data includes text,
A first noun extraction unit for extracting each noun included in the text of the data n;
A second noun extraction unit for extracting each noun included in the text of the data m;
Duplicate noun number, which is the number of nouns common to the noun extracted by the first noun extraction unit and the noun extracted by the second noun extraction unit, is calculated as the content similarity γ _n−m And a noun count calculator,
The similarity determination unit is configured such that one or both of the access pattern similarity α _n−m and the access pattern similarity α _m−n is higher than a predetermined first threshold, and the content similarity γ _n−m and the content similarity γ _{m When} one or both of _−n is higher than a predetermined third threshold, or one or both of the similarity R _n−m and the similarity R _m−n are higher than a predetermined second threshold. a unit that determines that data n and data m are similar when one or both of γ _{n− m} and content similarity γ _m−n are higher than a predetermined third threshold;
The similarity calculation apparatus characterized by the above.

A similarity calculation device according to claim 5;
A data information acquisition unit that acquires information about data corresponding to the received query, a similar data information acquisition unit that acquires information about data similar to the data corresponding to the query from the similarity calculation device, and the query A search device including an output unit that outputs information about the similar data together with information about the data corresponding to
Data retrieval system including

The access order assigning unit uses the access log read from the access log storage unit in which the access log related to the access to each data by each user is used to determine the order in which each user has accessed each data accessed by each user. A predetermined access order granting step;
The access order difference calculation unit sets the difference N _list (n, m) between the order in which the user i accesses the data n and the order in which the user i accesses the data m thereafter, assuming two different data as the data n and m. The required access order difference calculating step;
An access time difference calculating unit for calculating a difference t _margin (n, m) between a time when the user i accesses the data n and a time when the user i subsequently accesses the data m;
Access pattern similarity calculation unit, the access order difference _N list (n, m) and access time difference _{t margin} (n, m) monotonically decreasing function _{f 1} for, the access order difference _N list (n, m) and An access pattern similarity calculation step for obtaining an access pattern similarity α _n−m based on an access pattern for data n and m by _obtaining a calculation result obtained by inputting the access time difference t _margin (n, m);
Including
The similarity determining unit determines that the data n and the data m are similar when both the access pattern similarity α _n−m and the access pattern similarity α _m−n are higher than a predetermined threshold. A similarity calculation method further including a determination step.

Each step of the similarity calculation method according to claim 7,
An output unit that outputs information about the similar data together with information about the data corresponding to the query;
A data information acquisition step in which the data information acquisition unit acquires information about data corresponding to the received query;
A similar data information acquisition unit that acquires information about data similar to data corresponding to the query;
Data search method including.

A similarity calculation program for causing a computer to function as the similarity calculation apparatus according to claim 1.