CN115409155A

CN115409155A - Information cascade prediction system and method based on Transformer enhanced Hooke process

Info

Publication number: CN115409155A
Application number: CN202210983826.9A
Authority: CN
Inventors: 周帆; 余柳; 钟婷; 徐增; 匡平
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2022-08-17
Filing date: 2022-08-17
Publication date: 2022-11-29
Anticipated expiration: 2042-08-17
Also published as: CN115409155B

Abstract

The invention discloses an information cascading prediction system and method based on Transformer-enhanced Hawkes process, including a user embedding module, a global dependency module, a local dependency module, an intensity function acquisition module and a popularity prediction module. Firstly, the user position- Wise embedding and time coding are used as user embedding, and then from the perspective of topology, the path-aware assumption is introduced, and two layers of attention are designed from the perspectives of global embedding and local mode, parameterizing the strength function of the Hawkes process, and combining Hawkes process, global embedding and local patterns, learning the temporal and topological stochastic properties of information cascade diffusion process coupling for popularity prediction; the present invention extends the traditional Hawkes process and effectively learns from the continuous time domain Gain knowledge to improve the accuracy of popularity predictions.

Description

Information cascade prediction system and method based on Transformer enhanced Hooke process

Technical Field

The invention belongs to the technical field of Information processing, relates to Information diffusion (Information diffusion), information Cascade (Information Cascade) and Popularity Prediction (Popularity Prediction) based on Deep Learning (Deep Learning), and particularly relates to a method and a system (Hawkesformer) for introducing Transformer into a Hooke Process (Hawkes Process) to perform coupling modeling on time characteristics and topological structure of Cascade and finally realize Popularity Prediction (Cascade Prediction).

Background

Content sharing through social media platforms such as Twitter, facebook and microblog has become a main channel for expression opinions and reaction opinions of individuals on various topics. This trend has produced a large amount of data describing the manner in which information is disseminated, and has brought unprecedented convenience for generating, communicating, and disseminating information items. The initial distribution of information (e.g., tweet) and subsequent sharing and forwarding (e.g., retweet) forms a concatenation of information, representing what is commonly referred to as the information dissemination process. Similar phenomena are found in other (non-social media) settings, paper quotations, blog spaces, and email forwarding. The large number of user sharing behaviors promotes the rapid and massive spread of information concatenation. A typical task of information concatenation is to predict the scale of a potentially affected user after a certain period of time (tweet, microblog, etc.), i.e., popularity prediction (size of information concatenation), which brings enormous economic and social value in many downstream applications. For example, knowing which type of advertisement has maximized influence may make ad placement more accurate; predicting rumors' potential impact users so that administrators can intervene early to avoid serious consequences, etc.

The existing methods are mainly divided into three categories: (i) Feature-based models that mainly explore feature sets extracted from structures, time series, user content and profiles; however, they rely on a large number of hand-made feature projects and cannot be generalized from one field to another; (ii) Probabilistic generative models, aimed at modeling events by classical examples of point-in-time processes (TPPs), including poisson and hokes processes; however, the existing method makes a strong assumption on a cascading diffusion mechanism, and has limited learning capacity on large-scale cascading data; (iii) The deep learning model focuses more on static time cascade diffusion in a specific snapshot in a discrete time domain, has little focus on the whole evolution process and fully utilizes all cascade dynamics, and mainly uses a simple event sequence model (such as RNN) to model cascade events or is assisted by GNN to separately model time characteristics and topological characteristics; but the two are coupled in nature, and the mode of separated modeling loses cross-domain information and reduces the expression and prediction capability of the model.

Disclosure of Invention

Aiming at the problems of limited learning capacity, cross-domain information loss and the like of the current information cascade prediction method, the invention aims to provide a novel modeling mode for the cascade diffusion Process, generalize the cascade diffusion Process into a Directed Acyclic Graph (DAG) which is continuously diffused in a continuous time domain, introduce a Transformer (attention mechanism) into a Hawkes Process, introduce an attention mechanism into the Hawkes Process to model time characteristics and topological characteristics in the continuous time domain in a coupling manner, realize effective modeling of the cascade diffusion Process and improve the accuracy of popularity prediction.

The general idea of the invention is as follows: two levels of attention modules are connected together to parameterize the intensity function of the hokes process. The strength function ensures that the coupling time and topological dependence of the cascade is modeled in the continuous time domain. Specifically, at a first level, a global (global) dependent module is designed to dynamically capture long-term diffusion processes of events that have occurred in the past in the cascade, and main/non-main path assumptions are proposed to adaptively integrate the diffusion processes onto the underlying DAG. This allows each node participating in the cascade anywhere to update its current hidden state. At the second level, a local module is designed to capture the short-term evolution rate of the information cascade by encoding local patterns in a fixed time slice window. Through the design of two levels, the representation of the nodes on the global and local is taken, and finally, a full connection layer is fed for popularity prediction.

Before a particular flow is given, some basic terms and definitions are given. Given a concatenation of information C _k And it is at observation time t _o Forwarding history within a network

Wherein t is _j ∈(0,t _o ]L is within the observation time window (t) ₀ ,t ₀ +t _o ) Number of forwarding of t ₀ Indicating the start time. Each triplet (t) _j ,v _j ,u _j ) Is represented at t _j Time of day, user u _j Forwards user v _j . For example, the forwarding history of FIG. 1 is:

(t ₄ ,u ₁ ,u ₄ ),(t ₅ ,u ₃ ,u ₅ ) For simplicity, the following is omitted

In addition, we use

Representing the set of timestamps before the current time.

Based on the above inventive concept, the information cascade prediction system based on the Transformer enhanced hokes process provided by the invention comprises:

the User embedding module (Initial User indexes) is used for acquiring User position-wise embedding and time coding and taking the User position-wise embedding and the time coding as User embedding;

a Global Dependency Module (Global Dependency Module) for distinguishing all predecessor nodes as main path nodes and non-main path nodes for each user node in user embedding, and obtaining attention probability matrixes of the main path nodes and the non-main path nodes, further obtaining Global Dependency of each node, and obtaining Global embedding of all nodes through full connection;

a Local Dependency Module (Local Dependency Module) for obtaining a Local mode of each node according to a set time slice window, and then performing pooling operation on the Local mode of each node to obtain a pooled representation of each node; finally, generating a hidden vector of each node local mode based on a mask attention mechanism;

the intensity function acquisition module is used for parameterizing an intensity function of the Hox process according to the global embedding vector and the local mode hiding vector;

and the popularity prediction module is used for generating a popularity prediction result by combining the integral of the Hox process and the full-connection processing result of the global embedding and local mode hiding vectors.

The user embedding module, the user embedding X, is composed of position-wise embedding and time coding and is used as the input of a subsequent module (a global dependency module).

The user embedded module comprises a Position-wise embedded acquisition submodule and a time code acquisition submodule.

Position-wise Embedded acquisition submodule, including shared matrix

Where D is the user-embedded dimension and L is the maximum number of hops. The jth column of U represents the jth forwarding user, and each cascade of nodes in the same order on the time axis share an initial embedding. The shared matrix U is first randomly initialized and then subsequent context embeddings related to the underlying cascade are dynamically updated by subsequent global dependency modules. The shared matrix U greatly reduces the computational overhead, making the system more efficient. In a cascade, for the j-th forwarded user u _j Let v _j Representing its one-hot index vector on the shared matrix U, then user U _j Position-wise embedding of (A) is determined by (U) v _j The index is obtained, wherein

Representing the set of one-hot indexes embedded by all users. Thus, the Position-wise embedding of the user, which is derived by the Position-wise embedding derivation submodule, is UV based on the shared matrix U.

2. The time code acquisition submodule (Temporal Encoding), a key component of the present invention, is a two-level attention module, and since the attention mechanism does not consider the sequence dependency, the time code acquisition submodule characterizes the sequence position by using the time code z, which is defined as follows:

the temporal coding is a variant from the original transform absolute position coding. Wherein

Is time-coded, D is the dimension of the time-coding, [ z (t) _j )] _i Represents the value of the ith dimension of the code, i ∈ [1, D ]]。

Is a collection of temporal codes.

Thus, the forwarding sequence

The initial user embedding of (a) can eventually be written as: x = UV + Z. The initial user embedding X consists of a learnable shared position-wise embedding UV together with a fixed time coding Z.

The global dependency module is referred to as a global module for short; the global dependency of the information cascade can be seen as an indication that a certain user may promote or suppress another user during long-term diffusion.

The global dependency module includes:

the path perception attention layer is used for distinguishing all the precursor nodes into a main path node and a non-main path node for each user node in user embedding, acquiring long-term dependence attention scores of the main path node and the non-main path node and constructing an attention probability matrix of the main path and the non-main path; acquiring global dependence of each node according to the attention probability matrix of the main path and the non-main path of each node;

and the first output network is used for carrying out full connection processing on the global dependence of each node to obtain global embedding of all nodes.

After the user embedding is obtained, a path perception attention layer is designed in the global module to parameterize the intensity function of the Hawkesformer, the long-term diffusion dependence between the input cascade nodes is captured, and the time topological dependence of the coupling of the main path node and the non-main path node is extracted by using the user complete diffusion dependence extraction sub-module.

The theory of the hokes process states that: the final popularity of information concatenation is related to the past participation of the user, which means that the current node affects not only its immediate (immedate) forwarder, but also non-immediate (non-immedate) forwarders by transitivity. In order to indicate the influence of each node on the future popularity of the cascade from a topological point of view, a path-aware assumption, i.e. a primary path/non-primary path assumption, is therefore proposed in the global module: consider a node on the main path to be current node u _j The contribution of (b) is larger, the node on the non-main path is next to u _j The arrival contribution of the node is small but still has an impact on the final popularity, which also follows the assumptions of the hokes process. For the current node u _j All predecessor nodes

Are divided into two categories: (1) Nodes on the main path, denoted

It includes a front node u _j Backtracking to root node u ₀ All previous nodes on the entire forwarding path; (2) Nodes on the non-primary path, i.e. all nodes except the primary path node, are marked as

An example is given in fig. 2, where for the current node u ₄ The nodes on the primary path are represented by (1) and (2), and the nodes on the non-primary path are represented by (3).

Based on the above explanation, the implementation steps of the global dependency module are as follows:

given a current node u _j To obtain a node between it and a main path

Long term dependent attention score of (c):

wherein a, b and

respectively representing user embedded matrix X main path Q (Query) matrix

Non-main path Q (Query) matrix

And K (Key) matrix

The transformed users embed vectors, and the three transformation matrices are used for distinguishing roles of the users in long-term dependence. Wherein,<·,·>representing the inner product function. Similarly, current node u _j A node in the non-main path

Long term dependence attention scores of (a) are:

then, attention weighting and calculation are performed through the following formula to obtain the user u _j Complete diffusion dependence e _j ：

Finally, the context embedding of each node is obtained according to the attention scores of the primary path and the non-primary path of each node: e = { E = { E) ₀ ,e ₁ ,…,e _j ,…,e _L Each line in E represents the global dependency of a certain node in the cascade;

embedding matrix X via V (Value) matrix on behalf of user

The converted user embeds the vector.

Above for the current single user u _j Is calculated. In practical operation, the above attention is relied on using matrix parallel to speed up the operation. The matrix calculation flow of global dependency can refer to fig. 3.

Firstly, based on the path perception attention layer, the attention scores of the current node on the main path and the non-main path are calculated through the following formulas:

wherein M is ^P ，

One-hot matrix for respectively reserving attention scores of nodes on the main path and the non-main path if

Then element

Accordingly, if

Finally, the attention probability matrix for the primary and non-primary paths can be derived as:

where D is the dimension of the input embedding. Mask matrix

To prevent the system from peeping into future data if i<j，M _i,j =0, otherwise M _i,j And = - ∞. Matrix M forces the softmax function to be only at u _j Distributing valid long-term dependent attention on previous forwarders to subsequent users (i.e., later than t) _j Active user) is assigned 0. Each column vector in matrix A represents u _j To its precursor node

Including the main path

And a non-main path

A node of (c).

Then, a matrix form of the global dependency of each node is obtained according to the following formula:

E＝(W ^V X)A

in order to learn more semantic information, the path perception attention layer can be designed as a module based on a multi-head attention mechanism, context embedding and mapping are carried out on different subspaces, the subspaces can use the matrix multiplication parallel operation given in the foregoing, and then a plurality of heads E are calculated ₁ ,E ₂ ,…,E _H Then connecting them into E = W ^O Concat(E ₁ ,E ₂ ,…,E _H ) Wherein

Is an aggregation matrix. Finally, we feed E into the first output network (here the fully connected neural network FCNN) resulting in a global embedding of all cascades:

h(t _j )＝H(:,j)

wherein,

is a parameter of the neural network; h (t) _j ) Is that

J-th column in (1) represents user u _j Is indicative of long term dependence.

The local dependency module may be referred to as a local module for short; the rate of increase of the forwarding number in the information concatenation is not stable, and the final popularity is greatly influenced by short-term outbreaks in the propagation process. To better embed the evolution rate of the information cascade, pooled attention is used here to capture local patterns. And performing a pooling operation on the local mode of each current node based on the learned user global embedded representation in the global dependency module.

The local dependency module includes:

the local mode acquisition submodule is used for intercepting a part positioned in front of each node from the global dependence according to a set time slice window to obtain a local mode of each node;

the pooling submodule is used for performing pooling operation on the local mode of each node to obtain pooling representation of each node;

and the mask attention layer is used for dynamically aggregating the local mode pooling representation of each node and the nodes before the node, and generating the hidden vector of the local mode of each node.

Each current node has a local mode, and a fixed time slice window t is set _s Utilizing a local mode acquisition submodule to intercept a part positioned in front of each node from the global dependency to obtain a local mode of each node; thus, the local mode of the current node is determined by the current node u _j And a global dependency component of some predecessor nodes, denoted as

Representing the set of all local patterns of a piece of information concatenation.

For each current node u _j Local mode to it by pooling submodules

Performing a pooling operation to polymerize a time slice t _s Generating a new representation of the feature in (1)

Therefore, all local modes

The total pooled representation of (a) may be formalized as:

each column in (b) corresponds to a pooled representation of the local pattern.

To capture the evolutionary rate associated with cascading short-term outbreaks/falls, the present invention masks by designThe code attention layer incorporates the impact of the local patterns. Dynamically aggregating current time t with a mask attention layer based on a mask attention mechanism _j All previous local mode embedding (i.e. pooling representation), generating a concealment vector to summarize the evolution process of all previous modes:

g (-) is a linear transformation function, γ = { v (t) = ₀ ),v(t ₁ ),…,v(t _j ),…,v(t _L ) Is the set of local dependencies of the evolution rate. The similarity function f (·, ·) is specified as:

the intensity function obtaining module parameterizes the intensity function of the hokes process through the global module and the local module to obtain the intensity function lambda (t) of the Hawkesformer. Given a concatenation of information C _k And it is at observation time t _o Forwarding history within

Representing a time stamp t _j Before but not including time t _j The sequence of (a). Here, the intensity function obtaining module models the continuous dynamics of the cascade diffusion using the following conditional intensity functions:

wherein, theta represents the parameter of the model, and the forwarding time t is defined in the interval [ t ] _j ,t _j+1 ) Where f (x) = β lg (1 + exp (x/β)) is the softplus function, and the parameter β is used to constrain the strength function to be positive; the first current represents the evolution process of the continuous time domain, and alpha controls the importance degree of interpolation nodes in two timestamps; w is a ₁ 、w ₂ The learning weights of the global dependency and the local mode at the time t are respectively, and the strength function lambda (t) represents the inferred text arrival rate at the time t. Both global and local dependencies of the Hawkesformer have expressive power and flexibility, and the design of the intensity function enables the Hawkesformer to capture coupled spatio-temporal dynamics from the continuous time domain.

The popularity prediction module is mainly used for predicting the popularity of the information. According to the Hawkes theory, the survival function at time t is:

the function is represented at t _n ,t]Without conditional probability of an event occurring in between. Then, we can deduce the probability of observing an event at any time t':

finally, the final popularity is predicted through a popularity prediction module by combining the time point process and the output of two levels of attention mechanism modules (a global module and a local module) according to the following formula

The first term Λ represents the integral of the hokes process, representing the expected magnitude of popularity; t is t _o Represents the observation time; t is t _p Representing a predicted time; the second term represents the output result of the global embedding and local pattern hidden vector splicing result after passing through the second output network (here, the fully connected neural network FCNN).

The information cascade prediction system based on the transform enhanced Hooke process also comprises an optimization module used for optimizing parameters in the system training process.

A cascade C _k Is forwarded toTime

The joint likelihood of (c) can be expressed as:

wherein

Represents until time t _j The time set of (c).

Assume that there are N information cascade label samples for training, { C ₁ ,C ₂ ,…,C _N And the optimization module optimizes the loss of the popularity degree and the maximization of the log-likelihood of all information cascades by minimizing Mean Square Logarithm Error (MSLE):

y _k representing the true popularity of the kth information concatenation label exemplar,

and the predicted popularity corresponding to the kth information concatenation label sample is shown.

The Adam gradient descent algorithm is then used to optimize the system network parameters.

The invention also provides an information cascade prediction method based on the Transformer enhanced Hox process, which comprises the following steps:

s1, acquiring user position-wise embedding and time coding, and taking the user position-wise embedding and the time coding as user embedding;

s2, for each user node in user embedding, distinguishing all precursor nodes as a main path node and a non-main path node, respectively obtaining attention probability matrixes of the main path node and the non-main path node, further obtaining the global dependence of each node, and then obtaining the global embedding of all the nodes through full connection processing;

s3, acquiring a local mode of each node according to a set time slice window, and then performing pooling operation on the local mode of each node to obtain pooling representation of each node; then based on a mask attention mechanism, generating a local mode hidden vector of each node;

s4, parameterizing a strength function of the Hox process according to the global embedding vector and the local mode hiding vector;

and S5, integrating the Hox process and the full-connection processing result of the global embedding vector and the local mode hiding vector to generate a popularity prediction result.

The present invention thus achieves a system and method for coupled modeling and prediction of time features and topological features in the continuous time domain by introducing a two-tier attention mechanism into the hokes process. At a first level, a DAG is distinguished according to a main path and a non-main path, and the layer is used for capturing global dependency relationships between nodes, namely coupled time and topological dependency; at the second level, a local pool attention module is proposed to embed the evolution rate of the cascade to simulate bursts and sags of short-term forwarding numbers.

Compared with the prior art, the invention has the following beneficial effects:

1. according to the method, user position-wise embedding and time coding are firstly obtained to serve as user embedding, then a path perception hypothesis is introduced from the view angle of a topological structure, two attention layers are designed from the two angles of global embedding and a local mode, the intensity function of a Hox process is parameterized, and the coupled time and topological random characteristics of an information cascade diffusion process are learned by combining the Hox process, the global embedding and the local mode to predict popularity; the invention expands the traditional Hox process, effectively acquires knowledge from the continuous time domain and improves the popularity prediction accuracy.

2. The invention proposes a two-level attention architecture, which parameterizes the intensity function of the hokes process; firstly, a global module attention layer based on path perception is designed to capture a long-term diffusion dependency relationship between nodes in a dynamic diffusion topology, and then a local module is provided to capture a short-term evolution rate in information cascade.

3. The invention customizes a learnable position-wise user embedding method for the information cascade directed acyclic graph, thereby improving the parameter efficiency and reducing the calculation burden;

4. the method is extremely important for understanding the evolution process of the social network platform and explaining the popular reasons of the cascade; for example, the method and the system can predict the forwarding amount of a certain microblog in a future period of time, explain popular factors, and be used for downstream tasks such as marketing design, rumor prediction, influence maximization and the like.

Drawings

FIG. 1 is a process of information cascade diffusion presenting a Directed Acyclic Graph (DAG). Existing models model the temporal features of node occurrences and the topological features of the occurrences in the graph separately, e.g., temporal features are fed into the RNN and topological features are fed into the GNN. But the two are in a coupling mode in nature, and when the forwarding time of a certain node is determined, the topological structure in the DAG is also determined at the same time; e.g. u ₂ Is dependent on u ₀ Rather than its previous node u in time ₁ (ii) a Thus, DAGs form coupled temporal topological dependencies, whereas separate modeling cannot capture coupled dependencies.

FIG. 2 is a hypothetical of primary and non-primary paths in a global dependency module, distinguished from a chain structure based on an RNN sequence model; this assumption can directly capture the long term dependencies on the primary path (non-primary path).

Fig. 3 is a detailed diagram of the implementation of the assumption of the primary path and the non-primary path in the global dependency module, that is, the implementation of fig. 2.

FIG. 4 is a model diagram of an information cascade prediction system based on a Transformer enhanced Hooke process, which includes an initial user embedding module, a global dependency module, a local dependency module and a prediction module.

FIG. 5 is a visual rendering of an attention probability matrix A representing thermodynamic diagrams (Heatmaps) and cascade structures corresponding to three cascades randomly selected in a Weibo dataset; the diagonal line in the thermodynamic diagram represents the attention of the current node to the current node, and each row represents the attention of the current node to all the predecessor nodes; each entry ranges from [0,1], the darker the entry color, the higher the attention, and the 1 for each row.

Interpretation of terms

Information Cascade (Information Cascade): figure 1 illustrates this process in one example: after a root node issues a piece of public content, friends and followers of the root node can see the public content and then forward the public content one by one. In this way, public content propagates through the edges of the social network and creates information cascades, a typical task of which is to predict the size of a cascade (tweet, microblog, etc.) that is potentially affected after an observation period. The theoretical basis can be found in the literature [ J.Cheng, L.Adamic, P.A.Dow, J.M.Kleinberg, and J.Leskovec.Can cassettes be predictedlin Proc.of WWW,2014 ]

Hokes process (HawkesProcess, HP): the hokes process is a mainstream Point-in-time process (TPPs). HP has three key factors: (1) The influence of the users is greater, and the influence users contribute to the newly forwarded arrival rate, which indicates that the tweets forwarded by the influence users are often forwarded more; (2) A self-excitation mechanism, wherein each forwarding can influence the arrival rate of future new forwarding; and (3) time attenuation effect. The effect of forwarding decays with increasing time. But traditional HPs simplify the arrival rate of events, model the arrival of events using, for example, exponential or power law functions, and make strong assumptions about diffusion mechanisms in cascading events, limiting their learning capabilities on large-scale cascading data.

Self-Attention Mechanism (Self-Attention Mechanism): the self-Attention mechanism is a special case of the Attention mechanism recently proposed in transform [ vaswanni, ashish, et al ] "Attention is all you needed." Advances in neural information processing systems 30 (2017) ], which has been widely applied to various NLP tasks such as machine translation, language modeling and the like, and brings about considerable improvement. It encodes the input sequence by correlating the input tokens, which consist of the matrices Q (Queries), K (Keys) and V (Values). The matrix form of the attention function is:

Detailed Description

The invention is further described with reference to the accompanying drawings.

Example 1

For convenience, we will use Twitter (Twitter) concatenation as an example setup in the rest of the invention, but the invention is also applicable to other types of information concatenation, such as micro blogging, academic citations, etc., and we will give examples in the experimental part.

The information cascade prediction system based on the transform enhanced hoxox process provided by this embodiment, as shown in fig. 4, includes a user embedding module, a global dependency module, a local dependency module, a strength function obtaining module, and a popularity prediction module.

(1) User embedded module

And the user embedding module is used for acquiring the user position-wise embedding and the time coding and taking the user position-wise embedding and the time coding as the user embedding.

Position-wise embedded acquisition submodule comprising a shared matrix

In a cascade, for the j-th forwarded user u _j Let v _j Representing its one-hot index vector on the shared matrix U, then user U _j Position-wise embedding of (A) is determined by (U) v _j The index is obtained, wherein

Representing the set of one-hot indices embedded by all users. Therefore, based on the shared matrix U, position-wise embedding the acquisition submoduleThe Position-wise embedding of the user to the UV.

The temporal code acquisition submodule characterizes the sequence position by means of a temporal code z, defined as follows:

Is time coding, D is the dimension of time coding, [ z (t) _j )] _i Represents the value of the ith dimension of the code, i ∈ [1, D ]]。

Is a collection of temporal codes.

Thus, the user embedding can ultimately be written as: x = UV + Z. The initial user embedding X consists of a learnable shared position-wise embedding UV together with a fixed time coding Z.

The shared matrix U is initially initialized and its parameters are dynamically updated together by the designed optimization module using back-propagation to get subsequent context embedding related to the underlying cascade.

(2) Global dependency module

And the global dependency module is used for distinguishing all precursor nodes into a main path node and a non-main path node for each user node in the user embedding process, acquiring the attention probability matrix of the main path node and the non-main path node, further acquiring the global dependency of each node, and then performing full connection processing to obtain the global embedding of all the nodes.

The global dependency module includes a path-aware attention layer and a first output network.

The path perception attention layer is used for distinguishing all precursor nodes into a main path node and a non-main path node for each user node in user embedding, acquiring long-term dependence attention scores of the main path node and the non-main path node and constructing an attention probability matrix of the main path and the non-main path; and then obtaining the global dependence according to the attention probability matrix of the main path and the non-main path of each node.

As shown in fig. 2 and 3, a path-aware hypothesis, i.e., a primary path/non-primary path hypothesis, is proposed in the global module. Given a current node u _j To obtain it and a certain node in the main path

Long term dependent attention score of (c):

wherein a, b and

respectively representing user embedded matrix X main path Q (Query) matrix

Non-primary path Q (Query) matrix

And K (Key) matrix

The transformed users embed vectors, and the three transformation matrices are used for distinguishing roles of the users in long-term dependence. Wherein,<·,·>represents the inner product function. Similarly, the current node u _j A node in the non-main path

The long term dependent attention score of (a):

And finally, constructing the global dependency of all nodes according to the complete diffusion dependency of each node. The context embedding of each node is obtained according to the attention scores of the primary path and the non-primary path of each node: e = { E = ₀ ,e ₁ ,…,e _j ,…,e _L Each line in E represents the global dependency of a certain node in the cascade;

embedding matrix X via V (Value) matrix on behalf of user

The converted user embeds the vector.

The above is for the current single user u _j Is calculated. In practical operation, the above attention is relied on using matrix parallel to speed up the operation.

Based on the path perception attention layer, the attention scores of the current node on the main path and the non-main path are calculated through the following formulas:

wherein M is ^P ，

One-hot matrices for preserving the attention scores of the nodes on the primary path and the non-primary path, respectively, if

Then element

Accordingly, if

where D is the dimension of the input embedding. Mask matrix

Including the main path

And nodes on non-primary paths

E＝(W ^V X)A

in order to learn more of the semantic information,the path perception attention layer is designed into a module based on a multi-head attention mechanism, context is embedded and mapped to different subspaces, and the subspaces can use the matrix multiplication parallel operation given in the foregoing to further calculate a plurality of heads E ₁ ,E ₂ ,…,E _H Then connecting them into E = W ^O Concat(E ₁ ,E ₂ ,…,E _H ) In which

Is an aggregation matrix. And the first output network is used for carrying out full connection processing on the global dependence of all the nodes to obtain the global embedding of all the nodes. In this embodiment, the fully-connected neural network FCNN is used as the first output network, and the global dependencies E of all nodes are sent to the first output network to obtain global embeddings of all nodes:

h(t _j )＝H(:,j)

wherein,

is a parameter of the neural network; h (t) _j ) Is that

J-th column in (1) represents user u _j Is indicative of long term dependence.

(3) Local dependency module

The local dependence module is used for acquiring the local mode of each node according to the set time slice window, and then performing pooling operation on the local mode of each node to obtain pooling representation of each node; a hidden vector for each node local pattern is then generated based on the masked attention mechanism.

The local dependency module includes a local pattern acquisition submodule, a pooling submodule, and a masking attention layer.

Local mode acquisition submodule forAnd according to the set time slice window, intercepting the part positioned in front of each node from the global dependency to obtain the local mode of each node. Setting a fixed time slice window t _s Utilizing a local mode acquisition submodule to intercept a part positioned in front of each node from the global dependency to obtain a local mode of each node; thus, the local mode of the current node is determined by the current node u _j And a global dependency component of some predecessor nodes, denoted as

And the pooling sub-module is used for performing pooling operation on the local mode of each node to obtain a pooled representation of each node. For each current node u _j Local mode to it by pooling submodules

Performing a pooling operation to polymerize a time slice t _s Generating a new representation form by the features in (1)

Therefore, all local modes

The total pooled representation of (a) may be formalized as:

each column in (a) corresponds to a pooled representation of the local pattern.

And the mask attention layer is used for dynamically aggregating the local mode pooling representation of each node and the nodes before the node, and generating the hidden vector of the local mode of each node. Dynamically aggregating current time t with a mask attention layer based on a mask attention mechanism _j All previous local mode embedding (i.e. pooling representation), generating a concealment vector to summarize the evolution process of all previous modes:

g (-) is a linear transformation function, γ = { v (t) ₀₁ ),v(t ₁ ),…,v(tj),…,v(t _L ) Is a set of local dependencies of the evolution rate. The similarity function f (·, ·) is specified as:

(4) Intensity function acquisition module

And the strength function acquisition module is used for parameterizing the strength function of the traditional Hox process according to the global embedding vector and the local mode hiding vector. Here, the intensity function obtaining module models the continuous dynamics of the cascade diffusion using the following conditional intensity functions:

wherein, theta represents the parameter of the model, and the forwarding time t is defined in the interval [ t ] _j ,t _j+1 ) Where f (x) = β lg (1 + exp (x/β)) is the softplus function, and the parameter β is used to constrain the strength function to be positive; the first current represents the evolution process of the continuous time domain, and alpha controls the importance degree of interpolation nodes in two timestamps; w is a ₁ 、w ₂ The learning weights of the global dependency and the local mode at the time t are respectively, and lambda (t) represents the inferred text arrival rate at the time t.

(5) Popularity prediction module

And the popularity prediction module is used for generating a popularity prediction result by combining the integral of the Hox process and the full-connection processing result of the global embedding and local mode hiding vectors. Here, the final popularity is predicted according to the following formula

The first term Λ represents the integral of the hokes process, representing the desired magnitude of popularity; t is t _o Represents the observation time; t is t _p Representing a predicted time; the second term represents the output result of the global embedding and local pattern hidden vector splicing result after passing through the second output network (here, the fully connected neural network FCNN).

Example 2

The embodiment is a further improvement on the embodiment 1.

The information cascade prediction system based on the transform enhanced hoxon process provided by the embodiment further includes an optimization module, which is used for parameter optimization in a system training process.

and the predicted popularity corresponding to the kth information concatenation label sample is represented.

Example 3

The present embodiment provides an information cascade prediction method based on a transform enhanced hokes process, which uses the information cascade prediction system of claim 1, and comprises the following steps:

s1, acquiring user position-wise embedding and time coding, and taking the user position-wise embedding and the time coding as user embedding.

In the step, a user embedding module is used for respectively acquiring a sharing position-wise embedding UV and a fixed time code Z, and a final user embedding X = UV + Z is obtained.

S2, for each user node in user embedding, distinguishing all the precursor nodes as a main path node and a non-main path node, respectively obtaining long-term dependence attention scores of the main path node and the non-main path node, further obtaining the global dependence of each node, and then obtaining the global embedding of all the nodes through full connection processing.

Firstly, distinguishing all predecessor nodes as a main path node and a non-main path node for each user node in user embedding by utilizing a path perception attention layer, and acquiring attention probability matrixes of the main path node and the non-main path node; then constructing global dependence of all nodes according to the attention probability matrix of the main path and the non-main path of each node; and then the global embedding of all the nodes is obtained through full connection processing of the first output network.

S3, acquiring a local mode of each node according to a set time slice window, and then performing pooling operation on the local mode of each node to obtain pooling representation of each node; then, based on the masking attention mechanism, a local mode hidden vector for each node is generated.

Firstly, utilizing a local mode acquisition submodule to intercept a part positioned in front of each node from global dependence according to a set time slice window to obtain a local mode of each node; then, performing pooling operation on the local mode of each node through a pooling submodule to obtain pooling representation of each node; and dynamically aggregating each node and the local mode pooling representation of the nodes before the node by using the mask attention layer to generate the hidden vector of the local mode of each node.

S4 is used to parameterize the strength function of the hokes process in terms of global embedding and local pattern concealment vectors.

Here, the strength function of the traditional hokes process is parameterized by using a strength function acquisition module according to the global embedding and local mode hiding vectors, and the strength function of the Hawkesformer model is obtained.

Here, the popularity prediction module combines the integral of the hokes process and the full-concatenation processing result of the global embedding and local pattern hiding vectors to generate a popularity prediction result.

Application example

The present application uses three different authentic datasets (Twitter, weibo, and APS, first dataset origin reference [ Lilian Weng, filippo Menczer, and Yong-YooloAhn.2013. Visual prediction and community structure in social networks. Scientific Reports3 (2013) ] second dataset origin reference [ Qi Cao, huawei Shen, keting Cen, wentao Ouyang, and Xueqi Cheng.2017. Deep-Hawkeys: bridging the gateway prediction and interpretation of information sources. CIKM.1149-1158 ] third dataset origin reference/transitions.

The three data sets are divided into a training set, a verification set and a test set according to the proportion of 70%, 15% and 15% respectively.

Firstly, training and optimizing the information cascade prediction system (Hawkesformer) based on the transform enhanced hokes process provided in embodiment 2 by using a training set, and verifying the trained system by using a verification set every time the training is finished until the loss function of the system is reached

And tends to be stable.

Then, the trained Hawkesformer is tested according to the information cascade prediction method based on the transform enhanced hokes process provided in example 3, and compared with 7 different baseline models (Feature-Deep, deep hawkes, casCN, forrest, vaCas, tempCas, hawkesformer), and a common measurement method MSLE is adopted (the smaller the value, the better the prediction effect).

Table 1: performance comparison of Hawkesformer with baseline model on three datasets

Note: the bold values indicate better performance than other settings. All experimental results were best in 10 trials with statistical significance ρ <0.05.

The 7 baseline methods in table 1 are presented below:

Feature-Deep: structural features, temporal features, user features are extracted and input into the two-layer MLP. [ Xu, xovee, et al. "Casflow: expanding resonant structures and propagation unreserved for casting prediction" IEEE Transactions on Knowledge and Data Engineering (2021) "]

DeepHawkes: the deep learning and the Hawkes self-excitation point process are combined, and the gap between the prediction performance and the interpretability is closed. [ Qi Cao, huawei Shen, keting Cen, wentao Ouyang, and Xueqi Cheng.2017.Deep-Hawkes: bridging the gap between prediction and interpretation of information and casting. InCIKM.1149-1158 ]

And (3) CasCN: is a hybrid model, which uses RNN to simulate the evolution process of diffusion by stacking GNNs to obtain node embedding. [ Chen, xueqin, et al. "Information differentiation prediction video recording cassettes conjugation." 2019IEEE 35th international conjugation on data engineering (ICDE). IEEE,2019 ]

FOREST: the multi-scale cascade prediction problem is handled in combination with reinforcement learning and RNN. "Multi-scale Information Diffusion with a Reinforced Recurrent networks" IJCAI.2019 "by Yang, cheng, et al

VaCas: the hierarchical diffusion model and the space-time structure characteristics of information cascade are integrated, and diffusion uncertainty is captured. [ Fan Zhou, xovee Xu, kunpeng Zhang, gocetrajcevski, and Ting Zhong.2020.Variation information diffusion for probabilistic cases compression. ININFOCOM.1618-1627 ]

The TempCas: a heuristic approach is implemented using RNNs to embed the cascade graphs and use LSTM on specially designed attention-based CNNs to learn historical short-term variation trends. "proceeding of the AAAI Conference on Artificial Intelligence insight. Vol.35.No. 1.2021" ]

From the experimental results in table 1, it can be seen that the information cascade prediction system based on the transform enhanced hokes process provided by the invention can greatly improve the accuracy of popularity prediction compared with other baseline models.

In addition, in order to study the effectiveness of the global dependency module provided by the present invention in capturing the long-term diffusion dependency relationship between nodes in the dynamic diffusion topology, the present application example further obtains an attention probability matrix a, as shown in fig. 5. The stepwise analysis was performed as follows:

(1) For smaller (13 users) information concatenation C ₁ The information cascade prediction system provided by the invention focuses more on the main path and early users participating in diffusion. For each row of matrix A, we can see that the users on the main path usually have the most prominent color, which verifies our dominant/non-dominant hypothesis in the information propagation (a similar phenomenon is also shown in cascade C) ₂ And C ₃ ). For cascade C ₁ Node 9, hawkesformer detects that 0 → 2 → 9 is the primary path, indicating that node 0 and node 2 have a greater impact on node 9, while other nodes on the non-primary path have a lesser impact on the forwarding behavior of node 9. For node 8, the hawkesformer detects that 0 → 4 → 8 is the primary path, indicating that node 0 and node 4 have a greater impact on node 8, while other nodes on the non-primary node have a lesser impact on the forwarding behavior of node 8.

(2) Information concatenation C for medium (58 users) ₂ Long-term dependencies between the root user and subsequent nodes are observed. This suggests that long-term dependencies between nodes in a dynamic diffusion topology need to be modeled, because sequence models such as RNNs are difficult to train on long sequences,and when the sequence is too long it faces the gradient disappearance/explosion problem. In addition, several short-term bursts were observed, which again validates the assumption of the present invention to learn local patterns for information concatenation.

(3) At the same time, a virus transmission cascade C is shown ₃ It has 106 users and a maximum depth of 8. Most pushups in this cascade (84%) are from indirect users to the root. We have found that even "tail" users at the end of the diffusion tree may trigger outbreaks (e.g., users 19 and 43). Some users act as hubs to connect other influential users bringing them into the information dissemination process.

Therefore, the information cascade prediction system based on the transform enhanced Hooke process, which is provided by the invention, not only can improve the accuracy of popularity prediction, but also has strong interpretability. Experiments on three real data sets demonstrate the superior performance of the present invention over the most advanced baseline model. The improvement of the performance of the proposed scheme shows that the combination of the advantages of deep learning (Transformer) and a probability model (hokes process) is a new direction for effectively simulating the information cascade diffusion process and improving the prediction accuracy.

In summary, the modeling mode of the information propagation process is important for understanding the information propagation mechanism and the popularity prediction task, the evolution of information diffusion presents a Directed Acyclic Graph (DAG), and the topological features and time are coupled in nature, but the two most important features (topological features and time features) of the cascade are modeled separately in the existing research, so that cross-domain information is lost, and the expression and prediction capability of the model is reduced. The method links a multi-level attention mechanism with the Hox self-excitation point process to realize more effective and reasonable modeling of information cascade so as to achieve more accurate popularity prediction. In particular, using a two-layer attention structure to parameterize the intensity function of the hokes process has been achieved to efficiently acquire representation knowledge from a continuous time domain.

It will be appreciated by those of ordinary skill in the art that the embodiments described herein are intended to assist the reader in understanding the principles of the invention and are to be construed as being without limitation to such specifically recited embodiments and examples. Those skilled in the art can make various other specific changes and combinations based on the teachings of the present invention without departing from the spirit of the invention, and these changes and combinations are within the scope of the invention.

Claims

1. An information cascade prediction system based on Transformer enhanced Hawkes process, characterized in that, comprising:

The user embedding module is used to obtain user position-wise embedding and time code, and use user position-wise embedding and time code as user embedding;

The global dependency module is used to distinguish all precursor nodes from main path nodes and non-main path nodes for each user node in user embedding, and obtain the attention probability matrix of main path nodes and non-main path nodes, and then obtain each node The global dependence of , and then get the global embedding of all nodes through full connection;

The local dependency module is used to obtain the local pattern of each node according to the set time slice window, and then perform pooling operation on the local pattern of each node to obtain the pooled representation of each node; finally, based on the mask attention mechanism , generating hidden vectors for each node’s local pattern;

An intensity function acquisition module for parameterizing the intensity function of the Hawkes process according to the global embedding and the local pattern hidden vector;

The popularity prediction module is used to combine the integral of the Hawkes process, and the fully connected processing results of the global embedding and the local pattern hidden vector to generate the popularity prediction result.

2. the information cascading prediction system based on Transformer enhanced Hawkes process according to claim 1, is characterized in that, described user embedding module comprises Position-wise embedding acquisition submodule and time code acquisition submodule;

The Position-wise embedding acquisition sub-module includes the shared matrix U. In a cascade, for the user u _j forwarded for the jth time, let v _j represent its one-hot index vector on the shared matrix U, then the user The position-wise embedding of u _j is obtained by Uv _j index, wherein V represents the one-hot index set embedded by all users; the user Position-wise embedding obtained by the Position-wise embedding acquisition sub-module is UV;

The time code acquisition submodule uses the time code z to represent the sequence position, which is defined as follows:

in

is the time encoding, D is the dimension of the time encoding, [z(t _j )] _i represents the value of the i-th dimension of the encoding,

is a collection of time codes;

The user embedding is finally written as: X=UV+Z.

3. the information cascading prediction system based on Transformer enhanced Hawkes process according to claim 1, is characterized in that, described global dependence module comprises:

The path-aware attention layer is used for each user node in the user embedding, distinguishing all the precursor nodes from the main path nodes and non-main path nodes, and obtaining the long-term dependent attention scores of the main path nodes and non-main path nodes, and constructing the main path node. The attention probability matrix of the path and non-main path; and then obtain its global dependence according to the attention probability matrix of the main path and non-main path of each node;

The first output network is used to perform full-connection processing on the global dependencies of each node to obtain the global embedding of all nodes.

4. the information cascading prediction system based on Transformer enhanced Hawkes process according to claim 3, it is characterized in that, at first, based on the path-aware attention layer, the following formula is used to calculate the current node respectively in the main path, non-main path Attention scores on paths:

P _score ＝(W ^P X) ^T (W ^K X)⊙M ^P

N _score ＝(W ^N X) ^T (W ^K X)⊙M ^N

where M ^P ,

The one-hot matrix used to retain the attention scores of nodes on the main path and non-main path, respectively, if

Represents the main path node set, the element

Correspondingly, if

Represents the set of non-primary path nodes,

Finally, the attention probability matrix of the main path and non-main path is derived as:

where D is the dimensionality of the input embedding; the mask matrix

It is used to prevent the system from peeping future data. If i<j, M _i,j =0, otherwise M _i,j =-∞; each column vector in matrix A represents u _j to its pioneer node

attention, including the main path

and nodes on non-primary paths

Then, the matrix form of each node's global dependency is obtained through the user's global dependency extraction submodule: E=(W ^V X)A;

E is fully connected through the first output network to obtain the global embedding of nodes in all cascades:

h(t _j )=H(:,j)

in,

is the parameter of neural network; h(t _j ) is

The jth column in represents the long-term dependence of user u _j .

5. The information cascading prediction system based on Transformer enhanced Hawkes process according to any one of claims 1 to 4, characterized in that, said local dependence module comprises:

The local mode acquisition sub-module is used to intercept the part before each node from the global dependency according to the set time slice window, and obtain the local mode of each node;

The pooling sub-module is used to perform a pooling operation on the local pattern of each node to obtain a pooling representation of each node;

The masked attention layer is used to dynamically aggregate the local pattern pooling representation of each node and its previous nodes to generate the hidden vector of each node's local pattern.

6. the information cascade prediction system based on Transformer enhanced Hawkes process according to claim 5, is characterized in that, utilizes mask attention layer to dynamically aggregate all local pattern embeddings before current time _tj , generates a hidden vector To summarize the evolution of all previous schemas:

g( ) is a linear transformation function, γ={v(t ₀ ),v(t ₁ ),…,v(t _j ),…,v(t _L )} is a collection of local dependencies of the evolution rate; The similarity function f(·,·) is specified as:

7. The information cascade prediction system based on Transformer enhanced Hawkes process according to claim 6, wherein the intensity function acquisition module uses the following conditional intensity function to model the continuous dynamics of cascade diffusion:

Among them, Θ represents the parameters of the model, and the forwarding time t is defined in the interval [t _j ,t _j+1 ), where f(x)=βlg(1+exp(x/β)) is the softplus function, and the parameter β is used to constrain The intensity function is positive; the first term current represents the evolution process in the continuous time domain, and α controls the importance of the interpolation nodes within the two time stamps; w ₁ and w ₂ are the learning weights of global dependencies and local patterns at time t, respectively, λ(t) represents the arrival rate of tweets at time t.

8. the information cascade prediction system based on Transformer enhanced Hawkes process according to claim 7, is characterized in that, popularity prediction module predicts final popularity according to the following formula

The first term Λ represents the integral of the Hawkes process, which represents the expected popularity; t _o represents the observation time; t _p represents the prediction time; the second term represents the splicing result of the global embedding and local pattern hidden vector through the second output network After the output results.

9. the information cascade prediction system based on Transformer enhanced Hawkes process according to claim 8, is characterized in that, also comprises optimization module, is used for the parameter optimization in system training process; Optimization module is by minimizing mean square pair The logarithmic error optimizes the loss of popularity and maximizes the log-likelihood of all information cascades:

y _k represents the true popularity of the kth information cascade label sample,

Indicates the predicted popularity corresponding to the kth information cascade label sample, l(C _k ) indicates the joint likelihood of a cascade C _k , and N indicates the number of information cascade label samples.

10. An information cascade prediction method based on Transformer enhanced Hawkes process, which comprises the following steps:

S1 obtains user position-wise embedding and time code, and uses user position-wise embedding and time code as user embedding;

S2 For each user node in the user embedding, distinguish all the precursor nodes as main path nodes and non-main path nodes, and obtain the long-term dependency attention scores of the main path nodes and non-main path nodes respectively, and then obtain the global dependency of each node , and then get the global embedding of all nodes through full connection processing;

S3 obtains the local pattern of each node according to the set time slice window, and then performs pooling operation on the local pattern of each node to obtain the pooled representation of each node; then generates each node based on the mask attention mechanism The local pattern hidden vector of ;

S4 is used to parameterize the strength function of the Hawkes process in terms of the global embedding and the local pattern hidden vector;

S5 is used to combine the integral of the Hawkes process, and the fully connected processing results of the global embedding and the local pattern hidden vector to generate the popularity prediction result.