Disclosure of Invention
Aiming at the problems of limited learning capacity, cross-domain information loss and the like of the current information cascade prediction method, the invention aims to provide a novel modeling mode for the cascade diffusion Process, generalize the cascade diffusion Process into a Directed Acyclic Graph (DAG) which is continuously diffused in a continuous time domain, introduce a Transformer (attention mechanism) into a Hawkes Process, introduce an attention mechanism into the Hawkes Process to model time characteristics and topological characteristics in the continuous time domain in a coupling manner, realize effective modeling of the cascade diffusion Process and improve the accuracy of popularity prediction.
The general idea of the invention is as follows: two levels of attention modules are connected together to parameterize the intensity function of the hokes process. The strength function ensures that the coupling time and topological dependence of the cascade is modeled in the continuous time domain. Specifically, at a first level, a global (global) dependent module is designed to dynamically capture long-term diffusion processes of events that have occurred in the past in the cascade, and main/non-main path assumptions are proposed to adaptively integrate the diffusion processes onto the underlying DAG. This allows each node participating in the cascade anywhere to update its current hidden state. At the second level, a local module is designed to capture the short-term evolution rate of the information cascade by encoding local patterns in a fixed time slice window. Through the design of two levels, the representation of the nodes on the global and local is taken, and finally, a full connection layer is fed for popularity prediction.
Before a particular flow is given, some basic terms and definitions are given. Given a concatenation of information C
k And it is at observation time t
o Forwarding history within a network
Wherein t is
j ∈(0,t
o ]L is within the observation time window (t)
0 ,t
0 +t
o ) Number of forwarding of t
0 Indicating the start time. Each triplet (t)
j ,v
j ,u
j ) Is represented at t
j Time of day, user u
j Forwards user v
j . For example, the forwarding history of FIG. 1 is:
(t
4 ,u
1 ,u
4 ),(t
5 ,u
3 ,u
5 ) For simplicity, the following is omitted
In addition, we use
Representing the set of timestamps before the current time.
Based on the above inventive concept, the information cascade prediction system based on the Transformer enhanced hokes process provided by the invention comprises:
the User embedding module (Initial User indexes) is used for acquiring User position-wise embedding and time coding and taking the User position-wise embedding and the time coding as User embedding;
a Global Dependency Module (Global Dependency Module) for distinguishing all predecessor nodes as main path nodes and non-main path nodes for each user node in user embedding, and obtaining attention probability matrixes of the main path nodes and the non-main path nodes, further obtaining Global Dependency of each node, and obtaining Global embedding of all nodes through full connection;
a Local Dependency Module (Local Dependency Module) for obtaining a Local mode of each node according to a set time slice window, and then performing pooling operation on the Local mode of each node to obtain a pooled representation of each node; finally, generating a hidden vector of each node local mode based on a mask attention mechanism;
the intensity function acquisition module is used for parameterizing an intensity function of the Hox process according to the global embedding vector and the local mode hiding vector;
and the popularity prediction module is used for generating a popularity prediction result by combining the integral of the Hox process and the full-connection processing result of the global embedding and local mode hiding vectors.
The user embedding module, the user embedding X, is composed of position-wise embedding and time coding and is used as the input of a subsequent module (a global dependency module).
The user embedded module comprises a Position-wise embedded acquisition submodule and a time code acquisition submodule.
Position-wise Embedded acquisition submodule, including shared matrix
Where D is the user-embedded dimension and L is the maximum number of hops. The jth column of U represents the jth forwarding user, and each cascade of nodes in the same order on the time axis share an initial embedding. The shared matrix U is first randomly initialized and then subsequent context embeddings related to the underlying cascade are dynamically updated by subsequent global dependency modules. The shared matrix U greatly reduces the computational overhead, making the system more efficient. In a cascade, for the j-th forwarded user u
j Let v
j Representing its one-hot index vector on the shared matrix U, then user U
j Position-wise embedding of (A) is determined by (U) v
j The index is obtained, wherein
Representing the set of one-hot indexes embedded by all users. Thus, the Position-wise embedding of the user, which is derived by the Position-wise embedding derivation submodule, is UV based on the shared matrix U.
2. The time code acquisition submodule (Temporal Encoding), a key component of the present invention, is a two-level attention module, and since the attention mechanism does not consider the sequence dependency, the time code acquisition submodule characterizes the sequence position by using the time code z, which is defined as follows:
the temporal coding is a variant from the original transform absolute position coding. Wherein
Is time-coded, D is the dimension of the time-coding, [ z (t)
j )]
i Represents the value of the ith dimension of the code, i ∈ [1, D ]]。
Is a collection of temporal codes.
Thus, the forwarding sequence
The initial user embedding of (a) can eventually be written as: x = UV + Z. The initial user embedding X consists of a learnable shared position-wise embedding UV together with a fixed time coding Z.
The global dependency module is referred to as a global module for short; the global dependency of the information cascade can be seen as an indication that a certain user may promote or suppress another user during long-term diffusion.
The global dependency module includes:
the path perception attention layer is used for distinguishing all the precursor nodes into a main path node and a non-main path node for each user node in user embedding, acquiring long-term dependence attention scores of the main path node and the non-main path node and constructing an attention probability matrix of the main path and the non-main path; acquiring global dependence of each node according to the attention probability matrix of the main path and the non-main path of each node;
and the first output network is used for carrying out full connection processing on the global dependence of each node to obtain global embedding of all nodes.
After the user embedding is obtained, a path perception attention layer is designed in the global module to parameterize the intensity function of the Hawkesformer, the long-term diffusion dependence between the input cascade nodes is captured, and the time topological dependence of the coupling of the main path node and the non-main path node is extracted by using the user complete diffusion dependence extraction sub-module.
The theory of the hokes process states that: the final popularity of information concatenation is related to the past participation of the user, which means that the current node affects not only its immediate (immedate) forwarder, but also non-immediate (non-immedate) forwarders by transitivity. In order to indicate the influence of each node on the future popularity of the cascade from a topological point of view, a path-aware assumption, i.e. a primary path/non-primary path assumption, is therefore proposed in the global module: consider a node on the main path to be current node u
j The contribution of (b) is larger, the node on the non-main path is next to u
j The arrival contribution of the node is small but still has an impact on the final popularity, which also follows the assumptions of the hokes process. For the current node u
j All predecessor nodes
Are divided into two categories: (1) Nodes on the main path, denoted
It includes a front node u
j Backtracking to root node u
0 All previous nodes on the entire forwarding path; (2) Nodes on the non-primary path, i.e. all nodes except the primary path node, are marked as
An example is given in fig. 2, where for the current node u
4 The nodes on the primary path are represented by (1) and (2), and the nodes on the non-primary path are represented by (3).
Based on the above explanation, the implementation steps of the global dependency module are as follows:
given a current node u
j To obtain a node between it and a main path
Long term dependent attention score of (c):
wherein a, b and
respectively representing user embedded matrix X main path Q (Query) matrix
Non-main path Q (Query) matrix
And K (Key) matrix
The transformed users embed vectors, and the three transformation matrices are used for distinguishing roles of the users in long-term dependence. Wherein,<·,·>representing the inner product function. Similarly, current node u
j A node in the non-main path
Long term dependence attention scores of (a) are:
then, attention weighting and calculation are performed through the following formula to obtain the user u j Complete diffusion dependence e j :
Finally, the context embedding of each node is obtained according to the attention scores of the primary path and the non-primary path of each node: e = { E = { E)
0 ,e
1 ,…,e
j ,…,e
L Each line in E represents the global dependency of a certain node in the cascade;
embedding matrix X via V (Value) matrix on behalf of user
The converted user embeds the vector.
Above for the current single user u j Is calculated. In practical operation, the above attention is relied on using matrix parallel to speed up the operation. The matrix calculation flow of global dependency can refer to fig. 3.
Firstly, based on the path perception attention layer, the attention scores of the current node on the main path and the non-main path are calculated through the following formulas:
wherein M is
P ,
One-hot matrix for respectively reserving attention scores of nodes on the main path and the non-main path if
Then element
Accordingly, if
Finally, the attention probability matrix for the primary and non-primary paths can be derived as:
where D is the dimension of the input embedding. Mask matrix
To prevent the system from peeping into future data if i<j,M
i,j =0, otherwise M
i,j And = - ∞. Matrix M forces the softmax function to be only at u
j Distributing valid long-term dependent attention on previous forwarders to subsequent users (i.e., later than t)
j Active user) is assigned 0. Each column vector in matrix A represents u
j To its precursor node
Including the main path
And a non-main path
A node of (c).
Then, a matrix form of the global dependency of each node is obtained according to the following formula:
E=(W V X)A
in order to learn more semantic information, the path perception attention layer can be designed as a module based on a multi-head attention mechanism, context embedding and mapping are carried out on different subspaces, the subspaces can use the matrix multiplication parallel operation given in the foregoing, and then a plurality of heads E are calculated
1 ,E
2 ,…,E
H Then connecting them into E = W
O Concat(E
1 ,E
2 ,…,E
H ) Wherein
Is an aggregation matrix. Finally, we feed E into the first output network (here the fully connected neural network FCNN) resulting in a global embedding of all cascades:
h(t j )=H(:,j)
wherein,
is a parameter of the neural network; h (t)
j ) Is that
J-th column in (1) represents user u
j Is indicative of long term dependence.
The local dependency module may be referred to as a local module for short; the rate of increase of the forwarding number in the information concatenation is not stable, and the final popularity is greatly influenced by short-term outbreaks in the propagation process. To better embed the evolution rate of the information cascade, pooled attention is used here to capture local patterns. And performing a pooling operation on the local mode of each current node based on the learned user global embedded representation in the global dependency module.
The local dependency module includes:
the local mode acquisition submodule is used for intercepting a part positioned in front of each node from the global dependence according to a set time slice window to obtain a local mode of each node;
the pooling submodule is used for performing pooling operation on the local mode of each node to obtain pooling representation of each node;
and the mask attention layer is used for dynamically aggregating the local mode pooling representation of each node and the nodes before the node, and generating the hidden vector of the local mode of each node.
Each current node has a local mode, and a fixed time slice window t is set
s Utilizing a local mode acquisition submodule to intercept a part positioned in front of each node from the global dependency to obtain a local mode of each node; thus, the local mode of the current node is determined by the current node u
j And a global dependency component of some predecessor nodes, denoted as
Representing the set of all local patterns of a piece of information concatenation.
For each current node u
j Local mode to it by pooling submodules
Performing a pooling operation to polymerize a time slice t
s Generating a new representation of the feature in (1)
Therefore, all local modes
The total pooled representation of (a) may be formalized as:
each column in (b) corresponds to a pooled representation of the local pattern.
To capture the evolutionary rate associated with cascading short-term outbreaks/falls, the present invention masks by designThe code attention layer incorporates the impact of the local patterns. Dynamically aggregating current time t with a mask attention layer based on a mask attention mechanism j All previous local mode embedding (i.e. pooling representation), generating a concealment vector to summarize the evolution process of all previous modes:
g (-) is a linear transformation function, γ = { v (t) = 0 ),v(t 1 ),…,v(t j ),…,v(t L ) Is the set of local dependencies of the evolution rate. The similarity function f (·, ·) is specified as:
the intensity function obtaining module parameterizes the intensity function of the hokes process through the global module and the local module to obtain the intensity function lambda (t) of the Hawkesformer. Given a concatenation of information C
k And it is at observation time t
o Forwarding history within
Representing a time stamp t
j Before but not including time t
j The sequence of (a). Here, the intensity function obtaining module models the continuous dynamics of the cascade diffusion using the following conditional intensity functions:
wherein, theta represents the parameter of the model, and the forwarding time t is defined in the interval [ t ] j ,t j+1 ) Where f (x) = β lg (1 + exp (x/β)) is the softplus function, and the parameter β is used to constrain the strength function to be positive; the first current represents the evolution process of the continuous time domain, and alpha controls the importance degree of interpolation nodes in two timestamps; w is a 1 、w 2 The learning weights of the global dependency and the local mode at the time t are respectively, and the strength function lambda (t) represents the inferred text arrival rate at the time t. Both global and local dependencies of the Hawkesformer have expressive power and flexibility, and the design of the intensity function enables the Hawkesformer to capture coupled spatio-temporal dynamics from the continuous time domain.
The popularity prediction module is mainly used for predicting the popularity of the information. According to the Hawkes theory, the survival function at time t is:
the function is represented at t
n ,t]Without conditional probability of an event occurring in between. Then, we can deduce the probability of observing an event at any time t':
finally, the final popularity is predicted through a popularity prediction module by combining the time point process and the output of two levels of attention mechanism modules (a global module and a local module) according to the following formula
The first term Λ represents the integral of the hokes process, representing the expected magnitude of popularity; t is t o Represents the observation time; t is t p Representing a predicted time; the second term represents the output result of the global embedding and local pattern hidden vector splicing result after passing through the second output network (here, the fully connected neural network FCNN).
The information cascade prediction system based on the transform enhanced Hooke process also comprises an optimization module used for optimizing parameters in the system training process.
A cascade C
k Is forwarded toTime
The joint likelihood of (c) can be expressed as:
wherein
Represents until time t
j The time set of (c).
Assume that there are N information cascade label samples for training, { C 1 ,C 2 ,…,C N And the optimization module optimizes the loss of the popularity degree and the maximization of the log-likelihood of all information cascades by minimizing Mean Square Logarithm Error (MSLE):
y
k representing the true popularity of the kth information concatenation label exemplar,
and the predicted popularity corresponding to the kth information concatenation label sample is shown.
The Adam gradient descent algorithm is then used to optimize the system network parameters.
The invention also provides an information cascade prediction method based on the Transformer enhanced Hox process, which comprises the following steps:
s1, acquiring user position-wise embedding and time coding, and taking the user position-wise embedding and the time coding as user embedding;
s2, for each user node in user embedding, distinguishing all precursor nodes as a main path node and a non-main path node, respectively obtaining attention probability matrixes of the main path node and the non-main path node, further obtaining the global dependence of each node, and then obtaining the global embedding of all the nodes through full connection processing;
s3, acquiring a local mode of each node according to a set time slice window, and then performing pooling operation on the local mode of each node to obtain pooling representation of each node; then based on a mask attention mechanism, generating a local mode hidden vector of each node;
s4, parameterizing a strength function of the Hox process according to the global embedding vector and the local mode hiding vector;
and S5, integrating the Hox process and the full-connection processing result of the global embedding vector and the local mode hiding vector to generate a popularity prediction result.
The present invention thus achieves a system and method for coupled modeling and prediction of time features and topological features in the continuous time domain by introducing a two-tier attention mechanism into the hokes process. At a first level, a DAG is distinguished according to a main path and a non-main path, and the layer is used for capturing global dependency relationships between nodes, namely coupled time and topological dependency; at the second level, a local pool attention module is proposed to embed the evolution rate of the cascade to simulate bursts and sags of short-term forwarding numbers.
Compared with the prior art, the invention has the following beneficial effects:
1. according to the method, user position-wise embedding and time coding are firstly obtained to serve as user embedding, then a path perception hypothesis is introduced from the view angle of a topological structure, two attention layers are designed from the two angles of global embedding and a local mode, the intensity function of a Hox process is parameterized, and the coupled time and topological random characteristics of an information cascade diffusion process are learned by combining the Hox process, the global embedding and the local mode to predict popularity; the invention expands the traditional Hox process, effectively acquires knowledge from the continuous time domain and improves the popularity prediction accuracy.
2. The invention proposes a two-level attention architecture, which parameterizes the intensity function of the hokes process; firstly, a global module attention layer based on path perception is designed to capture a long-term diffusion dependency relationship between nodes in a dynamic diffusion topology, and then a local module is provided to capture a short-term evolution rate in information cascade.
3. The invention customizes a learnable position-wise user embedding method for the information cascade directed acyclic graph, thereby improving the parameter efficiency and reducing the calculation burden;
4. the method is extremely important for understanding the evolution process of the social network platform and explaining the popular reasons of the cascade; for example, the method and the system can predict the forwarding amount of a certain microblog in a future period of time, explain popular factors, and be used for downstream tasks such as marketing design, rumor prediction, influence maximization and the like.
Detailed Description
The invention is further described with reference to the accompanying drawings.
Example 1
For convenience, we will use Twitter (Twitter) concatenation as an example setup in the rest of the invention, but the invention is also applicable to other types of information concatenation, such as micro blogging, academic citations, etc., and we will give examples in the experimental part.
The information cascade prediction system based on the transform enhanced hoxox process provided by this embodiment, as shown in fig. 4, includes a user embedding module, a global dependency module, a local dependency module, a strength function obtaining module, and a popularity prediction module.
(1) User embedded module
And the user embedding module is used for acquiring the user position-wise embedding and the time coding and taking the user position-wise embedding and the time coding as the user embedding.
The user embedded module comprises a Position-wise embedded acquisition submodule and a time code acquisition submodule.
Position-wise embedded acquisition submodule comprising a shared matrix
In a cascade, for the j-th forwarded user u
j Let v
j Representing its one-hot index vector on the shared matrix U, then user U
j Position-wise embedding of (A) is determined by (U) v
j The index is obtained, wherein
Representing the set of one-hot indices embedded by all users. Therefore, based on the shared matrix U, position-wise embedding the acquisition submoduleThe Position-wise embedding of the user to the UV.
The temporal code acquisition submodule characterizes the sequence position by means of a temporal code z, defined as follows:
the temporal coding is a variant from the original transform absolute position coding. Wherein
Is time coding, D is the dimension of time coding, [ z (t)
j )]
i Represents the value of the ith dimension of the code, i ∈ [1, D ]]。
Is a collection of temporal codes.
Thus, the user embedding can ultimately be written as: x = UV + Z. The initial user embedding X consists of a learnable shared position-wise embedding UV together with a fixed time coding Z.
The shared matrix U is initially initialized and its parameters are dynamically updated together by the designed optimization module using back-propagation to get subsequent context embedding related to the underlying cascade.
(2) Global dependency module
And the global dependency module is used for distinguishing all precursor nodes into a main path node and a non-main path node for each user node in the user embedding process, acquiring the attention probability matrix of the main path node and the non-main path node, further acquiring the global dependency of each node, and then performing full connection processing to obtain the global embedding of all the nodes.
The global dependency module includes a path-aware attention layer and a first output network.
The path perception attention layer is used for distinguishing all precursor nodes into a main path node and a non-main path node for each user node in user embedding, acquiring long-term dependence attention scores of the main path node and the non-main path node and constructing an attention probability matrix of the main path and the non-main path; and then obtaining the global dependence according to the attention probability matrix of the main path and the non-main path of each node.
As shown in fig. 2 and 3, a path-aware hypothesis, i.e., a primary path/non-primary path hypothesis, is proposed in the global module. Given a current node u
j To obtain it and a certain node in the main path
Long term dependent attention score of (c):
wherein a, b and
respectively representing user embedded matrix X main path Q (Query) matrix
Non-primary path Q (Query) matrix
And K (Key) matrix
The transformed users embed vectors, and the three transformation matrices are used for distinguishing roles of the users in long-term dependence. Wherein,<·,·>represents the inner product function. Similarly, the current node u
j A node in the non-main path
The long term dependent attention score of (a):
then, attention weighting and calculation are performed through the following formula to obtain the user u j Complete diffusion dependence e j :
And finally, constructing the global dependency of all nodes according to the complete diffusion dependency of each node. The context embedding of each node is obtained according to the attention scores of the primary path and the non-primary path of each node: e = { E =
0 ,e
1 ,…,e
j ,…,e
L Each line in E represents the global dependency of a certain node in the cascade;
embedding matrix X via V (Value) matrix on behalf of user
The converted user embeds the vector.
The above is for the current single user u j Is calculated. In practical operation, the above attention is relied on using matrix parallel to speed up the operation.
Based on the path perception attention layer, the attention scores of the current node on the main path and the non-main path are calculated through the following formulas:
wherein M is
P ,
One-hot matrices for preserving the attention scores of the nodes on the primary path and the non-primary path, respectively, if
Then element
Accordingly, if
Finally, the attention probability matrix for the primary and non-primary paths can be derived as:
where D is the dimension of the input embedding. Mask matrix
To prevent the system from peeping into future data if i<j,M
i,j =0, otherwise M
i,j And = - ∞. Matrix M forces the softmax function to be only at u
j Distributing valid long-term dependent attention on previous forwarders to subsequent users (i.e., later than t)
j Active user) is assigned 0. Each column vector in matrix A represents u
j To its precursor node
Including the main path
And nodes on non-primary paths
Then, a matrix form of the global dependency of each node is obtained according to the following formula:
E=(W V X)A
in order to learn more of the semantic information,the path perception attention layer is designed into a module based on a multi-head attention mechanism, context is embedded and mapped to different subspaces, and the subspaces can use the matrix multiplication parallel operation given in the foregoing to further calculate a plurality of heads E
1 ,E
2 ,…,E
H Then connecting them into E = W
O Concat(E
1 ,E
2 ,…,E
H ) In which
Is an aggregation matrix. And the first output network is used for carrying out full connection processing on the global dependence of all the nodes to obtain the global embedding of all the nodes. In this embodiment, the fully-connected neural network FCNN is used as the first output network, and the global dependencies E of all nodes are sent to the first output network to obtain global embeddings of all nodes:
h(t j )=H(:,j)
wherein,
is a parameter of the neural network; h (t)
j ) Is that
J-th column in (1) represents user u
j Is indicative of long term dependence.
(3) Local dependency module
The local dependence module is used for acquiring the local mode of each node according to the set time slice window, and then performing pooling operation on the local mode of each node to obtain pooling representation of each node; a hidden vector for each node local pattern is then generated based on the masked attention mechanism.
The local dependency module includes a local pattern acquisition submodule, a pooling submodule, and a masking attention layer.
Local mode acquisition submodule forAnd according to the set time slice window, intercepting the part positioned in front of each node from the global dependency to obtain the local mode of each node. Setting a fixed time slice window t
s Utilizing a local mode acquisition submodule to intercept a part positioned in front of each node from the global dependency to obtain a local mode of each node; thus, the local mode of the current node is determined by the current node u
j And a global dependency component of some predecessor nodes, denoted as
Representing the set of all local patterns of a piece of information concatenation.
And the pooling sub-module is used for performing pooling operation on the local mode of each node to obtain a pooled representation of each node. For each current node u
j Local mode to it by pooling submodules
Performing a pooling operation to polymerize a time slice t
s Generating a new representation form by the features in (1)
Therefore, all local modes
The total pooled representation of (a) may be formalized as:
each column in (a) corresponds to a pooled representation of the local pattern.
And the mask attention layer is used for dynamically aggregating the local mode pooling representation of each node and the nodes before the node, and generating the hidden vector of the local mode of each node. Dynamically aggregating current time t with a mask attention layer based on a mask attention mechanism j All previous local mode embedding (i.e. pooling representation), generating a concealment vector to summarize the evolution process of all previous modes:
g (-) is a linear transformation function, γ = { v (t) 01 ),v(t 1 ),…,v(tj),…,v(t L ) Is a set of local dependencies of the evolution rate. The similarity function f (·, ·) is specified as:
(4) Intensity function acquisition module
And the strength function acquisition module is used for parameterizing the strength function of the traditional Hox process according to the global embedding vector and the local mode hiding vector. Here, the intensity function obtaining module models the continuous dynamics of the cascade diffusion using the following conditional intensity functions:
wherein, theta represents the parameter of the model, and the forwarding time t is defined in the interval [ t ] j ,t j+1 ) Where f (x) = β lg (1 + exp (x/β)) is the softplus function, and the parameter β is used to constrain the strength function to be positive; the first current represents the evolution process of the continuous time domain, and alpha controls the importance degree of interpolation nodes in two timestamps; w is a 1 、w 2 The learning weights of the global dependency and the local mode at the time t are respectively, and lambda (t) represents the inferred text arrival rate at the time t.
(5) Popularity prediction module
And the popularity prediction module is used for generating a popularity prediction result by combining the integral of the Hox process and the full-connection processing result of the global embedding and local mode hiding vectors. Here, the final popularity is predicted according to the following formula
The first term Λ represents the integral of the hokes process, representing the desired magnitude of popularity; t is t o Represents the observation time; t is t p Representing a predicted time; the second term represents the output result of the global embedding and local pattern hidden vector splicing result after passing through the second output network (here, the fully connected neural network FCNN).
Example 2
The embodiment is a further improvement on the embodiment 1.
The information cascade prediction system based on the transform enhanced hoxon process provided by the embodiment further includes an optimization module, which is used for parameter optimization in a system training process.
Assume that there are N information cascade label samples for training, { C 1 ,C 2 ,…,C N And the optimization module optimizes the loss of the popularity degree and the maximization of the log-likelihood of all information cascades by minimizing Mean Square Logarithm Error (MSLE):
y
k representing the true popularity of the kth information concatenation label exemplar,
and the predicted popularity corresponding to the kth information concatenation label sample is represented.
The Adam gradient descent algorithm is then used to optimize the system network parameters.
Example 3
The present embodiment provides an information cascade prediction method based on a transform enhanced hokes process, which uses the information cascade prediction system of claim 1, and comprises the following steps:
s1, acquiring user position-wise embedding and time coding, and taking the user position-wise embedding and the time coding as user embedding.
In the step, a user embedding module is used for respectively acquiring a sharing position-wise embedding UV and a fixed time code Z, and a final user embedding X = UV + Z is obtained.
S2, for each user node in user embedding, distinguishing all the precursor nodes as a main path node and a non-main path node, respectively obtaining long-term dependence attention scores of the main path node and the non-main path node, further obtaining the global dependence of each node, and then obtaining the global embedding of all the nodes through full connection processing.
Firstly, distinguishing all predecessor nodes as a main path node and a non-main path node for each user node in user embedding by utilizing a path perception attention layer, and acquiring attention probability matrixes of the main path node and the non-main path node; then constructing global dependence of all nodes according to the attention probability matrix of the main path and the non-main path of each node; and then the global embedding of all the nodes is obtained through full connection processing of the first output network.
S3, acquiring a local mode of each node according to a set time slice window, and then performing pooling operation on the local mode of each node to obtain pooling representation of each node; then, based on the masking attention mechanism, a local mode hidden vector for each node is generated.
Firstly, utilizing a local mode acquisition submodule to intercept a part positioned in front of each node from global dependence according to a set time slice window to obtain a local mode of each node; then, performing pooling operation on the local mode of each node through a pooling submodule to obtain pooling representation of each node; and dynamically aggregating each node and the local mode pooling representation of the nodes before the node by using the mask attention layer to generate the hidden vector of the local mode of each node.
S4 is used to parameterize the strength function of the hokes process in terms of global embedding and local pattern concealment vectors.
Here, the strength function of the traditional hokes process is parameterized by using a strength function acquisition module according to the global embedding and local mode hiding vectors, and the strength function of the Hawkesformer model is obtained.
And S5, integrating the Hox process and the full-connection processing result of the global embedding vector and the local mode hiding vector to generate a popularity prediction result.
Here, the popularity prediction module combines the integral of the hokes process and the full-concatenation processing result of the global embedding and local pattern hiding vectors to generate a popularity prediction result.
Application example
The present application uses three different authentic datasets (Twitter, weibo, and APS, first dataset origin reference [ Lilian Weng, filippo Menczer, and Yong-YooloAhn.2013. Visual prediction and community structure in social networks. Scientific Reports3 (2013) ] second dataset origin reference [ Qi Cao, huawei Shen, keting Cen, wentao Ouyang, and Xueqi Cheng.2017. Deep-Hawkeys: bridging the gateway prediction and interpretation of information sources. CIKM.1149-1158 ] third dataset origin reference/transitions.
The three data sets are divided into a training set, a verification set and a test set according to the proportion of 70%, 15% and 15% respectively.
Firstly, training and optimizing the information cascade prediction system (Hawkesformer) based on the transform enhanced hokes process provided in
embodiment 2 by using a training set, and verifying the trained system by using a verification set every time the training is finished until the loss function of the system is reached
And tends to be stable.
Then, the trained Hawkesformer is tested according to the information cascade prediction method based on the transform enhanced hokes process provided in example 3, and compared with 7 different baseline models (Feature-Deep, deep hawkes, casCN, forrest, vaCas, tempCas, hawkesformer), and a common measurement method MSLE is adopted (the smaller the value, the better the prediction effect).
Table 1: performance comparison of Hawkesformer with baseline model on three datasets
Note: the bold values indicate better performance than other settings. All experimental results were best in 10 trials with statistical significance ρ <0.05.
The 7 baseline methods in table 1 are presented below:
Feature-Deep: structural features, temporal features, user features are extracted and input into the two-layer MLP. [ Xu, xovee, et al. "Casflow: expanding resonant structures and propagation unreserved for casting prediction" IEEE Transactions on Knowledge and Data Engineering (2021) "]
DeepHawkes: the deep learning and the Hawkes self-excitation point process are combined, and the gap between the prediction performance and the interpretability is closed. [ Qi Cao, huawei Shen, keting Cen, wentao Ouyang, and Xueqi Cheng.2017.Deep-Hawkes: bridging the gap between prediction and interpretation of information and casting. InCIKM.1149-1158 ]
And (3) CasCN: is a hybrid model, which uses RNN to simulate the evolution process of diffusion by stacking GNNs to obtain node embedding. [ Chen, xueqin, et al. "Information differentiation prediction video recording cassettes conjugation." 2019IEEE 35th international conjugation on data engineering (ICDE). IEEE,2019 ]
FOREST: the multi-scale cascade prediction problem is handled in combination with reinforcement learning and RNN. "Multi-scale Information Diffusion with a Reinforced Recurrent networks" IJCAI.2019 "by Yang, cheng, et al
VaCas: the hierarchical diffusion model and the space-time structure characteristics of information cascade are integrated, and diffusion uncertainty is captured. [ Fan Zhou, xovee Xu, kunpeng Zhang, gocetrajcevski, and Ting Zhong.2020.Variation information diffusion for probabilistic cases compression. ININFOCOM.1618-1627 ]
The TempCas: a heuristic approach is implemented using RNNs to embed the cascade graphs and use LSTM on specially designed attention-based CNNs to learn historical short-term variation trends. "proceeding of the AAAI Conference on Artificial Intelligence insight. Vol.35.No. 1.2021" ]
From the experimental results in table 1, it can be seen that the information cascade prediction system based on the transform enhanced hokes process provided by the invention can greatly improve the accuracy of popularity prediction compared with other baseline models.
In addition, in order to study the effectiveness of the global dependency module provided by the present invention in capturing the long-term diffusion dependency relationship between nodes in the dynamic diffusion topology, the present application example further obtains an attention probability matrix a, as shown in fig. 5. The stepwise analysis was performed as follows:
(1) For smaller (13 users) information concatenation C 1 The information cascade prediction system provided by the invention focuses more on the main path and early users participating in diffusion. For each row of matrix A, we can see that the users on the main path usually have the most prominent color, which verifies our dominant/non-dominant hypothesis in the information propagation (a similar phenomenon is also shown in cascade C) 2 And C 3 ). For cascade C 1 Node 9, hawkesformer detects that 0 → 2 → 9 is the primary path, indicating that node 0 and node 2 have a greater impact on node 9, while other nodes on the non-primary path have a lesser impact on the forwarding behavior of node 9. For node 8, the hawkesformer detects that 0 → 4 → 8 is the primary path, indicating that node 0 and node 4 have a greater impact on node 8, while other nodes on the non-primary node have a lesser impact on the forwarding behavior of node 8.
(2) Information concatenation C for medium (58 users) 2 Long-term dependencies between the root user and subsequent nodes are observed. This suggests that long-term dependencies between nodes in a dynamic diffusion topology need to be modeled, because sequence models such as RNNs are difficult to train on long sequences,and when the sequence is too long it faces the gradient disappearance/explosion problem. In addition, several short-term bursts were observed, which again validates the assumption of the present invention to learn local patterns for information concatenation.
(3) At the same time, a virus transmission cascade C is shown 3 It has 106 users and a maximum depth of 8. Most pushups in this cascade (84%) are from indirect users to the root. We have found that even "tail" users at the end of the diffusion tree may trigger outbreaks (e.g., users 19 and 43). Some users act as hubs to connect other influential users bringing them into the information dissemination process.
Therefore, the information cascade prediction system based on the transform enhanced Hooke process, which is provided by the invention, not only can improve the accuracy of popularity prediction, but also has strong interpretability. Experiments on three real data sets demonstrate the superior performance of the present invention over the most advanced baseline model. The improvement of the performance of the proposed scheme shows that the combination of the advantages of deep learning (Transformer) and a probability model (hokes process) is a new direction for effectively simulating the information cascade diffusion process and improving the prediction accuracy.
In summary, the modeling mode of the information propagation process is important for understanding the information propagation mechanism and the popularity prediction task, the evolution of information diffusion presents a Directed Acyclic Graph (DAG), and the topological features and time are coupled in nature, but the two most important features (topological features and time features) of the cascade are modeled separately in the existing research, so that cross-domain information is lost, and the expression and prediction capability of the model is reduced. The method links a multi-level attention mechanism with the Hox self-excitation point process to realize more effective and reasonable modeling of information cascade so as to achieve more accurate popularity prediction. In particular, using a two-layer attention structure to parameterize the intensity function of the hokes process has been achieved to efficiently acquire representation knowledge from a continuous time domain.
It will be appreciated by those of ordinary skill in the art that the embodiments described herein are intended to assist the reader in understanding the principles of the invention and are to be construed as being without limitation to such specifically recited embodiments and examples. Those skilled in the art can make various other specific changes and combinations based on the teachings of the present invention without departing from the spirit of the invention, and these changes and combinations are within the scope of the invention.