CN115409155B

CN115409155B - Information Cascade Prediction System and Method Based on Transformer Enhanced Hawkes Process

Info

Publication number: CN115409155B
Application number: CN202210983826.9A
Authority: CN
Inventors: 周帆; 余柳; 钟婷; 徐增; 匡平
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2022-08-17
Filing date: 2022-08-17
Publication date: 2026-01-09
Anticipated expiration: 2042-08-17
Also published as: CN115409155A

Abstract

This invention discloses an information cascading prediction system and method based on a Transformer-enhanced Hawkes process, comprising a user embedding module, a global dependency module, a local dependency module, an intensity function acquisition module, and a popularity prediction module. First, user position-wise embeddings and temporal encodings are acquired as user embeddings. Then, from a topological perspective, a path-aware hypothesis is introduced, and two attention layers are designed from the perspectives of global embedding and local patterns. The intensity function of the Hawkes process is parameterized, and by combining the Hawkes process, global embedding, and local patterns, the temporal and topological stochastic characteristics of the information cascading diffusion process are learned for popularity prediction. This invention extends the traditional Hawkes process and effectively acquires knowledge from the continuous time domain, improving the accuracy of popularity prediction.

Description

Information cascade prediction system and method based on transform reinforced Hox process

Technical Field

The invention belongs to the technical field of information processing, relates to information diffusion (Information diffusion), information cascading (Information Cascade) and popularity prediction (Popularity Prediction) based on deep learning (DEEP LEARNING), and particularly relates to a method and a system (Hawkesformer) for introducing a transducer into a Hoxwell process (Hawkes Process) to perform coupling modeling on cascading time characteristics and topological structures and finally realizing popularity prediction (Cascade Prediction).

Background

Content sharing through Twitter, facebook and social media platforms such as microblogs has become a main channel for individuals to express opinions and respond to various topics. This trend produces a large amount of data describing the way information is propagated and provides unprecedented convenience in generating, delivering, and propagating information items. The initial distribution of information (e.g., tweet) and the subsequent sharing forwarding (e.g., retweet) form a cascade of information representing what is known as an information dissemination process. Similar phenomena are found in other (non-social media) settings, paper quotation, blog space, and email forwarding. The large number of user sharing activities contributes to the rapid and massive propagation of the information cascade. A typical task of information cascading is to predict a certain cascade (tweet, microblog, etc.), which after a certain period of time, is the scale of the potentially affected users, i.e. popularity prediction (size of the information cascade), which can bring great economic and social value in many downstream applications. For example, knowing what type of advertisement is maximally influential may make advertisement delivery more accurate, predicting potential impact users of rumors, allowing administrators to intervene early to avoid serious consequences, etc.

The existing methods are mainly divided into three categories (i) feature-based models, mainly explored feature sets extracted from structures, time sequences, user content and profiles, however, they rely on a large number of hand-made feature projects, which cannot be generalized from one domain to another, (ii) probability generation models, aimed at modeling events, including poisson and hough processes, by classical examples of Time Point Processes (TPPs), but the existing methods make powerful assumptions on the diffusion mechanisms of cascades and have limited learning capacity for large-scale cascade data, (iii) deep learning models, which focus on static time cascades diffusion within specific snapshots in discrete time domains, have little focus on the whole evolution process and fully utilizing all cascade dynamics, mainly using simple event sequence models (e.g. RNNs) to model cascade events, or aided with GNNs to separately model time features and topology features, but in essence both are coupled, such separate modeling approaches lose cross-domain information and reduce the expression and predictive capacity of models.

Disclosure of Invention

Aiming at the problems of limited learning capability, cross-domain information loss and the like of the current information cascade prediction method, the invention aims to provide a novel modeling mode for a cascade diffusion process, summarize the cascade diffusion process into a Directed Acyclic Graph (DAG) which is continuously diffused in a continuous time domain, introduce a transducer (attention mechanism) into a Hooke process (Hawkes Process), and perform coupled modeling on time characteristics and topological characteristics in the continuous time domain by introducing an attention mechanism into the Hooke process, thereby realizing effective modeling of the cascade diffusion process and improving the accuracy of popularity prediction.

The general idea of the invention is to connect two levels of attention modules together to parameterize the intensity function of the hough process. The intensity function ensures the coupling time and topology dependence of the modeling cascade in the continuous time domain. Specifically, at a first level, a global dependency module is designed to dynamically capture long-term diffusion processes of events in the cascade that have occurred in the past, and main/non-main path assumptions are proposed to adaptively integrate the diffusion process onto the underlying DAG. This allows each node participating in the cascade to update its current hidden state at any location. At the second level, a local module is designed to capture the short-term evolution rate of the information cascade by encoding local patterns in a fixed time slice window. Through the design of two levels, we take the representation of the node on the global and local, and finally send a full connection layer to conduct popularity prediction.

Before a particular procedure is given, some basic terms and definitions are given. Given an information cascade C _k and its forwarding history over observation time t _o Where t _j∈(0,t_o ], L is the number of forwarding within the observation time window (t ₀,t₀+t_o), and t ₀ denotes the start time. Each triplet (t _j,v_j,u_j) represents that at time t _j, user u _j forwarded user v _j. For example, the forwarding history of fig. 1 is: (t ₄,u₁,u₄),(t₅,u₃,u₅) }, omitted hereinafter for simplicity In addition, we use Representing the set of timestamps preceding the current moment.

Based on the inventive concept, the information cascade prediction system based on the transform enhanced hoxwell process provided by the invention comprises:

The user embedding module (Initial User Embeddings) is used for acquiring the position-wise embedding and the time coding of the user and taking the position-wise embedding and the time coding of the user as the user embedding;

the global dependence module (Global Dependency Module) is used for distinguishing all precursor nodes from main path nodes and non-main path nodes for each user node in the user embedding, acquiring attention probability matrixes of the main path nodes and the non-main path nodes, further acquiring global dependence of each node, and obtaining global embedding of all nodes through full connection;

The local dependency module (Local Dependency Module) is used for acquiring the local mode of each node according to the set time slice window, and then carrying out pooling operation on the local mode of each node to obtain pooled representation of each node;

The intensity function acquisition module is used for parameterizing the intensity function of the Hox process according to the global embedding and local mode hiding vectors;

And the popularity prediction module is used for generating popularity prediction results by combining integration of the Hox process and full-connection processing results of the global embedding and local mode hiding vectors.

The user embedding module is characterized in that the user embedding X consists of position-wise embedding and time coding and is used as the input of a follow-up module (global dependency module).

The user embedding module comprises a Position-wise embedding acquisition sub-module and a time coding acquisition sub-module.

Position-wise embedding acquisition sub-module comprising a shared matrixWhere D is the dimension of user embedding and L is the maximum number of forwarding times. The j-th column of U represents the user of the j-th forwarding, and each cascade of nodes with the same order on the time axis shares an initial embedding. The shared matrix U is first randomly initialized and then subsequent context embedding associated with the underlying cascade is dynamically updated by subsequent global dependency modules. The shared matrix U greatly reduces the computational overhead, making the system more efficient. In one cascade, for user U _j forwarded at the j-th time, let v _j represent its one-hot index vector on shared matrix U, then the position-wise embedding of user U _j is obtained by the Uv _j index, whereRepresenting a set of all user-embedded one-hot indexes. Therefore, based on the shared matrix U, the Position-wise embedding acquisition sub-module obtains the Position-wise embedding of the user as UV.

2. A time-code acquisition sub-module (Temporal Encoding), a key component of the invention is a two-level attention module, which characterizes sequence positions with a time code z, since the order dependency is not considered by the attention mechanism, defined as follows:

The temporal coding is a variant from the original transducer absolute position coding. Wherein the method comprises the steps of Is a time code, D is the dimension of the time code, [ z (t _j)]_i represents the value encoding the ith dimension, i.e. [1, D ]. Is a set of temporal encodings.

Thus, the sequence is forwardedThe initial user-embedding of (c) can ultimately be noted as x=uv+z. The initial user-embedded X consists of a learning-shared position-wise embedded UV together with a fixed time code Z.

The global dependency of the information cascade may be regarded as an indication that a certain user may promote or suppress another user during the long-term diffusion.

The global dependency module includes:

The path perception attention layer is used for distinguishing all precursor nodes from main path nodes and non-main path nodes for each user node in user embedding, acquiring long-term dependent attention scores of the main path nodes and the non-main path nodes, and constructing attention probability matrixes of the main path and the non-main path;

and the first output network is used for carrying out full connection processing on the global dependence of each node to obtain global embedding of all the nodes.

After the user is embedded, a path perception attention layer is designed in the global module to parameterize Hawkesformer the intensity function, which is used for capturing the long-term diffusion dependence between the input cascade nodes, and the user complete diffusion dependence extraction sub-module is utilized to extract the time topology dependence of the coupling of the main path nodes and the non-main path nodes.

Theory of the hox process shows that the final popularity of the information cascade is related to the past participation of the user, which means that the current node affects not only its immediate (immediate) forwarder, but also non-immediate (non-immediate) forwarder by transitivity. In order to represent the effect of each node on the cascading future popularity from a topology perspective, a path-aware assumption is therefore made in the global module, namely a main path/non-main path assumption that the nodes on the main path are considered to have a larger contribution to the current node u _j, and the nodes on the non-main path are considered to have a smaller contribution to the arrival of the u _j node, but still have an effect on the final popularity, which also conforms to the assumption of the hox process. For the current node u _j, all precursor nodes Is divided into two categories (1) nodes on the main path, noted asWhich includes all previous nodes on the entire forwarding path from the previous node u _j back to the root node u ₀, (2) nodes on the non-primary path, i.e., all nodes except the primary path node, are noted asAn example is given in fig. 2, where for the current nodes u ₄,① and ② nodes on the main path are represented, and ③ nodes on the non-main path are represented.

Based on the above explanation, the implementation steps of the global dependency module are as follows:

giving a current node u _j to obtain the current node u _j and a certain node in the main path Long-term dependent attention score of (2):

Wherein a, b and Representing the user embedding matrix X via a main path Q (Query) matrix, respectivelyNon-primary path Q (Query) matrixAnd K (Key) matrixThe transformed user embeds vectors, and the three transformation matrices are used to distinguish the roles of the users in long-term dependence. Wherein < ·, · > represents the inner product function. Similarly, the current node u _j is not a node in the non-primary pathThe long-term dependent attention score of (2) is:

Then, the attention weighted sum is calculated by the following formula, resulting in the complete diffusion dependency e _j of user u _j:

Finally, the context embedding of each node is obtained according to the attention scores of the main path and the non-main path of each node, wherein E= { E ₀,e₁,…,e_j,…,e_L }, and each row in E represents the global dependence of a certain node in the cascade; embedding matrix X via V (Value) matrix on behalf of user The converted user embeds the vector.

The above is calculated for the dependence of the current single user u _j. In actual operation, the above attention is relied on to use a matrix parallel manner to accelerate the operation speed. The matrix calculation flow of global dependencies can be referred to fig. 3.

Firstly, based on the path-aware attention layer, the attention scores of the current node on the main path and the non-main path are calculated through the following formulas:

wherein M ^P is a derivative of the formula, One-hot matrix for preserving the attention scores of nodes on the main path and the non-main path, respectively, ifThen the elementCorrespondingly, ifFinally, the attention probability matrices for the main and non-main paths can be derived as:

where D is the dimension of the input embedding. Mask matrix To prevent the system from peeping future data if i < j, M _i,j =0, otherwise M _i,j = - ≡. The matrix M forces the softmax function to assign a valid long-term dependent attention only on forwarders before u _j, while assigning 0 to subsequent users (i.e., users later than t _j activates). Each column vector in matrix a represents u _j for its predecessor nodesIncludes the main pathAnd non-primary pathThe upper node.

Then, a matrix form of global dependence of each node is obtained according to the following formula:

E=(W^VX)A

For learning more semantic information, the path aware attention layer can also be designed here as a module based on a multi-headed attention mechanism, mapping context embedding to different subspaces, which can be computed in parallel using the matrix multiplication given above, and thus compute a plurality of heads E ₁,E₂,…,E_H, which are then connected to e=w ^OConcat(E₁,E₂,…,E_H), where Is an aggregation matrix. Finally, we feed E into the first output network (here fully connected neural network FCNN) resulting in all cascaded global embeddings:

h(t_j)=H(:,j)

Wherein, the Is a parameter of the neural network, h (t _j) isRepresents a long-term dependent representation of user u _j.

The local dependence module can be called as local module for short, the increasing rate of the transfer quantity in the information cascade is not stable, and the short-term burst in the propagation process has great influence on the final popularity. To better embed the evolution rate of the information cascade, pooled attention is utilized here to capture local patterns. The pooling operation is performed on the local patterns of each current node based on the user global embedded representation that has been learned in the global dependency module.

The local dependency module includes:

the local mode acquisition sub-module is used for intercepting a part positioned in front of each node from the global dependence according to a set time slice window to obtain a local mode of each node;

chi Huazi module, configured to perform pooling operation on the local mode of each node to obtain pooled representation of each node;

and the mask attention layer is used for dynamically aggregating each node and the local mode pooling representation of the node before the node to generate a hidden vector of the local mode of each node.

Each current node has a local mode, a fixed time slice window t _s is set, a local mode acquisition submodule is utilized to intercept the part positioned before each node from the global dependence to obtain the local mode of each node, and therefore, the local mode of the current node consists of the global dependence of the current node u _j and part of precursor nodes and is expressed asRepresenting a collection of all local patterns of a cascade of information.

For each current node u _j, its local pattern is pooled by the pooling sub-modulePooling operations are performed to aggregate features in time slices t _s to generate new representationsThus, all local modesThe overall pooled representation of (a) may be formalized as: corresponds to a pooled representation of the local pattern.

To capture the evolution rate associated with cascading short-term bursts/dips, the present invention incorporates the effects of local patterns through a designed masking attention layer. Based on the mask attention mechanism, all local pattern embeddings (i.e., pooled representations) before the current time t _j are dynamically aggregated with the mask attention layer, generating a hidden vector to summarize the evolution of all previous patterns:

g (·) is a linear transformation function, γ= { v (t ₀),v(t₁),…,v(t_j),…,v(t_L) } is a set of local dependencies of the evolution rate. The similarity function f (·, ·) is specified as:

The intensity function obtaining module parameterizes the intensity function of the hox process by the global module and the local module to obtain the intensity function λ (t) of the invention Hawkesformer. Given an information cascade C _k and its forwarding history over observation time t _o Representing a sequence preceding time stamp t _j but excluding time t _j. Here, the intensity function acquisition module models the continuous dynamics of the cascade diffusion using the following conditional intensity function:

Wherein Θ represents a parameter of the model, the forwarding time t is defined in a section [ t _j,t_j+1 ], f (x) =βlg (1+exp (x/β)) is a softplus function, the parameter β is used for constraining the intensity function to be positive, a first current represents an evolution process of a continuous time domain, α controls importance of interpolation nodes in two time stamps, w ₁、w₂ is learning weight of a global dependency and a local mode at time t respectively, and the intensity function λ (t) represents a push arrival rate at time t. Both global and local dependencies of Hawkesformer have expressive and flexible properties, the design of the intensity function being such that Hawkesformer captures the spatio-temporal dynamics of the coupling from a continuous time domain.

The popularity prediction module is mainly used for predicting information popularity. According to Hawkes theory, the survival function at time t is: the function represents a conditional probability that no event has occurred between t _n, t. Subsequently, we can derive the probability that an event is observed at any time t':

Finally, by means of the popularity prediction module, the final popularity is predicted according to the following formula in combination with the time point process and the outputs of the two-level attention mechanism modules (global module and local module)

The first term Λ represents the integral of the hough process, the desired popularity, t _o the observation time, t _p the prediction time, and the second term the output of the global embedded and local pattern hidden vector splice via the second output network (here fully connected neural network FCNN).

The information cascade prediction system based on the transform enhanced Hox process also comprises an optimization module used for parameter optimization in the system training process.

Forwarding time of cascade C _k The joint likelihood of (2) can be expressed as:

Wherein the method comprises the steps of Representing a set of times up to time t _j.

Assuming that there are N information cascade label samples for training, { C ₁,C₂,…,C_N }, the optimization module optimizes by minimizing the Mean Square Log Error (MSLE) loss of the fluidity and maximizing the log likelihood of all information cascades:

y _k represents the true popularity of the kth information cascading label sample, And representing the predictive popularity corresponding to the kth information cascading label sample.

The Adam gradient descent algorithm is then used to optimize the system network parameters.

The invention also provides an information cascade prediction method based on a transform enhanced Hox process, which comprises the following steps:

s1, acquiring a user position-wise embedding and a time code, and taking the user position-wise embedding and the time code as a user embedding;

s2, distinguishing all precursor nodes as main path nodes and non-main path nodes for each user node in the user embedding, respectively acquiring attention probability matrixes of the main path nodes and the non-main path nodes, further acquiring global dependence of each node, and then acquiring global embedding of all nodes through full connection processing;

S3, obtaining a local mode of each node according to a set time slice window, and carrying out pooling operation on the local mode of each node to obtain pooled representation of each node;

S4, parameterizing an intensity function of the Hox process according to the global embedding and the local mode hiding vectors;

s5, integrating the Hox process and generating a popularity prediction result by combining the full connection processing result of the global embedded and local mode hidden vectors.

The present invention thus implements a system and method for modeling and predicting the coupling of time and topology features over a continuous time domain by introducing a two-layer attention mechanism into the hough process. At the first level we distinguish DAGs from non-primary paths in terms of primary paths, which layer is used to capture global dependencies between nodes, i.e. time and topology dependencies of couplings, and at the second level we propose a local pool attention module to embed the evolution rate of the cascade to simulate bursts and dips of short-term forwarding numbers.

Compared with the prior art, the invention has the following beneficial effects:

1. The method comprises the steps of firstly obtaining position-wise embedding and time coding of a user as user embedding, then introducing a path perception assumption from the view angle of a topological structure, designing two layers of attention layers from two angles of global embedding and local mode, parameterizing the intensity function of the Hox process, combining the Hox process, the global embedding and the local mode, learning the time and the topological random characteristics coupled in the information cascade diffusion process so as to conduct popularity prediction.

2. The invention provides a two-level attention architecture which parameterizes the intensity function of a Hox process, firstly designs a global module attention layer based on path perception to capture the long-term diffusion dependency relationship between nodes in a dynamic diffusion topology, and then provides a local module to capture the short-term evolution rate in information cascade.

3. The invention customizes a position-wise user embedding method capable of learning for the information cascade directed acyclic graph, improves the parameter efficiency and reduces the calculation burden;

4. The method is extremely important for understanding the evolution process of the social network platform and explaining the cascade epidemic reasons, for example, the method predicts the forwarding quantity of a certain microblog in a future period, explains the epidemic factors, and can be used for downstream tasks such as marketing design, rumor prediction, influence maximization and the like.

Drawings

FIG. 1 is an information cascade diffusion process presenting a Directed Acyclic Graph (DAG). The modeling mode of the existing model separately models the time feature of the node and the topology feature appearing in the graph, for example, the time feature is fed into RNN, and the topology feature is fed into GNN. But both are in coupled form in nature, the topology in the DAG is determined simultaneously when the forwarding time of a certain node is determined, e.g. u ₂'s forwarding behavior depends on u ₀ instead of its previous node u ₁ in time, so the DAG forms a time topology dependency of the coupling, whereas separate modeling cannot capture the dependency of the coupling.

FIG. 2 is a main path and non-main path hypothesis in the global dependency module, which can directly acquire long-term dependencies on the main path (non-main path), unlike the chained structure based on the RNN sequence model.

FIG. 3 is an implementation detail diagram of the primary and non-primary path hypotheses in the global dependency module, i.e., the implementation detail of FIG. 2.

FIG. 4 is a model diagram of an information cascade prediction system based on a transform enhanced Hox process of the present invention, including an initial user embedding, a global dependency module, a local dependency module, and a prediction module.

FIG. 5 is a visual depiction of an attention probability matrix A, representing three cascade-corresponding thermodynamic diagrams (Heatmaps) and cascade structures randomly chosen in Weibo dataset, wherein the diagonal lines in the thermodynamic diagrams represent the attention of the current node to itself, each row represents the attention of the current node to all predecessor nodes, each entry ranges from [0,1], the darker the entry color, the higher the attention, and the sum of each row is 1.

Interpretation of the terms

Information concatenation Information Cascade figure 1 illustrates this process in an example where after a root node publishes a piece of public content, friends and caregivers of the root node will see the public content and forward it one after the other. Thus, public content propagates through the edges of a social network and creates a cascade of information, a typical task of which is to predict the size of a certain cascade (tweets, microblogs, etc.), potentially affected users after a period of observation. The theory basis can be referred to in the literature 【J.Cheng,L.Adamic,P.A.Dow,J.M.Kleinberg,and J.Leskovec.Can cascades be predictedIn Proc.of WWW,2014.】

Hox process (HawkesProcess, HP) the Hox process is a mainstream time point process (Temporal Point Processes, TPPs). HP has three key factors, namely (1) user influence, more contribution of the user with influence to the arrival rate of new forwarding, more forwarding of the message forwarded by the user with influence, a self-excitation mechanism, each forwarding can influence the arrival rate of new forwarding in the future, and (3) a time attenuation effect. The impact of forwarding decays with increasing time. However, conventional HP simplifies the arrival rate of events, models the arrival of events using, for example, exponential or power law functions, and makes strong assumptions about the diffusion mechanisms in cascading events, limiting their learning ability over large-scale cascading data.

Self-attention mechanism (Self-Attention Mechanism) the Self-attention mechanism, a special case of the attention mechanism recently proposed in Transformer【Vaswani,Ashish,et al."Attention is all you need."Advances in neural information processing systems 30(2017).】, has been widely used in various NLP tasks such as machine translation, language modeling, etc., and has produced considerable improvements. It encodes the input sequence by correlating the input token, which consists of the matrices Q (query), K (Keys) and V (Values). The matrix form of the attention function is:

Detailed Description

The invention will be further described with reference to the accompanying drawings.

Example 1

For convenience, in the rest of the invention we set out the Twitter (Twitter) cascade as an example, but the invention is also applicable to other types of information cascades, such as microblogging, academic citations, etc., and we will give examples in the experimental section.

The information cascade prediction system based on the transform enhanced hoxwell process provided in this embodiment, as shown in fig. 4, includes a user embedding module, a global dependency module, a local dependency module, an intensity function obtaining module, and a popularity prediction module.

(1) User embedded module

And the user embedding module is used for acquiring the user position-wise embedding and the time code and taking the user position-wise embedding and the time code as the user embedding.

Position-wise embedding acquisition sub-module comprising a shared matrixIn one cascade, for user U _j forwarded at the j-th time, let v _j represent its one-hot index vector on shared matrix U, then the position-wise embedding of user U _j is obtained by the Uv _j index, whereRepresenting a set of all user-embedded one-hot indexes. Therefore, based on the shared matrix U, the Position-wise embedding acquisition sub-module obtains the Position-wise embedding of the user as UV.

The time code acquisition submodule characterizes the sequence position by using the time code z, and is defined as follows:

Thus, the user embedding can ultimately be noted as x=uv+z. The initial user-embedded X consists of a learning-shared position-wise embedded UV together with a fixed time code Z.

The shared matrix U is initially initialized, and its parameters are dynamically updated together by the designed optimization module using back propagation to obtain subsequent context embedding associated with the underlying cascade.

(2) Global dependency module

And the global dependence module is used for distinguishing all the precursor nodes from the main path nodes and the non-main path nodes for each user node in the user embedding, acquiring the attention probability matrixes of the main path nodes and the non-main path nodes, further acquiring the global dependence of each node, and obtaining the global embedding of all the nodes through full connection processing.

The global dependency module includes a path aware attention layer and a first output network.

The path perception attention layer is used for distinguishing all precursor nodes from main path nodes and non-main path nodes for each user node in user embedding, acquiring long-term dependent attention scores of the main path nodes and the non-main path nodes, constructing attention probability matrixes of the main path and the non-main path, and acquiring global dependence according to the attention probability matrixes of the main path and the non-main path of each node.

As shown in fig. 2 and 3, a path aware hypothesis, i.e., a main path/non-main path hypothesis, is set forth in the global module. Giving a current node u _j to obtain the current node u _j and a certain node in the main pathLong-term dependent attention score of (2):

Finally, the global dependencies of all nodes are constructed according to the complete diffusion dependencies of each node. E= { E ₀,e₁,…,e_j,…,e_L }, each row in E representing the global dependence of a certain node in the cascade, the context embedding of each node is obtained according to the attention scores of the main path and the non-main path of each node; embedding matrix X via V (Value) matrix on behalf of user The converted user embeds the vector.

The above is calculated for the dependence of the current single user u _j. In actual operation, the above attention is relied on to use a matrix parallel manner to accelerate the operation speed.

Based on the path-aware attention layer, the attention scores of the current node on the main path and the non-main path are calculated according to the following formula:

where D is the dimension of the input embedding. Mask matrix To prevent the system from peeping future data if i < j, M _i,j =0, otherwise M _i,j = - ≡. The matrix M forces the softmax function to assign a valid long-term dependent attention only on forwarders before u _j, while assigning 0 to subsequent users (i.e., users later than t _j activates). Each column vector in matrix a represents u _j for its predecessor nodesIncludes the main pathAnd nodes on non-primary paths

E=(W^VX)A

To learn more semantic information, the path aware attention layer is designed here as a module based on a multi-headed attention mechanism, mapping context embedding into different subspaces, which can be computed in parallel using the matrix multiplication given above, and thus compute a plurality of heads E ₁,E₂,…,E_H, which are then connected to e=w ^OConcat(E₁,E₂,…,E_H), where Is an aggregation matrix. And the first output network is used for carrying out full connection processing on the global dependence of all the nodes to obtain global embedding of all the nodes. In this embodiment, using the fully connected neural network FCNN as the first output network, the global dependencies E of all nodes are fed into the first output network, so as to obtain global embedments of all nodes:

h(t_j)=H(:,j)

(3) Local dependency module

The local dependency module is used for acquiring the local mode of each node according to the set time slice window, carrying out pooling operation on the local mode of each node to obtain pooled representation of each node, and generating a hidden vector of the local mode of each node based on a mask attention mechanism.

The local dependency module includes a local pattern retrieval sub-module, a pooling sub-module, and a mask attention layer.

And the local mode acquisition sub-module is used for intercepting a part positioned in front of each node from the global dependence according to the set time slice window to obtain the local mode of each node. Setting a fixed time slice window t _s, utilizing a local mode acquisition submodule to intercept the part positioned before each node from the global dependence to obtain the local mode of each node, so that the local mode of the current node is formed by the global dependence of the current node u _j and part of precursor nodes and is expressed asRepresenting a collection of all local patterns of a cascade of information.

And Chi Huazi module, configured to perform pooling operation on the local mode of each node to obtain a pooled representation of each node. For each current node u _j, its local pattern is pooled by the pooling sub-modulePooling operations are performed to aggregate features in time slices t _s to generate new representationsThus, all local modesThe overall pooled representation of (a) may be formalized as: corresponds to a pooled representation of the local pattern.

And the mask attention layer is used for dynamically aggregating each node and the local mode pooling representation of the node before the node to generate a hidden vector of the local mode of each node. Based on the mask attention mechanism, all local pattern embeddings (i.e., pooled representations) before the current time t _j are dynamically aggregated with the mask attention layer, generating a hidden vector to summarize the evolution of all previous patterns:

g (·) is a linear transformation function, y= { v (t ₀₁),v(t₁),…,v(tj),…,v(t_L) } is a set of local dependencies of evolution rate. The similarity function f (·, ·) is specified as:

(4) Intensity function acquisition module

And the intensity function acquisition module is used for parameterizing the intensity function of the traditional Hox process according to the global embedding and the local mode hiding vectors. Here, the intensity function acquisition module models the continuous dynamics of the cascade diffusion using the following conditional intensity function:

Wherein Θ represents a parameter of the model, the forwarding time t is defined in a section [ t _j,t_j+1 ], f (x) =βlg (1+exp (x/β)) is a softplus function, the parameter β is used for constraining the intensity function to be positive, a first current represents an evolution process of a continuous time domain, α controls importance of interpolation nodes in two time stamps, w ₁、w₂ is learning weight of a global dependence and a local pattern at time t, and λ (t) represents a push arrival rate at time t.

(5) Popularity prediction module

And the popularity prediction module is used for generating popularity prediction results by combining integration of the Hox process and full-connection processing results of the global embedding and local mode hiding vectors. Here, the final popularity is predicted according to the following formula

Example 2

This example is a further improvement over example 1.

The information cascade prediction system based on the transform enhanced Hox process further comprises an optimization module, wherein the optimization module is used for optimizing parameters in the system training process.

Example 3

The embodiment provides an information cascade prediction method based on a transform reinforced Hox process, which comprises the following steps:

s1, acquiring user position-wise embedding and time coding, and taking the user position-wise embedding and the time coding as user embedding.

The step uses a user embedding module to respectively acquire a shared position-wise embedding UV and a fixed time code Z, and obtains an end user embedding x=uv+z.

S2, distinguishing all precursor nodes as main path nodes and non-main path nodes for each user node in the user embedding, respectively acquiring long-term dependency attention scores of the main path nodes and the non-main path nodes, further acquiring global dependencies of each node, and obtaining global embedding of all nodes through full connection processing.

The method comprises the steps of firstly distinguishing all precursor nodes from each user node in user embedding by using a path perception attention layer, obtaining attention probability matrixes of the main path nodes and the non-main path nodes, constructing global dependence of all nodes according to the attention probability matrixes of the main path and the non-main path of each node, and obtaining global embedding of all nodes through full connection processing of a first output network.

S3, obtaining a local mode of each node according to a set time slice window, carrying out pooling operation on the local mode of each node to obtain pooled representation of each node, and generating a local mode hiding vector of each node based on a mask attention mechanism.

The method comprises the steps of firstly utilizing a local mode acquisition submodule to intercept a part positioned in front of each node from global dependence according to a set time slice window to obtain a local mode of each node, then carrying out pooling operation on the local mode of each node through a pooling submodule to obtain pooled representation of each node, and dynamically aggregating each node and the local mode pooled representation of the node before the node by using a mask attention layer to generate a hidden vector of the local mode of each node.

S4 is used to parameterize the intensity function of the hough process in terms of global embedding and local mode concealment vectors.

Here, the intensity function acquisition module is used to parameterize the intensity function of the conventional hough process according to the global embedding and local mode concealment vectors, resulting in the intensity function of the Hawkesformer model.

Here, integration of the hough process and fully connected processing results of the global embedded and local pattern hidden vectors are combined to generate a popularity prediction result by a popularity prediction module.

Application example

The application uses three different real data sets (Twitter, weibo and APS, the first data set source reference 【Lilian Weng,Filippo Menczer,and Yong-YeolAhn.2013.Virality prediction andcommunity structure in social networks.Scientific Reports3(2013)】, the second data set source reference 【Qi Cao,Huawei Shen,Keting Cen,Wentao Ouyang,and Xueqi Cheng.2017.Deep-Hawkes:Bridging the gap between prediction and understanding of informationcascades.InCIKM.1149–1158】) and the third data set source reference [ https:// journ als. Org/datasets ] to describe the prediction effect of the information cascade prediction system based on the transform enhanced Hox process.

Three data sets were divided into training, validation and test sets at ratios of 70%, 15%, respectively.

Firstly, training and optimizing an information cascade prediction system (Hawkesformer) based on a transducer enhanced Hox process provided in the embodiment 2 by using a training set, and verifying a system obtained by training by using a verification set every time training is completed until a loss function of the systemTend to stabilize.

Then, the trained Hawkesformer was tested on the test set according to the information cascade prediction method based on the transform enhanced houx procedure provided in example 3, and compared with 7 different baseline models (Feature-Deep, deepHawkes, casCN, FOREST, vaCas, tempCas, hawkesformer), and the commonly used measurement method MSLE was used (smaller value, better prediction effect).

TABLE 1 comparison of HawkesFormer versus baseline model Performance on three data sets

Note that bold values indicate better results than other settings. All experimental results were best obtained in 10 trials with statistical significance ρ <0.05.

The description of the 7 baseline methods in table 1 is as follows:

Feature-Deep extracting structural features, temporal features, user features, and inputting them into the bilayer MLP.【Xu,Xovee,et al."CasFlow:Exploring hierarchical structures and propagation uncertainty for cascade prediction."IEEE Transactions on Knowledge and Data Engineering(2021).】

DEEPHAWKES the self-excitation point process of deep learning and Hawkes is combined to bridge the gap between predictive performance and interpretability .【Qi Cao,Huawei Shen,Keting Cen,Wentao Ouyang,and Xueqi Cheng.2017.Deep-Hawkes:Bridging the gap between prediction and understanding of informationcascades.InCIKM.1149–1158.】

CasCN is a hybrid model, by stacking GNNs to obtain node embedding, and using RNNs to simulate the evolution of diffusion .【Chen,Xueqin,et al."Information diffusion prediction via recurrent cascades convolution."2019IEEE 35th international conference on data engineering(ICDE).IEEE,2019.】

Forest, combining reinforcement learning and RNN to address multi-scale cascade prediction problem .【Yang,Cheng,et al."Multi-scale Information Diffusion Prediction with Reinforced Recurrent Networks."IJCAI.2019.】

VaCas integration of hierarchical diffusion model and spatio-temporal structural features of information concatenation while also capturing diffusion uncertainty .【Fan Zhou,Xovee Xu,Kunpeng Zhang,GoceTrajcevski,and Ting Zhong.2020.Variational information diffusion for probabilistic cascades prediction.InINFOCOM.1618–1627.】

TempCas implementing a heuristic approach to embedding cascading graphs using RNNs and learning historical short-term trends using LSTM on specially designed attention-based CNNs .【Tang,Xiangyun,et al."Fully exploiting cascade graphs for real-time forwarding prediction."Proceedings of the AAAI Conference on Artificial Intelligence.Vol.35.No.1.2021.】

As can be seen from the experimental results in Table 1, the information cascade prediction system based on the transform enhanced Hox process provided by the invention can greatly improve the accuracy of popularity prediction compared with other baseline models.

In addition, in order to study the validity of the long-term diffusion dependency relationship between the nodes in the global dependency module capturing dynamic diffusion topology, the application example further obtains an attention probability matrix A, as shown in fig. 5. The following stepwise analysis was performed:

(1) For smaller (13 users) information cascades C ₁, the information cascade prediction system provided by the invention focuses more on the main path and early users participating in the diffusion. For each row of matrix A we can see that the user on the main path typically has the most prominent color, which verifies our main/non-main assumption in the information propagation process (similar phenomena are also shown in cascades C ₂ and C ₃). For node 9 in cascade C ₁, hawkesFormer detects that 0→2→9 is the primary path, indicating that node 0 and node 2 have a greater impact on node 9 than other nodes on the primary path have less impact on the forwarding behavior of node 9. For node 8, hawkesFormer detects that 0→4→8 is the main path, indicating that node 0 and node 4 have a greater impact on node 8, and that other nodes on the non-main node have less impact on the forwarding behavior of node 8.

(2) For a medium (58 users) information cascade C ₂, a long-term dependency between the root user and the following node is observed. This suggests that long-term dependencies between nodes in a dynamic diffusion topology need to be modeled, because sequence models such as RNNs are difficult to train on long sequences and face gradient vanishing/explosion problems when the sequences are too long. In addition, several short bursts were observed, which again verifies the hypothesis of the present invention to learn local patterns for information concatenation.

(3) Also shown is a viral propagation cascade C ₃, which has 106 users with a maximum depth of 8. Most of the rotations (84%) in this cascade are from indirect user to root. We have found that even the "tail" users at the end of the diffusion tree may trigger bursts (e.g., users 19 and 43). Some users act as hubs connecting other influential users, bringing them into the information dissemination process.

Therefore, the information cascade prediction system based on the transform reinforced Hox process provided by the invention not only can improve the accuracy of popularity prediction, but also has stronger interpretability. Experiments on three real data sets demonstrate the superior performance of the present invention over the most advanced baseline model. The improvement in performance of the proposed solution shows that combining the advantages of deep learning (transform) with probability models (hough process) is a new direction to effectively simulate the information cascade diffusion process and improve the prediction accuracy.

In summary, the modeling mode of the information propagation process is crucial to understanding the information propagation mechanism and popularity prediction task, the evolution of the information propagation presents a directed acyclic graph (DIRECTED ACYCLIC GRAPH, DAG), and the topological features and time are coupled in nature, but the existing research adopts a mode of separately modeling the two most important features (topological features and time features) of the cascade, so that cross-domain information is lost, and the expression and prediction capability of the model are reduced. The invention relates to a multi-level attention mechanism and a Hoxwell self-excitation point process so as to realize more effective and reasonable modeling of information cascade and achieve more accurate popularity prediction. In particular, using a two-layer attention structure to parameterize the intensity function of the hough process has been achieved to efficiently acquire representation knowledge from a continuous time domain.

Those of ordinary skill in the art will recognize that the embodiments described herein are for the purpose of aiding the reader in understanding the principles of the present invention and should be understood that the scope of the invention is not limited to such specific statements and embodiments. Those of ordinary skill in the art can make various other specific modifications and combinations from the teachings of the present disclosure without departing from the spirit thereof, and such modifications and combinations remain within the scope of the present disclosure.

Claims

1. An information cascade prediction system based on Transformer-enhanced Hawkes processes, characterized in that it comprises:

The user embedding module is used to obtain the user position-wise embedding and time encoding, and to use the user position-wise embedding and time encoding as the user embedding.

The global dependency module is used to distinguish all predecessor nodes from main path nodes and non-main path nodes for each user node in the user embedding, and obtain the attention probability matrix of main path nodes and non-main path nodes, thereby obtaining the global dependency of each node, and then obtaining the global embedding of all nodes through a fully connected layer.

The local dependency module is used to obtain the local pattern of each node according to the set time slice window, and then perform a pooling operation on the local pattern of each node to obtain the pooled representation of each node; finally, based on the mask attention mechanism, the hidden vector of the local pattern of each node is generated.

The intensity function acquisition module is used to parameterize the intensity function of the Hawkes process based on the global embedding and local pattern hiding vectors.

The popularity prediction module combines the integral of the Hawkes process with the fully connected processing results of global embedding and local pattern hiding vectors to generate popularity prediction results.

2. The information cascade prediction system based on Transformer-enhanced Hawkes process according to claim 1, wherein the user embedding module includes a position-wise embedding acquisition submodule and a time-encoding acquisition submodule;

The Position-wise embedding acquisition submodule includes a shared matrix U. In a cascade, for user _uj in the j-th forward, let _vj represent its one-hot index vector on the shared matrix U. Then the position-wise embedding of user _uj is obtained by indexing _Uvj , where V represents the set of one-hot indices of all user embeddings. The user position-wise embedding obtained by the Position-wise embedding acquisition submodule is UV.

The time-encoding acquisition submodule uses time-encoding z to represent the sequence position, as defined below:

in It is a time-encoded sequence, where D is the dimension of the time-encoded sequence, and [z( _tj )] _i represents the value of the i-th dimension. It is a collection of time codes;

User embedding is ultimately denoted as: X = UV + Z.

3. The information cascade prediction system based on Transformer-enhanced Hawkes processes according to claim 1, characterized in that the global dependency module comprises:

The path-aware attention layer is used to distinguish all predecessor nodes from main path nodes and non-main path nodes for each user node in the user embedding, and to obtain the long-term dependency attention scores of main path nodes and non-main path nodes, and to construct the attention probability matrix of main path and non-main path; then, based on the attention probability matrix of main path and non-main path of each node, its global dependency is obtained.

The first output network is used to perform fully connected processing on the global dependencies of each node, resulting in the global embedding of all nodes.

4. The information cascade prediction system based on the Transformer-enhanced Hawkes process according to claim 3, characterized in that, firstly, based on the path-aware attention layer, the attention scores of the current node on the main path and non-main path are calculated using the following formula:

P _score =(W ^P X) ^T (W ^K X)⊙M ^P

N _score =(W ^N X) ^T (W ^K X)⊙M ^N

Among them, ^MP , One-hot matrices are used to retain the attention scores of nodes on the main path and non-main path, respectively. If the set represents the main path node set, then the elements are... Correspondingly, if Represents the set of non-main path nodes. Finally, the attention probability matrices for the main path and non-main paths are derived as follows:

Where D is the dimension of the input embedding; the mask matrix To prevent the system from spying on future data, if i < j, M_i,j = 0, otherwise M _i,j = -∞; each column vector in matrix A represents u_j relative to its predecessor node. Attention, including the main path Nodes on non-main paths

Then, the matrix form of the global dependencies of each node is obtained through the user global dependency extraction submodule: E = (W ^V X)A;

By performing a fully connected operation on E through the first output network, the global embeddings of all nodes in the cascade are obtained:

h(t _j )＝H(:,j )

in, These are the parameters of the neural network; h( _tj ) is... The j-th column in the table represents the long-term dependency representation of user _uj .

5. The information cascade prediction system based on the Transformer-enhanced Hawkes process according to any one of claims 1 to 4, characterized in that the local dependency module comprises:

The local mode acquisition submodule is used to extract the portion of the global dependencies that precedes each node based on a set time slice window, thus obtaining the local mode of each node;

The pooling submodule is used to perform pooling operations on the local patterns of each node to obtain the pooled representation of each node;

The masked attention layer is used to dynamically aggregate the local pattern pooled representations of each node and its preceding nodes, generating a hidden vector for the local pattern of each node.

6. The information cascade prediction system based on the Transformer-enhanced Hawkes process according to claim 5, characterized in that, a masked attention layer is used to dynamically aggregate all local pattern embeddings before the current time _tj to generate a hidden vector that summarizes the evolution process of all previous modes:

g(·) is a linear transformation function, γ＝{v(t ₀ ),v(t ₁ ),…,v(t _j ),…,v(t _L )} is the set of local dependencies of the evolution rate; the similarity function f(·,·) is specified as:

7. The information cascade prediction system based on the Transformer-enhanced Hawkes process according to claim 6, characterized in that the intensity function acquisition module uses the following conditional intensity function to model the continuous dynamics of cascade diffusion:

Where Θ represents the model parameters, the forwarding time t is defined in the interval [ _tj , tj ₊₁ ), where f(x)＝βlg(1+exp(x/β)) is the softplus function, the parameter β is used to constrain the strength function to be positive; the first term current represents the evolution process in the continuous time domain, α controls the importance of the interpolation nodes within the two timestamps; _w1 and _w2 are the learning weights of global dependency and local pattern at time t, respectively, and λ(t) represents the tweet arrival rate at time t.

8. The information cascade prediction system based on the Transformer-enhanced Hawkes process according to claim 7, characterized in that the popularity prediction module predicts the final popularity according to the following formula.

The first term Λ represents the integral of the Hawkes process, indicating the expected popularity; t _o represents the observation time; t _p represents the prediction time; the second term represents the output of the concatenation of the global embedding and local pattern hiding vectors through the second output network.

9. The information cascade prediction system based on the Transformer-enhanced Hawkes process according to claim 8, characterized in that it further includes an optimization module for parameter optimization during system training; the optimization module optimizes by minimizing the loss of popularity due to the mean square logarithmic error and maximizing the log-likelihood of all information cascades:

y _k represents the true popularity of the k-th message cascaded label sample. Let l(C_k ) represent the predicted popularity of the k-th cascaded label sample, l(Ck) represent the joint likelihood of a cascaded C _k , and N represent the number of cascaded label samples.

10. An information cascade prediction method based on Transformer-enhanced Hawkes processes, comprising the following steps:

S1 obtains the user position-wise embedding and time encoding, and uses the user position-wise embedding and time encoding as the user embedding;

S2 distinguishes all predecessor nodes from main path nodes and non-main path nodes for each user node in the user embedding, and obtains the long-term dependency attention scores of main path nodes and non-main path nodes respectively, thereby obtaining the global dependency of each node, and then obtains the global embedding of all nodes through fully connected processing.

S3 obtains the local pattern of each node according to the set time slice window, and then performs a pooling operation on the local pattern of each node to obtain the pooled representation of each node; then, based on the mask attention mechanism, it generates the local pattern hidden vector of each node.

S4 is used to parameterize the intensity function of the Hawkes process based on the global embedding and local pattern hiding vectors;

S5 is used to combine the integral of the Hawkes process with the fully connected processing results of global embedding and local pattern hiding vectors to generate popularity prediction results.