Disclosure of Invention
In order to solve the technical problems, the application provides a data analysis method, a system, a terminal and a storage medium, so as to meet the analysis requirements of user behavior data under a super-large data scale, improve the stability and efficiency of data analysis and improve the accuracy of data analysis products.
The application provides a data analysis method, which comprises the following steps: acquiring behavior data of a target object, and storing the behavior data to a ClickHouse cluster according to a preset rule; acquiring data analysis task information, and performing task scheduling on each node of the ClickHouse cluster according to the data analysis task information so as to balance the load of each node; and analyzing the behavior data of each node according to the task scheduling arrangement to generate a data analysis product.
In one embodiment, storing the behavior data to a clickwouse cluster according to a preset rule includes: performing hash fragmentation on the behavior data according to the identity identification number of the target object, and writing the behavior data of each target object into a corresponding node of the ClickHouse cluster; and storing the behavior data of each target object according to a preset storage mode.
In an embodiment, the step of storing the behavior data of each target object according to a preset storage mode includes: pre-ordering the behavior data of each target object according to a three-level index sequence; wherein, the first-level index is the event number of the behavior data; the secondary index is the identity identification number of the target object to which the behavior data belongs; the third-level index is the log time of the behavior data.
In one embodiment, the data analysis task information includes a task type of the data analysis task; wherein the task type comprises at least one of event statistics, portrait analysis, funnel analysis, behavior path analysis, table structure change and cleaning outdated data.
In one embodiment, the task scheduling is performed on each node of the clickwouse cluster, and includes at least one of: according to the task execution priority, sequentially executing different types of data analysis tasks; and according to the task type of the data analysis task, adopting a corresponding load balancing strategy, wherein the load balancing strategy comprises random, polling and minimum load.
In one embodiment, executing a corresponding load balancing policy according to a task type of a data analysis task includes: if the task type is event statistics and/or portrait analysis and/or funnel analysis and/or behavior path analysis, a minimum load strategy is adopted; if the task type is the change of the table structure, a random strategy is adopted; and if the task type is to clear out-of-date data, adopting a polling strategy.
In an embodiment, the task scheduling for each node of the ClickHouse cluster further includes: acquiring the reading line number and the execution time of a data analysis task; and stopping the data analysis task when the reading line number of the data analysis task exceeds the maximum reading line number and/or the execution time of the data analysis task exceeds the maximum execution time.
The application also provides a data analysis system, which comprises a data writing module, a data storage module, a task scheduling module and a data analysis module; the data writing module is used for acquiring behavior data of a target object and writing the behavior data into the ClickHouse cluster according to a first preset rule; the data storage module is used for storing the behavior data written into the ClickHouse cluster according to a second preset rule; the task scheduling module is used for acquiring data analysis task information and performing task scheduling on each node of the ClickHouse cluster according to the data analysis task information so as to balance the load of each node; and the data analysis module is used for analyzing the behavior data of each node according to task scheduling arrangement to generate a data analysis product.
The present application further provides a terminal comprising a memory, a processor and a computer program stored in the memory and executable on the processor, wherein the processor implements the steps of the analysis method when executing the computer program.
The present application also provides a storage medium storing a computer program which, when executed by a processor, implements the steps of the above-described analysis method.
According to the data analysis method, the system, the terminal and the storage medium, behavior data are stored by using the ClickHouse cluster, and a plurality of load balancing strategies are adopted to schedule data analysis tasks, so that the analysis requirements of user behavior data under a super-large data scale can be met, the stability and the efficiency of data analysis are improved, and the accuracy of data analysis products is improved.
Detailed Description
The technical solution of the present application is further described in detail with reference to the drawings and specific embodiments of the specification. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein in the description of the present application is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used herein, "and/or" includes any and all combinations of one or more of the associated listed items.
Fig. 1 is a schematic flow chart of a data analysis method according to an embodiment of the present application. As shown in fig. 1, the data analysis method of the present application may include the steps of:
step S101: acquiring behavior data of a target object, and storing the behavior data to a ClickHouse cluster according to a preset rule;
in one embodiment, storing behavior data to a clickwouse cluster according to a preset rule includes:
performing hash fragmentation on the behavior data according to the identity identification number of the target object, and writing the behavior data of each target object into a corresponding node of the ClickHouse cluster;
and storing the behavior data of each target object according to a preset storage mode.
Optionally, an open source stream processing framework (Flink) service is adopted to insert data into the ClickHouse cluster in a Java database connection (jdbc) mode; further, in order to ensure the calculation efficiency during query and reduce the writing pressure as much as possible, the data is subjected to hash fragmentation according to the identity identification number (user ID) of the target object and then directly inserted into the local surface of the click House cluster, so that the problem of write amplification of the distributed table is avoided; finally, all behavior data of the same user are landed on the same machine, and a large amount of network read-write (IO) transmission during calculation is avoided, so that localized calculation is realized.
In one embodiment, the step of storing the behavior data of each target object according to a preset storage mode includes:
pre-sorting the behavior data of each target object according to a three-level index;
wherein, the first-level index is an event number of the behavior data; the secondary index is an identity identification number of a target object to which the behavior data belongs; the tertiary index is the log time of the behavioral data.
Optionally, in terms of primary key selection, to ensure retrieval efficiency, an event number (event _ id) of the behavior data is selected as a primary index; in order to ensure the high efficiency of data query, an identity number (xxHash32 (distint _ id) of a target object to which behavior data belongs is selected as a secondary index, and log time (log _ time) of the behavior data is selected as a tertiary index, so that all data are pre-ordered in advance according to the tertiary index sequence in a storage layer.
In an embodiment, the step of storing the behavior data of each target object according to a preset storage mode further includes:
in terms of engine selection, selecting a merge tree (mergeTree) engine as a distributed computing engine;
in the aspect of partition selection, selecting the generation date of behavior data as a partition field to realize horizontal partition;
in the aspect of sampling field selection, in order to ensure that the estimated scene has a small millisecond-level response, an identity identification number (distint _ id) of a target object is selected as a sampling field, and a sampling result obtained through a hash function (xxHash32 (distint _ id)) is used as a sampling reference;
in terms of Time To Live (TTL), considering local storage limitation and data use frequency, defining data within a preset Time as hot data, such as about 2 months, defining data outside the preset Time as cold data, such as 2 months ago, and providing data analysis service for the hot data, wherein the cold data can be subjected To data analysis through a hadoop distributed computing platform;
in terms of data granularity, official recommendations are used, i.e. an index is generated every 8192 rows.
Step S102: acquiring data analysis task information, and performing task scheduling on each node of the ClickHouse cluster according to the data analysis task information so as to balance the load of each node;
optionally, the data analysis task information includes task type and task configuration information; the task type comprises at least one of event statistics, portrait analysis, funnel analysis, behavior path analysis, table structure change and outdated data cleaning; the task configuration information is the analysis dimension or the analysis range of each type of task, for example, the configuration information of the portrait analysis task is the region, the academic calendar, the gender, the age and the like; the configuration information of the funnel analysis task is the hierarchical relation of user login, order placement and payment.
Illustratively, the events are counted as statistics of the number of users performing operations on the application or web page, such as login, placing an order, payment, etc.; portrait analysis is to analyze portrait information of registered users, such as the number of Hangzhou regional males among the registered users; the funnel analysis is to analyze the conversion rate from login, order placing to payment of the number of users, if 100 universal users log in, 80 universal users place orders and 50 universal users pay, the conversion rate from login to order placing is 80%, and the conversion rate from order placing to payment is 62.5%; analyzing the behavior path to analyze an operation track of the user from the first moment to the nth moment, such as a click track or a browsing track of an application program or webpage content; the table structure change comprises addition and deletion of a table, addition and deletion of columns in the table, addition, deletion, modification and the like of fields in the table, and the cleaning of expired data is to clean data exceeding the data lifetime.
In one embodiment, task scheduling is performed on each node of the ClickHouse cluster, and the task scheduling includes at least one of the following steps:
according to the task execution priority, sequentially executing different types of data analysis tasks;
and according to the task type of the data analysis task, adopting a corresponding load balancing strategy, wherein the load balancing strategy comprises random, polling and minimum load.
Optionally, executing a corresponding load balancing policy according to the task type of the data analysis task, including:
if the task type is event statistics and/or portrait analysis and/or funnel analysis and/or behavior path analysis, a minimum load strategy is adopted;
if the task type is the change of the table structure, a random strategy is adopted;
and if the task type is to clear out-of-date data, adopting a polling strategy.
Optionally, because event statistics, portrait analysis, funnel analysis and behavior path analysis occupy more computing resources, a minimum load strategy is adopted, optionally, the tasks of event statistics, portrait analysis, funnel analysis and behavior path analysis are identified in advance by self-defined task query ID, and the tasks of event statistics, portrait analysis, funnel analysis and behavior path analysis are distributed to nodes with fewer computing tasks in the ClickHouse cluster, so that the load pressure of the ClickHouse cluster is reduced, and the minimum load is realized; because the table structure change occupies less computing resources, a random distribution strategy is adopted for the table structure change task; since the cleaning of the stale data requires the tabulation analysis of each node of the ClickHouse cluster, a polling strategy is adopted.
In an embodiment, the task scheduling for each node of the ClickHouse cluster further includes:
acquiring the reading line number and the execution time of a data analysis task;
and stopping the data analysis task when the reading line number of the data analysis task exceeds the maximum reading line number and/or the execution time of the data analysis task exceeds the maximum execution time.
Optionally, the reading line number of the data analysis task is the sum of the reading line numbers of the node tasks; the execution time of the data analysis task is the sum of the time for sending the data analysis task to the main node, the main node sending the data analysis task to the child nodes, and the time for summarizing the behavior data of the child nodes to the main node in the whole process; the maximum reading line number and the maximum execution time of the data analysis task are preset values.
Step S103: and analyzing the behavior data of each node according to the task scheduling arrangement to generate a data analysis product.
The data analysis product is a data analysis result obtained by analyzing the behavior data of each node according to the task scheduling arrangement.
Optionally, a task keyword is generated according to the data analysis task information, and the task keyword is associated with the generated data analysis product, so that other data analysis tasks can hit the existing data analysis product through the task keyword, and a large amount of repeated calculation is reduced;
optionally, setting a validity period for the data analysis task, and automatically failing when the task expires;
optionally, the data analysis task and the clickwouse cluster interact in an asynchronous manner, so that a situation of thread blocking in the data analysis process is avoided.
In the data analysis method provided by the first embodiment of the application, a ClickHouse cluster is used as basic computing service, and all data are stored in a local disk; in a data storage layer, data are fragmented according to user IDs, and all behavior data of the same user are stored in the same machine under the user dimension, so that the localized calculation under a distributed cluster mode is realized; introducing independent scheduling service for the ClickHouse cluster, wherein all data analysis tasks pass through the scheduling service, and the concurrency of the data analysis tasks is effectively controlled; in the aspect of a single task, the maximum reading line number limit and the maximum execution time limit are introduced, and the maximum resources which can be occupied by the single task are effectively controlled; various load balancing strategies are introduced to ensure the pressure balance of each node of the cluster; the method solves the problems of the calculation performance bottleneck of user behavior data analysis under the super-large data scale and instability of the ClickHouse cluster when the calculation peak value is reached, can meet the analysis requirement of the user behavior data under the super-large data scale, effectively improves the stability and efficiency of data analysis, and improves the accuracy of data analysis products.
Fig. 2 is a schematic structural diagram of a data analysis system provided in this application. As shown in fig. 2, the analysis system of the present application includes a data writing module 11, a data storage module 12, a task scheduling module 13, and a data analysis module 14;
the data writing module 11 is configured to obtain behavior data of a target object, and write the behavior data into the clickwouse cluster according to a first preset rule;
the data storage module 12 is configured to store the behavior data written in the ClickHouse cluster according to a second preset rule;
the task scheduling module 13 is configured to obtain data analysis task information, and perform task scheduling on each node of the ClickHouse cluster according to the data analysis task information, so as to balance the load of each node;
the data analysis module 14 is configured to analyze the behavior data of each node according to the task scheduling arrangement, and generate a data analysis product.
Optionally, the task scheduling module 13 includes a distributed queue storage module 130, a task scheduler 131, and a clickwouse client module 132;
the distributed queue storage module 130 stores the distributed data analysis task queue by using a relational database management system (MySQL);
the task scheduler 131 is configured to sequentially execute different types of data analysis tasks according to task execution priorities, ensure that core tasks are preferentially executed, and effectively control task concurrency of the ClickHouse cluster;
the ClickHouse client module 132 is configured to adopt a corresponding load balancing policy according to a task type of the data analysis task, where the load balancing policy includes random, polling, and minimum load;
the ClickHouse client module 132 is further configured to obtain the reading line number and the execution time of the data analysis task, and stop the data analysis task when the reading line number of the data analysis task exceeds the maximum reading line number and/or the execution time of the data analysis task exceeds the maximum execution time.
The specific implementation process of this embodiment refers to the first embodiment, and is not described herein again.
The data analysis system provided by the embodiment of the application can meet the analysis requirement of user behavior data under a super-large data scale through interaction among the data writing module, the data storage module, the task scheduling module and the data analysis module, effectively improves the stability and efficiency of data analysis, and improves the accuracy of data analysis products. In addition, the availability ratio of the behavior data analysis system reaches 99.99%, and the data analysis efficiency can be improved from a minute level to a second level.
Fig. 3 is a schematic structural diagram of a terminal according to a third embodiment of the present application. The terminal of the application includes: a processor 210, a memory 211, and a computer program 212 stored in the memory 211 and executable on the processor 210. The processor 210, when executing the computer program 212, implements the steps in the above-described embodiments of the data analysis method, such as the steps S101 to S103 shown in fig. 1.
The terminal may include, but is not limited to, a processor 210, a memory 211. Those skilled in the art will appreciate that fig. 3 is only an example of a terminal and is not intended to be limiting and may include more or fewer components than those shown, or some components may be combined, or different components, e.g., the terminal may also include input-output devices, network access devices, buses, etc.
The Processor 210 may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
The memory 211 may be an internal storage unit of the terminal, such as a hard disk or a memory of the terminal. The memory 211 may also be an external storage device of the terminal, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, which are provided on the terminal. Further, the memory 211 may also include both an internal storage unit and an external storage device of the terminal. The memory 211 is used for storing the computer program and other programs and data required by the terminal. The memory 211 may also be used to temporarily store data that has been output or is to be output.
The present application also provides a storage medium having stored thereon a computer program which, when executed by a processor, carries out the steps of the data analysis method as described above.
The technical features of the embodiments described above may be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the embodiments described above are not described, but should be considered as being within the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
As used herein, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, including not only those elements listed, but also other elements not expressly listed.
The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily think of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.