Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456
CN115048466A - Data analysis method, system, terminal and storage medium - Google Patents
[go: Go Back, main page]

CN115048466A - Data analysis method, system, terminal and storage medium - Google Patents

Data analysis method, system, terminal and storage medium Download PDF

Info

Publication number
CN115048466A
CN115048466A CN202210504417.6A CN202210504417A CN115048466A CN 115048466 A CN115048466 A CN 115048466A CN 202210504417 A CN202210504417 A CN 202210504417A CN 115048466 A CN115048466 A CN 115048466A
Authority
CN
China
Prior art keywords
data
task
analysis
data analysis
behavior
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210504417.6A
Other languages
Chinese (zh)
Other versions
CN115048466B (en
Inventor
张园
田舟贤
邵克华
李利椿
强琦
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang Geely Holding Group Co Ltd
Hangzhou Youxing Technology Co Ltd
Original Assignee
Zhejiang Geely Holding Group Co Ltd
Hangzhou Youxing Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang Geely Holding Group Co Ltd, Hangzhou Youxing Technology Co Ltd filed Critical Zhejiang Geely Holding Group Co Ltd
Priority to CN202210504417.6A priority Critical patent/CN115048466B/en
Publication of CN115048466A publication Critical patent/CN115048466A/en
Application granted granted Critical
Publication of CN115048466B publication Critical patent/CN115048466B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • G06F16/2264Multidimensional index structures
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2282Tablespace storage structures; Management thereof
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2453Query optimisation
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/48Program initiating; Program switching, e.g. by interrupt
    • G06F9/4806Task transfer initiation or dispatching
    • G06F9/4843Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5083Techniques for rebalancing the load in a distributed system
    • G06F9/5088Techniques for rebalancing the load in a distributed system involving task migration
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Computing Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Debugging And Monitoring (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

本申请涉及一种数据分析方法、系统、终端及存储介质,其中,数据分析方法包括:获取目标对象的行为数据,并按照预设规则将所述行为数据存储至ClickHouse集群;获取数据分析任务信息,并根据所述数据分析任务信息,对所述ClickHouse集群的各节点进行任务调度,以使所述各节点的负载均衡;根据任务调度安排,对所述各节点的行为数据进行分析,生成数据分析产品。本申请提供的数据分析方法、系统、终端及存储介质,利用ClickHouse集群存储行为数据,并采用多种负载均衡策略对数据分析任务进行调度,能够满足超大数据规模下用户行为数据的分析需求,提高数据分析的稳定性及效率,提升数据分析产品的准确性。

Figure 202210504417

The present application relates to a data analysis method, system, terminal and storage medium, wherein the data analysis method includes: acquiring behavior data of a target object, and storing the behavior data in a ClickHouse cluster according to preset rules; acquiring data analysis task information , and according to the data analysis task information, perform task scheduling on each node of the ClickHouse cluster to balance the load of each node; according to the task scheduling arrangement, analyze the behavior data of each node to generate data Analyze the product. The data analysis method, system, terminal and storage medium provided by this application utilize ClickHouse cluster to store behavior data, and use various load balancing strategies to schedule data analysis tasks, which can meet the analysis requirements of user behavior data under the large data scale, and improve the The stability and efficiency of data analysis improves the accuracy of data analysis products.

Figure 202210504417

Description

Data analysis method, system, terminal and storage medium
Technical Field
The present application belongs to the technical field of data analysis, and in particular, to a data analysis method, system, terminal, and storage medium.
Background
In the field of analysis of massive user behavior data, big data native computing engines such as hive, spark, presto, impala, elastic search and the like are stranded. The industry commonly known enterprises often carry out secondary development on big data computing engines so as to realize efficient analysis on massive user behavior data, but the secondary development cost is extremely high, the computing efficiency is often unsatisfactory, and the subsequent maintenance cost is also very high. Most enterprises generally buy commercial products to make up for the defects in the field of massive user behavior data analysis, but expensive commercial products and data privacy problems also become potential risks for enterprise development.
The prior art, for example, patent cn202011006169.x, provides a method for realizing OLAP analysis based on clickwouse, which is described in detail in the fields of table creation specification, data writing, sql query and the like, but does not relate to how to create a stable and efficient data product based on clickwouse in the face of massive user behavior data. In addition, the problems of the calculation performance bottleneck of user behavior data analysis under the super-large data scale, instability of the ClickHouse cluster when the calculation peak value is reached and the like still exist.
Disclosure of Invention
In order to solve the technical problems, the application provides a data analysis method, a system, a terminal and a storage medium, so as to meet the analysis requirements of user behavior data under a super-large data scale, improve the stability and efficiency of data analysis and improve the accuracy of data analysis products.
The application provides a data analysis method, which comprises the following steps: acquiring behavior data of a target object, and storing the behavior data to a ClickHouse cluster according to a preset rule; acquiring data analysis task information, and performing task scheduling on each node of the ClickHouse cluster according to the data analysis task information so as to balance the load of each node; and analyzing the behavior data of each node according to the task scheduling arrangement to generate a data analysis product.
In one embodiment, storing the behavior data to a clickwouse cluster according to a preset rule includes: performing hash fragmentation on the behavior data according to the identity identification number of the target object, and writing the behavior data of each target object into a corresponding node of the ClickHouse cluster; and storing the behavior data of each target object according to a preset storage mode.
In an embodiment, the step of storing the behavior data of each target object according to a preset storage mode includes: pre-ordering the behavior data of each target object according to a three-level index sequence; wherein, the first-level index is the event number of the behavior data; the secondary index is the identity identification number of the target object to which the behavior data belongs; the third-level index is the log time of the behavior data.
In one embodiment, the data analysis task information includes a task type of the data analysis task; wherein the task type comprises at least one of event statistics, portrait analysis, funnel analysis, behavior path analysis, table structure change and cleaning outdated data.
In one embodiment, the task scheduling is performed on each node of the clickwouse cluster, and includes at least one of: according to the task execution priority, sequentially executing different types of data analysis tasks; and according to the task type of the data analysis task, adopting a corresponding load balancing strategy, wherein the load balancing strategy comprises random, polling and minimum load.
In one embodiment, executing a corresponding load balancing policy according to a task type of a data analysis task includes: if the task type is event statistics and/or portrait analysis and/or funnel analysis and/or behavior path analysis, a minimum load strategy is adopted; if the task type is the change of the table structure, a random strategy is adopted; and if the task type is to clear out-of-date data, adopting a polling strategy.
In an embodiment, the task scheduling for each node of the ClickHouse cluster further includes: acquiring the reading line number and the execution time of a data analysis task; and stopping the data analysis task when the reading line number of the data analysis task exceeds the maximum reading line number and/or the execution time of the data analysis task exceeds the maximum execution time.
The application also provides a data analysis system, which comprises a data writing module, a data storage module, a task scheduling module and a data analysis module; the data writing module is used for acquiring behavior data of a target object and writing the behavior data into the ClickHouse cluster according to a first preset rule; the data storage module is used for storing the behavior data written into the ClickHouse cluster according to a second preset rule; the task scheduling module is used for acquiring data analysis task information and performing task scheduling on each node of the ClickHouse cluster according to the data analysis task information so as to balance the load of each node; and the data analysis module is used for analyzing the behavior data of each node according to task scheduling arrangement to generate a data analysis product.
The present application further provides a terminal comprising a memory, a processor and a computer program stored in the memory and executable on the processor, wherein the processor implements the steps of the analysis method when executing the computer program.
The present application also provides a storage medium storing a computer program which, when executed by a processor, implements the steps of the above-described analysis method.
According to the data analysis method, the system, the terminal and the storage medium, behavior data are stored by using the ClickHouse cluster, and a plurality of load balancing strategies are adopted to schedule data analysis tasks, so that the analysis requirements of user behavior data under a super-large data scale can be met, the stability and the efficiency of data analysis are improved, and the accuracy of data analysis products is improved.
Drawings
FIG. 1 is a schematic flow chart of a data analysis method according to an embodiment of the present disclosure;
fig. 2 is a schematic structural diagram of a data analysis system provided in the second embodiment of the present application;
fig. 3 is a schematic structural diagram of a terminal according to a third embodiment of the present application.
Detailed Description
The technical solution of the present application is further described in detail with reference to the drawings and specific embodiments of the specification. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein in the description of the present application is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used herein, "and/or" includes any and all combinations of one or more of the associated listed items.
Fig. 1 is a schematic flow chart of a data analysis method according to an embodiment of the present application. As shown in fig. 1, the data analysis method of the present application may include the steps of:
step S101: acquiring behavior data of a target object, and storing the behavior data to a ClickHouse cluster according to a preset rule;
in one embodiment, storing behavior data to a clickwouse cluster according to a preset rule includes:
performing hash fragmentation on the behavior data according to the identity identification number of the target object, and writing the behavior data of each target object into a corresponding node of the ClickHouse cluster;
and storing the behavior data of each target object according to a preset storage mode.
Optionally, an open source stream processing framework (Flink) service is adopted to insert data into the ClickHouse cluster in a Java database connection (jdbc) mode; further, in order to ensure the calculation efficiency during query and reduce the writing pressure as much as possible, the data is subjected to hash fragmentation according to the identity identification number (user ID) of the target object and then directly inserted into the local surface of the click House cluster, so that the problem of write amplification of the distributed table is avoided; finally, all behavior data of the same user are landed on the same machine, and a large amount of network read-write (IO) transmission during calculation is avoided, so that localized calculation is realized.
In one embodiment, the step of storing the behavior data of each target object according to a preset storage mode includes:
pre-sorting the behavior data of each target object according to a three-level index;
wherein, the first-level index is an event number of the behavior data; the secondary index is an identity identification number of a target object to which the behavior data belongs; the tertiary index is the log time of the behavioral data.
Optionally, in terms of primary key selection, to ensure retrieval efficiency, an event number (event _ id) of the behavior data is selected as a primary index; in order to ensure the high efficiency of data query, an identity number (xxHash32 (distint _ id) of a target object to which behavior data belongs is selected as a secondary index, and log time (log _ time) of the behavior data is selected as a tertiary index, so that all data are pre-ordered in advance according to the tertiary index sequence in a storage layer.
In an embodiment, the step of storing the behavior data of each target object according to a preset storage mode further includes:
in terms of engine selection, selecting a merge tree (mergeTree) engine as a distributed computing engine;
in the aspect of partition selection, selecting the generation date of behavior data as a partition field to realize horizontal partition;
in the aspect of sampling field selection, in order to ensure that the estimated scene has a small millisecond-level response, an identity identification number (distint _ id) of a target object is selected as a sampling field, and a sampling result obtained through a hash function (xxHash32 (distint _ id)) is used as a sampling reference;
in terms of Time To Live (TTL), considering local storage limitation and data use frequency, defining data within a preset Time as hot data, such as about 2 months, defining data outside the preset Time as cold data, such as 2 months ago, and providing data analysis service for the hot data, wherein the cold data can be subjected To data analysis through a hadoop distributed computing platform;
in terms of data granularity, official recommendations are used, i.e. an index is generated every 8192 rows.
Step S102: acquiring data analysis task information, and performing task scheduling on each node of the ClickHouse cluster according to the data analysis task information so as to balance the load of each node;
optionally, the data analysis task information includes task type and task configuration information; the task type comprises at least one of event statistics, portrait analysis, funnel analysis, behavior path analysis, table structure change and outdated data cleaning; the task configuration information is the analysis dimension or the analysis range of each type of task, for example, the configuration information of the portrait analysis task is the region, the academic calendar, the gender, the age and the like; the configuration information of the funnel analysis task is the hierarchical relation of user login, order placement and payment.
Illustratively, the events are counted as statistics of the number of users performing operations on the application or web page, such as login, placing an order, payment, etc.; portrait analysis is to analyze portrait information of registered users, such as the number of Hangzhou regional males among the registered users; the funnel analysis is to analyze the conversion rate from login, order placing to payment of the number of users, if 100 universal users log in, 80 universal users place orders and 50 universal users pay, the conversion rate from login to order placing is 80%, and the conversion rate from order placing to payment is 62.5%; analyzing the behavior path to analyze an operation track of the user from the first moment to the nth moment, such as a click track or a browsing track of an application program or webpage content; the table structure change comprises addition and deletion of a table, addition and deletion of columns in the table, addition, deletion, modification and the like of fields in the table, and the cleaning of expired data is to clean data exceeding the data lifetime.
In one embodiment, task scheduling is performed on each node of the ClickHouse cluster, and the task scheduling includes at least one of the following steps:
according to the task execution priority, sequentially executing different types of data analysis tasks;
and according to the task type of the data analysis task, adopting a corresponding load balancing strategy, wherein the load balancing strategy comprises random, polling and minimum load.
Optionally, executing a corresponding load balancing policy according to the task type of the data analysis task, including:
if the task type is event statistics and/or portrait analysis and/or funnel analysis and/or behavior path analysis, a minimum load strategy is adopted;
if the task type is the change of the table structure, a random strategy is adopted;
and if the task type is to clear out-of-date data, adopting a polling strategy.
Optionally, because event statistics, portrait analysis, funnel analysis and behavior path analysis occupy more computing resources, a minimum load strategy is adopted, optionally, the tasks of event statistics, portrait analysis, funnel analysis and behavior path analysis are identified in advance by self-defined task query ID, and the tasks of event statistics, portrait analysis, funnel analysis and behavior path analysis are distributed to nodes with fewer computing tasks in the ClickHouse cluster, so that the load pressure of the ClickHouse cluster is reduced, and the minimum load is realized; because the table structure change occupies less computing resources, a random distribution strategy is adopted for the table structure change task; since the cleaning of the stale data requires the tabulation analysis of each node of the ClickHouse cluster, a polling strategy is adopted.
In an embodiment, the task scheduling for each node of the ClickHouse cluster further includes:
acquiring the reading line number and the execution time of a data analysis task;
and stopping the data analysis task when the reading line number of the data analysis task exceeds the maximum reading line number and/or the execution time of the data analysis task exceeds the maximum execution time.
Optionally, the reading line number of the data analysis task is the sum of the reading line numbers of the node tasks; the execution time of the data analysis task is the sum of the time for sending the data analysis task to the main node, the main node sending the data analysis task to the child nodes, and the time for summarizing the behavior data of the child nodes to the main node in the whole process; the maximum reading line number and the maximum execution time of the data analysis task are preset values.
Step S103: and analyzing the behavior data of each node according to the task scheduling arrangement to generate a data analysis product.
The data analysis product is a data analysis result obtained by analyzing the behavior data of each node according to the task scheduling arrangement.
Optionally, a task keyword is generated according to the data analysis task information, and the task keyword is associated with the generated data analysis product, so that other data analysis tasks can hit the existing data analysis product through the task keyword, and a large amount of repeated calculation is reduced;
optionally, setting a validity period for the data analysis task, and automatically failing when the task expires;
optionally, the data analysis task and the clickwouse cluster interact in an asynchronous manner, so that a situation of thread blocking in the data analysis process is avoided.
In the data analysis method provided by the first embodiment of the application, a ClickHouse cluster is used as basic computing service, and all data are stored in a local disk; in a data storage layer, data are fragmented according to user IDs, and all behavior data of the same user are stored in the same machine under the user dimension, so that the localized calculation under a distributed cluster mode is realized; introducing independent scheduling service for the ClickHouse cluster, wherein all data analysis tasks pass through the scheduling service, and the concurrency of the data analysis tasks is effectively controlled; in the aspect of a single task, the maximum reading line number limit and the maximum execution time limit are introduced, and the maximum resources which can be occupied by the single task are effectively controlled; various load balancing strategies are introduced to ensure the pressure balance of each node of the cluster; the method solves the problems of the calculation performance bottleneck of user behavior data analysis under the super-large data scale and instability of the ClickHouse cluster when the calculation peak value is reached, can meet the analysis requirement of the user behavior data under the super-large data scale, effectively improves the stability and efficiency of data analysis, and improves the accuracy of data analysis products.
Fig. 2 is a schematic structural diagram of a data analysis system provided in this application. As shown in fig. 2, the analysis system of the present application includes a data writing module 11, a data storage module 12, a task scheduling module 13, and a data analysis module 14;
the data writing module 11 is configured to obtain behavior data of a target object, and write the behavior data into the clickwouse cluster according to a first preset rule;
the data storage module 12 is configured to store the behavior data written in the ClickHouse cluster according to a second preset rule;
the task scheduling module 13 is configured to obtain data analysis task information, and perform task scheduling on each node of the ClickHouse cluster according to the data analysis task information, so as to balance the load of each node;
the data analysis module 14 is configured to analyze the behavior data of each node according to the task scheduling arrangement, and generate a data analysis product.
Optionally, the task scheduling module 13 includes a distributed queue storage module 130, a task scheduler 131, and a clickwouse client module 132;
the distributed queue storage module 130 stores the distributed data analysis task queue by using a relational database management system (MySQL);
the task scheduler 131 is configured to sequentially execute different types of data analysis tasks according to task execution priorities, ensure that core tasks are preferentially executed, and effectively control task concurrency of the ClickHouse cluster;
the ClickHouse client module 132 is configured to adopt a corresponding load balancing policy according to a task type of the data analysis task, where the load balancing policy includes random, polling, and minimum load;
the ClickHouse client module 132 is further configured to obtain the reading line number and the execution time of the data analysis task, and stop the data analysis task when the reading line number of the data analysis task exceeds the maximum reading line number and/or the execution time of the data analysis task exceeds the maximum execution time.
The specific implementation process of this embodiment refers to the first embodiment, and is not described herein again.
The data analysis system provided by the embodiment of the application can meet the analysis requirement of user behavior data under a super-large data scale through interaction among the data writing module, the data storage module, the task scheduling module and the data analysis module, effectively improves the stability and efficiency of data analysis, and improves the accuracy of data analysis products. In addition, the availability ratio of the behavior data analysis system reaches 99.99%, and the data analysis efficiency can be improved from a minute level to a second level.
Fig. 3 is a schematic structural diagram of a terminal according to a third embodiment of the present application. The terminal of the application includes: a processor 210, a memory 211, and a computer program 212 stored in the memory 211 and executable on the processor 210. The processor 210, when executing the computer program 212, implements the steps in the above-described embodiments of the data analysis method, such as the steps S101 to S103 shown in fig. 1.
The terminal may include, but is not limited to, a processor 210, a memory 211. Those skilled in the art will appreciate that fig. 3 is only an example of a terminal and is not intended to be limiting and may include more or fewer components than those shown, or some components may be combined, or different components, e.g., the terminal may also include input-output devices, network access devices, buses, etc.
The Processor 210 may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
The memory 211 may be an internal storage unit of the terminal, such as a hard disk or a memory of the terminal. The memory 211 may also be an external storage device of the terminal, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, which are provided on the terminal. Further, the memory 211 may also include both an internal storage unit and an external storage device of the terminal. The memory 211 is used for storing the computer program and other programs and data required by the terminal. The memory 211 may also be used to temporarily store data that has been output or is to be output.
The present application also provides a storage medium having stored thereon a computer program which, when executed by a processor, carries out the steps of the data analysis method as described above.
The technical features of the embodiments described above may be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the embodiments described above are not described, but should be considered as being within the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
As used herein, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, including not only those elements listed, but also other elements not expressly listed.
The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily think of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims (10)

1. A method of data analysis, comprising:
acquiring behavior data of a target object, and storing the behavior data to a ClickHouse cluster according to a preset rule;
acquiring data analysis task information, and performing task scheduling on each node of the ClickHouse cluster according to the data analysis task information so as to balance the load of each node;
and analyzing the behavior data of each node according to the task scheduling arrangement to generate a data analysis product.
2. The analysis method of claim 1, wherein storing the behavior data to a clickwouse cluster according to a preset rule comprises:
performing hash fragmentation on the behavior data according to the identity identification number of the target object, and writing the behavior data of each target object into a corresponding node of the ClickHouse cluster;
and storing the behavior data of each target object according to a preset storage mode.
3. The analysis method according to claim 2, wherein the step of storing the behavior data of the target objects according to a preset storage mode comprises:
pre-ordering the behavior data of each target object according to a three-level index sequence;
wherein, the first-level index is the event number of the behavior data; the secondary index is the identity identification number of the target object to which the behavior data belongs; the third-level index is the log time of the behavior data.
4. The analytics method of claim 1, wherein the data analytics task information includes a task type of a data analytics task; wherein the task type comprises at least one of event statistics, portrait analysis, funnel analysis, behavior path analysis, table structure change and cleaning outdated data.
5. The analytics method of claim 1, wherein task scheduling the nodes of the ClickHouse cluster comprises at least one of:
according to the task execution priority, sequentially executing different types of data analysis tasks;
and according to the task type of the data analysis task, adopting a corresponding load balancing strategy, wherein the load balancing strategy comprises randomness, polling and minimum load.
6. The analytics method of claim 5, wherein executing the corresponding load balancing policy according to the task type of the data analytics task comprises:
if the task type is event statistics and/or portrait analysis and/or funnel analysis and/or behavior path analysis, a minimum load strategy is adopted;
if the task type is the change of the table structure, a random strategy is adopted;
and if the task type is to clear out-of-date data, adopting a polling strategy.
7. The analytics method of claim 1, wherein task scheduling each node of the ClickHouse cluster, further comprising:
acquiring the reading line number and the execution time of a data analysis task;
and stopping the data analysis task when the reading line number of the data analysis task exceeds the maximum reading line number and/or the execution time of the data analysis task exceeds the maximum execution time.
8. A data analysis system is characterized by comprising a data writing module, a data storage module, a task scheduling module and a data analysis module;
the data writing module is used for acquiring behavior data of a target object and writing the behavior data into the ClickHouse cluster according to a first preset rule;
the data storage module is used for storing the behavior data written into the ClickHouse cluster according to a second preset rule;
the task scheduling module is used for acquiring data analysis task information and performing task scheduling on each node of the ClickHouse cluster according to the data analysis task information so as to balance the load of each node;
and the data analysis module is used for analyzing the behavior data of each node according to task scheduling arrangement to generate a data analysis product.
9. A terminal, characterized in that the terminal comprises a memory, a processor and a computer program stored in the memory and executable on the processor, the processor implementing the steps of the analysis method according to any one of claims 1 to 7 when executing the computer program.
10. A storage medium storing a computer program, wherein the computer program, when executed by a processor, implements the steps of the analysis method according to any one of claims 1 to 7.
CN202210504417.6A 2022-05-10 2022-05-10 Data analysis method, system, terminal and storage medium Active CN115048466B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210504417.6A CN115048466B (en) 2022-05-10 2022-05-10 Data analysis method, system, terminal and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210504417.6A CN115048466B (en) 2022-05-10 2022-05-10 Data analysis method, system, terminal and storage medium

Publications (2)

Publication Number Publication Date
CN115048466A true CN115048466A (en) 2022-09-13
CN115048466B CN115048466B (en) 2025-05-06

Family

ID=83158231

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210504417.6A Active CN115048466B (en) 2022-05-10 2022-05-10 Data analysis method, system, terminal and storage medium

Country Status (1)

Country Link
CN (1) CN115048466B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115757603A (en) * 2022-11-23 2023-03-07 重庆长安汽车股份有限公司 A visual data modeling system and method
CN115982211A (en) * 2022-12-02 2023-04-18 北京凌云雀科技有限公司 MySQL data query analysis method and device based on cloud primitive
CN116414917A (en) * 2023-04-14 2023-07-11 杭州微风企科技有限公司 Data transmission method, device, equipment and storage medium based on Myhouse database
CN118535592A (en) * 2024-05-24 2024-08-23 丰贺信息科技(上海)有限公司 Memory-based multidimensional model rolling calculation method, system, medium and equipment

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110213358A (en) * 2019-05-23 2019-09-06 深圳壹账通智能科技有限公司 Method, node, equipment and the storage medium of cluster resource scheduling
CN110908986A (en) * 2019-11-08 2020-03-24 欧冶云商股份有限公司 Layering method and device for computing tasks, distributed scheduling method and device and electronic equipment
US20200137151A1 (en) * 2017-06-30 2020-04-30 Huawei Technologies Co., Ltd. Load balancing engine, client, distributed computing system, and load balancing method
CN111177189A (en) * 2019-12-20 2020-05-19 航天云网科技发展有限责任公司 Client optimization system and method based on user behavior analysis
CN111367872A (en) * 2018-12-25 2020-07-03 北京嘀嘀无限科技发展有限公司 User behavior analysis method and device, electronic equipment and storage medium
CN111488261A (en) * 2020-03-11 2020-08-04 北京健康之家科技有限公司 User behavior analysis system, method, storage medium and computing device
CN112416991A (en) * 2020-11-30 2021-02-26 腾讯科技(深圳)有限公司 Data processing method and device and storage medium
CN113111083A (en) * 2021-03-31 2021-07-13 北京沃东天骏信息技术有限公司 Method, device, equipment, storage medium and program product for data query
CN113918436A (en) * 2021-11-12 2022-01-11 中国工商银行股份有限公司 Log processing method and device
US20220060369A1 (en) * 2020-08-24 2022-02-24 Juniper Networks, Inc. Intent-based distributed alarm service
CN114238360A (en) * 2021-12-24 2022-03-25 上海观安信息技术股份有限公司 User behavior analysis system

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200137151A1 (en) * 2017-06-30 2020-04-30 Huawei Technologies Co., Ltd. Load balancing engine, client, distributed computing system, and load balancing method
CN111367872A (en) * 2018-12-25 2020-07-03 北京嘀嘀无限科技发展有限公司 User behavior analysis method and device, electronic equipment and storage medium
CN110213358A (en) * 2019-05-23 2019-09-06 深圳壹账通智能科技有限公司 Method, node, equipment and the storage medium of cluster resource scheduling
CN110908986A (en) * 2019-11-08 2020-03-24 欧冶云商股份有限公司 Layering method and device for computing tasks, distributed scheduling method and device and electronic equipment
CN111177189A (en) * 2019-12-20 2020-05-19 航天云网科技发展有限责任公司 Client optimization system and method based on user behavior analysis
CN111488261A (en) * 2020-03-11 2020-08-04 北京健康之家科技有限公司 User behavior analysis system, method, storage medium and computing device
US20220060369A1 (en) * 2020-08-24 2022-02-24 Juniper Networks, Inc. Intent-based distributed alarm service
CN112416991A (en) * 2020-11-30 2021-02-26 腾讯科技(深圳)有限公司 Data processing method and device and storage medium
CN113111083A (en) * 2021-03-31 2021-07-13 北京沃东天骏信息技术有限公司 Method, device, equipment, storage medium and program product for data query
CN113918436A (en) * 2021-11-12 2022-01-11 中国工商银行股份有限公司 Log processing method and device
CN114238360A (en) * 2021-12-24 2022-03-25 上海观安信息技术股份有限公司 User behavior analysis system

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
达微: "ClickHouse深度解密", pages 1 - 8, Retrieved from the Internet <URL:https://www.jianshu.com/p/8e94ebc7d725> *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115757603A (en) * 2022-11-23 2023-03-07 重庆长安汽车股份有限公司 A visual data modeling system and method
CN115982211A (en) * 2022-12-02 2023-04-18 北京凌云雀科技有限公司 MySQL data query analysis method and device based on cloud primitive
CN115982211B (en) * 2022-12-02 2023-09-26 北京凌云雀科技有限公司 Cloud-protogenesis-based MySQL data query analysis method and device
CN116414917A (en) * 2023-04-14 2023-07-11 杭州微风企科技有限公司 Data transmission method, device, equipment and storage medium based on Myhouse database
CN116414917B (en) * 2023-04-14 2026-01-27 杭州微风企科技有限公司 Data transmission method, device, equipment and storage medium based on Myhouse database
CN118535592A (en) * 2024-05-24 2024-08-23 丰贺信息科技(上海)有限公司 Memory-based multidimensional model rolling calculation method, system, medium and equipment

Also Published As

Publication number Publication date
CN115048466B (en) 2025-05-06

Similar Documents

Publication Publication Date Title
US20220188332A1 (en) Distributed transaction database log with immediate reads and batched writes
CN115048466B (en) Data analysis method, system, terminal and storage medium
CA2822900C (en) Filtering queried data on data stores
US9875272B1 (en) Method and system for designing a database system for high event rate, while maintaining predictable query performance
US20190114350A1 (en) Systems, Methods, and Apparatuses for Implementing Concurrent Dataflow Execution with Write Conflict Protection Within a Cloud Based Computing Environment
US10445324B2 (en) Systems and methods for tracking sensitive data in a big data environment
Liu et al. Quantitative analysis of consistency in NoSQL key-value stores
CN117271513A (en) Data processing method, data query method and device
CN118350042A (en) Random noise-based data privacy protection method, device and medium
Gao et al. Chaindb: Ensuring integrity of querying off-chain data on blockchain
Ji et al. Query execution optimization in spark SQL
CN115718571B (en) Data management method and device based on multidimensional features
US20240256700A1 (en) Mechanisms to predict system resource consumption of transactions
CN118708608A (en) Processing engine selection method, device, computer equipment, and storage medium
CN118860587A (en) Task processing method, device, electronic device, storage medium and program product
CN120045574B (en) Data storage method, server and storage medium
US20260056972A1 (en) Hybrid transactional/analytical processing (htap) database with shared smart storage for multiple tenants
US20250110765A1 (en) Identifying data lake ownership using write access logs
Zhong et al. Big data workloads drawn from real-time analytics scenarios across three deployed solutions
Leonard Cost-effective Data Pipelines: Balancing Trade-offs when Developing Pipelines in the Cloud
Lwin et al. Definition and Scope of Big Data Problem
CN118916359A (en) Data storage method and related device based on attribute classification
CN119557380A (en) Method, device, computer equipment, readable storage medium and program product for executing data processing tasks
CN120610952A (en) Data distribution control method, device and computer-readable storage medium
CN116126217A (en) Storage resource allocation method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant