CN105608228A

CN105608228A - High-efficiency distributed RDF data storage method

Info

Publication number: CN105608228A
Application number: CN201610064516.1A
Authority: CN
Inventors: 吴志坚; 黎建辉; 周园春; 侯艳飞; 韩岳岐
Original assignee: Computer Network Information Center of CAS
Current assignee: Computer Network Information Center of CAS
Priority date: 2016-01-29
Filing date: 2016-01-29
Publication date: 2016-05-25
Anticipated expiration: 2036-01-29
Also published as: CN105608228B

Abstract

The invention discloses an efficient distributed RDF data storage method. The method is as follows: 1) the user selects a named graph or sets a new named graph for each triple to be uploaded; and sets an effective predicate and triple for the triple according to business requirements; 2 ) The data control system parses each triple in the RDF data uploaded by the user, extracts the predicate of the triple and the valid predicate of the named graph of the triple; then according to the valid predicate, the triple split into two triples with the same unique identifier: a triple of complete predicates of the same subject and a triple of valid predicates of the same subject; valid predicates are part of the complete predicates; 3) the data control system will get The triplet data of complete predicates of the same subject and the triplet data of valid predicates of the same subject are stored in different database clusters respectively. The invention improves the high availability of data.

Description

An Efficient Distributed RDF Data Storage Method

技术领域technical field

本发明涉及RDF数据存储技术领域，特别是高效的分布式的RDF数据存储方法，属于计算机软件领域。The invention relates to the technical field of RDF data storage, in particular to an efficient distributed RDF data storage method, which belongs to the field of computer software.

背景技术Background technique

随着互联网技术的高速发展，使得互联网的应用范围越来越广泛，并且形成一个庞大的知识网络库，但同时也带来很多挑战，为了把不同形式的知识网络库连接起来，让计算机能够理解数据与数据之间的联系，提出了语义网的概念。语义网的目标是让网络上的信息资源能够被机器理解，从而实现网络信息资源的自动化处理，以适应网络信息资源的快速增长。With the rapid development of Internet technology, the application scope of the Internet has become more and more extensive, and a huge knowledge network database has been formed, but it has also brought many challenges. In order to connect different forms of knowledge network databases, so that computers can understand The connection between data and data puts forward the concept of Semantic Web. The goal of the Semantic Web is to make the information resources on the network understandable by machines, so as to realize the automatic processing of network information resources and adapt to the rapid growth of network information resources.

语义网定义一种资源描述框架RDF来描述网络上的信息资源。RDF是一个网络资源对象和其间关系的数据模型，提供一个通用的数据模型来支持对网络资源的描述，RDF使用三元组(主语、谓语和宾语)来描述网络上的各种资源和他们之间的关系。从图的角度分析，该模型是由节点和节点之间的边构成，节点表示主语和宾语，边表示谓语，如此可以用节点表示资源，边表示资源的属性。The Semantic Web defines a resource description framework, RDF, to describe information resources on the Web. RDF is a data model of network resource objects and their relationships. It provides a general data model to support the description of network resources. RDF uses triples (subject, predicate and object) to describe various resources on the network and their relationships. relationship between. From the perspective of the graph, the model is composed of nodes and edges between nodes. Nodes represent subjects and objects, and edges represent predicates. In this way, nodes can be used to represent resources, and edges represent resource attributes.

目前对RDF数据存储普遍采用单机RDF数据库管理系统，比如：GraphDB、stardog和allegrograph等。这种RDF存储方式可以管理大量的三元组数据，但是随着互联网信息资源的快速增长，单机的存储能力有限，已不能满足目前海量三元组数据存储的需求。针对海量三元组数据的存储有学者提出了多种方案，但都处在研究阶段。比如使用Hadoop或Hbase分布式集群存储三元组数据，由于Hadoop或Hbase都天然具有海量数据的存储管理能力，并采用mapreduce模拟实现数据查询；但是由于这种存储方式使得同一主语的三元组数据存储具有分散性，即同一主语的三元组可能存储在多台机器；加上RDF数据关联关系的复杂性，每个三元组之间都有可能存在关联关系，mapreduce模拟实现数据查询方案进行数据查询时，需要进行大量的数据关联筛选，目前的存储方案不能实现对数据高速的查询，查询性能比较低，特别是数据量很大的情况下，一条简单的查询可能就需要执行十几秒，不能满足实际的业务查询需求。Currently, stand-alone RDF database management systems are commonly used for RDF data storage, such as GraphDB, stardog, and allegrograph. This RDF storage method can manage a large amount of triple data, but with the rapid growth of Internet information resources, the storage capacity of a single machine is limited, which can no longer meet the current demand for massive triple data storage. For the storage of massive triplet data, some scholars have proposed a variety of schemes, but they are all in the research stage. For example, using Hadoop or Hbase distributed clusters to store triplet data, because Hadoop or Hbase both naturally have the ability to store and manage massive data, and use mapreduce to simulate data query; but because of this storage method, the triplet data of the same subject Storage is decentralized, that is, triples of the same subject may be stored on multiple machines; coupled with the complexity of RDF data association relationships, there may be associations between each triplet, and mapreduce simulates and implements data query schemes. When querying data, a large amount of data association screening is required. The current storage solution cannot realize high-speed query of data, and the query performance is relatively low. Especially in the case of a large amount of data, a simple query may take more than ten seconds to execute , which cannot meet the actual business query requirements.

发明内容Contents of the invention

针对上面提到的RDF数据存储中遇到的问题，本发明提出了一种高效的分布式的RDF数据存储方法，解决现有RDF数据存储方式中数据存储量有限、三元组数据分散的问题。Aiming at the problems encountered in the RDF data storage mentioned above, the present invention proposes an efficient distributed RDF data storage method, which solves the problems of limited data storage capacity and scattered triplet data in the existing RDF data storage methods .

为解决上述问题，本发明提出了一种高效的分布式的RDF数据存储方法，该方法主要包括以下实现步骤：In order to solve the above problems, the present invention proposes a highly efficient distributed RDF data storage method, which mainly includes the following implementation steps:

1)数据解析器对用户上传的RDF数据进行解析，把每条三元组数据解析成统一格式的三元组对象；对解析后的数据进行处理，解析并提取三元组中的谓词，提取命名图的有效谓词，有效谓词通过用户的业务需求而定义，用户根据具体的业务需求确定目前用到的谓词三元组，即构成有效谓词的三元组。根据该命名图的有效谓词，把同一主语的三元组数据拆分成两部分，即同一主语的完整谓词的三元组数据和同一主语的有效谓词的三元组数据；同一主语的完整谓词的三元组数据即是同一主语的完整的三元组数据，同一主语的有效谓词的三元组数据即是同一主语的部分谓词的三元组数据，因此同一主语的有效谓词的三元组数据是同一主语的完整谓词的三元组数据子集。根据该命名图的有效谓词，把同一主语的三元组数据拆分成两部分，即同一主语的完整谓词的三元组数据和同一主语的有效谓词的三元组数据；并生成唯一ID，唯一标示该主语的三元组，每一主语的三元组都会生成该唯一ID，用于唯一标示该主语的三元组，同一主语的完整谓词的三元组数据和同一主语的有效谓词的三元组数据共用该唯一ID。1) The data parser parses the RDF data uploaded by the user, and parses each triplet data into a triplet object in a unified format; processes the parsed data, parses and extracts the predicates in the triplets, and extracts Name the effective predicates of the graph. The effective predicates are defined by the user's business requirements. The user determines the currently used predicate triples according to the specific business requirements, that is, the triples that constitute the effective predicates. According to the effective predicates of the naming graph, the triplet data of the same subject is split into two parts, namely the triplet data of the complete predicate of the same subject and the triplet data of the effective predicate of the same subject; the complete predicate of the same subject The triplet data of the same subject is the complete triplet data of the same subject, the triplet data of effective predicates of the same subject is the triplet data of partial predicates of the same subject, so the triplet data of effective predicates of the same subject Data is a subset of triplet data of full predicates of the same subject. According to the effective predicates of the naming graph, the triplet data of the same subject is split into two parts, that is, the triplet data of the complete predicate of the same subject and the triplet data of the effective predicate of the same subject; and generate a unique ID, The triplet that uniquely identifies the subject, each triplet of the subject will generate the unique ID, which is used to uniquely identify the triplet of the subject, the triplet data of the complete predicate of the same subject and the effective predicate of the same subject Triple data share this unique ID.

2)数据分为两部分进行存储管理，即同一主语的完整谓词的三元组数据和同一主语的有效谓词的三元组数据分开存储。使用开源分布式NoSQL数据库集群存储同一主语的完整谓词的三元组数据，为了保证数据的完整性，将来谓词需求变化时，对有效谓词三元组数据进行扩展或者缩减。使用RDF数据库集群存储同一主语的有效谓词的三元组数据，同一主语的有效谓词的三元组数据是同一主语的完整谓词的三元组数据子集，在存储能力不变的情况下，提高系统三元组数据的存储量和管理能力，减少了三元组的数据量进而提高数据查询效率；RDF数据库集群由数据节点、路由节点和配置节点构成。2) The data is divided into two parts for storage management, that is, triple data of complete predicates of the same subject and triple data of valid predicates of the same subject are stored separately. Use an open source distributed NoSQL database cluster to store the complete predicate triplet data of the same subject. In order to ensure the integrity of the data, when the predicate requirements change in the future, the effective predicate triplet data will be expanded or reduced. Use RDF database clusters to store the triple data of valid predicates of the same subject. The triple data of valid predicates of the same subject is a subset of the triple data of complete predicates of the same subject. In the case of unchanged storage capacity, improve The storage capacity and management capabilities of triplet data in the system reduce the amount of triplet data and improve the efficiency of data query; the RDF database cluster is composed of data nodes, routing nodes and configuration nodes.

3)RDF数据库集群中的有效谓词三元组数据可动态扩展。RDF数据库集群只存储同一主语的有效谓词的三元组数据，有效谓词可动态变化，有效谓词发生变化时，首先用户提交谓词更新任务，系统的谓词更新任务监控模块监控用户提交的谓词更新任务，当用户提交谓词更新任务后，该监控模块在后台启动谓词更新任务，检测哪些谓词发生变化，RDF数据库集群存储的三元组相应的也需要发生变化，数据管理模块负责有效谓词变化时，根据分布式NoSQL数据库集群存储同一主语的完整谓词的三元组数据中发生变化谓词的三元组数据导入到RDF数据库集群中，保证存储三元组数据的完整性。3) Effective predicate triples data in RDF database clusters can be expanded dynamically. The RDF database cluster only stores the triplet data of valid predicates of the same subject, and the valid predicates can be changed dynamically. When the valid predicates change, the user first submits the predicate update task, and the predicate update task monitoring module of the system monitors the predicate update task submitted by the user. When the user submits the predicate update task, the monitoring module starts the predicate update task in the background to detect which predicates have changed, and the triples stored in the RDF database cluster also need to change accordingly. The data management module is responsible for valid predicates. The NoSQL database cluster stores the triple data of the complete predicate of the same subject, and the triple data of the changed predicate is imported into the RDF database cluster to ensure the integrity of the stored triple data.

进一步的，所述三元组和命名图(graphname)，在RDF数据中基本结构是多个三元组的集合，每个三元组由一个主体、一个谓词和一个客体组成，谓词表示主体和客体之间的关联关系，每个三元组也可以理解为由一个主语、一个谓语和一个宾语组成。一系列这样的三元组被称为一个RDF图，定义RDF图的名称即命名图(graphname)，命名图就是数据保存的空间，等同于关系型数据库中数据库的概念，是在用户上传数据时根据业务需求定义，可以选择已有的命名图，也可添加新的命名图。Further, the basic structure of triples and named graphs (graphname) in RDF data is a collection of multiple triples, each triple consists of a subject, a predicate and an object, and the predicate represents the subject and Each triple can also be understood as consisting of a subject, a predicate and an object. A series of such triples is called an RDF graph. The name of the defined RDF graph is named graph (graphname). The named graph is the space for data storage, which is equivalent to the concept of database in relational database. According to the definition of business requirements, you can select an existing named map, or add a new named map.

进一步的，所述完整谓词和有效谓词，本发明把同一主语的三元组的谓词划分为两部分，即完整谓词和有效谓词；完整谓词：某一命名图包含的所有谓词，有效谓词：用户根据业务需求自定义，即某一命名图中用户目前需求会使用到的谓词；根据谓词信息将同一主语的三元组划分成两部分，即完整谓词的三元组和有效谓词的三元组。Further, for the complete predicate and the effective predicate, the present invention divides the predicates of the triples of the same subject into two parts, that is, the complete predicate and the effective predicate; the complete predicate: all the predicates contained in a named graph, and the effective predicate: the user Customized according to business requirements, that is, the predicates that users will use in a naming diagram; divide the triples of the same subject into two parts according to the predicate information, that is, the triples of complete predicates and the triples of valid predicates .

进一步的，所述同一主语完整谓词的三元组数据和同一主语有效谓词的三元组数据分开存储管理。由于同一主语的三元组谓词一般有多个，并且在实际需求中大部分谓词是冗余数据，不会在现有业务需求中使用到，但是在未来的需求发生变化时，可能会使用到该部分数据，为了保证数据的完整性，所以不能把该部分数据丢掉，所以采用这种模式对数据进行划分管理，即完整谓词的三元组数据和有效谓词的三元组数据分开存储，并使用唯一ID进行关联这两部分数据，使用开源分布式NoSQL数据库集群存储完整谓词的三元组数据，使用RDF数据库集群存储有效谓词的三元组。Further, the triple data of complete predicates of the same subject and the triple data of effective predicates of the same subject are stored and managed separately. Since there are generally multiple triple predicates for the same subject, and most of the predicates are redundant data in actual requirements, they will not be used in existing business requirements, but may be used when future requirements change In order to ensure the integrity of the data, this part of the data cannot be lost, so this mode is used to divide and manage the data, that is, the triple data of the complete predicate and the triple data of the effective predicate are stored separately, and Use a unique ID to associate these two parts of data, use an open source distributed NoSQL database cluster to store triple data of complete predicates, and use an RDF database cluster to store triples of valid predicates.

进一步的，所述RDF数据库集群由数据节点、路由节点和配置节点构成；数据节点主要进行数据存储，由多个开源单机版RDF数据库构成；路由(routor)节点对数据节点进行控制，包括数据更新、数据节点选择、数据分片和数据同步等；配置节点(config)对数据节点配置信息进行管理，包括各数据节点的IP和端口、名称、命名图、谓词信息、存储三元组数据量、最大装填因子和主从库标示等信息。Further, the RDF database cluster is composed of data nodes, routing nodes and configuration nodes; the data nodes mainly store data and are composed of multiple open source stand-alone RDF databases; the routing (router) nodes control the data nodes, including data update , data node selection, data fragmentation and data synchronization, etc.; the configuration node (config) manages the configuration information of the data node, including the IP and port of each data node, name, naming map, predicate information, storage triplet data volume, Information such as the maximum filling factor and the master-slave library label.

进一步的，所述数据分片和数据节点选择，存储三元组数据时，为了解决数据分散性的问题，将同一主语的三元组数据存储到同一数据节点，同一命名图的数据在数据节点最大存储量内存储到同一数据节点，使得数据分布式查询减少数据查询的计算量和不同节点间的数据传输，提升查询速度。在进行数据分片时，同一主语的三元组数据作为一个原子数据，根据各个数据节点当前数据存储量、存储能力、最大装填因子、图的分布情况选择相应的数据节点，存储该三元组数据。Further, when the data fragmentation and data node selection are used to store triplet data, in order to solve the problem of data dispersion, the triplet data of the same subject is stored in the same data node, and the data of the same named graph is stored in the data node The maximum storage capacity is stored in the same data node, so that the distributed query of data reduces the calculation amount of data query and the data transmission between different nodes, and improves the query speed. When performing data sharding, the triplet data of the same subject is regarded as an atomic data, and the corresponding data node is selected according to the current data storage capacity, storage capacity, maximum filling factor, and graph distribution of each data node, and the triplet is stored data.

与现有技术相比，本发明的积极效果为：Compared with prior art, positive effect of the present invention is:

本发明针对大规模RDF数据的存储，提出一种新的分布式RDF数据存储方案，把数据分为两部分进行存储管理，完整谓词的三元组数据和有效谓词的三元组数据分开存储模式。提高RDF数据的存储能力，使其可以管理海量的RDF数据；提升数据高可用性，RDF数据库集群具有数据分片和备份数据，当某个数据节点失效时，能够保证系统正常不间断运行；数据分片策略是以同一主语的三元组数据作为一个原子数据，根据命名图和主语进行数据分片和数据节点选择，降低三元组数据在各数据节点的分散性，减少数据查询时的复杂度和不同节点间数据传输量，同时提高数据的查询效率。Aiming at the storage of large-scale RDF data, the present invention proposes a new distributed RDF data storage scheme, divides the data into two parts for storage management, and separates storage modes for triplet data of complete predicates and triplet data of valid predicates . Improve the storage capacity of RDF data, so that it can manage massive RDF data; improve high availability of data, RDF database cluster has data fragmentation and backup data, when a data node fails, it can ensure the normal and uninterrupted operation of the system; The sharding strategy uses the triplet data of the same subject as an atomic data, performs data sharding and data node selection according to the naming graph and subject, reduces the dispersion of triplet data in each data node, and reduces the complexity of data query And the data transmission volume between different nodes, while improving the query efficiency of data.

附图说明Description of drawings

附图为本发明一种高效的分布式的RDF数据存储方法的系统架构图。The accompanying drawing is a system architecture diagram of an efficient distributed RDF data storage method of the present invention.

具体实施方式detailed description

为了更清晰和直观的表达本发明的方法，下面结合附图对本发明进行进一步详细描述。本发明高效的分布式的RDF数据存储方法包括以下步骤：In order to express the method of the present invention more clearly and intuitively, the present invention will be further described in detail below in conjunction with the accompanying drawings. The efficient distributed RDF data storage method of the present invention comprises the following steps:

1)数据访问，负责对外提供统一的数据访问接口，通过提供的接口进行数据的访问。主要包括的内容有数据上传、数据更新、数据查询、谓词扩展和谓词信息查询等接口。1) Data access, responsible for providing a unified data access interface to the outside world, and accessing data through the provided interface. It mainly includes interfaces such as data upload, data update, data query, predicate expansion and predicate information query.

2)数据控制，提供对数据的控制处理功能主要包括的内容有数据管理、谓词管理和数据存储管理。2) Data control, providing data control and processing functions mainly including data management, predicate management and data storage management.

数据管理提供对RDF数据的管理功能，包括对RDF数据的上传、更新和查询控制；RDF上传数据控制，包括RDF数据解析器、RDF数据分割模块和生成唯一ID。数据上传时，首先，RDF数据解析器进行RDF数据解析，支持对多种格式RDF数据的解析，包括xml、json和nt等格式的RDF数据，根据用户上传数据格式，把数据解析成统一格式的RDF数据对象；然后，RDF数据分割模块对解析生成的统一格式的RDF数据对象进行分割，用户定义RDF数据的命名图名称，用于确定上传数据保存到哪个命名图，根据该RDF数据的命名图获取其有效谓词列表，根据有效谓词列表把数据分割成两部分，即同一主语的完整谓词的三元组对象、同一主语的有效谓词的三元组对象；最后，生成唯一ID，用于唯一标示该主语的三元组，并关联一主语的完整谓词的三元组和同一主语的有效谓词的三元组这两部分数据，使用ID自增策略生成，通过自定义的唯一ID生成器获取该命名图的自增ID，并生成一条包含该ID的三元组分别封装到同一主语的完整谓词的三元组对象和同一主语的有效谓词的三元组对象中。Data management provides RDF data management functions, including RDF data upload, update and query control; RDF upload data control, including RDF data parser, RDF data segmentation module and unique ID generation. When data is uploaded, first, the RDF data parser performs RDF data analysis, supports the analysis of RDF data in various formats, including RDF data in xml, json, and nt formats, and parses the data into a unified format according to the data format uploaded by the user. RDF data object; then, the RDF data segmentation module splits the RDF data object in a unified format generated by parsing, and the user defines the named map name of the RDF data, which is used to determine which named map the uploaded data is saved to. According to the named map of the RDF data Obtain its valid predicate list, divide the data into two parts according to the valid predicate list, that is, the triplet object of the complete predicate of the same subject, and the triplet object of the valid predicate of the same subject; finally, generate a unique ID for unique identification The triplet of the subject, which is associated with the triplet of the complete predicate of the subject and the triplet of the valid predicate of the same subject, is generated using the ID auto-increment strategy, and the unique ID generator is used to obtain the data. Name the self-incrementing ID of the graph, and generate a triple containing the ID to be encapsulated into the triple object of the complete predicate of the same subject and the triple object of the effective predicate of the same subject.

谓词管理提供对RDF数据的谓词的管理功能，包括谓词的扩展、谓词的缩减和谓词信息的查询等功能。谓词的扩展，指对有效谓词进行扩展，由于RDF数据库集群只存储部分谓词的三元组，当用户需要使用某命名图的谓词不在有效谓词中时，需要对有效谓词进行扩展，扩展数据库中这些谓词的三元组。谓词扩展步骤：用户提交要进行扩展的命名图的谓词，谓词管理模块获取用户提交的命名图及其扩展谓词，对比该命名图中的有效谓词，核实得出要扩展谓词，为了保证现有的有效谓词不包括用户提交的扩展谓词，起到用户输入数据校验的作用；通过谓词扩展调度提交谓词扩展任务，后台异步执行谓词扩展任务，进行数据导入，从NoSQL数据库读取相应的三元组数据，提取扩展谓词的三元组，导入到RDF数据库集群中。Predicate management provides predicate management functions for RDF data, including functions such as predicate expansion, predicate reduction, and predicate information query. The expansion of predicates refers to the expansion of effective predicates. Since the RDF database cluster only stores the triples of some predicates, when the predicate that the user needs to use a certain named graph is not in the effective predicates, the effective predicates need to be extended to expand these in the database. A triplet of predicates. Predicate expansion step: the user submits the predicate of the named graph to be expanded, and the predicate management module obtains the named graph and its extended predicates submitted by the user, compares the valid predicates in the named graph, and verifies that the predicate needs to be extended. In order to ensure that the existing Valid predicates do not include extended predicates submitted by users, which play the role of user input data verification; predicate expansion tasks are submitted through predicate expansion scheduling, and the predicate expansion tasks are executed asynchronously in the background to import data and read corresponding triples from the NoSQL database Data, extract triples of extended predicates, and import them into RDF database clusters.

数据存储管理提供数据管理模块和谓词管理模块对数据库的操作，所有对数据库的操作都通过该模块进行，提供统一的数据访问接口，实现数据处理和数据存储分离，包括对数据库进行数据查询、更新和上传等功能，以及谓词扩展的数据导入、谓词信息进行查询、更新和上传功能。Data storage management provides data management module and predicate management module to operate the database. All operations on the database are carried out through this module, providing a unified data access interface, realizing the separation of data processing and data storage, including data query and update of the database and upload functions, as well as predicate extended data import, predicate information query, update and upload functions.

3)数据持久化，负责数据的物理存储，把数据保存到磁盘，数据分为两部分进行持久化，使用NoSQL数据库集群和RDF数据库集群进行数据存储。NoSQL数据库集群使用开源分布式NoSQL数据库集群，利用其海量的数据管理能力特点，存储完整谓词的三元组数据，用于保证数据的完整性，当有效谓词发生变化时，读取其相应的谓词三元组数据导入到RDF数据库集群中。RDF数据库集群由多个数据节点、路由节点和配置节点构成；数据节点主要进行三元组数据存储，由多台单机版开源RDF数据库构成；路由节点对数据节点进行控制，包括数据更新、数据节点选择、数据分片和数据同步等。路由节点管理RDF数据库集群，是集群的中心节点，控制各个RDF数据库数据节点。配置节点对数据节点配置信息进行管理，包各数据节点的IP和端口、名称、命名图、谓词信息、存储三元组数据量、最大装填因子和主从库标示等信息。装填因子是指数据存储量和数据最大容纳量比值，最大装填因子是指允许的最大的装填因子值，当前装填因子是指当前据存储量和数据最大容纳量比值。进行三元组数据上传时，路由节点根据该三元组的命名图和配置节点的配置信息，得出该命名图数据所在数据节点，如果该命名图数据没有存储在任何数据节点，表示该命名图是新的图，则从所有数据节点中选取一个当前装填因子最小的数据节点，存储上传的三元组数据；如果有该命名图存储在某些数据节点，则从这些数据节点中选取某个当前装填因子最小的数据节点，如果数据节点中的最小的当前装填因子值大于等于最大装填因子值，则需要对该命名图数据进行分片存储，从其他数据节点中选取一个填装因子最小的数据节点，存储上传的三元组数据，否则直接选取当前填装因子最小的数据节点，存储上传的三元组数据。数据存储到数据节点之后，更新相应的配置信息，包括更新命名图信息和数据节点的存储三元组数据量等配置信息。3) Data persistence, responsible for the physical storage of data, save the data to disk, divide the data into two parts for persistence, and use NoSQL database cluster and RDF database cluster for data storage. The NoSQL database cluster uses an open-source distributed NoSQL database cluster, which uses its massive data management capabilities to store triple data of complete predicates to ensure data integrity. When valid predicates change, read their corresponding predicates The triplet data is imported into the RDF database cluster. The RDF database cluster consists of multiple data nodes, routing nodes, and configuration nodes; the data nodes mainly store triplet data, and are composed of multiple stand-alone open source RDF databases; the routing nodes control the data nodes, including data update, data node selection, data sharding and data synchronization, etc. The routing node manages the RDF database cluster and is the central node of the cluster, controlling each RDF database data node. The configuration node manages the configuration information of the data nodes, including the IP and port of each data node, name, naming map, predicate information, storage triplet data volume, maximum filling factor, and master-slave library marking and other information. The fill factor refers to the ratio of the data storage capacity to the maximum data capacity, the maximum fill factor refers to the maximum allowable fill factor value, and the current fill factor refers to the ratio of the current data storage capacity to the maximum data capacity. When uploading triplet data, the routing node obtains the data node where the named graph data is located according to the named graph of the triplet and the configuration information of the configuration node. If the named graph data is not stored in any data node, it means that the named If the graph is a new graph, select a data node with the smallest current filling factor from all data nodes to store the uploaded triplet data; if the named graph is stored in some data nodes, select a certain data node from these data nodes A data node with the smallest current filling factor. If the smallest current filling factor value in the data node is greater than or equal to the maximum filling factor value, the named graph data needs to be stored in fragments, and a filling factor minimum is selected from other data nodes. , store the uploaded triplet data, otherwise, directly select the data node with the smallest filling factor to store the uploaded triplet data. After the data is stored in the data node, update the corresponding configuration information, including updating the naming graph information and the storage triplet data volume of the data node and other configuration information.

数据上传的实施案例分析：Implementation case analysis of data upload:

1.准备三元组数据，并定义该三元组数据的命名图(graphname)，即确定数据要上传到哪个命名图，通过调用数据上传接口，上传三元组数据和其命名图到数据管理模块。1. Prepare the triplet data, and define the named graph (graphname) of the triplet data, that is, determine which named graph the data will be uploaded to, and upload the triplet data and its named graph to the data management by calling the data upload interface module.

2.数据管理模块调用数据解析器，解析该三元组数据，把数据封装成统一格式的三元组数据对象。2. The data management module invokes the data parser, parses the triplet data, and encapsulates the data into triplet data objects in a unified format.

3.数据管理模块调用数据分割模块，并通过谓词管理模块查询其命名图的有效谓词列表，根据有效谓词列表，把上传三元组数据对象分割成两部分，即完整谓词三元组数据对象和有效谓词三元组数据对象。3. The data management module calls the data segmentation module, and queries the effective predicate list of its named graph through the predicate management module. According to the effective predicate list, the uploaded triplet data object is divided into two parts, that is, the complete predicate triplet data object and A valid predicate triplet data object.

4.数据管理模块使用唯一ID自增生成器，生成上传三元组数据的唯一ID，并把ID值分别封装到完整谓词三元组数据对象和有效谓词三元组数据对象中。4. The data management module uses a unique ID auto-increment generator to generate a unique ID for uploading triplet data, and encapsulates the ID value into a complete predicate triplet data object and a valid predicate triplet data object respectively.

5.调用数据存储控制模块，分别把完整谓词三元组数据和有效谓词三元组数据存储到NoSQL数据库集群和RDF数据库集群。完整谓词三元组数据直接存储到NoSQL数据库集群中。RDF数据库集群的路由节点控制有效谓词三元组数据的存储。5. Invoke the data storage control module to store the complete predicate triplet data and valid predicate triplet data in the NoSQL database cluster and the RDF database cluster respectively. The complete predicate triple data is directly stored in the NoSQL database cluster. The routing nodes of the RDF database cluster control the storage of valid predicate triple data.

6.RDF数据库集群的路由节点，通过调用配置节点获取该命名图所在数据节点，如果该命名图数据没有存储在任何数据节点，表示该命名图是新的图，则从所有数据节点中选取一个当前装填因子最小的数据节点，存储上传的三元组数据，按步骤10继续进行数据存储。6. The routing node of the RDF database cluster obtains the data node where the named graph is located by calling the configuration node. If the named graph data is not stored in any data node, it means that the named graph is a new graph, and then select one from all data nodes The data node with the smallest filling factor currently stores the uploaded triplet data, and continues to store data according to step 10.

7.如果有该命名图存储在某些数据节点，则从这些数据节点中选取某个当前装填因子最小的数据节点。7. If the named graph is stored in some data nodes, select a data node with the smallest current filling factor from these data nodes.

8.如果所选数据节点中的最小的当前装填因子值大于等于最大装填因子值，则需要对该命名图数据进行分片存储，从其他数据节点中选取一个当前填装因子最小的数据节点，存储上传的三元组数据，按步骤10继续进行数据存储。8. If the minimum current fill factor value in the selected data node is greater than or equal to the maximum fill factor value, the named graph data needs to be stored in fragments, and a data node with the smallest current fill factor is selected from other data nodes. Store the uploaded triplet data, and continue to store data according to step 10.

9.如果所选数据节点中的最小的当前装填因子值小于最大装填因子值，则直接选取当前填装因子最小的数据节点，存储上传的三元组数据，按步骤10继续进行数据存储。9. If the minimum current filling factor value in the selected data node is smaller than the maximum filling factor value, then directly select the data node with the smallest current filling factor, store the uploaded triplet data, and continue data storage according to step 10.

10.数据存储到数据节点之后，更新相应的配置信息：命名图信息、数据节点的存储三元组数据量和当前填装因子。10. After the data is stored in the data node, update the corresponding configuration information: named map information, the storage triplet data volume of the data node and the current fill factor.

Claims

1. An efficient distributed RDF data storage method, the steps of which are:

1) The user selects a named map or sets a new named map for each triple to be uploaded; and sets an effective predicate and its triple for the triple according to business requirements;

2) The data control system analyzes each triple in the RDF data uploaded by the user, and extracts the predicate of the triple and the valid predicate of the named graph of the triple; then, according to the valid predicate, the triple The group splits into two triples with the same unique identity: a triple of complete predicates of the same subject and a triple of valid predicates of the same subject; where complete predicates are all predicates contained in the named graph of triples , the valid predicate is a part of the complete predicate;

3) The data control system stores the obtained triplet data of complete predicates of the same subject and the triplet data of effective predicates of the same subject into different database clusters respectively.

2. The method according to claim 1, characterized in that, using an open source distributed NoSQL database cluster to store the triple data of the complete predicate of the same subject, and using an RDF database cluster to store the triple data of the effective predicate of the same subject.

3. The method according to claim 2, wherein when the data control system receives the predicate update task, it detects the changed predicate according to the predicate update information in the update task, and then updates the corresponding Predicates in triples.

4. The method according to claim 2 or 3, wherein the RDF database cluster includes data nodes, routing nodes and configuration nodes; wherein, the data nodes are used for data storage; the routing nodes are used for controlling the data nodes , including data update, data node selection, data fragmentation and data synchronization; the configuration node is used to manage the configuration information of the data node, including the IP and port of each data node, name, naming map, predicate information, and storage triplet data quantity, maximum filling factor and master-slave library marking information.

5. The method of claim 4, wherein the data control system stores triplet data of the same subject to the same data node.

6. The method according to claim 5, wherein the data control system stores the data of the same named graph into the same data node within the maximum storage capacity of the data node.

7. The method according to claim 4, wherein the routing node obtains the data node where the data of the naming graph is located according to the configuration information of the naming graph of the triplet and the configuration node; wherein, if the naming graph of the If the data is not stored in any data node, select a data node with the smallest current filling factor from all data nodes to store the uploaded triplet data; Select the data node with the smallest current filling factor in the data node. If the smallest current filling factor value in the data node is greater than or equal to the maximum filling factor value, the data of the named graph will be stored in pieces, and a filling factor will be selected from other data nodes. The node with the smallest filling factor stores the uploaded triplet data; otherwise, select the data node with the smallest filling factor currently to store the uploaded triplet data.

8. The method according to claim 7, wherein after the data node stores a triple, the corresponding configuration information is updated, including named map information, stored triple data volume and current filling factor.

9. The method according to claim 1, wherein the data control system expands the extracted effective predicates: for the predicates of the named graphs submitted by the user, the data control system obtains the named graphs submitted by the user and its Expand the predicate, compare the valid predicates in the named graph, and verify that the predicate is to be expanded.