当前位置：文档之家› 基于Ceph云存储系统的数据分布策略与可靠性研究

基于Ceph云存储系统的数据分布策略与可靠性研究

Research on Data Distribution Strategy and Reliability Based on

Ceph Cloud Storage System

Abstract

With the rapid development of Internet of Things, big data and cloud computing technologies, the amount of Internet data increases exponentially, and the form of data resources tends to be complex and diversified. Therefore, the safe and efficient data storage scheme is very important. Considering that the traditional centralized storage is limited by capacity, price, and security, the current massive data storage adopts a distributed storage solution for cloud storage. The distributed storage systems for cloud storage include GFS, HDFS, Ceph, TFS and so on. Among them, the Ceph distributed storage system, which is a milestone in the open source storage field, has been widely adopted because of its advantages such as high performance, high reliability, and easy scalability.

As the core algorithm of data distribution in the Ceph storage system, the CRUSH algorithm can map the physical cluster into a table with a hierarchical structure. During data storage, a weighted operation is performed according to the mapping table and node weights to obtain a set of storage nodes corresponding to the data. It is mainly used to deal with the effective mapping of data to storage devices and data migration problems caused by changing of cluster system structure. However, the CRUSH algorithm still has some shortcomings, mainly as follows: First, the CRUSH algorithm has complex implementation and large code size. Second, in the Ceph small-scale storage cluster, due to the pseudo-randomness of the CRUSH algorithm, the distribution of PGs in the storage nodes of the cluster is uneven, this can lead to uneven distribution of data between nodes. This article has done the following research on the above issues:

(1) Aiming at the complexity of object distribution algorithms in distributed storage clusters, the uneven distribution of objects, and the cluster migration problem when nodes fail, this paper presents an object distribution algorithm based on two-dimensional arrays. The algorithm is simple, supports weights and data redundancy, which can be used to represent a one-to-many functional relationship between data objects and storage nodes. When a data object storage node is selected, a jump Hash is used to locate it. At the same time, duplicate detection processing mechanism is introduced to avoid different copies of the data object are storage in the same fault domains.

(2) In order to solve the uneven distribution of PG due to the pseudo-randomness of the CRUSH algorithm in the small cluster, a weighted tree-based object distribution algorithm

is proposed. The algorithm adopts the method of clustering hierarchical mapping in the CRUSH algorithm, maps the actual storage environment of the cluster into a weighted tree, and then uses the addressing method based on the two-dimensional array object distribution algorithm to realize the uniform distribution of PG to the storage node by using the distribution of the addressing algorithm, thus achieving uniform distribution of the data.

The above solutions were tested through design experiments. The results show that the object distribution algorithm based on two-dimensional array has the characteristics of simple implementation, uniform data distribution, and less migration. The weighted tree-based object distribution algorithm constructed on the basis of this algorithm can make the PG distribution more uniform than the default algorithm, which can bring more uniform data distribution among the nodes in the cluster. In addition, the weighted tree-based object distribution algorithm can ensure that the object copies of the same data are stored in different fault domains. Therefore, the distributed storage system constructed by the algorithm can ensure the security and reliability of data in the system.

Keywords: Massive data; Cloud Storage; Ceph distributed file system; CRUSH; JumpHash;

III

摘要 .................................................................................................................................... I Abstract .................................................................................................................................. I I 第一章绪论 . (1)

§1.1 研究背景及意义 (1)

§1.2 国内外研究现状 (2)

§1.3 研究目标与内容 (3)

§1.4 论文的组织结构 (4)

第二章相关技术 (5)

§2.1 Ceph分布式文件系统 (5)

§2.1.1 Ceph的系统架构 (5)

§2.1.2 Ceph核心组件 (7)

§2.1.3 数据的寻址过程 (7)

§2.1.4 数据一致性 (8)

§2.1.5 数据容错机制 (10)

§2.2 CRUSH算法介绍 (13)

§2.2.1 集群分层映射 (13)

§2.2.2 副本放置规则 (18)

§2.3 本章小结 (19)

第三章云存储中基于二维数组的对象分布策略 (20)

§3.1 对象分布算法的问题描述 (20)

§3.2 基于二维数组的对象分布算法 (20)

§3.2.1 构建二维数组 (21)

§3.2.2 对象副本寻址过程 (21)

§3.2.3 数据对象分布 (22)

§3.2.4 数据迁移量分析 (23)

§3.3 实验与分析 (24)

§3.4 本章小结 (27)

第四章基于加权树的对象分布算法设计 (28)

§4.1 对象分布算法的问题描述 (28)

§4.2 基于加权树的对象分布算法 (28)

§4.2.1 构建加权树 (28)

§4.2.2 自定义副本设置 (29)

§4.2.3 数据副本节点选择 (30)

§4.3 实验与分析 (32)

§4.3.1 实验环境 (32)

§4.3.2 实验结果及分析 (32)

§4.4 本章小结 (35)

第五章基于加权树的数据分布系统的设计与实现 (36)

§5.1 系统总体设计 (36)

§5.2 Web客户端前端访问模块的设计与实现 (37)

§5.3 存储模块的设计与实现 (38)

§5.3.1 集群功能节点设计 (38)

§5.3.2 集群数据分布策略 (39)

§5.4 数据操作流程 (40)

§5.5 系统文件索引 (41)

§5.6 系统测试 (42)

§5.6.1 系统环境 (42)

§5.6.2 系统功能测试 (42)

§5.6.3 数据存储测试 (44)

§5.7 本章小结 (46)

第六章总结与展望 (47)

§6.1 工作总结 (47)

§6.2 下一步工作 (48)

参考文献 (49)

致谢 (55)

作者在攻读硕士期间主要研究成果 (56)