Tags Archives: CRUSH

LPIC3 DIPLOMA Linux Clustering – LAB NOTES: Lesson Ceph Centos7 – Ceph CRUSH Map

LAB on Ceph Clustering on Centos7

 

These are my notes made during my lab practical as part of my LPIC3 Diploma course in Linux Clustering. They are in “rough format”, presented as they were written.

 

This lab uses the ceph-deploy tool to set up the ceph cluster.  However, note that ceph-deploy is now an outdated Ceph tool and is no longer being maintained by the Ceph project. It is also not available for Centos8. The notes below relate to Centos7.

 

For OS versions of Centos higher than 7 the Ceph project advise you to use the cephadm tool for installing ceph on cluster nodes. 

 

At the time of writing (2021) knowledge of ceph-deploy is a stipulated syllabus requirement of the LPIC3-306 Clustering Diploma Exam, hence this Centos7 Ceph lab refers to ceph-deploy.

 

As Ceph is a large and complex subject, these notes have been split into several different pages.

 

Overview of Cluster Environment 

 

The cluster comprises three nodes installed with Centos7 and housed on a KVM virtual machine system on a Linux Ubuntu host. We are installing with Centos7 rather than the recent version because the later versions are not compatible with the ceph-deploy tool.

 

CRUSH is a crucial part of Ceph’s storage system as it’s the algorithm Ceph uses to determine how data is stored across the nodes in a Ceph cluster.

 

Ceph stores client data as objects within storage pools.  Using the CRUSH algorithm, Ceph calculates in which placement group the object should best be stored and then also calculates which Ceph OSD node should store the placement group.

The CRUSH algorithm also enables the Ceph Storage Cluster to scale, rebalance, and recover dynamically from faults.

 

The CRUSH map is a hierarchical cluster storage resource map representing the available storage resources.  CRUSH empowers Ceph clients to communicate with OSDs directly rather than through a centralized server. As CRUSH uses an algorithmically determined method of storing and retrieving data, the CRUSH map allows Ceph to scale without performance bottlenecks, scalability problems or single points of failure.

 

Ceph use three storage concepts for data management:

 

Pools
Placement Groups, and
CRUSH Map

 

Pools

 

Ceph stores data within logical storage groups called pools. Pools manage the number of placement groups, the number of replicas, and the ruleset deployed for the pool.

 

Placement Groups

 

Placement groups (PGs) are the shards or fragments of a logical object pool that store objects as a group on OSDs. Placement groups reduce the amount of metadata to be processed whenever Ceph reads or writes data to OSDs.

 

NOTE: Deploying a larger number of placement groups (e.g. 100 PGs per OSD) will result in better load balancing.

 

The CRUSH map contains a list of OSDs (physical disks), a list of buckets for aggregating the devices into physical locations, and a list of rules that define how CRUSH will replicate data in the Ceph cluster.

 

Buckets can contain any number of OSDs. Buckets can themselves also contain other buckets, enabling them to form interior nodes in a storage hierarchy.

 

OSDs and buckets have numerical identifiers and weight values associated with them.

 

This structure can be used to reflect the actual physical organization of the cluster installation, taking into account such characteristics as physical proximity, common power sources, and shared networks.

 

When you deploy OSDs they are automatically added to the CRUSH map under a host bucket named for the node on which they run. This ensures that replicas or erasure code shards are distributed across hosts and that a single host or other failure will not affect service availability.

 

The main practical advantages of CRUSH are:

 

Avoiding consequences of device failure. This is a big advantage over RAID.

 

Fast — read/writes occur in microseconds.

 

Stability and Reliability— since very little data movement occurs when topology changes.

 

Flexibility — replication, erasure codes, complex placement schemes are all possible.

 

 

The CRUSH Map Structure

 

The CRUSH map consists of a hierarchy that describes the physical topology of the cluster and a set of rules defining data placement policy.

 

The hierarchy has devices (OSDs) at the leaves, and internal nodes corresponding to other physical features or groupings:

 

hosts, racks, rows, datacenters, etc.

 

The rules describe how replicas are placed in terms of that hierarchy (e.g., ‘three replicas in different racks’).

 

Devices

 

Devices are individual OSDs that store data, usually one for each storage drive. Devices are identified by an id (a non-negative integer) and a name, normally osd.N where N is the device id.

 

Types and Buckets

 

A bucket is the CRUSH term for internal nodes in the hierarchy: hosts, racks, rows, etc.

 

The CRUSH map defines a series of types used to describe these nodes.

 

The default types include:

 

osd (or device)

 

host

 

chassis

 

rack

 

row

 

pdu

 

pod

 

room

 

datacenter

 

zone

 

region

 

root

 

Most clusters use only a handful of these types, and others can be defined as needed.

 

 

CRUSH Rules

 

CRUSH Rules define policy about how data is distributed across the devices in the hierarchy. They define placement and replication strategies or distribution policies that allow you to specify exactly how CRUSH places data replicas.

 

To display what rules are defined in the cluster:

 

ceph osd crush rule ls

 

You can view the contents of the rules with:

 

ceph osd crush rule dump

 

The weights associated with each node in the hierarchy can be displayed with:

 

ceph osd tree

 

 

To modify the CRUSH MAP

 

To add or move an OSD in the CRUSH map of a running cluster:

 

ceph osd crush set {name} {weight} root={root} [{bucket-type}={bucket-name} …]

 

 

eg

 

The following example adds osd.0 to the hierarchy, or moves the OSD from a previous location.

 

ceph osd crush set osd.0 1.0 root=default datacenter=dc1 room=room1 row=foo rack=bar host=foo-bar-1

 

To Remove an OSD from the CRUSH Map

 

To remove an OSD from the CRUSH map of a running cluster, execute the following:

 

ceph osd crush remove {name}

 

To Add, Move or Remove a Bucket to the CRUSH Map

 

To add a bucket in the CRUSH map of a running cluster, execute the ceph osd crush add-bucket command:

 

ceph osd crush add-bucket {bucket-name} {bucket-type}

 

To move a bucket to a different location or position in the CRUSH map hierarchy:

 

ceph osd crush move {bucket-name} {bucket-type}={bucket-name}, […]

 

 

To remove a bucket from the CRUSH hierarchy, use:

 

ceph osd crush remove {bucket-name}

 

Note: A bucket must be empty before removing it from the CRUSH hierarchy.

 

 

 

How To Tune CRUSH 

 

 

Crush uses matched profile sets known as tunables in order to tune the CRUSH map.

 

As of the Octopus release these are:

 

legacy: the legacy behavior from argonaut and earlier.

 

argonaut: the legacy values supported by the original argonaut release

 

bobtail: the values supported by the bobtail release

 

firefly: the values supported by the firefly release

 

hammer: the values supported by the hammer release

 

jewel: the values supported by the jewel release

 

optimal: the best (ie optimal) values of the current version of Ceph

 

default: the default values of a new cluster installed from scratch. These values, which depend on the current version of Ceph, are hardcoded and are generally a mix of optimal and legacy values. These generally match the optimal profile of the previous LTS release, or the most recent release for which most users will be likely to have up-to-date clients for.

 

You can apply a profile to a running cluster with the command:

 

ceph osd crush tunables {PROFILE}

 

 

How To Determine a CRUSH Location

 

The location of an OSD within the CRUSH map’s hierarchy is known as the CRUSH location.

 

This location specifier takes the form of a list of key and value pairs.

 

Eg if an OSD is in a specific row, rack, chassis and host, and is part of the ‘default’ CRUSH root (as usual for most clusters), its CRUSH location will be:

 

root=default row=a rack=a2 chassis=a2a host=a2a1

 

The CRUSH location for an OSD can be defined by adding the crush location option in ceph.conf.

 

Each time the OSD starts, it checks that it is in the correct location in the CRUSH map. If it is not then it moves itself.

 

To disable this automatic CRUSH map management, edit ceph.conf and add the following in the [osd] section:

 

osd crush update on start = false

 

 

 

However, note that in most cases it is not necessary to manually configure this.

 

 

How To Edit and Modify the CRUSH Map

 

It is more convenient to modify the CRUSH map at runtime with the Ceph CLI than editing the CRUSH map manually.

 

However you may sometimes wish to edit the CRUSH map manually, for example in order to change the default bucket types, or to use an alternativce bucket algorithm to straw.

 

 

The steps in overview:

 

Get the CRUSH map.

 

Decompile the CRUSH map.

 

Edit at least one: Devices, Buckets or Rules.

 

Recompile the CRUSH map.

 

Set the CRUSH map.

 

 

Get a CRUSH Map

 

ceph osd getcrushmap -o {compiled-crushmap-filename}

 

This writes (-o) a compiled CRUSH map to the filename you specify.

 

However, as the CRUSH map is in compiled form, you first need to decompile it.

 

Decompile a CRUSH Map

 

use the crushtool:

 

crushtool -d {compiled-crushmap-filename}-o {decompiled-crushmap-filename}

 

 

 

The CRUSH Map has six sections:

 

tunables: The preamble at the top of the map described any _tunables_for CRUSH behavior that vary from the historical/legacy CRUSH behavior. These correct for old bugs, optimizations, or other changes in behavior made over the years to CRUSH.

 

devices: Devices are individual ceph-osd daemons that store data.

 

types: Bucket types define the types of buckets used in the CRUSH hierarchy. Buckets consist of a hierarchical aggregation of storage locations (e.g., rows, racks, chassis, hosts, etc.) together with their assigned weights.

 

buckets: Once you define bucket types, you must define each node in the hierarchy, its type, and which devices or other nodes it contains.

 

rules: Rules define policy about how data is distributed across devices in the hierarchy.

 

choose_args: Choose_args are alternative weights associated with the hierarchy that have been adjusted to optimize data placement.

 

A single choose_args map can be used for the entire cluster, or alternatively one can be created for each individual pool.

 

 

Display the current crush hierarchy with:

 

ceph osd tree

 

[root@ceph-mon ~]# ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 0.00757 root default
-3 0.00378 host ceph-osd0
0 hdd 0.00189 osd.0 down 0 1.00000
3 hdd 0.00189 osd.3 up 1.00000 1.00000
-5 0.00189 host ceph-osd1
1 hdd 0.00189 osd.1 up 1.00000 1.00000
-7 0.00189 host ceph-osd2
2 hdd 0.00189 osd.2 up 1.00000 1.00000
[root@ceph-mon ~]#

 

 

 

To edit the CRUSH map:

 

ceph osd getcrushmap -o crushmap.txt

 

crushtool -d crushmap.txt -o crushmap-decompile

 

nano crushmap-decompile

 

 

 

Edit at least one of Devices, Buckets and Rules:

 

# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable chooseleaf_vary_r 1
tunable chooseleaf_stable 1
tunable straw_calc_version 1
tunable allowed_bucket_algs 54

 

# devices
device 0 osd.0 class hdd
device 1 osd.1 class hdd
device 2 osd.2 class hdd
device 3 osd.3 class hdd

 

# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 region
type 10 root

# buckets
host ceph-osd0 {
id -3 # do not change unnecessarily
id -4 class hdd # do not change unnecessarily
# weight 0.004
alg straw2
hash 0 # rjenkins1
item osd.0 weight 0.002
item osd.3 weight 0.002
}
host ceph-osd1 {
id -5 # do not change unnecessarily
id -6 class hdd # do not change unnecessarily
# weight 0.002
alg straw2
hash 0 # rjenkins1
item osd.1 weight 0.002
}
host ceph-osd2 {
id -7 # do not change unnecessarily
id -8 class hdd # do not change unnecessarily
# weight 0.002
alg straw2
hash 0 # rjenkins1
item osd.2 weight 0.002
}
root default {
id -1 # do not change unnecessarily
id -2 class hdd # do not change unnecessarily
# weight 0.008
alg straw2
hash 0 # rjenkins1
item ceph-osd0 weight 0.004
item ceph-osd1 weight 0.002
item ceph-osd2 weight 0.002
}

 

# rules
rule replicated_rule {
id 0
type replicated
min_size 1
max_size 10
step take default
step chooseleaf firstn 0 type host
step emit
}

 

# end crush map

 

 

To add racks to the cluster CRUSH layout:

 

ceph osd crush add-bucket rack01 rack
ceph osd crush add-bucket rack02 rack

 

[root@ceph-mon ~]# ceph osd crush add-bucket rack01 rack
added bucket rack01 type rack to crush map
[root@ceph-mon ~]# ceph osd crush add-bucket rack02 rack
added bucket rack02 type rack to crush map
[root@ceph-mon ~]#

 

 

 

Continue Reading