Tags Archives: cluster

LPIC3 DIPLOMA Linux Clustering – LAB NOTES: GlusterFS Configuration on Centos

How To Install GlusterFS on Centos7


Choose a package source: either the CentOS Storage SIG or Gluster.org


Using CentOS Storage SIG Packages



yum search centos-release-gluster


yum install centos-release-gluster37


yum install centos-release-gluster37


yum install glusterfs gluster-cli glusterfs-libs glusterfs-server




[root@glusterfs1 ~]# yum search centos-release-gluster
Loaded plugins: fastestmirror, langpacks
Loading mirror speeds from cached hostfile
* base: mirrors.xtom.de
* centos-ceph-nautilus: mirror1.hs-esslingen.de
* centos-nfs-ganesha28: ftp.agdsn.de
* epel: mirrors.xtom.de
* extras: mirror.netcologne.de
* updates: mirrors.xtom.de
================================================= N/S matched: centos-release-gluster =================================================
centos-release-gluster-legacy.noarch : Disable unmaintained Gluster repositories from the CentOS Storage SIG
centos-release-gluster40.x86_64 : Gluster 4.0 (Short Term Stable) packages from the CentOS Storage SIG repository
centos-release-gluster41.noarch : Gluster 4.1 (Long Term Stable) packages from the CentOS Storage SIG repository
centos-release-gluster5.noarch : Gluster 5 packages from the CentOS Storage SIG repository
centos-release-gluster6.noarch : Gluster 6 packages from the CentOS Storage SIG repository
centos-release-gluster7.noarch : Gluster 7 packages from the CentOS Storage SIG repository
centos-release-gluster8.noarch : Gluster 8 packages from the CentOS Storage SIG repository
centos-release-gluster9.noarch : Gluster 9 packages from the CentOS Storage SIG repository

Name and summary matches only, use “search all” for everything.
[root@glusterfs1 ~]#



Alternatively, using Gluster.org Packages


# yum update -y



Download the latest glusterfs-epel repository from gluster.org:


yum install wget -y



[root@glusterfs1 ~]# yum install wget -y
Loaded plugins: fastestmirror, langpacks
Loading mirror speeds from cached hostfile
* base: mirrors.xtom.de
* centos-ceph-nautilus: mirror1.hs-esslingen.de
* centos-nfs-ganesha28: ftp.agdsn.de
* epel: mirrors.xtom.de
* extras: mirror.netcologne.de
* updates: mirrors.xtom.de
Package wget-1.14-18.el7_6.1.x86_64 already installed and latest version
Nothing to do
[root@glusterfs1 ~]#




wget -P /etc/yum.repos.d/ http://download.gluster.org/pub/gluster/glusterfs/LATEST/CentOS/glusterfs-epel.repo


Also install the latest EPEL repository from fedoraproject.org to resolve all dependencies:


yum install http://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm



[root@glusterfs1 ~]# yum repolist
Loaded plugins: fastestmirror, langpacks
Loading mirror speeds from cached hostfile
* base: mirrors.xtom.de
* centos-ceph-nautilus: mirror1.hs-esslingen.de
* centos-nfs-ganesha28: ftp.agdsn.de
* epel: mirrors.xtom.de
* extras: mirror.netcologne.de
* updates: mirrors.xtom.de
repo id repo name status
base/7/x86_64 CentOS-7 – Base 10,072
centos-ceph-nautilus/7/x86_64 CentOS-7 – Ceph Nautilus 609
centos-nfs-ganesha28/7/x86_64 CentOS-7 – NFS Ganesha 2.8 153
ceph-noarch Ceph noarch packages 184
epel/x86_64 Extra Packages for Enterprise Linux 7 – x86_64 13,638
extras/7/x86_64 CentOS-7 – Extras 498
updates/7/x86_64 CentOS-7 – Updates 2,579
repolist: 27,733
[root@glusterfs1 ~]#



Then install GlusterFS Server on all glusterfs storage cluster nodes.

[root@glusterfs1 ~]# yum install glusterfs gluster-cli glusterfs-libs glusterfs-server


Loaded plugins: fastestmirror, langpacks
Loading mirror speeds from cached hostfile
* base: mirrors.xtom.de
* centos-ceph-nautilus: mirror1.hs-esslingen.de
* centos-nfs-ganesha28: ftp.agdsn.de
* epel: mirrors.xtom.de
* extras: mirror.netcologne.de
* updates: mirrors.xtom.de
No package gluster-cli available.
No package glusterfs-server available.
Resolving Dependencies
–> Running transaction check
—> Package glusterfs.x86_64 0:6.0-49.1.el7 will be installed
—> Package glusterfs-libs.x86_64 0:6.0-49.1.el7 will be installed
–> Finished Dependency Resolution

Dependencies Resolved

Package Arch Version Repository Size
glusterfs x86_64 6.0-49.1.el7 updates 622 k
glusterfs-libs x86_64 6.0-49.1.el7 updates 398 k

Transaction Summary
Install 2 Packages

Total download size: 1.0 M
Installed size: 4.3 M
Is this ok [y/d/N]: y
Downloading packages:
(1/2): glusterfs-libs-6.0-49.1.el7.x86_64.rpm | 398 kB 00:00:00
(2/2): glusterfs-6.0-49.1.el7.x86_64.rpm | 622 kB 00:00:00
Total 2.8 MB/s | 1.0 MB 00:00:00
Running transaction check
Running transaction test
Transaction test succeeded
Running transaction
Installing : glusterfs-libs-6.0-49.1.el7.x86_64 1/2
Installing : glusterfs-6.0-49.1.el7.x86_64 2/2
Verifying : glusterfs-6.0-49.1.el7.x86_64 1/2
Verifying : glusterfs-libs-6.0-49.1.el7.x86_64 2/2

glusterfs.x86_64 0:6.0-49.1.el7 glusterfs-libs.x86_64 0:6.0-49.1.el7

[root@glusterfs1 ~]#






Continue Reading

Pacemaker & Corosync Cluster Commands Cheat Sheet

 Config files for Corosync and Pacemaker


/etc/corosync/corosync.conf – config file for corosync cluster membership and quorum


/var/lib/pacemaker/crm/cib.xml – config file for cluster nodes and resources


Log files








/var/log/messages – used for some other services including crmd and pengine etc.



Pacemaker Cluster Resources and Resource Groups


A cluster resource refers to any object or service which is managed by the Pacemaker cluster.


A number of different resources are defined by Pacemaker:


Primitive: this is the basic resource managed by the cluster.


Clone: a resource which can run on multiple nodes simultaneously.


MultiStake or Master/Slave: a resource in which one instance serves as master and the other as slave. A common example of this is DRBD.



Resource Group: this is a set of primitives or clone which is used to group resources together for easier admin.


Resource Classes:


OCF or Open Cluster Framework: this is the most commonly used resource class for Pacemaker clusters
Service: used for implementing systemd, upstart, and lsb commands
Systemd: used for systemd commands
Fencing: used for Stonith fencing resources
Nagios: used for Nagios plugins
LSB or Linux Standard Base: these are for the older Linux init script operations. Now deprecated


Resource stickiness: this refers to running a resource on the same cluster node even after some problem occurs with the node which is later rectified. This is advised since migrating resources to other nodes should generally be avoided.



Constraints: A set of rules that sets out how resources or resource groups should be started.

Constraint Types:


Location: A location constraint defines on which node a resource should run – or not run, if the priority is set to minus -INFINITY.

Colocation: A colocation constraint defines which resources should be started together – or not started together in the case of -INFINITY

Order: Order constraints define in which order resources should be started. This is to allow for pre-conditional services to be started first.


Resource Order Priority Scores:


These are used with the constraint types above.


The priority score can be set to a value between -1,000,000 (-INFINITY = the event will never happen) right up to INFINITY (1,000,000 = the event must happen).


Any negative priority score will prevent the resource from running.



Cluster Admin Commands

On RedHat Pacemaker Clusters, the pcs command is used to manage the cluster. pcs stands for “Pacemaker Configuration System”:


pcs status – View cluster status.
pcs config – View and manage cluster configuration.
pcs cluster – Configure cluster options and nodes.
pcs resource – Manage cluster resources.
pcs stonith – Manage fence devices.
pcs constraint – Manage resource constraints.
pcs property – Manage pacemaker properties.
pcs node – Manage cluster nodes.
pcs quorum – Manage cluster quorum settings.
pcs alert – Manage pacemaker alerts.
pcs pcsd – Manage pcs daemon.
pcs acl – Manage pacemaker access control lists.


Pacemaker Cluster Installation and Configuration Commands:


To install packages:


yum install pcs -y
yum install fence-agents-all -y


echo CHANGE_ME | passwd –stdin hacluster


systemctl start pcsd
systemctl enable pcsd


To authenticate new cluster nodes:


pcs cluster auth \
node1.example.com node2.example.com node3.example.com
Username: hacluster
node1.example.com: Authorized
node2.example.com: Authorized
node3.example.com: Authorized


To create and start a new cluster:

pcs cluster setup <option> <member> …




pcs cluster setup –start –enable –name mycluster \
node1.example.com node2.example.com node3.example.com

To enable cluster services to start on reboot:


pcs cluster enable –all


To enable cluster service on a specific node[s]:


pcs cluster enable [–all] [node] […]


To disable cluster services on a node[s]:


pcs cluster disable [–all] [node] […]


To display cluster status:


pcs status
pcs config


pcs cluster status
pcs quorum status
pcs resource show
crm_verify -L -V


crm_mon – this is used as equivalent for the crmsh/crmd version of Pacemaker



To delete a cluster:

pcs cluster destroy <cluster>


To start/stop a cluster:


pcs cluster start –all
pcs cluster stop –all


To start/stop a cluster node:


pcs cluster start <node>
pcs cluster stop <node>



To carry out mantainance on a specific node:


pcs cluster standby <node>

Then to restore the node to the cluster service:

pcs cluster unstandby <node>


To switch a node to standby mode:


pcs cluster standby <node1>


To restore a node from standby mode:


pcs cluster unstandby <node1>


To set a cluster property


pcs property set <property>=<value>


To disable stonith fencing: NOTE: you should usually not do this on a live production cluster!


pcs property set stonith-enabled=false



To reenable the stonith fencing:


pcs property set stonith-enabled=true


To configure firewalling for the cluster:


firewall-cmd –permanent –add-service=high-availability
firewall-cmd –reload


To add a node to the cluster:


check hacluster user and password


systemctl status pcsd


Then on an active node:


pcs cluster auth node4.example.com
pcs cluster node add node4.example.com


Then, on the new node:


pcs cluster start
pcs cluster enable


To display the xml configuration


pcs cluster cib


To display current cluster status:


pcs status


To manage cluster resources:


pcs resource <tab>


To enable, disable and relocate resource groups:


pcs resource move <resource>


or alternatively with:


pcs resource relocate <resource>


to locate the resource back to its original node:


pcs resource clear <resource>


pcs contraint <type> <option>


To create a new resource:


pcs resource create <resource_name> <resource_type> <resource_options>


To create new resources, reference the appropriate resource agents or RAs.


To list ocf resource types:


(example below with ocf:heartbeat)


pcs resource list heartbeat


options detail of a resource type or agent:


pcs resource describe <resource_type>
pcs resource describe ocf:heartbeat:IPaddr2


pcs resource create vip_cluster ocf:heartbeat:IPaddr2 ip= –group myservices
pcs resource create apache-ip ocf:heartbeat:IPaddr2 ip= cidr_netmask=24



To display a resource:


pcs resource show


Cluster Troubleshooting

Logging functions:




tail -f /var/log/messages


tail -f /var/log/cluster/corosync.log


Debug information commands:


pcs resource debug-start <resource>
pcs resource debug-stop <resource>
pcs resource debug-monitor <resource>
pcs resource failcount show <resource>



To update a resource after modification:


pcs resource update <resource> <options>


To reset the failcount:


pcs resource cleanup <resource>


To remove a resource from a node:


pcs resource move <resource> [ <node> ]


To start a resource or a resource group:


pcs resource enable <resource>


To stop a resource or resource group:


pcs resource disable <resource>



To create a resource group and add a new resource:


pcs resource create <resource_name> <resource_type> <resource_options> –group <group>


To delete a resource:


pcs resource delete <resource>


To add a resource to a group:


pcs resource group add <group> <resource>
pcs resource group list
pcs resource list


To add a constraint to a resource group:


pcs constraint colocation add apache-group with ftp-group -100000
pcs constraint order apache-group then ftp-group



To reset a constraint for a resource or a resource group:


pcs resource clear <resource>


To list resource agent (RA) classes:


pcs resource standards


To list available RAs:


pcs resource agents ocf | service | stonith


To list specific resource agents of a specific RA provider:


pcs resource agents ocf:pacemaker


To list RA information:


pcs resource describe RA
pcs resource describe ocf:heartbeat:RA


To create a resource:


pcs resource create ClusterIP IPaddr2 ip= cidr_netmask=24 params ip= cidr_netmask=32 op monitor interval=60s

To delete a resource:


pcs resource delete resourceid


To display a resource (example with ClusterIP):


pcs resource show ClusterIP


To start a resource:


pcs resource enable ClusterIP


To stop a resource:


pcs resource disable ClusterIP


To remove a resource:


pcs resource delete ClusterIP


To modify a resource:


pcs resource update ClusterIP clusterip_hash=sourceip


To delete parameters for a resource (resource specific, here for ClusterIP):


pcs resource update ClusterIP ip=


To list the current resource defaults:


pcs resource rsc default


To set resource defaults:


pcs resource rsc defaults resource-stickiness=100


To list current operation defaults:


pcs resource op defaults


To set operation defaults:


pcs resource op defaults timeout=240s


To set colocation:


pcs constraint colocation add ClusterIP with WebSite INFINITY


To set colocation with roles:


pcs constraint colocation add Started AnotherIP with Master WebSite INFINITY


To set constraint ordering:


pcs constraint order ClusterIP then WebSite


To display constraint list:


pcs constraint list –full


To show a resource failure count:


pcs resource failcount show RA


To reset a resource failure count:


pcs resource failcount reset RA


To create a resource clone:


pcs resource clone ClusterIP globally-unique=true clone-max=2 clone-node-max=2


To manage a resource:


pcs resource manage RA


To unmanage a resource:


pcs resource unmanage RA



Fencing (Stonith) commands:

ipmitool -H rh7-node1-irmc -U admin -P password power on


fence_ipmilan –ip=rh7-node1-irmc.localdomain –username=admin –password=password –action=status

Status: ON

pcs stonith


pcs stonith describe fence_ipmilan


pcs stonith create ipmi-fencing1 fence_ipmilan \
pcmk_host_list=”rh7-node1.localdomain” \
ipaddr= \
login=admin passwd=password \
op monitor interval=60s


pcs property set stonith-enabled=true
pcs stonith fence pcmk-2
stonith_admin –reboot pcmk-2


To display fencing resources:


pcs stonith show



To display Stonith RA information:


pcs stonith describe fence_ipmilan


To list available fencing agents:


pcs stonith list


To add a filter to list available resource agents for Stonith:


pcs stonith list <string>


To setup properties for Stonith:


pcs property set no-quorum-policy=ignore
pcs property set stonith-action=poweroff # default is reboot


To create a fencing device:


pcs stonith create stonith-rsa-node1 fence_rsa action=off ipaddr=”node1_rsa” login=<user> passwd=<pass> pcmk_host_list=node1 secure=true


To display fencing devices:



pcs stonith show


To fence a node off from the rest of the cluster:


pcs stonith fence <node>


To modify a fencing device:


pcs stonith update stonithid [options]


To display fencing device options:


pcs stonith describe <stonith_ra>


To delete a fencing device:


pcs stonith delete stonithd


Continue Reading

LPIC3 DIPLOMA Linux Clustering – LAB NOTES: Lesson Ceph Centos7 – Ceph CRUSH Map

LAB on Ceph Clustering on Centos7


These are my notes made during my lab practical as part of my LPIC3 Diploma course in Linux Clustering. They are in “rough format”, presented as they were written.


This lab uses the ceph-deploy tool to set up the ceph cluster.  However, note that ceph-deploy is now an outdated Ceph tool and is no longer being maintained by the Ceph project. It is also not available for Centos8. The notes below relate to Centos7.


For OS versions of Centos higher than 7 the Ceph project advise you to use the cephadm tool for installing ceph on cluster nodes. 


At the time of writing (2021) knowledge of ceph-deploy is a stipulated syllabus requirement of the LPIC3-306 Clustering Diploma Exam, hence this Centos7 Ceph lab refers to ceph-deploy.


As Ceph is a large and complex subject, these notes have been split into several different pages.


Overview of Cluster Environment 


The cluster comprises three nodes installed with Centos7 and housed on a KVM virtual machine system on a Linux Ubuntu host. We are installing with Centos7 rather than the recent version because the later versions are not compatible with the ceph-deploy tool.


CRUSH is a crucial part of Ceph’s storage system as it’s the algorithm Ceph uses to determine how data is stored across the nodes in a Ceph cluster.


Ceph stores client data as objects within storage pools.  Using the CRUSH algorithm, Ceph calculates in which placement group the object should best be stored and then also calculates which Ceph OSD node should store the placement group.

The CRUSH algorithm also enables the Ceph Storage Cluster to scale, rebalance, and recover dynamically from faults.


The CRUSH map is a hierarchical cluster storage resource map representing the available storage resources.  CRUSH empowers Ceph clients to communicate with OSDs directly rather than through a centralized server. As CRUSH uses an algorithmically determined method of storing and retrieving data, the CRUSH map allows Ceph to scale without performance bottlenecks, scalability problems or single points of failure.


Ceph use three storage concepts for data management:


Placement Groups, and




Ceph stores data within logical storage groups called pools. Pools manage the number of placement groups, the number of replicas, and the ruleset deployed for the pool.


Placement Groups


Placement groups (PGs) are the shards or fragments of a logical object pool that store objects as a group on OSDs. Placement groups reduce the amount of metadata to be processed whenever Ceph reads or writes data to OSDs.


NOTE: Deploying a larger number of placement groups (e.g. 100 PGs per OSD) will result in better load balancing.


The CRUSH map contains a list of OSDs (physical disks), a list of buckets for aggregating the devices into physical locations, and a list of rules that define how CRUSH will replicate data in the Ceph cluster.


Buckets can contain any number of OSDs. Buckets can themselves also contain other buckets, enabling them to form interior nodes in a storage hierarchy.


OSDs and buckets have numerical identifiers and weight values associated with them.


This structure can be used to reflect the actual physical organization of the cluster installation, taking into account such characteristics as physical proximity, common power sources, and shared networks.


When you deploy OSDs they are automatically added to the CRUSH map under a host bucket named for the node on which they run. This ensures that replicas or erasure code shards are distributed across hosts and that a single host or other failure will not affect service availability.


The main practical advantages of CRUSH are:


Avoiding consequences of device failure. This is a big advantage over RAID.


Fast — read/writes occur in microseconds.


Stability and Reliability— since very little data movement occurs when topology changes.


Flexibility — replication, erasure codes, complex placement schemes are all possible.



The CRUSH Map Structure


The CRUSH map consists of a hierarchy that describes the physical topology of the cluster and a set of rules defining data placement policy.


The hierarchy has devices (OSDs) at the leaves, and internal nodes corresponding to other physical features or groupings:


hosts, racks, rows, datacenters, etc.


The rules describe how replicas are placed in terms of that hierarchy (e.g., ‘three replicas in different racks’).




Devices are individual OSDs that store data, usually one for each storage drive. Devices are identified by an id (a non-negative integer) and a name, normally osd.N where N is the device id.


Types and Buckets


A bucket is the CRUSH term for internal nodes in the hierarchy: hosts, racks, rows, etc.


The CRUSH map defines a series of types used to describe these nodes.


The default types include:


osd (or device)
























Most clusters use only a handful of these types, and others can be defined as needed.





CRUSH Rules define policy about how data is distributed across the devices in the hierarchy. They define placement and replication strategies or distribution policies that allow you to specify exactly how CRUSH places data replicas.


To display what rules are defined in the cluster:


ceph osd crush rule ls


You can view the contents of the rules with:


ceph osd crush rule dump


The weights associated with each node in the hierarchy can be displayed with:


ceph osd tree



To modify the CRUSH MAP


To add or move an OSD in the CRUSH map of a running cluster:


ceph osd crush set {name} {weight} root={root} [{bucket-type}={bucket-name} …]





The following example adds osd.0 to the hierarchy, or moves the OSD from a previous location.


ceph osd crush set osd.0 1.0 root=default datacenter=dc1 room=room1 row=foo rack=bar host=foo-bar-1


To Remove an OSD from the CRUSH Map


To remove an OSD from the CRUSH map of a running cluster, execute the following:


ceph osd crush remove {name}


To Add, Move or Remove a Bucket to the CRUSH Map


To add a bucket in the CRUSH map of a running cluster, execute the ceph osd crush add-bucket command:


ceph osd crush add-bucket {bucket-name} {bucket-type}


To move a bucket to a different location or position in the CRUSH map hierarchy:


ceph osd crush move {bucket-name} {bucket-type}={bucket-name}, […]



To remove a bucket from the CRUSH hierarchy, use:


ceph osd crush remove {bucket-name}


Note: A bucket must be empty before removing it from the CRUSH hierarchy.




How To Tune CRUSH 



Crush uses matched profile sets known as tunables in order to tune the CRUSH map.


As of the Octopus release these are:


legacy: the legacy behavior from argonaut and earlier.


argonaut: the legacy values supported by the original argonaut release


bobtail: the values supported by the bobtail release


firefly: the values supported by the firefly release


hammer: the values supported by the hammer release


jewel: the values supported by the jewel release


optimal: the best (ie optimal) values of the current version of Ceph


default: the default values of a new cluster installed from scratch. These values, which depend on the current version of Ceph, are hardcoded and are generally a mix of optimal and legacy values. These generally match the optimal profile of the previous LTS release, or the most recent release for which most users will be likely to have up-to-date clients for.


You can apply a profile to a running cluster with the command:


ceph osd crush tunables {PROFILE}



How To Determine a CRUSH Location


The location of an OSD within the CRUSH map’s hierarchy is known as the CRUSH location.


This location specifier takes the form of a list of key and value pairs.


Eg if an OSD is in a specific row, rack, chassis and host, and is part of the ‘default’ CRUSH root (as usual for most clusters), its CRUSH location will be:


root=default row=a rack=a2 chassis=a2a host=a2a1


The CRUSH location for an OSD can be defined by adding the crush location option in ceph.conf.


Each time the OSD starts, it checks that it is in the correct location in the CRUSH map. If it is not then it moves itself.


To disable this automatic CRUSH map management, edit ceph.conf and add the following in the [osd] section:


osd crush update on start = false




However, note that in most cases it is not necessary to manually configure this.



How To Edit and Modify the CRUSH Map


It is more convenient to modify the CRUSH map at runtime with the Ceph CLI than editing the CRUSH map manually.


However you may sometimes wish to edit the CRUSH map manually, for example in order to change the default bucket types, or to use an alternativce bucket algorithm to straw.



The steps in overview:


Get the CRUSH map.


Decompile the CRUSH map.


Edit at least one: Devices, Buckets or Rules.


Recompile the CRUSH map.


Set the CRUSH map.



Get a CRUSH Map


ceph osd getcrushmap -o {compiled-crushmap-filename}


This writes (-o) a compiled CRUSH map to the filename you specify.


However, as the CRUSH map is in compiled form, you first need to decompile it.


Decompile a CRUSH Map


use the crushtool:


crushtool -d {compiled-crushmap-filename}-o {decompiled-crushmap-filename}




The CRUSH Map has six sections:


tunables: The preamble at the top of the map described any _tunables_for CRUSH behavior that vary from the historical/legacy CRUSH behavior. These correct for old bugs, optimizations, or other changes in behavior made over the years to CRUSH.


devices: Devices are individual ceph-osd daemons that store data.


types: Bucket types define the types of buckets used in the CRUSH hierarchy. Buckets consist of a hierarchical aggregation of storage locations (e.g., rows, racks, chassis, hosts, etc.) together with their assigned weights.


buckets: Once you define bucket types, you must define each node in the hierarchy, its type, and which devices or other nodes it contains.


rules: Rules define policy about how data is distributed across devices in the hierarchy.


choose_args: Choose_args are alternative weights associated with the hierarchy that have been adjusted to optimize data placement.


A single choose_args map can be used for the entire cluster, or alternatively one can be created for each individual pool.



Display the current crush hierarchy with:


ceph osd tree


[root@ceph-mon ~]# ceph osd tree
-1 0.00757 root default
-3 0.00378 host ceph-osd0
0 hdd 0.00189 osd.0 down 0 1.00000
3 hdd 0.00189 osd.3 up 1.00000 1.00000
-5 0.00189 host ceph-osd1
1 hdd 0.00189 osd.1 up 1.00000 1.00000
-7 0.00189 host ceph-osd2
2 hdd 0.00189 osd.2 up 1.00000 1.00000
[root@ceph-mon ~]#




To edit the CRUSH map:


ceph osd getcrushmap -o crushmap.txt


crushtool -d crushmap.txt -o crushmap-decompile


nano crushmap-decompile




Edit at least one of Devices, Buckets and Rules:


# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable chooseleaf_vary_r 1
tunable chooseleaf_stable 1
tunable straw_calc_version 1
tunable allowed_bucket_algs 54


# devices
device 0 osd.0 class hdd
device 1 osd.1 class hdd
device 2 osd.2 class hdd
device 3 osd.3 class hdd


# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 region
type 10 root

# buckets
host ceph-osd0 {
id -3 # do not change unnecessarily
id -4 class hdd # do not change unnecessarily
# weight 0.004
alg straw2
hash 0 # rjenkins1
item osd.0 weight 0.002
item osd.3 weight 0.002
host ceph-osd1 {
id -5 # do not change unnecessarily
id -6 class hdd # do not change unnecessarily
# weight 0.002
alg straw2
hash 0 # rjenkins1
item osd.1 weight 0.002
host ceph-osd2 {
id -7 # do not change unnecessarily
id -8 class hdd # do not change unnecessarily
# weight 0.002
alg straw2
hash 0 # rjenkins1
item osd.2 weight 0.002
root default {
id -1 # do not change unnecessarily
id -2 class hdd # do not change unnecessarily
# weight 0.008
alg straw2
hash 0 # rjenkins1
item ceph-osd0 weight 0.004
item ceph-osd1 weight 0.002
item ceph-osd2 weight 0.002


# rules
rule replicated_rule {
id 0
type replicated
min_size 1
max_size 10
step take default
step chooseleaf firstn 0 type host
step emit


# end crush map



To add racks to the cluster CRUSH layout:


ceph osd crush add-bucket rack01 rack
ceph osd crush add-bucket rack02 rack


[root@ceph-mon ~]# ceph osd crush add-bucket rack01 rack
added bucket rack01 type rack to crush map
[root@ceph-mon ~]# ceph osd crush add-bucket rack02 rack
added bucket rack02 type rack to crush map
[root@ceph-mon ~]#




Continue Reading

LPIC3-306 COURSE NOTES: CEPH – An Overview

These are my notes made during my lab practical as part of my LPIC3 Diploma course in Linux Clustering.

They are in “rough format”, presented as they were written.



LPIC3-306 Clustering – 363.2 Ceph Syllabus Requirements



Exam Weighting: 8


Description: Candidates should be able to manage and maintain a Ceph Cluster. This
includes the configuration of RGW, RDB devices and CephFS.


Key Knowledge Areas:
• Understand the architecture and components of Ceph
• Manage OSD, MGR, MON and MDS
• Understand and manage placement groups and pools
• Understand storage backends (FileStore and BlueStore)
• Initialize a Ceph cluster
• Create and manage Rados Block Devices
• Create and manage CephFS volumes, including snapshots
• Mount and use an existing CephFS
• Understand and adjust CRUSH maps


Configure high availability aspects of Ceph
• Scale up a Ceph cluster
• Restore and verify the integrity of a Ceph cluster after an outage
• Understand key concepts of Ceph updates, including update order, tunables and


Partial list of the used files, terms and utilities:
• ceph-deploy (including relevant subcommands)
• ceph.conf
• ceph (including relevant subcommands)
• rados (including relevant subcommands)
• rdb (including relevant subcommands)
• cephfs (including relevant subcommands)
• ceph-volume (including relevant subcommands)
• ceph-authtool
• ceph-bluestore-tool
• crushtool




What is Ceph


Ceph is an open-source, massively scalable, software-defined storage system or “SDS”


It provides object, block and file system storage via a single clustered high-availability platform.


The intention of Ceph is to be a fully distributed system with no single point of failure which is self-healing and self-managing. Although production environment Ceph systems are best run on a high-grade hardware specification,  Ceph runs on standard commodity computer hardware.


An Overview of Ceph  



When Ceph services start, the initialization process activates a series of daemons that run in the background.


A Ceph Cluster runs with a minimum of three types of daemons:


Ceph Monitor (ceph-mon)


Ceph Manager (ceph-mgr)


Ceph OSD Daemon (ceph-osd)


Ceph Storage Clusters that support the Ceph File System also run at least one Ceph Metadata Server (ceph-mds).


Clusters that support Ceph Object Storage run Ceph RADOS Gateway daemons (radosgw) as well.



OSD or Object Storage Daemon:  An OSD stores data, handles data replication, recovery, backfilling, and rebalancing. An OSD also provides monitoring data for Ceph Monitors by checking other Ceph OSD Daemons for an active heartbeat.  A Ceph Storage Cluster requires at least two Ceph OSD Daemons in order to maintain an active + clean state.


Monitor or Mon: maintains maps of the cluster state, including the monitor map, the OSD map, the Placement Group (PG) map, and the CRUSH map.


Ceph also maintains a history or “epoch” of each state change in the Monitors, Ceph OSD Daemons, and the PGs.


Metadata Server or MDS: The MDS holds metadata relating to the Ceph Filesystem and enables POSIX file system users to execute standard POSIX commands such as ls, find, etc. without creating overhead on the Ceph Storage Cluster. MDS is only required if you are intending to run CephFS. It is not necessary if only block and object storage is to be used.  


A Ceph Storage Cluster requires at least one Ceph Monitor, one Ceph Manager, one Ceph Metadata Server or MDS, and at least one and preferably two or more Ceph OSDs or Object Storage Daemon servers.


Ceph stores data in the form of objects within logical storage pools. The CRUSH algorithm is used by Ceph to decide which placement group should contain the object and which Ceph OSD Daemon should store the placement group.


The CRUSH algorithm is also used by Ceph to scale, rebalance, and recover from failures.


Note that the newer version of ceph is not supported by Debian. Ceph is in general much better supported by CentOS since RedHat maintains both CentOS and Ceph. 


Ceph-deploy now replaced by cephadm


NOTE that ceph-deploy is now an outdated tool and is no longer maintained. It is also not available for Centos8. You should either use an installation method such as the above, or alternatively, use the cephadm tool for installing ceph on cluster nodes. However, a working knowledge of ceph-deploy is at time of writing still required for the LPIC3 exam.


For more on cephadm see https://ceph.io/ceph-management/introducing-cephadm/




The Client nodes know about monitors, OSDs and MDS’s but have no knowledge of object locations. Ceph clients communicate directly with the OSDs rather than going through a dedicated server.


The OSDs (Object Storage Daemons) store the data. They can be up and in the map or can be down and out if they have failed. An OSD can be down but still in the map which means that the PG has not yet been remapped. When OSDs come on line they inform the monitor.


The Monitor nodes store a master copy of the cluster map.



RADOS (Reliable Autonomic Distributed Object Store)


RADOS  makes up the heart of the scalable object storage service. 


In addition to accessing RADOS via the defined interfaces, it is also possible to access RADOS directly via a set of library calls.



CRUSH (Controlled Replication Under Scalable Hashing)


The CRUSH map contains the topology of the system and is location aware. Objects are mapped to Placement Groups and Placement Groups are in turn  mapped to OSDs. This allows for allows dynamic rebalancing and controls which Placement Group holds the objects. It also defines  which of the OSDs should hold the Placement Group.


The CRUSH map holds a list of OSDs, buckets and rules that hold replication directives.


CRUSH will try not to move data during rebalancing whereas a true hash function would be likely to cause greater data movement.



The CRUSH map allows for different resiliency models such as:


#0 for a 1-node cluster.


#1 for a multi node cluster in a single rack


#2 for a multi node, multi chassis cluster with multiple hosts in a chassis


#3 for a multi node cluster with hosts across racks, etc.


osd crush chooseleaf type = {n}




Buckets are a hierarchical structure of storage locations; a bucket in the CRUSH map context is a location.


Placement Groups (PGs)


Ceph subdivides a storage pool into placement groups, assigning each individual object to a placement group, and then assigns the placement group to a primary OSD.


If an OSD node fails or the cluster re-balances, Ceph is able to replicate or move a placement group and all the objects stored within it without the need to move or replicate each object individually. This allows for an efficient re-balancing or recovery of the Ceph cluster.


Objects are mapped to Placement Groups by hashing the object’s name along with the replication factor and a bitmask.



When you create a pool, a number of placement groups are automatically created by Ceph for the pool. If you don’t directly specify a number of placement groups, Ceph uses the default value of 8 which is extremely low.


A more useful default value is 128. For example:


osd pool default pg num = 128
osd pool default pgp num = 128


You need to set both the number of total placement groups and the number of placement groups used for objects in PG splitting to the same value. As a general guide use the following values:


Less than 5 OSDs: set pg_num and pgp_num to 128.
Between 5 and 10 OSDs: set pg_num and pgp_num to 512
Between 10 and 50 OSDs: set pg_num and pgp_num to 4096



To specifically define the number of PGs:


set pool x pg_num to {pg_num}


ceph osd pool set {pool-name} pg_num {pg_num}



set pool x pgp_num to {pgp_num}


ceph osd pool set {pool-name} pgp_num {pgp_num}


How To Create OSD Nodes on Ceph Using ceph-deploy



BlueStore OSD is the now the default storage system used for Ceph OSDs.


Before you add a BlueStore OSD node to Ceph, first delete all data on the device/s that will serve as OSDs.


You can do this with the zap command:


$CEPH_CONFIG_DIR/ceph-deploy disk zap node device


Replace node with the node name or host name where the disk is located.


Replace device with the path to the device on the host where the disk is located.


Eg to delete the data on a device named /dev/sdc on a node named ceph-node3 in the Ceph Storage Cluster, use:


$CEPH_CONFIG_DIR/ceph-deploy disk zap ceph-node3 /dev/sdc



Next, to create a filestore OSD, enter:


$CEPH_CONFIG_DIR/ceph-deploy osd create –data device node


This creates a volume group and logical volume on the specified disk. Both data and journal are stored on the same logical volume.




$CEPH_CONFIG_DIR/ceph-deploy osd create –data /dev/sdc ceph-node3




How To Create A FileStore OSD Manually


Quoted from the Ceph website:


FileStore is the legacy approach to storing objects in Ceph. It relies on a standard file system (normally XFS) in combination with a key/value database (traditionally LevelDB, now RocksDB) for some metadata.


FileStore is well-tested and widely used in production but suffers from many performance deficiencies due to its overall design and reliance on a traditional file system for storing object data.


Although FileStore is generally capable of functioning on most POSIX-compatible file systems (including btrfs and ext4), we only recommend that XFS be used. Both btrfs and ext4 have known bugs and deficiencies and their use may lead to data loss. By default all Ceph provisioning tools will use XFS.


The official Ceph default storage system is now BlueStore. Prior to Ceph version Luminous, the default (and only option available) was Filestore.



Note the instructions below create a FileStore and not a BlueStore system!


To create a FileStore OSD manually ie without using ceph-deploy or cephadm:


first create the required partitions on the OSD node concerned: one for data, one for journal.


This example creates a 40 GB data partition on /dev/sdc1 and a journal partition of 12GB on /dev/sdc2:



parted /dev/sdc –script — mklabel gpt
parted –script /dev/sdc mkpart primary 0MB 40000MB
parted –script /dev/sdc mkpart primary 42000MB 55000MB


dd if=/dev/zero of=/dev/sdc1 bs=1M count=1000


sgdisk –zap-all –clear –mbrtogpt -g — /dev/sdc2


ceph-volume lvm zap /dev/sdc2




From the deployment node, create the FileStore OSD. To specify OSD file type, use –filestore and –fs-type.


Eg, to create a FileStore OSD with XFS filesystem:


CEPH_CONFIG_DIR/ceph-deploy osd create –filestore –fs-type xfs –data /dev/sdc1 –journal /dev/sdc2 ceph-node2



What is BlueStore?


Any new OSDs (e.g., when the cluster is expanded) can be deployed using BlueStore. This is the default behavior so no specific change is needed.


There are two methods OSDs can use to manage the data they store.


The default is now BlueStore. Prior to Ceph version Luminous, the default (and only option available) was Filestore.


BlueStore is a new back-end object storage system for Ceph OSD daemons. The original object store used by Ceph, FileStore, required a file system placed on top of raw block devices. Objects were then written to the file system.


By contrast, BlueStore does not require a file system for itself, because BlueStore stores objects directly on the block device. This improves cluster performance as it removes file system overhead.


BlueStore can use different block devices for storing different data. As an example, Hard Disk Drive (HDD) storage for data, Solid-state Drive (SSD) storage for metadata, Non-volatile Memory (NVM) or persistent or Non-volatile RAM (NVRAM) for the RocksDB WAL (write-ahead log).


In the simplest implementation, BlueStore resides on a single storage device which is partitioned into two parts, one containing OSD metadata and actual data partition.


The OSD metadata partition is formatted with XFS and holds information about the OSD, such as its identifier, the cluster it belongs to, and its private keyring.


Data partition contains the actual OSD data and is managed by BlueStore. The primary partition is identified by a block symbolic link in the data directory.


Two additional devices can also be implemented:


A WAL (write-ahead-log) device: This contains the BlueStore internal journal or write-ahead log and is identified by the block.wal symbolic link in the data directory.


Best practice is to use an SSH disk to implement a WAL device in order to provide for optimum performance.



A DB device: this stores BlueStore internal metadata. The embedded RocksDB database will then place as much metadata as possible on the DB device instead of on the primary device to optimize performance.


Only if the DB device becomes full will it then place metadata on the primary device. As for WAL, best practice for the Bluestore DB device is to deploy an SSD.




Starting and Stopping Ceph


To start all Ceph daemons:


[root@admin ~]# systemctl start ceph.target


To stop all Ceph daemons:


[root@admin ~]# systemctl stop ceph.target


To restart all Ceph daemons:


[root@admin ~]# systemctl restart ceph.target


To start, stop, and restart individual Ceph daemons:



On Ceph Monitor nodes:


systemctl start ceph-mon.target


systemctl stop ceph-mon.target


systemctl restart ceph-mon.target


On Ceph Manager nodes:


systemctl start ceph-mgr.target


systemctl stop ceph-mgr.target


systemctl restart ceph-mgr.target


On Ceph OSD nodes:


systemctl start ceph-osd.target


systemctl stop ceph-osd.target


systemctl restart ceph-osd.target


On Ceph Object Gateway nodes:


systemctl start ceph-radosgw.target


systemctl stop ceph-radosgw.target


systemctl restart ceph-radosgw.target



To perform stop, start, restart actions on specific Ceph monitor, manager, OSD or object gateway node instances:


On a Ceph Monitor node:


systemctl start ceph-mon@$MONITOR_HOST_NAME
systemctl stop ceph-mon@$MONITOR_HOST_NAME
systemctl restart ceph-mon@$MONITOR_HOST_NAME


On a Ceph Manager node:

systemctl start ceph-mgr@MANAGER_HOST_NAME
systemctl stop ceph-mgr@MANAGER_HOST_NAME
systemctl restart ceph-mgr@MANAGER_HOST_NAME



On a Ceph OSD node:


systemctl start ceph-osd@$OSD_NUMBER
systemctl stop ceph-osd@$OSD_NUMBER
systemctl restart ceph-osd@$OSD_NUMBER


substitute $OSD_NUMBER with the ID number of the Ceph OSD.


On a Ceph Object Gateway node:


systemctl start ceph-radosgw@rgw.$OBJ_GATEWAY_HOST_NAME
systemctl stop ceph-radosgw@rgw.$OBJ_GATEWAY_HOST_NAME
systemctl restart ceph-radosgw@rgw.$OBJ_GATEWAY_HOST_NAME



Placement Groups PG Information


To display the number of placement groups in a pool:


ceph osd pool get {pool-name} pg_num



To display statistics for the placement groups in the cluster:


ceph pg dump [–format {format}]



How To Check Status of the Ceph Cluster



To check the status and health of the cluster from the administration node, use:


ceph health
ceph status


Note it often can take up to several minutes for the cluster to stabilize before the cluster health will indicate HEALTH_OK.


You can also check the cluster quorum status of the cluster monitors:


ceph quorum_status –format json-pretty



For more Ceph admin commands, see https://sabaini.at/pages/ceph-cheatsheet.html#monit



The ceph.conf File


Each Ceph daemon looks for a ceph.conf file that contains its configuration settings.  For manual deployments, you need to create a ceph.conf file to define your cluster.


ceph.conf contains the following definitions:


Cluster membership
Host names
Host addresses
Paths to keyrings
Paths to journals
Paths to data
Other runtime options


The default ceph.conf locations in sequential order are as follows:


$CEPH_CONF (i.e., the path following the $CEPH_CONF environment variable)


-c path/path (i.e., the -c command line argument)






./ceph.conf (i.e., in the current working directory)


ceph-conf is a utility for getting information from a ceph configuration file.


As with most Ceph programs, you can specify which Ceph configuration file to use with the -c flag.



ceph-conf -L = lists all sections



Ceph Journals 


Note that journals are used only on FileStore.


Journals are deprecated on BlueStore and thus are not explicitly defined for BlueStore systems.



How To List Your Cluster Pools


To list your cluster pools, execute:


ceph osd lspools


Rename a Pool


To rename a pool, execute:


ceph osd pool rename <current-pool-name> <new-pool-name>



Continue Reading

How To Install Pacemaker and Corosync on Centos

This article sets out how to install the clustering management software Pacemaker and the cluster membership software Corosync on Centos version 8.


For this example, we are setting up a three node cluster using virtual machines on the Linux KVM hypervisor platform.


The virtual machines have the KVM names and hostnames centos1, centos2, and centos3.


Each node has two network interfaces: one for the KVM bridged NAT network (KVM network name: default via eth0) and the other for the cluster subnet (KVM network name:network- via eth1). DHCP is not used for either of these interfaces. Pacemaker and Corosync require static IP addresses.


The machine centos1 will be our current designated co-ordinator (DC) cluster node.


First, make sure you have first created an ssh-key for root on the first node:


[root@centos1 .ssh]# ssh-keygen -t rsa
Generating public/private rsa key pair.
Enter file in which to save the key (/root/.ssh/id_rsa):
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /root/.ssh/id_rsa.
Your public key has been saved in /root/.ssh/id_rsa.pub.
The key fingerprint is:
SHA256:********** root@centos1.localdomain


then copy the ssh key to the other nodes:


ssh-copy-id centos2
/usr/bin/ssh-copy-id: INFO: Source of key(s) to be installed: “/root/.ssh/id_rsa.pub”
/usr/bin/ssh-copy-id: INFO: attempting to log in with the new key(s), to filter out any that are already installed


/usr/bin/ssh-copy-id: WARNING: All keys were skipped because they already exist on the remote system.
(if you think this is a mistake, you may want to use -f option)


[root@centos1 .ssh]#
First you need to enable the HighAvailability repository


[root@centos1 ~]# yum repolist all | grep -i HighAvailability
ha CentOS Stream 8 – HighAvailability disabled
[root@centos1 ~]# dnf config-manager –set-enabled ha
[root@centos1 ~]# yum repolist all | grep -i HighAvailability
ha CentOS Stream 8 – HighAvailability enabled
[root@centos1 ~]#


Next, install the following packages:


[root@centos1 ~]# yum install epel-release


[root@centos1 ~]# yum install pcs fence-agents-all


Next, STOP and DISABLE Firewall for lab testing convenience:


[root@centos1 ~]# systemctl stop firewalld
[root@centos1 ~]#
[root@centos1 ~]# systemctl disable firewalld
Removed /etc/systemd/system/multi-user.target.wants/firewalld.service.
Removed /etc/systemd/system/dbus-org.fedoraproject.FirewallD1.service.
[root@centos1 ~]#


then check with:


[root@centos1 ~]# systemctl status firewalld
● firewalld.service – firewalld – dynamic firewall daemon
Loaded: loaded (/usr/lib/systemd/system/firewalld.service; disabled; vendor preset: enabled)
Active: inactive (dead)


Next we enable pcsd This is the Pacemaker daemon service:


[root@centos1 ~]# systemctl enable –now pcsd
Created symlink /etc/systemd/system/multi-user.target.wants/pcsd.service → /usr/lib/systemd/system/pcsd.service.
[root@centos1 ~]#


then change the default password for user hacluster:


echo | passwd –stdin hacluster


Changing password for user hacluster.

passwd: all authentication tokens updated successfully.
[root@centos2 ~]#


Then, on only ONE of the nodes, I am doing it on centos1 on the KVM cluster, as this will be the default DC for the cluster:


pcs host auth centos1.localdomain centos2.localdomain centos3.localdomain


NOTE the correct command is pcs host auth – not pcs cluster auth unlike in some instruction material, the syntax has since changed.


[root@centos1 .ssh]# pcs host auth centos1.localdomain suse1.localdomain ubuntu4.localdomain
Username: hacluster
centos1.localdomain: Authorized
centos2.localdomain: Authorized
centos3.localdomain: Authorized
[root@centos1 .ssh]#


Next, on centos1, as this will be our default DC (designated coordinator node) we create a corosync secret key:


[root@centos1 corosync]# corosync-keygen
Corosync Cluster Engine Authentication key generator.
Gathering 2048 bits for key from /dev/urandom.
Writing corosync key to /etc/corosync/authkey.
[root@centos1 corosync]#


Then copy the key to the other 2nodes:


scp /etc/corosync/authkey centos2:/etc/corosync/
scp /etc/corosync/authkey centos3:/etc/corosync/


[root@centos1 corosync]# pcs cluster setup hacluster centos1.localdomain addr= centos2.localdomain addr= centos3.localdomain addr=
Sending ‘corosync authkey’, ‘pacemaker authkey’ to ‘centos1.localdomain’, ‘centos2.localdomain’, ‘centos3.localdomain’
centos1.localdomain: successful distribution of the file ‘corosync authkey’
centos1.localdomain: successful distribution of the file ‘pacemaker authkey’
centos2.localdomain: successful distribution of the file ‘corosync authkey’
centos2.localdomain: successful distribution of the file ‘pacemaker authkey’
centos3.localdomain: successful distribution of the file ‘corosync authkey’
centos3.localdomain: successful distribution of the file ‘pacemaker authkey’
Sending ‘corosync.conf’ to ‘centos1.localdomain’, ‘centos2.localdomain’, ‘centos3.localdomain’
centos1.localdomain: successful distribution of the file ‘corosync.conf’
centos2.localdomain: successful distribution of the file ‘corosync.conf’
centos3.localdomain: successful distribution of the file ‘corosync.conf’
Cluster has been successfully set up.
[root@centos1 corosync]#


Note I had to specify the IP addresses for the nodes. This is because these nodes each have TWO network interfaces with separate IP addresses. If the nodes only had one network interface, then you can leave out the addr= setting.


Next you can start the cluster:


[root@centos1 corosync]# pcs cluster start
Starting Cluster…
[root@centos1 corosync]#
[root@centos1 corosync]#
[root@centos1 corosync]# pcs cluster status
Cluster Status:
Cluster Summary:
* Stack: unknown
* Current DC: NONE
* Last updated: Mon Feb 22 12:57:37 2021
* Last change: Mon Feb 22 12:57:35 2021 by hacluster via crmd on centos1.localdomain
* 3 nodes configured
* 0 resource instances configured
Node List:
* Node centos1.localdomain: UNCLEAN (offline)
* Node centos2.localdomain: UNCLEAN (offline)
* Node centos3.localdomain: UNCLEAN (offline)


PCSD Status:
centos1.localdomain: Online
centos3.localdomain: Online
centos2.localdomain: Online
[root@centos1 corosync]#



The Node List says “UNCLEAN”.


So I did:


pcs cluster start centos1.localdomain
pcs cluster start centos2.localdomain
pcs cluster start centos3.localdomain
pcs cluster status


then the cluster was started in clean running state:


[root@centos1 cluster]# pcs cluster status
Cluster Status:
Cluster Summary:
* Stack: corosync
* Current DC: centos1.localdomain (version 2.0.5-7.el8-ba59be7122) – partition with quorum
* Last updated: Mon Feb 22 13:22:29 2021
* Last change: Mon Feb 22 13:17:44 2021 by hacluster via crmd on centos1.localdomain
* 3 nodes configured
* 0 resource instances configured
Node List:
* Online: [ centos1.localdomain centos2.localdomain centos3.localdomain ]


PCSD Status:
centos1.localdomain: Online
centos2.localdomain: Online
centos3.localdomain: Online
[root@centos1 cluster]#

Continue Reading