How Can We Help?
LPIC3-306 COURSE NOTES: CEPH – An Overview
These are my notes made during my lab practical as part of my LPIC3 Diploma course in Linux Clustering.
They are in “rough format”, presented as they were written.
LPIC3-306 Clustering – 363.2 Ceph Syllabus Requirements
Exam Weighting: 8
Description: Candidates should be able to manage and maintain a Ceph Cluster. This
includes the configuration of RGW, RDB devices and CephFS.
Key Knowledge Areas:
• Understand the architecture and components of Ceph
• Manage OSD, MGR, MON and MDS
• Understand and manage placement groups and pools
• Understand storage backends (FileStore and BlueStore)
• Initialize a Ceph cluster
• Create and manage Rados Block Devices
• Create and manage CephFS volumes, including snapshots
• Mount and use an existing CephFS
• Understand and adjust CRUSH maps
Configure high availability aspects of Ceph
• Scale up a Ceph cluster
• Restore and verify the integrity of a Ceph cluster after an outage
• Understand key concepts of Ceph updates, including update order, tunables and
features
Partial list of the used files, terms and utilities:
• ceph-deploy (including relevant subcommands)
• ceph.conf
• ceph (including relevant subcommands)
• rados (including relevant subcommands)
• rdb (including relevant subcommands)
• cephfs (including relevant subcommands)
• ceph-volume (including relevant subcommands)
• ceph-authtool
• ceph-bluestore-tool
• crushtool
What is Ceph
Ceph is an open-source, massively scalable, software-defined storage system or “SDS”
It provides object, block and file system storage via a single clustered high-availability platform.
The intention of Ceph is to be a fully distributed system with no single point of failure which is self-healing and self-managing. Although production environment Ceph systems are best run on a high-grade hardware specification, Ceph runs on standard commodity computer hardware.
An Overview of Ceph
When Ceph services start, the initialization process activates a series of daemons that run in the background.
A Ceph Cluster runs with a minimum of three types of daemons:
Ceph Monitor (ceph-mon)
Ceph Manager (ceph-mgr)
Ceph OSD Daemon (ceph-osd)
Ceph Storage Clusters that support the Ceph File System also run at least one Ceph Metadata Server (ceph-mds).
Clusters that support Ceph Object Storage run Ceph RADOS Gateway daemons (radosgw) as well.
OSD or Object Storage Daemon: An OSD stores data, handles data replication, recovery, backfilling, and rebalancing. An OSD also provides monitoring data for Ceph Monitors by checking other Ceph OSD Daemons for an active heartbeat. A Ceph Storage Cluster requires at least two Ceph OSD Daemons in order to maintain an active + clean state.
Monitor or Mon: maintains maps of the cluster state, including the monitor map, the OSD map, the Placement Group (PG) map, and the CRUSH map.
Ceph also maintains a history or “epoch” of each state change in the Monitors, Ceph OSD Daemons, and the PGs.
Metadata Server or MDS: The MDS holds metadata relating to the Ceph Filesystem and enables POSIX file system users to execute standard POSIX commands such as ls, find, etc. without creating overhead on the Ceph Storage Cluster. MDS is only required if you are intending to run CephFS. It is not necessary if only block and object storage is to be used.
A Ceph Storage Cluster requires at least one Ceph Monitor, one Ceph Manager, one Ceph Metadata Server or MDS, and at least one and preferably two or more Ceph OSDs or Object Storage Daemon servers.
Ceph stores data in the form of objects within logical storage pools. The CRUSH algorithm is used by Ceph to decide which placement group should contain the object and which Ceph OSD Daemon should store the placement group.
The CRUSH algorithm is also used by Ceph to scale, rebalance, and recover from failures.
Note that the newer version of ceph is not supported by Debian. Ceph is in general much better supported by CentOS since RedHat maintains both CentOS and Ceph.
Ceph-deploy now replaced by cephadm
NOTE that ceph-deploy is now an outdated tool and is no longer maintained. It is also not available for Centos8. You should either use an installation method such as the above, or alternatively, use the cephadm tool for installing ceph on cluster nodes. However, a working knowledge of ceph-deploy is at time of writing still required for the LPIC3 exam.
For more on cephadm see https://ceph.io/ceph-management/introducing-cephadm/
The Client nodes know about monitors, OSDs and MDS’s but have no knowledge of object locations. Ceph clients communicate directly with the OSDs rather than going through a dedicated server.
The OSDs (Object Storage Daemons) store the data. They can be up and in the map or can be down and out if they have failed. An OSD can be down but still in the map which means that the PG has not yet been remapped. When OSDs come on line they inform the monitor.
The Monitor nodes store a master copy of the cluster map.
RADOS (Reliable Autonomic Distributed Object Store)
RADOS makes up the heart of the scalable object storage service.
In addition to accessing RADOS via the defined interfaces, it is also possible to access RADOS directly via a set of library calls.
CRUSH (Controlled Replication Under Scalable Hashing)
The CRUSH map contains the topology of the system and is location aware. Objects are mapped to Placement Groups and Placement Groups are in turn mapped to OSDs. This allows for allows dynamic rebalancing and controls which Placement Group holds the objects. It also defines which of the OSDs should hold the Placement Group.
The CRUSH map holds a list of OSDs, buckets and rules that hold replication directives.
CRUSH will try not to move data during rebalancing whereas a true hash function would be likely to cause greater data movement.
The CRUSH map allows for different resiliency models such as:
#0 for a 1-node cluster.
#1 for a multi node cluster in a single rack
#2 for a multi node, multi chassis cluster with multiple hosts in a chassis
#3 for a multi node cluster with hosts across racks, etc.
osd crush chooseleaf type = {n}
Buckets
Buckets are a hierarchical structure of storage locations; a bucket in the CRUSH map context is a location.
Placement Groups (PGs)
Ceph subdivides a storage pool into placement groups, assigning each individual object to a placement group, and then assigns the placement group to a primary OSD.
If an OSD node fails or the cluster re-balances, Ceph is able to replicate or move a placement group and all the objects stored within it without the need to move or replicate each object individually. This allows for an efficient re-balancing or recovery of the Ceph cluster.
Objects are mapped to Placement Groups by hashing the object’s name along with the replication factor and a bitmask.
When you create a pool, a number of placement groups are automatically created by Ceph for the pool. If you don’t directly specify a number of placement groups, Ceph uses the default value of 8 which is extremely low.
A more useful default value is 128. For example:
osd pool default pg num = 128
osd pool default pgp num = 128
You need to set both the number of total placement groups and the number of placement groups used for objects in PG splitting to the same value. As a general guide use the following values:
Less than 5 OSDs: set pg_num and pgp_num to 128.
Between 5 and 10 OSDs: set pg_num and pgp_num to 512
Between 10 and 50 OSDs: set pg_num and pgp_num to 4096
To specifically define the number of PGs:
set pool x pg_num to {pg_num}
ceph osd pool set {pool-name} pg_num {pg_num}
set pool x pgp_num to {pgp_num}
ceph osd pool set {pool-name} pgp_num {pgp_num}
How To Create OSD Nodes on Ceph Using ceph-deploy
BlueStore OSD is the now the default storage system used for Ceph OSDs.
Before you add a BlueStore OSD node to Ceph, first delete all data on the device/s that will serve as OSDs.
You can do this with the zap command:
$CEPH_CONFIG_DIR/ceph-deploy disk zap node device
Replace node with the node name or host name where the disk is located.
Replace device with the path to the device on the host where the disk is located.
Eg to delete the data on a device named /dev/sdc on a node named ceph-node3 in the Ceph Storage Cluster, use:
$CEPH_CONFIG_DIR/ceph-deploy disk zap ceph-node3 /dev/sdc
Next, to create a filestore OSD, enter:
$CEPH_CONFIG_DIR/ceph-deploy osd create –data device node
This creates a volume group and logical volume on the specified disk. Both data and journal are stored on the same logical volume.
Eg
$CEPH_CONFIG_DIR/ceph-deploy osd create –data /dev/sdc ceph-node3
How To Create A FileStore OSD Manually
Quoted from the Ceph website:
FileStore is the legacy approach to storing objects in Ceph. It relies on a standard file system (normally XFS) in combination with a key/value database (traditionally LevelDB, now RocksDB) for some metadata.
FileStore is well-tested and widely used in production but suffers from many performance deficiencies due to its overall design and reliance on a traditional file system for storing object data.
Although FileStore is generally capable of functioning on most POSIX-compatible file systems (including btrfs and ext4), we only recommend that XFS be used. Both btrfs and ext4 have known bugs and deficiencies and their use may lead to data loss. By default all Ceph provisioning tools will use XFS.
The official Ceph default storage system is now BlueStore. Prior to Ceph version Luminous, the default (and only option available) was Filestore.
Note the instructions below create a FileStore and not a BlueStore system!
To create a FileStore OSD manually ie without using ceph-deploy or cephadm:
first create the required partitions on the OSD node concerned: one for data, one for journal.
This example creates a 40 GB data partition on /dev/sdc1 and a journal partition of 12GB on /dev/sdc2:
parted /dev/sdc –script — mklabel gpt
parted –script /dev/sdc mkpart primary 0MB 40000MB
parted –script /dev/sdc mkpart primary 42000MB 55000MB
dd if=/dev/zero of=/dev/sdc1 bs=1M count=1000
sgdisk –zap-all –clear –mbrtogpt -g — /dev/sdc2
ceph-volume lvm zap /dev/sdc2
From the deployment node, create the FileStore OSD. To specify OSD file type, use –filestore and –fs-type.
Eg, to create a FileStore OSD with XFS filesystem:
CEPH_CONFIG_DIR/ceph-deploy osd create –filestore –fs-type xfs –data /dev/sdc1 –journal /dev/sdc2 ceph-node2
What is BlueStore?
Any new OSDs (e.g., when the cluster is expanded) can be deployed using BlueStore. This is the default behavior so no specific change is needed.
There are two methods OSDs can use to manage the data they store.
The default is now BlueStore. Prior to Ceph version Luminous, the default (and only option available) was Filestore.
BlueStore is a new back-end object storage system for Ceph OSD daemons. The original object store used by Ceph, FileStore, required a file system placed on top of raw block devices. Objects were then written to the file system.
By contrast, BlueStore does not require a file system for itself, because BlueStore stores objects directly on the block device. This improves cluster performance as it removes file system overhead.
BlueStore can use different block devices for storing different data. As an example, Hard Disk Drive (HDD) storage for data, Solid-state Drive (SSD) storage for metadata, Non-volatile Memory (NVM) or persistent or Non-volatile RAM (NVRAM) for the RocksDB WAL (write-ahead log).
In the simplest implementation, BlueStore resides on a single storage device which is partitioned into two parts, one containing OSD metadata and actual data partition.
The OSD metadata partition is formatted with XFS and holds information about the OSD, such as its identifier, the cluster it belongs to, and its private keyring.
Data partition contains the actual OSD data and is managed by BlueStore. The primary partition is identified by a block symbolic link in the data directory.
Two additional devices can also be implemented:
A WAL (write-ahead-log) device: This contains the BlueStore internal journal or write-ahead log and is identified by the block.wal symbolic link in the data directory.
Best practice is to use an SSH disk to implement a WAL device in order to provide for optimum performance.
A DB device: this stores BlueStore internal metadata. The embedded RocksDB database will then place as much metadata as possible on the DB device instead of on the primary device to optimize performance.
Only if the DB device becomes full will it then place metadata on the primary device. As for WAL, best practice for the Bluestore DB device is to deploy an SSD.
Starting and Stopping Ceph
To start all Ceph daemons:
[root@admin ~]# systemctl start ceph.target
To stop all Ceph daemons:
[root@admin ~]# systemctl stop ceph.target
To restart all Ceph daemons:
[root@admin ~]# systemctl restart ceph.target
To start, stop, and restart individual Ceph daemons:
On Ceph Monitor nodes:
systemctl start ceph-mon.target
systemctl stop ceph-mon.target
systemctl restart ceph-mon.target
On Ceph Manager nodes:
systemctl start ceph-mgr.target
systemctl stop ceph-mgr.target
systemctl restart ceph-mgr.target
On Ceph OSD nodes:
systemctl start ceph-osd.target
systemctl stop ceph-osd.target
systemctl restart ceph-osd.target
On Ceph Object Gateway nodes:
systemctl start ceph-radosgw.target
systemctl stop ceph-radosgw.target
systemctl restart ceph-radosgw.target
To perform stop, start, restart actions on specific Ceph monitor, manager, OSD or object gateway node instances:
On a Ceph Monitor node:
systemctl start ceph-mon@$MONITOR_HOST_NAME
systemctl stop ceph-mon@$MONITOR_HOST_NAME
systemctl restart ceph-mon@$MONITOR_HOST_NAME
On a Ceph Manager node:
systemctl start ceph-mgr@MANAGER_HOST_NAME
systemctl stop ceph-mgr@MANAGER_HOST_NAME
systemctl restart ceph-mgr@MANAGER_HOST_NAME
On a Ceph OSD node:
systemctl start ceph-osd@$OSD_NUMBER
systemctl stop ceph-osd@$OSD_NUMBER
systemctl restart ceph-osd@$OSD_NUMBER
substitute $OSD_NUMBER with the ID number of the Ceph OSD.
On a Ceph Object Gateway node:
systemctl start ceph-radosgw@rgw.$OBJ_GATEWAY_HOST_NAME
systemctl stop ceph-radosgw@rgw.$OBJ_GATEWAY_HOST_NAME
systemctl restart ceph-radosgw@rgw.$OBJ_GATEWAY_HOST_NAME
Placement Groups PG Information
To display the number of placement groups in a pool:
ceph osd pool get {pool-name} pg_num
To display statistics for the placement groups in the cluster:
ceph pg dump [–format {format}]
How To Check Status of the Ceph Cluster
To check the status and health of the cluster from the administration node, use:
ceph health
ceph status
Note it often can take up to several minutes for the cluster to stabilize before the cluster health will indicate HEALTH_OK.
You can also check the cluster quorum status of the cluster monitors:
ceph quorum_status –format json-pretty
For more Ceph admin commands, see https://sabaini.at/pages/ceph-cheatsheet.html#monit
The ceph.conf File
Each Ceph daemon looks for a ceph.conf file that contains its configuration settings. For manual deployments, you need to create a ceph.conf file to define your cluster.
ceph.conf contains the following definitions:
Cluster membership
Host names
Host addresses
Paths to keyrings
Paths to journals
Paths to data
Other runtime options
The default ceph.conf locations in sequential order are as follows:
$CEPH_CONF (i.e., the path following the $CEPH_CONF environment variable)
-c path/path (i.e., the -c command line argument)
/etc/ceph/ceph.conf
~/.ceph/config
./ceph.conf (i.e., in the current working directory)
ceph-conf is a utility for getting information from a ceph configuration file.
As with most Ceph programs, you can specify which Ceph configuration file to use with the -c flag.
ceph-conf -L = lists all sections
Ceph Journals
Note that journals are used only on FileStore.
Journals are deprecated on BlueStore and thus are not explicitly defined for BlueStore systems.
How To List Your Cluster Pools
To list your cluster pools, execute:
ceph osd lspools
Rename a Pool
To rename a pool, execute:
ceph osd pool rename <current-pool-name> <new-pool-name>