LPIC3-306 COURSE NOTES: CEPH – An Overview

You are here:
< All Topics

These are my notes made during my lab practical as part of my LPIC3 Diploma course in Linux Clustering.

They are in “rough format”, presented as they were written.

 

 

LPIC3-306 Clustering – 363.2 Ceph Syllabus Requirements

 

 

Exam Weighting: 8

 

Description: Candidates should be able to manage and maintain a Ceph Cluster. This
includes the configuration of RGW, RDB devices and CephFS.

 

Key Knowledge Areas:
• Understand the architecture and components of Ceph
• Manage OSD, MGR, MON and MDS
• Understand and manage placement groups and pools
• Understand storage backends (FileStore and BlueStore)
• Initialize a Ceph cluster
• Create and manage Rados Block Devices
• Create and manage CephFS volumes, including snapshots
• Mount and use an existing CephFS
• Understand and adjust CRUSH maps

 

Configure high availability aspects of Ceph
• Scale up a Ceph cluster
• Restore and verify the integrity of a Ceph cluster after an outage
• Understand key concepts of Ceph updates, including update order, tunables and
features

 

Partial list of the used files, terms and utilities:
• ceph-deploy (including relevant subcommands)
• ceph.conf
• ceph (including relevant subcommands)
• rados (including relevant subcommands)
• rdb (including relevant subcommands)
• cephfs (including relevant subcommands)
• ceph-volume (including relevant subcommands)
• ceph-authtool
• ceph-bluestore-tool
• crushtool

 

 

 

What is Ceph

 

Ceph is an open-source, massively scalable, software-defined storage system or “SDS”

 

It provides object, block and file system storage via a single clustered high-availability platform.

 

The intention of Ceph is to be a fully distributed system with no single point of failure which is self-healing and self-managing. Although production environment Ceph systems are best run on a high-grade hardware specification,  Ceph runs on standard commodity computer hardware.

 

An Overview of Ceph  

 

 

When Ceph services start, the initialization process activates a series of daemons that run in the background.

 

A Ceph Cluster runs with a minimum of three types of daemons:

 

Ceph Monitor (ceph-mon)

 

Ceph Manager (ceph-mgr)

 

Ceph OSD Daemon (ceph-osd)

 

Ceph Storage Clusters that support the Ceph File System also run at least one Ceph Metadata Server (ceph-mds).

 

Clusters that support Ceph Object Storage run Ceph RADOS Gateway daemons (radosgw) as well.

 

 

OSD or Object Storage Daemon:  An OSD stores data, handles data replication, recovery, backfilling, and rebalancing. An OSD also provides monitoring data for Ceph Monitors by checking other Ceph OSD Daemons for an active heartbeat.  A Ceph Storage Cluster requires at least two Ceph OSD Daemons in order to maintain an active + clean state.

 

Monitor or Mon: maintains maps of the cluster state, including the monitor map, the OSD map, the Placement Group (PG) map, and the CRUSH map.

 

Ceph also maintains a history or “epoch” of each state change in the Monitors, Ceph OSD Daemons, and the PGs.

 

Metadata Server or MDS: The MDS holds metadata relating to the Ceph Filesystem and enables POSIX file system users to execute standard POSIX commands such as ls, find, etc. without creating overhead on the Ceph Storage Cluster. MDS is only required if you are intending to run CephFS. It is not necessary if only block and object storage is to be used.  

 

A Ceph Storage Cluster requires at least one Ceph Monitor, one Ceph Manager, one Ceph Metadata Server or MDS, and at least one and preferably two or more Ceph OSDs or Object Storage Daemon servers.

 

Ceph stores data in the form of objects within logical storage pools. The CRUSH algorithm is used by Ceph to decide which placement group should contain the object and which Ceph OSD Daemon should store the placement group.

 

The CRUSH algorithm is also used by Ceph to scale, rebalance, and recover from failures.

 

Note that the newer version of ceph is not supported by Debian. Ceph is in general much better supported by CentOS since RedHat maintains both CentOS and Ceph. 

 

Ceph-deploy now replaced by cephadm

 

NOTE that ceph-deploy is now an outdated tool and is no longer maintained. It is also not available for Centos8. You should either use an installation method such as the above, or alternatively, use the cephadm tool for installing ceph on cluster nodes. However, a working knowledge of ceph-deploy is at time of writing still required for the LPIC3 exam.

 

For more on cephadm see https://ceph.io/ceph-management/introducing-cephadm/

 

 

 

The Client nodes know about monitors, OSDs and MDS’s but have no knowledge of object locations. Ceph clients communicate directly with the OSDs rather than going through a dedicated server.

 

The OSDs (Object Storage Daemons) store the data. They can be up and in the map or can be down and out if they have failed. An OSD can be down but still in the map which means that the PG has not yet been remapped. When OSDs come on line they inform the monitor.

 

The Monitor nodes store a master copy of the cluster map.

 

 

RADOS (Reliable Autonomic Distributed Object Store)

 

RADOS  makes up the heart of the scalable object storage service. 

 

In addition to accessing RADOS via the defined interfaces, it is also possible to access RADOS directly via a set of library calls.

 

 

CRUSH (Controlled Replication Under Scalable Hashing)

 

The CRUSH map contains the topology of the system and is location aware. Objects are mapped to Placement Groups and Placement Groups are in turn  mapped to OSDs. This allows for allows dynamic rebalancing and controls which Placement Group holds the objects. It also defines  which of the OSDs should hold the Placement Group.

 

The CRUSH map holds a list of OSDs, buckets and rules that hold replication directives.

 

CRUSH will try not to move data during rebalancing whereas a true hash function would be likely to cause greater data movement.

 

 

The CRUSH map allows for different resiliency models such as:

 

#0 for a 1-node cluster.

 

#1 for a multi node cluster in a single rack

 

#2 for a multi node, multi chassis cluster with multiple hosts in a chassis

 

#3 for a multi node cluster with hosts across racks, etc.

 

osd crush chooseleaf type = {n}

 

Buckets

 

Buckets are a hierarchical structure of storage locations; a bucket in the CRUSH map context is a location.

 

Placement Groups (PGs)

 

Ceph subdivides a storage pool into placement groups, assigning each individual object to a placement group, and then assigns the placement group to a primary OSD.

 

If an OSD node fails or the cluster re-balances, Ceph is able to replicate or move a placement group and all the objects stored within it without the need to move or replicate each object individually. This allows for an efficient re-balancing or recovery of the Ceph cluster.

 

Objects are mapped to Placement Groups by hashing the object’s name along with the replication factor and a bitmask.

 

 

When you create a pool, a number of placement groups are automatically created by Ceph for the pool. If you don’t directly specify a number of placement groups, Ceph uses the default value of 8 which is extremely low.

 

A more useful default value is 128. For example:

 

osd pool default pg num = 128
osd pool default pgp num = 128

 

You need to set both the number of total placement groups and the number of placement groups used for objects in PG splitting to the same value. As a general guide use the following values:

 

Less than 5 OSDs: set pg_num and pgp_num to 128.
Between 5 and 10 OSDs: set pg_num and pgp_num to 512
Between 10 and 50 OSDs: set pg_num and pgp_num to 4096

 

 

To specifically define the number of PGs:

 

set pool x pg_num to {pg_num}

 

ceph osd pool set {pool-name} pg_num {pg_num}

 

 

set pool x pgp_num to {pgp_num}

 

ceph osd pool set {pool-name} pgp_num {pgp_num}

 

How To Create OSD Nodes on Ceph Using ceph-deploy

 

 

BlueStore OSD is the now the default storage system used for Ceph OSDs.

 

Before you add a BlueStore OSD node to Ceph, first delete all data on the device/s that will serve as OSDs.

 

You can do this with the zap command:

 

$CEPH_CONFIG_DIR/ceph-deploy disk zap node device

 

Replace node with the node name or host name where the disk is located.

 

Replace device with the path to the device on the host where the disk is located.

 

Eg to delete the data on a device named /dev/sdc on a node named ceph-node3 in the Ceph Storage Cluster, use:

 

$CEPH_CONFIG_DIR/ceph-deploy disk zap ceph-node3 /dev/sdc

 

 

Next, to create a filestore OSD, enter:

 

$CEPH_CONFIG_DIR/ceph-deploy osd create –data device node

 

This creates a volume group and logical volume on the specified disk. Both data and journal are stored on the same logical volume.

 

Eg

 

$CEPH_CONFIG_DIR/ceph-deploy osd create –data /dev/sdc ceph-node3

 

 

 

How To Create A FileStore OSD Manually

 

Quoted from the Ceph website:

 

FileStore is the legacy approach to storing objects in Ceph. It relies on a standard file system (normally XFS) in combination with a key/value database (traditionally LevelDB, now RocksDB) for some metadata.

 

FileStore is well-tested and widely used in production but suffers from many performance deficiencies due to its overall design and reliance on a traditional file system for storing object data.

 

Although FileStore is generally capable of functioning on most POSIX-compatible file systems (including btrfs and ext4), we only recommend that XFS be used. Both btrfs and ext4 have known bugs and deficiencies and their use may lead to data loss. By default all Ceph provisioning tools will use XFS.

 

The official Ceph default storage system is now BlueStore. Prior to Ceph version Luminous, the default (and only option available) was Filestore.

 

 

Note the instructions below create a FileStore and not a BlueStore system!

 

To create a FileStore OSD manually ie without using ceph-deploy or cephadm:

 

first create the required partitions on the OSD node concerned: one for data, one for journal.

 

This example creates a 40 GB data partition on /dev/sdc1 and a journal partition of 12GB on /dev/sdc2:

 

 

parted /dev/sdc –script — mklabel gpt
parted –script /dev/sdc mkpart primary 0MB 40000MB
parted –script /dev/sdc mkpart primary 42000MB 55000MB

 

dd if=/dev/zero of=/dev/sdc1 bs=1M count=1000

 

sgdisk –zap-all –clear –mbrtogpt -g — /dev/sdc2

 

ceph-volume lvm zap /dev/sdc2

 

 

 

From the deployment node, create the FileStore OSD. To specify OSD file type, use –filestore and –fs-type.

 

Eg, to create a FileStore OSD with XFS filesystem:

 

CEPH_CONFIG_DIR/ceph-deploy osd create –filestore –fs-type xfs –data /dev/sdc1 –journal /dev/sdc2 ceph-node2

 

 

What is BlueStore?

 

Any new OSDs (e.g., when the cluster is expanded) can be deployed using BlueStore. This is the default behavior so no specific change is needed.

 

There are two methods OSDs can use to manage the data they store.

 

The default is now BlueStore. Prior to Ceph version Luminous, the default (and only option available) was Filestore.

 

BlueStore is a new back-end object storage system for Ceph OSD daemons. The original object store used by Ceph, FileStore, required a file system placed on top of raw block devices. Objects were then written to the file system.

 

By contrast, BlueStore does not require a file system for itself, because BlueStore stores objects directly on the block device. This improves cluster performance as it removes file system overhead.

 

BlueStore can use different block devices for storing different data. As an example, Hard Disk Drive (HDD) storage for data, Solid-state Drive (SSD) storage for metadata, Non-volatile Memory (NVM) or persistent or Non-volatile RAM (NVRAM) for the RocksDB WAL (write-ahead log).

 

In the simplest implementation, BlueStore resides on a single storage device which is partitioned into two parts, one containing OSD metadata and actual data partition.

 

The OSD metadata partition is formatted with XFS and holds information about the OSD, such as its identifier, the cluster it belongs to, and its private keyring.

 

Data partition contains the actual OSD data and is managed by BlueStore. The primary partition is identified by a block symbolic link in the data directory.

 

Two additional devices can also be implemented:

 

A WAL (write-ahead-log) device: This contains the BlueStore internal journal or write-ahead log and is identified by the block.wal symbolic link in the data directory.

 

Best practice is to use an SSH disk to implement a WAL device in order to provide for optimum performance.

 

 

A DB device: this stores BlueStore internal metadata. The embedded RocksDB database will then place as much metadata as possible on the DB device instead of on the primary device to optimize performance.

 

Only if the DB device becomes full will it then place metadata on the primary device. As for WAL, best practice for the Bluestore DB device is to deploy an SSD.

 

 

 

Starting and Stopping Ceph

 

To start all Ceph daemons:

 

[root@admin ~]# systemctl start ceph.target

 

To stop all Ceph daemons:

 

[root@admin ~]# systemctl stop ceph.target

 

To restart all Ceph daemons:

 

[root@admin ~]# systemctl restart ceph.target

 

To start, stop, and restart individual Ceph daemons:

 

 

On Ceph Monitor nodes:

 

systemctl start ceph-mon.target

 

systemctl stop ceph-mon.target

 

systemctl restart ceph-mon.target

 

On Ceph Manager nodes:

 

systemctl start ceph-mgr.target

 

systemctl stop ceph-mgr.target

 

systemctl restart ceph-mgr.target

 

On Ceph OSD nodes:

 

systemctl start ceph-osd.target

 

systemctl stop ceph-osd.target

 

systemctl restart ceph-osd.target

 

On Ceph Object Gateway nodes:

 

systemctl start ceph-radosgw.target

 

systemctl stop ceph-radosgw.target

 

systemctl restart ceph-radosgw.target

 

 

To perform stop, start, restart actions on specific Ceph monitor, manager, OSD or object gateway node instances:

 

On a Ceph Monitor node:

 

systemctl start ceph-mon@$MONITOR_HOST_NAME
systemctl stop ceph-mon@$MONITOR_HOST_NAME
systemctl restart ceph-mon@$MONITOR_HOST_NAME

 

On a Ceph Manager node:

systemctl start ceph-mgr@MANAGER_HOST_NAME
systemctl stop ceph-mgr@MANAGER_HOST_NAME
systemctl restart ceph-mgr@MANAGER_HOST_NAME

 

 

On a Ceph OSD node:

 

systemctl start ceph-osd@$OSD_NUMBER
systemctl stop ceph-osd@$OSD_NUMBER
systemctl restart ceph-osd@$OSD_NUMBER

 

substitute $OSD_NUMBER with the ID number of the Ceph OSD.

 

On a Ceph Object Gateway node:

 

systemctl start ceph-radosgw@rgw.$OBJ_GATEWAY_HOST_NAME
systemctl stop ceph-radosgw@rgw.$OBJ_GATEWAY_HOST_NAME
systemctl restart ceph-radosgw@rgw.$OBJ_GATEWAY_HOST_NAME

 

 

Placement Groups PG Information

 

To display the number of placement groups in a pool:

 

ceph osd pool get {pool-name} pg_num

 

 

To display statistics for the placement groups in the cluster:

 

ceph pg dump [–format {format}]

 

 

How To Check Status of the Ceph Cluster

 

 

To check the status and health of the cluster from the administration node, use:

 

ceph health
ceph status

 

Note it often can take up to several minutes for the cluster to stabilize before the cluster health will indicate HEALTH_OK.

 

You can also check the cluster quorum status of the cluster monitors:

 

ceph quorum_status –format json-pretty

 

 

For more Ceph admin commands, see https://sabaini.at/pages/ceph-cheatsheet.html#monit

 

 

The ceph.conf File

 

Each Ceph daemon looks for a ceph.conf file that contains its configuration settings.  For manual deployments, you need to create a ceph.conf file to define your cluster.

 

ceph.conf contains the following definitions:

 

Cluster membership
Host names
Host addresses
Paths to keyrings
Paths to journals
Paths to data
Other runtime options

 

The default ceph.conf locations in sequential order are as follows:

 

$CEPH_CONF (i.e., the path following the $CEPH_CONF environment variable)

 

-c path/path (i.e., the -c command line argument)

 

/etc/ceph/ceph.conf

 

~/.ceph/config

 

./ceph.conf (i.e., in the current working directory)

 

ceph-conf is a utility for getting information from a ceph configuration file.

 

As with most Ceph programs, you can specify which Ceph configuration file to use with the -c flag.

 

 

ceph-conf -L = lists all sections

 

 

Ceph Journals 

 

Note that journals are used only on FileStore.

 

Journals are deprecated on BlueStore and thus are not explicitly defined for BlueStore systems.

 

 

How To List Your Cluster Pools

 

To list your cluster pools, execute:

 

ceph osd lspools

 

Rename a Pool

 

To rename a pool, execute:

 

ceph osd pool rename <current-pool-name> <new-pool-name>

 

 

Table of Contents