Tags Archives: cluster

GlusterFS Lab on Centos7

Replicated GlusterFS Cluster with 3 Nodes

 

First, we have a 3 node gluster cluster consisting of:

 

glusterfs1
glusterfs2
glusterfs3

 

 

# GlusterFS VMs
192.168.122.70 glusterfs1
192.168.122.71 glusterfs2
192.168.122.72 glusterfs3

 

Brick – is the basic storage unit (directory) on a server in the trusted storage pool.

 

Volume – is a logical collection of bricks.

 

Volumes:  The GlusterFS volume is the collection of bricks. Most of the gluster operations such as reading and writing are done on the volume.

 

 

GlusterFS supports different types of volumes, for scaling the storage size or improving the performance or for both.

 

 

In this lab we will configure a replicated GlusterFS volume on CentOS7.

 

Replicated Glusterfs Volume is similar to RAID 1. The volume maintains exact copies of the data on all bricks.

 

You can set the number of replicas when creating the volume.

 

 

You need to have at least two bricks to create a volume, with two replicas or three bricks to create a volume of 3 replicas.

 

 

I created 3 local disks /dev/vdb on each machine, 200MB each, and with 1 partition 100% vdb1

 

then created /STORAGE/BRICK1 on each machine as local mountpoint.

 

and did

 

mkfs.ext4 /dev/vdb1 on each node.

 

then added to the fstab:

 

[root@glusterfs1 STORAGE]# echo ‘/dev/vdb1 /STORAGE/BRICK1 ext4 defaults 1 2’ >> /etc/fstab
[root@glusterfs1 STORAGE]#

 

 

next firewalling….

 

The gluster processes on the nodes need to be able to communicate with each other. To simplify this setup, configure the firewall on each node to accept all traffic from the other node.

 

# iptables -I INPUT -p all -s <ip-address> -j ACCEPT

 

where ip-address is the address of the other node.

 

 

Then configure the trusted pool

 

From “server1”

 

# gluster peer probe server2
# gluster peer probe server3

 

Note: When using hostnames, the first server needs to be probed from one other server to set its hostname.

 

From “server2”

 

# gluster peer probe server1

Note: Once this pool has been established, only trusted members may probe new servers into the pool. A new server cannot probe the pool, it must be probed from the pool.

 

 

so in our case we do:

 

 

[root@glusterfs1 etc]# gluster peer probe glusterfs2
peer probe: success
[root@glusterfs1 etc]# gluster peer probe glusterfs3
peer probe: success
[root@glusterfs1 etc]#

 

[root@glusterfs2 STORAGE]# gluster peer probe glusterfs1
peer probe: Host glusterfs1 port 24007 already in peer list
[root@glusterfs2 STORAGE]# gluster peer probe glusterfs2
peer probe: Probe on localhost not needed
[root@glusterfs2 STORAGE]#

 

[root@glusterfs3 STORAGE]# gluster peer probe glusterfs1
peer probe: Host glusterfs1 port 24007 already in peer list
[root@glusterfs3 STORAGE]# gluster peer probe glusterfs2
peer probe: Host glusterfs2 port 24007 already in peer list
[root@glusterfs3 STORAGE]#

 

 

Note that once this pool has been established, only trusted members can place or probe new servers into the pool.

 

A new server cannot probe the pool, it has to be probed from the pool.

 

Check the peer status on server1

 

# gluster peer status

 

[root@glusterfs1 etc]# gluster peer status
Number of Peers: 2

 

Hostname: glusterfs2
Uuid: 5fd324e4-9415-441c-afea-4df61141c896
State: Peer in Cluster (Connected)

 

Hostname: glusterfs3
Uuid: 28a7bf8e-e2b9-4509-a45f-a95198139a24
State: Peer in Cluster (Connected)
[root@glusterfs1 etc]#

 

 

next, we set up a GlusterFS volume

 

 

On all servers do:

 

# mkdir -p /data/brick1/gv0

From any single server:

 

# gluster volume create gv0 replica 3 server1:/data/brick1/gv0 server2:/data/brick1/gv0 server3:/data/brick1/gv0
volume create: gv0: success: please start the volume to access data
# gluster volume start gv0
volume start: gv0: success

 

Confirm that the volume shows “Started”:

 

# gluster volume info

 

on each machine:

 

 

mkdir -p /STORAGE/BRICK1/GV0

 

 

then on ONE gluster node ONLY:

 

 

gluster volume create GV0 replica 3 glusterfs1:/STORAGE/BRICK1/GV0 glusterfs2:/STORAGE/BRICK1/GV0 glusterfs3:/STORAGE/BRICK1/GV0

 

 

[root@glusterfs1 etc]# gluster volume create GV0 replica 3 glusterfs1:/STORAGE/BRICK1/GV0 glusterfs2:/STORAGE/BRICK1/GV0 glusterfs3:/STORAGE/BRICK1/GV0
volume create: GV0: success: please start the volume to access data
[root@glusterfs1 etc]# gluster volume info

 

Volume Name: GV0
Type: Replicate
Volume ID: c0dc91d5-05da-4451-ba5e-91df44f21057
Status: Created
Snapshot Count: 0
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: glusterfs1:/STORAGE/BRICK1/GV0
Brick2: glusterfs2:/STORAGE/BRICK1/GV0
Brick3: glusterfs3:/STORAGE/BRICK1/GV0
Options Reconfigured:
cluster.granular-entry-heal: on
storage.fips-mode-rchecksum: on
transport.address-family: inet
nfs.disable: on
performance.client-io-threads: off
[root@glusterfs1 etc]#

 

Note: If the volume does not show “Started”, the files under /var/log/glusterfs/glusterd.log should be checked in order to debug and diagnose the situation. These logs can be looked at on one or, all the servers configured.

 

 

# gluster volume start gv0
volume start: gv0: success

 

 

gluster volume start GV0

 

 

[root@glusterfs1 glusterfs]# gluster volume start GV0
volume start: GV0: success
[root@glusterfs1 glusterfs]#

 

 

 

[root@glusterfs1 glusterfs]# gluster volume start GV0
volume start: GV0: success
[root@glusterfs1 glusterfs]# gluster volume status
Status of volume: GV0
Gluster process TCP Port RDMA Port Online Pid
——————————————————————————
Brick glusterfs1:/STORAGE/BRICK1/GV0 49152 0 Y 1933
Brick glusterfs2:/STORAGE/BRICK1/GV0 49152 0 Y 1820
Brick glusterfs3:/STORAGE/BRICK1/GV0 49152 0 Y 1523
Self-heal Daemon on localhost N/A N/A Y 1950
Self-heal Daemon on glusterfs2 N/A N/A Y 1837
Self-heal Daemon on glusterfs3 N/A N/A Y 1540

Task Status of Volume GV0
——————————————————————————
There are no active volume tasks

 

[root@glusterfs1 glusterfs]#

 

 

[root@glusterfs2 /]# gluster volume status
Status of volume: GV0
Gluster process TCP Port RDMA Port Online Pid
——————————————————————————
Brick glusterfs1:/STORAGE/BRICK1/GV0 49152 0 Y 1933
Brick glusterfs2:/STORAGE/BRICK1/GV0 49152 0 Y 1820
Brick glusterfs3:/STORAGE/BRICK1/GV0 49152 0 Y 1523
Self-heal Daemon on localhost N/A N/A Y 1837
Self-heal Daemon on glusterfs1 N/A N/A Y 1950
Self-heal Daemon on glusterfs3 N/A N/A Y 1540

Task Status of Volume GV0
——————————————————————————
There are no active volume tasks

[root@glusterfs2 /]#

 

[root@glusterfs3 STORAGE]# gluster volume status
Status of volume: GV0
Gluster process TCP Port RDMA Port Online Pid
——————————————————————————
Brick glusterfs1:/STORAGE/BRICK1/GV0 49152 0 Y 1933
Brick glusterfs2:/STORAGE/BRICK1/GV0 49152 0 Y 1820
Brick glusterfs3:/STORAGE/BRICK1/GV0 49152 0 Y 1523
Self-heal Daemon on localhost N/A N/A Y 1540
Self-heal Daemon on glusterfs2 N/A N/A Y 1837
Self-heal Daemon on glusterfs1 N/A N/A Y 1950

Task Status of Volume GV0
——————————————————————————
There are no active volume tasks

 

[root@glusterfs3 STORAGE]#
[root@glusterfs3 STORAGE]#
[root@glusterfs3 STORAGE]#
[root@glusterfs3 STORAGE]#

 

 

you only need to run the gluster volume start command from ONE node!

 

 

and it starts automatically on each node.

 

 

Testing the GlusterFS volume

 

We will use one of the servers to mount the volume. Typically you would do this from an external machine, ie a “client”. Since using this method requires additional packages to be installed on the client machine, we will instead use one of the servers to test, as if it were an actual separate client machine.

 

 

[root@glusterfs1 glusterfs]# mount -t glusterfs glusterfs2:/GV0 /mnt
[root@glusterfs1 glusterfs]#

 

 

# mount -t glusterfs server1:/gv0 /mnt

# for i in `seq -w 1 100`; do cp -rp /var/log/messages /mnt/copy-test-$i; done

First, check the client mount point:

 

# ls -lA /mnt/copy* | wc -l

 

You should see 100 files returned. Next, check the GlusterFS brick mount points on each server:

 

# ls -lA /data/brick1/gv0/copy*

 

You should see 100 files on each server using the method above.  Without replication, with a distribute-only volume (not detailed here), you would instead see about 33 files on each machine.

 

 

kevin@asus:~$ sudo su
root@asus:/home/kevin# ssh glusterfs1
^C

 

glusterfs1 is not yet booted… so let’s have a look at the glusterfs system before we boot the 3rd machine:

 

root@asus:/home/kevin# ssh glusterfs2
Last login: Wed May 4 18:04:05 2022 from asus
[root@glusterfs2 ~]#
[root@glusterfs2 ~]#
[root@glusterfs2 ~]#
[root@glusterfs2 ~]# gluster volume status
Status of volume: GV0
Gluster process TCP Port RDMA Port Online Pid
——————————————————————————
Brick glusterfs2:/STORAGE/BRICK1/GV0 49152 0 Y 1114
Brick glusterfs3:/STORAGE/BRICK1/GV0 49152 0 Y 1227
Self-heal Daemon on localhost N/A N/A Y 1129
Self-heal Daemon on glusterfs3 N/A N/A Y 1238

Task Status of Volume GV0
——————————————————————————
There are no active volume tasks

 

third machine glusterfs1 is now booted and live:

 

[root@glusterfs2 ~]# gluster volume status
Status of volume: GV0
Gluster process TCP Port RDMA Port Online Pid
——————————————————————————
Brick glusterfs1:/STORAGE/BRICK1/GV0 N/A N/A N N/A
Brick glusterfs2:/STORAGE/BRICK1/GV0 49152 0 Y 1114
Brick glusterfs3:/STORAGE/BRICK1/GV0 49152 0 Y 1227
Self-heal Daemon on localhost N/A N/A Y 1129
Self-heal Daemon on glusterfs1 N/A N/A Y 1122
Self-heal Daemon on glusterfs3 N/A N/A Y 1238

Task Status of Volume GV0
——————————————————————————
There are no active volume tasks

 

[root@glusterfs2 ~]#

 

 

a little while later….

[root@glusterfs2 ~]# gluster volume status
Status of volume: GV0
Gluster process TCP Port RDMA Port Online Pid
——————————————————————————
Brick glusterfs1:/STORAGE/BRICK1/GV0 49152 0 Y 1106
Brick glusterfs2:/STORAGE/BRICK1/GV0 49152 0 Y 1114
Brick glusterfs3:/STORAGE/BRICK1/GV0 49152 0 Y 1227
Self-heal Daemon on localhost N/A N/A Y 1129
Self-heal Daemon on glusterfs3 N/A N/A Y 1238
Self-heal Daemon on glusterfs1 N/A N/A Y 1122

Task Status of Volume GV0
——————————————————————————
There are no active volume tasks

[root@glusterfs2 ~]#

 

 

testing…

 

[root@glusterfs2 ~]# mount -t glusterfs glusterfs2:/GV0 /mnt
[root@glusterfs2 ~]#
[root@glusterfs2 ~]#
[root@glusterfs2 ~]# df
Filesystem 1K-blocks Used Available Use% Mounted on
devtmpfs 753612 0 753612 0% /dev
tmpfs 765380 0 765380 0% /dev/shm
tmpfs 765380 8860 756520 2% /run
tmpfs 765380 0 765380 0% /sys/fs/cgroup
/dev/mapper/centos-root 8374272 2421908 5952364 29% /
/dev/vda1 1038336 269012 769324 26% /boot
/dev/vdb1 197996 2084 181382 2% /STORAGE/BRICK1
tmpfs 153076 0 153076 0% /run/user/0
glusterfs2:/GV0 197996 4064 181382 3% /mnt
[root@glusterfs2 ~]# cd /mnt
[root@glusterfs2 mnt]# ls
[root@glusterfs2 mnt]#
[root@glusterfs2 mnt]# for i in `seq -w 1 100`; do cp -rp /var/log/messages /mnt/copy-test-$i; done
[root@glusterfs2 mnt]#
[root@glusterfs2 mnt]#
[root@glusterfs2 mnt]# ls -l
total 30800
-rw——- 1 root root 315122 May 4 19:41 copy-test-001
-rw——- 1 root root 315122 May 4 19:41 copy-test-002
-rw——- 1 root root 315122 May 4 19:41 copy-test-003
-rw——- 1 root root 315122 May 4 19:41 copy-test-004
-rw——- 1 root root 315122 May 4 19:41 copy-test-005

.. .. ..
.. .. ..

-rw——- 1 root root 315122 May 4 19:41 copy-test-098
-rw——- 1 root root 315122 May 4 19:41 copy-test-099
-rw——- 1 root root 315122 May 4 19:41 copy-test-100
[root@glusterfs2 mnt]#

You should see 100 files returned.

 

Next, check the GlusterFS brick mount points on each server:

 

ls -lA /STORAGE/BRICK1/GV0/copy*

 

You should see 100 files on each server using the method we listed here. Without replication, in a distribute only volume (not detailed here), you should see about 33 files on each one.

 

sure enough, we have 100 files on each server

 

 

Adding a New Brick To Gluster 

 

I then added a new brick on just one node, glusterfs1:

Device Boot Start End Blocks Id System
/dev/vdc1 2048 419431 208692 83 Linux

 

 

[root@glusterfs1 ~]# mkfs.ext4 /dev/vdc1
mke2fs 1.42.9 (28-Dec-2013)
Filesystem label=
OS type: Linux
Block size=1024 (log=0)
Fragment size=1024 (log=0)
Stride=0 blocks, Stripe width=0 blocks
52208 inodes, 208692 blocks
10434 blocks (5.00%) reserved for the super user
First data block=1
Maximum filesystem blocks=33816576
26 block groups
8192 blocks per group, 8192 fragments per group
2008 inodes per group
Superblock backups stored on blocks:
8193, 24577, 40961, 57345, 73729, 204801

Allocating group tables: done
Writing inode tables: done
Creating journal (4096 blocks): done
Writing superblocks and filesystem accounting information: done

[root@glusterfs1 ~]#

 

then create mount point and add to fstab:

 

mkdir -p /STORAGE/BRICK2

and then

then added to the fstab:

 

[root@glusterfs1 STORAGE]# echo ‘/dev/vdc1 /STORAGE/BRICK2 ext4 defaults 1 2’ >> /etc/fstab

[root@glusterfs1 etc]# cat fstab

#
# /etc/fstab
# Created by anaconda on Mon Apr 26 14:28:43 2021
#
# Accessible filesystems, by reference, are maintained under ‘/dev/disk’
# See man pages fstab(5), findfs(8), mount(8) and/or blkid(8) for more info
#
/dev/mapper/centos-root / xfs defaults 0 0
UUID=e8756f1e-4d97-4a5b-bac2-f61a9d49d0f6 /boot xfs defaults 0 0
/dev/mapper/centos-swap swap swap defaults 0 0
/dev/vdb1 /STORAGE/BRICK1 ext4 defaults 1 2
/dev/vdc1 /STORAGE/BRICK2 ext4 defaults 1 2
[root@glusterfs1 etc]#

 

 

next you need to mount the new brick manually for this session (unless you reboot)

 

 

mount -a

 

 

the filesystem is now mounted:

 

[root@glusterfs1 etc]# df
Filesystem 1K-blocks Used Available Use% Mounted on
devtmpfs 753612 0 753612 0% /dev
tmpfs 765380 0 765380 0% /dev/shm
tmpfs 765380 8908 756472 2% /run
tmpfs 765380 0 765380 0% /sys/fs/cgroup
/dev/mapper/centos-root 8374272 2422224 5952048 29% /
/dev/vda1 1038336 269012 769324 26% /boot
/dev/vdb1 197996 27225 156241 15% /STORAGE/BRICK1
tmpfs 153076 0 153076 0% /run/user/0
/dev/vdc1 197996 1806 181660 1% /STORAGE/BRICK2
[root@glusterfs1 etc]#

 

 

next we need to add the brick to the gluster volume:

 

volume add-brick <VOLNAME> <NEW-BRICK> …

Add the specified brick to the specified volume.

 

gluster volume add-brick GV0 /STORAGE/BRICK2

 

[root@glusterfs1 etc]# gluster volume add-brick GV0 /STORAGE/BRICK2
Wrong brick type: /STORAGE/BRICK2, use <HOSTNAME>:<export-dir-abs-path>

 

Usage:
volume add-brick <VOLNAME> [<stripe|replica> <COUNT> [arbiter <COUNT>]] <NEW-BRICK> … [force]

[root@glusterfs1 etc]#

 

gluster volume add-brick GV0 replica /STORAGE/BRICK2

 

 

[root@glusterfs1 BRICK1]# gluster volume add-brick GV0 replica 4 glusterfs1:/STORAGE/BRICK2/
volume add-brick: failed: The brick glusterfs1:/STORAGE/BRICK2 is a mount point. Please create a sub-directory under the mount point and use that as the brick directory. Or use ‘force’ at the end of the command if you want to override this behavior.
[root@glusterfs1 BRICK1]#

 

 

[root@glusterfs1 BRICK2]# mkdir GV0
[root@glusterfs1 BRICK2]#
[root@glusterfs1 BRICK2]#
[root@glusterfs1 BRICK2]# gluster volume add-brick GV0 replica 4 glusterfs1:/STORAGE/BRICK2/
volume add-brick: failed: The brick glusterfs1:/STORAGE/BRICK2 is a mount point. Please create a sub-directory under the mount point and use that as the brick directory. Or use ‘force’ at the end of the command if you want to override this behavior.
[root@glusterfs1 BRICK2]#
[root@glusterfs1 BRICK2]# gluster volume add-brick GV0 replica 4 glusterfs1:/STORAGE/BRICK2/GV0
volume add-brick: success
[root@glusterfs1 BRICK2]#

 

 

we now have four bricks in the volume GV0:

 

[root@glusterfs2 mnt]# gluster volume info

Volume Name: GV0
Type: Replicate
Volume ID: c0dc91d5-05da-4451-ba5e-91df44f21057
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 4 = 4
Transport-type: tcp
Bricks:
Brick1: glusterfs1:/STORAGE/BRICK1/GV0
Brick2: glusterfs2:/STORAGE/BRICK1/GV0
Brick3: glusterfs3:/STORAGE/BRICK1/GV0
Brick4: glusterfs1:/STORAGE/BRICK2/GV0
Options Reconfigured:
performance.client-io-threads: off
nfs.disable: on
transport.address-family: inet
storage.fips-mode-rchecksum: on
cluster.granular-entry-heal: on
[root@glusterfs2 mnt]#

 

[root@glusterfs1 etc]# gluster volume status
Status of volume: GV0
Gluster process TCP Port RDMA Port Online Pid
——————————————————————————
Brick glusterfs1:/STORAGE/BRICK1/GV0 49152 0 Y 1221
Brick glusterfs2:/STORAGE/BRICK1/GV0 49152 0 Y 1298
Brick glusterfs3:/STORAGE/BRICK1/GV0 49152 0 Y 1220
Brick glusterfs1:/STORAGE/BRICK2/GV0 49153 0 Y 1598
Self-heal Daemon on localhost N/A N/A Y 1615
Self-heal Daemon on glusterfs3 N/A N/A Y 1498
Self-heal Daemon on glusterfs2 N/A N/A Y 1717

Task Status of Volume GV0
——————————————————————————
There are no active volume tasks

[root@glusterfs1 etc]#

 

 

you cant unmount while they belong to the gluster volume:

 

[root@glusterfs1 etc]# cd ..
[root@glusterfs1 /]# umount /STORAGE/BRICK1
umount: /STORAGE/BRICK1: target is busy.
(In some cases useful info about processes that use
the device is found by lsof(8) or fuser(1))
[root@glusterfs1 /]# umount /STORAGE/BRICK2
umount: /STORAGE/BRICK2: target is busy.
(In some cases useful info about processes that use
the device is found by lsof(8) or fuser(1))
[root@glusterfs1 /]#

 

 

Another example of adding a new brick to gluster:

 

 

gluster volume add-brick REPVOL replica 4 glusterfs4:/DISK2/BRICK

[root@glusterfs2 DISK2]# gluster volume add-brick REPVOL replica 4 glusterfs4:/DISK2/BRICK
volume add-brick: success
[root@glusterfs2 DISK2]#

[root@glusterfs2 DISK2]# gluster volume status
Status of volume: DDVOL
Gluster process TCP Port RDMA Port Online Pid
——————————————————————————
Brick glusterfs1:/DISK1/EXPORT1 49152 0 Y 1239
Brick glusterfs2:/DISK1/EXPORT1 49152 0 Y 1022
Brick glusterfs3:/DISK1/EXPORT1 49152 0 Y 1097
Self-heal Daemon on localhost N/A N/A Y 1039
Self-heal Daemon on glusterfs4 N/A N/A Y 1307
Self-heal Daemon on glusterfs3 N/A N/A Y 1123
Self-heal Daemon on glusterfs1 N/A N/A Y 1261

Task Status of Volume DDVOL
——————————————————————————
There are no active volume tasks

Status of volume: REPVOL
Gluster process TCP Port RDMA Port Online Pid
——————————————————————————
Brick glusterfs1:/DISK2/BRICK 49153 0 Y 1250
Brick glusterfs2:/DISK2/BRICK 49153 0 Y 1029
Brick glusterfs3:/DISK2/BRICK 49153 0 Y 1108
Brick glusterfs4:/DISK2/BRICK 49152 0 Y 1446
Self-heal Daemon on localhost N/A N/A Y 1039
Self-heal Daemon on glusterfs4 N/A N/A Y 1307
Self-heal Daemon on glusterfs3 N/A N/A Y 1123
Self-heal Daemon on glusterfs1 N/A N/A Y 1261

Task Status of Volume REPVOL
——————————————————————————
There are no active volume tasks

[root@glusterfs2 DISK2]#

 

 

Detaching a Peer From Gluster

 

 

[root@glusterfs3 ~]# gluster peer help

 

gluster peer commands
======================

 

peer detach { <HOSTNAME> | <IP-address> } [force] – detach peer specified by <HOSTNAME>
peer help – display help for peer commands
peer probe { <HOSTNAME> | <IP-address> } – probe peer specified by <HOSTNAME>
peer status – list status of peers
pool list – list all the nodes in the pool (including localhost)

 

 

[root@glusterfs2 ~]#
[root@glusterfs2 ~]# gluster pool list
UUID Hostname State
02855654-335a-4be3-b80f-c1863006c31d glusterfs1 Connected
28a7bf8e-e2b9-4509-a45f-a95198139a24 glusterfs3 Connected
5fd324e4-9415-441c-afea-4df61141c896 localhost Connected
[root@glusterfs2 ~]#

 

peer detach <HOSTNAME>
Detach the specified peer.

 

gluster peer detach glusterfs1

 

[root@glusterfs2 ~]# gluster peer detach glusterfs1

 

All clients mounted through the peer which is getting detached need to be remounted using one of the other active peers in the trusted storage pool to ensure client gets notification on any changes done on the gluster configuration and if the same has been done do you want to proceed? (y/n) y

 

peer detach: failed: Peer glusterfs1 hosts one or more bricks. If the peer is in not recoverable state then use either replace-brick or remove-brick command with force to remove all bricks from the peer and attempt the peer detach again.

 

[root@glusterfs2 ~]#

 

 

[root@glusterfs3 ~]# gluster peer detach glusterfs4
All clients mounted through the peer which is getting detached need to be remounted using one of the other active peers in the trusted storage pool to ensure client gets notification on any changes done on the gluster configuration and if the same has been done do you want to proceed? (y/n) y
peer detach: success
[root@glusterfs3 ~]#

 

 

[root@glusterfs3 ~]# gluster peer status
Number of Peers: 2

 

Hostname: glusterfs1
Uuid: 02855654-335a-4be3-b80f-c1863006c31d
State: Peer in Cluster (Connected)

 

Hostname: glusterfs2
Uuid: 5fd324e4-9415-441c-afea-4df61141c896
State: Peer in Cluster (Connected)

[root@glusterfs3 ~]#

 

[root@glusterfs3 ~]# gluster pool list
UUID Hostname State
02855654-335a-4be3-b80f-c1863006c31d glusterfs1 Connected
5fd324e4-9415-441c-afea-4df61141c896 glusterfs2 Connected
28a7bf8e-e2b9-4509-a45f-a95198139a24 localhost Connected
[root@glusterfs3 ~]#

 

 

 

Adding a Node to a Trusted Storage Pool

 

 

[root@glusterfs3 ~]#
[root@glusterfs3 ~]# gluster peer probe glusterfs4
peer probe: success

[root@glusterfs3 ~]#

[root@glusterfs3 ~]# gluster pool list
UUID Hostname State
02855654-335a-4be3-b80f-c1863006c31d glusterfs1 Connected
5fd324e4-9415-441c-afea-4df61141c896 glusterfs2 Connected
2bfe642f-7dfe-4072-ac48-238859599564 glusterfs4 Connected
28a7bf8e-e2b9-4509-a45f-a95198139a24 localhost Connected

[root@glusterfs3 ~]#

[root@glusterfs3 ~]# gluster peer status
Number of Peers: 3

 

Hostname: glusterfs1
Uuid: 02855654-335a-4be3-b80f-c1863006c31d
State: Peer in Cluster (Connected)

 

Hostname: glusterfs2
Uuid: 5fd324e4-9415-441c-afea-4df61141c896
State: Peer in Cluster (Connected)

 

Hostname: glusterfs4
Uuid: 2bfe642f-7dfe-4072-ac48-238859599564
State: Peer in Cluster (Connected)
[root@glusterfs3 ~]#

 

 

 

 

Removing a Brick

 

 

 

volume remove-brick <VOLNAME> <BRICK> …

 

 

[root@glusterfs1 etc]# gluster volume remove-brick DRVOL 1 glusterfs1:/STORAGE/EXPORT1 stop
wrong brick type: 1, use <HOSTNAME>:<export-dir-abs-path>

 

Usage:
volume remove-brick <VOLNAME> [replica <COUNT>] <BRICK> … <start|stop|status|commit|force>

 

[root@glusterfs1 etc]# gluster volume remove-brick DRVOL glusterfs1:/STORAGE/EXPORT1 stop
volume remove-brick stop: failed: Volume DRVOL needs to be started to perform rebalance
[root@glusterfs1 etc]#

 

 

[root@glusterfs1 etc]# gluster volume remove-brick DRVOL glusterfs1:/STORAGE/EXPORT1 force
Remove-brick force will not migrate files from the removed bricks, so they will no longer be available on the volume.
Do you want to continue? (y/n) n
[root@glusterfs1 etc]# gluster volume rebalance

 

Usage:
volume rebalance <VOLNAME> {{fix-layout start} | {start [force]|stop|status}}

 

[root@glusterfs1 etc]# gluster volume rebalance start

 

Usage:
volume rebalance <VOLNAME> {{fix-layout start} | {start [force]|stop|status}}

 

[root@glusterfs1 etc]#
[root@glusterfs1 etc]# gluster volume rebalance DRVOL start
volume rebalance: DRVOL: success: Rebalance on DRVOL has been started successfully. Use rebalance status command to check status of the rebalance process.
ID: 939c3ec2-7634-46b4-a1ad-9e99e6da7bf2
[root@glusterfs1 etc]#

 

 

 

 

I then shutdown glusterfs1 and glusterfs2 nodes

 

[root@glusterfs3 ~]#
[root@glusterfs3 ~]# gluster peer status
Number of Peers: 2

Hostname: glusterfs1
Uuid: 02855654-335a-4be3-b80f-c1863006c31d
State: Peer in Cluster (Disconnected)

Hostname: glusterfs2
Uuid: 5fd324e4-9415-441c-afea-4df61141c896
State: Peer in Cluster (Disconnected)
[root@glusterfs3 ~]#

 

 

 

this means we now just have

 

[root@glusterfs3 ~]# gluster volume status
Status of volume: GV0
Gluster process TCP Port RDMA Port Online Pid
——————————————————————————
Brick glusterfs3:/STORAGE/BRICK1/GV0 49152 0 Y 1220
Self-heal Daemon on localhost N/A N/A Y 1498

Task Status of Volume GV0
——————————————————————————
There are no active volume tasks

[root@glusterfs3 ~]#

 

 

 

and tried on glusterfs3 to mount the volume GV0:

 

 

[root@glusterfs3 ~]# mount -t glusterfs glusterfs3:/GV0 /mnt
Mount failed. Check the log file for more details.
[root@glusterfs3 ~]#
[root@glusterfs3 ~]#

 

 

I then restarted just one more node ie glusterfs1:
[root@glusterfs3 ~]# gluster peer status
Number of Peers: 2

Hostname: glusterfs1
Uuid: 02855654-335a-4be3-b80f-c1863006c31d
State: Peer in Cluster (Connected)

Hostname: glusterfs2
Uuid: 5fd324e4-9415-441c-afea-4df61141c896
State: Peer in Cluster (Disconnected)
[root@glusterfs3 ~]# gluster volume info

Volume Name: GV0
Type: Replicate
Volume ID: c0dc91d5-05da-4451-ba5e-91df44f21057
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 4 = 4
Transport-type: tcp
Bricks:
Brick1: glusterfs1:/STORAGE/BRICK1/GV0
Brick2: glusterfs2:/STORAGE/BRICK1/GV0
Brick3: glusterfs3:/STORAGE/BRICK1/GV0
Brick4: glusterfs1:/STORAGE/BRICK2/GV0
Options Reconfigured:
performance.client-io-threads: off
nfs.disable: on
transport.address-family: inet
storage.fips-mode-rchecksum: on
cluster.granular-entry-heal: on
[root@glusterfs3 ~]# mount -t glusterfs glusterfs3:/GV0 /mnt
[root@glusterfs3 ~]#

 

 

I was then able to mount the glusterfs volume:

 

glusterfs3:/GV0 197996 29211 156235 16% /mnt
[root@glusterfs3 ~]#

 

[root@glusterfs3 ~]# gluster volume status
Status of volume: GV0
Gluster process TCP Port RDMA Port Online Pid
——————————————————————————
Brick glusterfs1:/STORAGE/BRICK1/GV0 49152 0 Y 1235
Brick glusterfs3:/STORAGE/BRICK1/GV0 49152 0 Y 1220
Brick glusterfs1:/STORAGE/BRICK2/GV0 49153 0 Y 1243
Self-heal Daemon on localhost N/A N/A Y 1498
Self-heal Daemon on glusterfs1 N/A N/A Y 1256

Task Status of Volume GV0
——————————————————————————
There are no active volume tasks

[root@glusterfs3 ~]#

 

 

I then shutdown glusterfs1 as it has 2 bricks, and started up glusterfs2 which has only 1 brick:

 

[root@glusterfs3 ~]# gluster peer status
Number of Peers: 2

Hostname: glusterfs1
Uuid: 02855654-335a-4be3-b80f-c1863006c31d
State: Peer in Cluster (Disconnected)

Hostname: glusterfs2
Uuid: 5fd324e4-9415-441c-afea-4df61141c896
State: Peer in Cluster (Connected)
[root@glusterfs3 ~]#

 

 

[root@glusterfs3 ~]# gluster volume status
Status of volume: GV0
Gluster process TCP Port RDMA Port Online Pid
——————————————————————————
Brick glusterfs2:/STORAGE/BRICK1/GV0 49152 0 Y 1093
Brick glusterfs3:/STORAGE/BRICK1/GV0 49152 0 Y 1220
Self-heal Daemon on localhost N/A N/A Y 1498
Self-heal Daemon on glusterfs2 N/A N/A Y 1108

Task Status of Volume GV0
——————————————————————————
There are no active volume tasks

[root@glusterfs3 ~]#
[root@glusterfs3 ~]#

 

I removed one brick from glusterfs1 (which has 2 bricks):

 

[root@glusterfs1 /]# gluster volume remove-brick GV0 replica 3 glusterfs1:/STORAGE/BRICK1/GV0 force
Remove-brick force will not migrate files from the removed bricks, so they will no longer be available on the volume.
Do you want to continue? (y/n) y
volume remove-brick commit force: success
[root@glusterfs1 /]#

 

 

it now looks like this:

 

[root@glusterfs1 /]# gluster volume status
Status of volume: GV0
Gluster process TCP Port RDMA Port Online Pid
——————————————————————————
Brick glusterfs2:/STORAGE/BRICK1/GV0 49152 0 Y 1018
Brick glusterfs3:/STORAGE/BRICK1/GV0 49152 0 Y 1098
Brick glusterfs1:/STORAGE/BRICK2/GV0 49153 0 Y 1249
Self-heal Daemon on localhost N/A N/A Y 1262
Self-heal Daemon on glusterfs3 N/A N/A Y 1114
Self-heal Daemon on glusterfs2 N/A N/A Y 1028

Task Status of Volume GV0
——————————————————————————
There are no active volume tasks

[root@glusterfs1 /]#

 

 

note you have to include the full path ie /STORAGE/BRICK1/GV0 and not just /STORAGE/BRICK1 else it wont work.

 

also you have to set the new brick count – in this case now 3 instead of the old 4.

 

 

 

 

To Stop and Start a Gluster Volume

 

To stop a volume:

 

gluster volume stop GV0

 

[root@glusterfs1 /]# gluster volume stop GV0
Stopping volume will make its data inaccessible. Do you want to continue? (y/n) y
volume stop: GV0: success
[root@glusterfs1 /]#

 

[root@glusterfs2 /]# gluster volume status
Volume GV0 is not started

[root@glusterfs2 /]#

 

 

to start a volume:

[root@glusterfs1 /]# gluster volume start GV0
volume start: GV0: success
[root@glusterfs1 /]#

 

[root@glusterfs2 /]# gluster volume status
Status of volume: GV0
Gluster process TCP Port RDMA Port Online Pid
——————————————————————————
Brick glusterfs2:/STORAGE/BRICK1/GV0 49152 0 Y 1730
Brick glusterfs3:/STORAGE/BRICK1/GV0 49152 0 Y 1788
Brick glusterfs1:/STORAGE/BRICK2/GV0 49152 0 Y 2532
Self-heal Daemon on localhost N/A N/A Y 1747
Self-heal Daemon on glusterfs1 N/A N/A Y 2549
Self-heal Daemon on glusterfs3 N/A N/A Y 1805

Task Status of Volume GV0
——————————————————————————
There are no active volume tasks

[root@glusterfs2 /]#

 

Deleting a Gluster Volume 

 

to delete a volume:

 

[root@glusterfs1 etc]#
[root@glusterfs1 etc]# gluster volume delete GV0
Deleting volume will erase all information about the volume. Do you want to continue? (y/n) y
volume delete: GV0: failed: Volume GV0 has been started.Volume needs to be stopped before deletion.
[root@glusterfs1 etc]#

 

 

[root@glusterfs1 etc]# gluster volume stop GV0
Stopping volume will make its data inaccessible. Do you want to continue? (y/n) y
volume stop: GV0: success
[root@glusterfs1 etc]#

 

[root@glusterfs1 etc]# gluster volume delete GV0
Deleting volume will erase all information about the volume. Do you want to continue? (y/n) y
volume delete: GV0: success
[root@glusterfs1 etc]#

[root@glusterfs1 etc]#
[root@glusterfs1 etc]# gluster volume status
No volumes present
[root@glusterfs1 etc]#

 

 

note we still have our gluster cluster with 3 nodes, but no gluster volume anymore:

 

[root@glusterfs1 etc]# gluster peer status
Number of Peers: 2

Hostname: glusterfs2
Uuid: 5fd324e4-9415-441c-afea-4df61141c896
State: Peer in Cluster (Connected)

Hostname: glusterfs3
Uuid: 28a7bf8e-e2b9-4509-a45f-a95198139a24
State: Peer in Cluster (Connected)
[root@glusterfs1 etc]#

 

 

 

Creating a Distributed Replicated Gluster Volume

 

 

Next, we want to build a distributed replicated volume:

 

first we will add another virtual machine to the gluster cluster:

 

glusterfs4

 

to make this process quicker we will clone glusterfs1 in KVM:

 

first we switch off glusterfs1, then clone it with the name glusterfs4 with the same hardware config as glusterfs1:

 

and then switch on glusterfs4

 

glusterfs4 needs to be given an IP, and definition added to /etc/hosts files on all machines and distributed: scp /etc/hosts <machine>

 

[root@glusterfs4 ~]# gluster pool list
UUID Hostname State
5fd324e4-9415-441c-afea-4df61141c896 glusterfs2 Connected
28a7bf8e-e2b9-4509-a45f-a95198139a24 glusterfs3 Connected
02855654-335a-4be3-b80f-c1863006c31d localhost Connected
[root@glusterfs4 ~]#

 

we first have to get this machine to join the gluster pool ie the cluster

 

BUT we have a problem – the UUID through the cloning is the same as for glusterfs1!

 

[root@glusterfs1 ~]# gluster system:: uuid get
UUID: 02855654-335a-4be3-b80f-c1863006c31d
[root@glusterfs1 ~]#

 

[root@glusterfs4 /]# gluster system:: uuid get
UUID: 02855654-335a-4be3-b80f-c1863006c31d
[root@glusterfs4 /]#

 

 

so first we have to change this and generate a new uuid for glusterfs4:

 

Use the ‘gluster system:: uuid reset’ command to reset the UUID of the local glusterd of the machine, and then ‘peer probe’ will run ok.

 

 

[root@glusterfs4 /]# gluster system:: uuid reset
Resetting uuid changes the uuid of local glusterd. Do you want to continue? (y/n) y
trusted storage pool has been already formed. Please detach this peer from the pool and reset its uuid.
[root@glusterfs4 /]#

 

 

this was a bit complicated, because

 

the new machine glusterfs4 had the same uuid as glusterfs1… we had to detach it in gluster but we could only do that if it was renamed glusterfs1 temporarily, and also temporarily editing the /etc/hosts files on all gluster nodes so they pointed to the glusterfs4 renamed temporarily as glusterfs1… then we could go to another machine and then remove the “glusterfs1” from the cluster – in reality of course our new glusterfs4 machine.

 

see below

5fd324e4-9415-441c-afea-4df61141c896 localhost Connected
[root@glusterfs2 etc]# gluster peer detach glusterfs1
All clients mounted through the peer which is getting detached need to be remounted using one of the other active peers in the trusted storage pool to ensure client gets notification on any changes done on the gluster configuration and if the same has been done do you want to proceed? (y/n) y
peer detach: success
[root@glusterfs2 etc]#
[root@glusterfs2 etc]#
[root@glusterfs2 etc]#

 

 

then, having done that, we create a new uuid for the node:

 

[root@glusterfs1 ~]# gluster system:: uuid reset
Resetting uuid changes the uuid of local glusterd. Do you want to continue? (y/n) y
resetting the peer uuid has been successful
[root@glusterfs1 ~]#

 

we now have a new unique uuid for this machine:

 

[root@glusterfs1 ~]# cat /var/lib/glusterd/glusterd.info
UUID=2bfe642f-7dfe-4072-ac48-238859599564
operating-version=90000
[root@glusterfs1 ~]#

 

 

then, we can switch the name and host file definitions back to glusterfs4 for this machine:

 

 

 

and then we can do:

 

[root@glusterfs2 etc]#
[root@glusterfs2 etc]# gluster peer probe glusterfs1
peer probe: success
[root@glusterfs2 etc]# gluster peer probe glusterfs4
peer probe: success
[root@glusterfs2 etc]# gluster peer probe glusterfs3
peer probe: Host glusterfs3 port 24007 already in peer list
[root@glusterfs2 etc]#

 

[root@glusterfs2 etc]# gluster pool list
UUID Hostname State
28a7bf8e-e2b9-4509-a45f-a95198139a24 glusterfs3 Connected
02855654-335a-4be3-b80f-c1863006c31d glusterfs1 Connected
2bfe642f-7dfe-4072-ac48-238859599564 glusterfs4 Connected
5fd324e4-9415-441c-afea-4df61141c896 localhost Connected
[root@glusterfs2 etc]#

 

and we now have a 4-node gluster cluster.

 

Note from Redhat:

 

Support for two-way replication is planned for deprecation and removal in future versions of Red Hat Gluster Storage. This will affect both replicated and distributed-replicated volumes.

 

Support is being removed because two-way replication does not provide adequate protection from split-brain conditions. While a dummy node can be used as an interim solution for this problem, Red Hat recommends that all volumes that currently use two-way replication are migrated to use either arbitrated replication or three-way replication.

 

 

NOTE:  Make sure you start your volumes before you try to mount them or else client operations after the mount will hang.

 

GlusterFS will fail to create a replicate volume if more than one brick of a replica set is present on the same peer. For eg. a four node replicated volume where more than one brick of a replica set is present on the same peer.

 

BUT NOTE!! you can use an “Arbiter brick”….

 

Arbiter configuration for replica volumes

Arbiter volumes are replica 3 volumes where the 3rd brick acts as the arbiter brick. This configuration has mechanisms that prevent occurrence of split-brains.

 

It can be created with the following command:

 

`# gluster volume create <VOLNAME> replica 2 arbiter 1 host1:brick1 host2:brick2 host3:brick3`

 

 

 

Note: The number of bricks for a distributed-replicated Gluster volume should be a multiple of the replica count.

 

Also, the order in which bricks are specified has an effect on data protection.

 

Each replica_count consecutive bricks in the list you give will form a replica set, with all replica sets combined into a volume-wide distribute set.

 

To make sure that replica-set members are not placed on the same node, list the first brick on every server, then the second brick on every server in the same order, and so on.

 

 

example

 

# gluster volume create test-volume replica 2 transport tcp server1:/exp1 server2:/exp2 server3:/exp3 server4:/exp4
Creation of test-volume has been successful
Please start the volume to access data.

 

 

compared with ordinary replicated:

 

# gluster volume create test-volume replica 2 transport tcp server1:/exp1 server2:/exp2
Creation of test-volume has been successful
Please start the volume to access data.

 

 

[root@glusterfs3 mnt]# gluster volume status
No volumes present
[root@glusterfs3 mnt]#

 

 

so, now we add 2 more peers to the trusted pool:

 

glusterfs1 and glusterfs2

 

[root@glusterfs3 mnt]#
[root@glusterfs3 mnt]# gluster peer probe glusterfs1
peer probe: success
[root@glusterfs3 mnt]# gluster peer probe glusterfs2
peer probe: success
[root@glusterfs3 mnt]# gluster peer status
Number of Peers: 3

 

Hostname: glusterfs4
Uuid: 2bfe642f-7dfe-4072-ac48-238859599564
State: Peer in Cluster (Connected)

 

Hostname: glusterfs1
Uuid: 02855654-335a-4be3-b80f-c1863006c31d
State: Peer in Cluster (Connected)

 

Hostname: glusterfs2
Uuid: 5fd324e4-9415-441c-afea-4df61141c896
State: Peer in Cluster (Connected)
[root@glusterfs3 mnt]#

 

so we now have a 4 node trusted pool consisting of glusterfs1,2,3 & 4.

 

 

Next, we can create our distributed replicated volume across the 4 nodes:

 

 

gluster volume create DRVOL replica 2 transport tcp glusterfs1:/STORAGE/EXPORT1 glusterfs2:/STORAGE/EXPORT2 glusterfs3:/STORAGE/EXPORT3 glusterfs4:/STORAGE/EXPORT4

 

[root@glusterfs1 ~]# gluster volume create DRVOL replica 2 transport tcp glusterfs1:/STORAGE/EXPORT1 glusterfs2:/STORAGE/EXPORT2 glusterfs3:/STORAGE/EXPORT3 glusterfs4:/STORAGE/EXPORT4
Replica 2 volumes are prone to split-brain. Use Arbiter or Replica 3 to avoid this. See: http://docs.gluster.org/en/latest/Administrator%20Guide/Split%20brain%20and%20ways%20to%20deal%20with%20it/.
Do you still want to continue?
(y/n) y
volume create: DRVOL: failed: /STORAGE/EXPORT1 is already part of a volume
[root@glusterfs1 ~]# gluster volume status
No volumes present
[root@glusterfs1 ~]#

 

REASON for this error is that you have the brick directories already created ie existing before you run the volume create command (from our earlier lab exercises). These directories contain a .glusterfs subdirectory and this is blocking the creation of  bricks with these names.

 

Solution: remove the subdirectories under /STORAGE/ on each node. ie /EXPORTn/.glusterfs

 

eg (on all machines!)

 

[root@glusterfs3 STORAGE]# rm -r -f EXPORT3/
[root@glusterfs3 STORAGE]#

 

then run the command again:

 

[root@glusterfs1 ~]# gluster volume create DRVOL replica 2 transport tcp glusterfs1:/STORAGE/EXPORT1 glusterfs2:/STORAGE/EXPORT2 glusterfs3:/STORAGE/EXPORT3 glusterfs4:/STORAGE/EXPORT4
Replica 2 volumes are prone to split-brain. Use Arbiter or Replica 3 to avoid this. See: http://docs.gluster.org/en/latest/Administrator%20Guide/Split%20brain%20and%20ways%20to%20deal%20with%20it/.
Do you still want to continue?
(y/n) y
volume create: DRVOL: success: please start the volume to access data
[root@glusterfs1 ~]#

 

 

(ideally you should have at least 6 nodes ie a 3-way to avoid split-brain, but we will just go with 4 nodes for this example).

 

 

so, now successfully created:

 

[root@glusterfs3 STORAGE]# gluster volume status
Status of volume: DRVOL
Gluster process TCP Port RDMA Port Online Pid
——————————————————————————
Brick glusterfs1:/STORAGE/EXPORT1 49152 0 Y 1719
Brick glusterfs2:/STORAGE/EXPORT2 49152 0 Y 1645
Brick glusterfs3:/STORAGE/EXPORT3 49152 0 Y 2054
Brick glusterfs4:/STORAGE/EXPORT4 49152 0 Y 2014
Self-heal Daemon on localhost N/A N/A Y 2071
Self-heal Daemon on glusterfs4 N/A N/A Y 2031
Self-heal Daemon on glusterfs1 N/A N/A Y 1736
Self-heal Daemon on glusterfs2 N/A N/A Y 1662

Task Status of Volume DRVOL
——————————————————————————
There are no active volume tasks

[root@glusterfs3 STORAGE]#

 

 

[root@glusterfs3 STORAGE]# gluster volume info

Volume Name: DRVOL
Type: Distributed-Replicate
Volume ID: 570cdad3-39c3-4fb4-bce6-cc8030fe8a65
Status: Started
Snapshot Count: 0
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: glusterfs1:/STORAGE/EXPORT1
Brick2: glusterfs2:/STORAGE/EXPORT2
Brick3: glusterfs3:/STORAGE/EXPORT3
Brick4: glusterfs4:/STORAGE/EXPORT4
Options Reconfigured:
cluster.granular-entry-heal: on
storage.fips-mode-rchecksum: on
transport.address-family: inet
nfs.disable: on
performance.client-io-threads: off
[root@glusterfs3 STORAGE]#

 

 

 

Mounting Gluster Volumes on Clients

The volume must first be started on the Gluster.

 

(and of course the respective bricks must also be mounted on all participating node servers in the Gluster).

 

For this example we can use one of our Gluster servers to mount the volume.

 

Usually you would mount on a Gluster client machine. Since using this method requires additional packages to be installed on the client machine, we will instead use one of the servers to test, as if it were an actual separate client machine.

 

for our example, we will use mount glusterfs1 on glusterfs1 (but we could mount the glusterfs2,3 or 4 on glusterfs1 if we wanted):

 

mount -t glusterfs glusterfs1:/DRVOL /mnt

 

Note that we mount the volume by its Gluster volume name – NOT the underlying brick directory!

 

 

[root@glusterfs1 /]# mount -t glusterfs glusterfs1:/DRVOL /mnt
[root@glusterfs1 /]#
[root@glusterfs1 /]# df
Filesystem 1K-blocks Used Available Use% Mounted on
devtmpfs 753612 0 753612 0% /dev
tmpfs 765380 0 765380 0% /dev/shm
tmpfs 765380 8912 756468 2% /run
tmpfs 765380 0 765380 0% /sys/fs/cgroup
/dev/mapper/centos-root 8374272 2424712 5949560 29% /
/dev/vda1 1038336 269012 769324 26% /boot
/dev/vdb1 197996 2084 181382 2% /STORAGE
tmpfs 153076 0 153076 0% /run/user/0
glusterfs1:/DRVOL 395992 8128 362764 3% /mnt
[root@glusterfs1 /]#

 

 

To Stop and Start a Gluster Volume

 

check volume status with:

 

gluster volume status

 

list available volumes with:

 

gluster volume info

 

 

[root@glusterfs1 ~]# gluster volume info all
 
 
Volume Name: DDVOL
Type: Disperse
Volume ID: 37d79a1a-3d24-4086-952e-2342c8744aa4
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x (2 + 1) = 3
Transport-type: tcp
Bricks:
Brick1: glusterfs1:/DISK1/EXPORT1
Brick2: glusterfs2:/DISK1/EXPORT1
Brick3: glusterfs3:/DISK1/EXPORT1
Options Reconfigured:
storage.fips-mode-rchecksum: on
transport.address-family: inet
nfs.disable: on
[root@glusterfs1 ~]# 

 

 

 

check the peers with:

 

gluster peer status

 

[root@glusterfs1 ~]# gluster peer status
Number of Peers: 3

 

Hostname: glusterfs3
Uuid: 28a7bf8e-e2b9-4509-a45f-a95198139a24
State: Peer in Cluster (Connected)

 

Hostname: glusterfs4
Uuid: 2bfe642f-7dfe-4072-ac48-238859599564
State: Peer in Cluster (Disconnected)

 

Hostname: glusterfs2
Uuid: 5fd324e4-9415-441c-afea-4df61141c896
State: Peer in Cluster (Connected)
[root@glusterfs1 ~]#

 

 

 

gluster volume status all

 

[root@glusterfs1 ~]# gluster volume status all
Status of volume: DDVOL
Gluster process TCP Port RDMA Port Online Pid
——————————————————————————
Brick glusterfs1:/DISK1/EXPORT1 49152 0 Y 1403
Brick glusterfs2:/DISK1/EXPORT1 49152 0 Y 1298
Brick glusterfs3:/DISK1/EXPORT1 49152 0 Y 1299
Self-heal Daemon on localhost N/A N/A Y 1420
Self-heal Daemon on glusterfs2 N/A N/A Y 1315
Self-heal Daemon on glusterfs3 N/A N/A Y 1316

Task Status of Volume DDVOL
——————————————————————————
There are no active volume tasks

[root@glusterfs1 ~]#

 

 

 

to stop a gluster volume:

 

gluster volume stop <volname>

 

to start a gluster volume:

 

gluster volume start <volname>

 

 

To stop the Gluster system:

 

systemctl stop glusterd

 

 

[root@glusterfs1 ~]# systemctl status glusterd
● glusterd.service – GlusterFS, a clustered file-system server
Loaded: loaded (/usr/lib/systemd/system/glusterd.service; enabled; vendor preset: disabled)
Active: active (running) since Fri 2022-05-13 18:11:19 CEST; 13min ago
Docs: man:glusterd(8)
Process: 967 ExecStart=/usr/sbin/glusterd -p /var/run/glusterd.pid –log-level $LOG_LEVEL $GLUSTERD_OPTIONS (code=exited, status=0/SUCCESS)
Main PID: 974 (glusterd)
CGroup: /system.slice/glusterd.service
└─974 /usr/sbin/glusterd -p /var/run/glusterd.pid –log-level INFO

 

May 13 18:11:18 glusterfs1 systemd[1]: Starting GlusterFS, a clustered file-system server…
May 13 18:11:19 glusterfs1 systemd[1]: Started GlusterFS, a clustered file-system server.
[root@glusterfs1 ~]#

 

 

 

 

[root@glusterfs1 ~]# systemctl stop glusterd
[root@glusterfs1 ~]# systemctl status glusterd
● glusterd.service – GlusterFS, a clustered file-system server
Loaded: loaded (/usr/lib/systemd/system/glusterd.service; enabled; vendor preset: disabled)
Active: inactive (dead) since Fri 2022-05-13 18:24:59 CEST; 2s ago
Docs: man:glusterd(8)
Process: 967 ExecStart=/usr/sbin/glusterd -p /var/run/glusterd.pid –log-level $LOG_LEVEL $GLUSTERD_OPTIONS (code=exited, status=0/SUCCESS)
Main PID: 974 (code=exited, status=15)

 

May 13 18:11:18 glusterfs1 systemd[1]: Starting GlusterFS, a clustered file-system server…
May 13 18:11:19 glusterfs1 systemd[1]: Started GlusterFS, a clustered file-system server.
May 13 18:24:59 glusterfs1 systemd[1]: Stopping GlusterFS, a clustered file-system server…
May 13 18:24:59 glusterfs1 systemd[1]: Stopped GlusterFS, a clustered file-system server.
[root@glusterfs1 ~]#

 

 

If there are still problems, do:

 

systemctl stop glusterd

 

mv /var/lib/glusterd/glusterd.info /tmp/.
rm -rf /var/lib/glusterd/*
mv /tmp/glusterd.info /var/lib/glusterd/.

systemctl start glusterd

 

 

 

Continue Reading

LPIC3 DIPLOMA Linux Clustering – LAB NOTES: GlusterFS Configuration on Centos

How To Install GlusterFS on Centos7

 

Choose a package source: either the CentOS Storage SIG or Gluster.org

 

Using CentOS Storage SIG Packages

 

 

yum search centos-release-gluster

 

yum install centos-release-gluster37

 

yum install centos-release-gluster37

 

yum install glusterfs gluster-cli glusterfs-libs glusterfs-server

 

 

 

[root@glusterfs1 ~]# yum search centos-release-gluster
Loaded plugins: fastestmirror, langpacks
Loading mirror speeds from cached hostfile
* base: mirrors.xtom.de
* centos-ceph-nautilus: mirror1.hs-esslingen.de
* centos-nfs-ganesha28: ftp.agdsn.de
* epel: mirrors.xtom.de
* extras: mirror.netcologne.de
* updates: mirrors.xtom.de
================================================= N/S matched: centos-release-gluster =================================================
centos-release-gluster-legacy.noarch : Disable unmaintained Gluster repositories from the CentOS Storage SIG
centos-release-gluster40.x86_64 : Gluster 4.0 (Short Term Stable) packages from the CentOS Storage SIG repository
centos-release-gluster41.noarch : Gluster 4.1 (Long Term Stable) packages from the CentOS Storage SIG repository
centos-release-gluster5.noarch : Gluster 5 packages from the CentOS Storage SIG repository
centos-release-gluster6.noarch : Gluster 6 packages from the CentOS Storage SIG repository
centos-release-gluster7.noarch : Gluster 7 packages from the CentOS Storage SIG repository
centos-release-gluster8.noarch : Gluster 8 packages from the CentOS Storage SIG repository
centos-release-gluster9.noarch : Gluster 9 packages from the CentOS Storage SIG repository

Name and summary matches only, use “search all” for everything.
[root@glusterfs1 ~]#

 

 

Alternatively, using Gluster.org Packages

 

# yum update -y

 

 

Download the latest glusterfs-epel repository from gluster.org:

 

yum install wget -y

 

 

[root@glusterfs1 ~]# yum install wget -y
Loaded plugins: fastestmirror, langpacks
Loading mirror speeds from cached hostfile
* base: mirrors.xtom.de
* centos-ceph-nautilus: mirror1.hs-esslingen.de
* centos-nfs-ganesha28: ftp.agdsn.de
* epel: mirrors.xtom.de
* extras: mirror.netcologne.de
* updates: mirrors.xtom.de
Package wget-1.14-18.el7_6.1.x86_64 already installed and latest version
Nothing to do
[root@glusterfs1 ~]#

 

 

 

wget -P /etc/yum.repos.d/ http://download.gluster.org/pub/gluster/glusterfs/LATEST/CentOS/glusterfs-epel.repo

 

Also install the latest EPEL repository from fedoraproject.org to resolve all dependencies:

 

yum install http://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm

 

 

[root@glusterfs1 ~]# yum repolist
Loaded plugins: fastestmirror, langpacks
Loading mirror speeds from cached hostfile
* base: mirrors.xtom.de
* centos-ceph-nautilus: mirror1.hs-esslingen.de
* centos-nfs-ganesha28: ftp.agdsn.de
* epel: mirrors.xtom.de
* extras: mirror.netcologne.de
* updates: mirrors.xtom.de
repo id repo name status
base/7/x86_64 CentOS-7 – Base 10,072
centos-ceph-nautilus/7/x86_64 CentOS-7 – Ceph Nautilus 609
centos-nfs-ganesha28/7/x86_64 CentOS-7 – NFS Ganesha 2.8 153
ceph-noarch Ceph noarch packages 184
epel/x86_64 Extra Packages for Enterprise Linux 7 – x86_64 13,638
extras/7/x86_64 CentOS-7 – Extras 498
updates/7/x86_64 CentOS-7 – Updates 2,579
repolist: 27,733
[root@glusterfs1 ~]#

 

 

Then install GlusterFS Server on all glusterfs storage cluster nodes.

[root@glusterfs1 ~]# yum install glusterfs gluster-cli glusterfs-libs glusterfs-server

 

Loaded plugins: fastestmirror, langpacks
Loading mirror speeds from cached hostfile
* base: mirrors.xtom.de
* centos-ceph-nautilus: mirror1.hs-esslingen.de
* centos-nfs-ganesha28: ftp.agdsn.de
* epel: mirrors.xtom.de
* extras: mirror.netcologne.de
* updates: mirrors.xtom.de
No package gluster-cli available.
No package glusterfs-server available.
Resolving Dependencies
–> Running transaction check
—> Package glusterfs.x86_64 0:6.0-49.1.el7 will be installed
—> Package glusterfs-libs.x86_64 0:6.0-49.1.el7 will be installed
–> Finished Dependency Resolution

Dependencies Resolved

=======================================================================================================================================
Package Arch Version Repository Size
=======================================================================================================================================
Installing:
glusterfs x86_64 6.0-49.1.el7 updates 622 k
glusterfs-libs x86_64 6.0-49.1.el7 updates 398 k

Transaction Summary
=======================================================================================================================================
Install 2 Packages

Total download size: 1.0 M
Installed size: 4.3 M
Is this ok [y/d/N]: y
Downloading packages:
(1/2): glusterfs-libs-6.0-49.1.el7.x86_64.rpm | 398 kB 00:00:00
(2/2): glusterfs-6.0-49.1.el7.x86_64.rpm | 622 kB 00:00:00
—————————————————————————————————————————————
Total 2.8 MB/s | 1.0 MB 00:00:00
Running transaction check
Running transaction test
Transaction test succeeded
Running transaction
Installing : glusterfs-libs-6.0-49.1.el7.x86_64 1/2
Installing : glusterfs-6.0-49.1.el7.x86_64 2/2
Verifying : glusterfs-6.0-49.1.el7.x86_64 1/2
Verifying : glusterfs-libs-6.0-49.1.el7.x86_64 2/2

Installed:
glusterfs.x86_64 0:6.0-49.1.el7 glusterfs-libs.x86_64 0:6.0-49.1.el7

Complete!
[root@glusterfs1 ~]#

 

 

 

 

 

Continue Reading

Pacemaker & Corosync Cluster Commands Cheat Sheet

 Config files for Corosync and Pacemaker

 

/etc/corosync/corosync.conf – config file for corosync cluster membership and quorum

 

/var/lib/pacemaker/crm/cib.xml – config file for cluster nodes and resources

 

Log files

 

/var/log/cluster/corosync.log

 

/var/log/pacemaker.log

 

/var/log/pcsd/pcsd.log

 

/var/log/messages – used for some other services including crmd and pengine etc.

 

 

Pacemaker Cluster Resources and Resource Groups

 

A cluster resource refers to any object or service which is managed by the Pacemaker cluster.

 

A number of different resources are defined by Pacemaker:

 

Primitive: this is the basic resource managed by the cluster.

 

Clone: a resource which can run on multiple nodes simultaneously.

 

MultiStake or Master/Slave: a resource in which one instance serves as master and the other as slave. A common example of this is DRBD.

 

 

Resource Group: this is a set of primitives or clone which is used to group resources together for easier admin.

 

Resource Classes:

 

OCF or Open Cluster Framework: this is the most commonly used resource class for Pacemaker clusters
Service: used for implementing systemd, upstart, and lsb commands
Systemd: used for systemd commands
Fencing: used for Stonith fencing resources
Nagios: used for Nagios plugins
LSB or Linux Standard Base: these are for the older Linux init script operations. Now deprecated

 

Resource stickiness: this refers to running a resource on the same cluster node even after some problem occurs with the node which is later rectified. This is advised since migrating resources to other nodes should generally be avoided.

 

Constraints

Constraints: A set of rules that sets out how resources or resource groups should be started.

Constraint Types:

 

Location: A location constraint defines on which node a resource should run – or not run, if the priority is set to minus -INFINITY.

Colocation: A colocation constraint defines which resources should be started together – or not started together in the case of -INFINITY

Order: Order constraints define in which order resources should be started. This is to allow for pre-conditional services to be started first.

 

Resource Order Priority Scores:

 

These are used with the constraint types above.

 

The priority score can be set to a value between -1,000,000 (-INFINITY = the event will never happen) right up to INFINITY (1,000,000 = the event must happen).

 

Any negative priority score will prevent the resource from running.

 

 

Cluster Admin Commands

On RedHat Pacemaker Clusters, the pcs command is used to manage the cluster. pcs stands for “Pacemaker Configuration System”:

 

pcs status – View cluster status.
pcs config – View and manage cluster configuration.
pcs cluster – Configure cluster options and nodes.
pcs resource – Manage cluster resources.
pcs stonith – Manage fence devices.
pcs constraint – Manage resource constraints.
pcs property – Manage pacemaker properties.
pcs node – Manage cluster nodes.
pcs quorum – Manage cluster quorum settings.
pcs alert – Manage pacemaker alerts.
pcs pcsd – Manage pcs daemon.
pcs acl – Manage pacemaker access control lists.
 

 

Pacemaker Cluster Installation and Configuration Commands:

 

To install packages:

 

yum install pcs -y
yum install fence-agents-all -y

 

echo CHANGE_ME | passwd –stdin hacluster

 

systemctl start pcsd
systemctl enable pcsd

 

To authenticate new cluster nodes:

 

pcs cluster auth \
node1.example.com node2.example.com node3.example.com
Username: hacluster
Password:
node1.example.com: Authorized
node2.example.com: Authorized
node3.example.com: Authorized

 

To create and start a new cluster:

pcs cluster setup <option> <member> …

 

eg

 

pcs cluster setup –start –enable –name mycluster \
node1.example.com node2.example.com node3.example.com

To enable cluster services to start on reboot:

 

pcs cluster enable –all

 

To enable cluster service on a specific node[s]:

 

pcs cluster enable [–all] [node] […]

 

To disable cluster services on a node[s]:

 

pcs cluster disable [–all] [node] […]

 

To display cluster status:

 

pcs status
pcs config

 

pcs cluster status
pcs quorum status
pcs resource show
crm_verify -L -V

 

crm_mon – this is used as equivalent for the crmsh/crmd version of Pacemaker

 

 

To delete a cluster:

pcs cluster destroy <cluster>

 

To start/stop a cluster:

 

pcs cluster start –all
pcs cluster stop –all

 

To start/stop a cluster node:

 

pcs cluster start <node>
pcs cluster stop <node>

 

 

To carry out mantainance on a specific node:

 

pcs cluster standby <node>

Then to restore the node to the cluster service:

pcs cluster unstandby <node>

 

To switch a node to standby mode:

 

pcs cluster standby <node1>

 

To restore a node from standby mode:

 

pcs cluster unstandby <node1>

 

To set a cluster property

 

pcs property set <property>=<value>

 

To disable stonith fencing: NOTE: you should usually not do this on a live production cluster!

 

pcs property set stonith-enabled=false

 

 

To reenable the stonith fencing:

 

pcs property set stonith-enabled=true

 

To configure firewalling for the cluster:

 

firewall-cmd –permanent –add-service=high-availability
firewall-cmd –reload

 

To add a node to the cluster:

 

check hacluster user and password

 

systemctl status pcsd

 

Then on an active node:

 

pcs cluster auth node4.example.com
pcs cluster node add node4.example.com

 

Then, on the new node:

 

pcs cluster start
pcs cluster enable

 

To display the xml configuration

 

pcs cluster cib

 

To display current cluster status:

 

pcs status

 

To manage cluster resources:

 

pcs resource <tab>

 

To enable, disable and relocate resource groups:

 

pcs resource move <resource>

 

or alternatively with:

 

pcs resource relocate <resource>

 

to locate the resource back to its original node:

 

pcs resource clear <resource>

 

pcs contraint <type> <option>

 

To create a new resource:

 

pcs resource create <resource_name> <resource_type> <resource_options>

 

To create new resources, reference the appropriate resource agents or RAs.

 

To list ocf resource types:

 

(example below with ocf:heartbeat)

 

pcs resource list heartbeat

 

ocf:heartbeat:IPaddr2
ocf:heartbeat:LVM
ocf:heartbeat:Filesystem
ocf:heartbeat:oracle
ocf:heartbeat:apache
options detail of a resource type or agent:

 

pcs resource describe <resource_type>
pcs resource describe ocf:heartbeat:IPaddr2

 

pcs resource create vip_cluster ocf:heartbeat:IPaddr2 ip=192.168.125.10 –group myservices
pcs resource create apache-ip ocf:heartbeat:IPaddr2 ip=192.168.125.20 cidr_netmask=24

 

 

To display a resource:

 

pcs resource show

 

Cluster Troubleshooting

Logging functions:

 

journalctl

 

tail -f /var/log/messages

 

tail -f /var/log/cluster/corosync.log

 

Debug information commands:

 

pcs resource debug-start <resource>
pcs resource debug-stop <resource>
pcs resource debug-monitor <resource>
pcs resource failcount show <resource>

 

 

To update a resource after modification:

 

pcs resource update <resource> <options>

 

To reset the failcount:

 

pcs resource cleanup <resource>

 

To remove a resource from a node:

 

pcs resource move <resource> [ <node> ]

 

To start a resource or a resource group:

 

pcs resource enable <resource>

 

To stop a resource or resource group:

 

pcs resource disable <resource>

 

 

To create a resource group and add a new resource:

 

pcs resource create <resource_name> <resource_type> <resource_options> –group <group>

 

To delete a resource:

 

pcs resource delete <resource>

 

To add a resource to a group:

 

pcs resource group add <group> <resource>
pcs resource group list
pcs resource list

 

To add a constraint to a resource group:

 

pcs constraint colocation add apache-group with ftp-group -100000
pcs constraint order apache-group then ftp-group

 

 

To reset a constraint for a resource or a resource group:

 

pcs resource clear <resource>

 

To list resource agent (RA) classes:

 

pcs resource standards

 

To list available RAs:

 

pcs resource agents ocf | service | stonith

 

To list specific resource agents of a specific RA provider:

 

pcs resource agents ocf:pacemaker

 

To list RA information:

 

pcs resource describe RA
pcs resource describe ocf:heartbeat:RA

 

To create a resource:

 

pcs resource create ClusterIP IPaddr2 ip=192.168.100.125 cidr_netmask=24 params ip=192.168.125.100 cidr_netmask=32 op monitor interval=60s

To delete a resource:

 

pcs resource delete resourceid

 

To display a resource (example with ClusterIP):

 

pcs resource show ClusterIP

 

To start a resource:

 

pcs resource enable ClusterIP

 

To stop a resource:

 

pcs resource disable ClusterIP

 

To remove a resource:

 

pcs resource delete ClusterIP

 

To modify a resource:

 

pcs resource update ClusterIP clusterip_hash=sourceip

 

To delete parameters for a resource (resource specific, here for ClusterIP):

 

pcs resource update ClusterIP ip=192.168.100.25

 

To list the current resource defaults:

 

pcs resource rsc default

 

To set resource defaults:

 

pcs resource rsc defaults resource-stickiness=100

 

To list current operation defaults:

 

pcs resource op defaults

 

To set operation defaults:

 

pcs resource op defaults timeout=240s

 

To set colocation:

 

pcs constraint colocation add ClusterIP with WebSite INFINITY

 

To set colocation with roles:

 

pcs constraint colocation add Started AnotherIP with Master WebSite INFINITY

 

To set constraint ordering:

 

pcs constraint order ClusterIP then WebSite

 

To display constraint list:

 

pcs constraint list –full

 

To show a resource failure count:

 

pcs resource failcount show RA

 

To reset a resource failure count:

 

pcs resource failcount reset RA

 

To create a resource clone:

 

pcs resource clone ClusterIP globally-unique=true clone-max=2 clone-node-max=2

 

To manage a resource:

 

pcs resource manage RA

 

To unmanage a resource:

 

pcs resource unmanage RA

 

 

Fencing (Stonith) commands:

ipmitool -H rh7-node1-irmc -U admin -P password power on

 

fence_ipmilan –ip=rh7-node1-irmc.localdomain –username=admin –password=password –action=status

Status: ON

pcs stonith

 

pcs stonith describe fence_ipmilan

 

pcs stonith create ipmi-fencing1 fence_ipmilan \
pcmk_host_list=”rh7-node1.localdomain” \
ipaddr=192.168.100.125 \
login=admin passwd=password \
op monitor interval=60s

 

pcs property set stonith-enabled=true
pcs stonith fence pcmk-2
stonith_admin –reboot pcmk-2

 

To display fencing resources:

 

pcs stonith show

 

 

To display Stonith RA information:

 

pcs stonith describe fence_ipmilan

 

To list available fencing agents:

 

pcs stonith list

 

To add a filter to list available resource agents for Stonith:

 

pcs stonith list <string>

 

To setup properties for Stonith:

 

pcs property set no-quorum-policy=ignore
pcs property set stonith-action=poweroff # default is reboot

 

To create a fencing device:

 

pcs stonith create stonith-rsa-node1 fence_rsa action=off ipaddr=”node1_rsa” login=<user> passwd=<pass> pcmk_host_list=node1 secure=true

 

To display fencing devices:

 

 

pcs stonith show

 

To fence a node off from the rest of the cluster:

 

pcs stonith fence <node>

 

To modify a fencing device:

 

pcs stonith update stonithid [options]

 

To display fencing device options:

 

pcs stonith describe <stonith_ra>

 

To delete a fencing device:

 

pcs stonith delete stonithd

 

Continue Reading

LPIC3 DIPLOMA Linux Clustering – LAB NOTES: Lesson Ceph Centos7 – Ceph CRUSH Map

LAB on Ceph Clustering on Centos7

 

These are my notes made during my lab practical as part of my LPIC3 Diploma course in Linux Clustering. They are in “rough format”, presented as they were written.

 

This lab uses the ceph-deploy tool to set up the ceph cluster.  However, note that ceph-deploy is now an outdated Ceph tool and is no longer being maintained by the Ceph project. It is also not available for Centos8. The notes below relate to Centos7.

 

For OS versions of Centos higher than 7 the Ceph project advise you to use the cephadm tool for installing ceph on cluster nodes. 

 

At the time of writing (2021) knowledge of ceph-deploy is a stipulated syllabus requirement of the LPIC3-306 Clustering Diploma Exam, hence this Centos7 Ceph lab refers to ceph-deploy.

 

As Ceph is a large and complex subject, these notes have been split into several different pages.

 

Overview of Cluster Environment 

 

The cluster comprises three nodes installed with Centos7 and housed on a KVM virtual machine system on a Linux Ubuntu host. We are installing with Centos7 rather than the recent version because the later versions are not compatible with the ceph-deploy tool.

 

CRUSH is a crucial part of Ceph’s storage system as it’s the algorithm Ceph uses to determine how data is stored across the nodes in a Ceph cluster.

 

Ceph stores client data as objects within storage pools.  Using the CRUSH algorithm, Ceph calculates in which placement group the object should best be stored and then also calculates which Ceph OSD node should store the placement group.

The CRUSH algorithm also enables the Ceph Storage Cluster to scale, rebalance, and recover dynamically from faults.

 

The CRUSH map is a hierarchical cluster storage resource map representing the available storage resources.  CRUSH empowers Ceph clients to communicate with OSDs directly rather than through a centralized server. As CRUSH uses an algorithmically determined method of storing and retrieving data, the CRUSH map allows Ceph to scale without performance bottlenecks, scalability problems or single points of failure.

 

Ceph use three storage concepts for data management:

 

Pools
Placement Groups, and
CRUSH Map

 

Pools

 

Ceph stores data within logical storage groups called pools. Pools manage the number of placement groups, the number of replicas, and the ruleset deployed for the pool.

 

Placement Groups

 

Placement groups (PGs) are the shards or fragments of a logical object pool that store objects as a group on OSDs. Placement groups reduce the amount of metadata to be processed whenever Ceph reads or writes data to OSDs.

 

NOTE: Deploying a larger number of placement groups (e.g. 100 PGs per OSD) will result in better load balancing.

 

The CRUSH map contains a list of OSDs (physical disks), a list of buckets for aggregating the devices into physical locations, and a list of rules that define how CRUSH will replicate data in the Ceph cluster.

 

Buckets can contain any number of OSDs. Buckets can themselves also contain other buckets, enabling them to form interior nodes in a storage hierarchy.

 

OSDs and buckets have numerical identifiers and weight values associated with them.

 

This structure can be used to reflect the actual physical organization of the cluster installation, taking into account such characteristics as physical proximity, common power sources, and shared networks.

 

When you deploy OSDs they are automatically added to the CRUSH map under a host bucket named for the node on which they run. This ensures that replicas or erasure code shards are distributed across hosts and that a single host or other failure will not affect service availability.

 

The main practical advantages of CRUSH are:

 

Avoiding consequences of device failure. This is a big advantage over RAID.

 

Fast — read/writes occur in microseconds.

 

Stability and Reliability— since very little data movement occurs when topology changes.

 

Flexibility — replication, erasure codes, complex placement schemes are all possible.

 

 

The CRUSH Map Structure

 

The CRUSH map consists of a hierarchy that describes the physical topology of the cluster and a set of rules defining data placement policy.

 

The hierarchy has devices (OSDs) at the leaves, and internal nodes corresponding to other physical features or groupings:

 

hosts, racks, rows, datacenters, etc.

 

The rules describe how replicas are placed in terms of that hierarchy (e.g., ‘three replicas in different racks’).

 

Devices

 

Devices are individual OSDs that store data, usually one for each storage drive. Devices are identified by an id (a non-negative integer) and a name, normally osd.N where N is the device id.

 

Types and Buckets

 

A bucket is the CRUSH term for internal nodes in the hierarchy: hosts, racks, rows, etc.

 

The CRUSH map defines a series of types used to describe these nodes.

 

The default types include:

 

osd (or device)

 

host

 

chassis

 

rack

 

row

 

pdu

 

pod

 

room

 

datacenter

 

zone

 

region

 

root

 

Most clusters use only a handful of these types, and others can be defined as needed.

 

 

CRUSH Rules

 

CRUSH Rules define policy about how data is distributed across the devices in the hierarchy. They define placement and replication strategies or distribution policies that allow you to specify exactly how CRUSH places data replicas.

 

To display what rules are defined in the cluster:

 

ceph osd crush rule ls

 

You can view the contents of the rules with:

 

ceph osd crush rule dump

 

The weights associated with each node in the hierarchy can be displayed with:

 

ceph osd tree

 

 

To modify the CRUSH MAP

 

To add or move an OSD in the CRUSH map of a running cluster:

 

ceph osd crush set {name} {weight} root={root} [{bucket-type}={bucket-name} …]

 

 

eg

 

The following example adds osd.0 to the hierarchy, or moves the OSD from a previous location.

 

ceph osd crush set osd.0 1.0 root=default datacenter=dc1 room=room1 row=foo rack=bar host=foo-bar-1

 

To Remove an OSD from the CRUSH Map

 

To remove an OSD from the CRUSH map of a running cluster, execute the following:

 

ceph osd crush remove {name}

 

To Add, Move or Remove a Bucket to the CRUSH Map

 

To add a bucket in the CRUSH map of a running cluster, execute the ceph osd crush add-bucket command:

 

ceph osd crush add-bucket {bucket-name} {bucket-type}

 

To move a bucket to a different location or position in the CRUSH map hierarchy:

 

ceph osd crush move {bucket-name} {bucket-type}={bucket-name}, […]

 

 

To remove a bucket from the CRUSH hierarchy, use:

 

ceph osd crush remove {bucket-name}

 

Note: A bucket must be empty before removing it from the CRUSH hierarchy.

 

 

 

How To Tune CRUSH 

 

 

Crush uses matched profile sets known as tunables in order to tune the CRUSH map.

 

As of the Octopus release these are:

 

legacy: the legacy behavior from argonaut and earlier.

 

argonaut: the legacy values supported by the original argonaut release

 

bobtail: the values supported by the bobtail release

 

firefly: the values supported by the firefly release

 

hammer: the values supported by the hammer release

 

jewel: the values supported by the jewel release

 

optimal: the best (ie optimal) values of the current version of Ceph

 

default: the default values of a new cluster installed from scratch. These values, which depend on the current version of Ceph, are hardcoded and are generally a mix of optimal and legacy values. These generally match the optimal profile of the previous LTS release, or the most recent release for which most users will be likely to have up-to-date clients for.

 

You can apply a profile to a running cluster with the command:

 

ceph osd crush tunables {PROFILE}

 

 

How To Determine a CRUSH Location

 

The location of an OSD within the CRUSH map’s hierarchy is known as the CRUSH location.

 

This location specifier takes the form of a list of key and value pairs.

 

Eg if an OSD is in a specific row, rack, chassis and host, and is part of the ‘default’ CRUSH root (as usual for most clusters), its CRUSH location will be:

 

root=default row=a rack=a2 chassis=a2a host=a2a1

 

The CRUSH location for an OSD can be defined by adding the crush location option in ceph.conf.

 

Each time the OSD starts, it checks that it is in the correct location in the CRUSH map. If it is not then it moves itself.

 

To disable this automatic CRUSH map management, edit ceph.conf and add the following in the [osd] section:

 

osd crush update on start = false

 

 

 

However, note that in most cases it is not necessary to manually configure this.

 

 

How To Edit and Modify the CRUSH Map

 

It is more convenient to modify the CRUSH map at runtime with the Ceph CLI than editing the CRUSH map manually.

 

However you may sometimes wish to edit the CRUSH map manually, for example in order to change the default bucket types, or to use an alternativce bucket algorithm to straw.

 

 

The steps in overview:

 

Get the CRUSH map.

 

Decompile the CRUSH map.

 

Edit at least one: Devices, Buckets or Rules.

 

Recompile the CRUSH map.

 

Set the CRUSH map.

 

 

Get a CRUSH Map

 

ceph osd getcrushmap -o {compiled-crushmap-filename}

 

This writes (-o) a compiled CRUSH map to the filename you specify.

 

However, as the CRUSH map is in compiled form, you first need to decompile it.

 

Decompile a CRUSH Map

 

use the crushtool:

 

crushtool -d {compiled-crushmap-filename}-o {decompiled-crushmap-filename}

 

 

 

The CRUSH Map has six sections:

 

tunables: The preamble at the top of the map described any _tunables_for CRUSH behavior that vary from the historical/legacy CRUSH behavior. These correct for old bugs, optimizations, or other changes in behavior made over the years to CRUSH.

 

devices: Devices are individual ceph-osd daemons that store data.

 

types: Bucket types define the types of buckets used in the CRUSH hierarchy. Buckets consist of a hierarchical aggregation of storage locations (e.g., rows, racks, chassis, hosts, etc.) together with their assigned weights.

 

buckets: Once you define bucket types, you must define each node in the hierarchy, its type, and which devices or other nodes it contains.

 

rules: Rules define policy about how data is distributed across devices in the hierarchy.

 

choose_args: Choose_args are alternative weights associated with the hierarchy that have been adjusted to optimize data placement.

 

A single choose_args map can be used for the entire cluster, or alternatively one can be created for each individual pool.

 

 

Display the current crush hierarchy with:

 

ceph osd tree

 

[root@ceph-mon ~]# ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 0.00757 root default
-3 0.00378 host ceph-osd0
0 hdd 0.00189 osd.0 down 0 1.00000
3 hdd 0.00189 osd.3 up 1.00000 1.00000
-5 0.00189 host ceph-osd1
1 hdd 0.00189 osd.1 up 1.00000 1.00000
-7 0.00189 host ceph-osd2
2 hdd 0.00189 osd.2 up 1.00000 1.00000
[root@ceph-mon ~]#

 

 

 

To edit the CRUSH map:

 

ceph osd getcrushmap -o crushmap.txt

 

crushtool -d crushmap.txt -o crushmap-decompile

 

nano crushmap-decompile

 

 

 

Edit at least one of Devices, Buckets and Rules:

 

# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable chooseleaf_vary_r 1
tunable chooseleaf_stable 1
tunable straw_calc_version 1
tunable allowed_bucket_algs 54

 

# devices
device 0 osd.0 class hdd
device 1 osd.1 class hdd
device 2 osd.2 class hdd
device 3 osd.3 class hdd

 

# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 region
type 10 root

# buckets
host ceph-osd0 {
id -3 # do not change unnecessarily
id -4 class hdd # do not change unnecessarily
# weight 0.004
alg straw2
hash 0 # rjenkins1
item osd.0 weight 0.002
item osd.3 weight 0.002
}
host ceph-osd1 {
id -5 # do not change unnecessarily
id -6 class hdd # do not change unnecessarily
# weight 0.002
alg straw2
hash 0 # rjenkins1
item osd.1 weight 0.002
}
host ceph-osd2 {
id -7 # do not change unnecessarily
id -8 class hdd # do not change unnecessarily
# weight 0.002
alg straw2
hash 0 # rjenkins1
item osd.2 weight 0.002
}
root default {
id -1 # do not change unnecessarily
id -2 class hdd # do not change unnecessarily
# weight 0.008
alg straw2
hash 0 # rjenkins1
item ceph-osd0 weight 0.004
item ceph-osd1 weight 0.002
item ceph-osd2 weight 0.002
}

 

# rules
rule replicated_rule {
id 0
type replicated
min_size 1
max_size 10
step take default
step chooseleaf firstn 0 type host
step emit
}

 

# end crush map

 

 

To add racks to the cluster CRUSH layout:

 

ceph osd crush add-bucket rack01 rack
ceph osd crush add-bucket rack02 rack

 

[root@ceph-mon ~]# ceph osd crush add-bucket rack01 rack
added bucket rack01 type rack to crush map
[root@ceph-mon ~]# ceph osd crush add-bucket rack02 rack
added bucket rack02 type rack to crush map
[root@ceph-mon ~]#

 

 

 

Continue Reading

LPIC3-306 COURSE NOTES: CEPH – An Overview

These are my notes made during my lab practical as part of my LPIC3 Diploma course in Linux Clustering.

They are in “rough format”, presented as they were written.

 

 

LPIC3-306 Clustering – 363.2 Ceph Syllabus Requirements

 

 

Exam Weighting: 8

 

Description: Candidates should be able to manage and maintain a Ceph Cluster. This
includes the configuration of RGW, RDB devices and CephFS.

 

Key Knowledge Areas:
• Understand the architecture and components of Ceph
• Manage OSD, MGR, MON and MDS
• Understand and manage placement groups and pools
• Understand storage backends (FileStore and BlueStore)
• Initialize a Ceph cluster
• Create and manage Rados Block Devices
• Create and manage CephFS volumes, including snapshots
• Mount and use an existing CephFS
• Understand and adjust CRUSH maps

 

Configure high availability aspects of Ceph
• Scale up a Ceph cluster
• Restore and verify the integrity of a Ceph cluster after an outage
• Understand key concepts of Ceph updates, including update order, tunables and
features

 

Partial list of the used files, terms and utilities:
• ceph-deploy (including relevant subcommands)
• ceph.conf
• ceph (including relevant subcommands)
• rados (including relevant subcommands)
• rdb (including relevant subcommands)
• cephfs (including relevant subcommands)
• ceph-volume (including relevant subcommands)
• ceph-authtool
• ceph-bluestore-tool
• crushtool

 

 

 

What is Ceph

 

Ceph is an open-source, massively scalable, software-defined storage system or “SDS”

 

It provides object, block and file system storage via a single clustered high-availability platform.

 

The intention of Ceph is to be a fully distributed system with no single point of failure which is self-healing and self-managing. Although production environment Ceph systems are best run on a high-grade hardware specification,  Ceph runs on standard commodity computer hardware.

 

An Overview of Ceph  

 

 

When Ceph services start, the initialization process activates a series of daemons that run in the background.

 

A Ceph Cluster runs with a minimum of three types of daemons:

 

Ceph Monitor (ceph-mon)

 

Ceph Manager (ceph-mgr)

 

Ceph OSD Daemon (ceph-osd)

 

Ceph Storage Clusters that support the Ceph File System also run at least one Ceph Metadata Server (ceph-mds).

 

Clusters that support Ceph Object Storage run Ceph RADOS Gateway daemons (radosgw) as well.

 

 

OSD or Object Storage Daemon:  An OSD stores data, handles data replication, recovery, backfilling, and rebalancing. An OSD also provides monitoring data for Ceph Monitors by checking other Ceph OSD Daemons for an active heartbeat.  A Ceph Storage Cluster requires at least two Ceph OSD Daemons in order to maintain an active + clean state.

 

Monitor or Mon: maintains maps of the cluster state, including the monitor map, the OSD map, the Placement Group (PG) map, and the CRUSH map.

 

Ceph also maintains a history or “epoch” of each state change in the Monitors, Ceph OSD Daemons, and the PGs.

 

Metadata Server or MDS: The MDS holds metadata relating to the Ceph Filesystem and enables POSIX file system users to execute standard POSIX commands such as ls, find, etc. without creating overhead on the Ceph Storage Cluster. MDS is only required if you are intending to run CephFS. It is not necessary if only block and object storage is to be used.  

 

A Ceph Storage Cluster requires at least one Ceph Monitor, one Ceph Manager, one Ceph Metadata Server or MDS, and at least one and preferably two or more Ceph OSDs or Object Storage Daemon servers.

 

Ceph stores data in the form of objects within logical storage pools. The CRUSH algorithm is used by Ceph to decide which placement group should contain the object and which Ceph OSD Daemon should store the placement group.

 

The CRUSH algorithm is also used by Ceph to scale, rebalance, and recover from failures.

 

Note that the newer version of ceph is not supported by Debian. Ceph is in general much better supported by CentOS since RedHat maintains both CentOS and Ceph. 

 

Ceph-deploy now replaced by cephadm

 

NOTE that ceph-deploy is now an outdated tool and is no longer maintained. It is also not available for Centos8. You should either use an installation method such as the above, or alternatively, use the cephadm tool for installing ceph on cluster nodes. However, a working knowledge of ceph-deploy is at time of writing still required for the LPIC3 exam.

 

For more on cephadm see https://ceph.io/ceph-management/introducing-cephadm/

 

 

 

The Client nodes know about monitors, OSDs and MDS’s but have no knowledge of object locations. Ceph clients communicate directly with the OSDs rather than going through a dedicated server.

 

The OSDs (Object Storage Daemons) store the data. They can be up and in the map or can be down and out if they have failed. An OSD can be down but still in the map which means that the PG has not yet been remapped. When OSDs come on line they inform the monitor.

 

The Monitor nodes store a master copy of the cluster map.

 

 

RADOS (Reliable Autonomic Distributed Object Store)

 

RADOS  makes up the heart of the scalable object storage service. 

 

In addition to accessing RADOS via the defined interfaces, it is also possible to access RADOS directly via a set of library calls.

 

 

CRUSH (Controlled Replication Under Scalable Hashing)

 

The CRUSH map contains the topology of the system and is location aware. Objects are mapped to Placement Groups and Placement Groups are in turn  mapped to OSDs. This allows for allows dynamic rebalancing and controls which Placement Group holds the objects. It also defines  which of the OSDs should hold the Placement Group.

 

The CRUSH map holds a list of OSDs, buckets and rules that hold replication directives.

 

CRUSH will try not to move data during rebalancing whereas a true hash function would be likely to cause greater data movement.

 

 

The CRUSH map allows for different resiliency models such as:

 

#0 for a 1-node cluster.

 

#1 for a multi node cluster in a single rack

 

#2 for a multi node, multi chassis cluster with multiple hosts in a chassis

 

#3 for a multi node cluster with hosts across racks, etc.

 

osd crush chooseleaf type = {n}

 

Buckets

 

Buckets are a hierarchical structure of storage locations; a bucket in the CRUSH map context is a location.

 

Placement Groups (PGs)

 

Ceph subdivides a storage pool into placement groups, assigning each individual object to a placement group, and then assigns the placement group to a primary OSD.

 

If an OSD node fails or the cluster re-balances, Ceph is able to replicate or move a placement group and all the objects stored within it without the need to move or replicate each object individually. This allows for an efficient re-balancing or recovery of the Ceph cluster.

 

Objects are mapped to Placement Groups by hashing the object’s name along with the replication factor and a bitmask.

 

 

When you create a pool, a number of placement groups are automatically created by Ceph for the pool. If you don’t directly specify a number of placement groups, Ceph uses the default value of 8 which is extremely low.

 

A more useful default value is 128. For example:

 

osd pool default pg num = 128
osd pool default pgp num = 128

 

You need to set both the number of total placement groups and the number of placement groups used for objects in PG splitting to the same value. As a general guide use the following values:

 

Less than 5 OSDs: set pg_num and pgp_num to 128.
Between 5 and 10 OSDs: set pg_num and pgp_num to 512
Between 10 and 50 OSDs: set pg_num and pgp_num to 4096

 

 

To specifically define the number of PGs:

 

set pool x pg_num to {pg_num}

 

ceph osd pool set {pool-name} pg_num {pg_num}

 

 

set pool x pgp_num to {pgp_num}

 

ceph osd pool set {pool-name} pgp_num {pgp_num}

 

How To Create OSD Nodes on Ceph Using ceph-deploy

 

 

BlueStore OSD is the now the default storage system used for Ceph OSDs.

 

Before you add a BlueStore OSD node to Ceph, first delete all data on the device/s that will serve as OSDs.

 

You can do this with the zap command:

 

$CEPH_CONFIG_DIR/ceph-deploy disk zap node device

 

Replace node with the node name or host name where the disk is located.

 

Replace device with the path to the device on the host where the disk is located.

 

Eg to delete the data on a device named /dev/sdc on a node named ceph-node3 in the Ceph Storage Cluster, use:

 

$CEPH_CONFIG_DIR/ceph-deploy disk zap ceph-node3 /dev/sdc

 

 

Next, to create a filestore OSD, enter:

 

$CEPH_CONFIG_DIR/ceph-deploy osd create –data device node

 

This creates a volume group and logical volume on the specified disk. Both data and journal are stored on the same logical volume.

 

Eg

 

$CEPH_CONFIG_DIR/ceph-deploy osd create –data /dev/sdc ceph-node3

 

 

 

How To Create A FileStore OSD Manually

 

Quoted from the Ceph website:

 

FileStore is the legacy approach to storing objects in Ceph. It relies on a standard file system (normally XFS) in combination with a key/value database (traditionally LevelDB, now RocksDB) for some metadata.

 

FileStore is well-tested and widely used in production but suffers from many performance deficiencies due to its overall design and reliance on a traditional file system for storing object data.

 

Although FileStore is generally capable of functioning on most POSIX-compatible file systems (including btrfs and ext4), we only recommend that XFS be used. Both btrfs and ext4 have known bugs and deficiencies and their use may lead to data loss. By default all Ceph provisioning tools will use XFS.

 

The official Ceph default storage system is now BlueStore. Prior to Ceph version Luminous, the default (and only option available) was Filestore.

 

 

Note the instructions below create a FileStore and not a BlueStore system!

 

To create a FileStore OSD manually ie without using ceph-deploy or cephadm:

 

first create the required partitions on the OSD node concerned: one for data, one for journal.

 

This example creates a 40 GB data partition on /dev/sdc1 and a journal partition of 12GB on /dev/sdc2:

 

 

parted /dev/sdc –script — mklabel gpt
parted –script /dev/sdc mkpart primary 0MB 40000MB
parted –script /dev/sdc mkpart primary 42000MB 55000MB

 

dd if=/dev/zero of=/dev/sdc1 bs=1M count=1000

 

sgdisk –zap-all –clear –mbrtogpt -g — /dev/sdc2

 

ceph-volume lvm zap /dev/sdc2

 

 

 

From the deployment node, create the FileStore OSD. To specify OSD file type, use –filestore and –fs-type.

 

Eg, to create a FileStore OSD with XFS filesystem:

 

CEPH_CONFIG_DIR/ceph-deploy osd create –filestore –fs-type xfs –data /dev/sdc1 –journal /dev/sdc2 ceph-node2

 

 

What is BlueStore?

 

Any new OSDs (e.g., when the cluster is expanded) can be deployed using BlueStore. This is the default behavior so no specific change is needed.

 

There are two methods OSDs can use to manage the data they store.

 

The default is now BlueStore. Prior to Ceph version Luminous, the default (and only option available) was Filestore.

 

BlueStore is a new back-end object storage system for Ceph OSD daemons. The original object store used by Ceph, FileStore, required a file system placed on top of raw block devices. Objects were then written to the file system.

 

By contrast, BlueStore does not require a file system for itself, because BlueStore stores objects directly on the block device. This improves cluster performance as it removes file system overhead.

 

BlueStore can use different block devices for storing different data. As an example, Hard Disk Drive (HDD) storage for data, Solid-state Drive (SSD) storage for metadata, Non-volatile Memory (NVM) or persistent or Non-volatile RAM (NVRAM) for the RocksDB WAL (write-ahead log).

 

In the simplest implementation, BlueStore resides on a single storage device which is partitioned into two parts, one containing OSD metadata and actual data partition.

 

The OSD metadata partition is formatted with XFS and holds information about the OSD, such as its identifier, the cluster it belongs to, and its private keyring.

 

Data partition contains the actual OSD data and is managed by BlueStore. The primary partition is identified by a block symbolic link in the data directory.

 

Two additional devices can also be implemented:

 

A WAL (write-ahead-log) device: This contains the BlueStore internal journal or write-ahead log and is identified by the block.wal symbolic link in the data directory.

 

Best practice is to use an SSH disk to implement a WAL device in order to provide for optimum performance.

 

 

A DB device: this stores BlueStore internal metadata. The embedded RocksDB database will then place as much metadata as possible on the DB device instead of on the primary device to optimize performance.

 

Only if the DB device becomes full will it then place metadata on the primary device. As for WAL, best practice for the Bluestore DB device is to deploy an SSD.

 

 

 

Starting and Stopping Ceph

 

To start all Ceph daemons:

 

[root@admin ~]# systemctl start ceph.target

 

To stop all Ceph daemons:

 

[root@admin ~]# systemctl stop ceph.target

 

To restart all Ceph daemons:

 

[root@admin ~]# systemctl restart ceph.target

 

To start, stop, and restart individual Ceph daemons:

 

 

On Ceph Monitor nodes:

 

systemctl start ceph-mon.target

 

systemctl stop ceph-mon.target

 

systemctl restart ceph-mon.target

 

On Ceph Manager nodes:

 

systemctl start ceph-mgr.target

 

systemctl stop ceph-mgr.target

 

systemctl restart ceph-mgr.target

 

On Ceph OSD nodes:

 

systemctl start ceph-osd.target

 

systemctl stop ceph-osd.target

 

systemctl restart ceph-osd.target

 

On Ceph Object Gateway nodes:

 

systemctl start ceph-radosgw.target

 

systemctl stop ceph-radosgw.target

 

systemctl restart ceph-radosgw.target

 

 

To perform stop, start, restart actions on specific Ceph monitor, manager, OSD or object gateway node instances:

 

On a Ceph Monitor node:

 

systemctl start ceph-mon@$MONITOR_HOST_NAME
systemctl stop ceph-mon@$MONITOR_HOST_NAME
systemctl restart ceph-mon@$MONITOR_HOST_NAME

 

On a Ceph Manager node:

systemctl start ceph-mgr@MANAGER_HOST_NAME
systemctl stop ceph-mgr@MANAGER_HOST_NAME
systemctl restart ceph-mgr@MANAGER_HOST_NAME

 

 

On a Ceph OSD node:

 

systemctl start ceph-osd@$OSD_NUMBER
systemctl stop ceph-osd@$OSD_NUMBER
systemctl restart ceph-osd@$OSD_NUMBER

 

substitute $OSD_NUMBER with the ID number of the Ceph OSD.

 

On a Ceph Object Gateway node:

 

systemctl start ceph-radosgw@rgw.$OBJ_GATEWAY_HOST_NAME
systemctl stop ceph-radosgw@rgw.$OBJ_GATEWAY_HOST_NAME
systemctl restart ceph-radosgw@rgw.$OBJ_GATEWAY_HOST_NAME

 

 

Placement Groups PG Information

 

To display the number of placement groups in a pool:

 

ceph osd pool get {pool-name} pg_num

 

 

To display statistics for the placement groups in the cluster:

 

ceph pg dump [–format {format}]

 

 

How To Check Status of the Ceph Cluster

 

 

To check the status and health of the cluster from the administration node, use:

 

ceph health
ceph status

 

Note it often can take up to several minutes for the cluster to stabilize before the cluster health will indicate HEALTH_OK.

 

You can also check the cluster quorum status of the cluster monitors:

 

ceph quorum_status –format json-pretty

 

 

For more Ceph admin commands, see https://sabaini.at/pages/ceph-cheatsheet.html#monit

 

 

The ceph.conf File

 

Each Ceph daemon looks for a ceph.conf file that contains its configuration settings.  For manual deployments, you need to create a ceph.conf file to define your cluster.

 

ceph.conf contains the following definitions:

 

Cluster membership
Host names
Host addresses
Paths to keyrings
Paths to journals
Paths to data
Other runtime options

 

The default ceph.conf locations in sequential order are as follows:

 

$CEPH_CONF (i.e., the path following the $CEPH_CONF environment variable)

 

-c path/path (i.e., the -c command line argument)

 

/etc/ceph/ceph.conf

 

~/.ceph/config

 

./ceph.conf (i.e., in the current working directory)

 

ceph-conf is a utility for getting information from a ceph configuration file.

 

As with most Ceph programs, you can specify which Ceph configuration file to use with the -c flag.

 

 

ceph-conf -L = lists all sections

 

 

Ceph Journals 

 

Note that journals are used only on FileStore.

 

Journals are deprecated on BlueStore and thus are not explicitly defined for BlueStore systems.

 

 

How To List Your Cluster Pools

 

To list your cluster pools, execute:

 

ceph osd lspools

 

Rename a Pool

 

To rename a pool, execute:

 

ceph osd pool rename <current-pool-name> <new-pool-name>

 

 

Continue Reading

How To Install Pacemaker and Corosync on Centos

This article sets out how to install the clustering management software Pacemaker and the cluster membership software Corosync on Centos version 8.

 

For this example, we are setting up a three node cluster using virtual machines on the Linux KVM hypervisor platform.

 

The virtual machines have the KVM names and hostnames centos1, centos2, and centos3.

 

Each node has two network interfaces: one for the KVM bridged NAT network (KVM network name: default via eth0) and the other for the cluster subnet 10.0.8.0 (KVM network name:network-10.0.8.0 via eth1). DHCP is not used for either of these interfaces. Pacemaker and Corosync require static IP addresses.

 

The machine centos1 will be our current designated co-ordinator (DC) cluster node.

 

First, make sure you have first created an ssh-key for root on the first node:

 

[root@centos1 .ssh]# ssh-keygen -t rsa
Generating public/private rsa key pair.
Enter file in which to save the key (/root/.ssh/id_rsa):
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /root/.ssh/id_rsa.
Your public key has been saved in /root/.ssh/id_rsa.pub.
The key fingerprint is:
SHA256:********** root@centos1.localdomain

 

then copy the ssh key to the other nodes:

 

ssh-copy-id centos2
/usr/bin/ssh-copy-id: INFO: Source of key(s) to be installed: “/root/.ssh/id_rsa.pub”
/usr/bin/ssh-copy-id: INFO: attempting to log in with the new key(s), to filter out any that are already installed

 

/usr/bin/ssh-copy-id: WARNING: All keys were skipped because they already exist on the remote system.
(if you think this is a mistake, you may want to use -f option)

 

[root@centos1 .ssh]#
First you need to enable the HighAvailability repository

 

[root@centos1 ~]# yum repolist all | grep -i HighAvailability
ha CentOS Stream 8 – HighAvailability disabled
[root@centos1 ~]# dnf config-manager –set-enabled ha
[root@centos1 ~]# yum repolist all | grep -i HighAvailability
ha CentOS Stream 8 – HighAvailability enabled
[root@centos1 ~]#

 

Next, install the following packages:

 

[root@centos1 ~]# yum install epel-release

 

[root@centos1 ~]# yum install pcs fence-agents-all

 

Next, STOP and DISABLE Firewall for lab testing convenience:

 

[root@centos1 ~]# systemctl stop firewalld
[root@centos1 ~]#
[root@centos1 ~]# systemctl disable firewalld
Removed /etc/systemd/system/multi-user.target.wants/firewalld.service.
Removed /etc/systemd/system/dbus-org.fedoraproject.FirewallD1.service.
[root@centos1 ~]#

 

then check with:

 

[root@centos1 ~]# systemctl status firewalld
● firewalld.service – firewalld – dynamic firewall daemon
Loaded: loaded (/usr/lib/systemd/system/firewalld.service; disabled; vendor preset: enabled)
Active: inactive (dead)

 

Next we enable pcsd This is the Pacemaker daemon service:

 

[root@centos1 ~]# systemctl enable –now pcsd
Created symlink /etc/systemd/system/multi-user.target.wants/pcsd.service → /usr/lib/systemd/system/pcsd.service.
[root@centos1 ~]#

 

then change the default password for user hacluster:

 

echo | passwd –stdin hacluster

 

Changing password for user hacluster.

passwd: all authentication tokens updated successfully.
[root@centos2 ~]#

 

Then, on only ONE of the nodes, I am doing it on centos1 on the KVM cluster, as this will be the default DC for the cluster:

 

pcs host auth centos1.localdomain centos2.localdomain centos3.localdomain

 

NOTE the correct command is pcs host auth – not pcs cluster auth unlike in some instruction material, the syntax has since changed.

 

[root@centos1 .ssh]# pcs host auth centos1.localdomain suse1.localdomain ubuntu4.localdomain
Username: hacluster
Password:
centos1.localdomain: Authorized
centos2.localdomain: Authorized
centos3.localdomain: Authorized
[root@centos1 .ssh]#

 

Next, on centos1, as this will be our default DC (designated coordinator node) we create a corosync secret key:

 

[root@centos1 corosync]# corosync-keygen
Corosync Cluster Engine Authentication key generator.
Gathering 2048 bits for key from /dev/urandom.
Writing corosync key to /etc/corosync/authkey.
[root@centos1 corosync]#

 

Then copy the key to the other 2nodes:

 

scp /etc/corosync/authkey centos2:/etc/corosync/
scp /etc/corosync/authkey centos3:/etc/corosync/

 

[root@centos1 corosync]# pcs cluster setup hacluster centos1.localdomain addr=10.0.8.11 centos2.localdomain addr=10.0.8.12 centos3.localdomain addr=10.0.8.13
Sending ‘corosync authkey’, ‘pacemaker authkey’ to ‘centos1.localdomain’, ‘centos2.localdomain’, ‘centos3.localdomain’
centos1.localdomain: successful distribution of the file ‘corosync authkey’
centos1.localdomain: successful distribution of the file ‘pacemaker authkey’
centos2.localdomain: successful distribution of the file ‘corosync authkey’
centos2.localdomain: successful distribution of the file ‘pacemaker authkey’
centos3.localdomain: successful distribution of the file ‘corosync authkey’
centos3.localdomain: successful distribution of the file ‘pacemaker authkey’
Sending ‘corosync.conf’ to ‘centos1.localdomain’, ‘centos2.localdomain’, ‘centos3.localdomain’
centos1.localdomain: successful distribution of the file ‘corosync.conf’
centos2.localdomain: successful distribution of the file ‘corosync.conf’
centos3.localdomain: successful distribution of the file ‘corosync.conf’
Cluster has been successfully set up.
[root@centos1 corosync]#

 

Note I had to specify the IP addresses for the nodes. This is because these nodes each have TWO network interfaces with separate IP addresses. If the nodes only had one network interface, then you can leave out the addr= setting.

 

Next you can start the cluster:

 

[root@centos1 corosync]# pcs cluster start
Starting Cluster…
[root@centos1 corosync]#
[root@centos1 corosync]#
[root@centos1 corosync]# pcs cluster status
Cluster Status:
Cluster Summary:
* Stack: unknown
* Current DC: NONE
* Last updated: Mon Feb 22 12:57:37 2021
* Last change: Mon Feb 22 12:57:35 2021 by hacluster via crmd on centos1.localdomain
* 3 nodes configured
* 0 resource instances configured
Node List:
* Node centos1.localdomain: UNCLEAN (offline)
* Node centos2.localdomain: UNCLEAN (offline)
* Node centos3.localdomain: UNCLEAN (offline)

 

PCSD Status:
centos1.localdomain: Online
centos3.localdomain: Online
centos2.localdomain: Online
[root@centos1 corosync]#

 

 

The Node List says “UNCLEAN”.

 

So I did:

 

pcs cluster start centos1.localdomain
pcs cluster start centos2.localdomain
pcs cluster start centos3.localdomain
pcs cluster status

 

then the cluster was started in clean running state:

 

[root@centos1 cluster]# pcs cluster status
Cluster Status:
Cluster Summary:
* Stack: corosync
* Current DC: centos1.localdomain (version 2.0.5-7.el8-ba59be7122) – partition with quorum
* Last updated: Mon Feb 22 13:22:29 2021
* Last change: Mon Feb 22 13:17:44 2021 by hacluster via crmd on centos1.localdomain
* 3 nodes configured
* 0 resource instances configured
Node List:
* Online: [ centos1.localdomain centos2.localdomain centos3.localdomain ]

 

PCSD Status:
centos1.localdomain: Online
centos2.localdomain: Online
centos3.localdomain: Online
[root@centos1 cluster]#

Continue Reading