Troubleshoot Issues with Ceph and Rook
These topics provide instructions for troubleshooting issues with Ceph and Rook.
Troubleshoot OSD Creation Failure
Problem
Installation fails because object storage daemons (OSDs) are not created.
Solution
A common reason for installation failure is that the object storage daemons (OSDs) are not created. An OSD configures the storage on a cluster node. OSDs might not be created because of non-availability of disk resources, in the form of either insufficient resources or incorrectly partitioned disk space. To prevent installation failing due to insufficient disk space, ensure that the nodes have sufficient unformatted disk space available.
Debug Disk Formatting Issues
Problem
Installation fails when a disk needs to be formatted.
Solution
Ensure sufficient unformatted disk space.
You must examine the logs of the
rook-ceph-osd-prepare-hostname-*
jobs to determine if you
need to reformat the disk or partition. To reformat the disk or partition, and
restart Rook:
-
Use one of the following methods to reformat an existing disk or partition:
CAUTION:These commands completely reformat the disk or partitions that you are using, and you will lose all data on them.
- If you have a block storage device that should have been used for
Ceph, but wasn't used because it was in an unusable state, you can
reformat the disk
completely.
$ sgdisk -zap /dev/disk $ dd if=/dev/zero of=/dev/disk bs=1M count=100
- If you have a disk partition that should have been used for Ceph,
you can clear the data on the partition
completely.
$ wipefs -a -f /dev/partition $ dd if=/dev/zero of=/dev/partition bs=1M count=100
- If you have a block storage device that should have been used for
Ceph, but wasn't used because it was in an unusable state, you can
reformat the disk
completely.
-
Restart Rook to save the changes and reattempt the OSD creation process.
$ kubectl rollout restart deploy -n rook-ceph rook-ceph-operator
Troubleshoot Ceph OSD Failure
Problem
The Ceph OSD fails.
Solution
You must identify the failed OSD and remove it. You can then reformat or replace the disk partially or completely.
To troubleshoot disk failure:
-
Run the following command to check the status of Rook and Ceph pods installed in the
rook-ceph
namespace.# kubectl get po -n rook-ceph
-
If a
rook-ceph-osd-*
pod is in theError
orCrashLoopBackoff
state, then you must repair the disk.Follow these steps to repair the disk:
-
Stop
rook-ceph-operator
.# kubectl scale deploy -n rook-ceph rook-ceph-operator --replicas=0
-
Remove the failing OSD processes.
# kubectl delete deploy -n rook-ceph rook-ceph-osd-number
-
Connect to the toolbox.
$ kubectl exec -ti -n rook-ceph $(kubectl get po -n rook-ceph -l app=rook-ceph-tools -o jsonpath={..metadata.name}) -- bash
-
Identify the failing OSD.
# ceph osd status
-
Mark the failed OSD out.
[root@rook-ceph-tools-/]# ceph osd out 5 marked out osd.5. [root@rook-ceph-tools-/]# ceph osd status ID HOST USED AVAIL WR OPS WR DATA RD OPS RD DATA STATE 0 10.xx.xx.210 4856M 75.2G 0 0 0 0 exists,up 1 10.xx.xx.215 2986M 77.0G 0 0 1 89 exists,up 2 10.xx.xx.98 3243M 76.8G 0 0 1 15 exists,up 3 10.xx.xx.195 4945M 75.1G 0 0 0 0 exists,up 4 10.xx.xx.170 5053M 75.0G 0 0 0 0 exists,up 5 10.xx.xx.197 0 0 0 0 0 0 exists
-
Remove the failed OSD.
# ceph osd purge number --yes-i-really-mean-it
- Connect to the node that hosted the failed OSD, and do one of the
following:
- Replace the hard disk in case of a hardware failure.
- Reformat the disk
completely.
$ sgdisk -zap /dev/disk $ dd if=/dev/zero of=/dev/disk bs=1M count=100
- Reformat the partition
completely.
$ wipefs -a -f /dev/partition $ dd if=/dev/zero of=/dev/partition bs=1M count=100
-
Restart
rook-ceph-operator
.# kubectl scale deploy -n rook-ceph rook-ceph-operator --replicas=1
-
Monitor the OSD pods.
# kubectl get po -n rook-ceph
If the OSD does not recover, use the same procedure to remove the OSD, and then remove the disk or delete the partition before restarting
rook-ceph-operator
.
-
See Also
Debug Issues with Rook and Ceph Pods
Problem
Installation
can fail when Rook and Ceph pods are in the error
state,
Solution
An
underpowered hardware can cause a Rook and Ceph pods error. A Rook and Ceph pods
error can also cause installation to fail. To solve most
issues with Rook and Ceph pods, ensure that the installed pods are in the
running
state. To confirm this, you must:
-
Run the following command to check the status of Rook and Ceph pods installed in the
rook-ceph
namespace.# kubectl get po -n rook-ceph
-
Ensure that the following pods are in the
running
state:rook-ceph-mon-*
—Typically, three monitor pods-
rook-ceph-mgr-*
—One manager pod -
rook-ceph-osd-*
—Three or more OSD pods -
rook-ceph-mds-cephfs-*
—Metadata servers -
rook-ceph-rgw-object-store-*
—ObjectStore gateway -
rook-ceph-tools*
—For additional debugging optionsTo connect to the toolbox, use this command:
$ kubectl exec -ti -n rook-ceph $(kubectl get po -n rook-ceph -l app=rook-ceph-tools \ -o jsonpath={..metadata.name}) -- bash
For more information on additional common commands in the toolbox, see Table 1.
Table 1: Additional Commands Command Description # ceph status
View cluster status. # ceph osd status
View summary of OSD map. # ceph osd df
View details of disk usage (global and per pool). # ceph osd utilization
View OSD utilization. # ceph osd pool stats
View disk pool usage. # ceph osd tree
View OSD tree. # ceph pg stat
View pg status and performance.