Troubleshoot Issues with Ceph and Rook

These topics provide instructions for troubleshooting issues with Ceph and Rook.

Troubleshoot OSD Creation Failure

Problem

Installation fails because object storage daemons (OSDs) are not created.

Solution

A common reason for installation failure is that the object storage daemons (OSDs) are not created. An OSD configures the storage on a cluster node. OSDs might not be created because of non-availability of disk resources, in the form of either insufficient resources or incorrectly partitioned disk space. To prevent installation failing due to insufficient disk space, ensure that the nodes have sufficient unformatted disk space available.

Debug Disk Formatting Issues

Problem

Installation fails when a disk needs to be formatted.

Solution

Ensure sufficient unformatted disk space.

You must examine the logs of the rook-ceph-osd-prepare-hostname-* jobs to determine if you need to reformat the disk or partition. To reformat the disk or partition, and restart Rook:

Use one of the following methods to reformat an existing disk or partition:

CAUTION:

These commands completely reformat the disk or partitions that you are using, and you will lose all data on them.
- If you have a block storage device that should have been used for Ceph, but wasn't used because it was in an unusable state, you can reformat the disk completely.
- If you have a disk partition that should have been used for Ceph, you can clear the data on the partition completely.
Restart Rook to save the changes and reattempt the OSD creation process.

Troubleshoot Ceph OSD Failure

Problem

The Ceph OSD fails.

Solution

You must identify the failed OSD and remove it. You can then reformat or replace the disk partially or completely.

To troubleshoot disk failure:

Run the following command to check the status of Rook and Ceph pods installed in the rook-ceph namespace.

# kubectl get po -n rook-ceph
If a rook-ceph-osd-* pod is in the Error or CrashLoopBackoff state, then you must repair the disk.

Follow these steps to repair the disk:
1. Stop rook-ceph-operator.
  
  # kubectl scale deploy -n rook-ceph rook-ceph-operator --replicas=0
2. Remove the failing OSD processes.
  
  # kubectl delete deploy -n rook-ceph rook-ceph-osd-number
3. Connect to the toolbox.
  
  $ kubectl exec -ti -n rook-ceph $(kubectl get po -n rook-ceph -l app=rook-ceph-tools -o jsonpath={..metadata.name}) -- bash
4. Identify the failing OSD.
  
  # ceph osd status
5. Mark the failed OSD out.
6. Remove the failed OSD.
  
  # ceph osd purge number --yes-i-really-mean-it
7. Connect to the node that hosted the failed OSD, and do one of the following:
  - Replace the hard disk in case of a hardware failure.
  - Reformat the disk completely.
  - Reformat the partition completely.
8. Restart rook-ceph-operator.
  
  # kubectl scale deploy -n rook-ceph rook-ceph-operator --replicas=1
9. Monitor the OSD pods.
  
  # kubectl get po -n rook-ceph
  
  If the OSD does not recover, use the same procedure to remove the OSD, and then remove the disk or delete the partition before restarting rook-ceph-operator.

Debug Issues with Rook and Ceph Pods

Problem

Installation can fail when Rook and Ceph pods are in the error state,

Solution

An underpowered hardware can cause a Rook and Ceph pods error. A Rook and Ceph pods error can also cause installation to fail. To solve most issues with Rook and Ceph pods, ensure that the installed pods are in the running state. To confirm this, you must:

Run the following command to check the status of Rook and Ceph pods installed in the rook-ceph namespace.

# kubectl get po -n rook-ceph

Ensure that the following pods are in the running state:

rook-ceph-mon-*—Typically, three monitor pods
rook-ceph-mgr-*—One manager pod
rook-ceph-osd-*—Three or more OSD pods
rook-ceph-mds-cephfs-*—Metadata servers
rook-ceph-rgw-object-store-*—ObjectStore gateway

rook-ceph-tools*—For additional debugging options

To connect to the toolbox, use this command:

$ kubectl exec -ti -n rook-ceph $(kubectl get po -n rook-ceph -l app=rook-ceph-tools \ -o jsonpath={..metadata.name}) -- bash

For more information on additional common commands in the toolbox, see Table 1.

Table 1: Additional Commands
Command	Description
`# ceph status`	View cluster status.
`# ceph osd status`	View summary of OSD map.
`# ceph osd df`	View details of disk usage (global and per pool).
`# ceph osd utilization`	View OSD utilization.
`# ceph osd pool stats`	View disk pool usage.
`# ceph osd tree`	View OSD tree.
`# ceph pg stat`	View pg status and performance.

ON THIS PAGE

Troubleshoot Issues with Ceph and Rook

Troubleshoot OSD Creation Failure

Debug Disk Formatting Issues

Troubleshoot Ceph OSD Failure

See Also

Debug Issues with Rook and Ceph Pods