Help us improve your experience.

Let us know what you think.

Do you have time for a two-minute survey?

list Table of Contents
file_download PDF
keyboard_arrow_right

Troubleshoot Issues with Ceph and Rook

date_range 10-Jan-23

These topics provide instructions for troubleshooting issues with Ceph and Rook.

Troubleshoot OSD Creation Failure

Problem

Installation fails because object storage daemons (OSDs) are not created.

Solution

A common reason for installation failure is that the object storage daemons (OSDs) are not created. An OSD configures the storage on a cluster node. OSDs might not be created because of non-availability of disk resources, in the form of either insufficient resources or incorrectly partitioned disk space. To prevent installation failing due to insufficient disk space, ensure that the nodes have sufficient unformatted disk space available.

Debug Disk Formatting Issues

Problem

Installation fails when a disk needs to be formatted.

Solution

Ensure sufficient unformatted disk space.

You must examine the logs of the rook-ceph-osd-prepare-hostname-* jobs to determine if you need to reformat the disk or partition. To reformat the disk or partition, and restart Rook:

  1. Use one of the following methods to reformat an existing disk or partition:

    CAUTION:

    These commands completely reformat the disk or partitions that you are using, and you will lose all data on them.

    • If you have a block storage device that should have been used for Ceph, but wasn't used because it was in an unusable state, you can reformat the disk completely.
      content_copy zoom_out_map
      $ sgdisk -zap /dev/disk
      $ dd if=/dev/zero of=/dev/disk bs=1M count=100 
    • If you have a disk partition that should have been used for Ceph, you can clear the data on the partition completely.
      content_copy zoom_out_map
      $ wipefs -a -f /dev/partition
      $ dd if=/dev/zero of=/dev/partition bs=1M count=100 
  2. Restart Rook to save the changes and reattempt the OSD creation process.

    content_copy zoom_out_map
    $ kubectl rollout restart deploy -n rook-ceph rook-ceph-operator

Troubleshoot Ceph OSD Failure

Problem

The Ceph OSD fails.

Solution

You must identify the failed OSD and remove it. You can then reformat or replace the disk partially or completely.

To troubleshoot disk failure:

  1. Run the following command to check the status of Rook and Ceph pods installed in the rook-ceph namespace.

    # kubectl get po -n rook-ceph

  2. If a rook-ceph-osd-* pod is in the Error or CrashLoopBackoff state, then you must repair the disk.

    Follow these steps to repair the disk:

    1. Stop rook-ceph-operator.

      # kubectl scale deploy -n rook-ceph rook-ceph-operator --replicas=0

    2. Remove the failing OSD processes.

      # kubectl delete deploy -n rook-ceph rook-ceph-osd-number

    3. Connect to the toolbox.

      $ kubectl exec -ti -n rook-ceph $(kubectl get po -n rook-ceph -l app=rook-ceph-tools -o jsonpath={..metadata.name}) -- bash

    4. Identify the failing OSD.

      # ceph osd status

    5. Mark the failed OSD out.

      content_copy zoom_out_map
      [root@rook-ceph-tools-/]# ceph osd out 5 
      marked out osd.5. 
      [root@rook-ceph-tools-/]# ceph osd status
      ID  HOST           USED  AVAIL  WR OPS  WR DATA  RD OPS  RD DATA  STATE      
       0  10.xx.xx.210  4856M  75.2G      0        0       0        0   exists,up  
       1  10.xx.xx.215  2986M  77.0G      0        0       1       89   exists,up  
       2  10.xx.xx.98   3243M  76.8G      0        0       1       15   exists,up  
       3  10.xx.xx.195  4945M  75.1G      0        0       0        0   exists,up  
       4  10.xx.xx.170  5053M  75.0G      0        0       0        0   exists,up  
       5  10.xx.xx.197     0      0       0        0       0        0   exists     
    6. Remove the failed OSD.

      # ceph osd purge number --yes-i-really-mean-it

    7. Connect to the node that hosted the failed OSD, and do one of the following:
      • Replace the hard disk in case of a hardware failure.
      • Reformat the disk completely.
        content_copy zoom_out_map
        $ sgdisk -zap /dev/disk
        $ dd if=/dev/zero of=/dev/disk bs=1M count=100 
      • Reformat the partition completely.
        content_copy zoom_out_map
        $ wipefs -a -f /dev/partition
        $ dd if=/dev/zero of=/dev/partition bs=1M count=100 
    8. Restart rook-ceph-operator.

      # kubectl scale deploy -n rook-ceph rook-ceph-operator --replicas=1

    9. Monitor the OSD pods.

      # kubectl get po -n rook-ceph

      If the OSD does not recover, use the same procedure to remove the OSD, and then remove the disk or delete the partition before restarting rook-ceph-operator.

Debug Issues with Rook and Ceph Pods

Problem

Installation can fail when Rook and Ceph pods are in the error state,

Solution

An underpowered hardware can cause a Rook and Ceph pods error. A Rook and Ceph pods error can also cause installation to fail. To solve most issues with Rook and Ceph pods, ensure that the installed pods are in the running state. To confirm this, you must:

  1. Run the following command to check the status of Rook and Ceph pods installed in the rook-ceph namespace.

    # kubectl get po -n rook-ceph

  2. Ensure that the following pods are in the running state:

    • rook-ceph-mon-*—Typically, three monitor pods
    • rook-ceph-mgr-*—One manager pod

    • rook-ceph-osd-*—Three or more OSD pods

    • rook-ceph-mds-cephfs-*—Metadata servers

    • rook-ceph-rgw-object-store-*—ObjectStore gateway

    • rook-ceph-tools*—For additional debugging options

      To connect to the toolbox, use this command:

      $ kubectl exec -ti -n rook-ceph $(kubectl get po -n rook-ceph -l app=rook-ceph-tools \ -o jsonpath={..metadata.name}) -- bash

      For more information on additional common commands in the toolbox, see Table 1.

      Table 1: Additional Commands
      Command Description
      # ceph status View cluster status.
      # ceph osd status View summary of OSD map.
      # ceph osd df View details of disk usage (global and per pool).
      # ceph osd utilization View OSD utilization.
      # ceph osd pool stats View disk pool usage.
      # ceph osd tree View OSD tree.
      # ceph pg stat View pg status and performance.
external-footer-nav