Due to a lack of proper planning, a developer directly purchased NAS storage with a capacity of over 100TB from a public cloud. During use, data storage was unorganized, and all business data was written into the same directory without any structure in the subdirectories. The subdirectories contained user pictures, videos, access logs, and even backups. As the number of applications increased, the cost of using the public cloud skyrocketed. Considering the costs and controllability, it was decided to migrate the NAS and all applications from the public cloud to a self-built Proxmox VE hyper-converged cluster.
A virtual machine was created in the Proxmox VE cluster and allocated 40TB of disk space, which was mounted as a separate partition. Initially, this approach was not preferred, and it was suggested to the user to clean up unnecessary data first. However, the user replied that the task was urgent, the file directory was unstructured, and temporary operations were not possible, so all the data from the public cloud NAS had to be copied to the virtual machine on Proxmox VE. The hosting data center and the public cloud had a 300M bandwidth, and it took nearly a month to complete the copying. During operation, the Proxmox VE cluster servers unexpectedly rebooted simultaneously, causing a Ceph OSD failure. The command “ceph pg repair” was ineffective. Since there were already businesses running on the cluster, rebuilding the Ceph pool and copying the data again from the public cloud NAS was not feasible. Due to the large single partition allocated to the virtual machine, various methods were tried without success, and the only option was to wait for Ceph to recover on its own. It took nearly a month for it to recover automatically.
To solve the shared storage issue, dedicated equipment was purchased, which was an ordinary server with many hard drives, SSD for the system installation, and NVME for caching. TrueNAS was deployed, and the data from the large partition on the Proxmox VE cluster was migrated, completely eliminating the cluster’s risks.
In the past few days, the process of releasing this enormous virtual machine began, which took a long time to delete, taking almost 24 hours to reach just over 40%.
Finding this too slow, it was intended to delete the VM disk directly from the Proxmox VE Web management backend, but an error occurred, and the operation was not possible.
Logging into the host system of any node in the cluster, the command “rbd ls -l hdd_pool” was executed to check for abnormal virtual machine disk images.
The output confirmed that problematic virtual machine disk images existed. The command “rbd rm vm-100-disk-0 -p hdd_pool” was used to clean them up.
Similarly, the other problematic disk image “vm-127-disk-1” was cleaned up, and running “rbd ls -l hdd_pool” showed the normal display of the virtual machine disk images.
Leave a Reply