Earlier this year I documented the rebuild of my home lab environment using Gluster and RHV. This post is a final wrap-up of that rebuild, as one or two things changed significantly since the last part was written, and I have been intending to write this follow-up article for some time…
Issues with the original setup
If you recall from my series on the rebuild I was using three nodes, each with an internal RAID set acting as a gluster brick, giving three bricks in total per volume.
Well, that setup worked really well – UNTIL one Sunday when both main nodes (baremetal1 and baremetal2) decided to run an mdadm scan on the soft-raid volume, at the same time (thanks cron).
What happened was that the disk IO times went south big time, and the cluster pretty much ground to a halt. This resulted in RHV taking all the volumes offline, and the manager itself stopping. The two hosted-engine hypervisors then went into spasms trying to relaunch the engine, and I was spammed by a couple of hundred emails from the cluster over the space of several hours.
I was able to stabilise things once the mdadm scans had finished, but this was far from a usable solution for me. With the cluster stable, I stood up a temporary filestore on my NAS via iSCSI and relocated all VM images over to that with the exception of the ovirt-engine.
Then I trashed the cluster and rebuilt it a little differently.