Tony's ramblings on Open Source Software, Life and Photography

Gluster on ZFS with Geo-Replication

I've been fighting with Gluster with geo-replication on ZFS for several months now but I believe I've finally stumbled on the configuration that works best and gives much better Gluster performance on ZFS.

First a peek at the landscape
I probably don't have your typical storage needs. On this particular cluster I'm storing around 15 million files averaging 20MB each. No more than 255 files or sub-directories in a given directory.

My Gluster configuration uses a two-brick configuration with replica 2 (mirroring) and a single geo-replication slave over SSH.

Each Gluster brick uses ZFS RAID-Z volumes, spanning four SATA3 7200 RPM drives, and a single 256 GB SSD configured as a cache drive for the volume. Each Gluster server is connected on a private gigabit network segment to each other, and for the initial data load the Geo-replication server was connected to the same segment.

The Geo-replication slave also uses ZFS RAID-Z, but does not have a cache drive.

The goal with this configuration is to enable snapshot backups on either brick, plus geo-replication backups to a distant datacenter.

All of my data is compressed and encrypted prior to being placed in the Gluster share, so ZFS compression can't do anything for me. All clients access Gluster with the native Gluster client.

So, to summarize there are a ton of moderate sized files (not virtual machine size, and not tiny) in a ton of directories across two Gluster bricks each using ZFS to store it's data.

The problem
With straight out of the package configuration and no tuning, I ran into several major problems. At first Gluster was fast, but very quickly even reading a directory listing (remember, less than 255 entries, and usually less than 100) would take 15 seconds or more. Geo-replication would get up to about 2TB of data and then just... almost stop. It would crawl along at replicating one file every hour or so.

The solutions
Keep in mind I'm not a Gluster or ZFS developer but from what I've been able to cobble together it appears that ZFS doesn't handle small files very well. At first blush you'd think so what, I'm not using small files! The problem is that Gluster does. Apparently both the volume replication and the geo-replication features of Gluster will create a ton of small files on ZFS in the default configurations. Gluster just assumes it's adding xattr features to files, but ZFS is translating that into file storage instead of inode storage.

So, here's the required tuning for my situation. The primary change is the xattr setting, and possibly the sync setting. I turned off executing and access time updates because I don't need either in my situation:

zfs set atime=off [volname]
zfs set xattr=sa [volname]
zfs set exec=off [volname]
zfs set sync=disabled [volname]
zfs set devices=off [volname]
zfs set recordsize=64K [volname]
zfs set setuid=off [volname]

Please keep in mind that setting sync=disabled can put your data at risk and you should investigate exactly how that will impact your build out. For me it's not an issue as all servers are on UPS with generator backup, and all data is verified having been written by the application servers after writing. I also must state I have a complete offline backup of this data prior to it being loaded into these servers. The geo-replication and snapshots are more for faster disaster recovery and are not an attempt to replace real backups. With this setup I could effectively have a 10 minute or less recovery time even if both my replicated bricks are destroyed. If any one of the three servers die I have zero downtime.

The painful part
The additional problem is that unless those settings, particularly the xattr setting, is configured prior to loading data it doesn't take effect. That means that I had to copy all of my data over to another set of Gluster servers that already had that setting configured. I couldn't just enable the settings for the existing volumes and expect a sudden performance increase.

I'm not 100% done transferring data to the new servers, mostly because the first Gluster bricks are just so slow reading, but at first blush it appears to have dramatically fixed the problem. Directory listings on the new pair of Gluster servers are nearly instant, versus the 15 seconds I see on the old pair. But, I'm only 1/4 of the way done loading data into it so whether that holds true will remain to be seen.