ZFS Is 42.

ZFS Is 42.

For those who don't get the reference (and I bet of my readers that's very few) 42 is the answer to the secret of life, the universe and everything.

For the past few weeks I've been playing with the ZFS file system on Linux. And I keep upping the ante, so to speak. I started by simply creating a home file server running three 2TB hard drives in a RAIDZ configuration with a 128GB SSD boot drive. RAIDZ is ZFS specific and roughly equivalent to RAID-5.

ZFS is unique in that you don't format your drives like you would with a typical filesystem such as FAT, NTFS or EXT4. You simply give it hard drives and say "here." You can then create logical groups of datasets (directories) within the root of that pool of drives, which comes in handy if you want to use snapshots.

Save that Data!

ZFS was developed from the ground up for data stability. It's disk check utility "scrub" runs on your active, mounted filesystem. With RAIDZ, you get all the advantages of RAID-5 without the headache of proprietary controllers or FAKERAID.

Getting Ready

Here's how you go about installing ZFS for Linux on a recent Ubuntu distribution. First install the prerequisites:

apt-get update
apt-get install build-essential linux-headers-generic python-software-properties

Next add the PPA repository and install ubuntu-zfs:

apt-add ppa:zfs-native/stable
apt-get update
apt-get install ubuntu-zfs

This may take a few minutes while it compiles the kernel module for ZFS. Any time you install a new Linux kernel from the updates, it'll have to compile that module again, but it happens automatically.

Creating Your Storage Pool

Next, you add your drives to the pool of available drives for ZFS. No need to reboot. Use something like:

zpool create data /dev/sdb /dev/sdc /dev/sdd

For a production grade server you might instead want to use the by-path reference such as:

zpool create data /dev/disk/by-path/pci-0000:00:1f.2-scsi-1:0:0:0 
/dev/disk/by-path/pci-0000:00:1f.2-scsi-2:0:0:0 
/dev/disk/by-path/pci-0000:00:1f.2-scsi-3:0:0:0

Now you could stop there, but it's a better idea to place zfs "datasets" inside that pool rather than putting your data directly in the pool. This helps to logically group your data and helps with managing snapshots:

zfs create data/storage1

Now on your filesystem you'll have something like:

df -h
Filesystem      Size  Used Avail Use% Mounted on
/dev/sda5       235G  2.5G  221G   2% /
udev            5.9G  4.0K  5.9G   1% /dev
tmpfs           2.4G  324K  2.4G   1% /run
none            5.0M     0  5.0M   0% /run/lock
none            5.9G     0  5.9G   0% /run/shm
data            5.4T     0  5.4T   0% /data
data/docstore   5.4T     0  5.4T   0% /data/storage1

Note that the available disk size is shared between /data and /data/storage. If I create more zfs datasets in that pool, they all share the total available pool size.

Going Production

After my initial tests, I started on a test server where I plan to use two identical servers with ZFS on the backend and GlusterFS on top of it mirroring the data across both servers. This gives me RAID redundancy on a single server using RAIDZ, covering me in the event of a single drive failure. It also gives me complete server redundancy and high availability (HA) by using GlusterFS on top of that to ensure a duplicate of the data is on each physical server and always accessible.

In addition, I want to be notified in the event that ZFS has a disk failure. ZFS is great in that just like RAID-5 if a disk fails it will keep serving data, but you really need to know right away that it has happened.

Initially I installed a 256GB SSD boot drive and three 2TB hard drives and configured them in a standard, non-RAID ZFS pool. That gave me roughly 6 TB of data availability. With this it's trivial to add a new drive to the pool and have it's size immediately available to storage without needing to unmount a running file system. If you're using a hotswap capable server, that is. Throwing a fourth 2TB drive in I tested it with "zpool add data /path/to/device" and immediately my available /data mount grew by nearly 2TB. No need to wait for things to "format".

Next just to see what happened, I pulled one of the drives out of the server. My mounted /data path vanished and zpool reported that it was now bad and couldn't recover. This was what I expected given that it wasn't a RAIDZ pool and had no redundancy. All data was lost.

With RAIDZ there's no automatic size expansion so you must start your pool over if you want to expand, and of course you lose space equivalent to one hard drive to redundancy. After doing those initial tests I then recreated the pool as a 6TB available RAIDZ using four 2TB drives.

Creating it with RAID

To create a RAIDZ pool you simply use the raidz keyword such as:

zpool create data raidz /dev/sdb /dev/sdc /dev/sdd /dev/sde

Next I pulled a drive out of the system and the /data mount continued humming right along. I wrote some data, then popped the drive back in and told zpool to make it active again. All was good in the world and there was no downtime.

Adding an SSD Cache Drive

Finally I decided I wanted to improve read performance on my already existing pool so I added a 128GB SSD drive to the server (hotswap, again no downtime) and told zpool to add the drive to the pool as cache. Immediately ZFS made that SSD available as a fast read cache for most active and most recently accessed data - a relatively cheap performance booster.

Adding it was as simple as:

zpool add data cache /dev/sdf

Monitoring for Failures

Then for monitoring the status I modified a script I found online. The original script didn't quite work right because a path it referenced didn't exist in Ubuntu, and the zpool status command will output using stderr if there's an error condition, but the original script didn't try to capture that:

#!/bin/sh
#
# This script is called by cron and monitors the status of the ZFS file system
# It creates /root/zpool.status to keep from paging more than once per event
#

REPORT_EMAIL=meexample.com
SERVER=`hostname`
STATUSFILE="/root/zpool.status"

ZPOOL_STATUS=`/sbin/zpool status -x 2>&1`
if [ "$ZPOOL_STATUS" = "all pools are healthy" -o "$ZPOOL_STATUS" = "no pools available" ]
then
        echo -n 0 > $STATUSFILE
else
        if [ `cat $STATUSFILE` -eq 0 ]
        then
                /sbin/zpool status | mail -s "$SERVER ZPOOL NOT HEALTHY" $REPORT_EMAIL
                echo -n 1 > $STATUSFILE
        fi
fi

I threw this into a cron job to run every 5 minutes and it will notify me if any drives in the system go down. More testing by jerking drives out of the server while it's running and letting the monitoring script page me.

The ZFS snapshot feature lets me create daily snapshots without causing much additional drive usage, so I can roll files back to any given day as if I'd spent hours doing a nightly backup. With GlusterFS replicating the data to two separate ZFS file servers, I have the safety of High Availability, true data redundancy, striping RAID performance and safety, daily backups and complete hardware failure protection with about 6TB of file storage. It doesn't get much better than this for under $5k. With my already-in-place nightly offsite backups I run of this data I'm now covered for all events - single drive failure (quick rebuild), single server catastrophic failure (slower resync), and asteroid hitting my datacenter (get data from offsite automated backups)

Posted by Tony on May 16, 2013 | Servers, ZFS