ZFS Pool setup for SuperMicro server.

For this SuperMicro server, I'm using three-way mirroring, based mainly on the advice here, 33% storage efficiency is acceptable for this system, partly because the 2TB SAS drives are used so it's unknown how reliable they are, but since they were fairly inexpensive (around $30 each) having more mirrors can't hurt.

Choice of Hard Drive

The BPN-SAS-846A backplane accepts both SATA and SAS hard drives, although it's tricky to find inexpensive used SAS drives, not sure why.

An example of an inexpensive (used) SAS drive is the Seagate 2TB ST32000445SS which can be had for around $25 these days (in lots of ten drives). Here I'll be using 24 of them, for a total of around $650 (plus $100 for the four spare drives, so as to be able to replace drives that might fail). Even at these lowish unit prices, the total cost of all the hard drives tends to add up, rivaling the price of the server itself.

The Seagate 2TB ST32000445SS is actually a "SED", which is one acronym used to describe self-encrypting drives.

See here on how to use the SED feature with hdparm with SATA drives. Since the Seagate 2TB ST32000445SS is a SAS drive and hdparm only works with SATA drives (see here), I'm not sure yet if there's any way to encrypt the disk encryption key using Linux.

It seems it might only be possible to perform a secure erase of the Seagate 2TB ST32000445SS drives using proprietary software (see section "13.7 LSI MegaRAID SafeStore Encryption Services" of the "MegaRAID SAS Software User Guide" manual) and a different controller card with SafeStore software pre-installed, which wouldn't be of much use except for secure-erasing the drives prior to disposing of them, as we obviously want to use the M1015 card with Linux.

The Seagate 2TB ST32000445SS has a sector size of 512 bytes, so you have to use ashift=9 when creating the ZFS pool, which is a little old-fashioned compared to the modern "Advanced Format" drives which have 4k sector size and use ashift=12, but these Seagate 2TB ST32000445SS SAS drives seem reliable and fairly speedy and run at a temperature of less than around 40C even in the hot summertime.

Since it's a SAS drive, remember that in smartmontools you have to use the -A flag to scmartctl read the drive's S.M.A.R.T. info. In Debian Jessie, the drives show up as /dev/disk/by-id/scsi-35000c500342222cb which is not their serial number (perhaps it's the drive's WWN), making things a little inconvenient when it comes to locating a specific drive.

Configuring the ZFS Pool

The ZFS pool is of size 16TB raw and consists of eight three-drive vdevs. If a single controller fails, only one group of eight drives would fail out of the 24 drives, so each vdev would still have one mirror drive of redundancy remaining. Admittedly, on the surface, using three-way mirroring seems wasteful of both power and storage, as only a third of the raw storage is available, but the eight additional drives don't use much power, and having a proper "enterprise-level" of redundancy with a simple configuration which also provides very high performance is a requirement for this system. The use of a three-way mirror also allows for the technique of "splitting the mirror" to create an immediate off-site backup (resilvering the replacement drives only takes a few hours).

It's a good idea to take a photo of the label on each SAS hard drive before placing it into a drive slot, so that it'll be easier later on to figure out which drive has failed. I've connected the SAS-846A backplane to the controller cards as follows:

SAS slots M1015 controller iPass connector iPass connector
#0 to #7 #0 JSM1 goes to M1015 controller #1 port #0 JSM2 goes to M1015 controller #1 port #1
#8 to #15 #1 JSM3 goes to M1015 controller #1 port #0 JSM4 goes to M1015 controller #1 port #1
#16 to #23 #2 JSM5 goes to M1015 controller #1 port #0 JSM6 goes to M1015 controller #1 port #1

To connect the drives in the SAS-846A backplane to the M1015 controller card I needed to hunt down the proper Mini SAS to Mini SAS SFF-8087 "iPass" connectors which turned out to also be quite tricky, I resorted to reading the Molex data sheet to find the magical part number: 79576-2104. These elusive 79576-2104 cables are apparently what's needed to connect the M1015 cards to the backplane, they're basically like four SATA cables "rolled into one", and the one meter length connectors are an eye-popping $16 each. I can't believe they even consider them to be "enterprise" connectors: they seem very flimsy compared to proper SCSI connectors, but I guess this is how things are in the SAS / SATA world -- they need to keep the costs down however they can.

Drives on controller #0:

Device-by-id name Linux device name Backplane slot number
scsi-35000c50034157a8f /dev/sdi SAS #0
scsi-35000c5003424a11f /dev/sdh SAS #1
scsi-35000c50034241d4f /dev/sdg SAS #2
scsi-35000c5003425027f /dev/sdf SAS #3
scsi-35000c500342222cb /dev/sde SAS #4
scsi-35000c5003417889f /dev/sdd SAS #5
scsi-35000c50034241cd3 /dev/sdc SAS #6
scsi-35000c50034249a1b /dev/sdb SAS #7

Drives on controller #1:
Device-by-id name Linux device name Backplane slot number
scsi-35000c50034248657 /dev/sdq SAS #8
scsi-35000c50034150623 /dev/sdp SAS #9
scsi-35000c50034247a63 /dev/sdo SAS #10
scsi-35000c50034157d6f /dev/sdn SAS #11
scsi-35000c5003424a0e7 /dev/sdm SAS #12
scsi-35000c5003423f1ff /dev/sdl SAS #13
scsi-35000c50034149b6b /dev/sdk SAS #14
scsi-35000c5003414d617 /dev/sdj SAS #15

Drives on controller #2:
Device-by-id name Linux device name Backplane slot number
scsi-35000c500342497a3 /dev/sdr SAS #16
scsi-35000c50034192787 /dev/sds SAS #17
scsi-35000c50034247a43 /dev/sdy SAS #18
scsi-35000c5003423c29b /dev/sdx SAS #19
scsi-35000c50034249537 /dev/sdw SAS #20
scsi-35000c50034241d67 /dev/sdv SAS #21
scsi-35000c5003423f337 /dev/sdu SAS #22
scsi-35000c5003418ca53 /dev/sdt SAS #23

Creating the ZFS pool

I created the ZFS pool in two steps, first setting it up as a two-way mirror, so that controller #2 could be used for the initial import of the data (since making a copy from drives attached to a local controller is far quicker than a network copy). So the initial pool layout was like this:

VDEV name SAS drive member SAS drive member
mirror-0 SAS_#0 SAS_#8
mirror-1 SAS_#1 SAS_#9
mirror-2 SAS_#2 SAS_#10
mirror-3 SAS_#3 SAS_#11
mirror-4 SAS_#4 SAS_#12
mirror-5 SAS_#5 SAS_#13
mirror-6 SAS_#6 SAS_#14
mirror-7 SAS_#7 SAS_#15

The command to create the ZFS pool uses the /dev/disk/by-id device names, as follows:

sudo zpool create -f sminception \
  mirror scsi-35000c50034157a8f scsi-35000c50034248657 \
  mirror scsi-35000c5003424a11f scsi-35000c50034150623 \
  mirror scsi-35000c50034241d4f scsi-35000c50034247a63 \
  mirror scsi-35000c5003425027f scsi-35000c50034157d6f \
  mirror scsi-35000c500342222cb scsi-35000c5003424a0e7 \
  mirror scsi-35000c5003417889f scsi-35000c5003423f1ff \
  mirror scsi-35000c50034241cd3 scsi-35000c50034149b6b \
  mirror scsi-35000c50034249a1b scsi-35000c5003414d617

At this point, it's always a good idea to export the ZFS pool and import it again, so as to make sure nothing strange happens, and that the devices are always shown by their device id in the zpool status output (rather than via the /dev/sdb device names, which can change after drives are added or removed). When importing your newly-created ZFS pool, you may get a ridiculous error such as "One or more devices are missing from the system." with a nonsensical suggestion such as "The pool cannot be imported. Attach the missing devices and try again." with a useless link to the generic Oracle documentation.

When exporting and then re-importing the pool, I found that you always need to use the -d /dev/disk/by-id option to the zpool import command, otherwise the ZFS pool cannot be properly imported, devices mysteriously "go missing" (but only intermittently). Creating the ZFS pool "by path" rather than "by id" does not work any better, the devices still mysteriously go missing occasionally, so the trick is to always use the "-d /dev/disk/by-id" option, e.g. as follows:

  zpool export sminception
  zpool import -d /dev/disk/by-id sminception

Al at this point, you can export and import the newly-created zpool, and check the zpool status using zpool status sminception, e.g.:

root@sm:~# zpool status sminception
  pool: sminception
 state: ONLINE
  scan: none requested
config:

        NAME                        STATE     READ WRITE CKSUM
        sminception                 ONLINE       0     0     0
          mirror-0                  ONLINE       0     0     0
            scsi-35000c50034157a8f  ONLINE       0     0     0
            scsi-35000c50034248657  ONLINE       0     0     0
          mirror-1                  ONLINE       0     0     0
            scsi-35000c5003424a11f  ONLINE       0     0     0
            scsi-35000c50034150623  ONLINE       0     0     0
          mirror-2                  ONLINE       0     0     0
            scsi-35000c50034241d4f  ONLINE       0     0     0
            scsi-35000c50034247a63  ONLINE       0     0     0
          mirror-3                  ONLINE       0     0     0
            scsi-35000c5003425027f  ONLINE       0     0     0
            scsi-35000c50034157d6f  ONLINE       0     0     0
          mirror-4                  ONLINE       0     0     0
            scsi-35000c500342222cb  ONLINE       0     0     0
            scsi-35000c5003424a0e7  ONLINE       0     0     0
          mirror-5                  ONLINE       0     0     0
            scsi-35000c5003417889f  ONLINE       0     0     0
            scsi-35000c5003423f1ff  ONLINE       0     0     0
          mirror-6                  ONLINE       0     0     0
            scsi-35000c50034241cd3  ONLINE       0     0     0
            scsi-35000c50034149b6b  ONLINE       0     0     0
          mirror-7                  ONLINE       0     0     0
            scsi-35000c50034249a1b  ONLINE       0     0     0
            scsi-35000c5003414d617  ONLINE       0     0     0

Quick test of the sequential I/O performance results are as follows at this point (two-way mirror), so local sequential write I/O is about 700 MB/s and local sequential read I/O is at around 950 MB/s. Obviously, this is not over the network, local only.

dd if=/dev/zero of=test conv=fsync bs=1M count=1000000
222198+0 records in
222198+0 records out
232991490048 bytes (233 GB) copied, 321.869 s, 724 MB/s

dd if=test of=/dev/null conv=fsync bs=1M
127933+0 records in
127932+0 records out
134146424832 bytes (134 GB) copied, 139.599 s, 961 MB/s

At this point, I created a new "dataset" called inception and copied the data into this new dataset.

zfs create sminception/inception

For your ZFS pools, you should make sure not to fill them too much, they should remain with a lot of free space as ZFS doesn't cope well when pools get too full. The usable space is like this at the moment.

$ df
Filesystem              1K-blocks       Used  Available Use% Mounted on
sminception/inception 15325971968 6738102656 8587869312  44% /mnt/sm/inception
sminception            8587869312          0 8587869312   0% /sminception

$ df -hl
Filesystem             Size  Used Avail Use% Mounted on
sminception/inception   15T  6.3T  8.0T  44% /mnt/sm/inception
sminception            8.0T     0  8.0T   0% /sminception

Adding eight more drives, to make three-way mirror

At this point, now that the copy of the data into the pool has completed, to increase the redundancy of the pool, I populate the remaining eight slots and attach the drives under controller #2 to the mirrored pool sminception, so that it becomes like this:

VDEV name SAS drive member SAS drive member SAS drive member
mirror-0 SAS_#0 SAS_#8 SAS_#16
mirror-1 SAS_#1 SAS_#9 SAS_#17
mirror-2 SAS_#2 SAS_#10 SAS_#18
mirror-3 SAS_#3 SAS_#11 SAS_#19
mirror-4 SAS_#4 SAS_#12 SAS_#20
mirror-5 SAS_#5 SAS_#13 SAS_#21
mirror-6 SAS_#6 SAS_#14 SAS_#22
mirror-7 SAS_#7 SAS_#15 SAS_#23

The command used was:

sudo zpool attach -f sminception \
  scsi-35000c50034248657 scsi-35000c500342497a3

sudo zpool attach -f sminception \
  scsi-35000c50034150623 scsi-35000c50034192787

sudo zpool attach -f sminception \
  scsi-35000c50034247a63 scsi-35000c50034247a43

sudo zpool attach -f sminception \
  scsi-35000c50034157d6f scsi-35000c5003423c29b

sudo zpool attach -f sminception \
  scsi-35000c5003424a0e7 scsi-35000c50034249537

sudo zpool attach -f sminception \
  scsi-35000c5003423f1ff scsi-35000c50034241d67

sudo zpool attach -f sminception \
  scsi-35000c50034149b6b scsi-35000c5003423f337

sudo zpool attach -f sminception \
  scsi-35000c5003414d617 scsi-35000c5003418ca53

After the above additions, the resilvering is progressing as follows, which appears to be ridiculously slow, apparently the default settings are designed to conserve bandwidth while leaving your data exposed for longer to the pool's lower redundancy configuration during the resilver, see here, here and here for some explanation and suggestions on tuning.

# zpool status sminception

  pool: sminception
 state: ONLINE
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Wed Jul 22 10:39:55 2015
    1.21G scanned out of 4.38T at 177M/s, 7h13m to go
    1.18G resilvered, 0.03% done
config:

        NAME                        STATE     READ WRITE CKSUM
        sminception                 ONLINE       0     0     0
          mirror-0                  ONLINE       0     0     0
            scsi-35000c50034157a8f  ONLINE       0     0     0
            scsi-35000c50034248657  ONLINE       0     0     0
            scsi-35000c500342497a3  ONLINE       0     0     0  (resilvering)
          mirror-1                  ONLINE       0     0     0
            scsi-35000c5003424a11f  ONLINE       0     0     0
            scsi-35000c50034150623  ONLINE       0     0     0
            scsi-35000c50034192787  ONLINE       0     0     0  (resilvering)
          mirror-2                  ONLINE       0     0     0
            scsi-35000c50034241d4f  ONLINE       0     0     0
            scsi-35000c50034247a63  ONLINE       0     0     0
            scsi-35000c50034247a43  ONLINE       0     0     0  (resilvering)
          mirror-3                  ONLINE       0     0     0
            scsi-35000c5003425027f  ONLINE       0     0     0
            scsi-35000c50034157d6f  ONLINE       0     0     0
            scsi-35000c5003423c29b  ONLINE       0     0     0  (resilvering)
          mirror-4                  ONLINE       0     0     0
            scsi-35000c500342222cb  ONLINE       0     0     0
            scsi-35000c5003424a0e7  ONLINE       0     0     0
            scsi-35000c50034249537  ONLINE       0     0     0  (resilvering)
          mirror-5                  ONLINE       0     0     0
            scsi-35000c5003417889f  ONLINE       0     0     0
            scsi-35000c5003423f1ff  ONLINE       0     0     0
            scsi-35000c50034241d67  ONLINE       0     0     0  (resilvering)
          mirror-6                  ONLINE       0     0     0
            scsi-35000c50034241cd3  ONLINE       0     0     0
            scsi-35000c50034149b6b  ONLINE       0     0     0
            scsi-35000c5003423f337  ONLINE       0     0     0  (resilvering)
          mirror-7                  ONLINE       0     0     0
            scsi-35000c50034249a1b  ONLINE       0     0     0
            scsi-35000c5003414d617  ONLINE       0     0     0
            scsi-35000c5003418ca53  ONLINE       0     0     0  (resilvering)

errors: No known data errors

Speedy Resilvering Tunable zfs_resilver_delay

The tunables I decided to use to speed up the resilver (and scrubs) were as follows on Debian Jessie:

ZFS Tunable's Purpose Command to change the ZFS Tunable on Debian Jessie
Prioritize resilvering by setting the delay to zero (the default is 2) echo 0 > /sys/module/zfs/parameters/zfs_resilver_delay
Prioritize scrubs by setting the delay to zero (the default is 4) echo 0 > /sys/module/zfs/parameters/zfs_scrub_delay
maximum number of inflight I/Os (adjust for your environment, the default is 32) echo 128 > /sys/module/zfs/parameters/zfs_top_maxinflight
resilver for five seconds per TXG (default is 3000) echo 5000 > /sys/module/zfs/parameters/zfs_resilver_min_time_ms

dd if=/dev/zero of=test conv=fsync bs=1M count=100000
55871+0 records in
55871+0 records out
58584989696 bytes (59 GB) copied, 75.5945 s, 775 MB/s

dd if=test of=/dev/null conv=fsync bs=1M
55871+0 records in
55871+0 records out
58584989696 bytes (59 GB) copied, 56.3685 s, 1.0 GB/s

Speedy Resilvering Tunable zfs_resilver_delay on Apple OSX Yosemite

On Apple OSX Yosemite, see here for how to do it.

The settings to the kernel tunables made by the following command prioritize resilvering and scrubbing at the expense of everything else, so they will likely noticeably reduce performance if you want to use your pool for anything during a resilver or a scrub.

sudo /usr/sbin/sysctl -w \
  kstat.zfs.darwin.tunable.scrub_max_active=6 \
  kstat.zfs.darwin.tunable.zfs_resilver_delay=0 \
  kstat.zfs.darwin.tunable.zfs_scrub_delay=0

Note that the above changes to the tunable settings will be restored to their defaults after a reboot, but if you need to set the defaults back earlier they are:

Tunable Default
scrub_max_active 2
zfs_resilver_delay 2
zfs_scrub_delay 4

Temperature of Hard Drives

The following script can be used to check the temperatures of the drives in the slots, to see if there are any overheating. In hot weather, the drives may get to around 40C.

Slot temperatures
~~~~~~~~~~~~~~~~~
echo "Drives on controller 0"
echo "~~~~~~~~~~~~~~~~~~~~~~"
echo "scsi-35000c50034249a1b SAS #7"
smartctl /dev/disk/by-id/scsi-35000c50034249a1b -A|grep 'Current Drive Temperature'
echo "scsi-35000c50034241cd3 SAS #6"
smartctl /dev/disk/by-id/scsi-35000c50034241cd3 -A|grep 'Current Drive Temperature'
echo "scsi-35000c5003417889f SAS #5"
smartctl /dev/disk/by-id/scsi-35000c5003417889f -A|grep 'Current Drive Temperature'
echo "scsi-35000c500342222cb SAS #4"
smartctl /dev/disk/by-id/scsi-35000c500342222cb -A|grep 'Current Drive Temperature'
echo "scsi-35000c5003425027f SAS #3"
smartctl /dev/disk/by-id/scsi-35000c5003425027f -A|grep 'Current Drive Temperature'
echo "scsi-35000c50034241d4f SAS #2"
smartctl /dev/disk/by-id/scsi-35000c50034241d4f -A|grep 'Current Drive Temperature'
echo "scsi-35000c5003424a11f SAS #1"
smartctl /dev/disk/by-id/scsi-35000c5003424a11f -A|grep 'Current Drive Temperature'
echo "scsi-35000c50034157a8f SAS #0"
smartctl /dev/disk/by-id/scsi-35000c50034157a8f -A|grep 'Current Drive Temperature'

echo "Drives on controller 1"
echo "~~~~~~~~~~~~~~~~~~~~~~"
echo "scsi-35000c5003414d617 SAS #15"
smartctl /dev/disk/by-id/scsi-35000c5003414d617 -A|grep 'Current Drive Temperature'
echo "scsi-35000c50034149b6b SAS #14"
smartctl /dev/disk/by-id/scsi-35000c50034149b6b -A|grep 'Current Drive Temperature'
echo "scsi-35000c5003423f1ff SAS #13"
smartctl /dev/disk/by-id/scsi-35000c5003423f1ff -A|grep 'Current Drive Temperature'
echo "scsi-35000c5003424a0e7 SAS #12"
smartctl /dev/disk/by-id/scsi-35000c5003424a0e7 -A|grep 'Current Drive Temperature'
echo "scsi-35000c50034157d6f SAS #11"
smartctl /dev/disk/by-id/scsi-35000c50034157d6f -A|grep 'Current Drive Temperature'
echo "scsi-35000c50034247a63 SAS #10"
smartctl /dev/disk/by-id/scsi-35000c50034247a63 -A|grep 'Current Drive Temperature'
echo "scsi-35000c50034150623 SAS #9"
smartctl /dev/disk/by-id/scsi-35000c50034150623 -A|grep 'Current Drive Temperature'
echo "scsi-35000c50034248657 SAS #8"
smartctl /dev/disk/by-id/scsi-35000c50034248657 -A|grep 'Current Drive Temperature'

echo "Drives on controller 2"
echo "~~~~~~~~~~~~~~~~~~~~~~"
echo "scsi-35000c500342497a3 SAS #16"
smartctl /dev/disk/by-id/scsi-35000c500342497a3 -A|grep 'Current Drive Temperature'
echo "scsi-35000c50034192787 SAS #17"
smartctl /dev/disk/by-id/scsi-35000c50034192787 -A|grep 'Current Drive Temperature'
echo "scsi-35000c50034247a43 SAS #18"
smartctl /dev/disk/by-id/scsi-35000c50034247a43 -A|grep 'Current Drive Temperature'
echo "scsi-35000c5003423c29b SAS #19"
smartctl /dev/disk/by-id/scsi-35000c5003423c29b -A|grep 'Current Drive Temperature'
echo "scsi-35000c50034249537 SAS #20"
smartctl /dev/disk/by-id/scsi-35000c50034249537 -A|grep 'Current Drive Temperature'
echo "scsi-35000c50034241d67 SAS #21"
smartctl /dev/disk/by-id/scsi-35000c50034241d67 -A|grep 'Current Drive Temperature'
echo "scsi-35000c5003423f337 SAS #22"
smartctl /dev/disk/by-id/scsi-35000c5003423f337 -A|grep 'Current Drive Temperature'
echo "scsi-35000c5003418ca53 SAS #23"
smartctl /dev/disk/by-id/scsi-35000c5003418ca53 -A|grep 'Current Drive Temperature'

Output is like this initially (on a hot day):

Drives on controller 0
~~~~~~~~~~~~~~~~~~~~~~
scsi-35000c50034249a1b SAS #7
Current Drive Temperature:     30 C
scsi-35000c50034241cd3 SAS #6
Current Drive Temperature:     30 C
scsi-35000c5003417889f SAS #5
Current Drive Temperature:     29 C
scsi-35000c500342222cb SAS #4
Current Drive Temperature:     30 C
scsi-35000c5003425027f SAS #3
Current Drive Temperature:     30 C
scsi-35000c50034241d4f SAS #2
Current Drive Temperature:     30 C
scsi-35000c5003424a11f SAS #1
Current Drive Temperature:     29 C
scsi-35000c50034157a8f SAS #0
Current Drive Temperature:     30 C

Drives on controller 1
~~~~~~~~~~~~~~~~~~~~~~
scsi-35000c5003414d617 SAS #15
Current Drive Temperature:     30 C
scsi-35000c50034149b6b SAS #14
Current Drive Temperature:     30 C
scsi-35000c5003423f1ff SAS #13
Current Drive Temperature:     30 C
scsi-35000c5003424a0e7 SAS #12
Current Drive Temperature:     30 C
scsi-35000c50034157d6f SAS #11
Current Drive Temperature:     29 C
scsi-35000c50034247a63 SAS #10
Current Drive Temperature:     29 C
scsi-35000c50034150623 SAS #9
Current Drive Temperature:     30 C
scsi-35000c50034248657 SAS #8
Current Drive Temperature:     30 C

Drives on controller 2
~~~~~~~~~~~~~~~~~~~~~~
scsi-35000c500342497a3 SAS #16
Current Drive Temperature:     30 C
scsi-35000c50034192787 SAS #17
Current Drive Temperature:     29 C
scsi-35000c50034247a43 SAS #18
Current Drive Temperature:     30 C
scsi-35000c5003423c29b SAS #19
Current Drive Temperature:     30 C
scsi-35000c50034249537 SAS #20
Current Drive Temperature:     31 C
scsi-35000c50034241d67 SAS #21
Current Drive Temperature:     31 C
scsi-35000c5003423f337 SAS #22
Current Drive Temperature:     30 C
scsi-35000c5003418ca53 SAS #23
Current Drive Temperature:     29 C

The output is like this after the machine has been up for an hour doing a zfs send, on a cold day in winter (14C ambient temperature, 9C outside), the warmest drive is around 25C:

Drives on controller 0
~~~~~~~~~~~~~~~~~~~~
scsi-35000c50034249a1b SAS #7
Current Drive Temperature:     24 C
scsi-35000c50034241cd3 SAS #6
Current Drive Temperature:     25 C
scsi-35000c5003417889f SAS #5
Current Drive Temperature:     21 C
scsi-35000c500342222cb SAS #4
Current Drive Temperature:     23 C
scsi-35000c5003425027f SAS #3
Current Drive Temperature:     24 C
scsi-35000c50034241d4f SAS #2
Current Drive Temperature:     24 C
scsi-35000c5003424a11f SAS #1
Current Drive Temperature:     24 C
scsi-35000c50034157a8f SAS #0
Current Drive Temperature:     25 C

Drives on controller 1
~~~~~~~~~~~~~~~~~~~~
scsi-35000c5003414d617 SAS #15
Current Drive Temperature:     23 C
scsi-35000c50034149b6b SAS #14
Current Drive Temperature:     24 C
scsi-35000c5003423f1ff SAS #13
Current Drive Temperature:     24 C
scsi-35000c5003424a0e7 SAS #12
Current Drive Temperature:     24 C
scsi-35000c50034157d6f SAS #11
Current Drive Temperature:     22 C
scsi-35000c50034247a63 SAS #10
Current Drive Temperature:     23 C
scsi-35000c50034150623 SAS #9
Current Drive Temperature:     24 C
scsi-35000c50034248657 SAS #8
Current Drive Temperature:     24 C

Drives on controller 2
~~~~~~~~~~~~~~~~~~~~
scsi-35000c500342497a3 SAS #16
Current Drive Temperature:     23 C
scsi-35000c50034192787 SAS #17
Current Drive Temperature:     22 C
scsi-35000c50034247a43 SAS #18
Current Drive Temperature:     25 C
scsi-35000c5003423c29b SAS #19
Current Drive Temperature:     24 C
scsi-35000c50034249537 SAS #20
Current Drive Temperature:     24 C
scsi-35000c50034241d67 SAS #21
Current Drive Temperature:     25 C
scsi-35000c5003423f337 SAS #22
Current Drive Temperature:     24 C
scsi-35000c5003418ca53 SAS #23
Current Drive Temperature:     22 C

Monitoring performance.

Note you can use zpool iostat as well as plain iostat.

Checking the S.M.A.R.T. status.

Since these are SAS drives, the -A flag to the smartctl command is used to see the Elements in grown defect list. Note that SAS devices do not provide SATA S.M.A.R.T. attributes like "Reallocated Sector Count".

sudo apt-get install smartmontools

Setting the mount point for a ZFS filesystem

By default, a ZFS filesystem's mountpoint value is "inherited", which may not be what you want if you have your own convention for how filesystems are mounted on your system.

So if we look at the inception filesystem's mount point, it's like this by default (after the above default setup steps):

zfs get all sminception/inception|grep mountpoint

sminception/inception  mountpoint        /sminception/inception       default

To change that, so that the mountpoint follows for example the convention of /mnt//, we can use this command, which automatically attempts to unmount the filesystem from where it's currently mounted, and re-mounts it at the new place. Note that this only affects local mounting of the filesystem, for NFS clients they can decide for themselves where to mount the filesystem, but it's obviously best if they follow the same convention.

zfs set mountpoint=/mnt/sm/inception sminception/inception

Sharing over NFS

To export the filesystem over NFS, the command is like this. On Debian Jessie, the statd is started properly, note that statd is required for Apple OSX Yosemite NFS clients to work properly (otherwise they hang).

 apt-get install nfs-kernel-server
 echo '/dummy_for_etc_exports_moronic localhost(ro,subtree_check)' >> /etc/exports
 zfs set sharenfs="rw=@192.168.1.0/24,insecure" inception
sudo zfs share sminception/inception
showmount -e
zfs get sharenfs

# showmount -e
Export list for sm:
/mnt/sm/inception              192.168.1.0/24
/dummy_for_etc_exports_moronic localhost

# zfs get sharenfs
NAME                   PROPERTY  VALUE                        SOURCE
sminception            sharenfs  off                          default
sminception/inception  sharenfs  rw=@192.168.1.0/24,insecure  local

zdb

The output of the zdb command is like this:
root@sm:~# zdb
sminception:
    version: 5000
    name: 'sminception'
    state: 0
    txg: 59992
    pool_guid: 7993001279504182280
    errata: 0
    hostid: 8323329
    hostname: 'sm'
    vdev_children: 8
    vdev_tree:
        type: 'root'
        id: 0
        guid: 7993001279504182280
        children[0]:
            type: 'mirror'
            id: 0
            guid: 10523391604328058912
            metaslab_array: 43
            metaslab_shift: 34
            ashift: 9
            asize: 2000384688128
            is_log: 0
            create_txg: 4
            children[0]:
                type: 'disk'
                id: 0
                guid: 6139581275235454379
                path: '/dev/disk/by-id/scsi-35000c50034157a8f-part1'
                whole_disk: 1
                DTL: 364
                create_txg: 4
            children[1]:
                type: 'disk'
                id: 1
                guid: 17713344311634408801
                path: '/dev/disk/by-id/scsi-35000c50034248657-part1'
                whole_disk: 1
                DTL: 363
                create_txg: 4
            children[2]:
                type: 'disk'
                id: 2
                guid: 7461526471984904477
                path: '/dev/disk/by-id/scsi-35000c500342497a3-part1'
                whole_disk: 1
                DTL: 345
                create_txg: 4
        children[1]:
            type: 'mirror'
            id: 1
            guid: 13489788898704216389
            metaslab_array: 41
            metaslab_shift: 34
            ashift: 9
            asize: 2000384688128
            is_log: 0
            create_txg: 4
            children[0]:
                type: 'disk'
                id: 0
                guid: 9764520435028323644
                path: '/dev/disk/by-id/scsi-35000c5003424a11f-part1'
                whole_disk: 1
                DTL: 360
                create_txg: 4
            children[1]:
                type: 'disk'
                id: 1
                guid: 6587559338994911545
                path: '/dev/disk/by-id/scsi-35000c50034150623-part1'
                whole_disk: 1
                DTL: 359
                create_txg: 4
            children[2]:
                type: 'disk'
                id: 2
                guid: 14631248896257805436
                path: '/dev/disk/by-id/scsi-35000c50034192787-part1'
                whole_disk: 1
                DTL: 347
                create_txg: 4
        children[2]:
            type: 'mirror'
            id: 2
            guid: 18127905325371103265
            metaslab_array: 40
            metaslab_shift: 34
            ashift: 9
            asize: 2000384688128
            is_log: 0
            create_txg: 4
            children[0]:
                type: 'disk'
                id: 0
                guid: 43503354591006241
                path: '/dev/disk/by-id/scsi-35000c50034241d4f-part1'
                whole_disk: 1
                DTL: 358
                create_txg: 4
            children[1]:
                type: 'disk'
                id: 1
                guid: 15701981273058322080
                path: '/dev/disk/by-id/scsi-35000c50034247a63-part1'
                whole_disk: 1
                DTL: 357
                create_txg: 4
            children[2]:
                type: 'disk'
                id: 2
                guid: 17147423204722466029
                path: '/dev/disk/by-id/scsi-35000c50034247a43-part1'
                whole_disk: 1
                DTL: 367
                create_txg: 4
        children[3]:
            type: 'mirror'
            id: 3
            guid: 2871032375477364202
            metaslab_array: 39
            metaslab_shift: 34
            ashift: 9
            asize: 2000384688128
            is_log: 0
            create_txg: 4
            children[0]:
                type: 'disk'
                id: 0
                guid: 4236933644699323838
                path: '/dev/disk/by-id/scsi-35000c5003425027f-part1'
                whole_disk: 1
                DTL: 356
                create_txg: 4
            children[1]:
                type: 'disk'
                id: 1
                guid: 9004728264698295353
                path: '/dev/disk/by-id/scsi-35000c50034157d6f-part1'
                whole_disk: 1
                DTL: 355
                create_txg: 4
            children[2]:
                type: 'disk'
                id: 2
                guid: 18428541736462747800
                path: '/dev/disk/by-id/scsi-35000c5003423c29b-part1'
                whole_disk: 1
                DTL: 370
                create_txg: 4
        children[4]:
            type: 'mirror'
            id: 4
            guid: 1048981656851707422
            metaslab_array: 38
            metaslab_shift: 34
            ashift: 9
            asize: 2000384688128
            is_log: 0
            create_txg: 4
            children[0]:
                type: 'disk'
                id: 0
                guid: 4260926722300526777
                path: '/dev/disk/by-id/scsi-35000c500342222cb-part1'
                whole_disk: 1
                DTL: 354
                create_txg: 4
            children[1]:
                type: 'disk'
                id: 1
                guid: 14783070303109957676
                path: '/dev/disk/by-id/scsi-35000c5003424a0e7-part1'
                whole_disk: 1
                DTL: 353
                create_txg: 4
            children[2]:
                type: 'disk'
                id: 2
                guid: 10595685646790827725
                path: '/dev/disk/by-id/scsi-35000c50034249537-part1'
                whole_disk: 1
                DTL: 372
                create_txg: 4
        children[5]:
            type: 'mirror'
            id: 5
            guid: 14789964356999802181
            metaslab_array: 37
            metaslab_shift: 34
            ashift: 9
            asize: 2000384688128
            is_log: 0
            create_txg: 4
            children[0]:
                type: 'disk'
                id: 0
                guid: 65763515804082926
                path: '/dev/disk/by-id/scsi-35000c5003417889f-part1'
                whole_disk: 1
                DTL: 352
                create_txg: 4
            children[1]:
                type: 'disk'
                id: 1
                guid: 6982179362716627328
                path: '/dev/disk/by-id/scsi-35000c5003423f1ff-part1'
                whole_disk: 1
                DTL: 351
                create_txg: 4
            children[2]:
                type: 'disk'
                id: 2
                guid: 16167140138866994948
                path: '/dev/disk/by-id/scsi-35000c50034241d67-part1'
                whole_disk: 1
                DTL: 375
                create_txg: 4
        children[6]:
            type: 'mirror'
            id: 6
            guid: 17364531108284266094
            metaslab_array: 36
            metaslab_shift: 34
            ashift: 9
            asize: 2000384688128
            is_log: 0
            create_txg: 4
            children[0]:
                type: 'disk'
                id: 0
                guid: 15427417627729432411
                path: '/dev/disk/by-id/scsi-35000c50034241cd3-part1'
                whole_disk: 1
                DTL: 350
                create_txg: 4
            children[1]:
                type: 'disk'
                id: 1
                guid: 8043972884445530254
                path: '/dev/disk/by-id/scsi-35000c50034149b6b-part1'
                whole_disk: 1
                DTL: 349
                create_txg: 4
            children[2]:
                type: 'disk'
                id: 2
                guid: 13316974147855006133
                path: '/dev/disk/by-id/scsi-35000c5003423f337-part1'
                whole_disk: 1
                DTL: 378
                create_txg: 4
        children[7]:
            type: 'mirror'
            id: 7
            guid: 755944469145100657
            metaslab_array: 34
            metaslab_shift: 34
            ashift: 9
            asize: 2000384688128
            is_log: 0
            create_txg: 4
            children[0]:
                type: 'disk'
                id: 0
                guid: 15959964977682225405
                path: '/dev/disk/by-id/scsi-35000c50034249a1b-part1'
                whole_disk: 1
                DTL: 362
                create_txg: 4
            children[1]:
                type: 'disk'
                id: 1
                guid: 12475628906210417449
                path: '/dev/disk/by-id/scsi-35000c5003414d617-part1'
                whole_disk: 1
                DTL: 361
                create_txg: 4
            children[2]:
                type: 'disk'
                id: 2
                guid: 1030968181531841641
                path: '/dev/disk/by-id/scsi-35000c5003418ca53-part1'
                whole_disk: 1
                DTL: 380
                create_txg: 4
    features_for_read:
        com.delphix:hole_birth
        com.delphix:embedded_data

Scrubbing the pool

The pool can be checked using zpool scrub sminception and the scrub starts out quite slow and speeds up later on, here it's almost finished, I think it took less than two hours to complete the scrub.

  pool: sminception
 state: ONLINE
  scan: scrub in progress since Fri Jul 24 17:30:21 2015
    2.93T scanned out of 4.38T at 643M/s, 0h39m to go
    0 repaired, 66.82% done
config:

The iostat looks like this during the scrub:

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.17    0.00   23.47    0.15    0.00   76.21

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sda               0.00         0.00         0.00          0          0
sdc             710.40     90505.20         5.25     905052         52
sdk             709.40     90426.75         5.25     904267         52
sdb             703.00     88750.55         5.25     887505         52
sdj             704.10     88841.80         5.25     888418         52
sdr             703.30     88922.65         5.25     889226         52
sds             707.00     90069.05         5.25     900690         52
sdd             705.60     89016.15         0.40     890161          4
sdl             707.00     89023.20         0.40     890232          4
sdt             704.70     88888.20         0.40     888882          4
sde             730.00     89814.85         0.00     898148          0
sdf             730.70     90748.90         0.00     907489          0
sdn             731.10     90701.45         0.00     907014          0
sdm             724.00     89781.85         0.00     897818          0
sdv             730.00     90927.75         0.00     909277          0
sdh             720.60     89937.45         2.90     899374         29
sdu             725.50     89973.85         0.00     899738          0
sdg             723.90     89723.25         2.90     897232         29
sdp             715.10     90004.75         2.90     900047         29
sdo             717.20     89686.45         2.90     896864         29
sdx             713.60     89812.70         2.90     898127         29
sdw             712.50     89134.05         2.90     891340         29
sdi             716.00     90203.45         7.75     902034         77
sdq             714.50     90416.30         7.75     904163         77
sdy             719.40     91029.65         7.75     910296         77
dm-0              0.00         0.00         0.00          0          0

When running the zpool status, remember to include -T d to show the timestamp, and make sure your hard drives are shown "by-id" and not by their linux device names.

# zpool status -v -T d
Sun Aug  9 11:41:40 PDT 2015
  pool: sminception
 state: ONLINE
  scan: scrub repaired 0 in 1h56m with 0 errors on Fri Jul 24 19:27:20 2015
config:

        NAME                        STATE     READ WRITE CKSUM
        sminception                 ONLINE       0     0     0
          mirror-0                  ONLINE       0     0     0
            scsi-35000c50034157a8f  ONLINE       0     0     0
            scsi-35000c50034248657  ONLINE       0     0     0
            scsi-35000c500342497a3  ONLINE       0     0     0
          mirror-1                  ONLINE       0     0     0
            scsi-35000c5003424a11f  ONLINE       0     0     0
            scsi-35000c50034150623  ONLINE       0     0     0
            scsi-35000c50034192787  ONLINE       0     0     0
          mirror-2                  ONLINE       0     0     0
            scsi-35000c50034241d4f  ONLINE       0     0     0
            scsi-35000c50034247a63  ONLINE       0     0     0
            scsi-35000c50034247a43  ONLINE       0     0     0
          mirror-3                  ONLINE       0     0     0
            scsi-35000c5003425027f  ONLINE       0     0     0
            scsi-35000c50034157d6f  ONLINE       0     0     0
            scsi-35000c5003423c29b  ONLINE       0     0     0
          mirror-4                  ONLINE       0     0     0
            scsi-35000c500342222cb  ONLINE       0     0     0
            scsi-35000c5003424a0e7  ONLINE       0     0     0
            scsi-35000c50034249537  ONLINE       0     0     0
          mirror-5                  ONLINE       0     0     0
            scsi-35000c5003417889f  ONLINE       0     0     0
            scsi-35000c5003423f1ff  ONLINE       0     0     0
            scsi-35000c50034241d67  ONLINE       0     0     0
          mirror-6                  ONLINE       0     0     0
            scsi-35000c50034241cd3  ONLINE       0     0     0
            scsi-35000c50034149b6b  ONLINE       0     0     0
            scsi-35000c5003423f337  ONLINE       0     0     0
          mirror-7                  ONLINE       0     0     0
            scsi-35000c50034249a1b  ONLINE       0     0     0
            scsi-35000c5003414d617  ONLINE       0     0     0
            scsi-35000c5003418ca53  ONLINE       0     0     0

errors: No known data errors

Quirks in ZFS

During resilvering, zpool iostat 1 doesn't show the write bandwidth, see here for the apparently unanswered question as to why not, and the suggestion to use the -v flag to be able to see the write bandwidth to the resilvering drives. zpool iostat -v 1 Perhaps drives that are in the progress of being resilvered are not yet considered to be full members of the pool so the write bandwidth isn't of interest, but it seems strange that -v would be used in this fashion as normally it just means "verbose".

zpool iostat -v 1

Further info...

For more on ZFS, Aaron's guide to ZFS is a good place to start. Also, his article about parchive is interesting.