Tuesday, 22 December 2015

ZFS, like a work of art

A subtle thing with ZFS is you'll notice how the drive L.E.D.s flash quite differently to typical storage arrays, when you understand more under the hood you'll know why that is. So just looking in a DC you'd be able to observe this across which servers for example. You can see this type of effect here to illustrate - https://www.youtube.com/watch?v=LS3cfl-7n-4

ofc thats ZFS on linux.. which is implemented as a FUSE so less efficent than that of a FS in kernel space as elaborted across various posts, some examples: https://lkml.org/lkml/2007/4/16/133 , https://lkml.org/lkml/2007/4/16/83

example pool using raidz2 with hot spares, which will autoreplace in the event a drive or 2 fail. Creating with brackets like this is always easier - c4t{0..1}d0. Also have to get the order of commands to be correct or you may be second guessing...

# zpool create data c0t50004CF210AD1C22d0 c0t50004CF210BE51F1d0 c0t50004CF210BE51F3d0 c0t50004CF210BE5214d0 c4t{0..1}d0 raidz2
Unable to build pool from specified devices: invalid vdev specification: raidz2 requires at least 3 devices

# zpool create -o atime=off -o compress=lz4 data raidz2 c0t50004CF210AD1C22d0 c0t50004CF210BE51F1d0 c0t50004CF210BE51F3d0 c0t50004CF210BE5214d0 c4t{0..1}d0
# zpool add data spare c4t3d0 c5t3d0
# zpool status
  pool: data
 state: ONLINE
  scan: none requested

        NAME                       STATE     READ WRITE CKSUM
        data                       ONLINE       0     0     0
          raidz2-0                 ONLINE       0     0     0
            c0t50004CF210AD1C22d0  ONLINE       0     0     0
            c0t50004CF210BE51F1d0  ONLINE       0     0     0
            c0t50004CF210BE51F3d0  ONLINE       0     0     0
            c0t50004CF210BE5214d0  ONLINE       0     0     0
            c4t0d0                 ONLINE       0     0     0
            c4t1d0                 ONLINE       0     0     0
          c4t3d0                   AVAIL  
          c5t3d0                   AVAIL

Then as always test the assumption and it works as expected. I've got hot swap capabilities so pulled a drive out to simulate then try write some data and looks to have worked.

# zpool status -xv
  pool: data
 state: DEGRADED
status: One or more devices are unavailable in response to persistent errors.
    Sufficient replicas exist for the pool to continue functioning in a
    degraded state.
action: Determine if the device needs to be replaced, and clear the errors
    using 'zpool clear' or 'fmadm repaired', or replace the device
    with 'zpool replace'.
  scan: resilvered 136K in 1s with 0 errors on Wed Dec 23 05:53:44 2015


    NAME                         STATE     READ WRITE CKSUM
    data                         DEGRADED     0     0     0
      raidz2-0                   DEGRADED     0     0     0
        c0t50004CF210AD1C22d0    ONLINE       0     0     0
        c0t50004CF210BE51F1d0    ONLINE       0     0     0
        spare-2                  DEGRADED     0     0     0
          c0t50004CF210BE51F3d0  UNAVAIL      0    24     0
          c4t3d0                 ONLINE       0     0     0
        c0t50004CF210BE5214d0    ONLINE       0     0     0
        c4t0d0                   ONLINE       0     0     0
        c4t1d0                   ONLINE       0     0     0
      c4t3d0                     INUSE  
      c5t3d0                     AVAIL  

device details:

    c0t50004CF210BE51F3d0      UNAVAIL       too many errors
    status: FMA has faulted this device.
    action: Run 'fmadm faulty' for more information. Clear the errors
        using 'fmadm repaired'.
       see: http://support.oracle.com/msg/ZFS-8000-FD for recovery

Tuesday, 3 November 2015

ZFS born in Zion

Interesting vids from the recent OpenZFS Summit 2015. Recommend you watch these - https://www.youtube.com/watch?v=dcV2PaMTAJ4&index=6&list=PLaUVvul17xSedlXipesHxfzDm74lXj0ab

As Jeff Bonwick explains around the time of ZFS conception that it has links to The Matrix. That's why Oracle documentation has things in there about Neo, Trinity, tank and Morpheus. Amazing film with memorable quotes:

Morpheus: "You're faster than this. Don't think you are, know you are."
Morpheus: "I'm trying to free your mind, Neo. But I can only show you the door. You're the one that has to walk through it"

Let's not forget he was also Cowboy Curtis - https://www.youtube.com/watch?v=3jsCxNK4vAc 

Lawrence and Samuel aren't the same person....

Sunday, 1 November 2015

Hardware or Software RAID?

About 4-5 years ago when I first made a start on learning and using Linux one of the questions was towards RAID, given you have more than one way to skin a cat so to speak. Which way to skin it?
I was told by a manager (and he was saying this with 100% solidity)"hardware RAID IS the best RAID". - I have yet to see this proven.

Loose Background

Years ago hardware RAID used to be the better option as CPU's were considerably slower so whilst software RAID is constantly running will consume a fair amount of CPU resources (thus additional overhead) combined with the lack of well designed software RAID (or for example firmware RAID on older motherboards) meant you would be better of paying for a dedicated card to handle this as it also has things like BBU + cache so it is able to reorganise write operations prior to flushing to disk at same time keeping writes ready to be flushed even if power is temporarily out to maintain a consistent state.

Questions arised and can be asked such as:
What if the hardware RAID card fails?
If software RAID is improved can we spend less money on HW?
Can rebuilds be done faster through software than hardware RAID?
Perhaps we should integrate LVM/VFS layer together?
Should software RAID be done user space or kernel space?
Is it possible to have software reorganize I/Os like hardware?
What happens to the state of the array if the cache after 72 hours is gone?

Linux mdadm is quite alot better, you also can use BTRFS or ZFS. I've played around removing drives and rebuilding etc using mdadm. I no longer bother now as I just use ZFS for all my storage needs.

In short Software RAID is now at a stage that it is faster than hardware RAID, provides end-to-end checksumming (so no data corruption), organizing writes to convert random writes into sequential writes (whilst providing dynamic block allocation) and can be very efficient in terms of it's resource usage.
Test that compares software and hardware RAID by Robert - http://milek.blogspot.co.uk/2006/08/hw-raid-vs-zfs-software-raid-part-ii.html
and as referenced also from "Unix and Linux System Administration Handbook fourth edition"

Saturday, 31 October 2015

Microsoft is Evil!

This link is funny


and on it within the links is my favorite message

from - http://toastytech.com/evil/errwindows.html

you never know, maybe messages like that could exist!

The best saying about Storage

When I read this quote I quite liked it.

"There are two things about hard drives, either they are going to fail, or they have failed."

Thinking of it in that way means you won't (or shouldn't) rely on some known % failure rate statistics or thinking my RAID has this low chance of failing so I will be fine etc, as at some point you know they will fail. Enterprise quality or not.

It is all well and good if you have a RAID array where you can suffer several drives failing at the same time and have spares ready to rebuild but have you asked what if another one fails before rebuild? What if they all fail? Ask this because in my and others experience when one thing goes wrong it just so happens it is when you need it most. (I think this is known as Murphy's Law) I've heard stories of someone telling me the chances are so low.. followed by but it just so happened on this one occasion and.. Also recently I suffered several drives fail within one month of one another after about 5-6 years of use (more on that one in another post)

Friday, 30 October 2015

NVMe (focus on M.2) the latest paradigm shift

I heard about this a few months back from my adviser and only just yesterday Samsung released the NVMe pro 950 M.2 SSD. A 256 and 512G version. This emerging tech has dramatic effects for the industry. Others don't appear to have realized or are even aware of the implications of NVMe (based on lack of comments from the posts I follow and people I've spoken with.) but then again I haven't checked everywhere.

This is why I've got myself a motherboard with 2 such M.2 Slots to utilize this (Asrock X99 extreme 11), probably for use as L2ARC... I'll just hold off a bit longer as prices will most certainly drop. (The 512G version is about £300)

What will it cause?

The next generation of all future laptops, smart phones and other devices will integrate this in. (infact iphone 6S already has this) this allows all next gen hardware to probably be 10x faster than existing tech (Based on the fact that most operations machines are waiting on is storage I/Os.) Being as this architecture is so small it will replace more and more existing SSD's such as the 2.5" Sata based ones as it grows more commonplace. (why would you not want something much faster and power efficient?) because it is very efficient from wattage point of view running costs on larger scales will also be less, space required is much less to as additional layers are added to the silicon as opposed to the older plane/flat methods. Just compare the sizes of your typical 3.5", 2.5"storage devices to something the size of a large chewing gum stick, which at some point will be TBs in size.

What is the future?

I am aware that more production facilities are in the making to produce this on a larger scale with additional layers. Next year Samsung will almost certainly release a 1TB model with faster speeds. Not to mention other vendors will be in direct competition. For starters Laptops not using this will be phased out. My question is what is the max amount of layers that can be added?