Nimble Storage uses a new filesystem architecture called CASLTM (Cache Accelerated Sequential Layout). As I described in a previous post, CASL was designed from the ground up to provide a powerful combination of capacity and performance optimizations.

There is no dearth of well-designed filesystems. The vast majority of filesystems are “write in place” (WIP). When an application updates a block, the filesystem overwrites the block’s existing location on disk.

CASL belong to a different class of filesystems that may be called “write in free space” (WIFS). Two well-known examples of this class are NetApp’s WAFL and Sun’s ZFS. In this post, I will describe how CASL provides all the benefits of WIFS while overcoming the shortcomings of existing WIFS filesystems.

A WIFS filesystem does not overwrite blocks in place. Instead, it redirects each write to free space, and updates its index to point to the new location.  This enables the filesystem to coalesce logically random writes into a physically sequential write on disk.

Furthermore, because WIFS does not overwrite the old versions of blocks, it provides a simple and efficient method to take snapshots. These snapshots are often called “redirect on write” (ROW). On the other hand, a WIP filesystem generally creates “copy on write” (COW) snapshots, wherein the first write to each block after a snapshot triggers a copy of the old version to a separate location.

However, most existing WIFS filesystems such as WAFL and ZFS are “hole filling” in nature.  When most of the space is free, the filesystem is able to write in full stripes across the disk group, providing good performance.  Over time, as random blocks are overwritten and snapshots are deleted, free space gets fragmented into “holes” of various sizes, resulting in a Swiss cheese pattern. The filesystem redirects writes into these holes, resulting in random writes. Furthermore, even sequential reads of data written in this manner turn into random reads on disk.

Some WIFS file systems attempt to overcome these shortomings, for example by periodically defragmenting the free space. However  the process is heavyweight and does not ensure sequentiality. ZFS attempts to reduce the impact of hole filling on parity updates by fitting a full RAID stripe within the hole. However, this RAID stripe does not span the whole disk group, so it still results in random writes and reads.

CASL is a WIFS filesystem, but it is NOT hole filling.  It always writes in full stripes spanning the whole disk group.  It employs a lightweight sweeping process to consolidate small holes into free full stripes. Its internal data structures are designed ground-up to run sweeping efficiently, and it caches these data structures in flash for additional speed.

An important side-benefit of always writing in full stripes is that CASL can coalesce blocks of different sizes into a stripe. Among other significant benefits, this enables a particularly efficient and elegant form of compression. The resulting layout is shown below.

On the other hand, hole-filling filesystems are forced to use less efficient mechanisms.  E.g., since they write in units of blocks and not full stripes, they may try to compress a bunch of successive blocks into a smaller number of slots. Now imagine what would happen if an application updates one of those blocks.  The filesystem would need to do a read-modify-write on the bunch: read the old bunch, decompress it into blocks, update the one block, and re-compress and re-write the bunch.

The read-modify-write is has a big impact on performance, making compression in hole-filling files systems unsuitable for workloads with random block updates, such as databases. In contrast, CASL supports compression with little impact on performance. This turns out to be a huge benefit, because databases often compress very well (2—4x).

Overall, CASL provides the best of both worlds: big capacity savings and consistently good performance, even for random workloads.

Share →
Buffer

11 Responses to A Comparison of Filesystem Architectures »

  1. Iñigo says:

    Hello Umesh,

    If I understand well your article, CASL maximizes flash/solid state drive write cycles; while WIP, and hole-filing WIFS filesystems may trash / reduce your flash drive lifetime (wich will also grow your support costs).

    Am I right ?

    • admin says:

      Iñigo,

      Yes, that is right. CASL writes in full stripes on both SSDs and hard disks. On SSDs it reduces wear, and on hard disks it provides superb performance even for random writes.

      Umesh

  2. John Martin says:

    Disclosure – NetApp Employee

    Interesting post if a little light on the details, though I’m not sure that your case for superior efficiency vs “hole filling” is entirely convincing possibly because you’ve had to abstract out a lot of the details.

    Now imagine what would happen if an application updates one of those blocks on a compressed stripe on a nimble array.

    I would infer from the information you’ve provided above that the entire stripe would need to be read, the new data added to it. If the resulting information no longer neatly fitted into another full stripe, you’d wait for that more data to arrive, combine the new data with the old and write out another full stripe and invalidate the old stripe.

    To my way of thinking the requirement to read a whole stripe before committing even a small amount of new data to the back end disks seems excessive. If those small block updates were randomly distributed across a large number of stripes then the resulting amount of disk read activity required to destage write from cache might be enough to start causing performance problems. Even if these stripe writes / reads are sequential, a large enough number of concurrent sequential operations would end up looking pretty random from a head positioning point of view.

    Large caches help a lot with this, but I’m not sure it would be sufficient to keep a busy OLTP database happy.

    Of course I could be completely wrong and trying to use my internal model of how WAFL works to understand how Nimble works might well be leading me to invalid conclusions, nonetheless at this point in time I remain unconvinced that always reading and writing full stripes is an effective strategy for a broad range of workloads.

    Regards
    John

    • admin says:

      John,

      When an application updates a single block, CASL does not need to rewrite the old stripe holding the old image of that block. It writes only the new block, along with other newly written blocks. This creates a hole in the old stripe. A sweeping process running in the background sweeps the holes into full free stripes. CASL is designed to run this sweeping process efficiently.

      Umesh Maheshwari
      CTO

  3. Raghu S says:

    I am sure your architecture does a great job handling small random IOs. I am curious to know if it can perform well with large sequential reads (I understand they are not frequent). Large sequential reads should translate into small random reads when they hit the Nimble array, right? I understand Log Structured file systems in general are not good at dealing with large sequential reads and that’s not the primary requirement an LFS tries to satisfy. Just wondering if Nimble does anything to handle this case better — I know read caching doesn’t help this since only the working set is cached by Nimble.

    ZFS can address this issue by using large block size (say 128KB) and performing a copy-on-write when the write arrives (thus making sure that data is laid out in large chunks). But that defeats the “log-structuredness” of the file system since it introduces random reads before writes can be committed on disk.

    Thanks
    Raghu

    • admin says:

      Hello Raghu,

      When laying out data on disk, one can optimize for at most one of the two:
      1. Writes and reads that are sequential in the logical address space.
      2. Writes that are random or sequential, and reads that follow a similar pattern as writes.

      Write-in-place filesystems optimize for the first. Log-structured filesystems optimize for the second.

      Two reasons are shifting the need towards the second. First, large read caches are making it more important to optimize random writes to balance write and read performance. Second, increasing layers of virtualization are adding randomness and making the accesses that hit the disk subsystem more like the second.

      That said, the Nimble filesystem does take a few extra steps to help sequential reads. It buffers writes in NVRAM and sorts them by logical address before writing them to disk. Furthermore, on detecting sequential reads, it starts to prefetch data into the read cache. Finally, like ZFS, it supports an application-tunable block size, currently up to 32KB.

      Umesh

      • Raghu S says:

        Makes sense. However, I believe that application-tunable block size is pointless for a log structured FS (unless you want to do a copy-on-write like ZFS does to make sure that data is laid out in large chunks). Correct me if I am missing something.

        Thanks
        Raghu

        • admin says:

          Hello Raghu,

          The application-tunable block size helps minimize the amount of block metadata. But you are right that, in a log-structured filesystem, one would not set the block size to any higher than the common size of write requests issued by the application/OS.

          Thanks,
          Umesh

  4. [...] This is the engineering challenge that Nimble has addressed to finally deliver a disk-based file system that is truly optimized for writes. You can read more about it here. [...]

  5. Its such as you read my mind! You seem to know so much about this, like you wrote the e-book in it or something.
    I believe that you just can do with a few p.c. to drive the message house a little bit, however instead
    of that, that is fantastic blog. A great read.
    I will certainly be back.

  6. [...] UCS SmartPlay with VMware Horizon View for 300 users and up. Built upon the flash-optimized, hybrid CASL architecture, the Nimble Storage CS-Series packs adaptive performance VDI needs – such as with [...]

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>