Posts Tagged ‘open storage’

h1

Sun Adds De-Duplication to ZFS

November 3, 2009

Yesterday Jeff Bonwick (Sun) announced that deduplication is now officially part of ZFS – Sun’s Zettabyte File System that is at the heart of Sun’s Unified Storage platform and NexentaStor. In his post, Jeff touched on the major issues surrounding deduplication in ZFS:

Deduplication in ZFS is Block-level

ZFS provides block-level deduplication because this is the finest granularity that makes sense for a general-purpose storage system. Block-level dedup also maps naturally to ZFS’s 256-bit block checksums, which provide unique block signatures for all blocks in a storage pool as long as the checksum function is cryptographically strong (e.g. SHA256).

Deduplication in ZFS is Synchronous

ZFS assumes a highly multithreaded operating system (Solaris) and a hardware environment in which CPU cycles (GHz times cores times sockets) are proliferating much faster than I/O. This has been the general trend for the last twenty years, and the underlying physics suggests that it will continue.

Deduplication in ZFS is Per-Dataset

Like all zfs properties, the ‘dedup’ property follows the usual rules for ZFS dataset property inheritance. Thus, even though deduplication has pool-wide scope, you can opt in or opt out on a per-dataset basis. Most storage environments contain a mix of data that is mostly unique and data that is mostly replicated. ZFS deduplication is per-dataset, which means you can selectively enable dedup only where it is likely to help.

Deduplication in ZFS is based on a SHA256 Hash

Chunks of data — files, blocks, or byte ranges — are checksummed using some hash function that uniquely identifies data with very high probability. When using a secure hash like SHA256, the probability of a hash collision is about 2^-256 = 10^-77. For reference, this is 50 orders of magnitude less likely than an undetected, uncorrected ECC memory error on the most reliable hardware you can buy.

Deduplication in ZFS can be Verified

[If you are paranoid about potential “hash collisions”] ZFS provies a ‘verify’ option that performs a full comparison of every incoming block with any alleged duplicate to ensure that they really are the same, and ZFS resolves the conflict if not.

Deduplication in ZFS is Scalable

ZFS places no restrictions on your ability to dedup. You can dedup a petabyte if you’re so inclined. The performace of ZFS dedup will follow the obvious trajectory: it will be fastest when the DDTs (dedup tables) fit in memory, a little slower when they spill over into the L2ARC, and much slower when they have to be read from disk — but the point I want to emphasize here is that there are no limits in ZFS dedup. ZFS dedup scales to any capacity on any platform, even a laptop; it just goes faster as you give it more hardware.

Jeff Bonwick’s Blog, November 2, 2009

What does this mean for ZFS users? That depends on the application, but highly duplicated environments like virtualization stand to gain significant storage-related value from this small addition to ZFS. Considering the various ways virtualization administrators deal with virtual machine cloning, even the basic VMware template approach (not using linked-clones) will now result in significant storage savings. This restores parity between storage and compute in the virtualization stack.

What does it mean for ZFS-based storage vendors? More main memory and processor threads will be necessary to limit the impact on performance. With 6-core and 8-thread CPU’s available in the mainstream, this problem is very easily resolved. Just like the L2ARC tables consume main memory, the DDT’s will require an increase in main memory for larger datasets. Testing and configuration convergence will likely take 2-3 months once dedupe is mainstream.

When can we expect to see dedupe added to ZFS (i.e. OpenSolaris)? According to Jeff, “in roughly a month.”

Updated: 11/04/2009 – Link to Nexenta corrected. Was incorrectly linked to “nexent.com” – typo – now correctly linked to “http://www.nexenta.com”

h1

In-the-Lab: Full ESX/vMotion Test Lab in a Box, Part 4

August 26, 2009

In Part 3 of this series we showed how to install and configure a basic NexentaStor VSA using iSCSI and NFS storage. We also created a CIFS bridge for managing ISO images that are available to our ESX servers using NFS. We now have a fully functional VSA with working iSCSI target (unmounted as of yet) and read-only NFS export mounted to the hardware host.

In this segment, Part 4, we will create an ESXi instance on NFS along with an ESX instance on iSCSI, and, using writable snapshots, turn both of these installations into quick-deploy templates. We’ll then mount our large iSCSI target (created in Part 3) and NFS-based ISO images to all ESX/ESXi hosts (physical and virtual), and get ready to install our vCenter virtual machine.

Part 4, Making an ESX Cluster-in-a-Box

With a lot of things behind us in Parts 1 through 3, we are going to pick-up the pace a bit. Although ZFS snapshots are immediately available in a hidden “.zfs” folder for each snapshotted file system, we are going to use cloning and mount the cloned file systems instead.

Cloning allows us to re-use a file system as a template for a copy-on-write variant of the source. By using the clone instead of the original, we can conserve storage because only the differences between the two file systems (the clone and the source) are stored to disk. This process allows us to save time as well, leveraging “clean installations” as starting points (templates) along with their associate storage (much like VMware’s linked-clone technology for VDI.) While VMware’s “template” capability allows us save time by using a VM as a “starting point” it does so by copying storage, not cloning it, and therefore conserves no storage.

Using clones in NexentaStor to aid rapid deployment and testing.

Using clones in NexentaStor to conserve storage and aid rapid deployment and testing. Only the differences between the source and the clone require additional storage on the NexentaStor appliance.

While the ESX and ESXi use cases might not seem the “perfect candidates” for cloning in a “production” environment, in the lab it allows for an abundance of possibilities in regression and isolation testing. In production you might find that NFS and iSCSI boot capabilities could make cloned hosts just as effective for deployment and backup as they are in the lab (but that’s another blog).

Here’s the process we will continue with for this part in the lab series:

  1. Create NFS folder in NexentaStor for the ESXi template and share via NFS;
  2. Modify the NFS folder properties in NexentaStor to:
    1. limit access to the hardware ESXi host only;
    2. grant the hardware ESXi host “root” access;
  3. Create a folder in NexentaStor for the ESX template and create a Zvol;
  4. From VI Client’s “Add Storage…” function, we’ll add the new NFS and iSCSI volumes to the Datastore;
  5. Create ESX and ESXi clean installations in these “template” volumes as a cloning source;
  6. Unmount the “template” volumes using the VI Client and unshare them in NexentaStore;
  7. Clone the “template” Zvol and NFS file systems using NexentaStore;
  8. Mount the clones with VI Client and complete the ESX and ESXi installations;
  9. Mount the main Zvol and ISO storage to ESX and ESXi as primary shared storage;
Basic storage architecture for the ESX-on-ESX lab.

Basic storage architecture for the ESX-on-ESX lab.

Read the rest of this entry ?

h1

In-the-Lab: Full ESX/vMotion Test Lab in a Box, Part 3

August 21, 2009

In Part 2 of this series we introduced the storage architecture that we would use for the foundation of our “shared storage” necessary to allow vMotion to do its magic. As we have chosen NexentaStor for our VSA storage platform, we have the choice of either NFS or iSCSI as the storage backing.

In Part 3 of this series we will install NexentaStor, make some file systems and discuss the advantages and disadvantages of NFS and iSCSI as the storage backing. By the end of this segment, we will have everything in place for the ESX and ESXi virtual machines we’ll build in the next segment.

Part 3, Building the VSA

For DRAM memory, our lab system has 24GB of RAM which we will apportion as follows: 2GB overhead to host, 4GB to NexentaStor, 8GB to ESXi, and 8GB to ESX. This leaves 2GB that can be used to support a vCenter installation at the host level.

Our lab mule was configured with 2x250GB SATA II drives which have roughly 230GB each of VMFS partitioned storage. Subtracting 10% for overhead, the sum of our virtual disks will be limited to 415GB. Because of our relative size restrictions, we will try to maximize available storage while limiting our liability in case of disk failure. Therefore, we’ll plan to put the ESXi server on drive “A” and the ESX server on drive “B” with the virtual disks of the VSA split across both “A” and “B” disks.

Our VSA Virtual Hardware

For lab use, a VSA with 4GB RAM and 1 vCPU will suffice. Additional vCPU’s will only serve to limit CPU scheduling for our virtual ESX/ESXi servers, so we’ll leave it at the minimum. Since we’re splitting storage roughly equally across the disks, we note that an additional 4GB was taken-up on disk “A” during the installation of ESXi, therefore we’ll place the VSA’s definition and “boot” disk on disk “B” – otherwise, we’ll interleave disk slices equally across both disks.

NexentaStor-VSA-virtual-hardware

  • Datastore – vLocalStor02B, 8GB vdisk size, thin provisioned, SCSI 0:0
  • Guest Operating System – Solaris, Sun Solaris 10 (64-bit)
  • Resource Allocation
    • CPU Shares – Normal, no reservation
    • Memory Shares – Normal, 4096MB reservation
  • No floppy disk
  • CD-ROM disk – mapped to ISO image of NexentaStor 2.1 EVAL, connect at power on enabled
  • Network Adapters – Three total¬†
    • One to “VLAN1 Mgt NAT” and
    • Two to “VLAN2000 vSAN”
  • Additional Hard Disks – 6 total
    • vLocalStor02A, 80GB vdisk, thick, SCSI 1:0, independent, persistent
    • vLocalStor02B, 80GB vdisk, thick, SCSI 2:0, independent, persistent
    • vLocalStor02A, 65GB vdisk, thick, SCSI 1:1, independent, persistent
    • vLocalStor02B, 65GB vdisk, thick, SCSI 2:1, independent, persistent
    • vLocalStor02A, 65GB vdisk, thick, SCSI 1:2, independent, persistent
    • vLocalStor02B, 65GB vdisk, thick, SCSI 2:2, independent, persistent

NOTE: It is important to realize here that the virtual disks above could have been provided by vmdk’s on the same disk, vmdk’s spread out across multiple disks or provided by RDM’s mapped to raw SCSI drives. If your lab chassis has multiple hot-swap bays or even just generous internal storage, you might want to try providing NexentaStor with RDM’s or 1-vmdk-per-disk vmdk’s for performance testing or “near” production use. CPU, memory and storage are the basic elements of virtualization and there is no reason that storage must be the bottleneck. For instance, this environment is GREAT for testing SSD applications on a resource limited budget.

Read the rest of this entry ?

h1

In-the-Lab: Full ESX/vMotion Test Lab in a Box, Part 2

August 19, 2009

In Part 1 of this series we introduced the basic Lab-in-a-Box platform and outlined how it would be used to provide the three major components of a vMotion lab: (1) shared storage, (2) high speed network and (3) multiple ESX hosts. If you have followed along in your lab, you should now have an operating VMware ESXi 4 system with at least two drives and a properly configured network stack.

In Part 2 of this series we’re going to deploy a Virtual Storage Appliance (VSA) based on an open storage platform which uses Sun’s Zetabyte File System (ZFS) as its underpinnings. We’ve been working with Nexenta’s NexentaStor SAN operating system for some time now and will use it – with its web-based volume management – instead of deploying OpenSolaris and creating storage manually.

Part 2, Choosing a Virtual Storage Architecture

To get started on the VSA, we want to identify some key features and concepts that caused us to choose NexentaStor over a myriad of other options. These are:

  • NexentaStor is based on open storage concepts and licensing;
  • NexentaStor comes in a “free” developer’s version with 4TB 2TB of managed storage;
  • NexentaStor developer’s version includes snapshots, replication, CIFS, NFS and performance monitoring facilities;
  • NexentaStor is available in a fully supported, commercially licensed variant with very affordable $/TB licensing costs;
  • NexentaStor has proven extremely reliable and forgiving in the lab and in the field;
  • Nexenta is a VMware Technology Alliance Partner with VMware-specific plug-ins (commercial product) that facilitate the production use of NexentaStor with little administrative input;
  • Sun’s ZFS (and hence NexentaStor) was designed for commodity hardware and makes good use of additional RAM for cache as well as SSD’s for read and write caching;
  • Sun’s ZFS is designed to maximize end-to-end data integrity – a key point when ALL system components live in the storage domain (i.e. virtualized);
  • Sun’s ZFS employs several “simple but advanced” architectural concepts that maximize performance capabilities on commodity hardware: increasing IOPs and reducing latency;

While the performance features of NexentaStor/ZFS are well outside the capabilities of an inexpensive “all-in-one-box” lab, the concepts behind them are important enough to touch on briefly. Once understood, the concepts behind ZFS make it a compelling architecture to use with virtualized workloads. Eric Sproul has a short slide deck on ZFS that’s worth reviewing.

ZFS and Cache – DRAM, Disks and SSD’s

Legacy SAN architectures are typically split into two elements: cache and disks. While not always monolithic, the cache in legacy storage typically are single-purpose pools set aside to hold frequently accessed blocks of storage – allowing this information to be read/written from/to RAM instead of disk. Such caches are generally very expensive to expand (when possible) and may only accomodate one specific cache function (i.e. read or write, not both). Storage vendors employ many strategies to “predict” what information should stay in cache and how to manage it to effectively improve overall storage throughput.

New cache model used by ZFS allows main memory and fast SSDs to be used as read cache and write cache, reducing the need for large DRAM cache facilities.

New cache model used by ZFS allows main memory and fast SSDs to be used as read cache and write cache, reducing the need for large DRAM cache facilities.

Read the rest of this entry ?

h1

Installing: Xtravirt Virtual SAN

February 10, 2009

Today we’re looking at the Xtravirt Virtual SAN Appliance (VSA) solution for use with VMware ESX Server 3. It is designed to be a simple to deploy, redundant (DRBD synchronization), high-availability iSCSI SAN between two ESX servers. We are installing it on two ESXi servers, each with local storage, running the latest patch update (3.5.0 build 143129).

Initial Installation

XVS requires a virtual machine import: either using Converter or manual process. We follow the manual process. We used the command line to convert the imported disk into a ESXi-compliant format:

vmkfstools -i XVS.vmdk -d thin XVSnew.vmdk
rm -f XVS.vmdk XVS-*

After conversion, you have a 2GB virtual machine (times two) ready for configuration. We removed the legacy ethernet and hard disk that came with the inventory import. Then add the “existing” disk and new Ethernet (flex) controller.

We then added a 120GB virtual disk to each node using the local storage controllers: LSI 1068SAS (RAID1) for node 1 and NVidia MCP55Pro (RAID1) for node 2. Node 1 and 2 are using the same 250GB Seagate ES.2 (RAID edition) drives. Read the rest of this entry ?

h1

SME Stack V0.1, Part 3 – Storage Solutions

January 2, 2009

If storage is the key, shared storage the key that opens all locks. In the early days of file servers, shared storage meant a common file store presented to users over the network infrastructure. This storage was commonly found in a DAS array – usually RAID1 or RAID5 – and managed by a general purpose server operating system (like Windows or Netware). Eventually, such storage paradigms adopted clustering technologies for high-availability, but the underlying principles remained much the same: an extrapolation of a general purpose implementation.

Today, shared storage means something completely different. In fact, the need for “file servers” of old has not disappeared but the dependency on DAS for the place where stored data is ultimately placed has moved to the network. Network attached storage – in the form of filers and network block devices – are replacing DAS as companies retire legacy systems and expand their data storage and business continuity horizons. Today, commercial and open source software options abound that provide stability, performance, scalability, redundancy and feature sets that can provide increased functionality and accelerated ROI to their adopters. Read the rest of this entry ?