SOLORI’s Laws of SMB Virtual ArchitectureMarch 2, 2009
SOLORI’s Laws of SMB Virtual Architecture
- Single points of failure must be eliminated.
- Start simple, add complexity gradually.
- Improve stability and reliability first.
- Improve capacity only after achieving stability and reliability.
- Start with 50% more storage than you “need.”
- Start with 4GB of RAM per CPU core.
- Start with at least 3 CPU cores.
- Avoid front-side bus architectures.
- Use as many disks as possible to achieve your storage target.
- Secure your management network.
Law 1: Single Points of Failure Must Be Eliminated
This could have many interpretations, but here’s mine: Noah was right, everything must come in pairs. At the most basic level, this means two switches, two “hosts” and two Gigabit Ethernet ports per trunk, per “host” – minimum.
Redundant Host Power Supplies
At the very low-end, computer chassis do not come with redundant power supplies. Since undersizing, line transients and heat build-up are leading causes for power supply failure, risk can be mitigated This can be remedied by after market power supply replacements and will run under $250/server.
For switch redundancy, an alternative to switch stacking is 802.3ad link aggregation between two non-stacking switches. Today most “web managed” VLAN-capable switches include 802.3ad, LACP or at least static trunking between switches. This allows multiple ports to be “bonded together” to form a single “logical link” between switches. This class of switch in 24-port Gigabit Ethernet can be found for $300-500 each.
When trunking between non-stacked switches, create a trunk for your NAS traffic (should be a separate segment) and a seprate trunk for all other traffic. Start with a two-port trunk between switches for VLAN traffic and a 2-port trunk for NAS traffic. Add another port to the NAS trunk for every additional hypervisor (host) beyond the first two – never use more than 8-ports per trunk.
Redundant Host Gigabit NIC(s)
The network interface cards used in the hypervisor host are important. While many Broadcom BCM5800 and NVidia MCP55 variants are supported, the Intel e1000 presents the least complicated solution. Given that a dual-port e1000 NIC can be had for less than $150, its good insurance to populate your hosts with these cards. On a budget, the single-port PCI version can be had for less than $30/each.
Once installed into the host system, the simple solution is to plug one port into each switch and let VMware’s basic port sharing take care of the redundancy piece. Without cross-switch trunking (stacking), any other approach will not be reliable AND/OR simple – a “second law” violation.
Factor two Gigabit Ethernet ports for VLAN and management traffic, and two ports for NAS traffic – per host. If you have more budget, add two additional ports to separate management (and later VMotion) traffic from the VLAN networks. That’s six Gigabit ports per host.
On the low-end, two non-redundant NAS boxes will usually cost much less than one redundant box. This can be an advantage as two storage systems can potentially provide twice the throughput. By apportioning storage across multiple appliances you can deliver better targeted performance on lesser equipment.
Here is where a new challenge presents itself: how to split-up network ports for the storage device. In most low-end devices, the choice is simple: it has only one port. However, in newer devices like the Thecus N5200B or QNAP TS-509 Pro you get two Gigabit Ethernet interfaces for fail-over and load distribution.
Make sure that your data can be easily backed-up between the devices on a regular basis if “replication” is not an option or requires too much of a performance impact. Of course, you can BYO storage system using server hardware and lots more memory to improve performance and save money, but this is not “getting started” level stuff. Before going that route, have a look at the 7-drive Thecus N7700 or the 8-drive QNAP TS-809 Pro.
Let’s face it, outside of the lab useful loads will be handling company applications and data. If only one host is used – and it goes down – the recovery time after failure could be unpredictable. It is always a good idea to start a production virtualization system in pairs – or at least a system with NBD replacement service.
The SOLORI Eco-System servers should run 3-5 years problem free depending on deployment environment and power quality. Low-end “white box” servers will have a less predictable lifespan and maintenance history. When in doubt: plan for the system to fail or need maintenance on a “regular” basis. Prime failure modes for low-end “white box” systems:
- Under-rated power supplies & supply fans,
- CPU fans,
- Chip set fans
When the power supply on the “white box” goes, replace it with a better class of supply – possibly redundant if the case will allow. Likewise the CPU fan: after about 6 months of continuous operation, check it and replace with a “cooling tower” model – lots of fins and surface area.
That said, you’ll need a second server to run your operations on while maintenance is taking place. Actually, you’d need a second server anyway (law 1 violation) and it’s good for load distribution and isolation as well. For any system of N virtual hosts, make sure your workloads and memory footprints do not scale beyond the capability of N-1 boxes.
Law 2: Start Simple – Add Complexity Gradually
As you ease into the job of “virtualization master” don’t let your yin get ahead of your yang. A lot of virtualization projects stumble trying to get too sophisticated too soon. Unless you’re a guru at networks or storage and fully understand the implications of vmnics and vswitches, take the simple approach with networking and file system design; then branch-out a little at a time.
Isolate your management network as the “default VLAN” and put nothing else on it. This makes management and trouble-shooting much easier. Putting management on a “tagged” VLAN requires additional setup with the ESX console but can be done fairly easily. Make sure your switch is configured properly before making such changes.
Add one LUN at a time with storage. If you have “thin provisioning” employ it during this phase so when you blow-away your first two or three VMFS file systems, it won’t take too long. If you’re using NFS, so much the better, but eventually you’ll be separating LUNs based on the performance requirements of your workloads.
If you are using ESXi in a diskless environment, this can make for a very simple and resilient deployment. However, remember that you’ll need an NFS or iSCSI/VMFS volume for scratch space. If you have a single LUN for scratch, make sure each ESX host has its own directory for scratch files and you point the host to that directory from the advanced settings window.
Law 3: Improve Stability and Reliability First
It’s not quite tortoise and hare, but speed gains that sacrifice stability and reliability are dooming your platform to maintenance hell. The don’t of hypervisor deployment as related to stability and reliability:
- Never overclock your host CPU,
- Never use RAM timings that deviate from the package specs,
- Never overclock your HyperTransport or PCI bus,
- Never use “desktop” drives in your RAID arrays,
- Never use “jumbo frames” on your network.
The small percentage gains from all of these “performance enhancements” – combined – is not worth the headaches of an unreliable system. This sounds like Law 2, but it’s good advice. If you MUST tweak with any of these things, do it VERY gradually and regression test heavily.
Law 4: Achieving Stability and Reliability, Then Capacity
This could be considered a corollary to the “tweaker’s law” (Law #3) but it’s easier to make changes in small increments. Given the importance of your platform’s longevity and life-cycle, the more VM’s you’re running on your host, the more catastrophic the result of a failure. Regression test with a small number of workloads on your “tweaks” before deploying them across your cluster or farm.
Law 5: Start with 50% More Storage
Storage predictions are based on today’s business practices: those practices will change dramatically as virtualization takes hold of your infrastructure. Ease of deployment, ease of management, ease of performance tracking, etc. will all lead to an increase in systems utilization once consolidation is completed. Plan on this eventuality by factoring-in 50% more storage than your initial estimate. On the low-end, deduplication is rarely and option, so factor that into your backup plan if NAS-to-NAS replication will be used. If your 3rd party backup has deduplication, base your storage requirements on the 50% bonus we mention above.
Law 6: Start with 4GB of RAM per Core.
Even though the 4GB sticks are a bit more “premium” in price, so is throwing-out your 2GB stick only to replace it with a 4GB stick 6-months later. If your 4x4GB stick is $320 and a 4x2GB sticks is $120 today, 6-months from now you’ll be spending $240 dollars on 4x4GB sticks anyway – net loss: $40. Operationally, memory is cheap compared to the consequences of running out.
Even with VMware’s page sharing technology and memory balloon driver (requires VMware Tools installed in each guest) you will hit demand periods where machines struggle to allocate free memory. This, of course, assumes that you’re running in an oversubscribed memory environment, and for low-end hosts that max-out at 8-16GB of RAM this is almost a certainty (see my blog on SBS 2008).
If you want a more mathematical way of looking at it, assume you have a 3:1 guest-to-core consolidation ration. That’s 2x1GB plus 1x2GB guests for every core. If your guests have higher minimum requirements, you’re going to need up your memory. On a 4-core processor, that’s 9 guest operating systems (4 cores – 1 supervisor core times 3) and 12-16GB of RAM.
Law 7: Start with 3+ Cores
OK. I tipped you off to this thinking. Assume a worst-case deployment scenario where the network and NAS traffic consume an entire core. In a dual-core system with 3:1 consolidation, you can only bank on running 3 guest operating systems of any useful size.
With a triple-core system you double that number to 6 guest OS’s plus some slack for unused CPU cycles. If you’re into saving money two X3 processors at 2.8GHz (6MB L3 cache) cost $100 less than an equivalent pair of X4 processor (remember, two systems).
Law 8: Avoid FSB Architecture
This is more a plug for good system architecture than AMD’s HyperTransport. The fact is, AMD has proven its architecture is superior for virtualization over front side bus (FSB) and the “direct connect architecture” is a primary factor. Look at the top dogs on the virtualization benchmarks to see what we’re talking about.
Law 9: Use as Many Disks as Possible for Your Storage
Low-end storage is dominated by SATA, and SATA has performance problems. To get near-SAS-like performance out of SATA, some vendors have taken to using only the outer tracks of the drive and ignoring the inner tracks. This reduction in rotational latency improves throughput and the reduction in track seeks (less active tracks) improved random access performance.
In SMB applications, it is unlikely that you will be willing to sacrifice storage capacity for a boost in disk performance. It is more cost effective to spread the disk load across more disk spindles: more disks = more available on-demand bandwidth. The whole concept of RAID was based on “inexpensive disks” after all. You WILL lose the advantage of SATA’s effective storage density when LUN-loading comes into play.
There is a great article on sizing VMFS storage to provide maximum performance at VM/ETC. It is incumbent on the administrator to manage separate, physical disk domains (LUNs) to segment performance. While this is easy to do with mixed arrays in high-end SAN fabrics, this is a manual process for low-end storage.
LUN-loading is really more about managing I/O operations per second (IOPS) and this is what ultimately governs random access performance – especially for shared storage paradigms like VMFS and NFS. Ultimately, the SMB storage application will be limited by: How many servers actively use the same shared volume How many “spindles” define the shared volume
The cumulative load each server places on shared storage For better mid-range performance (not low-end) you might consider DIY storage using a tiered storage approach (Nexenta’s use of tiers with ZFS is one example) where IOPS can be boosted using a small number of high-performance disks as write-through-storage. While SDD’s are being used to “supercharge” storage in this arena, the use of RAM-based disks could be more reliable and increase performance on random access functions.
Law 10: Secure Your Management Network
Consider the hypervisor console a key to your data center: literally. Just like conventional data centers use HID and biometric keys to keep physical intruders out, so should you consider the use of strong security to protect your hypervisor’s control and management interface.
Security gurus will tell you that “given access to the physical machine and console, any system can be broken into…” The management console for your hypervisor allows anyone with authorized access to gain admission to the console of your virtual machines. Using the security tools provided by the hypervisor is only the first step.
Keeping unauthorized access away from the management console is also a priority. Make sure to segment the network used to manage the hypervisor console using the same level of authorization and identity policies you would apply to a physical data center:
- Firewall the console’s network interface(s)
- Use atomic grouping to control user access
- Use unique, per-user access models
- Limit user source address to local and VPN only
- Provide strong password requirements
- Use multi-factor authentication where possible
- Log access to separate security logs
- Review access logs regularly