SOLORI’s Take: The most interesting aspect of the EX benchmark is its clock-adjusted scaling factor: between 70% and 91% versus a 2P/8-core Nehalem-EP reference (Cisco UCS, B200 M1, 25.06@17 tiles). The unpredictable nature of Intel’s “turbo” feature – varying with thermal loads and per-core conditions – makes an exact clock-for-clock comparison difficult. However, if the scaling factor is 90%, the EX blows away our previous expectations about the platform’s scalability. Where did we go wrong when we predicted a conservative 44@39 tiles? We’re looking at three things: (1) a bad assumption about the effectiveness of “turbo” in the EP VMmark case (setting Ref_EP_Clock to 3.33 GHz), and (2) underestimating EX’s scaling efficiency (assumed 70%), (3) assuming a 2.26GHz clock for EX.
Correcting for the as-tested clock/turbo numbers, and using AMD’s 2P-to-4P VMmark scaling efficiency of 83%, and shifting to the new UCS baseline (with newer ESX version) the Nehalem-EX prediction factors to:
Clearly, this approach either overestimates the scaling efficiency or underestimates the “turbo” mode. IBM claims that a 2.93 GHz “turbo” setting is viable where Intel suggests 2.67 GHz is the maximum, so there is a source of potential bias. Looking at the tiles-per-core ratio of the VMmark result, the Nehalem-EX drops from 2.13 tiles per core on EP/2P platforms to 1.5 tiles per core on EX/4P platforms – about a 30% drop in per-core loading efficiency. That indicator matches well with our initial 75% scaling efficiency moving from 2P to 4P – something that AMD demonstrated with Istanbul last August. Given the high TDP of EX and IBM’s 2.93 GHz “turbo” specification, it’s possible that “turbo” is adding clock cycles (and power consumption) and compensating for a “lower” scaling efficiency than we’ve assumed. Looking at the same estimation with 2.93GHz “clock” and 71% efficiency (1.5/2.13), the numbers fall in line with VMmark:
This give us a good basis for evaluating 2P vs. 4P Nehalem systems: scaling factor of 71% and capable of pushing clock towards the 3GHz mark within its thermal envelope. Both of these conclusions fit typical 2P-to-4P norms and Intel’s process history.
That’s nowhere near good enough to top the current 8P, 48-core Istanbul VMmark at 53.73@35 tiles, so we’ll likely have to wait for faster 6100 parts to see any new AMD records. However, assuming AMD’s proposition is still “value 4P” so about 200 VM’s at under $18K/server gets you around $90/VM or less.
Given that VMmark is a single node test harness, the difference between rack server and blade server architectures is a non-issue. However, more than just rack vs. blade is going on in this comparison. The Cisco UCS system is being fed by a pair of 10GE converged network adapters – used both for host network access and Fiber Channel bus access – and a monolithic storage array in the guise of a CLARiiON CX4-240 complete with a complement of 20, 73GB STEC SSD’s – just to sweeten the pot.
VMmark Network Configuration for the UCS B200-M1
While it is clear from past VMmark posts that the network speed (beyond 1Gbps) has little to do with the results, it is nice to see the confidence Cisco has in the CNA approach (Cisco UCS M71KR-Q) to go with the “eggs in a basket” solution. Given the storage demands on the CNA, the VMmark result should remove any doubt about the viability (performance) of high-capacity tandems (we’ll leave the physical link security concerns for another day.)
However, where the “rubber meets the road” in this contest is storage I/O and this solution – in our opinion is just plain showing off. With just 41 disks to build from, the CX4-240 has been configured to deliver 37 LUNs – nearly one LUN per unit disk. Before any awards are given out for storage of the year, we need to consider that 36 of those LUNs are RAID0 – yielding a testing platform with no real-world analog (hence “showing off”.)
CLARiiON CX4-240 Storage Build-out for UCS B200-M1 VMmark
Given the ease at which RAID0 can be replaced by RAID1+0, it may be safe to assume that the same results could have been obtained by using 77 disks instead of 41 – at which point the CX4-240 would still be less than half the size of the top VMmark’s 172-disk solution. The reason is clear: SSD’s accelerate I/O loads incredibly well in architectures that support them. If anything, this “runner-up” proves that SSD adoption is on the verge of becoming mainstream.
But what does this test show about UCS? Firstly, it shows that Cisco’s platform can compete with the best solutions out there on CPU and I/O performance (what’s a half a percentage point across 102 virtual machines?) It’s not really a surprise given that the UCS platform was designed to do just that – and within a neatly managed framework. Secondly, it shows that the choice of EMC as a partner was an excellent one. As Martin Glassborow commented on his Storagebod’s Blog, EMC’s involvement in VMware has energized the storage vendor to take bold and innovative steps towards Cloud Computing solutions that it might not have done otherwise (like the RAID0 SSD array). Thirdly and most importantly, it underscores the importance of predictable performance in a virtualization solution. Given the UCS/vBlock approach to systems organization, it can be very difficult not to draw solid parallels between the benchmarks and expectations for net new builds based on the criterion.
NEC’s venerable Express5800/A1160 is back at the top VMmark chart, this time establishing the brand-new 64-core category with a score of 48.23@32 tiles – surpassing its 48-core 3rd place posting by over 30%. NEC’s new 16-socket, 64-core, 256GB “Dunnington” X7460 Xeon-based score represents a big jump in performance over its predecessor with a per tile ratio of 1.507 – up 6% from the 48-core ratio of 1.419.
At $500/core, NEC’s gambit may represent an expensive form of “core liposuction” but it was a necessary one to meet VMware’s “logical processor per host” limitation of 64. That’s right, currently VMware’s vSphere places a limit on logical processors based on the following formula:
CPU_Sockets X Cores_Per_Socket X Threads_Per_Core =< 64
According to VMware, the other 32 cores would have been “ignored” by vSphere had they been enabled. Since “ignored” is a nebulous term (aka “undefined”), NEC did the “scientific” thing by disabling 32 cores and calling the system a 64-core server. The win here: a net 6% improvement in performance per tile over the 6-core configuration – ostensibly from the reduced core loading on the 16MB of L3 cache per socket and reduction in memory bus contention.
Moving forward to 2010, what does this mean for vSphere hardware configurations in the wake of 8-core, 16-thread Intel Nehalem-EX and 12-core, 12-thread AMD Magny-Cours processors? With a 4-socket Magny-Cours system limitation, we won’t be seeing any VMmarks from the boys in green beyond 48-cores. Likewise, the boys in blue will be trapped by a VMware limitation (albeit, a somewhat arbitrary and artificial one) into a 4-socket, 64-thread (HT) configuration or an 8-socket, 64-core (HT-disabled) configuration for their Nehalem-EX platform – even if using the six-core variant of EX. Looks like VMware will need to lift the 64-LCPU embargo by Q2/2010 just to keep up.
Fujitsu’s RX300 S5 rack server takes the top spot in VMware’s VMmark for 8-core systems today with a score of 25.16@17 tiles. Loaded with two of Intel’s top-bin 3.33GHz, 130W Nehalem-EP processors (W5590, turbo to 3.6GHz per core) and 96GB of DDR3-1333 R-ECC memory, the RX300 bested the former champ – the HP ProLiant BL490c G6 blade – by only about 2.5%.
With 17 tiles and 102 virtual machines on a single 2U box, the RX300 S5 demonstrates precisely how well vSphere scales on today’s x86 commodity platforms. It also appears to demonstrate both the value and the limits of Intel’s “turbo mode” in its top-bin Nehalem-EP processors – especially in the virtualization use case – we’ll get to that later. In any case, the resulting equation is:
More * (Threads + Memory + I/O) = Dense Virtualization
We could have added “higher execution rates” to that equation, however, virtualization is a scale-out applications where threads, memory pool and I/O capabilities dominate the capacity equation – not clock speed. Adding 50% more clock provides less virtualization gains than adding 50% more cores, and reducing memory and context latency likewise provides better gains that simply upping the clock speed. That’s why a dual quad-core Nehalem 2.6GHz processor will crush a quad dual-core 3.5GHz (ill-fated) Tulsa.
Speaking of Tulsa, unlike Tulsa’s rather anaemic first-generation hyper-threading, Intel’s improved SMT in Nehalem “virtually” adds more core “power” to the Xeon by contributing up to 100% more thread capacity. This is demonstrated by Nehalem-EP’s 2 tiles per core contributions to VMmark where AMD’s Istanbul quad-core provides only 1 tile per core. But exactly what is a VMmark tile and how does core versus thread play into the result?
The Illustrated VMmark "Tile" Load
As you can see, a “VMmark Tile” – or just “tile” for short – is composed of 6 virtual machines, half running Windows, half running SUSE Linux. Likewise, half of the tiles are running in 64-bit mode while the other half runs in 32-bit mode. As a whole, the tile is composed of 10 virtual CPUs, 5GB of RAM and 62GB of storage. Looking at how the parts contribute to the whole, the tile is relatively balanced:
Operating System / Mode
Windows Server 2003 R2
SUSE Linux Enterprise Server 10 SP2
If we stop here and accept that today’s best x86 processors from AMD and Intel are capable of providing 1 tile for each thread, we can look at the thread count and calculate the number of tiles and resulting memory requirement. While that sounds like a good “rule of thumb” approach, it ignores specific use case scenarios where synthetic threads (like HT and SMT) do not scale linearly like core threads do where SMT accounts for only about 12% gains over single-threaded core, clock-for-clock. For this reason, processors from AMD and Intel in 2010 will feature more cores – 12 for AMD and 8 for Intel in their Magny-Cours and Nehalem-EX (aka “Beckton”), respectively.
Learning from the Master
If we want to gather some information about a specific field, we consult an expert, right? Judging from the results, Fujitsu’s latest dual-processor entry has definitely earned the title ‘Master of VMmark” in 2P systems – at least for now. So instead of the usual VMmark $/VM analysis (which are well established for recent VMmark entries), let’s look at the solution profile and try to glean some nuggets to take back to our data centers.
It’s Not About Raw Speed
First, we’ve noted that the processor used is not Intel’s standard “rack server” fare, but the more workstation oriented W-series Nehalem at 130W TDP. With “turbo mode” active, this CPU is capable of driving the 3.33GHz core – on a per-core basis – up to 3.6GHz. Since we’re seeing only a 2.5% improvement in overall score versus the ProLiant blade at 2.93GHz, we can extrapolate that the 2.93GHz X5570 Xeon is spending a lot of time at 3.33GHz – its “turbo” speed – while the power-hungry W5590 spends little time at 3.6GHz. How can we say this? Looking at the tile ratio as a function of the clock speed.
We know that the X5570 can run up to 3.33GHz, per core, according to thermal conditions on the chip. With proper cooling, this could mean up to 100% of the time (sorry, Google). Assuming for a moment that this is the case in the HP test environment (and there is sufficient cause to think so) then the ratio of the tile score to tile count and CPU frequency is 0.433 (24.54/17/3.33). If we examine the same ratio for the W5590, assuming the clock speed of 3.33GHz, we get 0.444 – a difference of 2.5%, or the contribution of “turbo” in the W5590. Likewise, if you back-figure the “apparent speed” of the X5570 using the ratio of the clock-locked W5590, you arrive at 3.25GHz for the W5570 (an 11% gain over base clock). In either case, it is clear that “turbo” is a better value at the low-end of the Nehalem spectrum as there isn’t enough thermal headroom for it to work well for the W-series.
VMmark Equals Meager Network Use
Second, we’re not seeing “fancy” networking tricks out of VMmark submissions. In the past, we’ve commented on the use of “consumer grade” switches in VMmark tests. For this reason, we can consider VMmark’s I/O dependency as related almost exclusively to storage. With respect to networking, the Fujitsu team simply interfaced three 1Gbps network adapter ports to the internal switch of the blade enclosure used to run the client-side load suite and ran with the test. Here’s what that looks like:
Networking Simplified: The "leaders" simple virtual networking topology.
Note that the network interfaces used for the VMmark trial are not from the on-board i82575EB network controller but from the PCI-Express quad-port adapter using its older cousin – the i82571EB. What is key here is that VMmark is tied to network performance issues, and it is more likely that additional network ports might increase the likelihood of IRQ sharing and reduced performance more so than the “optimization” of network flows.
Keeping Storage “Simple”
Third, Fujitsu’s approach to storage is elegantly simple: several “inexpensive” arrays with intelligent LUN allocation. For this, Fujistu employed eight of its ETERNUS DX80 Disk Storage Systems with 7 additional storage shelves for a total of 172 working disks and 23 LUNs. For simplicity, Fujistu used a pair of 8Gbps FC ports to feed ESX and at least one port per DX80 – all connected through a Brocade 5100 fabric switch. The result looked something like this:
And yes, the ESX server is configured to boot from SAN, using no locally attached storage. Note that the virtual machine configuration files, VM swap and ESX boot/swap are contained in a separate DX80 system. This “non-default” approach allows the working VMDKs of the virtual machines to be isolated – from a storage perspective – from the swap file overhead, about 5GB per tile. Again, this is a benchmark scenario, not an enterprise deployment, so trade-offs are in favour of performance, not CAPEX or OPEX.
Even if the DX80 solution falls into the $1K/TB range, to say that this approach to storage is “economic” requires a deeper look. At 33 rack units for the solution – including the FC switch but not including the blade chassis – this configuration has a hefty datacenter footprint. In contrast to the old-school server/blade approach, 1 rack at 3 servers per U is a huge savings over the 2 racks of blades or 3 racks of 1U rack servers. Had each of those servers of blades had a mirror pair, we’d be talking about 200+ disks spinning in those racks versus the 172 disks in the ETERNUS arrays, so that still represents a savings of 15.7% in storage-related power/space.
When will storage catch up?
Compared to a 98% reduction in network ports, a 30-80% reduction server/storage CAPEX (based on $1K/TB SAN), a 50-75% reduction in overall datacenter footprint, why is a 15% reduction in datacenter storage footprint acceptable? After all, storage – in the Fujitsu VMmark case – now represents 94% of the datacenter footprint. Even if the load were less aggressively spread across five ESX servers (a conservative 20:1 loading), the amount of space taken by storage only falls to 75%.
How can storage catch up to virtualization densities. First, with 2.5″ SAS drives, a bank of 172 disks can be made to occupy only 16U with very strong performance. This drops storage to only 60% of the datacenter footprint – 10U for hypervisor, 16U for storage, 26U total for this example. Moving from 3.5″ drives to 2.5″ drives takes care of the physical scaling issue with acceptable returns, but results in only minimal gains in terms of power savings.
Saving power in storage platforms is not going to be achieved by simply shrinking disk drives – shrinking the NUMBER of disks required per “effective” LUN is what’s necessary to overcome the power demands of modern, high-performance storage. This is where non-traditional technology like FLASH/SSD is being applied to improve performance while utilizing fewer disks and proportionately less power. For example, instead of dedicating disks on a per LUN basis, carving LUNs out of disk pools accelerated by FLASH (a hybrid storage pool) can result in a 30-40% reduction in disk count – when applied properly – and that means 30-40% reduction in datacenter space and power utilization.
Here are our “take aways” from the Fujitsu VMmark case:
1) Top-bin performance is at the losing end of diminishing returns. Unless your budget can accommodate this fact, purchasing decisions about virtualization compute platforms need to be aligned with $/VM within an acceptable performance envelope. When shopping CPU, make sure the top-bin’s “little brother” has the same architecture and feature set and go with the unit priced for the mainstream. (Don’t forget to factor memory density into the equation…) Regardless, try to stick within a $190-280/VM equipment budget for your hypervisor hardware and shoot for a 20-to-1 consolidation ratio (that’s at least $3,800-5,600 per server/blade).
2) While networking is not important to VMmark, this is likely not the case for most enterprise applications. Therefore, VMmark is not a good comparison case for your network-heavy applications. Also, adding more network ports increases capacity and redundancy but does so at the risk of IRQ-sharing (ESX, not ESXi) problems, not to mention the additional cost/number of network switching ports. This is where we think 10GE will significantly change the equation in 2010. Remember to add up the total number of in use ports – including out-of-band management – when factoring in switch density. For net new instalments, look for a switch that provides 10GE/SR or 10GE/CX4 options and go with !0GE/SR if power savings are driving your solution.
3) Storage should be simple, easy to manage, cheap (relatively speaking), dense and low-power. To meet these goals, look for storage technologies that utilize FLASH memory, tiered spindle types, smart block caching and other approaches to limit spindle count without sacrificing performance. Remember to factor in at least the cost of DAS when approximating your storage budget – about $150/VM in simple consolidation cases and $750/VM for more mission critical applications (that’s a range of $9,000-45,000 for a 3-server virtualization stack). The economies in managed storage come chiefly from the administration of the storage, but try to identify storage solutions that reduce datacenter footprint including both rack space and power consumption. Here’s where offerings from Sun and NexentaStor are showing real gains.
We’d like to see VMware update VMmark to include system power specifications so we can better gage – from the sidelines – what solution stack(s) perform according to our needs. VMmark served its purpose by giving the community a standard from which different platforms could be compared in terms of the resultant performance. With the world’s eyes on power consumption and the ecological impact of datacenter choices, adding a “power utilization component” to the “server-side” of the VMmark test would not be that significant of a “tweak.” Here’s how we think it can be done:
Require power consumption of the server/VMmark related components be recorded, including:
the ESX platform (rack server, blade & blade chassis, etc.)
the storage platform providing ESX and test LUN(s) (all heads, shelves, switches, etc.)
the switching fabric (i.e. Ethernet, 10GE, FC, etc.)
Power delivered to the test harness platforms, client load machines, etc. can be ignored;
Power measurements should be recorded at the following times:
All equipment off (validation check);
Single tile load;
100% tile capacity;
75% tile capacity;
50% tile capacity;
Power measurements should be recorded using a time-power data-logger with readings recorded as 5-minute averages;
Notations should be made concerning “cache warm-up” intervals, if applicable, where “cache optimized” storage is used.
Why is this important? In the wake of the VCE announcement, solution stacks like VCE need to be measured against each other in an easy to “consume” way. Is VCE the best platform versus a component solution provided by your local VMware integrator? Given that the differentiated VCE components are chiefly UCS, Cisco switching and EMC storage, it will be helpful to have a testing platform that can better differentiate “packaged solutions” instead of uncorrelated vendor “propaganda.”
Let us know what your thoughts are on the subject, either on Twitter or on our blog…
Most importantly for virtualization systems architects is how the vCPU scheduling affects “measured” performance. The telling piece comes from the difference in comparison results where vCPU scheduling is equalized:
AnandTech's Quad Sockets v. Dual Sockets Comparison. Oct 6, 2009.
When comparing the results, De Gelas hits on the I/O factor which chiefly separates VMmark from vAPUS:
The result is that VMmark with its huge number of VMs per server (up to 102 VMs!) places a lot of stress on the I/O systems. The reason for the Intel Xeon X5570′s crushing VMmark results cannot be explained by the processor architecture alone. One possible explanation may be that the VMDq (multiple queues and offloading of the virtual switch to the hardware) implementation of the Intel NICs is better than the Broadcom NICs that are typically found in the AMD based servers.
This is yet another issue that VMware architects struggle with in complex deployments. The latency in “Dunnington” is a huge contributor to its downfall and why the Penryn architecture was a dead-end. Combined with 8 additional threads in the 2P form factor, Nehalem delivers twice the number of hardware execution contexts than Shanghai, resulting in significant efficiencies for Nehalem where small working data sets are involved.
When larger sets are used – as in vAPUS – the Istanbul’s additional cores allows it to close the gap to within the clock speed difference of Nehalem (about 12%). In contrast to VMmark which implies a 3:2 advantage to Nehalem, the vAPUS results suggest a closer performance gap in more aggressive virtualization use cases.
HP’s ProLiant BL490c G6 server blade now tops the VMware VMmark table for 8-core systems – just squeaking past rack servers from Lenovo and Dell with a score of 24.54@17 tiles: a new 8-core record. The half-height blade was equipped with two, quad-core Intel Xeon X5570 (Nehalem-EP, 130W TDP) and 96GB ECC Registered DDR3-1333 (12x 8GB, 2-DIMM/channel) memory.
In our follow-up, we found that HP’s on-line configuration tool does not allow for DDR3-1333 memory so we went to the street for a comparison. For starters, we examined the on-line price from HP with DDR3-1066 memory and the added QLogic QMH2462 Fiber Channel adapter ($750) and additional NC360m dual-port Gigabit Ethernet controller ($320) which came to a grand total of $28,280 for the blade (about $277/VM, not including Blade chassis or SAN storage).
Stripping memory from the build-out results in a $7,970 floor to the hardware, sans memory. Going to the street to find 8GB sticks with DDR3-1333 ratings and HP support yielded the Kingston KTH-PL313K3/24G kit (3x 8GB DIMMs) of which we would need three to complete the build-out. At $4,773 per kit, the completed system comes to $22,289 (about $218/VM, not including chassis or storage) which may do more to demonstrate Kingston’s value in the market place rather than HP’s penchant for “over-priced” memory.
Now, the interesting disclosure from HP’s testing team is this:
SOLORI’s Take:Those of you following closely may be asking yourselves: “Why did HP choose to over-clock the memory controller in this run by pushing a 1066MHz, 2DPC limit to 1333MHz?” It would appear the answer is self-evident: the extra 6% was needed to put them over the Lenovo machine. This issue raises a new question about the VMmark validation process: “Should out of specification configurations be allowed in the general benchmark corpus?” It is our opinion that VMmark should represent off-the-shelf, fully-supported configurations only – not esoteric configuration tweaks and questionable over-clocking practices.
Should there be as “unlimited” category in the VMmark arena? Who knows? How many enterprises knowingly commit their mission critical data and processes to systems running over-clocked processors and over-driven memory controllers? No hands? That’s what we thought… Congratulations anyway to HP for clawing their way to the top of the VMmark 8-core heap…
Dell R710 w/redundant high-output power supply, ($18,209)
2 x Intel Xeon X5570 Processors (included)
96GB ECC DDR3/1066 (12×8GB) (included)
2 x Broadcom NexXtreme II 5709 dual-port GigabitEthernet w/TOE (included)
1 x Intel PRO 1000VT quad-port GigabitEthernet (1x PCIe-x4 slot, $529)
3 x QLogic QLE2462 FC HBA (1x PCIe slot, $1,219/ea)
1 x LSI1078 SAS Controller (on-board)
8 x 15K SAS OS drive, RAID10 (included)
Required ProSupport package ($2,164)
Total as Configured: $24,559 ($241/VM, not including storage)
Three Dell/EMC CX3-40f arrays were used as the storage backing of the test. The storage system included 8GB cache, 2 enclosures and 15, 15K disks per array delivering 19 LUNs at about 300GB each. Intel’s Hyper-Threading and “Turbo Boost” were enabled for 8-thread, 3.33GHz core clocking as was VT; however embedded SATA and USB were disabled as is common practice.
At about $1,445/tile ($241/VM) the new “second dog” delivers its best at a 20% price premium over Lenovo’s “top dog” – although the non-standard OS drive configuration makes-up a half of the difference, with Dell’s mandatory support package making-up the remainder. Using a simple RAID1 SAS and eliminating the support package would have droped the cost to $20,421 – a dead heat with Lenovo at $182/VM.
Comparing the Dell R710 the 2P, 12-core benchmark HP DL385 G6 Istanbul system at 15.54@11 tiles:
2 x Broadcom 5709 dual-port GigabitEthernet (on-board)
1 x Intel 82571EB dual-port GigabitEthernet (1x PCIe slot, $150/ea)
1 x QLogic QLE2462 FC HBA (1x PCIe slot, $1,219/ea)
1 x HP SAS Controller (on-board)
2 x SAS OS drive (included)
$10,673/system total (versus $14,696 complete from HP)
Direct pricing shows Istanbul’s numbers at $1,336/tile ($223/VM) which is a 7.5% savings per-VM over the Dell R710. Going to the street – for memory only – changes the Istanbul picture to $970/tile ($162/VM) representing a 33% savings over the R710.
SOLORI’s Take: Istanbul continues to offer a 20-30% CAPEX value proposition against Nehalem in the virtualization use case – even without IOMMU and higher memory bandwidth promised in upcoming Magny-Cours. With the HE parts running around $500 per processor, the OPEX benefits are there for Istanbul too. It is difficult to understand why HP wants to charge $900/DIMM for 8GB PC-5300 sticks when they are available on the street for 50% less – that’s a 100% markup. Looking at what HP charges for 8GB DDR3/1066 – $1,700/DIM – they are at least consistent. HP’s memory pricing practice makes one thing clear – customers are not buying large memory configurations from their system vendors…
On the contrary, Dell appears to be happy to offer decent prices on 8GB DDR3/1066 with their R710 at approximately $837/DIMM – almost par with street prices. Looking to see if this parity held up with Dell’s AMD offerings, we examined the prices offered with Dell’s R805: while – at $680/DIMM – Dell’s prices were significantly better than HP’s, they still exceeded the market by 50%. Still, we were able to configure a Dell R805 with AMD 2435′s for much less than the equivalent HP system:
1 x Intel PRO 100oPT dual-port GigabitEthernet (1x PCIe slot, included)
1 x QLogic QLE2462 FC HBA (1x PCIe slot, included)
1 x Dell PERC SAS Controller (on-board)
2 x SAS OS drive (included)
$10,678/system total (versus $12,702 complete from Dell)
This offering from Dell should be able to deliver equivalent performance with HP’s DL385 G6 and likewise savings/VM compared to the Nehalem-based R710. Even at the $12,702 price as delivered from Dell, the R805 represents a potential $192/VM price point – about $50/VM (25%) savings over the R710.
Using the same 8-processor HP ProLiant DL785 G6 platform as in the previous run – complete with 2.8GHz AMD Opteron 8439 SE 6-core chips and 256GB DDR2/667 – the new score comes with significant performance bumps in the javaserver, mailserver and database results achieved by the same system configuration as the previous attempt – including the same ESX 4.0 version (164009). So what changed to add an additional 5 tiles to the team’s run? It would appear that someone was unsatisfied with the storage configuration on the mailserver run.
Given that the tile ratio of the previous run ran about 6% higher than its 24-core counterpart, there may have been a small indication that untapped capacity was available. According to the run notes, the only reported changes to the test configuration – aside from the addition of the 5 LUNs and 5 clients needed to support the 5 additional tiles – was a notation indicating that the “data drive and backup drive for all mailserver VMs” we repartitioned using AutoPart v1.6.
The change in performance numbers effectively reduces the virtualization cost of the system by 15% to about $257/VM – closing-in on its 24-core sibling to within $10/VM and stretching-out its lead over “Dunnington” rivals to about $85/VM. While virtualization is not the primary application for 8P systems, this demonstrates that 48-core virtualization is definitely viable.
SOLORI’s Take: HP’s performance team has done a great job tuning its flagship AMD platform, demonstrating that platform performance is not just related to hertz or core-count but requires balanced tuning and performance all around. This improvement in system tuning demonstrates an 18% increase in incremental scalability – approaching within 3% of the 12-core to 24-core scaling factor, making it actually a viable consideration in the virtualization use case.
In recent discussions with AMD about the SR5690 chipset applications for Socket-F, AMD re-iterated that the mainstream focus for SR5690 has been Magny-Cours and the Q1/2010 launch. Given the close relationship between Istanbul and Magny-Cours – detailed nicely by Charlie Demerjian at Semi-Accurate – the bar is clearly fixed for 2P and 4P virtualization systems designed around these chips. Extrapolating from the similarities and improvements to I/O and memory bandwidth, we expect to see 2P VMmarks besting 32@23 and 4P scores over 54@39 from HP, AMD and Magny-Cours.
SOLORI’s 2nd Take: Intel has been plugging away with its Nehalem-EX for 8-way systems and – delivering 128-threads – promises to deliver some insane VMmarks. Assuming Intel’s EX scales as efficiently as AMD’s new Opterons have, extrapolations indicate performance for the 4P, 64-thread Nehalem-EX shoud fall between 41@29 and 44@31 given the current crop of speed and performance bins. Using the same methods, our calculus predicts an 8P, 128-thread EX system should deliver scores between 64@45 and 74@52.
With EX expected to clock at 2.66GHz with 140W TDP and AMD’s MCM-based Magny-Cours doing well to hit 130W ACP in the same speed bins, CIO’s balancing power and performance considerations will need to break-out the spreadsheets to determine the winners here. With both systems running 4-channel DDR3, there will be no power or price advantage given on either side to memory differences: relative price-performance and power consumption of the CPU’s will be major factors. Assuming our extrapolations are correct, we’re looking at a slight edge to AMD in performance-per-watt in the 2P segment, and a significant advantage in the 4P segment.
SOLORI’s Take: While the September timing of the release might imply a G6 with AMD’s SR5690 and IOMMU, we’re doubtful that the timing is anything but a coincidence: even though such a pairing would enable PCIe 2.0 and highly effective 10Gbps solutions. The modular design of the DL785 series – with its ability to scale from 4P to 8P in the same system – mitigates the economic realities of the dwindling 8P segment, and HP has delivered the pinnacle of performance for this technology.
We are also impressed with HP’s performance team and their ability to scale Shanghai to Istanbul with relative efficiency. Moving from DL785 G5 quad-core to DL785 G6 six-core was an almost perfect linear increase in capacity (95% of theoretical increase from 32-core to 48-core) while performance-per-tile increased by 6%. This further demonstrates the “home run” AMD has hit with Istanbul and underscores the excellent value proposition of Socket-F systems over the last several years.
Unfortunately, while they demonstrate a 91% scaling efficiency from 12-core to 24-core, HP and Istanbul have only achieved a 75% incremental scaling efficiency from 24-cores to 48-cores. When looking at tile-per-core scaling using the 8-core, 2P system as a baseline (1:1 tile-to-core ratio), 2P, 4P and 8P Istanbul deliver 91%, 83% and 62.5% efficiencies overall, respectively. However, compared to the %58 and 50% tile-to-core efficiencies of Dunnington 4P and 8P, respectively, Istanbul clearly dominates the 4P and 8P performance and price-performance landscape in 2009.
In today’s age of virtualization-driven scale-out, SOLORI’s calculus indicates that multi-socket solutions that deliver a tile-to-core ratio of less than 75% will not succeed (economically) in the virtualization use case in 2010, regardless of socket count. That said – even at a 2:3 tile-to-core ratio – the 8P, 48-core Istanbul will likely reign supreme as the VMmark heavy-weight champion of 2009.
SOLORI’s 2nd Take: HP and AMD’s achievements with this Istanbul system should be recognized before we usher-in the next wave of technology like Magny-Cours and Socket G34. While the DL785 G6 is not a game changer, its footnote in computing history may well be as a preview of what we can expect to see out of Magny-Cours in 2H/2010. If 12-core, 4P system price shrinks with the socket count we could be looking at a $150/VM price-point for a 4P system: now that would be a serious game changer.
NEC’s venerable Express5800/A1160 tops the 48-core VMmark category today with a score of 34.05@24 tiles to wrest the title away from IBM who established the category back in June, 2009. NEC’s new “Dunnington” X7460 Xeon-based score represents a performance per tile ratio of 1.41 and a tile to core efficiency of 50% using 128GB of ECC DDR2 RAM.
Compared to the leading 24-core “Dunnington” results – held by IBM’s x3850 M2 at 20.41@14 tiles – the NEC benchmark sets a scalability factor of 85.7% when moving from 4-socket to 8-socket systems. Both servers from NEC and IBM are scalable systems allowing for multiple chassis to be interconnected to achieve greater CPU-per-system numbers – each scaling in 4-CPU increments – ostensibly for OLTP advantages. The NEC starts at around $70K for 128GB and 48-cores resulting in a $486/VM cost to VMmark.
If AMD’s Istanbul scales to 8-socket at least as efficiently as Dunnington, we should be seeing some 48-core results in the 43.8@30 tile range in the next month or so from HP’s 785 G6 with 8-AMD 8439 SE processors. You might ask: what virtualization applications scale to 48-cores when $/VM is doubled at the same time? We don’t have that answer, and judging by Intel and AMD’s scale-by-hub designs coming in 2010, that market will need to be created at the OEM level.
Based on the performance we’re seeing in 8-socket systems relative to 4-socket and the upcoming “massively mult-core” processors in 2010, the law of diminishing returns seems to favor the 4-socket system as the limit for anything but massive OLTP workloads. Even then, we expect to see 48-core in a “4-way” box more efficient than the same number of cores in an 8-way box. The choice in virtualization will continue to be workload biased, with 2P systems offering the best “small footprint” $/VM solution and 4P systems offering the best “large footprint” $/VM solution.
SOLORI's Take and Quick Take posts express my personal opinion unless explicitly attributed to other sources. Where possible, supporting facts are presented to properly frame and ground these opinions, however they are presented "AS-IS" without regard to warranty or promise: expressed or implied.
Comments are open to all registered users and may be edited for decorum. Spam is deleted with prejudice.