Posts Tagged ‘magny-cours’


VMware Management Assistant Panics on Magny Cours

August 11, 2010

VMware’s current version of its vSphere Management Assistant – also known as vMA (pronounced “vee mah”) – will crash when run on an ESX host using AMD Magny Cours processors. This behavior was discovered recently when installing the vMA on an AMD Opteron 6100 system (aka. Magny Cours) causing a “kernel panic” on boot after deploying the OVF template. Something of note is the crash also results in 100% vCPU utilization until the VM is either powered-off or reset:

vMA Kernel Panic on Import

vMA Kernel Panic on Import

As it turns out, no manner of tweaks to the virtual machine’s virtualization settings nor OS boot/grub settings (i.e. noapic, etc.) seem to cure the ills for vMA. However, we did discover that the OVF deployed appliance was configured as a VMware Virtual Machine Hardware Version 4 machine:

vMA 4.1 defaults to Hardware Version 4

vMA 4.1 defaults to Virtual Machine Hardware Version 4

Since our lab vMA deployments have all been upgraded to Virtual Machine Harware Version 7 for some time (and for functional benefits as well), we tried to update the vMA to Version 7 and try again:

Upgrade vMA Virtual Machine Version...

Upgrade vMA Virtual Machine Version...

This time, with Virtual Hardware Version 7 (and no other changes to the VM), the vMA boots as it should:

vMA Booting after Upgrade to Virtual Hardware Version 7

vMA Booting after Upgrade to Virtual Hardware Version 7

Since the Magny Cours CPU is essentially a pair of tweaked 6-core Opteron CPUs in a single package, we took the vMA into the lab and deployed it to an ESX server running on AMD 2435 6-core CPUs: the vMA booted as expected, even with Virtual Hardware Version 4. A quick check of the community and support boards show a few issues with older RedHat/Centos kernels (like vMA’s) but no reports of kernel panic with Magny Cours. Perhaps there are just not that many AMD Opteron 6100 deployments out there with vMA yet…


vSphere 4 Update 2 Released

June 11, 2010

VMware vSphere 4, Update 2 has been released with the following changes to ESXi:

The following information provides highlights of some of the enhancements available in this release of VMware ESXi:

  • Enablement of Fault Tolerance Functionality for Intel Xeon 56xx Series processors— vSphere 4.0 Update 1 supports the Intel Xeon 56xx Series processors without Fault Tolerance. vSphere 4.0 Update 2 enables Fault Tolerance functionality for the Intel Xeon 56xx Series processors.
  • Enablement of Fault Tolerance Functionality for Intel i3/i5 Clarkdale Series and Intel Xeon 34xx Clarkdale Series processors— vSphere 4.0 Update 1 supports the Intel i3/i5 Clarkdale Series and Intel Xeon 34xx Clarkdale Series processors without Fault Tolerance. vSphere 4.0 Update 2 enables Fault Tolerance functionality for the Intel i3/i5 Clarkdale Series and Intel Xeon 34xx Clarkdale Series processors.
  • Enablement of IOMMU Functionality for AMD Opteron 61xx and 41xx Series processors— vSphere 4.0 Update 1 supports the AMD Opteron 61xx and 41xx Series processors without input/output memory management unit (IOMMU). vSphere 4.0 Update 2 enables IOMMU functionality for the AMD Opteron 61xx and 41xx Series processors.
  • Enhancement of the resxtop utility— vSphere 4.0 U2 includes an enhancement of the performance monitoring utility, resxtop. The resxtop utility now provides visibility into the performance of NFS datastores in that it displays the following statistics for NFS datastores: Reads/swrites/sMBreads/sMBwrtn/scmds/sGAVG/s (guest latency).
  • Additional Guest Operating System Support— ESX/ESXi 4.0 Update 2 adds support for Ubuntu 10.04. For a complete list of supported guest operating systems with this release, see the VMware Compatibility Guide.

Resolved Issues In addition, this release delivers a number of bug fixes that have been documented in theResolved Issues section.

ESXi 4 Update 2 Release Notes

Noted in the release is the official support for AMD’s IOMMU in Opteron 6100 and 4100 processors – available in 1P, 2P and 4P configurations. This finally closes the (functional) gap between AMD Opteron and Intel’s Nehalem line-up. Likewise, FT support for many new Intel processors has been added. Also, the addition of NFS performance counters in ESXTop will make storage troubleshooting a bit easier. Grab you applicable update at VMware’s download site now (SnS required.)


Quick Take: IBM Tops VMmark, Crushes Record with 4P Nehalem-EX

April 7, 2010

It was merely a matter of time before one of the new core-rich titans – the Intel’s 8-core “Beckton” Nehalem-EX (Xeon 7500) or AMD’s 12-core “Magny-Cours” (Opteron 6100) – was to make a name for itself on VMware’s VMmark benchmark. Today, Intel draws first blood in the form of an 4-processor, 32-core, 64-thread, monster from IBM: the x3850 X5 running four Xeon X7560 (2.266GHz – 2.67GHz w/turbo, 130W TDP, each) and 384GB of DDR3-1066 low-power registered DIMMs. Weighing-in at 70.78@48 tiles, the 4P IBM System x3850 handily beats the next highest system – the 48-core DL785 G5 which set the record of 53.73@35 tiles back in August, 2009 – and bests it by over 30%.

At $3,800+ per socket for the tested Beckton chip, this is no real 2P alternative. In fact, a pair of Cisco UCS B250 M2 blades will get 52 tiles running for much less money. Looking at processor and memory configurations alone, this is a $67K+ enterprise server, resulting in a moderately-high $232/VM price point for the IBM x3850 X5.

SOLORI’s Take: The most interesting aspect of the EX benchmark is its clock-adjusted scaling factor: between 70% and 91% versus a 2P/8-core Nehalem-EP reference (Cisco UCS, B200 M1, 25.06@17 tiles). The unpredictable nature of Intel’s “turbo” feature – varying with thermal loads and per-core conditions – makes an exact clock-for-clock comparison difficult. However, if the scaling factor is 90%, the EX blows away our previous expectations about the platform’s scalability. Where did we go wrong when we predicted a conservative 44@39 tiles? We’re looking at three things: (1) a bad assumption about the effectiveness of “turbo” in the EP VMmark case (setting Ref_EP_Clock to 3.33 GHz), and (2) underestimating EX’s scaling efficiency (assumed 70%), (3) assuming a 2.26GHz clock for EX.

Chosing our minimum QPI/HT3 scalability factor of 75%, the predicted performance was derived this way from HP Proliant BL490 G6 as a baseline:

Est. Tiles = EP_Tiles_per_core( 2.13 ) * 32 cores * Scaling_Efficiency( 75% ) * EX_Clock( 2.26 ) / EP_Clock( 2.93 ) = 39 tiles

Est. Score = Est_Tiles( 40 ) * EP_Score_per_Tile( 1.43 ) * Est_EX_Clock( 2.26 ) / Ref_EP_Clock( 2.93 ) = 44.12

Est. Nehalem-EX VMmark -> 44.12@39 tiles

Correcting for the as-tested clock/turbo numbers, and using AMD’s 2P-to-4P VMmark scaling efficiency of 83%, and shifting to the new UCS baseline (with newer ESX version) the Nehalem-EX prediction factors to:

Est. Tiles = EP_Tiles_per_core( 2.13 ) * 32 cores * Scaling_Efficiency( 83% ) * EX_Clock( 2.67 ) / EP_Clock( 2.93 ) = 51 tiles

Est. Score = Est_Tiles( 51 ) * EP_Score_per_Tile( 1.47 ) * Est_EX_Clock( 2.67 ) / Ref_EP_Clock( 2.93 ) = 68.32

Est. Nehalem-EX VMmark -> 68.3@51 tiles

Clearly, this approach either overestimates the scaling efficiency or underestimates the “turbo” mode. IBM claims that a 2.93 GHz “turbo” setting is viable where Intel suggests 2.67 GHz is the maximum, so there is a source of potential bias. Looking at the tiles-per-core ratio of the VMmark result, the Nehalem-EX drops from 2.13 tiles per core on EP/2P platforms to 1.5 tiles per core on EX/4P platforms – about a 30% drop in per-core loading efficiency. That indicator matches well with our initial 75% scaling efficiency moving from 2P to 4P – something that AMD demonstrated with Istanbul last August. Given the high TDP of EX and IBM’s 2.93 GHz “turbo” specification, it’s possible that “turbo” is adding clock cycles (and power consumption) and compensating for a “lower” scaling efficiency than we’ve assumed. Looking at the same estimation with 2.93GHz “clock” and 71% efficiency (1.5/2.13), the numbers fall in line with VMmark:

Est. Tiles = EP_Tiles_per_core( 2.13 ) * 32 cores * Scaling_Efficiency( 71% ) * EX_Clock( 2.93 ) / EP_Clock( 2.93 ) = 48 tiles

Est. Score = Est_Tiles( 48 ) * EP_Score_per_Tile( 1.47 ) * Est_EX_Clock( 2.93 ) / Ref_EP_Clock( 2.93 ) = 70.56

Est. Nehalem-EX VMmark -> 70.56@48 tiles

This give us a good basis for evaluating 2P vs. 4P Nehalem systems: scaling factor of 71% and capable of pushing clock towards the 3GHz mark within its thermal envelope. Both of these conclusions fit typical 2P-to-4P norms and Intel’s process history.

SOLORI’s 2nd Take: So where does that leave AMD’s newest 12-core chip? To date, no VMmark exists for AMD’s Magny-Cours, and AMD chips tend not to do as well in VMmark as their Intel peers do to the benchmarks SMT-friendly loads. However, we can’t resist using the same analysis against AMD/MC’s 2.4GHz Opteron 6174SE (theoretical) using the 2P HP DL385 G6 as a baseline for core loading and the HP DL785 G6 for tile performance (best of the best cases):

Est. Tiles = HP_Tiles_per_core( 0.92 ) * 48 cores * Scaling_Efficiency( 83% ) * MC_Clock( 2.3 ) / HP_Clock( 2.6 ) = 33 tiles

Est. Score = Est_Tiles( 34 ) * HP_Score_per_Tile( 1.54 ) * Est_MC_Clock( 2.3 ) / Ref_HP_Clock( 2.8 ) = 41.8

Est. 4P Magny-Cours VMmark -> 41.8@33 tiles

That’s nowhere near good enough to top the current 8P, 48-core Istanbul VMmark at 53.73@35 tiles, so we’ll likely have to wait for faster 6100 parts to see any new AMD records. However, assuming AMD’s proposition is still “value 4P” so about 200 VM’s at under $18K/server gets you around $90/VM or less.


Quick-Take: AMD Dodeca-core Opteron, Real Soon Now

March 3, 2010

In a recent blog, John Fruehe recounted a few highlights from the recent server analyst event at AMD/Austin concerning the upcoming release of AMD’s new 12-core (dodeca) Opteron 6100 series processor – previously knows as Magny-Cours. While not much “new” was officially said outside of NDA privilege, here’s what we’re reading from his post:

1. Unlike previous launches, AMD is planning to have “boots on the ground” this time with vendors and supply alignments in place to be able to ship product against anticipated demand. While it is now well known that Magny-Cours has been shipping to certain OEM and institutional customers for some time, our guess is that 2000/8000 series 6-core HE series have been hard to come by for a reason – and that reason has 12-cores not 6;

Obviously the big topic was the new AMD Opteron™ 6000 Series platforms that will be launching very soon.  We had plenty of party favors – everyone walked home with a new 12-core AMD Opteron 6100 Series processor, code name “Magny-Cours”.

– Fruehe on AMD’s pending launch

2. Timing is right! With Intel’s Nehalem-EX 8-core and Core i7/Nehalem-EP 6-core being demoed about, there is more pressure than ever for AMD to step-up with a competitive player. Likewise, DDR3 is neck-and-neck with DDR2 in affordability and way ahead with low-power variants that more than compensate for power-hungry CPU profiles. AMD needs to deliver mainstream performance in 24-cores and 96GB DRAM within the power envelope of 12-cores and 64GB to be a player. With 1.35V DDR3 parts paired to better power efficiency in the 6100, this could be a possibility;

We demonstrated a benchmark running on two servers, one based on the Six-Core AMD Opteron processor codenamed “Istanbul,” and one 12-core “Magny-Cours”-based platform.  You would have seen that the power consumption for the two is about the same at each utilization level.  However, there is one area where there was a big difference – at idle.  The “Magny-Cours”-based platform was actually lower!

– AMD’s Fruehe on Opteron 6100’s power consumption

3. Performance in scaled virtualization matters – raw single-threaded performance is secondary. In virtual architectures, clusters of systems must perform as one in an orchestrated ballet of performance and efficiency seeking. For some clusters, dynamic load migration to favour power consumption is a priority – relying on solid power efficiency under high load conditions. For other clusters, workload is spread to maximize performance available to key workloads – relying on solid power efficiency under generally light loads. For many environments, multi-generational hardware will be commonplace and AMD is counting on its wider range of migration compatibility to hold-on to customers that have not yet jumped ship for Intel’s Nehalem-EP/EX.

“We demonstrated Microsoft Hyper-V running on two different servers, one based on a Quad-Core AMD Opteron processor codenamed “Barcelona” (circa 2007) and a brand new “Magny-Cours”-based system. …companies might have problems moving a 2010 VM to a 2007 server without limiting the VM features. (For example, in order to move a virtual machine from an Intel  “Nehalem”-based system to a “Harpertown” [or earlier] platform, the customer must not enable nested paging in the “Nehalem” virtual machine, which can reduce the overall performance of the VM.)”

– AMD’s Fruehe, extolling the virtues of Opteron generational compatibility

SOLORI’s Take: It would appear that Magny-Cours has more under the MCM hood than a pair of Istanbul processors (as previously charged.) To manage better idle performance and constant power performance in spite of a two-to-one core ratio and similar 45nm process, AMD’s process and feature set must include much better power management as well, however, core speed is not one of them. With the standard “Maranello” 6100 series coming in at 1.9, 2.1 and 2.2 GHz with an HE variant at 1.7GHz and SE version running at 2.3GHz, finding parity in an existing cluster of 2.4, 2.6 and 2.8 GHz six-core servers may be difficult. Still, Maranello/G34 CPUs will be at 85, 115 and 140W TDP.

That said, Fruehe has a point on virtualization platform deployment and processor speed: it is not necessary to trim-out an entire farm with top-bin parts – only a small portion of the cluster needs to operate with top-band performance marks. The rest of the market is looking for predictable performance, scalability and power efficiency per thread. While SMT makes a good run at efficiency per thread, it does so at the expense of predictable performance. Here’s hoping that AMD’s C1E (or whatever their power-sipping special sauce will be called) does nothing to interfere with predictable performance…

As we’ve said before, memory capacity and bandwidth (as a function of system power and core/thread capacity) are key factors in a CPU’s viability in a virtualization stack. With 12 DIMM slots per CPU (3-DPC, 4-channel), AMD inherits an enviable position over Intel’s current line-up of 2P solutions by being able to offer 50% more memory per cluster node without resorting to 8GB DIMMs. That said, it’s up to OEM’s to deliver rack server designs that feature 12 DIMM per CPU and not hold-back with only 8 DIMM variants. In the blade and 1/2-size market, cramming 8 DIMM per board (effectively 1-DPC for 2P Magny-Cours) can be a challenge let alone 24 DIMMs! Perhaps we’ll see single-socket blades with 12 DIMMs (12-cores, 48/96GB DDR3) or 2P blades with only one 12 DIMM memory bank (one-hop, NUMA) in the short term.

SOLORI’s 2nd Take: It makes sense that AMD would showcase their leading OEM partners because their success will be determined on what those OEM’s bring to market. With VDI finally poised to make a big market impact, we’d expect to see the first systems delivered with 2-DPC configurations (8 DIMM per CPU, economically 2.5GB/core) which could meet both VDI and HPC segments equally. However, with Window7 gaining momentum, what’s good for HPC might not cut it for long in the VDI segment where expectations of 4-6 VM’s per core at 1-2GB/VM are mounting.

Besides the launch date, what wasn’t said was who these OEM’s are and how many systems they’ll be delivering at launch. Whoever they are, they need to be (1) financially stronger than AMD, (2) in an aggressive marketing position with respect to today’s key growth market (server and desktop virtualization), and (3) willing to put AMD-based products “above the fold” on their marketing and e-commerce initiatives. AMD needs to “represent” in a big way before a tide of new technologies makes them yesterday’s news. We have high hopes that AMD’s recent “perfect” execution streak will continue.


Quick Take: Year-end DRAM Price Follow-up, Thoughts on 2010

December 30, 2009

Looking at memory prices one last time before the year is out and prices of our “benchmark” Kingston DDR3 server DIMMs are on the decline. While the quad rank 8G DDR3/1066 DIMMs are below the $565 target price (at $514) we predicted back in August, the dual rank equivalent (on our benchmark list) are still hovering around $670 each. Likewise, while retail price on the 8G DDR2/667 parts continue to rise, inventory and promotional pricing has managed to keep them flat at $433 each, giving large foot print DDR2 systems a $2,000 price advantage (based on 64GB systems).

Benchmark Server (Spot) Memory Pricing – Dual Rank DDR2 Only
DDR2 Reg. ECC Series (1.8V) Price Jun ’09 Price Sep ’09 Price
Dec ’09

4GB 800MHz DDR2 ECC Reg with Parity CL6 DIMM Dual Rank, x4
(5.400W operating)
$100.00 $117.00
up 17%
up 23%

(Promo price, retail $162)

4GB 667MHz DDR2 ECC Reg with Parity CL5 DIMM Dual Rank, x4 (5.940W operating)
$80.00 $103.00
up 29%
down 5%

(retail $160)

8GB 667MHz DDR2 ECC Reg with Parity CL5 DIMM Dual Rank, x4 (7.236W operating)
$396.00 $433.00 $433.00

(Promo price, retail $515)
Benchmark Server (Spot) Memory Pricing – Dual Rank DDR3 Only
DDR3 Reg. ECC Series (1.5V) Price Jun ’09 Price Sep ’09 Price
Dec ’09

4GB 1333MHz DDR3 ECC Reg w/Parity CL9 DIMM Dual Rank, x4 w/Therm Sen (3.960W operating)
$138.00 $151.00
up 10%


down 10%

4GB 1066MHz DDR3 ECC Reg w/Parity CL7 DIMM Dual Rank, x4 w/Therm Sen (5.09W 5.085W operating)
$132.00 $151.00
up 15%
down 9%(retail $162)

8GB 1066MHz DDR3 ECC Reg w/Parity CL7 DIMM Dual Rank, x4 w/Therm Sen (6.36W 4.110W operating)
$1035.00 $917.00
down 11.5%
down 28%

(avail. 1/10)

As the year ends, OEMs are expected to “pull up inventory,” according to DRAMeXchange, in advance of a predicted market short fall somewhere in Q2/2010. Demand for greater memory capacities are being driven by Windows 7 and 64-bit processors with 4GB as the well established minimum system foot print ending 2009. With Server 2008 systems demanding 6GB+ and increased shift towards large memory foot print virtualization servers and blades, the market price for DDR3 – just turning the corner in Q1/2010 versus DDR2 – will likely flatten based on growing demand.

SOLORI’s Take: With Samsung and Hynix doubling CAPEX spending in 2010, we’d be surprised to see anything more than a 30% drop in retail 4GB and 8GB server memory by Q3/2010 given the anticipated demand. That puts 8G DDR3/10666 at $470/stick versus $330 for 2x 4GB and on track with August 2009 estimates. The increase in compute, I/O and memory densities in 2010 will be market changing and memory demand will play a small (but significant) role in that development.

In the battle to “feed” the virtualization servers of 2H/2010, the 4-channel “behemoth” Magny-Cours system could have a serious memory/price advantage with 8 (2-DPC) or 12 (3-DPC) configurations of 64GB (2.6GB/thread) and 96GB (3.9GB/thread) DDR3/1066 using only 4GB sticks (assumes 2P configuration). Similar GB/thread loads on Nehalem-EP6 “Gulftown” (6-core/12-thread) could be had with 72GB DDR3/800 (18x 4GB, 3-DPC) or 96GB DDR3/1066 (12x 8GB, 2-DPC), providing the solution architect with a choice between either a performance (memory bandwidth) or price (about $2,900 more) crunch. This means Magny-Cours could show a $2-3K price advantage (per system) versus Nehalem-EP6 in $/VM optimized VDI implementations.

Where the rubber starts to meet the road, from a virtualization context, is with (unannounced) Nehalem-EP8 (8-core/16-thread) which would need 96GB (12x 8GB, 2-DPC) to maintain 2.6GB/thread capacity with Magny-Cours. This creates a memory-based price differential – in Magny-Cours’ favor – of about $3K per system/blade in the 2P space. At the high-end (3.9GB/thread), the EP8 system would need a full 144GB (running DDR3/800 timing) to maintain GB/thread parity with 2P Magny-Cours – this creates a $5,700 system price differential and possibly a good reason why we’ll not actually see an 8-core/16-thread variant of Nehalem-EP in 2010.

Assuming that EP8 has 30% greater thread capacity than Magny-Cours (32-threads versus 24-threads, 2P system), a $5,700 difference in system price would require a 2P Magny-Cours system to cost about $19,000 just to make it an even value proposition. We’d be shocked to see a MC processor priced above $2,600/socket, making the target system price in the $8-9K range (24-core, 2P, 96GB DDR3/1066). That said, with VDI growth on the move, a 4GB/thread baseline is not unrealistic (4 VM/thread, 1GB per virtual desktop) given current best practices. If our numbers are conservative, that’s a $100 equipment cost per virtual desktop – about 20% less than today’s 2P equivalents in the VDI space. In retrospect, this realization makes VMware’s decision to license VDI per-concurrent-user and NOT per socket a very forward-thinking one!

Of course, we’re talking about rack servers and double-size and non-standard blades here: after all, where can we put 24 DIMM slots (2P, 3-DPC, 4-channel memory) on a SFF blade? Vendors will have a hard enough time with 8-DIMM per processor (2P, 2-DPC, 4-channel memory) configurations today. Plus, all that dense compute and I/O will need to get out of the box somehow (10GE, IB, etc.) It’s easy to see that HPC and virtualization platforms demands are converging, and we think that’s good for both markets.

SOLORI’s 2nd Take: Why does 8GB of DRAM require less than 4GB at the same speed and voltage??? The 4GB stick is based on 36x 256M x 4-bit DDR3-1066 FBGA’s (60nm) and the 8GB stick is based on 36x 512M x 4-bit DDR3-1066 FBGA’s (likely 50nm). According to SAMSUNG, the smaller feature size offers nearly 40% improvement in power consumption (per FBGA). Since the sticks use the same number of FBGA components (1Gb vs 2Gb), the 20% power savings seems reasonable.

The prospect of lower power at higher memory densities will drive additional market share to modules based on 2Gb DRAM modules. The gulf between DDR2 will continue to expand as tooling shifts to majority-DDR3 production and the technology. While minority leader Hynix announced a 50nm 2Gb DDR2 part earlier this year (2009), the chip giant Samsung continues to use 60-nm for its 2Gb DDR2. Recently, Hynix announced a successful validation of its 40-nm class 2Gb DDR3 module operating at 1333MHz and saving up to 40% power from the 50nm design. Similarly, Samsung’s leading the DRAM arms race with 30nm, 4Gb DDR3 production which will show-up in 1.35V, 16GB UDIMM and RDIMM in 2010 offering additional power saving benefits over 40-50nm designs. Meanwhile, Samsung has all but abandoned advances on DDR2 feature sizes.

The writing is on the wall for DDR2 systems: unit costs are rising, demand is shrinking, research is stagnant and a new wave of DDR3-based hardware is just over the horizon (1H/2010). While these things will show the door to DDR2-based systems (which enjoyed a brief resurgence in 2009 due to DDR3 supply problems and marginal power differences) as demand and DDR3 advantages heat-up in later 2010, it’s kudos to AMD for calling the adoption curve, spot on!


NEC Offers “Dunnington” Liposuction, Tops 64-Core VMmark

November 19, 2009

NEC’s venerable Express5800/A1160 is back at the top VMmark chart, this time establishing the brand-new 64-core category with a score of 48.23@32 tiles – surpassing its 48-core 3rd place posting by over 30%. NEC’s new 16-socket, 64-core, 256GB “Dunnington” X7460 Xeon-based score represents a big jump in performance over its predecessor with a per tile ratio of 1.507 – up 6% from the 48-core ratio of 1.419.

To put this into perspective, the highest VMmark achieved, to date, is the score of 53.73@35 tiles (tile ratio 1.535) from the 48-core HP DL785 G6 in August, 2009. If you are familiar with the “Dunnington” X7460, you know that it’s a 6-core, 130W giant with 16MB L2 cache and a 1000’s price just south of $3,000 per socket. So that raises the question: how does 6-cores X 16-sockets = 64? Well, it’s not pro-rationing from the Obama administration’s “IT fairness” czar. NEC chose to disable the 4th and 6th core of each socket to reduce the working cores from 96 to 64.

At $500/core, NEC’s gambit may represent an expensive form of “core liposuction” but it was a necessary one to meet VMware’s “logical processor per host” limitation of 64. That’s right, currently VMware’s vSphere places a limit on logical processors based on the following formula:

CPU_Sockets X Cores_Per_Socket X Threads_Per_Core =< 64

According to VMware, the other 32 cores would have been “ignored” by vSphere had they been enabled. Since “ignored” is a nebulous term (aka “undefined”), NEC did the “scientific” thing by disabling 32 cores and calling the system a 64-core server. The win here: a net 6% improvement in performance per tile over the 6-core configuration – ostensibly from the reduced core loading on the 16MB of L3 cache per socket and reduction in memory bus contention.

Moving forward to 2010, what does this mean for vSphere hardware configurations in the wake of 8-core, 16-thread Intel Nehalem-EX and 12-core, 12-thread AMD Magny-Cours processors? With a 4-socket Magny-Cours system limitation, we won’t be seeing any VMmarks from the boys in green beyond 48-cores. Likewise, the boys in blue will be trapped by a VMware limitation (albeit, a somewhat arbitrary and artificial one) into a 4-socket, 64-thread (HT) configuration or an 8-socket, 64-core (HT-disabled) configuration for their Nehalem-EX platform – even if using the six-core variant of EX. Looks like VMware will need to lift the 64-LCPU embargo by Q2/2010 just to keep up.


Fujistu RX300 S5 Rack Server Takes 8-core VMmark Lead

November 11, 2009

Fujitsu’s RX300 S5 rack server takes the top spot in VMware’s VMmark for 8-core systems today with a score of 25.16@17 tiles. Loaded with two of Intel’s top-bin 3.33GHz, 130W Nehalem-EP processors (W5590, turbo to 3.6GHz per core) and 96GB of DDR3-1333 R-ECC memory, the RX300 bested the former champ – the HP ProLiant BL490c G6 blade – by only about 2.5%.

With 17 tiles and 102 virtual machines on a single 2U box, the RX300 S5 demonstrates precisely how well vSphere scales on today’s x86 commodity platforms. It also appears to demonstrate both the value and the limits of Intel’s “turbo mode” in its top-bin Nehalem-EP processors – especially in the virtualization use case – we’ll get to that later. In any case, the resulting equation is:

More * (Threads + Memory + I/O) = Dense Virtualization

We could have added “higher execution rates” to that equation, however, virtualization is a scale-out applications where threads, memory pool and I/O capabilities dominate the capacity equation – not clock speed. Adding 50% more clock provides less virtualization gains than adding 50% more cores, and reducing memory and context latency likewise provides better gains that simply upping the clock speed. That’s why a dual quad-core Nehalem 2.6GHz processor will crush a quad dual-core 3.5GHz (ill-fated) Tulsa.

Speaking of Tulsa, unlike Tulsa’s rather anaemic first-generation hyper-threading, Intel’s improved SMT in Nehalem “virtually” adds more core “power” to the Xeon by contributing up to 100% more thread capacity. This is demonstrated by Nehalem-EP’s 2 tiles per core contributions to VMmark where AMD’s Istanbul quad-core provides only 1 tile per core. But exactly what is a VMmark tile and how does core versus thread play into the result?


The Illustrated VMmark "Tile" Load

As you can see, a “VMmark Tile” – or just “tile” for short – is composed of 6 virtual machines, half running Windows, half running SUSE Linux. Likewise, half of the tiles are running in 64-bit mode while the other half runs in 32-bit mode. As a whole, the tile is composed of 10 virtual CPUs, 5GB of RAM and 62GB of storage. Looking at how the parts contribute to the whole, the tile is relatively balanced:

Operating System / Mode 32-bit 64-bit Memory vCPU Disk
Windows Server 2003 R2 67% 33% 45% 50% 58%
SUSE Linux Enterprise Server 10 SP2 33% 67% 55% 50% 42%
32-bit 50% N/A 30% 40% 58%
64-bit N/A 50% 70% 60% 42%

If we stop here and accept that today’s best x86 processors from AMD and Intel are capable of providing 1 tile for each thread, we can look at the thread count and calculate the number of tiles and resulting memory requirement. While that sounds like a good “rule of thumb” approach, it ignores specific use case scenarios where synthetic threads (like HT and SMT) do not scale linearly like core threads do where SMT accounts for only about 12% gains over single-threaded core, clock-for-clock. For this reason, processors from AMD and Intel in 2010 will feature more cores – 12 for AMD and 8 for Intel in their Magny-Cours and Nehalem-EX (aka “Beckton”), respectively.

Learning from the Master

If we want to gather some information about a specific field, we consult an expert, right? Judging from the results, Fujitsu’s latest dual-processor entry has definitely earned the title ‘Master of VMmark” in 2P systems – at least for now. So instead of the usual VMmark $/VM analysis (which are well established for recent VMmark entries), let’s look at the solution profile and try to glean some nuggets to take back to our data centers.

It’s Not About Raw Speed

First, we’ve noted that the processor used is not Intel’s standard “rack server” fare, but the more workstation oriented W-series Nehalem at 130W TDP. With “turbo mode” active, this CPU is capable of driving the 3.33GHz core – on a per-core basis – up to 3.6GHz. Since we’re seeing only a 2.5% improvement in overall score versus the ProLiant blade at 2.93GHz, we can extrapolate that the 2.93GHz X5570 Xeon is spending a lot of time at 3.33GHz – its “turbo” speed – while the power-hungry W5590 spends little time at 3.6GHz. How can we say this? Looking at the tile ratio as a function of the clock speed.

We know that the X5570 can run up to 3.33GHz, per core, according to thermal conditions on the chip. With proper cooling, this could mean up to 100% of the time (sorry, Google). Assuming for a moment that this is the case in the HP test environment (and there is sufficient cause to think so) then the ratio of the tile score to tile count and CPU frequency is 0.433 (24.54/17/3.33). If we examine the same ratio for the W5590, assuming the clock speed of 3.33GHz, we get 0.444 – a difference of 2.5%, or the contribution of “turbo” in the W5590. Likewise, if you back-figure the “apparent speed” of the X5570 using the ratio of the clock-locked W5590, you arrive at 3.25GHz for the W5570 (an 11% gain over base clock). In either case, it is clear that “turbo” is a better value at the low-end of the Nehalem spectrum as there isn’t enough thermal headroom for it to work well for the W-series.

VMmark Equals Meager Network Use

Second, we’re not seeing “fancy” networking tricks out of VMmark submissions. In the past, we’ve commented on the use of “consumer grade” switches in VMmark tests. For this reason, we can consider VMmark’s I/O dependency as related almost exclusively to storage. With respect to networking, the Fujitsu team simply interfaced three 1Gbps network adapter ports to the internal switch of the blade enclosure used to run the client-side load suite and ran with the test. Here’s what that looks like:


Networking Simplified: The "leaders" simple virtual networking topology.

Note that the network interfaces used for the VMmark trial are not from the on-board i82575EB network controller but from the PCI-Express quad-port adapter using its older cousin – the i82571EB. What is key here is that VMmark is tied to network performance issues, and it is more likely that additional network ports might increase the likelihood of IRQ sharing and reduced performance more so than the “optimization” of network flows.

Keeping Storage “Simple”

Third, Fujitsu’s approach to storage is elegantly simple: several “inexpensive” arrays with intelligent LUN allocation. For this, Fujistu employed eight of its ETERNUS DX80 Disk Storage Systems with 7 additional storage shelves for a total of 172 working disks and 23 LUNs. For simplicity, Fujistu used a pair of 8Gbps FC ports to feed ESX and at least one port per DX80 – all connected through a Brocade 5100 fabric switch. The result looked something like this:


Fujitsu's VMmark Storage Topology: 8 Controllers, 7 Shelves, 172 Disks and 23 LUNs.

And yes, the ESX server is configured to boot from SAN, using no locally attached storage. Note that the virtual machine configuration files, VM swap and ESX boot/swap are contained in a separate DX80 system. This “non-default” approach allows the working VMDKs of the virtual machines to be isolated – from a storage perspective – from the swap file overhead, about 5GB per tile. Again, this is a benchmark scenario, not an enterprise deployment, so trade-offs are in favour of performance, not CAPEX or OPEX.

Even if the DX80 solution falls into the $1K/TB range, to say that this approach to storage is “economic” requires a deeper look. At 33 rack units for the solution – including the FC switch but not including the blade chassis – this configuration has a hefty datacenter footprint. In contrast to the old-school server/blade approach, 1 rack at 3 servers per U is a huge savings over the 2 racks of blades or 3 racks of 1U rack servers. Had each of those servers of blades had a mirror pair, we’d be talking about 200+ disks spinning in those racks versus the 172 disks in the ETERNUS arrays, so that still represents a savings of 15.7% in storage-related power/space.

When will storage catch up?

Compared to a 98% reduction in network ports, a 30-80% reduction server/storage CAPEX (based on $1K/TB SAN), a 50-75% reduction in overall datacenter footprint, why is a 15% reduction in datacenter storage footprint acceptable? After all, storage – in the Fujitsu VMmark case – now represents 94% of the datacenter footprint. Even if the load were less aggressively spread across five ESX servers (a conservative 20:1 loading), the amount of space taken by storage only falls to 75%.

How can storage catch up to virtualization densities. First, with 2.5″ SAS drives, a bank of 172 disks can be made to occupy only 16U with very strong performance. This drops storage to only 60% of the datacenter footprint – 10U for hypervisor, 16U for storage, 26U total for this example. Moving from 3.5″ drives to 2.5″ drives takes care of the physical scaling issue with acceptable returns, but results in only minimal gains in terms of power savings.

Saving power in storage platforms is not going to be achieved by simply shrinking disk drives – shrinking the NUMBER of disks required per “effective” LUN is what’s necessary to overcome the power demands of modern, high-performance storage. This is where non-traditional technology like FLASH/SSD is being applied to improve performance while utilizing fewer disks and proportionately less power. For example, instead of dedicating disks on a per LUN basis, carving LUNs out of disk pools accelerated by FLASH (a hybrid storage pool) can result in a 30-40% reduction in disk count – when applied properly – and that means 30-40% reduction in datacenter space and power utilization.

Lessons Learned

Here are our “take aways” from the Fujitsu VMmark case:

1) Top-bin performance is at the losing end of diminishing returns. Unless your budget can accommodate this fact, purchasing decisions about virtualization compute platforms need to be aligned with $/VM within an acceptable performance envelope. When shopping CPU, make sure the top-bin’s “little brother” has the same architecture and feature set and go with the unit priced for the mainstream. (Don’t forget to factor memory density into the equation…) Regardless, try to stick within a $190-280/VM equipment budget for your hypervisor hardware and shoot for a 20-to-1 consolidation ratio (that’s at least $3,800-5,600 per server/blade).

2) While networking is not important to VMmark, this is likely not the case for most enterprise applications. Therefore, VMmark is not a good comparison case for your network-heavy applications. Also, adding more network ports increases capacity and redundancy but does so at the risk of IRQ-sharing (ESX, not ESXi) problems, not to mention the additional cost/number of network switching ports. This is where we think 10GE will significantly change the equation in 2010. Remember to add up the total number of in use ports – including out-of-band management – when factoring in switch density. For net new instalments, look for a switch that provides 10GE/SR or 10GE/CX4 options and go with !0GE/SR if power savings are driving your solution.

3) Storage should be simple, easy to manage, cheap (relatively speaking), dense and low-power. To meet these goals, look for storage technologies that utilize FLASH memory, tiered spindle types, smart block caching and other approaches to limit spindle count without sacrificing performance. Remember to factor in at least the cost of DAS when approximating your storage budget – about $150/VM in simple consolidation cases and $750/VM for more mission critical applications (that’s a range of $9,000-45,000 for a 3-server virtualization stack). The economies in managed storage come chiefly from the administration of the storage, but try to identify storage solutions that reduce datacenter footprint including both rack space and power consumption. Here’s where offerings from Sun and NexentaStor are showing real gains.

We’d like to see VMware update VMmark to include system power specifications so we can better gage – from the sidelines – what solution stack(s) perform according to our needs. VMmark served its purpose by giving the community a standard from which different platforms could be compared in terms of the resultant performance. With the world’s eyes on power consumption and the ecological impact of datacenter choices, adding a “power utilization component” to the “server-side” of the VMmark test would not be that significant of a “tweak.” Here’s how we think it can be done:

  1. Require power consumption of the server/VMmark related components be recorded, including:
    1. the ESX platform (rack server, blade & blade chassis, etc.)
    2. the storage platform providing ESX and test LUN(s) (all heads, shelves, switches, etc.)
    3. the switching fabric (i.e. Ethernet, 10GE, FC, etc.)
  2. Power delivered to the test harness platforms, client load machines, etc. can be ignored;
  3. Power measurements should be recorded at the following times:
    1. All equipment off (validation check);
    2. Start-up;
    3. Single tile load;
    4. 100% tile capacity;
    5. 75% tile capacity;
    6. 50% tile capacity;
  4. Power measurements should be recorded using a time-power data-logger with readings recorded as 5-minute averages;
  5. Notations should be made concerning “cache warm-up” intervals, if applicable, where “cache optimized” storage is used.

Why is this important? In the wake of the VCE announcement, solution stacks like VCE need to be measured against each other in an easy to “consume” way. Is VCE the best platform versus a component solution provided by your local VMware integrator? Given that the differentiated VCE components are chiefly UCS, Cisco switching and EMC storage, it will be helpful to have a testing platform that can better differentiate “packaged solutions” instead of uncorrelated vendor “propaganda.”

Let us know what your thoughts are on the subject, either on Twitter or on our blog…


Quick Take: Magny-Cours Spotted, Pushed to 3GHz for wPrime

September 13, 2009

Andreas Galistel at NordicHardware posted an article showing a system running a pair of engineering samples of the Magny-Cours processor running at 3.0GHz. Undoubtedly these images were culled from a report “leaked” on XtremeSystems forums showing a “DINAR2” motherboard with SR5690 chipset – in single and dual processor installation – running Magny-Cours at the more typical pre-release speed of 1.7GHz.

We know that Magny-Cours is essentially a MCM of Istanbul delivered in the rectangular socket G34 package. One thing illuminating about the two posts is the reported “reduction” in L3 cache from 12MB (6MB x 2 in MCM) to 10MB (2 x 5MB in MCM). Where did the additional cache go? That ‘s easy: since a 2P Magny-Cours installation is essentially a 4P Istanbul configuration, these processors have the new HT Assist feature enabled – giving 1MB of cache from each chip in the MCM to HT Assist.

“wPrime uses a recursive call of Newton’s method for estimating functions, with f(x)=x2-k, where k is the number we’re sqrting, until Sgn(f(x)/f'(x)) does not equal that of the previous iteration, starting with an estimation of k/2. It then uses an iterative calling of the estimation method a set amount of times to increase the accuracy of the results. It then confirms that n(k)2=k to ensure the calculation was correct. It repeats this for all numbers from 1 to the requested maximum.”

wPrime site

Another thing intriguing about the XtremeSystems post in particular is the reported wPrime 32M and 1024M completion times. Compared to the hyper-threading-enabled 2P Xeon W5590 (130W TDP) running wPrime 32M at 3.33GHz (3.6GHz turbo)  in 3.950 seconds, the 2P 3.0GHz Magny-Cours completed wPrime 32M in an unofficial 3.539 seconds – about 10% quicker while running a 10% slower clock. From the myopic lens of this result, it would appear AMD’s choice of “real cores” versus hyper-threading delivers its punch.

SOLORI’s Take: As a “reality check” we can compared the reigning quad-socked, quad-core Opteron 8393 SE result in wPrime 32M and wPrime 1024M at 3.90 and 89.52  seconds, respectively. Adjusted for clock and core count versus its Shanghai cousin, the Magny-Cours engineering samples – at 3.54 and 75.77 seconds, respectively – turned-in times about 10% slower than our calculus predicted. While still “record breaking” for 2P systems, we expected the Magny-Cours/Istanbul cores to out-perform Shanghai clock-per-clock – even at this stage of the game.

Due to the multi-threaded nature of the wPrime benchmark, it is likely that the HT Assist feature – enabled in a 2P Magny-Cours system by default – is the cause of the discrepancy. By reducing the available L3 cache by 1MB per die – 4MB of L3 cache total – HT Assist actually could be creating a slow-down. However, there are several things to remember here:

  • These are engineering samples qualified for 1.7GHz operation
  • Speed enhancements were performed with tools not yet adapted to Magny-Cours
  • The author indicated a lack of control over AMD’s Cool ‘n Quiet technology which could have made “as tested” core clocks somewhat lower than what CPUz reported (at least during the extended tests)
  • It is speculated that AMD will release Magny-Cours at 2.2GHz (top bin) upon release, making the 2.6+ GHz results non-typical
  • The BIOS and related dependencies are likely still being “baked”

Looking at the more “typical” engineering sample speed tests posted on the XtremeSystems’ forum tracks with the 3.0GHz overclock results at a more “typical” clock speed of 2.6GHz for 2P Magny-Cours: 3.947 seconds and 79.625 seconds for wPrime 32M and 1024M, respectively. Even at that speed, the 24-core system is on par with the 2P Nehalem system clocked nearly a GHz faster. Oddly, Intel reports the W5590  as not supporting “turbo” or hyper-threading although it is clear that Intel’s marketing is incorrect based on actual testing.

Assuming Magny-Cours improves slightly on its way to market, we already know how 24-core Istanbul stacks-up against 16-thread Nehalem in VMmark and what that means for Nehalem-EP. This partly explains the marketing shift as Intel tries to position Nehalep-EP as a destined for workstations instead of servers. Whether or not you consider this move a prelude to the ensuing Nehalem-EX v. Magny-Cours combat to come or an attempt to keep Intel’s server chip power average down by eliminating the 130W+ parts from the “server” list,  Intel and AMD will each attempt win the war before the first shot is fired. Either way, we see nothing that disrupts the price-performance and power-performance comparison models that dominate the server markets.

[Ed: The 10% difference is likely due to the fact that the author was unable to get “more than one core” clocked at 3.0GHz. Likewise, he was uncertain that all cores were reliably clocking at 2.6GHz for the longer wPrime tests. Again, this engineering sample was designed to run at 1.7GHz and was not likely “hand picked” to run at much higher clocks. He speculated that some form of dynamic core clocking linked to temperature was affecting clock stability – perhaps due to some AMD-P tweaks in Magny-Cours.]


Quick-Take: VMworld 2009 Wrap-Up

September 8, 2009

VMworld 2009 in San Franciso started off with a crash and a fist fight, but ended without further incident. If you’re looking for what happened, it would be hard to beat Duncan Epping’s link-summary of the San Francisco VMworld 2009 at Yellow-Bricks, so we won’t even try. Likewise, Chad Sakacc has some great EMC view point on his Virtualgeek blog, and – fresh from his new book releaseScott Lowe has some great detail about the VMworld keynotes, events and sessions he attended.

There is a great no-spin commentary on VMworld’s “softer underbelly” on Jon William Toigo’s Drunken Data blog – especially the post about Xsigo’s participation in VMworld 2009. Also, Brian Madden has a great wrap-up video of of interviews from the VMworld floor including VMware’s Client Virtualization Platform (CVP) and the software implementation of Teradici’s PC-over-IP.

AMD’s IOMMU was on display using a test mule with two 12-core 6100 processors and a SR5690 chipset. The targets were a FirePro graphics card and a Solarflare 10GE NIC. For IOMMU-based virtualization to have broad appeal, hardware device segmentation must be supported in a manner compatible with vMotion (live migration.) No segmentation was hinted at in AMD’s demo (for FirePro), but the fact that vSphere+IOMMU+Magny-Cours equated to enough stability to be openly demonstrating the technology says a lot about the maturity of AMD’s upcoming chips and chipsets. On the other hand, Solarflare’s demonstration previewed – in 10GE – what could be possible in a future version of IOV for GPU’s:

“The flexible vNIC demonstration will highlight the Solarstorm server adapter’s scalable, virtualized architecture, supporting 100s of virtual machines and 1000s of vNICs. The Solarstorm vNIC architecture provides flexible mapping of vNICs, so that each guest OS can have its own vNIC, as well as traffic management, enabling prioritization and isolation of IP flows between vNICs.”

– Solarflare Press Release

SOLORI’s Take: The controversy surrounding VMware’s “focus” on the VMware “sphere” of products was a non-starter. The name VMworld does not stand for “Virtualization World” – it stands for “VMware World” and denying competitor’s “marketing access” to that venue seems like a reasonable restriction. While it may seem like a strong-arm tactic to some, insisting that vendors/partners are there “for VMworld only” – and hence restricting cross-marketing efforts in and around the venue – makes it more difficult for direct competitors to play the “NASCAR-style marketing” (as Toigo calls it) game.

VMworld is a showcase for technologies driving the virtualization eco-system as seen from VMware’s perspective. While there are a growing number of competitors for virtualization mind-share, VMware’s pace and vision – to date – has been driven by careful observation of use-case more so than innovation for innovation’s sake. It is this attention to business need that has made VMware successful and what defines VMworld’s focus – and it is in that light that VMworld 2009 looks like a great success.


Quick Take: HP’s Sets Another 48-core VMmark Milestone

August 26, 2009

Not satisfied with a landmark VMmark score that crossed the 30 tile mark for the first time, HP’s performance team went back to the benches two weeks later and took another swing at the performance crown. Well, the effort paid off, and HP significantly out-paced their two-week-old record with a score of 53.73@35 tiles in the heavy weight, 48-core category.

Using the same 8-processor HP ProLiant DL785 G6 platform as in the previous run – complete with 2.8GHz AMD Opteron 8439 SE 6-core chips and 256GB DDR2/667 – the new score comes with significant performance bumps in the javaserver, mailserver and database results achieved by the same system configuration as the previous attempt – including the same ESX 4.0 version (164009). So what changed to add an additional 5 tiles to the team’s run? It would appear that someone was unsatisfied with the storage configuration on the mailserver run.

Given that the tile ratio of the previous run ran about 6% higher than its 24-core counterpart, there may have been a small indication that untapped capacity was available. According to the run notes, the only reported changes to the test configuration – aside from the addition of the 5 LUNs and 5 clients needed to support the 5 additional tiles – was a notation indicating that the “data drive and backup drive for all mailserver VMs” we repartitioned using AutoPart v1.6.

The change in performance numbers effectively reduces the virtualization cost of the system by 15% to about $257/VM – closing-in on its 24-core sibling to within $10/VM and stretching-out its lead over “Dunnington” rivals to about $85/VM. While virtualization is not the primary application for 8P systems, this demonstrates that 48-core virtualization is definitely viable.

SOLORI’s Take: HP’s performance team has done a great job tuning its flagship AMD platform, demonstrating that platform performance is not just related to hertz or core-count but requires balanced tuning and performance all around. This improvement in system tuning demonstrates an 18% increase in incremental scalability – approaching within 3% of the 12-core to 24-core scaling factor, making it actually a viable consideration in the virtualization use case.

In recent discussions with AMD about the SR5690 chipset applications for Socket-F, AMD re-iterated that the mainstream focus for SR5690 has been Magny-Cours and the Q1/2010 launch. Given the close relationship between Istanbul and Magny-Cours – detailed nicely by Charlie Demerjian at Semi-Accurate – the bar is clearly fixed for 2P and 4P virtualization systems designed around these chips. Extrapolating from the similarities and improvements to I/O and memory bandwidth, we expect to  see 2P VMmarks besting 32@23 and 4P scores over 54@39 from HP, AMD and Magny-Cours.

SOLORI’s 2nd Take: Intel has been plugging away with its Nehalem-EX for 8-way systems and – delivering 128-threads – promises to deliver some insane VMmarks. Assuming Intel’s EX scales as efficiently as AMD’s new Opterons have, extrapolations indicate performance for the 4P, 64-thread Nehalem-EX shoud fall between 41@29 and 44@31 given the current crop of speed and performance bins. Using the same methods, our calculus predicts an 8P, 128-thread EX system should deliver scores between 64@45 and 74@52.

With EX expected to clock at 2.66GHz with 140W TDP and AMD’s MCM-based Magny-Cours doing well to hit 130W ACP in the same speed bins, CIO’s balancing power and performance considerations will need to break-out the spreadsheets to determine the winners here. With both systems running 4-channel DDR3, there will be no power or price advantage given on either side to memory differences: relative price-performance and power consumption of the CPU’s will be major factors. Assuming our extrapolations are correct, we’re looking at a slight edge to AMD in performance-per-watt in the 2P segment, and a significant advantage in the 4P segment.