Posts Tagged ‘64-thread’

h1

Quick Take: IBM Tops VMmark, Crushes Record with 4P Nehalem-EX

April 7, 2010

It was merely a matter of time before one of the new core-rich titans – the Intel’s 8-core “Beckton” Nehalem-EX (Xeon 7500) or AMD’s 12-core “Magny-Cours” (Opteron 6100) – was to make a name for itself on VMware’s VMmark benchmark. Today, Intel draws first blood in the form of an 4-processor, 32-core, 64-thread, monster from IBM: the x3850 X5 running four Xeon X7560 (2.266GHz – 2.67GHz w/turbo, 130W TDP, each) and 384GB of DDR3-1066 low-power registered DIMMs. Weighing-in at 70.78@48 tiles, the 4P IBM System x3850 handily beats the next highest system – the 48-core DL785 G5 which set the record of 53.73@35 tiles back in August, 2009 – and bests it by over 30%.

At $3,800+ per socket for the tested Beckton chip, this is no real 2P alternative. In fact, a pair of Cisco UCS B250 M2 blades will get 52 tiles running for much less money. Looking at processor and memory configurations alone, this is a $67K+ enterprise server, resulting in a moderately-high $232/VM price point for the IBM x3850 X5.

SOLORI’s Take: The most interesting aspect of the EX benchmark is its clock-adjusted scaling factor: between 70% and 91% versus a 2P/8-core Nehalem-EP reference (Cisco UCS, B200 M1, 25.06@17 tiles). The unpredictable nature of Intel’s “turbo” feature – varying with thermal loads and per-core conditions – makes an exact clock-for-clock comparison difficult. However, if the scaling factor is 90%, the EX blows away our previous expectations about the platform’s scalability. Where did we go wrong when we predicted a conservative 44@39 tiles? We’re looking at three things: (1) a bad assumption about the effectiveness of “turbo” in the EP VMmark case (setting Ref_EP_Clock to 3.33 GHz), and (2) underestimating EX’s scaling efficiency (assumed 70%), (3) assuming a 2.26GHz clock for EX.

Chosing our minimum QPI/HT3 scalability factor of 75%, the predicted performance was derived this way from HP Proliant BL490 G6 as a baseline:

Est. Tiles = EP_Tiles_per_core( 2.13 ) * 32 cores * Scaling_Efficiency( 75% ) * EX_Clock( 2.26 ) / EP_Clock( 2.93 ) = 39 tiles

Est. Score = Est_Tiles( 40 ) * EP_Score_per_Tile( 1.43 ) * Est_EX_Clock( 2.26 ) / Ref_EP_Clock( 2.93 ) = 44.12

Est. Nehalem-EX VMmark -> 44.12@39 tiles

Correcting for the as-tested clock/turbo numbers, and using AMD’s 2P-to-4P VMmark scaling efficiency of 83%, and shifting to the new UCS baseline (with newer ESX version) the Nehalem-EX prediction factors to:

Est. Tiles = EP_Tiles_per_core( 2.13 ) * 32 cores * Scaling_Efficiency( 83% ) * EX_Clock( 2.67 ) / EP_Clock( 2.93 ) = 51 tiles

Est. Score = Est_Tiles( 51 ) * EP_Score_per_Tile( 1.47 ) * Est_EX_Clock( 2.67 ) / Ref_EP_Clock( 2.93 ) = 68.32

Est. Nehalem-EX VMmark -> 68.3@51 tiles

Clearly, this approach either overestimates the scaling efficiency or underestimates the “turbo” mode. IBM claims that a 2.93 GHz “turbo” setting is viable where Intel suggests 2.67 GHz is the maximum, so there is a source of potential bias. Looking at the tiles-per-core ratio of the VMmark result, the Nehalem-EX drops from 2.13 tiles per core on EP/2P platforms to 1.5 tiles per core on EX/4P platforms – about a 30% drop in per-core loading efficiency. That indicator matches well with our initial 75% scaling efficiency moving from 2P to 4P – something that AMD demonstrated with Istanbul last August. Given the high TDP of EX and IBM’s 2.93 GHz “turbo” specification, it’s possible that “turbo” is adding clock cycles (and power consumption) and compensating for a “lower” scaling efficiency than we’ve assumed. Looking at the same estimation with 2.93GHz “clock” and 71% efficiency (1.5/2.13), the numbers fall in line with VMmark:

Est. Tiles = EP_Tiles_per_core( 2.13 ) * 32 cores * Scaling_Efficiency( 71% ) * EX_Clock( 2.93 ) / EP_Clock( 2.93 ) = 48 tiles

Est. Score = Est_Tiles( 48 ) * EP_Score_per_Tile( 1.47 ) * Est_EX_Clock( 2.93 ) / Ref_EP_Clock( 2.93 ) = 70.56

Est. Nehalem-EX VMmark -> 70.56@48 tiles

This give us a good basis for evaluating 2P vs. 4P Nehalem systems: scaling factor of 71% and capable of pushing clock towards the 3GHz mark within its thermal envelope. Both of these conclusions fit typical 2P-to-4P norms and Intel’s process history.

SOLORI’s 2nd Take: So where does that leave AMD’s newest 12-core chip? To date, no VMmark exists for AMD’s Magny-Cours, and AMD chips tend not to do as well in VMmark as their Intel peers do to the benchmarks SMT-friendly loads. However, we can’t resist using the same analysis against AMD/MC’s 2.4GHz Opteron 6174SE (theoretical) using the 2P HP DL385 G6 as a baseline for core loading and the HP DL785 G6 for tile performance (best of the best cases):

Est. Tiles = HP_Tiles_per_core( 0.92 ) * 48 cores * Scaling_Efficiency( 83% ) * MC_Clock( 2.3 ) / HP_Clock( 2.6 ) = 33 tiles

Est. Score = Est_Tiles( 34 ) * HP_Score_per_Tile( 1.54 ) * Est_MC_Clock( 2.3 ) / Ref_HP_Clock( 2.8 ) = 41.8

Est. 4P Magny-Cours VMmark -> 41.8@33 tiles

That’s nowhere near good enough to top the current 8P, 48-core Istanbul VMmark at 53.73@35 tiles, so we’ll likely have to wait for faster 6100 parts to see any new AMD records. However, assuming AMD’s proposition is still “value 4P” so about 200 VM’s at under $18K/server gets you around $90/VM or less.

h1

VMware PartnerExchange2010 – Day 1-2

February 9, 2010

View of the Mandalay Bay from VMware's Alumni Lounge

It’s my second day at the beautiful Mandalay Bay in Las Vegas, Nevada and VMware PartnerExchange 2010. Yesterday was filled with travel and a generous “Tailgate Party” with burgers, dogs, beverages and lots of VMware geeks! I managed to catch the last quarter of the game from the Mandalay Bay Poker Room where I added to my chip stack at the 1/2 No-Limit Texas Hold ‘Em tables. Then it was early to bed – about 9PM PST – where I studied for the upcoming VCP410 exam.

Today (Monday) was occupied with a partners-only VMware Certified Professional, Version 4, Preparation Course which outlined the VCP4 Blueprint, question examples and test-taking strategies. The “best answer,” multiple-choice format of the VCP410 exam promises to offer me some challenges as I apply black-and-white logic to a few shades-of-grey questions. The best strategy to overcome such an obstacle: read the question in its entirety, eliminate all wrong answers, then choose the answer(s) that best satisfy the entire question. A key example is this from the on-line “mock-up” exam:

What is the maximum number of vNetwork switch ports per ESX host and vCenter Server instance?

a.  4,088 for vNetwork standard switches; 4,096 for vNetwork Distributed switches

b.  4,096 for both types of switches

c.  4,088 for vNetwork standard switches; 6,000 for vNetwork distributed switches

d.  512 for both types of virtual switches

Well, it might have been obvious that “c” is the “correct” answer, but “a” is right off of Page 6 of the vSphere Configuration Maximums guide. Both are solidly “correct” answers, it’s just that “c” speaks to both the ESX question and the vCenter question making it more correct. However, neither is completely correct since vDS ports are bound by vCenter and ESX host, while vSS ports are bound only by ESX host. Since neither answer “a” or “c” specifies which limitation they are answering – host or vCenter – it is left to subjective reasoning to infer the intent. According to Jon Hall (VMware, Florida) the most ports any vNetwork switch can have in any one host is 4,088 – regardless of type. Therefore, to reach the “total virtual network ports per host (vDS and vSS ports) at least one switch of each type must exist. Alone, they can only reach 4,088 ports, however the Configuration Maximums document never spells this out for the vNetwork Distributed Switch. Hopefully this exception will be foot-noted in the next revision of the document. [Note: the additional information about vDS type vNetwork switches that  Jon logically invalidates “a” as a response.]

Following the VCP4 Prep Course, I “recharged” in the Alumni Lounge. VMware had snacks and drinks to quell the appetite and lots of power outlets to restore my iPhone and laptop. While I waited, I contacted the wife and got the 4-1-1 on our baby, checked e-mail and ran through the “mock-up” exam a couple of times. Then it was off to the Welcome Reception at the VMware Experience Hall where sponsors and exhibitors had their wares on display.

iPhone Screen Capture of the ESX Host Running Nehalem-EX, 4P/16C/32T

iPhone Screen Capture of the ESX Host Running Nehalem-EX, 4P/32C/64T

Just inside the Hall – across from the closest beverage station – was Intel’s booth and the boys in blue were demonstrating vMotion over 10GE NICs. Yes, it was fast (as you’d expect) but the real kick was the “upcoming” 10GE Base-T adapters to challenge the current price-performance leader: the 10GE Base-CR (also supporting SFP+). At under $400/port for 10GE, it’s hard to remember a reason for using 1Gbps NICs… Oh yes, the prohibitive per-port cost of 10GE switches. AristaNetworks to the rescue???

Intel was also showing their “modular server” system. Unfortunately, the current offering doesn’t allow for SAS JBOD expansion in a meaningful way (read: running NexentaStor on one/two of the “blades”), but after discussing the issue of SAS/love with the guys in the blue booth, interests were peaked. Evan, expect a call from Intel’s server group… Seriously, with 14x 2.5″ drives in a SAS Expander interconnected chassis, NexentaStor + SSD + 15K SAS would rock!

Last but not least, Intel was proudly showing their 4P, Nehalem-EX running VMware ESX with 512GB of RAM (DDR3) and demonstrating 64active threads (pictured.) This build-out offers lots of virtualization goodness at a hereto unknown price point. Suffice to say, at 1.8GHz it’s not a screamer, but the RAS features are headed in the right direction. When you rope 64-threads (about 125-250 VM’s) and 1TB worth of VM’s (yes, 1TB RAM – about $250K worth using “on-loan Samsung parts”) you are talking about a lot of “eggy in the basket.” By enhancing the RAS capabilities of these giant systems, component failure mitigation is becoming less catastrophic  – eventually allowing only a few VM’s to be impacted by a point failure instead of ALL running VM’s on the box.

vCenter ESX Host Status Showing 512GB of RAM

In case you haven’t seen an ESX host with 512GB of available RAM, check-out this screen capture (excuse the iPhone quality) to the right. That’s about $33K worth of DDR3 memory sitting in that box and assuming that the EX processors run $2K a piece and giving $6K for the remainder of the system, that’s nearly $6K/VM in this demo: fantastically decadent! Of course – and in all due fairness to the boys in blue – VM density was not the goal in this demonstration: RAS was, and the 2-bit error scrubbing – while painful as watching paint dry – is pretty cool and soon to be needed (as indicated above) for systems with this capacity.

Other vendors visited were Wyse and Xsigo. The boys in yellow (Wyse) were pimping their thin/zero clients with some compelling examples of PCoIP (Wyse 20p) and MMR (Wyse r90lew). The PCoIP demos featured end-to-end hardware Teradici cards displaying clips from Avatar, while the MMR demo featured 720p movie clips from an iMAX cut of dog fight training. While the PCoIP was impressive and flawless, the upcoming MMR enhancements – while flawed in the beta I saw – were nothing short of impressive.

No, that's not Xsigo's secret sauce: it's the chocolate fountain at VMware's Welcome Reception.

Considering that the MMR-capable thin client was running a 1.5GHz AMD Semperon, the 720p Windows Media stream looked all the better. Looking back at the virtual machine from the ESX console, only about 10-15% of a core was being consumed to “render” the video. But that’s the beauty of MMR: redirect the processor intensive decoding to the end-point and just send the stream un-decoded. While PCoIP is a win in LANs with knowledge workers and call center applications, the MMR-based thin clients look pretty good for Education and YouTube-happy C-level employees looking to catch-up on their Hulu…

I managed to catch the Xsigo boys as the night wound down and they insured my that “mom’s cooking” back at the HQ. “Very soon” we should be hearing about a Xsigo I/O Director option that is a better fit for ROBO and SME deployments. The best part about Xsigo’s I/O virtualization technology in VMware applications: it delivers without a proprietary blade or server requirement! I’m really looking forward to getting some Xsigo into the SOLORI lab this summer…