Archive for the ‘Servers’ Category

h1

Quick Take: Nehalem/Istanbul Comparison at AnandTech

October 7, 2009

Johan De Gelas and crew present an interesting comparison of Dunnington, Shanghai, Istanbul and Nehalem in a new post at AnandTech this week. In the test line-up are the “top bin” parts from Intel and AMD in 4-core and 6-core incarnations:

  • Intel Nehalem-EP Xeon, X5570 2.93GHz, 4-core, 8-thread
  • Intel “Dunnington” Xeon, X7460, 2.66GHz, 6-core, 6-thread
  • AMD “Shanghai” Opteron 2389/8389, 2.9GHz, 4-core, 4-thread
  • AMD “Istanbul” Opteron 2435/8435, 2.6GHz, 6-core, 6-thread

Most importantly for virtualization systems architects is how the vCPU scheduling affects “measured” performance. The telling piece comes from the difference in comparison results where vCPU scheduling is equalized:

AnandTech's Quad Sockets v. Dual Sockets Comparison. Oct 6,  2009.

AnandTech's Quad Sockets v. Dual Sockets Comparison. Oct 6, 2009.

When comparing the results, De Gelas hits on the I/O factor which chiefly separates VMmark from vAPUS:

The result is that VMmark with its huge number of VMs per server (up to 102 VMs!) places a lot of stress on the I/O systems. The reason for the Intel Xeon X5570’s crushing VMmark results cannot be explained by the processor architecture alone. One possible explanation may be that the VMDq (multiple queues and offloading of the virtual switch to the hardware) implementation of the Intel NICs is better than the Broadcom NICs that are typically found in the AMD based servers.

Johan De Gelas, AnandTech, Oct 2009

This is yet another issue that VMware architects struggle with in complex deployments. The latency in “Dunnington” is a huge contributor to its downfall and why the Penryn architecture was a dead-end. Combined with 8 additional threads in the 2P form factor, Nehalem delivers twice the number of hardware execution contexts than Shanghai, resulting in significant efficiencies for Nehalem where small working data sets are involved.

When larger sets are used – as in vAPUS – the Istanbul’s additional cores allows it to close the gap to within the clock speed difference of Nehalem (about 12%). In contrast to VMmark which implies a 3:2 advantage to Nehalem, the vAPUS results suggest a closer performance gap in more aggressive virtualization use cases.

SOLORI’s Take: We differ with De Gelas on the reduction in vAPUS’ data set to accommodate the “cheaper” memory build of the Nehalem system. While this offers some advantages in testing, it also diminishes one of Opteron’s greatest strengths: access to cheap and abundant memory. Here we have the testing conundrum: fit the test around the competitors or the competitors around the test. The former approach presents a bias on the “pure performance” aspect of the competitors, while the latter is more typical of use-case testing.

We do not construe this issue as intentional bias on AnandTech’s part, however it is another vector to consider in the evaluation of the results. De Gelas delivers a report worth reading in its entirety, and we view this as a primer to the issues that will define the first half of 2010.

h1

Quick Take: HP Blade Tops 8-core VMmark w/OC’d Memory

September 25, 2009

HP’s ProLiant BL490c G6 server blade now tops the VMware VMmark table for 8-core systems – just squeaking past rack servers from Lenovo and Dell with a score of 24.54@17 tiles: a new 8-core record. The half-height blade was equipped with two, quad-core Intel Xeon X5570 (Nehalem-EP, 130W TDP) and 96GB ECC Registered DDR3-1333 (12x 8GB, 2-DIMM/channel) memory.

In our follow-up, we found that HP’s on-line configuration tool does not allow for DDR3-1333 memory so we went to the street for a comparison. For starters, we examined the on-line price from HP with DDR3-1066 memory and the added QLogic QMH2462 Fiber Channel adapter ($750) and additional NC360m dual-port Gigabit Ethernet controller ($320) which came to a grand total of $28,280 for the blade (about $277/VM, not including Blade chassis or SAN storage).

Stripping memory from the build-out results in a $7,970 floor to the hardware, sans memory. Going to the street to find 8GB sticks with DDR3-1333 ratings and HP support yielded the Kingston KTH-PL313K3/24G kit (3x 8GB DIMMs) of which we would need three to complete the build-out.  At $4,773 per kit, the completed system comes to $22,289 (about $218/VM, not including chassis or storage) which may do more to demonstrate Kingston’s value in the market place rather than HP’s penchant for “over-priced” memory.

Now, the interesting disclosure from HP’s testing team is this:

Notes from HP's VMmark submission.

Notes from HP's VMmark submission.

While this appears to boost memory performance significantly for HP’s latest run (compared to the 24.24@17 tiles score back in May, 2009) it does so at the risk of running the Nehalem-EP memory controller out of specification – essentially, driving the controller beyond the rated load. It is hard for us to imagine that this specific configuration would be vendor supported if used in a problematic customer installation.

SOLORI’s Take:Those of you following closely may be asking yourselves: “Why did HP choose to over-clock the  memory controller in this run by pushing a 1066MHz, 2DPC limit to 1333MHz?”  It would appear the answer is self-evident: the extra 6% was needed to put them over the Lenovo machine. This issue raises a new question about the VMmark validation process: “Should out of specification configurations be allowed in the general benchmark corpus?” It is our opinion that VMmark should represent off-the-shelf, fully-supported configurations only – not esoteric configuration tweaks and questionable over-clocking practices.

Should there be as “unlimited” category in the VMmark arena? Who knows? How many enterprises knowingly commit their mission critical data and processes to systems running over-clocked processors and over-driven memory controllers? No hands? That’s what we thought… Congratulations anyway to HP for clawing their way to the top of the VMmark 8-core heap…

h1

AMD Chipsets Launched: Fiorano and Kroner Platforms to Follow

September 21, 2009

The Channel Register is reporting on the launch of AMD’s motherboard chipsets which will drive new socket-F based Fiorano and Kroner platforms as well as the socket G34 and C32 based Maranello and San Marino platforms. The Register also points out that no tier one PC maker is announcing socket-F solutions based on the new chipsets today. However, motherboard and “barebones” maker Supermicro is also announcing new A+ server, blade and workstation variants using the new AMD SR5690 and SP5100 chipsets, enabling:

  • GPU-optimized designs: Support up to four double-width GPUs along with two CPUs and up to 3 additional high-performance add-on cards.
  • Up to 10 quad-processor (MP) or dual-processor (DP) Blades in a 7U enclosure: Industry-leading density and power efficiency with up to 240 processor cores and 640GB memory per 7U enclosure.
  • 6Gb/s SAS 2.0 designs: Four-socket and two-socket server and workstation solutions with double the data throughput of previous generation storage architectures.
  • Universal I/O designs: Provide flexible I/O customization and investment protection.
  • QDR InfiniBand support option: Integrated QDR IB switch and UIO add-on card solution for maximum I/O performance.
  • High memory capacity: 16 DIMM models with high capacity memory support to dramatically improve memory and virtualization performance.
  • PCI-E 2.0 Slots plus Dual HT Links (HT3) to CPUs: Enhance motherboard I/O bandwidth and performance. Optimal for QDR IB card support.
  • Onboard IPMI 2.0 support: Reduces remote management costs.

Eco-Systems based on Supermicro’s venerable AS2021M – based on the NVidia nForce Pro 3600 chipset – can now be augmented with the Supermicro AS2021A variant based on AMD’s SR5690/SP5100 pairing. Besides offering HT3.0 and on-board Winbond WPCM450 KVM/IP BMC module, the new iteration includes support for the SR5690’s IOMMU function (experimentally supported by VMware), 16 DDR2 800/667/533 DIMMs, and four PCI-E 2.0 slots – all in the same, familiar 2U chassis with eight 3.5″ hot-swap bays.

AMD’s John Fruehe outlines AMD’s market approach for the new chipsets in his “AMD at Work” blog today. Based on the same basic logic/silicon, the SR5690, SR5670 and SR5650 all deliver PCI-E 2.0 and HT3.0 but at differing levels of power consumption and PCI Express lanes to their respective platforms. Paired with appropriate “power and speed” Opteron variant, these platforms offer system designers, virtualization architects and HPC vendors greater control over price-performance and power-performance constraints that drive their respective environments.

AMD chose the occasion of the Embedded Systems Conference in Boston to announce its new chipset to the world. Citing performance-per-watt advantages that could enhance embedded systems in the telecom, storage and security markets, AMD’s press release highlighted three separate vendors with products ready to ship based on the new AMD chipsets.

h1

Quick Take: DRAM Price Follow-Up

September 14, 2009

As anticipated, global DRAM prices have continued their upward trend through September, 2009. We reported on August 4, 2009 about the DDR3 and DDR2 price increases that – coupled with a short-fall in DDR3 production – have caused a temporary shift of the consumer market towards DDR2-based designs.

Last week, the Inquirer also reported that DRAM prices were on the rise and that the trend will result in parity between DDR2 and DDR3 prices. MaximumPC ran the Inquirer’s story urging its readers to buy now as the tide rises on both fronts. DRAMeXchange is reporting a significant revenue gain to the major players in the DRAM market as a result of this well orchestrated ballet of supply and demand. The net result for consumers is higher prices across the board as the DDR2/DDR3 production cross-over point is reached.

2Q2009-WW-DRAM-revenue

SOLORI’s Take: DDR2 is a fading bargain in the server markets, and DIMM vendors like Kingston are working to maintain a stable source of DDR2 components through the end of 2009. While still Looking at our benchmark tracking components, we project 8GB DIMMs to average $565/DIMM by the end of 2009. In the new year, expect 8GB/DDR2 to hit $600/DIMM by the end of H2/2010 with lower pricing on 8GB/DDR3-1066 – in the $500/DIMM range (if supply can keep up with new system demands created by continued growth in the virtualization market.)

Benchmark Server Memory Pricing
DDR2 Series (1.8V) Price Jun ’09 Price Sep ’09 DDR3 Series (1.5V) Price Jun ’09 Price Sep ’09

4GB 800MHz DDR2 ECC Reg with Parity CL6 DIMM Dual Rank, x4 (5.4W)
$100.00 $117.00

up 17%

4GB 1333MHz DDR3 ECC Reg w/Parity CL9 DIMM Dual Rank, x4 w/Therm Sen (3.96W)
$138.00
$151.00

up 10%

4GB 667MHz DDR2 ECC Reg with Parity CL5 DIMM Dual Rank, x4 (5.94W)
$80.00 $103.00

up 29%

4GB 1066MHz DDR3 ECC Reg w/Parity CL7 DIMM Dual Rank, x4 w/Therm Sen (5.09W)
$132.00 $151.00

up 15%

8GB 667MHz DDR2 ECC Reg with Parity CL5 DIMM Dual Rank, x4 (7.236W)
$396.00 $433.00

up 9%

8GB 1066MHz DDR3 ECC Reg w/Parity CL7 DIMM Dual Rank, x4 w/Therm Sen (6.36W)
$1035.00 $917.00

down 11.5%

SOLORI’s 2nd Take: Samsung has been driving the DRAM roller coaster in an effort to dominate the market. With Samsung’s 40-nm 2Gb DRAM production ramping by year end, the chip maker’s infulence could create a disruptive position in the PC and server markets by driving 8GB/DDR3 prices into the sub-$250/DIMM range by 2H/2010. Meanwhile Hynix, the #2 market leader, chases with 40-nm 1Gb DDR3 giving Samsung the opportunity to repeat its 2008/2009 gambit in 2010 making it increasingly harder for competitors to get a foot-hold in the DDR3 market.

Samsung has their eye on the future with 16GB and 32GB DIMMs already exhibited with 50-nm 2Gb parts claiming a 20% power savings over the current line of memory. With 40-nm 2Gb parts, Samsung is claiming up to 30% additional power savings. To put this into perspective, eight 32GB DIMMs would could about 60% of the power consumed by 32 8GB DIMMs (requiring a 4P+ server). In a virtualization context, this is enough memory to enable 100 virtual machines with 2.5GB of memory each without over subscription. Realistically, we expect to see 16GB DDR3 DIMMs at $1,200/DIMM by 2H/2010 – if everything goes according to plan.

h1

Quick Take: Magny-Cours Spotted, Pushed to 3GHz for wPrime

September 13, 2009

Andreas Galistel at NordicHardware posted an article showing a system running a pair of engineering samples of the Magny-Cours processor running at 3.0GHz. Undoubtedly these images were culled from a report “leaked” on XtremeSystems forums showing a “DINAR2” motherboard with SR5690 chipset – in single and dual processor installation – running Magny-Cours at the more typical pre-release speed of 1.7GHz.

We know that Magny-Cours is essentially a MCM of Istanbul delivered in the rectangular socket G34 package. One thing illuminating about the two posts is the reported “reduction” in L3 cache from 12MB (6MB x 2 in MCM) to 10MB (2 x 5MB in MCM). Where did the additional cache go? That ‘s easy: since a 2P Magny-Cours installation is essentially a 4P Istanbul configuration, these processors have the new HT Assist feature enabled – giving 1MB of cache from each chip in the MCM to HT Assist.

“wPrime uses a recursive call of Newton’s method for estimating functions, with f(x)=x2-k, where k is the number we’re sqrting, until Sgn(f(x)/f'(x)) does not equal that of the previous iteration, starting with an estimation of k/2. It then uses an iterative calling of the estimation method a set amount of times to increase the accuracy of the results. It then confirms that n(k)2=k to ensure the calculation was correct. It repeats this for all numbers from 1 to the requested maximum.”

wPrime site

Another thing intriguing about the XtremeSystems post in particular is the reported wPrime 32M and 1024M completion times. Compared to the hyper-threading-enabled 2P Xeon W5590 (130W TDP) running wPrime 32M at 3.33GHz (3.6GHz turbo)  in 3.950 seconds, the 2P 3.0GHz Magny-Cours completed wPrime 32M in an unofficial 3.539 seconds – about 10% quicker while running a 10% slower clock. From the myopic lens of this result, it would appear AMD’s choice of “real cores” versus hyper-threading delivers its punch.

SOLORI’s Take: As a “reality check” we can compared the reigning quad-socked, quad-core Opteron 8393 SE result in wPrime 32M and wPrime 1024M at 3.90 and 89.52  seconds, respectively. Adjusted for clock and core count versus its Shanghai cousin, the Magny-Cours engineering samples – at 3.54 and 75.77 seconds, respectively – turned-in times about 10% slower than our calculus predicted. While still “record breaking” for 2P systems, we expected the Magny-Cours/Istanbul cores to out-perform Shanghai clock-per-clock – even at this stage of the game.

Due to the multi-threaded nature of the wPrime benchmark, it is likely that the HT Assist feature – enabled in a 2P Magny-Cours system by default – is the cause of the discrepancy. By reducing the available L3 cache by 1MB per die – 4MB of L3 cache total – HT Assist actually could be creating a slow-down. However, there are several things to remember here:

  • These are engineering samples qualified for 1.7GHz operation
  • Speed enhancements were performed with tools not yet adapted to Magny-Cours
  • The author indicated a lack of control over AMD’s Cool ‘n Quiet technology which could have made “as tested” core clocks somewhat lower than what CPUz reported (at least during the extended tests)
  • It is speculated that AMD will release Magny-Cours at 2.2GHz (top bin) upon release, making the 2.6+ GHz results non-typical
  • The BIOS and related dependencies are likely still being “baked”

Looking at the more “typical” engineering sample speed tests posted on the XtremeSystems’ forum tracks with the 3.0GHz overclock results at a more “typical” clock speed of 2.6GHz for 2P Magny-Cours: 3.947 seconds and 79.625 seconds for wPrime 32M and 1024M, respectively. Even at that speed, the 24-core system is on par with the 2P Nehalem system clocked nearly a GHz faster. Oddly, Intel reports the W5590  as not supporting “turbo” or hyper-threading although it is clear that Intel’s marketing is incorrect based on actual testing.

Assuming Magny-Cours improves slightly on its way to market, we already know how 24-core Istanbul stacks-up against 16-thread Nehalem in VMmark and what that means for Nehalem-EP. This partly explains the marketing shift as Intel tries to position Nehalep-EP as a destined for workstations instead of servers. Whether or not you consider this move a prelude to the ensuing Nehalem-EX v. Magny-Cours combat to come or an attempt to keep Intel’s server chip power average down by eliminating the 130W+ parts from the “server” list,  Intel and AMD will each attempt win the war before the first shot is fired. Either way, we see nothing that disrupts the price-performance and power-performance comparison models that dominate the server markets.

[Ed: The 10% difference is likely due to the fact that the author was unable to get “more than one core” clocked at 3.0GHz. Likewise, he was uncertain that all cores were reliably clocking at 2.6GHz for the longer wPrime tests. Again, this engineering sample was designed to run at 1.7GHz and was not likely “hand picked” to run at much higher clocks. He speculated that some form of dynamic core clocking linked to temperature was affecting clock stability – perhaps due to some AMD-P tweaks in Magny-Cours.]

h1

Quick Take: Dell/Nehalem Take #2, 2P VMmark Spot

September 9, 2009

The new 1st runner-up spot for VMmark in the “8 core” category was taken yesterday by Dell’s R710 – just edging-out the previous second spot HP ProLiant BL490 G6 by 0.1% – a virtual dead heat. Equipped with a pair of Xeon X5570 ($1386/ea, bulk list) and 96GB registered DDR3/1066 (12x8GB), the 2U, rack mount R710 weighs-in with a tile ratio of 1.43 over 102 VMs. :

  • Dell R710 w/redundant high-output power supply, ($18,209)
  • 2 x Intel Xeon X5570 Processors (included)
  • 96GB ECC DDR3/1066 (12×8GB) (included)
  • 2 x Broadcom NexXtreme II 5709 dual-port GigabitEthernet w/TOE (included)
  • 1 x Intel PRO 1000VT quad-port GigabitEthernet (1x PCIe-x4 slot, $529)
  • 3 x QLogic QLE2462 FC HBA (1x PCIe slot, $1,219/ea)
  • 1 x LSI1078 SAS Controller (on-board)
  • 8 x 15K SAS OS drive, RAID10 (included)
  • Required ProSupport package ($2,164)
  • Total as Configured: $24,559 ($241/VM, not including storage)

Three Dell/EMC CX3-40f arrays were used as the storage backing of the test. The storage system included 8GB cache, 2 enclosures and 15, 15K disks per array delivering 19 LUNs at about 300GB each. Intel’s Hyper-Threading and  “Turbo Boost” were enabled for 8-thread, 3.33GHz core clocking as was VT; however embedded SATA and USB were disabled as is common practice.

At about $1,445/tile ($241/VM) the new “second dog” delivers its best at a 20% price premium over Lenovo’s “top dog” – although the non-standard OS drive configuration makes-up a half of the difference, with Dell’s mandatory support package making-up the remainder. Using a simple RAID1 SAS and eliminating the support package would have droped the cost to $20,421 – a dead heat with Lenovo at $182/VM.

Comparing the Dell R710 the 2P, 12-core benchmark HP DL385 G6 Istanbul system at 15.54@11 tiles:

  • HP DL385 G6  ($5,840)
  • 2 x AMD 2435 Istanbul Processors (included)
  • 64GB ECC DDR2/667 (8×8GB) ($433/DIMM)
  • 2 x Broadcom 5709 dual-port GigabitEthernet (on-board)
  • 1 x Intel 82571EB dual-port GigabitEthernet (1x PCIe slot, $150/ea)
  • 1 x QLogic QLE2462 FC HBA (1x PCIe slot, $1,219/ea)
  • 1 x HP SAS Controller (on-board)
  • 2 x SAS OS drive (included)
  • $10,673/system total (versus $14,696 complete from HP)

Direct pricing shows Istanbul’s numbers at $1,336/tile ($223/VM) which is  a 7.5% savings per-VM over the Dell R710. Going to the street – for memory only – changes the Istanbul picture to $970/tile ($162/VM) representing a 33% savings over the R710.

SOLORI’s Take: Istanbul continues to offer a 20-30% CAPEX value proposition against Nehalem in the virtualization use case – even without IOMMU and higher memory bandwidth promised in upcoming Magny-Cours. With the HE parts running around $500 per processor, the OPEX benefits are there for Istanbul too. It is difficult to understand why HP wants to charge $900/DIMM for 8GB PC-5300 sticks when they are available on the street for 50% less – that’s a 100% markup. Looking at what HP charges for 8GB DDR3/1066 – $1,700/DIM – they are at least consistent. HP’s memory pricing practice makes one thing clear – customers are not buying large memory configurations from their system vendors…

On the contrary, Dell appears to be happy to offer decent prices on 8GB DDR3/1066 with their R710 at approximately $837/DIMM – almost par with street prices.  Looking to see if this parity held up with Dell’s AMD offerings, we examined the prices offered with Dell’s R805: while – at $680/DIMM – Dell’s prices were significantly better than HP’s, they still exceeded the market by 50%. Still, we were able to configure a Dell R805 with AMD 2435’s for much less than the equivalent HP system:

  • Dell R805 w/redundant power ($7,214)
  • 2 x AMD 2435 Istanbul Processors (included)
  • 64GB ECC DDR2/667 (8×8GB) ($433/ea, street)
  • 4 x Broadcom 5708 GigabitEthernet (on-board)
  • 1 x Intel PRO 100oPT dual-port GigabitEthernet (1x PCIe slot, included)
  • 1 x QLogic QLE2462 FC HBA (1x PCIe slot, included)
  • 1 x Dell PERC SAS Controller (on-board)
  • 2 x SAS OS drive (included)
  • $10,678/system total (versus $12,702 complete from Dell)

This offering from Dell should be able to deliver equivalent performance with HP’s DL385 G6 and likewise savings/VM compared to the Nehalem-based R710. Even at the $12,702 price as delivered from Dell, the R805 represents a potential $192/VM price point – about $50/VM (25%) savings over the R710.

h1

Quick Take: HP’s Sets Another 48-core VMmark Milestone

August 26, 2009

Not satisfied with a landmark VMmark score that crossed the 30 tile mark for the first time, HP’s performance team went back to the benches two weeks later and took another swing at the performance crown. Well, the effort paid off, and HP significantly out-paced their two-week-old record with a score of 53.73@35 tiles in the heavy weight, 48-core category.

Using the same 8-processor HP ProLiant DL785 G6 platform as in the previous run – complete with 2.8GHz AMD Opteron 8439 SE 6-core chips and 256GB DDR2/667 – the new score comes with significant performance bumps in the javaserver, mailserver and database results achieved by the same system configuration as the previous attempt – including the same ESX 4.0 version (164009). So what changed to add an additional 5 tiles to the team’s run? It would appear that someone was unsatisfied with the storage configuration on the mailserver run.

Given that the tile ratio of the previous run ran about 6% higher than its 24-core counterpart, there may have been a small indication that untapped capacity was available. According to the run notes, the only reported changes to the test configuration – aside from the addition of the 5 LUNs and 5 clients needed to support the 5 additional tiles – was a notation indicating that the “data drive and backup drive for all mailserver VMs” we repartitioned using AutoPart v1.6.

The change in performance numbers effectively reduces the virtualization cost of the system by 15% to about $257/VM – closing-in on its 24-core sibling to within $10/VM and stretching-out its lead over “Dunnington” rivals to about $85/VM. While virtualization is not the primary application for 8P systems, this demonstrates that 48-core virtualization is definitely viable.

SOLORI’s Take: HP’s performance team has done a great job tuning its flagship AMD platform, demonstrating that platform performance is not just related to hertz or core-count but requires balanced tuning and performance all around. This improvement in system tuning demonstrates an 18% increase in incremental scalability – approaching within 3% of the 12-core to 24-core scaling factor, making it actually a viable consideration in the virtualization use case.

In recent discussions with AMD about the SR5690 chipset applications for Socket-F, AMD re-iterated that the mainstream focus for SR5690 has been Magny-Cours and the Q1/2010 launch. Given the close relationship between Istanbul and Magny-Cours – detailed nicely by Charlie Demerjian at Semi-Accurate – the bar is clearly fixed for 2P and 4P virtualization systems designed around these chips. Extrapolating from the similarities and improvements to I/O and memory bandwidth, we expect to  see 2P VMmarks besting 32@23 and 4P scores over 54@39 from HP, AMD and Magny-Cours.

SOLORI’s 2nd Take: Intel has been plugging away with its Nehalem-EX for 8-way systems and – delivering 128-threads – promises to deliver some insane VMmarks. Assuming Intel’s EX scales as efficiently as AMD’s new Opterons have, extrapolations indicate performance for the 4P, 64-thread Nehalem-EX shoud fall between 41@29 and 44@31 given the current crop of speed and performance bins. Using the same methods, our calculus predicts an 8P, 128-thread EX system should deliver scores between 64@45 and 74@52.

With EX expected to clock at 2.66GHz with 140W TDP and AMD’s MCM-based Magny-Cours doing well to hit 130W ACP in the same speed bins, CIO’s balancing power and performance considerations will need to break-out the spreadsheets to determine the winners here. With both systems running 4-channel DDR3, there will be no power or price advantage given on either side to memory differences: relative price-performance and power consumption of the CPU’s will be major factors. Assuming our extrapolations are correct, we’re looking at a slight edge to AMD in performance-per-watt in the 2P segment, and a significant advantage in the 4P segment.

h1

Quick Take: HP Plants the Flag with 48-core VMmark Milestones

August 12, 2009

Following on the heels of last month we predicted that HP could easily claim the VMmark summit with its DL785 G6 using AMD’s Istanbul processors:

If AMD’s Istanbul scales to 8-socket at least as efficiently as Dunnington, we should be seeing some 48-core results in the 43.8@30 tile range in the next month or so from HP’s 785 G6 with 8-AMD 8439 SE processors. You might ask: what virtualization applications scale to 48-cores when $/VM is doubled at the same time? We don’t have that answer, and judging by Intel and AMD’s scale-by-hub designs coming in 2010, that market will need to be created at the OEM level.

Well, HP didn’t make us wait too long. Today, the PC maker cleared two significant VMmark milestones: crossing the 30 tile barrier in a single system (180 VMs) and exceeding the 40 mark on VMmark score. With a score of 47.77@30 tiles, the HP DL785 G6 – powered by 8 AMD Istanbul 8439 SE processors and 256GB of DDR2/667 memory – set the bar well beyond the competition and does so with better performance than we expected – most likely due to AMD’s “HT assist” technology increasing its scalability.

Not available until September 14, 2009, the HP DL785 G6 is a pricey competitor. We estimate – based on today’s processor and memory prices – that a system as well appointed as the VMmark-configured version (additional NICs, HBA, etc) will run at least $54,000 or around $300/VM (about $60/VM higher than the 24-core contender and about $35/VM lower than HP’s Dunnnigton “equivalent”).

SOLORI’s Take: While the September timing of the release might imply a G6 with AMD’s SR5690 and IOMMU, we’re doubtful that the timing is anything but a coincidence: even though such a pairing would enable PCIe 2.0 and highly effective 10Gbps solutions. The modular design of the DL785 series – with its ability to scale from 4P to 8P in the same system – mitigates the economic realities of the dwindling 8P segment, and HP has delivered the pinnacle of performance for this technology.

We are also impressed with HP’s performance team and their ability to scale Shanghai to Istanbul with relative efficiency. Moving from DL785 G5 quad-core to DL785 G6 six-core was an almost perfect linear increase in capacity (95% of theoretical increase from 32-core to 48-core) while performance-per-tile increased by 6%. This further demonstrates the “home run” AMD has hit with Istanbul and underscores the excellent value proposition of Socket-F systems over the last several years.

Unfortunately, while they demonstrate a 91% scaling efficiency from 12-core to 24-core, HP and Istanbul have only achieved a 75% incremental scaling efficiency from 24-cores to 48-cores. When looking at tile-per-core scaling using the 8-core, 2P system as a baseline (1:1 tile-to-core ratio), 2P, 4P and 8P Istanbul deliver 91%, 83% and 62.5% efficiencies overall, respectively. However, compared to the %58 and 50% tile-to-core efficiencies of Dunnington 4P and 8P, respectively, Istanbul clearly dominates the 4P and 8P performance and price-performance landscape in 2009.

In today’s age of virtualization-driven scale-out, SOLORI’s calculus indicates that multi-socket solutions that deliver a tile-to-core ratio of less than 75% will not succeed (economically) in the virtualization use case in 2010, regardless of socket count. That said – even at a 2:3 tile-to-core ratio – the 8P, 48-core Istanbul will likely reign supreme as the VMmark heavy-weight champion of 2009.

SOLORI’s 2nd Take: HP and AMD’s achievements with this Istanbul system should be recognized before we usher-in the next wave of technology like Magny-Cours and Socket G34. While the DL785 G6 is not a game changer, its footnote in computing history may well be as a preview of what we can expect to see out of Magny-Cours in 2H/2010. If 12-core, 4P system price shrinks with the socket count we could be looking at a $150/VM price-point for a 4P system: now that would be a serious game changer.

h1

Quick Take: 6-core “Gulftown” Nehalem-EP Spotted, Tested

August 10, 2009

TechReport is reporting on a Taiwanese overclocker who may be testing a pair of Nehalem 6-core processors (2P) slated for release early in 2010. Likewise, AlienBabelTech mentions a Chinese website, HKEPC, that has preliminary testing completed on the desktop (1P) variant of the 6-core. While these could be different 32nm silicon parts, it is more likely – judging from the CPU-Z outputs and provided package pictures – that these are the same sample SKUs tested as 1P and 2P LGA-1366 components.

CPUzWhat does this mean for AMD and the only 6-core shipping today? Since Intel’s still projecting Q2/2010 for the server part, AMD has a decent opportunity to grow market share for Istanbul. Intel’s biggest rival will be itself – facing a wildly growing number of SKU’s in across its i-line from i5, i7, i8 and i9 “families” with multiple speed and feature variants. Clearly, the non-HT version would stand as a direct competitor to Istanbul’s native 6-core SKUs. Likewise, Istanbul may be no match for the 6-core Nehalem with HT and “turbo core” feature set.

However, with an 8-core “Beckton” Nehalem variant on the horizon, it might be hard to understand just where the Gulftown fits in Intel’s picture. Intel faces a serious production issue, filling fab capacity with 4-core, 6-core and 8-core processors, each with speed, power, socket and HT variants from which to supply high-speed, high-power SKUs and lower-speed, low-power SKUs for 1P, 2P and 4P+ destinations. Doing the simple math with 3 SKU’s per part Intel would be offering the market a minimum of 18 base parts according to their current marketing strategy: 9 with HT/turbo, 9 without HT/turbo. For socket LGA-1366, this could easily mean 40+ SKUs with 1xQPI and 2xQPI variants included (up from 23).

SOLORI’s take: Intel will have to create some interesting “crippling or pricing tricks” to keep Gulftown from canibalizing the Gainstown market. If they follow their “normal” play book, we prodict the next 10-months will play out like this:

  1. Initially there will be no 8-core product for 1P and 2P systems (LGA-1366), allowing for artificially high margins on the 8-core EX chip (LGA-1567), slowing the enevitable canibalization of the 4-core/2P market, and easing production burdens;
  2. Intel will silently and abruptly kill Itanium in favor of “hyper-scale” Nehalem-EX variants;
  3. Gulftown will remain high-power (90-130W TDP) and be positioned against AMD’s G34 systems and Magny-Cours – plotting 12-core against 12-thread;
  4. Intel creates a “socket refresh” (LGA-1566?) to enable “inexpensive” 2P-4P platforms from its Gulftown/Beckton line-up in 2H/2010 (ostensibly to maintain parity with G34) without hurting EX;
  5. Revised, lower-power variants of Gainstown will be positioned against AMD’s C32 target market;
  6. Intel will cut SKUs in favor of higher margins, increasing speed and features for “same dollar” cost;
  7. Non-HT parts will begin to disappear in 4-core configurations completely;
  8. Intel’s AES enhancements in Gulftown will allow it to further differentiate itself in storage and security markets;

It would be a mistake for Intel to continue growing SKU count or provide too much overlap between 4-core HT and 6-core non-HT offerings. If purchasing trends soften in 4Q/09 and remain (relatively) flat through 2Q/10, Intel will benefit from a leaner, well differentiated line-up. AMD has already announced a “leaner” plan for G34/C32. If all goes well at the fabs, 1H/2010 will be a good ole fashioned street fight between blue and green.

h1

Quick Take: DDR3 Prices on the Rise

August 4, 2009

DDR-128x128In the current server-class arms race, Intel and AMD have secured separate quarters: Intel’s rival QPI architecture coupled to a 3-channel DDR3 memory bus and functional hyper-threading cores (top bin parts) holds the pure performance sector; while AMD’s improved Istanbul cores can be delivered 6 at a time and paired with inexpensive DDR2 memory to achieve better price-performance (acquisition). Both solutions deliver about the same economies in power consumption under virtualized loads.

All in all, the Twin2 with Xeon L5520 CPUs is the best platform for those seeking an affordable server with an excellent performance/watt ratio at an affordable price. On the other hand, if performance/price is the most important criterion followed by performance/watt, we would probably opt for the six-core Opteron version of the Twin2. Supermicro has “a blade killer” avialable with the Twin², especially for those people who like to keep the hardware costs low.

John De Gelas, AnantTech, July 22, 2009

Global DDR2 and DDR3 Capacity

Global DDR2 and DDR3 Capacity

Meanwhile, the cost differential between DDR3 and DDR2 continues to widen due to increased demand in the notebook sector and reduced supply (capacity). According to DRAMeXchange, the trend will continue into Q4/09 as suppliers are expected to commit up to 30% of capacity to DDR3 by that time.

At the same time, DDR3 prices continue to inch up, by 5% in July, while DDR2 prices have appeared to bottom-out. This trend in DDR3 pricing is consistent across all speed ratings (1066/1333/1600) and, despite artificial downward price pressure from Samsung, has managed to drift upward 20% since May, 2009.

DRAMeXchange, DDR3 1Gb 128Mx8 1333MHz Price Chart

DDR3 Price Trend, May to August, 2009

Because low-end, lower-priced 2GB DDR3/1066 ($60/stick) memory shows little advantage over 2GB DDR2/800 ($35/stick), the 70% price premium keeps DDR2 in demand. With the added economic pressures of the world economy and cautious growth outlook of manufacturing sector, the cross-over from DDR2 to DDR3 will come at a significant cost: either to the consumer or the supplier.

Until the cross-over, DDR2-based systems will continue to be a favorite in price sensitive applications (i.e. where total system cost plays a significant role in purchasing decisions.) As an example of this economic inequality, let’s take the HP DL380 G6 and DL385 G6 as a comparison. Adding 16GB to the DL380 adds about $760 to the price tag (4x4GB DDR3-1066), while adding the same amount of memory to the DL385 adds only $410 (4x4GB DDR2-800). This comparison demonstrates an 85% price premium of DDR3 versus DDR2, a bit higher (percentage wise) than the desktop norm of 70%.

SOLORI’s Take: While the cost of memory in desktop systems typically represents a small portion of the overall system cost, the same can not be said for virtualization systems where entry configurations weigh-in at 16GB and often run from 48GB to 72GB in “fully loaded” systems. This, as our calculus has shown, is where the sweet-spot of $/VM is delivered.

In such configurations, the cost of DDR3 memory can tripple the system cost ($6,370 for 2P, L5506 w/12x4GB DDR3-1066R vs. $5,201 for 2P 2427 w/12x4GB DDR2-800). Moving to the higher memory footprint in 2P systems is typically not cost effective because core count cannot keep-up with the memory needs of the virtual machine inventory. However, if it were possible to utilize additional memory in the 2P platform, our benchmark 8GB DDR3-1066 versus DDR2-667 price comparison tells another story. At $900/stick, the cost of 8GB DDR3 is still a 235% premium over 8GB DDR2, making 96GB DDR3 systems (2P Xeon w/HT) nearly $6,200 per server more costly than their DDR2 counterparts (2P Istanbul) based on memory pricing alone.

SOLORI’s 2nd Take: We’re hoping to see Tyan and Supermicro release SR5690 chipset-based systems – promised in Q3/2009 – to take advantage of this pricing trend and round-out the Istanbul offering before Q1/2010 ushers-in the next wave of multi-core systems. With 10G prices on the decline, we think today’s virtualization applications make Istanbul+IOMMU a good price-performance and price-feature fit in the 32-64GB memory footprint space, leaving Nehalem-EP with only the performance niche to its credit. The only question is: where is SR5690?