Intel HD 5500 Architecture and Performance

Broadwell is the architecture behind the latest 5th generation of Intel Core products. Building on the foundations of the Haswell architecture launched in mid-2013, Broadwell shrinks it down to the company’s new 14nm manufacturing process – reducing die size and power consumption – whilst also introducing the company’s latest Gen 8 graphics architecture.

Broadwell itself was subject to a number of delays, and although a full launch was originally expected in mid-2014 the first products didn’t arrive on the market until late in the year in the form of the extremely low power, tablet focused, Core M chips, also known as ‘Broadwell-Y’. The first of the Ultrabook focused chips, known as ‘Broadwell-U’ were launched at CES in January 2015, with laptops utilising the HD 5500 graphics core arriving on the market in late January and early February – and it is these which we are looking at today.

The Broadwell-U 'GT2' die

The Broadwell-U ‘GT2’ die

Inside Intel’s Gen 8 graphics

On the CPU side relatively little has changed between Haswell and Broadwell, the move to Gen 8 has brought a number of revisions to Intels graphics core. At a high level this means updated API support, up to DirectX 11.2 and OpenCL 2.0, but the internal arrangement has seen a number of changes which are worth further investigation. Intels graphics architecture is divided in to a number of sub-sections which can be scaled up or down to create different graphics solutions.

HD 5500 block diagram

HD 5500 block diagram

Front-end

The Command Streamer and Global Thread Dispatcher handle receiving instructions from the graphics driver and balancing the load across the chips resources. Fixed function units responsible for tasks such as triangle setup also live in the front-end. Unfortunately it is difficult to find accurate data on the exact setup rate of these units, or if the size of the front-end is scaled for GT2 and GT3 variants.

Slices

These units feed in to the graphics ‘slices’, which contain further fixed function units responsible for tasks such as raster operations (ROPs) and texturing. This means that with Broadwell, as with Haswell, the Pixel Fill and Texel Rates scale with the number of slices present in the graphics configuration.

Broadwell does offer some improvements over Haswell in this regard, with at least a 50% increase in both the Pixel Fill and Texel Rates.

The slice also contains a 384KB L3 data cache which the sub-slice samplers can read from, and Shared Local Memory (64KB per sub-slice) which the sub-slices Data Ports can read and write to.

For the HD 5500 ‘GT2’ configuration you have a single slice, with ‘GT3’ based SKUs (HD 6000 and Iris 6100) using two.

Sub-slices

The sub-slices contain the Execution Units used for processing shader instructions, along with the Local Thread Dispatcher for distributing the incoming instructions to them and a Sampler for fetching textures from the slices shared memory and the Data Port for carrying out memory load/store operations.

The Execution Units are arranged in to groups of 8 per sub-slice for GT2 and GT3 based Broadwells – down from 10 per sub-slice in Haswell to increase the effective bandwidth available to the slices shared memory by reducing contention for the sub-slices Sampler and Data Port.

As this would lead to a reduction in the total number of Execution Units compared Haswell, the number of sub-slices per slice has been increased from two in Haswell to three in Broadwell. This gives a 20% increase in the total number of Execution Units, but a a 50% increase in bandwidth available from the slices shared local memory. For GT2 configurations this means 24 Execution Units in Broadwell (up from 20 in Haswell) and 48 in GT3 Broadwell (up from 40 in Haswell).

There are some caveats here. i3 Broadwell chips have a single Execution Unit disabled, giving a total of 23 Execution Units across 3 sub-slices, and Broadwell based Celerons and Pentiums feature a 1 slice, 2 sub-slice arrangement, with 6 EUs per sub-slice – presumably through fusing out unused units.

Execution Units

The actual operation of the Execution Units is beyond the scope of this article, but the peak throughput for FP32 operations is given as 2 x SIMD-4 FPU Fused Multiply+Add operations per clock, per Execution Unit, or 16 FLOPs per clock cycle.

For GT2 this means a theoretical peak rate of 384 FLOPs per clock for Broadwell, up from 320 FLOPs per clock cycle on GT2 Haswell.

A fuller explanation of the Execution Units operation can be found in Intel’s IDF14 Presentation – The Compute Architecture of IntelĀ® Processor Graphics Gen8.

Comments

comments

Pages: 1 2