Nvidia Maxwell Architecture Analysis – Delivering Double the Performance Per Watt on 28 NM

Ok the NDA has finally lifted and the official architecture documentation is up. Its finally time to take a look at Maxwell in depth. It manages to Double the Performance per Watt while staying on the same 28nm Process. Which, there is no other way to put it, is nothing short of a miracle. Lets see just how exactly it manages that.

Maxwell 28nm Miracle – How the Architecture makes Doubling Performance Per Watt Possible

Firstly, as I am sure most of you are aware, Maxwell does not work on SMX units. It works on SMMs, short for Streaming Maxwell Multiprocessors. Each SMM houses 128 CUDA Cores as opposed to the 192 housed by SMXs. Now unlike in the Kepler architecture where the CUDA Cores are housed in a single core fashion, Maxwell houses Cuda Cores in 4 subsets of each SMM. Almost 4 Separate “Cores” within the SMM. Lets call these “Major Cores” (as opposed to CUDA Cores) to avoid confusion. Do realize that this only refers to the 1st Generation of Maxwell and the division by four could change in the next generation. The Major Cores tactic allows Nvidia to achieve much higher efficiency rates and increase performance by 135% Per Core. Take a look at this diagram of Maxwell SMMs.

Now since we already know that SMMs have 128 CUDA Cores, simple maths would tell you this block diagram is of the GTX 750 TI (128*5 = 640 = 750 Ti’s CUDA Core Count) But the thing we are interested in is the division. Notice how each SMM is divided into 4 dedicated “major cores”. This is one of the biggest changes that architecture has seen since Kepler which would have consisted of just one big sheet. Lets zoom in, straight into a Streaming Maxwell Multiprocessor.

Maxwell Architecture Streaming Maxwell Multiprocessor SMM

If you were to count the CUDA Cores you would count exactly 128. It is also very interesting how they have divided up the memory interface width (bus) between the major cores giving 32-bit to each. The memory interface width ofcourse adds up to 128 Bit. Now here’s the interesting part. There are two L1 Caches and each is shared by two Major Cores along with 4 Texture Units. The 64kb of Shared Memory is shared between 4 major cores, ie the entire SMM.
Here are the Kepler SMX in comparison:

By this point the major revolution of Maxwell architecture should be becoming clear. Division, division and more division. You might also have noticed that unlike in Kepler SMX the warp scheduler has control over only its own ‘major core’. Nothing is being shared between the 4 major cores except FP64 and Texture units (by the warp schedulers). Taking power in numbers to an art form, it raises interesting questions whether using the same division tactics to other architectures yield the same benefits? It also implies that if we were somehow able to split the 128 Cuda Cores into not 4, but 128 Major Cores, with 1-bit each, would we have the perfect efficient architecture?

I would also like to mention concludingly that there is something in the Maxwell architecture that Nvidia is not telling us. The ‘secret sauce’ approach if you may, though its childish no one can argue with its effectiveness.

Screen Shot 2014-02-18 at 8.48.24 AM_575px