Nvidia Maxwell Architecture Analysis – Delivering Double the Performance Per Watt on 28 NM
Ok the NDA has finally lifted and the official architecture documentation is up. Its finally time to take a look at Maxwell in depth. It manages to Double the Performance per Watt while staying on the same 28nm Process. Which, there is no other way to put it, is nothing short of a miracle. Lets see just how exactly it manages that.
Maxwell 28nm Miracle – How the Architecture makes Doubling Performance Per Watt Possible
Now since we already know that SMMs have 128 CUDA Cores, simple maths would tell you this block diagram is of the GTX 750 TI (128*5 = 640 = 750 Ti’s CUDA Core Count) But the thing we are interested in is the division. Notice how each SMM is divided into 4 dedicated “major cores”. This is one of the biggest changes that architecture has seen since Kepler which would have consisted of just one big sheet. Lets zoom in, straight into a Streaming Maxwell Multiprocessor.
If you were to count the CUDA Cores you would count exactly 128. It is also very interesting how they have divided up the memory interface width (bus) between the major cores giving 32-bit to each. The memory interface width ofcourse adds up to 128 Bit. Now here’s the interesting part. There are two L1 Caches and each is shared by two Major Cores along with 4 Texture Units. The 64kb of Shared Memory is shared between 4 major cores, ie the entire SMM.
Here are the Kepler SMX in comparison:
By this point the major revolution of Maxwell architecture should be becoming clear. Division, division and more division. You might also have noticed that unlike in Kepler SMX the warp scheduler has control over only its own ‘major core’. Nothing is being shared between the 4 major cores except FP64 and Texture units (by the warp schedulers). Taking power in numbers to an art form, it raises interesting questions whether using the same division tactics to other architectures yield the same benefits? It also implies that if we were somehow able to split the 128 Cuda Cores into not 4, but 128 Major Cores, with 1-bit each, would we have the perfect efficient architecture?
I would also like to mention concludingly that there is something in
the Maxwell architecture that Nvidia is not telling us. The ‘secret
sauce’ approach if you may, though its childish no one can argue with
its effectiveness.
0 komentari:
Speak up your mind
Tell us what you're thinking... !