You are currently viewing our boards as a guest, which gives you limited access to view most discussions and access our other features. By joining our free community, you will have access to post topics, communicate privately with other members (PM), respond to polls, upload content, and access many other special features. Registration is fast, simple, and absolutely free, so please, <a href="/profile.php?mode=register">join our community today</a>!
Hiroshige Goto wrote an article regarding AVX, Bulldozer architecture (maybe inspired by my blog, since he quotes some of the same patent applications), new instructions and their encoding:
There appeared a few more interesting patent applications and my first take on them can be read here (as always): http://citavia.blog.de/
They cover the multi threading aspect of Bulldozer.
Thank you for posting this, very interesting. Some of these architectural features might actually be for 2012, 2013 improved versions of Bulldozer.
Regards, Hans
Thanks for your comment. I will include it in my blog soon.
Do you have an idea, what all this decoding power might be good for? On P3DNow we developed an idea, because there are already several pat. app. regarding µCode caching, hierarchical µCode, safe known good code execution etc. to implement architectural changes:
For so much µCode they better have some way to decode it without limitations as in K7-K10. Optimum would be to be able to use up to all available decoders (µCode + fast path) for decoding. That's an important point in those new pat. apps., I think. They'd essentially maximize the input bandwidth to the decoders (both from i-fetch to fast path decode and from µOp code storage to µOp decode) by increasing the output bandwidth, since the paths to the clusters are already there.
Currently I'm checking academic research papers regarding CMT architectures (benefits, power profile, performance profile etc.) and updating my blog again ;)
Currently I'm checking academic research papers regarding CMT architectures (benefits, power profile, performance profile etc.) and updating my blog again ;)
Thanks Dresdenboy,very good and informative blogs about BD you wrote there ;).
1) slide 6, one "AVX HIGH" unit on port 0, and "AVX LOW" on port 1 => it looks like 256-bit AVX will have the same throughput than 128-bit SSE on current cores: 2 clocks for a 256-bit vmulps/pd + a 256-bit vaddps/pd instead of 1 clock for 128-bit mulps/pd + 128-bit addps/pd (i.e. same sp/dp flops per clock with balanced add/mul), so if Sandy Bridge can't issue in the same clock a mul and an add unlike Conroe, Penryn and Nehalem it will be actually less efficient than these previous cores with a lot of legacy 128-bit SSE code ?, if it's indeed the case Agner was pretty right after all
2) slide 6, only 128-bit paths from the L1D cache to execution units (I was hoping full featured 256-bit paths), a few consequences : - the extra load port will help as much legacy 128-bit SSE or 128-bit AVX than 256-bit AVX, same 48 B / clock maximum L1 bandwidth - loop fission will be probably no more a good optimization if intermediate results are stored in L1D, probably better to overflow the LSD than the L1D, particularly with multiple threads fighting for L1D access - more incentive to use 64-bit code to have 16 ymm registers instead of 8 to minimize L1D access
3) slide 54, 64 B cache lines (unchanged), so : - align memory still important (more important than on Nehalem), 1/2 access will incur a cache line split otherwise
4) slide 58, masked moves considered harmful, replace vmaskmovps by vblendvps + vmovaps just like in legacy SSE4 code ?
1) slide 6, one "AVX HIGH" unit on port 0, and "AVX LOW" on port 1 => it looks like 256-bit AVX will have the same throughput than 128-bit SSE on current cores: 2 clocks for a 256-bit vmulps/pd + a 256-bit vaddps/pd instead of 1 clock for 128-bit mulps/pd + 128-bit addps/pd (i.e. same sp/dp flops per clock with balanced add/mul), so if Sandy Bridge can't issue in the same clock a mul and an add unlike Conroe, Penryn and Nehalem it will be actually less efficient than these previous cores with a lot of legacy 128-bit SSE code ?, if it's indeed the case Agner was pretty right after all
2) slide 6, only 128-bit paths from the L1D cache to execution units (I was hoping full featured 256-bit paths), a few consequences : - the extra load port will help as much legacy 128-bit SSE or 128-bit AVX than 256-bit AVX, same 48 B / clock maximum L1 bandwidth - loop fission will be probably no more a good optimization if intermediate results are stored in L1D, probably better to overflow the LSD than the L1D, particularly with multiple threads fighting for L1D access - more incentive to use 64-bit code to have 16 ymm registers instead of 8 to minimize L1D access
3) slide 54, 64 B cache lines (unchanged), so : - align memory still important (more important than on Nehalem), 1/2 access will incur a cache line split otherwise
4) slide 58, masked moves considered harmful, replace vmaskmovps by vblendvps + vmovaps just like in legacy SSE4 code ?
That would correspond with this below here which doesn't show any doubled FP resources.
Users browsing this forum: No registered users and 1 guest
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot post attachments in this forum