You are currently viewing our boards as a guest, which gives you limited access to view most discussions and access our other features. By joining our free community, you will have access to post topics, communicate privately with other members (PM), respond to polls, upload content, and access many other special features. Registration is fast, simple, and absolutely free, so please, <a href="/profile.php?mode=register">join our community today</a>!
Well, I trust that Mark does know :^) (thanks Mark)
Interestingly, Sandybridge's core to L3 ratio is smaller instead of larger as that of Westmere's, so there doesn't seem to be a lot of room for extra hardware.
Of course one could double pump all the SSE circuits instead of doubling the actual hardware....
Clarkdale die shot is not very easy to read. Does it have a memory controller (can it be used without IGP)?
According to ごとう ひろしげ, from PCWatch (Hiroshige Goto) it's something like this:
Seems ok...
Regards, Hans
Yes, that's inline with Intel's slides. However, can you verify this from the die shot? There is plenty of space reserved for I/O, and it seems to be too much for just one QPI link.
Joined: Wed Jun 27, 2007 10:19 am Posts: 331 Location: Milano, Italy
If the sizes of each bar indicates its width in bits does that mean that SB can do two 256-bit loads per cycle but only one 128-bit store (i.e. it can do one AVX every two cycles)?
> If the sizes of each bar indicates its width in bits does that mean that SB can do two 256-bit loads > per cycle but only one 128-bit store (i.e. it can do one AVX every two cycles)?
To do two 32-byte loads per cycle, you'd need more than 48 bytes/cycle from the L1d.
If the sizes of each bar indicates its width in bits does that mean that SB can do two 256-bit loads per cycle but only one 128-bit store (i.e. it can do one AVX every two cycles)?
according to Mark Buxton answer here http://software.intel.com/en-us/forums/ ... pic/68554/ the average throughput for 2 256-bit loads is = 1.5 cycle i.e. loads can use all available 48B / cycle bandwidth
it's possible that 256-bit stores have a 2 clock thoughput, i.e. stores can use only 1/3 of the bandwidth, indeed it is definitely how the new chart is looking
the main impact on code optimization will be to avoid useless stores to L1D, some optimizations like loop fission must be reverted back to big loops without the intermediate stores, I will wait for the real chips to do that kind of tests though, one reason to use loop fission was to fit within LSD limits, it may also change in Sandy Bridge if there is indeed a small trace cache (only rumors so far) instead of the tiny LSD
Last edited by Eric Bron on Tue Oct 20, 2009 9:35 am, edited 1 time in total.
The important point you want to notice in Mark post: if you are writing ASM code is about the disambiguation and the MAsk move, This can really hurt bad if you don t pay attention to it. On the other hand, the compiler will do an awesome job at managing this. The execs are full 256bits, and I can only tell you that they rock and roll :)
Man, i love this new set of instruction, awesome toy!
The important point you want to notice in Mark post: if you are writing ASM code is about the disambiguation and the MAsk move, This can really hurt bad if you don t pay attention to it. On the other hand, the compiler will do an awesome job at managing this. The execs are full 256bits, and I can only tell you that they rock and roll :)
Man, i love this new set of instruction, awesome toy!
Francois
Neither disambiguation nor MASKMOV nor an awesome compiler nor extra amounts of cheering from a random Intel guy on the web address the fact that Sandy Bridge cannot perform two 256-bit loads per cycle.
Users browsing this forum: No registered users and 2 guests
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot post attachments in this forum