You are currently viewing our boards as a guest, which gives you limited access to view most discussions and access our other features. By joining our free community, you will have access to post topics, communicate privately with other members (PM), respond to polls, upload content, and access many other special features. Registration is fast, simple, and absolutely free, so please, <a href="/profile.php?mode=register">join our community today</a>!
Are you telling me there are any non geeks that have any idea what SSEx is let alone, that buy a cpu to get faster "PMSADBW" ? Further more, what law says SSE5 must be 100% completely binary backwards compatible with SSE4/3/2/1 ? I'm not a fan of this naming mess either, but lets not be ridiculous here.
I neither get his problem ... Intel's not better .. anyone remenbers the "Supplemental Streaming SIMD Extension 3 " called SSSE3 ? In the first reports these commands were called SSE4, then suddenly they were renamed to SSSE3, and there was another SSE4 extension. Or the Pentium4 ... the first batch with S423 was not really faster than a high clocked Pentium3, or the first DDR2-400 memory; slower than DDR1-400 despite the "2" ... naming confusions in the computer business were common, are common and will be common ...
Not good for the customer, but one gets used to it.
Edit:
Thx @Hans for the link, 2004 seems sufficient to have "enough" support, but as you wrote, we do not know ;-)
cheers
Opteron
P.S: The PCGH article also states that AMD would like to include the SSE4.1 commands, but they mention that this depends on Intel only.
Last edited by Opteron on Sun Mar 16, 2008 8:47 pm, edited 4 times in total.
Excellent work Hans! What strikes me is the L3 cache density. Why is the AMD L3 cache about the same as L2? It's not like it's very fast, if you think of Barcelona...?
I seem to remember that they basically reused the cells designed for the L2 and yeah, it's not very fast.
Barcelona uses the same srams for L2 and L3. You have to remember that AMD is not nearly as aggressive on cache design as Intel and they have to contend with hysterisys from SOI.
Since they are using the same cells for L2 and L3, I would expect that there is timing slack in the L3 design which is traded away to lower the power draw.
Are you telling me there are any non geeks that have any idea what SSEx is let alone, that buy a cpu to get faster "PMSADBW" ? Further more, what law says SSE5 must be 100% completely binary backwards compatible with SSE4/3/2/1 ? I'm not a fan of this naming mess either, but lets not be ridiculous here.
I neither get his problem ... Intel's not better .. anyone remenbers the "Supplemental Streaming SIMD Extension 3 " called SSSE3 ? In the first reports these commands were called SSE4, then suddenly they were renamed to SSSE3, and there was another SSE4 extension. Or the Pentium4 ... the first batch with S423 was not really faster than a high clocked Pentium3, or the first DDR2-400 memory; slower than DDR1-400 despite the "2" ... naming confusions in the computer business were common, are common and will be common ...
Not good for the customer, but one gets used to it.
Edit: Thx @Hans for the link, 2004 seems sufficient to have "enough" support, but as you wrote, we do not know ;-)
cheers
Opteron
P.S: The PCGH article also states that AMD would like to include the SSE4.1 commands, but they mention that this depends on Intel only.
you are trying to mix up everything together, very effective FUD technic.
back to the point: AMD-SSE5 does not include SSE4: This is misleading.
back to the point: AMD-SSE5 does not include SSE4: This is misleading.
Huu .. now you are mixing things up. Any SSEx standard denotes just a handful of commands, nobody says that they have to include the predecessors. However, most CPUs do, so please do not mix up processors which support several command sets, and the name of a single command subset.
back to the point: AMD-SSE5 does not include SSE4: This is misleading.
Huu .. now you are mixing things up. Any SSEx standard denotes just a handful of commands, nobody says that they have to include the predecessors. However, most CPUs do, so please do not mix up processors which support several command sets, and the name of a single command subset.
cheers
Opteron
As part of the group of people who design SSE to SSE4, I am telling you, this is the golden rule, you got to include the previous instruction set before moving to the next level.
Will you tell me you are more qualify than the people who named and did the work?
who?/ Francois
This is my own opinion, my employer is not responsable for this posting.
Since it looks more and more that the previously leaked Gesher details
were indeed for the processor now codenamed Nehalem we actually may
have some numbers for the cache access times of Nehalem.
L1: 3 cycles.
L2: 9 cycles.
L3: 33 cycles.
The sheet mentions for the cache per core: L1=32kB, L2=512kB and L3=2-3 MB.
The Nehalem numbers are L1=32kB, L2=256kB, L3=2MB with the L3's
shared and maybe the L1 split in two 16kB halves per thread in SMT mode.
Since it looks more and more that the previously leaked Gesher details were indeed for the processor now codenamed Nehalem we actually may have some numbers for the cache access times of Nehalem.
There is at least one big thing on the Gesher slide that says that it is not Nehalem: 7FP/cycle using SSEx. Does Nehalem have a ring internal bus?
Since it looks more and more that the previously leaked Gesher details were indeed for the processor now codenamed Nehalem we actually may have some numbers for the cache access times of Nehalem.
There is at least one big thing on the Gesher slide that says that it is not Nehalem: 7FP/cycle using SSEx. Does Nehalem have a ring internal bus?
The SSE units on Nehalem are redesigned. They are not the same as
on Penryn. I would expect accumulate extensions to the multiplier, like
in SSE5, that alone would bring the number of double precision FP ops to 6.
The reason to bring the FP accu inside the multiplier is that you then
can do effectively single cycle FP adds (instead of 4 to 5 cycles) using
a few tricks. The whole idea is that one can start a new MAC every cycle,
accumulating all products together. I designed FP hardware like that
already 15 years or so ago.
The ring unit would be the L3 intercommunication: 4 x 64B for 4 cores
each hopping from one read/write buffer section to the other.
Regards, Hans
Last edited by Hans de Vries on Mon Mar 17, 2008 12:29 am, edited 1 time in total.
Will you tell me you are more qualify than the people who named and did the work?
Sure I will, cause you are top professionals in processor design or instruction set implementation, but not in naming, semantic or technical text writing ;-)
Here we can see all the new instructions added to the Intel ISA since the Pentium era.
First thing that catches the eye is that it names "Intel processor ... additions" so the context of these extensions are intel CPUs.
Second thing: The number of instructions are counted independently of the predecessor's additions.
Then there are text passages like these: 1.
Quote:
Intel will also introduce new sets of instructions designed to optimize the performance and lower the power needs of a broad range of existing and new applications. To effectively get the benefit of these new instruction, existing applications will need to be recompiled with ....
"New set of instructions" does not sound like these instructions are dependent on any other instructions. They are new, that's it ...
2.
Quote:
SSE4 is Intel’s largest ISA extension in terms of scope and impact since SSE2.
So it is an "ISA" extension. What is ISA ? I interpret ISA as "standard IA32", or Intel64 (formally know as IA32e, not to be mistaken with IA64). Your argumentation would imply that "ISA" is always the latest and greatest with all the previous SSEx bells and whistles ...
All in all the only conclusion one can draw from the picture above is that Intel is supporting all SSEx instructions with its processors. That any other company is doing the same with its own processors is not a universally valid assumption. All SSE instructions are just independent ISA additions, one can implement all, parts of it, none ... anything is possible...
Your golden rule is nice, makes sense, and I naturally believe you, too, but as long as it is not written anywhere that *any* CPU has to implement all SSEx instructions to be "official" compatible to the SSE standard(if there is any), its area of influence is quite limited.
The whole thing boils down to a matter of interpretations or in other words nitpicking. Quite annoying ... lets agree to the fact that there are always several views possible of text, views, problems etc. ... and lets stop the discussion at this point, I prefer technical argumentations to nitpicking ;-)
For example Hans' findings / speculations on Nehalem's FP Units are much more interesting ;-)
Since it looks more and more that the previously leaked Gesher details were indeed for the processor now codenamed Nehalem we actually may have some numbers for the cache access times of Nehalem.
There is at least one big thing on the Gesher slide that says that it is not Nehalem: 7FP/cycle using SSEx. Does Nehalem have a ring internal bus?
The SSE units on Nehalem are redesigned. They are not the same as on Penryn. I would expect accumulate extensions to the multiplier, like in SSE5, that alone would bring the number of double precision FP ops to 6.
The reason to bring the FP accu inside the multiplier is that you then can do effectively single cycle FP adds (instead of 4 to 5 cycles) using a few tricks. The whole idea is that one can start a new MAC every cycle, accumulating all products together. I designed FP hardware like that already 15 years or so ago.
The ring unit would be the L3 intercommunication: 4 x 64B for 4 cores each hopping from one read/write buffer section to the other.
Regards, Hans
I'm still skeptical that Nehalem will do >4 FP/cycle. There are tons of presentations from mid 2007 that all show Gesher distinct from Nehalem.
Here we can see all the new instructions added to the Intel ISA since the Pentium era.
in the Pentium era things were not so simple, I can remember the Pentium-MMX and the Pentium Pro available at the same period with the PPro lacking MMX and the PMMX lacking CMOV and other new instructions in the PPro
talking about SSEx, these are indeed clearly segregated (i.e. SSEn not included in SSEn+1), it's like that in the CPUID features flags and in all official Intel technical documentation I'm aware of, in fact it's quite common to talk between developers and say things like "I use SSE, SSE2 and SSE 4.1" (i.e. neither SSE3 nor SSSE3)
if who ? was right the features flags will be replaced by a simple version number and things will be way simpler for developers
Here we can see all the new instructions added to the Intel ISA since the Pentium era.
in the Pentium era things were not so simple, I can remember the Pentium-MMX and the Pentium Pro available at the same period with the PPro lacking MMX and the PMMX lacking CMOV and other new instructions in the PPro
talking about SSEx, these are indeed clearly segregated (i.e. SSEn not included in SSEn+1), it's like that in the CPUID features flags and in all official Intel technical documentation I'm aware of, in fact it's quite common to talk between developers and say things like "I use SSE, SSE2 and SSE 4.1" (i.e. neither SSE3 nor SSSE3)
if who ? was right the features flags will be replaced by a simple version number and things will be way simpler for developers
And things are even less simple now, SSE4a is a subset of SSE4, SSE5 is amd's new instuction set that adds better profiling support and 3 operand instructions. AMD will call it SSE5 and it may or may not be supported by Intel.
Isn't x86 becoming incredibly convoluted now? Intel and AMD battling to inject the most instructions and larrabee coming with a subset of x86 instructions. I guess Atom will only support a subset too pluss the various AMD subset implementations. And all this to get data parallel performance boosts that is just as likely to eventually come from other circuitry than gp cores.
I wouldn't be surprised if either AMD or Intel came out with chips codenamed Nimrod or Tower.
Users browsing this forum: No registered users and 0 guests
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot post attachments in this forum