Aceshardware

(not so) temporary home for the aceshardware community
 FAQ •  Search •  Register •  Login 
It is currently Sat Nov 07, 2009 11:22 pm

All times are UTC + 1 hour



Welcome
Welcome to <strong>Aceshardware</strong>.

You are currently viewing our boards as a guest, which gives you limited access to view most discussions and access our other features. By joining our free community, you will have access to post topics, communicate privately with other members (PM), respond to polls, upload content, and access many other special features. Registration is fast, simple, and absolutely free, so please, <a href="/profile.php?mode=register">join our community today</a>!


Post new topic Reply to topic  [ 14 posts ] 
Author Message
 Post subject: Consequences of extending XMM registers to YMM
PostPosted: Tue Jun 17, 2008 3:35 pm 
Offline

Joined: Fri Sep 07, 2007 10:31 am
Posts: 25
Location: Denmark
Intel has announced that the 128-bit vector registers (XMM) will be expanded to 256 bits (YMM) in the forthcoming AVX instruction set. (See my preceding posting in the thread http://aceshardware.freeforums.org/inte ... .html#6824)

The extension of registers to the double size has happened several times in the history of the x86 ISA. Every time registers are extended to a larger size we have the problem with partial register access. When the size of a register is doubled, we still have the possibility of writing to the lower half of the extended register for the sake of backwards compatibility. When an instruction writes to the lower half of an extended register, it has to wait for any previous instruction writing to the same register because the previous instruction may possibly write something to the upper half of the register that has to be combined with whatever the second instruction writes to the lower half. This can prevent out-of-order execution. This is known as false dependencies.

The possible solutions to this problem can be summarized as follows:

    (1). Make the new registers independent of the previous smaller registers. This is the solution that was used in the transition from 64-bit vectors (MMX) to 128-bit vectors (XMM). The advantage is that there is no false dependency. The disadvantage is that there are more registers to save on every task switch and that we need new instructions for moving data between the new and the old register set. The now-obsolete MMX registers and all instructions relating to them are still supported for the sake of backwards compatibility although they are rarely used.

    (2). Allow the hardware to split the register in two. A write to the lower half of an extended register is resolved by splitting the register in two independent registers of smaller size. This method is used in Intel Pentium Pro through Pentium M to handle the extension of the general-purpose registers from 8 or 16 to 32 bits. There is no false dependency as long as the two partial registers can be kept apart. But there is a penalty if the two halves have to be joined again by an instruction that reads the full register (for example for saving it on the stack). The two partial registers cannot be joined together until both have retired to the permanent register file, which takes 5 - 7 clock cycles.

    (3). Allow the hardware to split the register in two, but join them together again at the register read stage if needed. This method is used in Core 2 for the 8-16-32 bit registers. The register read stage in the pipeline will automatically insert an extra micro-operation when needed for joining the two partial registers into one. The delay is 2 - 3 clock cycles.

    (4). Don't split the register into parts. This method is used in AMD processors and in Intel Pentium 4 for the 8-16-32 bit registers. There is no penalty for managing partial registers and for joining them together, but every write to a partial register has a false dependency on previous writes to the same register or any part of it. The instruction scheduler has an extra dependency to keep track of.

    (5). Any write to a partial register causes the rest of the register to be set to zero. This method is used for the transition from 32 to 64-bit general-purpose registers (x64 instruction set). There is no false dependency and no splitting into partial registers. The processor has separate 32-bit and 64-bit modes that cannot be mixed. Thus, there is no possibility that a procedure using 64-bit instructions can loose the upper part of a register by calling a legacy 32-bit function which saves and restores a 32-bit partial register and thereby erases the upper part of the full 64-bit register. Such a call is simply not possible, but the disadvantage is that all function libraries must have a 32-bit version and a 64-bit version.

    (6). The programmer (or compiler) can remove false dependencies by zeroing the full register, or the upper part of it, before accessing the full register. The disadvantage is that value of the full register cannot be preserved across a call to a legacy function that saves and restores the lower part of the register.
The planned extension from 128-bit XMM to 256-bit YMM will use a combination of the above methods, according to the preliminary info published by Intel (http://softwareprojects.intel.com/avx/). All instructions that write to an XMM register will have two versions: A legacy version that modifies the lower half of the 256-bit register and leaves the upper part unchanged, and a new version of the same instruction with a VEX prefix that zeroes the upper half of the register. So the VEX version of a 128-bit instruction uses method (5). It is not clear whether the legacy version of 128-bit instructions will use method (2), (3) or (4). A new instruction VZEROUPPER clears the upper half of all the YMM registers, according to method (6).

Now, I wonder if we really need the complexity of having two versions of all 128-bit instructions. Let's look at where this is needed.

The possibility of writing to the lower half of a YMM register and leave the upper half unchanged is needed only in the following scenario: A function using a full YMM register calls a legacy function which is unaware of the YMM extension but saves the corresponding XMM register before using it, and restores the value before returning. The calling function can then rely on the full YMM register being unchanged.

However, this scenario is only relevant if the legacy function saves and restores the XMM register. Whether a function has to save and restore a particular register is specified in the ABI (Application Binary Interface) standard for the operating system in question. The ABI for 64-bit Windows specifies that register XMM6 - XMM15 have callee-save status, i.e. these registers must be saved and restored if they are used. All other x86 operating systems (32-bit Windows, 32- and 64-bit Linux, BSD and Mac) have no XMM registers with callee-save status. So this discussion is relevant only to 64-bit Windows. There can be no problem in any other operating system because there are no legacy functions that save these registers anyway.

The proposed design of the AVX instruction set allows a possible amendment to the ABI for 64-bit Windows, specifying that YMM6 - YMM15 should have callee-save status. The advantage of callee-save registers is that local variables can be saved in registers rather than in memory across a call to a library function.

The disadvantage of this hypothetical specification of callee-save status to YMM6 - YMM15 in a future Windows 64 ABI is that we will have a penalty for using a full YMM register after saving and restoring the partial register. The cost of this is unknown at present because we don't know if method (2), (3) or (4) will be used. It is clear, however, that the penalty will not be insignificant because Intel wouldn't have defined the VZEROUPPER instruction and recommended the use of it unless there is some situation where the penalty of partial register access is higher than the cost of zeroing the upper half of all sixteen YMM registers. But if VZEROUPPER is used for reducing the penalty of partial register access then we have destroyed the advantage of callee-save status because all the YMM registers are destroyed anyway. This is a catch-22 situation! If there is any significant penalty to partial register access then there is no point in defining callee-save status to YMM registers.

So the advantage of having two different versions of all 128-bit instructions is minimal at best. Now, let's look at the disadvantages:

    * There will be a penalty for mixing the legacy XMM instructions using partial register writes with any of the full YMM instructions.

    * Compilers will need a switch for compiling 128-bit XMM instructions with or without VEX prefix. Software developers will have problems avoiding the penalty of mixing code with and without VEX prefixes.

    * It will be very hard for programmers using vector intrinsics to avoid mixing the different kinds of instructions.

    * All function libraries using vector registers should have two versions of every function: a legacy version for backwards compatibility, and a version with VEX prefixes on all XMM instructions for use in procedures that use YMM registers.

But if we have two versions of every library function then we don't have to care about YMM registers being saved across a call to a legacy library function, because the compiler will insert a call to the VEX version of the library, which can save and restore the full YMM registers if required by the ABI.

The solution of having two versions of all XMM instructions looks to me like a shortsighted patch in an otherwise well designed and future-oriented ISA extension. The problem will appear again in all future extensions of the size of the vector registers.

I wonder if Intel designers have thought all these problems through. They are not in the habit of commenting on discussions like this, but let's hope that they are reading this. I don't know whom to contact for comments.

Intel's specification of the AVX extension looks to me like a draft, published prematurely as a response to AMD's SSE5 stunt. Maybe there is still time for discussion of which solution is the optimal?

Agner Fog http://www.agner.org/optimize


Top
 Profile  
 
 Post subject: Re: Consequences of extending XMM registers to YMM
PostPosted: Tue Jun 17, 2008 6:02 pm 
Offline

Joined: Fri Mar 21, 2008 4:07 pm
Posts: 51
Agner wrote:
...

I wonder if Intel designers have thought all these problems through. They are not in the habit of commenting on discussions like this, but let's hope that they are reading this. I don't know whom to contact for comments.
...


:?:

Go here : http://softwarecommunity.intel.com/isn/ ... us/Forums/

This is the direct link to AVX subforum : http://softwarecommunity.intel.com/isn/ ... Forum.aspx


Top
 Profile  
 
 Post subject: Re: Consequences of extending XMM registers to YMM
PostPosted: Wed Jun 18, 2008 9:08 am 
Offline

Joined: Sun Oct 07, 2007 6:22 pm
Posts: 102
> I don't know whom to contact for comments.

The "Intel Architecture Press Briefing" presentation from March 2008
lists Ronak Singhal as a Principal Engineer on its first page. His Intel
e-mail address is pretty easy to guess...


Top
 Profile  
 
 Post subject: Re: Consequences of extending XMM registers to YMM
PostPosted: Wed Jun 18, 2008 10:17 am 
Offline

Joined: Wed Jun 27, 2007 10:19 am
Posts: 320
Location: Milano, Italy
Agner wrote:
* It will be very hard for programmers using vector intrinsics to avoid mixing the different kinds of instructions.

From a software development POV the proliferation of x86 vector extensions have become such an ugly mess that I really don't want to touch it anymore, not even with a stick. There's no way to have various path written using intrinsics, testing and development would take too much time.
Personally I'm thinking of replacing the hardcoded intrinsic-based vector code with JIT compiled code optimized for the host machine using LLVM. Feeding it with proper 'vector friendly' code seem to yield very good result and entirely removes the problem of having multiple paths. Besides it has the non-trivial advantages of being extremely flexible and offering support for non-x86 processors.
Naturally since LLVM libraries are C++ this doesn't work for C projects currently, I'm looking into bridging it, if it's not too tricky.

On a side note, Larabbee is supposed to have yet another vector extension so I am under the impression that Intel is in the process of making x86 SIMD extensions an even worse mess than it already is.


Top
 Profile  
 
 Post subject: Re: Consequences of extending XMM registers to YMM
PostPosted: Wed Jun 18, 2008 3:14 pm 
Offline

Joined: Wed Aug 29, 2007 3:55 pm
Posts: 784
Location: Great white north
Gabriele Svelto wrote:
On a side note, Larabbee is supposed to have yet another vector extension so I am under the impression that Intel is in the process of making x86 SIMD extensions an even worse mess than it already is.


I think the odds are pretty good that Intel will totally revamp the SIMD
portion of the IA64 ISA starting with Poulson. Whether it ressembles
AVX, Larrabee, or something altogether new remains to be seen but
there would be advantages in compiler support for common features
and capabilities from a high level point of view.


Top
 Profile  
 
 Post subject: Re: Consequences of extending XMM registers to YMM
PostPosted: Wed Jun 18, 2008 4:35 pm 
Offline

Joined: Sat Sep 01, 2007 8:01 am
Posts: 650
Gabriele Svelto wrote:
Agner wrote:
* It will be very hard for programmers using vector intrinsics to avoid mixing the different kinds of instructions.

From a software development POV the proliferation of x86 vector extensions have become such an ugly mess that I really don't want to touch it anymore, not even with a stick. There's no way to have various path written using intrinsics, testing and development would take too much time.
Personally I'm thinking of replacing the hardcoded intrinsic-based vector code with JIT compiled code optimized for the host machine using LLVM. Feeding it with proper 'vector friendly' code seem to yield very good result and entirely removes the problem of having multiple paths. Besides it has the non-trivial advantages of being extremely flexible and offering support for non-x86 processors.
Naturally since LLVM libraries are C++ this doesn't work for C projects currently, I'm looking into bridging it, if it's not too tricky.

On a side note, Larabbee is supposed to have yet another vector extension so I am under the impression that Intel is in the process of making x86 SIMD extensions an even worse mess than it already is.


What do you propose to make it better: LLVM?? You are kidding right?
look at the level of optimization of the application using LLVM and you ll know why it is not good. (look at tunes ... on windows... hummhummm)

http://llvm.org/docs/LangRef.html
Does not even support PSAD ... very sad :) say bye bye to your videos ... lol

ISA is complexe because we cover a lot of ground, and please a lot of people. Tell me that you don't need MPSADBW when you do motion estimation.
ISA is not for every programmer, but having it that complexe is price less. We saw what happen to RISK ... they had to do altivec ...

who?


Top
 Profile  
 
 Post subject: Re: Consequences of extending XMM registers to YMM
PostPosted: Wed Jun 18, 2008 8:02 pm 
Offline

Joined: Wed Jun 27, 2007 10:19 am
Posts: 320
Location: Milano, Italy
who? wrote:
What do you propose to make it better: LLVM?? You are kidding right?
look at the level of optimization of the application using LLVM and you ll know why it is not good. (look at tunes ... on windows... hummhummm)

I suppose you haven't looked into the project. The code coming out of LLVM's JIT is much better than what many compilers produce.
Quote:
http://llvm.org/docs/LangRef.html
Does not even support PSAD ... very sad :)

MPSADBW is not a vector instruction, it is a SIMD instruction which happens to do stuff which is not easily modelled in a compiler back end nor really useful for all the code which is not video encoding. And also for video encoding it works only on a single data format.
Quote:
say bye bye to your videos ... lol

You mean that without that instruction I cannot encode video? Funny I happen to encode video on machines which do not have it.
Quote:
ISA is complexe because we cover a lot of ground, and please a lot of people. Tell me that you don't need MPSADBW when you do motion estimation.

Nope, it is not required. It can *accelerate* motion estimation, it is not *needed* for it.
Quote:
ISA is not for every programmer, but having it that complexe is price less. We saw what happen to RISK ... they had to do altivec ...

Yes, not 7 or 8 different extensions. While AltiVec is far from perfect it's much better than anything I've saw from Intel or AMD.


Top
 Profile  
 
 Post subject: Re: Consequences of extending XMM registers to YMM
PostPosted: Thu Jun 19, 2008 1:53 pm 
Offline

Joined: Sat Sep 01, 2007 8:01 am
Posts: 650
Gabriele Svelto wrote:
who? wrote:
What do you propose to make it better: LLVM?? You are kidding right?
look at the level of optimization of the application using LLVM and you ll know why it is not good. (look at tunes ... on windows... hummhummm)

I suppose you haven't looked into the project. The code coming out of LLVM's JIT is much better than what many compilers produce.
Quote:
http://llvm.org/docs/LangRef.html
Does not even support PSAD ... very sad :)

MPSADBW is not a vector instruction, it is a SIMD instruction which happens to do stuff which is not easily modelled in a compiler back end nor really useful for all the code which is not video encoding. And also for video encoding it works only on a single data format.
Quote:
say bye bye to your videos ... lol

You mean that without that instruction I cannot encode video? Funny I happen to encode video on machines which do not have it.
Quote:
ISA is complexe because we cover a lot of ground, and please a lot of people. Tell me that you don't need MPSADBW when you do motion estimation.

Nope, it is not required. It can *accelerate* motion estimation, it is not *needed* for it.
Quote:
ISA is not for every programmer, but having it that complexe is price less. We saw what happen to RISK ... they had to do altivec ...

Yes, not 7 or 8 different extensions. While AltiVec is far from perfect it's much better than anything I've saw from Intel or AMD.


That is, as usual, your BIAS opinion. See the graph that apple showed when moving to Intel? In Steve presentation: 4X, 2X , if altivec was that great, why was it so slow? http://uk.youtube.com/watch?v=ghdTqnYnFyg
http://uk.youtube.com/watch?v=I6JWqllbhXE
who?


Top
 Profile  
 
 Post subject: Re: Consequences of extending XMM registers to YMM
PostPosted: Thu Jun 19, 2008 5:13 pm 
Offline

Joined: Wed Jun 27, 2007 10:19 am
Posts: 320
Location: Milano, Italy
who? wrote:
That is, as usual, your BIAS opinion.

If I'm biased I wonder how would you define yourself.
Quote:
See the graph that apple showed when moving to Intel? In Steve presentation: 4X, 2X , if altivec was that great, why was it so slow? http://uk.youtube.com/watch?v=ghdTqnYnFyg
http://uk.youtube.com/watch?v=I6JWqllbhXE
who?

Three things:
a) Pulling Steve Jobs in a technical discussion is gross
b) I thought we were talking about instruction sets not implementations
c) Can you please stick to the topic for once?


Top
 Profile  
 
 Post subject: Re: Consequences of extending XMM registers to YMM
PostPosted: Thu Jun 19, 2008 6:32 pm 
Offline

Joined: Fri Aug 31, 2007 10:08 pm
Posts: 196
Location: Switzerland
Gabriele Svelto wrote:
MPSADBW is not a vector instruction,


and one of the few instructions without a VEX.256 variant, no very interesting going forward / not very relevant when discussing AVX


Top
 Profile  
 
 Post subject:
PostPosted: Thu Jun 19, 2008 7:50 pm 
Offline

Joined: Fri Mar 21, 2008 4:07 pm
Posts: 51
I've seen you posted this in the Intel Software forum.Can you comment on their reply?


Top
 Profile  
 
 Post subject:
PostPosted: Thu Jun 19, 2008 8:41 pm 
Offline

Joined: Sat Sep 01, 2007 8:01 am
Posts: 650
savantu wrote:
I've seen you posted this in the Intel Software forum.Can you comment on their reply?


here is the answer from my Buddy Mark.

http://softwarecommunity.intel.com/isn/ ... 57153.aspx
Mark Buxton on Intel forum wrote:
Hi Professor Fog,

Thank you for your detailed and insightful comments.

Among the important boundary conditions to add to your list is that some of today’s drivers use legacy SSE. These are being reached via interrupt and the existing drivers can’t save the upper part of the live YMM registers. Theoretically a new OS could take care of this for the ISR, but we didn’t want to penalize users of the legacy architecture and could not mandate a major OS re-write.

Over a decade ago when we defined legacy SSE instructions we had no vision of the way we were going to extending the vector length. The new Intel AVX 128 instructions are defined as zeroing the upper part of the register (bit 128 to infinity), and we have a new state management (XSAVE/RSTOR) that can manage state in a forward compatible manner. So if and when we will extend the vector size further the magnitude of this issue should be much smaller (driver writers, you have been warned J).

So in brief, our solution was:

1) Legacy and new code can intermix, the first transition from 256b usage to legacy 128b will have a transition penalty (in Sandy bridge implementation it will cost several tens of cycles depending on what’s in the pipe, in extreme cases it can be much longer). The HW will take care to save the upper parts and later restore (with a similar performance penalty) when returning to 256b code.

2) The compiler will map C or 128b instrisics to either the legacy 128b or the new zeroing-upper 128b forms based on a switch. A programmer could also write in assembly, and they need to keep the above in mind. Inline assembly can also be encoded to the new zeroing-upper 128b forms based on a switch. The expected result of this is that mixing legacy SSE and new 256b instructions would mostly happen at or above the function level.

3) The vZeroUpper instruction is cheap and the recommended ABI is to save the live upper part of the YMM register and issue vZeroUpper prior to calling to an external function. BTW, it will be a good programming practice to "deallocate" the upper part (using vZeroUpper) or the whole registers (using vZeroAll) when finishing part of the program that uses this registers. (optimization recommendation: This will help free up resources in the OOO machine and reduce time spent in task switches). The result of this will be that only the small overhead of vZeroUpper will be the cost of this scheme.

4) We will add performance events that count the two transitions from #1 above to help with debugging any SSE->256b transitions.

To help understand the decision process a bit better, allow me to provide some historical context. Over the past several years we evaluated all the options you have above, but particularly options 1 and 5 (and of course the current option...). Options involving partial dependencies or partial state management were difficult to make fast and/or power efficient. It also did not seem right that the new state should penalize all users of the legacy architecture".

Your "option 1" is in a couple of ways a desirable architecture. It avoids any concern of legacy code perturbing your new state. On the other hand, there are advantages to being able to interoperate cleanly with the current XMM register state: it is entrenched in the ABI's (as you note, XMM is used for argument passing), and we wanted to bring the encoding advantages of the NDS and new operations to scalar and short vector (scalar dominates most FP code today).. The big problem is that doubling the name space for what is, in effect, the same functionality, is inefficient both inside the processor as well as in the state save image.

Your "option 5" (the legacy instructions to zero the new upper state) is really attractive. The real issue here is that popular operating systems do not preserve all state on switching to an interrupt service routine (ISR). A surprising number of Ring-0 components active during ISR use XMM's - if only to move or initialize data. There are only two solutions to this problem: the OS must save all the new state or the drivers must be rewritten. There is no way we could compel the industry to rewrite/fix all of their existing drivers (for example to use XSAVE) and no way to guarantee they would have done so successfully. Consider for example the pain the industry is still going through on the transition from 32 to 64-bit operating systems! The feedback we have from OS vendors also precluded adding overhead to the ISR servicing to add the state management overhead on every interrupt. We didn't want to inflict either of these costs on portions of the industry that don't even typically use wide vectors. Architecturally, therefore, we had to prevent legacy drivers from having side effects on the new (upper) state. This means they had to merge. New drivers, aware of how to use XSAVE to manage AVX state in a forward compatible manner so as not to break apps. So for the AVX-prefixed (short vector or scalar) instructions we allow a zeroing behavior.

What we decided to do was to optimize for the common scenario of uniform blocks of 128-bit (or all scalar) code separated from uniform blocks of 256-bit code. We maintain an internal record of when we transition between states where the upper bits contain something nonzero - to a point where the state is guaranteed to be zero. We give you a fast (1* cycle throughput) way to the second state –VZEROUPPER (though VZEROALL, XRSTOR and reboot also work). Once you're in that state of Zeroed-upperness, you can execute 128-bit code (or scalar) - VEX prefixed or not - and you pay no transition penalty. You can also transition back to executing 256-bit instructions – also with no penalty. You can transition freely between any VEXed instruction of any width and pay no penalty. The downside is that if you try to move from 256-bit instructions to legacy 128-bit instructions without that VZEROUPPER, you're going to pay. The way we chose to make you pay is to optimize for the common use of long blocks of SSE code – you pay once during the transition to legacy 128-bit code instead of on every instruction. We do it by copying the upper 128-bits of all 16 registers to a special scratchpad and this copying takes time (something like 50 cycles - still TBD). Then the legacy SSE code can operate as long as it wants with no penalty. When you transition back to a VEX-prefixed instruction, you have to pay the penalty again to restore state from that scratchpad.

The solution for this problem is for software to use VZEROUPPER prior to leaving your basic block of 256-bit code. Use it prior to calling any (ABI compliant) function, and prior to any blind return jump.

We've also expanded the ABI in a backwards compatible way - functions can declare and pass 256-bit arguments if so declared, but >>new state is caller save<<. Making ymm6-15 as callee save on Windows-64 would require changes in several runtime libraries related to exception handling and stack unwinding, making the changes incompatible with the current libraries. Having all new state as caller-save is slightly inefficient but the gain due to full backward compatibility outweighs the loss of performance. For full details see the ABI extensions for each OS see the Spring Intel Developer Forum AVX presentation (https://intel.wingateweb.com/SHchina/pu ... 0r_eng.pdf). For code that doesn't vectorize, mixture of scalar (or 128-bit vector) AVX and legacy 128-bit instructions is completely painless so long as the upper state is zero previously - I would rely on compliant callers to issue VZEROUPPER or if paranoid you could zero yourself prior to the executing the function body).

So back to your scenario: A function using 256-bit AVX instructions cannot assume the callee will not modify the high bits in the YMM’s. Bits 128-255 of the YMM’s are caller save. The caller must also issue a VZEROUPPER in case that callee hadn't ported itself to use AVX. There's always a tradeoff in caller/callee save and the benefit here is that the callee that wasn’t to use 256-INT doesn't have to worry about preserving this new state.

> Now, I wonder if we really need the complexity of having two versions of all 128-bit instructions

The non-destructive source and the compact encoding are >major< performance features that apply to scalar and short vector forms. Indeed, I expect the performance upside of these on general purpose, compiled code (that doesn't always vectorize so well) to match the upside of the wider vector width. New operations like broadcast and true masked loads and stores and the 4th operand (for the new 3-source instructions) are only available under the VEX prefix. You can freely intermix 128-bit VEX instructions with legacy 128-bit instructions. And you can intermix blocks of 256-bit instructions with legacy 128-bit instructions so long as you follow two rules: use VZEROUPPER prior to leaving a basic block containing 256-bit and adhere to caller-save ABI semantics on new state.

> There will be a penalty for mixing the legacy XMM instructions using partial register writes with any of the full YMM instructions. Is there a penalty only when reading a full register after writing to the partial register, or are there other situations where mixing instructions with and without VEX causes delays?

Without VZEROUPPER, there will be a penalty when moving from 256-bit instructions to legacy 128-bit instructions, and another penalty when moving from 128-bit instructions back to 256-bit instructions. Consider this code:

VADDPS ymm0, ymm1, ymm2

ADDSS xmm3, xmm4

VSUBPS ymm0, ymm0, ymm2

Here we would have a penalty on the ADDSS and a second penalty on the VSUBPS. (In this case it would be best just to use the VEX form of ADDSS).

> Compilers will need a switch for compiling 128-bit XMM instructions with or without VEX prefix. Software developers will have problems avoiding the penalty of mixing code with and without VEX prefixes.

Absolutely correct on the compiler switch (the compiler has a switch to select between AVX and legacy SSE generation – on the Intel compiler it’s QxG). But since the compilers (at least for high level languages) will generate VZEROUPPER and caller-save semantics on new state whenever they succeed in autovectorizing, there won't be penalties. There are still transitions penalties that happen due to asynchronous events (such as ISR's that use XMM's) but these are not common enough to be a significant performance concern.

The tools will not generate VZEROUPPER on intrinsics or assembly - here, we give the developer the full right/control to shoot their performance in the foot. It is a particularly concern of mine that people writing intrinsics be aware they still have to use VZEROUPPER!

> It will be very hard for programmers using vector intrinsics to avoid mixing the different kinds of instructions.

I hope it will not be too hard, but it is a concern. I've ported a number of applications with both an internal version of the Intel(R) compiler - with and without intrinsic and assembly versions of some hotspots - and the transitions haven't shown up as a noticeable contribution. Autovectorizing compilers like the Intel(R) compiler don't have any problems. In the Intel compiler, when you compile with QxG, all your intrinsics (128-bit or 256-bit) get a VEX prefix. The only diligence required on the part of the programmer is to ensure that you have a VZEROUPPER prior to leaving that block of intrinsics. We have a tool used internally (the Intel(R) software development emulator) that identifies any transitions. We also plan to have a performance monitoring counter in the hardware to allow tools like Intel(R) VTune to show transition penalties.

> Do you expect all function libraries using XMM registers to have two versions of every function: a legacy version for backwards compatibility, and a version with VEX prefixes on all XMM instructions for calling from procedures that use YMM registers?

What we wouldn't do is have lots of library versions just to work around the lack of a callee-save ABI. We would have (and do today) support multiple versions of our performance libraries optimized for different processors. But the 80/20 rule applies: most library functions are not critical performance bottlenecks and the legacy (non-VEXed) implementations work there just fine - with no transition penalties to apps that use 256-bit AVX. In the long run of course, there are no advantages to the non-VEX forms and we would like to move the industry in the direction of AVX.

> The solution of having two versions of all XMM instructions looks to me like a shortsighted patch in an otherwise well designed and future-oriented ISA extension

> The problem will appear again in all future extensions of the size of the vector registers. How do you plan to solve the problem next time the register size is increased? Will we have two versions of every YMM instruction when ZMM is introduced, and two versions of every ZMM instruction when.... This would be a waste of the few unused bits that are left in the VEX prefix.

I have a much stronger feeling that we have reached an excellent compromise. There are significant benefits to the non-destructive destination and new instruction forms that apply to scalar and short vector operations - and if you don't want/need to port, we interoperate at very high performance with the legacy instructions. Software and tools developers have very few rules to avoid performance overhead 1) when you use 256-bit forms of instructions, make sure you VZEROUPPER when leaving the block and 2) realize that the ABI remains caller save for new state. And of course - don't intentionally write code like the example above J

I'm glad you raised the future of ZMM. If/when we adopt the next natural extension (to 512 bits or whatever), we would face the same problems with 'Legacy 256-bit' so long as the asynchronous parts of the system are not saving state in a forward-looking way, or the OS does not decide to step in and manage state on ISR's. So this is a call to the industry planning to adopt AVX in their interrupt service routines: use XSAVE. It's there to guarantee you can save both today’s <and> tomorrow's state!

> It is probably too late to change the AVX spec now, although it looks to me like a draft, published prematurely as a response to AMD's SSE5 stunt?

Let's keep having the dialogue, but I believe the software apps will be greatly benefitted by the present architecture (and microarchitecture). As we did with the Nehalem instructions, we believe its best for the software to have lead time to prepare the OS, compilers, tools, and apps early. The AVX spec is not a draft :)

Regards, Mark






Top
 Profile  
 
 Post subject:
PostPosted: Fri Jun 20, 2008 9:44 pm 
Offline

Joined: Fri Sep 07, 2007 10:31 am
Posts: 25
Location: Denmark
savantu wrote:
I've seen you posted this in the Intel Software forum.Can you comment on their reply?


I am very happy with his reply. Explains everything in detail.


Top
 Profile  
 
 Post subject:
PostPosted: Fri Jun 20, 2008 10:47 pm 
Offline

Joined: Fri Mar 21, 2008 4:07 pm
Posts: 51
Agner wrote:
savantu wrote:
I've seen you posted this in the Intel Software forum.Can you comment on their reply?


I am very happy with his reply. Explains everything in detail.


In light of this , how do you consider the choices Intel made ?


Top
 Profile  
 
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 14 posts ] 

All times are UTC + 1 hour


Who is online

Users browsing this forum: No registered users and 1 guest


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to: