Aceshardware

(not so) temporary home for the aceshardware community
 FAQ •  Search •  Register •  Login 
It is currently Mon Dec 28, 2015 7:05 pm

All times are UTC + 1 hour



Welcome
Welcome to <strong>Aceshardware</strong>.

You are currently viewing our boards as a guest, which gives you limited access to view most discussions and access our other features. By joining our free community, you will have access to post topics, communicate privately with other members (PM), respond to polls, upload content, and access many other special features. Registration is fast, simple, and absolutely free, so please, <a href="/profile.php?mode=register">join our community today</a>!


Post new topic Reply to topic  [ 163 posts ]  Go to page 1, 2, 3, 4, 5 ... 11  Next
Author Message
 Post subject: Intel AVX kills AMD SSE5
PostPosted: Tue Jun 17, 2008 3:28 pm 
Offline

Joined: Fri Sep 07, 2007 10:31 am
Posts: 41
Location: Denmark
When AMD published their new ISA extension named SSE5 in late August 2007, they also introduced a new instruction code format for instructions with 3 or 4 operands. When Intel presented their AVX extension in April this year they introduced another code format that also supports 3 or 4 operands. These two formats are very different. We are now in a position where AMD and Intel are using completely different coding schemes for the same instructions. This is every programmer's nightmare! I cannot imagine any significant number of programmers making three versions of their code: one for AMD, one for Intel, and one for compatibility with older processors.

The forking of instruction sets and coding schemes is one of the less desirable consequences of free competition. We would all prefer some kind of international standardization committee that could approve new instruction codes. Such a committee would be reluctant to accept new shortsighted patches that add just another complication to instruction decoding. They would have weeded out the bizarre undocumented instructions from the old 8086 days that are still supported. And they might not accept the addition of new instructions to the already bulging instruction set mainly for marketing reasons with little technical benefit. Unfortunately, there is little hope that such a committee will be formed.

I have looked into the details of the two competing instruction formats and made a comparison:

    * Both ISA extensions are compatible with all existing code.

    * SSE5 supports 3 operands for new instructions only. AVX extends existing instructions to 3 operands as well. Almost all existing instructions on XMM registers are extended to 3 operands, and the code format makes room for also extending general-purpose register instructions to 3 operands.

    * SSE5 supports instructions with 4 operands, but only if two of the operands are the same register. AVX supports any combination of 4 registers by adding an extra code byte. Future extension to 5 operands is possible.

    * SSE5 makes instructions longer. AVX makes some instructions longer and some instructions shorter, but most instructions keep the same length as before despite containing one more register operand and other new information.

    * SSE5 adds yet another complication to the already very complicated instruction decoding procedure. AVX makes instruction decoding simpler by sanitizing a lot of old patches. The many prefixes and escape bytes that pester the current instruction set are joined together into a single "VEX" prefix that is 2 or 3 bytes long.

    * AVX supports the extension of the 128-bit vector registers (XMM registers) to 256 bits (YMM registers) with room for further extensions in the future. SSE5 has no room for new extensions.

    * AVX has 3 unused bits for future extensions to the now overloaded opcode map. This means no new shortsighted patches for a foreseeable future.

Before I saw the AVX documentation, I would have denied that it was possible to add so much new information without making instructions longer. The trick is that it makes one long prefix instead of many short prefixes. One or a few bits in the new VEX prefix contains the same information as a whole 8-bit or even 16-bit prefix or escape code in the current coding scheme. The two VEX prefixes are made out of two obsolete instructions, LDS and LES, which are valid in 16- and 32-bit mode but invalid in 64-bit mode. Certain bits in the VEX prefix that indicate register extensions available only in 64-bit mode are placed in such a way in the VEX prefix that the only values valid in 32-bit mode form an invalid register operand if interpreted as a legacy LDS or LES instruction. This is a solution no less ingenious than the x64 extension invented by AMD.

Looking at the advantages of AVX over SSE5 there can be no doubt that AMD has no choice but to adopt AVX. There is no way AMD can stay in competition without supporting the new 256-bit vectors and the 3-operand version of all existing XMM instructions. And, incidentally, it will be easier to implement the new 3-operand instructions for AMD than it is for Intel because the current Intel microarchitecture does not allow micro-operations with more than two inputs, while the AMD microarchitecture has no such limitation.

Let me explain the advantage of 3-operand instructions to those who don't know what this is about. Most of the current instructions place the result of a calculation in the same register as one of the input operands, e.g.:
A = A * B.
With a 3-operand version, you can do:
C = A * B.
This gives the programmer the freedom to reuse the original value of A in other calculations without having to copy it to another register. The result is fewer register-to-register moves and hence more efficient and compact code.

The SSE5 instructions will suffer the same fate as AMD's 3DNow instructions. Nobody ever used the 3DNow instructions because they are not supported in Intel processors. They are superseded by the more efficient SSE instructions, but AMD have to keep supporting them in all their future processors for the sake of backwards compatibility. Let's hope that AMD have the guts to drop SSE5 altogether before it's too late. There has been some speculation that they might.

Too bad that AMD haven't seen this coming before they published their SSE5 spec. Intel must have been able to keep their plans secret despite the patent sharing agreement between AMD and Intel. Maybe there is no patent on AVX?

See also my second posting on the software and hardware consequences of extending the size of the vector registers in the thread http://aceshardware.freeforums.org/cons ... .html#6825


Top
 Profile  
 
 
 Post subject:
PostPosted: Wed Jun 18, 2008 12:59 am 
Offline

Joined: Sat Sep 01, 2007 8:01 am
Posts: 652
I trend to agree with you. Let's hope they are going to be gentlement, and do the right thing.

who?


Top
 Profile  
 
 Post subject:
PostPosted: Wed Jun 18, 2008 9:04 am 
Offline

Joined: Sun Oct 07, 2007 6:22 pm
Posts: 119
> Let's hope they are going to be gentlement, and do the right thing.

So AVX will suck in those parts of SSE5 that it's missing?

And Larrabee will adopt AVX, instead of using its own encoding?

Yeah... right.


Top
 Profile  
 
 Post subject:
PostPosted: Wed Jun 18, 2008 1:46 pm 
Offline

Joined: Sat Sep 01, 2007 8:01 am
Posts: 652
Larrabee is closer than Cuda, isn 't it?

who?


Top
 Profile  
 
 Post subject:
PostPosted: Mon Jun 23, 2008 6:56 pm 
Offline

Joined: Sun Mar 16, 2008 3:20 pm
Posts: 86
Agner wrote:
...This is every programmer's nightmare! I cannot imagine any significant number of programmers making three versions of their code: one for AMD, one for Intel, and one for compatibility with older processors.
I guess this would be an illustration of your nightmare:

Image

Then ... would the following be kind of a programmer's dream? :

Image
http://sseplus.sourceforge.net/index.html

Project description:
Quote:
SSEPlus is a SIMD function library. It provides optimized emulation for newer SSE instructions. It also provides a rich set of high performance routines for common operations such as arithmetic, bitwise logic, and data packing and unpacking.


At least it sounds to me like a solution. It should be easy to add AVX instruction later, too.

Benefits, according to the profect's pdf:
  • Develop with new instructions before hardware is available
  • Optimize once for target hardware, other platforms are easy
  • Ensure generated code conforms to target hardware
  • Stop worrying about instruction sets . Use instructions that match your algorithm
  • Open source: If a function is missing -> add it
  • Feedback loop: High value added functions may become hardware instructions

Quote:
And, incidentally, it will be easier to implement the new 3-operand instructions for AMD than it is for Intel because the current Intel microarchitecture does not allow micro-operations with more than two inputs, while the AMD microarchitecture has no such limitation.

Could you explain this a little bit further ? Sounds interesting, as I also speculate that AMD will launch SSE5 already with Shanghai, i.e. this winter. Reasons for this are

a) the above open source project (they already coded a native SSE5 part, I wonder why they should do this already now, if SSE5 would really be released with Bulldozer only, sometime in 2010/2011 ...

b) Mention of SSE5 in the AMD CPU ID instructions (among SSE 4.1 and SSSE3):
http://www.amd.com/us-en/assets/content ... /25481.pdf

c) Report in a German magazine, they interviewed some AMD people during cebit and, according to the report, the AMD people stated that the 45nm CPUs are equipped with SSE5 (unfortunately, I lost the link :( )
However some AMD marketing guy in US denied everything the day after ^^

Quote:
The SSE5 instructions will suffer the same fate as AMD's 3DNow instructions. Nobody ever used the 3DNow instructions because they are not supported in Intel processors. They are superseded by the more efficient SSE instructions, but AMD have to keep supporting them in all their future processors for the sake of backwards compatibility. Let's hope that AMD have the guts to drop SSE5 altogether before it's too late. There has been some speculation that they might.
Without the above SSEPlus approach I would agree, however with it, an easy upgrade path from SSEx to AVX is possible, but I wonder what kind of support AMD could get. So far I never heard of this before, I just found it accidentally while using google ;-)

cheers

Opteron


Top
 Profile  
 
 Post subject:
PostPosted: Mon Jun 23, 2008 8:01 pm 
Offline

Joined: Mon Jul 23, 2007 1:48 am
Posts: 81
SSEPlus looks nice, but I have my doubts on it. If it has someone (or entity) is willing to ensure that there is good support, documentation and updates, then maybe it'll have more potential. It also seems a bit risky for a company to try SSEPlus. I haven't really looked into the site too much, but what happens if SSEPlus breaks or the developers walk away?


Top
 Profile  
 
 Post subject:
PostPosted: Mon Jun 23, 2008 10:20 pm 
Offline

Joined: Sun Mar 16, 2008 3:20 pm
Posts: 86
cornelius785 wrote:
SSEPlus looks nice, but I have my doubts on it. If it has someone (or entity) is willing to ensure that there is good support, documentation and updates, then maybe it'll have more potential. It also seems a bit risky for a company to try SSEPlus. I haven't really looked into the site too much, but what happens if SSEPlus breaks or the developers walk away?


It is officially backed by AMD, just found the project's "homepage":
http://developer.amd.com/cpu/Libraries/ ... fault.aspx

cheers

Opteron


Top
 Profile  
 
 Post subject:
PostPosted: Tue Jun 24, 2008 4:43 am 
Offline

Joined: Sun Oct 07, 2007 6:22 pm
Posts: 119
> I also speculate that AMD will launch SSE5 already with Shanghai [...]

Judging from the sample chips, your speculation is incorrect. :)


Top
 Profile  
 
 Post subject:
PostPosted: Tue Jun 24, 2008 5:14 pm 
Offline

Joined: Sat Sep 01, 2007 8:01 am
Posts: 652
Opteron wrote:
Agner wrote:
...This is every programmer's nightmare! I cannot imagine any significant number of programmers making three versions of their code: one for AMD, one for Intel, and one for compatibility with older processors.
I guess this would be an illustration of your nightmare:

Image

Then ... would the following be kind of a programmer's dream? :

Image
http://sseplus.sourceforge.net/index.html

Project description:
Quote:
SSEPlus is a SIMD function library. It provides optimized emulation for newer SSE instructions. It also provides a rich set of high performance routines for common operations such as arithmetic, bitwise logic, and data packing and unpacking.


At least it sounds to me like a solution. It should be easy to add AVX instruction later, too.

Benefits, according to the profect's pdf:
  • Develop with new instructions before hardware is available
  • Optimize once for target hardware, other platforms are easy
  • Ensure generated code conforms to target hardware
  • Stop worrying about instruction sets . Use instructions that match your algorithm
  • Open source: If a function is missing -> add it
  • Feedback loop: High value added functions may become hardware instructions

Quote:
And, incidentally, it will be easier to implement the new 3-operand instructions for AMD than it is for Intel because the current Intel microarchitecture does not allow micro-operations with more than two inputs, while the AMD microarchitecture has no such limitation.

Could you explain this a little bit further ? Sounds interesting, as I also speculate that AMD will launch SSE5 already with Shanghai, i.e. this winter. Reasons for this are

a) the above open source project (they already coded a native SSE5 part, I wonder why they should do this already now, if SSE5 would really be released with Bulldozer only, sometime in 2010/2011 ...

b) Mention of SSE5 in the AMD CPU ID instructions (among SSE 4.1 and SSSE3):
http://www.amd.com/us-en/assets/content ... /25481.pdf

c) Report in a German magazine, they interviewed some AMD people during cebit and, according to the report, the AMD people stated that the 45nm CPUs are equipped with SSE5 (unfortunately, I lost the link :( )
However some AMD marketing guy in US denied everything the day after ^^

Quote:
The SSE5 instructions will suffer the same fate as AMD's 3DNow instructions. Nobody ever used the 3DNow instructions because they are not supported in Intel processors. They are superseded by the more efficient SSE instructions, but AMD have to keep supporting them in all their future processors for the sake of backwards compatibility. Let's hope that AMD have the guts to drop SSE5 altogether before it's too late. There has been some speculation that they might.
Without the above SSEPlus approach I would agree, however with it, an easy upgrade path from SSEx to AVX is possible, but I wonder what kind of support AMD could get. So far I never heard of this before, I just found it accidentally while using google ;-)

cheers

Opteron


I guess the AMD guys need to look at the array of functions when they do different CPU code path ... their way is adding many conditional Jump to get to their SSE5#!@$ ... Especially because you usually do optimization in the critical path ...
I am saying this ... but many will agree here that I don t know what I am talking about .... lol!

switch case ... lol!!!! Be serious please!

who?


Top
 Profile  
 
 Post subject:
PostPosted: Tue Jun 24, 2008 5:37 pm 
Offline

Joined: Thu Sep 06, 2007 3:48 pm
Posts: 235
who? wrote:
switch case ... lol!!!! Be serious please!


Isn't this similar to what Intel does in MKL or IPP when it supports multiple processors and switches based on cpu instruction set support? Or what you get when you use the Intel compiler with the -ax flag?


Top
 Profile  
 
 Post subject:
PostPosted: Tue Jun 24, 2008 5:44 pm 
Offline

Joined: Sat Sep 01, 2007 8:01 am
Posts: 652
martinw wrote:
who? wrote:
switch case ... lol!!!! Be serious please!


Isn't this similar to what Intel does in MKL or IPP when it supports multiple processors and switches based on cpu instruction set support? Or what you get when you use the Intel compiler with the -ax flag?


well, there is a little difference ... compilers are automatic systems, and here is a C/C++ code...
IPP does use array of functions ... in the sample code posted here, the code shows 7 Jxx , that is a little too much ... GCC and ICC, MSVC will all put the SSE5 code path as the last Jxx.
It is a manual C code, it has to be array of functions ... probably a marketing guy doing coding over the week end ... This would not even compile...
I am sure there is AMD guys able to do this, just not the guy who did the slides.
[edit]
Quote:
I forgot to add .. -ax flag does it only between 2 ISAs, only one Jxx
[/edit]


who?


Top
 Profile  
 
 Post subject:
PostPosted: Tue Jun 24, 2008 6:04 pm 
Offline

Joined: Sat Sep 01, 2007 8:01 am
Posts: 652
for those in need of understanding Array of functions, a very simplistic example here:
http://www.learning-computer-programmin ... tions.html

Quote:
// Program to demonstrate
// array of functions
#include<iostream.h>

// -- FUNCTION PROTOTYPES --
void func1();
void func2();
void func3();
void func4();
void func5();
// -- ENDS --

void main()
{
// notice the prototype
void (*ptr[5])();

// arrays are made to point
// at the respective functions
ptr[0]=func1;
ptr[1]=func2;
ptr[2]=func3;
ptr[3]=func4;
ptr[4]=func5;

// now the array elements
// point to different functions
// which are called just like
// we access the elements of
// an array
for(int i=0;i<5;i++)
(*ptr[i])();
}

// -- FUNCTIONS DEFINITION --
void func1()
{
cout<<"Called Func1!\n"; //TRY SSE CODE
}

void func2()
{
cout<<"Called Func2!\n"; //TRY SSE2 CODE
}

void func3()
{
cout<<"Called Func3!\n"; //TRY SSE3 CODE ...
}

void func4()
{
cout<<"Called Func4!\n";
}

void func5()
{
cout<<"Called Func5!\n";
}
// -- ENDS --

this is the efficent way to do multiple code path.
You create one array of function for each instruction set, you set up the pointer for each optimized code path, and you are done.
you can use C++ member overwrite to have the same result.

who?


Top
 Profile  
 
 Post subject:
PostPosted: Wed Jun 25, 2008 12:08 am 
Offline

Joined: Sat Mar 22, 2008 5:10 pm
Posts: 370
who? wrote:
I guess the AMD guys need to look at the array of functions when they do different CPU code path ... their way is adding many conditional Jump to get to their SSE5#!@$ ... Especially because you usually do optimization in the critical path ...
I am saying this ... but many will agree here that I don t know what I am talking about .... lol!

switch case ... lol!!!! Be serious please!

who?


Do you understand the diference between an oversimplified example in a PDF for journalists and an actual implementation?
who? wrote:
well, there is a little difference ... compilers are automatic systems, and here is a C/C++ code...
IPP does use array of functions ... in the sample code posted here, the code shows 7 Jxx , that is a little too much ... GCC and ICC, MSVC will all put the SSE5 code path as the last Jxx.
It is a manual C code, it has to be array of functions ... probably a marketing guy doing coding over the week end ... This would not even compile...
I am sure there is AMD guys able to do this, just not the guy who did the slides.
[edit]
Quote:
I forgot to add .. -ax flag does it only between 2 ISAs, only one Jxx
[/edit]


who?

Actually in that case the mentioned compilers would create a jump table or cascade, and in the beginnig of the program there must be a "switch" to build the correct function table.

who? wrote:
for those in need of understanding Array of functions, a very simplistic example here:
http://www.learning-computer-programmin ... tions.html

...

this is the efficent way to do multiple code path.
You create one array of function for each instruction set, you set up the pointer for each optimized code path, and you are done.
you can use C++ member overwrite to have the same result.

who?

Nice try but wrong example, in the case of implementing it on C using a structure of functions would be far more readable and won't need a cast every call.
Also, populating it at the beggining of the program instead of just getting a pointer would save an indirection when calling each function.


Top
 Profile  
 
 Post subject:
PostPosted: Wed Jun 25, 2008 7:31 am 
Offline

Joined: Wed Jun 27, 2007 10:19 am
Posts: 367
Location: Milano, Italy
who? wrote:
this is the efficent way to do multiple code path.

I rest my case, JIT-compiling the critical path at runtime with LLVM is easier, cleaner, future proof, doesn't bloat the executable, doesn't need jump tables or whatever and can be tuned not only for an instruction set but also for the processor running it.


Top
 Profile  
 
 Post subject:
PostPosted: Wed Jun 25, 2008 3:02 pm 
Offline

Joined: Fri Sep 07, 2007 10:31 am
Posts: 41
Location: Denmark
SSEPlus is a function library with CPU dispatcher inside. This is obviously a good way to take advantage of different instruction sets. Most math libraries already have a CPU dispatcher.

The math libraries supplied by Intel have a CPU dispatcher with the funny feature that it always chooses the worst possible path when running on an AMD processor. If you replace the CPU dispatcher inside the Intel math libraries with a non-discriminating CPU dispatcher then the program that is compiled with Intels compiler suddenly runs much faster on an AMD machine. See my programming manual for details http://www.agner.org/optimize/optimizing_cpp.pdf.

SSEPlus has an AMD logo on it so it is obviously sponsored by AMD. It differs from Intel's function libraries by being open source. Therefore they cannot put any secret CrippleMyCompetitor() function into it. I would recommend that everybody use optimized open source function libraries instead of Intels libraries.

The colorful picture you copied from the SSEPlus doc shows the instruction sets like a linear sequence. This deliberately obscures the fact that the instruction sets do not form the linear progression that we would like them to do, but indeed a fork between AMD and Intel.

There is no mentioning of the future Intel AVX and FMA instruction sets in the SSEPlus doc (yet). This has nothing to do with whether it would be advantageous for AMD to change the SSE5 spec to make it compatible with the more efficient coding scheme proposed in AVX.

If you are a programmer wanting to optimize your application you may be lucky that there is a ready made function library that does exactly what you need, and there is a CPU dispatcher inside the library. But somebody still has to build and maintain these libraries. And these libraries become huge if different CPU vendors use different codes for the same instructions.

The huge code makes the executable big and makes code caching less efficient. In many cases the optimal solution is to have only two branches in the CPU dispatcher: One branch uses the oldest instruction set that has all the instructions you need for that particular purpose and one branch for supporting older computers. No need to make a branch for SSE4.2 or whatever if there are no instructions beyond SSE3 that are useful for that particular application.

Just-in-time compilation of byte code obviously removes the runtime CPU dispatcher, but the JIT compiler itself becomes bigger and more difficult to maintain the more different branches it has to support. JIT compilation in general has several disadvantages. The intermediate step in the compilation process makes it more difficult for the compiler to find possibilities for optimization. I have never seen a JIT compiled application that runs faster than a similar directly compiled binary. Furthermore, the JIT compiler and runtime framework often use much more resources than the application itself. I haven't tried LLVM, but my experience with .NET and Java applications is that they are increadibly slow compared to binaries compiled from C++. The better you want a compiler to optimize, the more time it takes to compile. JIT compilation puts this burden on the end user.

The solution to the programmers nightmare is neither bloated function libraries with excessive branches nor JIT compilers with support for an endless number of mutually incompatible CPUs.

The optimal solution is that we all put maximal pressure on competing CPU vendors to make their instructions compatible with each other rather than going for the short-term PR gain of inventing funny new instructions before their competitor.


Top
 Profile  
 
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 163 posts ]  Go to page 1, 2, 3, 4, 5 ... 11  Next

All times are UTC + 1 hour


Who is online

Users browsing this forum: No registered users and 0 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
cron
suspicion-preferred