Aceshardware

(not so) temporary home for the aceshardware community
 FAQ •  Search •  Register •  Login 
It is currently Fri Nov 27, 2009 3:45 pm

All times are UTC + 1 hour



Welcome
Welcome to <strong>Aceshardware</strong>.

You are currently viewing our boards as a guest, which gives you limited access to view most discussions and access our other features. By joining our free community, you will have access to post topics, communicate privately with other members (PM), respond to polls, upload content, and access many other special features. Registration is fast, simple, and absolutely free, so please, <a href="/profile.php?mode=register">join our community today</a>!


Post new topic Reply to topic  [ 163 posts ]  Go to page Previous  1, 2, 3, 4, 5, 6, 7, 8 ... 11  Next
Author Message
 Post subject:
PostPosted: Mon Jan 26, 2009 4:04 pm 
Offline

Joined: Fri Sep 07, 2007 10:31 am
Posts: 25
Location: Denmark
fivemack wrote:
I really hope the wheel of reincarnation doesn't bring us back to the Cray world which includes registers (the 64x64 matrix in the BMM unit) too large to save automatically and defined to be destroyed by function calls.

The only way to make your code compatible with future extensions of the vector registers is to define that the vector registers, or their high part, are not saved across function calls.
The rules are already in place: The 128-bit registers xmm6-xmm15 are saved across function calls in 64-bit Windows only. No vector registers are saved in 32-bit Windows or in Unix. The extensions to YMM and future extensions to ZMM etc. are not saved across function calls. See http://www.agner.org/optimize/calling_conventions.pdf for details.


Top
 Profile  
 
 Post subject:
PostPosted: Mon Jan 26, 2009 5:09 pm 
Offline

Joined: Sat Sep 01, 2007 8:01 am
Posts: 652
fivemack wrote:
I don't know how big (in square microns) register files are in contemporary hardware. 16 512-bit registers is a kilobyte, three read and one write port per AVX unit feels as if it might be visible to the naked eye even in 32nm.

Various discussion on the linux-kernel mailing list suggests that 512-bit vector registers are thought of as an explicit but dim possibility; there's a medium bag on the side of AVX so that code which doesn't know about the top halves of YMM registers doesn't destroy them, combined with an indication that there won't be this special case for ZMM.

I really hope the wheel of reincarnation doesn't bring us back to the Cray world which includes registers (the 64x64 matrix in the BMM unit) too large to save automatically and defined to be destroyed by function calls.


Don't worry about this, we got an elegant and efficent solution.


Top
 Profile  
 
 Post subject:
PostPosted: Mon Jan 26, 2009 8:41 pm 
Offline

Joined: Fri Aug 31, 2007 10:08 pm
Posts: 208
Location: Switzerland
fivemack wrote:
I don't know how big (in square microns) register files are in contemporary hardware. 16 512-bit registers is a kilobyte, .


why only 16 registers ? already in 2000 Willamette was featuring 128 (physical) 128-bit registers IIRC

it's beyond me why people think that it's normal for a 45 nm Larrabee to have 16/24/32 x 512-bit Vector Units (+ texture units) but think it's "impossible" for 32 nm Sandy B to have 6/8 such units


Top
 Profile  
 
 Post subject:
PostPosted: Mon Jan 26, 2009 10:58 pm 
Offline

Joined: Fri Aug 17, 2007 2:55 pm
Posts: 352
Presumably architectural registers, not physical registers.


Top
 Profile  
 
 Post subject:
PostPosted: Mon Jan 26, 2009 11:02 pm 
Offline

Joined: Fri Aug 31, 2007 10:08 pm
Posts: 208
Location: Switzerland
TacoBell wrote:
Presumably architectural registers, not physical registers.


sure, but fivemack talk about die areas not ISA, the minimum will be 32 physical registers just for the architected state of 2 therads

he makes it sound like a 1KB register file is something incredible when already the very first P4 had 2KB


Last edited by Eric Bron on Mon Jan 26, 2009 11:14 pm, edited 1 time in total.

Top
 Profile  
 
 Post subject: Re: Intel removed the 4-operand FMA support from AVX spec
PostPosted: Tue Jan 27, 2009 12:19 am 
Offline

Joined: Sat Mar 22, 2008 5:10 pm
Posts: 220
Agner wrote:
The AVX instruction format has plenty of room for expansion into bigger vectors. Hopefully, they will use the same instruction codes on the GPU.

But the code I write today targeting 256-bit AVX won't be very usefull when processors reach 512-bit vectors...

Code:
gvs rax ; instruction to Get Vector Size
dec rax
;...
;for(int i = 0; i < length; i++)
mov rcx, length
xor edx, edx
loop_start:
vpmov ZMM0, buffer[rdx]
vpaddb ZMM0, ZMM0, buffer2[rdx]
vpscb ZMM1, lut[ZMM0] ;SCather, get indexes zero-extended from ZMM instead of GPR
;...
;if(a > b)
vpcmpgtb ZMM2, ZMM1, ZMM0
vptest ZMM2, ZMM2
jz endif
;...everything done on ZMM3
vpblend ZMM1, ZMM3, ZMM1, ZMM2
endif:
mov rbx, rcx
sub rbx, rdx
vplmovb buffer[rdx], ZMM1, rbx; move up to rbx bytes from ZMM1 to memory
add rdx, rax
cmp rcx, rdx
ja loop_start ; didn't check the order of any of compare and blend instructions, just the concept is important here


Top
 Profile  
 
 Post subject: Re: Intel removed the 4-operand FMA support from AVX spec
PostPosted: Tue Jan 27, 2009 9:23 am 
Offline

Joined: Fri Aug 31, 2007 10:08 pm
Posts: 208
Location: Switzerland
EduardoS wrote:
gvs rax ; instruction to Get Vector Size

it's an inevtion of yours right ?

EduardoS wrote:
dec rax


why ?


Top
 Profile  
 
 Post subject: Re: Intel removed the 4-operand FMA support from AVX spec
PostPosted: Tue Jan 27, 2009 9:22 pm 
Offline

Joined: Sat Mar 22, 2008 5:10 pm
Posts: 220
Eric Bron wrote:
EduardoS wrote:
gvs rax ; instruction to Get Vector Size

it's an inevtion of yours right ?

Yes, it's a way I would like to program for a 128 bit vector CPU and the code still usable by a 512 bit vector CPU

Eric Bron wrote:
EduardoS wrote:
dec rax


why ?

Because I did some mistakes in the code...
It isn'r supposed to be functional, just to show how I would like it.


Top
 Profile  
 
 Post subject: Re: Intel removed the 4-operand FMA support from AVX spec
PostPosted: Wed Jan 28, 2009 1:48 am 
Offline

Joined: Tue Jul 24, 2007 10:12 pm
Posts: 59
who? wrote:
jim_cox wrote:
Agner wrote:
Gabriele Svelto wrote:
Whatever happens it is very likely that the final decision on this matter will be taken by Microsoft.

The new 256-bit vector registers need OS support. Everything else will work in existing operating systems, including SSE5 as well as AVX and FMA instructions as long as they are used on 128-bit registers. The final decision will be taken by software producers supporting Intel or AMD instructions or both. Microsoft cares more about C# and Visual Basic than about native C++, which is the platform most likely to use the new instruction sets.


I am not sure why we need the 256-bit register expansion, or a whole new instruction set extension that will probably have a short useful life. Why the "in-between" stage? Why not jump to supporting something like a larrabee (or other gpu style) co-processor or the instruction extensions? If code really benefits from 256-bit registers, then wouldn't it run a whole lot faster on a larrabee co-processor? I guess supporting asymmetric cores in the OS may take some time, but I don't think most people need 4 cpu cores, much less the 8 core beast that are coming soon. Four cores and a vector co-processor makes a lot more sense.

With AMD continuing to lose money, anyone think Intel will be pushing IA-64 for everyone again, if AMD gets to weak? They seem to have taken AMD's strategy of x86 everywhere, but I don't know if everyone at Intel is happy about that.


I am happy :)

well, if you mean adding Larrabee like cores into the CPU, you need to ritch a certain level of financial condition before you can do this, your dice size and your average selling price have to make sense.
The success of Intel is mainly based on those think of choice. For example, in 65nm, a native quad core did not make any sense financially, and we all saw that happen to AMD when they tried. Native quad core was only possible to try with serious single threaded performance in 45nm, AMD learned it the hard way.

just my 2 cents.


Intel has made the same mistake. The original P4, given its architecture, needed more than 256 KB of L2 cache. Phenom needed more than 2 MB of L3 cache. AMD still seems to be way behind in cache design regardless of the mistake with Phenom. IMO, faster and larger caches are a big part of Intel's success.

As far as integrating larrabee vector processors or units into a general purpose cpu, this is more of a software issue than a manufacturing (die size) issue. Initial larrabee implementations are probably going to have something on the order of 32 cores, possibly 48 on 32 nm, so including a couple of these cores or units on a cpu would not be an issue as far as die size is concerned. At 32 nm, 4 core cpus will be mainstream, and 8 core chips high-end. How many larrabee cores would fit in the space taken by 4 cpu cores?


Top
 Profile  
 
 Post subject: Re: Intel removed the 4-operand FMA support from AVX spec
PostPosted: Wed Jan 28, 2009 1:57 am 
Offline

Joined: Tue Jul 24, 2007 10:12 pm
Posts: 59
Eric Bron wrote:
jim_cox wrote:
I am not sure why we need the 256-bit register expansion, or a whole new instruction set extension that will probably have a short useful life.


I think exactly the same, it will be far better to have one single ISA for Larrabee and Sandy Bridge, they say Larrabee is x86 "compatible" and in fact we must plan for two more code paths for x86 CPUs and x86 "compatible" GPU/GPGPU/whatever

if the die area argument stand they should have done like SSE in Katmai: process each vector in two 256-bit part in the 1st generation of 512-bit CPU


I am not familiar with programming for vector processors. For HPC it is fine to have people re-compile and/or rewrite code for new implementations, but this isn't the best for the consumer space. Are we going to get a wider vector ISA extension every 2 years? How would this be handled? Something like what is done with GPUs at the moment (support compilation of an intermediate language into code for your specific implementation)?


Top
 Profile  
 
 Post subject: Re: Intel removed the 4-operand FMA support from AVX spec
PostPosted: Wed Jan 28, 2009 7:04 am 
Offline

Joined: Fri Aug 31, 2007 10:08 pm
Posts: 208
Location: Switzerland
EduardoS wrote:
Yes, it's a way I would like to program for a 128 bit vector CPU and the code still usable by a 512 bit vector CPU


yes it looks rather elegant, but it goes beyond the goal of the ISAs , though it's something you'll typically achieve with an intermediate code (with higher levels semantics) and a JIT compiler will generate the best fit

btw what will you do for instructions like MOVMSKPS/VMOVMSKPS ?


Top
 Profile  
 
 Post subject: Re: Intel removed the 4-operand FMA support from AVX spec
PostPosted: Wed Jan 28, 2009 7:22 am 
Offline

Joined: Fri Aug 31, 2007 10:08 pm
Posts: 208
Location: Switzerland
jim_cox wrote:
it is fine to have people re-compile and/or rewrite code for new implementations, but this isn't the best for the consumer space.


the problem we face here is that we not only have to re-compile and (partially) rewrite the code but we have to target two new ISA next year with Sandy Bridge and Larrabee, 256-bit vs 512-bit is really the easy part, after some serious refactoring it's possible to just change a flag and recompile to support 128-512+ (with as a welcome side effect cleaner source code) more difficult issues come from differences we can expect like scatter/gather instructions in the Larrabee ISA but missing in AVX


Top
 Profile  
 
 Post subject: Re: Intel removed the 4-operand FMA support from AVX spec
PostPosted: Wed Jan 28, 2009 11:51 pm 
Offline

Joined: Sat Mar 22, 2008 5:10 pm
Posts: 220
Eric Bron wrote:
yes it looks rather elegant, but it goes beyond the goal of the ISAs , though it's something you'll typically achieve with an intermediate code (with higher levels semantics) and a JIT compiler will generate the best fit

Yeah, a JIT would help in a situation like now where Intel is expecting us to support three different extension just for their processors (SSE, AVX and Larrabee), a JIT for vectors isn't new, current GPUs do that, but they force the JIT so there is no need to worry about supporting a lot of obsolete instructions like in Intel case.
When the instruction are user mode I still prefer a more flexible option.

Eric Bron wrote:
btw what will you do for instructions like MOVMSKPS/VMOVMSKPS ?

They aren't really necessary, and, even in current format, they could allow up to 1024 bit vectors in 32 bits mode, a simple solution could be make the dest operand a memory location.
SSE/AVX have too many problems to me worry about VMOVMSK...


Top
 Profile  
 
 Post subject:
PostPosted: Thu Jan 29, 2009 12:00 am 
Offline

Joined: Sun Jan 18, 2009 10:35 pm
Posts: 81
We go to extensive development, there are more and more extensions. And all is multithreaded by the way. Cpu clock is arround 3GHz for years already.


Top
 Profile  
 
 Post subject: Re: Intel removed the 4-operand FMA support from AVX spec
PostPosted: Thu Jan 29, 2009 10:47 am 
Offline

Joined: Fri Aug 31, 2007 10:08 pm
Posts: 208
Location: Switzerland
Eric Bron wrote:
SSE/AVX have too many problems to me worry about VMOVMSK...


my case is quite simple, I have tons of legacy SSE code and I have to estimate these days how many day*man it will take to port to AVX and Larrabee + more important define how to write future code to target all x86 platforms from the same source code

there is a lot of MOVMSK in the legacy code, as a matter of fact in a lot of cases it is leading to faster code vs. branch elimination with ANDPS/ANDNPS/ORPS or the new BLENDVPS where you *have to compute both sides of the branch*, it's typically used to test if all flags are set/reset after a paccked compare and to branch to special code (very typical to have the common case with more than 90% of the actual branches and thus very good branch prediction), one way to avoid these is to use PTEST which is quite new but in many case I have 2 passes, one 1st loop which store the integer masks, the 2nd loop which do the actual processing, it was just faster this way


Top
 Profile  
 
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 163 posts ]  Go to page Previous  1, 2, 3, 4, 5, 6, 7, 8 ... 11  Next

All times are UTC + 1 hour


Who is online

Users browsing this forum: No registered users and 0 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to: