Aceshardware

(not so) temporary home for the aceshardware community
 FAQ •  Search •  Register •  Login 
It is currently Sun Nov 22, 2009 8:04 pm

All times are UTC + 1 hour



Welcome
Welcome to <strong>Aceshardware</strong>.

You are currently viewing our boards as a guest, which gives you limited access to view most discussions and access our other features. By joining our free community, you will have access to post topics, communicate privately with other members (PM), respond to polls, upload content, and access many other special features. Registration is fast, simple, and absolutely free, so please, <a href="/profile.php?mode=register">join our community today</a>!


Post new topic Reply to topic  [ 136 posts ]  Go to page Previous  1 ... 4, 5, 6, 7, 8, 9, 10  Next
Author Message
 Post subject:
PostPosted: Thu Mar 20, 2008 9:40 am 
Offline

Joined: Sat Sep 01, 2007 8:01 am
Posts: 652
Eric Bron wrote:
who? wrote:
Eric Bron wrote:
who? wrote:
hahahaha! are you telling me that you are not going to use the MULPS and ADDPS when you do Structure of Array?


huh ? no I'm not, I was thinking that your PACK (all caps, wtf ?) was refering to packing/swizzling not the packed add/mul instructions, my bad

anyway ADDPS/MULPS will be used with a SoA or SoS layout, it's much like if you try to hide your clumsy comment about SoA by going back to a [SoA, SoS] vs [AoS] debate

if you can't see why your idea to still promote SoS in 2008 is not very smart, I'll suggest you to really *think* how you'll teach people to target SSEx and AVX from the same code

who? wrote:
... packed instructions ...


yes it's more clear with the terms used in the field


Some time, I wonder why i got to explain the "all story"
look at the link i gave you,

Quote:
NumOfGroups = NumOfVertices/SIMDwidth
typedef struct{
float x[SIMDwidth];
float y[SIMDwidth];
float z[SIMDwidth];
} VerticesCoordList;
typedef struct{
int a[SIMDwidth];
int b[SIMDwidth];
int c[SIMDwidth];
. . .
} VerticesColorList;
VerticesCoordList VerticesCoord[NumOfGroups];
VerticesColorList VerticesColor[NumOfGroups];


who?


there is no timing information at your link, it's looks like an old 2004 paper revamped in a rush, it even still talk about "fewer prefetch"

with AVX, sizeof(VerticesCoordList) = 96 bytes, not very smart with a 64-byte cache lines size if you ask me, without precise timings comparing the two approachs


your speculation on sizes are not correct, but i am not going to comment, obviously :)

soft prefetcher is still awesome in the right hands.

futur will show you, stay tuned

[edit]
I got a funny email, wants to share it ... one of the reader here mailled me and asked me to "stop beating the dead dog" , i did not know the expression, funny actually!
So, you are right ... everything I plan is totally dum, and you are the only one who know better than I do! anything to add? you even know the instructions i designed better than I do!
[/edit]

who?


Top
 Profile  
 
 Post subject:
PostPosted: Thu Mar 20, 2008 11:02 am 
Offline

Joined: Fri Aug 31, 2007 10:08 pm
Posts: 208
Location: Switzerland
who? wrote:

Eric Bron wrote:

with AVX, sizeof(VerticesCoordList) = 96 bytes,



who? wrote:
your speculation on sizes are not correct




with 256-bit AVX SIMDwidth = 8, so we have 8 * 3 * sizeof(float) = 96, that's not very high level maths

anyway, you still try to avoid my main point: cross compatibility between libraries, maybe next time you'll tell us that the MKL 10.0 isn't using SoA ?


Top
 Profile  
 
 Post subject:
PostPosted: Thu Mar 20, 2008 2:34 pm 
Offline

Joined: Wed Jun 27, 2007 10:19 am
Posts: 324
Location: Milano, Italy
who? wrote:
well, you are really clue less on this, do you? there is no SOA or SOS without Pack instructions, the purpose of SoS or SoA is TO USE the pack instructions!!!!!!!

No. The purpose of SoA and SoS is to use packed instructions.

Quote:
here is some lecture that will correct your mis-understanding of the situation...
http://softwarecommunity.intel.com/articles/eng/3592.htm

This link is very, very interesting. You did argue up to this point that intra-vector dot-product instructions were useful because games were using AoS, now you post a link showing dot-products done on a SoS structure w/o using the instructions you mentioned. Are you having fun negating what you previously said? Oh and BTW, you didn't respond to my comment were I pointed that you was confusing culling and clipping, what about that game development expert?


Top
 Profile  
 
 Post subject:
PostPosted: Thu Mar 20, 2008 2:49 pm 
Offline

Joined: Fri Aug 31, 2007 10:08 pm
Posts: 208
Location: Switzerland
Gabriele Svelto wrote:
This link is very, very interesting. You did argue up to this point that intra-vector dot-product instructions were useful because games were using AoS,


thinking exactly the same, this paper looks like an evolution of the "coding for SIMD architectures" part of the old P4 optimization guide dating back to Willamette, it was at the time these well presented arguments which convinced me to go SoA (and to try what we call SoS in this thread) whenever possible and to avoid AoS like the plague, the horizontal instructions in SSE3 and SSE4.1 are so slow anyway (besides the unused computation slots argument) that I don't even understand how someone can defend their use on a public forum

just another note for you Gabriele, in the past we were talking about the disapointing latency for BLENDVPS in SSE4.1 based on a latency dump someone posted here, now after extensive test on a real machine I can confirm that it's indeed not faster than an equivalent sequence of SSE andps/andnps/orps


Top
 Profile  
 
 Post subject:
PostPosted: Thu Mar 20, 2008 3:52 pm 
Offline

Joined: Wed Jun 27, 2007 10:19 am
Posts: 324
Location: Milano, Italy
Eric Bron wrote:
thinking exactly the same, this paper looks like an evolution of the "coding for SIMD architectures" part of the old P4 optimization guide dating back to Willamette, it was at the time these well presented arguments which convinced me to go SoA (and to try what we call SoS in this thread) whenever possible and to avoid AoS like the plague, the horizontal instructions in SSE3 and SSE4.1 are so slow anyway (besides the unused computation slots argument) that I don't even understand how someone can defend their use on a public forum

Completely agreed. Besides if you have to push the data on to the GPU (when using vertex buffers) I'm under the impression that SoA will always be much better than AoS (and also SoS with a width <32 elements) as GPUs enjoy coarse grained access & processing. I don't have a way to verify it unfortunately.

Quote:
just another note for you Gabriele, in the past we were talking about the disapointing latency for BLENDVPS in SSE4.1 based on a latency dump someone posted here, now after extensive test on a real machine I can confirm that it's indeed not faster than an equivalent sequence of SSE andps/andnps/orps

Thank you for the info. I tend to believe that the bottleneck lies in the decode stage since it is likely that a BLENDVPS issues more than one uop.


Top
 Profile  
 
 Post subject:
PostPosted: Thu Mar 20, 2008 4:52 pm 
Offline

Joined: Sat Sep 01, 2007 8:01 am
Posts: 652
Gabriele Svelto wrote:
who? wrote:
well, you are really clue less on this, do you? there is no SOA or SOS without Pack instructions, the purpose of SoS or SoA is TO USE the pack instructions!!!!!!!

No. The purpose of SoA and SoS is to use packed instructions.

Quote:
here is some lecture that will correct your mis-understanding of the situation...
http://softwarecommunity.intel.com/articles/eng/3592.htm

This link is very, very interesting. You did argue up to this point that intra-vector dot-product instructions were useful because games were using AoS, now you post a link showing dot-products done on a SoS structure w/o using the instructions you mentioned. Are you having fun negating what you previously said? Oh and BTW, you didn't respond to my comment were I pointed that you was confusing culling and clipping, what about that game development expert?


The big majority of the video game programmers are still on Array of Structure, so, we did DPPS to help them.
SoS is the dream solution, but the adoption is fairly low, and it can be understood easily, the consoles are not very big on SIMD, and most of the programmers focus on threading, SIMD opti comes only when the console is too short (Xbox 1)

DPPS gives you a quick optimization that does not change your data structure, but give pretty good boost.

I am not contradicting myself, I am just living with reality, the dream case is SoS, and if you do so, you don't need DPPS. If you don t do SoS and you stick to AoS, then, DPPS is a quick and very efficent way to get faster. I give more choice to our customers, it can't be bad.

This is part of what I call the "new intel", we made an instruction that was asked for years by the community, and we are seeing that it is working, they are planning to use it. We try to answer feedback of the community with instructions, when needed, and it does not break the logic of the Instruction Set.

of course, I would prefert to see people using SoS, but if they don't, I am happy to see them going DPPS.

I have the feeling of chatting with 2 kids that try to pick on everything. From the email feed back i am getting from famous press guys, I am not the only one :) lol !

The smart people perfectly understood what I meant, and I am done with chatting with you 2 on this topic, after all, you are the experts ;-) lol!!!

make sense?

who?


Top
 Profile  
 
 Post subject:
PostPosted: Thu Mar 20, 2008 5:12 pm 
Offline

Joined: Sat Sep 01, 2007 8:01 am
Posts: 652
Eric Bron wrote:
who? wrote:

Eric Bron wrote:

with AVX, sizeof(VerticesCoordList) = 96 bytes,



who? wrote:
your speculation on sizes are not correct




with 256-bit AVX SIMDwidth = 8, so we have 8 * 3 * sizeof(float) = 96, that's not very high level maths

anyway, you still try to avoid my main point: cross compatibility between libraries, maybe next time you'll tell us that the MKL 10.0 isn't using SoA ?


MKL uses SoA, youpi!

Congratulation for calculating the size of a SIMDwidth in byte, pretty impressive...lol ,but you have no clue about the cache line size of SandyB, so, your comment is useless. And you are not going to make me say anything about it :)

and in any case, i will let you meditate on the prefetch of (Cacheline + 1) theory. There is more than 50 PhD written on this, but I am sure you have the absolue answer to this, as you always do.

for me, it is end of THREAD here.

Good luck with this.

who?


Top
 Profile  
 
 Post subject:
PostPosted: Thu Mar 20, 2008 5:21 pm 
Offline

Joined: Fri Aug 31, 2007 10:08 pm
Posts: 208
Location: Switzerland
who? wrote:
DPPS gives you a quick optimization that does not change your data structure, but give pretty good boost.


hey cool, now we are very near a concrete discussion, just provide us numbers instead of "pretty good boost" PR talk

I'll be really glad if you provide us some concrete timings with AoS and DPPS vs AoS and on the fly swizzling (using good old MOVHPS,SHUFPS which are very fast on Penryn) let's say with AoS not aligned to 16-byte boundaries like it's typically the case with 12 byte per x,y,z tuple (hint: the bottleneck is MOVUPS)

we know that with aligned 16 SoA ("SoS" if you want) we get > 3x speedup on Conroe/Penryn vs ADDSS/MULSS, now which speedup do you get with DPPS, 1.2 x ? even less ?


Last edited by Eric Bron on Thu Mar 20, 2008 5:34 pm, edited 4 times in total.

Top
 Profile  
 
 Post subject:
PostPosted: Thu Mar 20, 2008 5:25 pm 
Offline

Joined: Sat Sep 01, 2007 8:01 am
Posts: 652
Eric Bron wrote:

so, please provide us some concrete timings .....


That is a luxury that only the director of my group can have .... lol! reality check dude!

who?
I am out of here.


Top
 Profile  
 
 Post subject:
PostPosted: Thu Mar 20, 2008 5:30 pm 
Offline

Joined: Fri Aug 31, 2007 10:08 pm
Posts: 208
Location: Switzerland
who? wrote:
That is a luxury that only the director of my group can have .... lol! reality check dude!


so you agree it's less than 1.2 x speedup ? that's a really worthless optimization


Top
 Profile  
 
 Post subject:
PostPosted: Fri Mar 21, 2008 12:21 am 
Offline

Joined: Sat Sep 01, 2007 8:01 am
Posts: 652
Eric Bron wrote:
who? wrote:
That is a luxury that only the director of my group can have .... lol! reality check dude!


so you agree it's less than 1.2 x speedup ? that's a really worthless optimization


i don t agree

who?


Top
 Profile  
 
 Post subject:
PostPosted: Fri Mar 21, 2008 7:37 am 
Offline

Joined: Fri Aug 31, 2007 10:08 pm
Posts: 208
Location: Switzerland
who? wrote:
i don t agree
who?


OK, so you claim that you get >= 20% faster code with DPPS (vs. optimized on the fly swizzling) keeping vertices data in AoS, it's *interesting*, now I'll suggest you to post precise timings and code snippets of the two inner loops you are comparing, my understanding is that the baseline (without DPPS) isn't a well optimized SSE code path, but I'll be pleased to be proven wrong by factual data

DPPS is disclosed since one full year, and there is Hapertown CPUs publicly available since more than 6 months so there is nothing to hide there


Top
 Profile  
 
 Post subject:
PostPosted: Fri Mar 21, 2008 4:16 pm 
Offline

Joined: Sat Sep 01, 2007 8:01 am
Posts: 652
Eric Bron wrote:
who? wrote:
i don t agree
who?


OK, so you claim that you get >= 20% faster code with DPPS (vs. optimized on the fly swizzling) keeping vertices data in AoS, it's *interesting*, now I'll suggest you to post precise timings and code snippets of the two inner loops you are comparing, my understanding is that the baseline (without DPPS) isn't a well optimized SSE code path, but I'll be pleased to be proven wrong by factual data

DPPS is disclosed since one full year, and there is Hapertown CPUs publicly available since more than 6 months so there is nothing to hide there


Since you are a super coder ... why don t you do your experimentation yourself? You sound like a very bad manager to me, you scream at people, you tell them they are lieing, and then you ask them to do work for you ...

my advise, if you want people to do stuff with you, you are going to have to work on your communication skills! (I don t mean spelling ! hahahah)

stop digging your hole.

who?


Top
 Profile  
 
 Post subject:
PostPosted: Fri Mar 21, 2008 5:07 pm 
Offline

Joined: Fri Aug 31, 2007 10:08 pm
Posts: 208
Location: Switzerland
who? wrote:
why don t you do your experimentation yourself?


hey why do you think I'm not experimenting ? trust me I've carefully experimented SSE4.1 when it was something new, my conclusion is that it's worthless for any 3D purpose

It looks very much like the "1" in "SSE4.1" is for the number of instructions showing *clearly* a speedup, man that's only one : mpsadbw

now some new instructions were looking very interesting like blendps/pd,blendvps/pd and ptest, unfortunately these lead to no speedup (when compared with industry standard SSE) even on tight loops specificaly targeted at showing their benefit, the fact is that Penryn is so fast with legacy SSE (1/3 clock throughput for some instructions) that new instructions in SSE4.1 with 2 clock throughput are just plain useless to give us a speedup

dppd/ps are not interesting for me since I deal exclusively with SoA and Hybrid SoA (what you call "SoS") layouts, so I have not tested it, since we know for a while that the future is wider vectors, AoS and horizontal computations are a dead end anyway, anyway I'll be interested to see some code and timings my guestimate is that you can't get better than 20% faster code with 12 byte per vertex (your example) than optimized SSE, moreover I suppose the dpps code path will be not significantly simpler than on the fly swizzling (on the fly swizzling can give a nice speedup without changing the AoS layout, only local changes like dpps but after that developers have already one foot in true SIMD) so I'll advise developers that want to optimize their code to target industry standard SSE, and to start planing for wider vectors and generalized SoA since it's the way going forward


who? wrote:
you tell them they are lieing


huh ?


Top
 Profile  
 
 Post subject:
PostPosted: Fri Mar 21, 2008 6:22 pm 
Offline

Joined: Tue Jun 26, 2007 8:55 pm
Posts: 706
Quote:
you are going to have to work on your communication skills!

[..]

stop digging your hole.


Oh the irony!


Top
 Profile  
 
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 136 posts ]  Go to page Previous  1 ... 4, 5, 6, 7, 8, 9, 10  Next

All times are UTC + 1 hour


Who is online

Users browsing this forum: No registered users and 0 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to: