You are currently viewing our boards as a guest, which gives you limited access to view most discussions and access our other features. By joining our free community, you will have access to post topics, communicate privately with other members (PM), respond to polls, upload content, and access many other special features. Registration is fast, simple, and absolutely free, so please, <a href="/profile.php?mode=register">join our community today</a>!
hahahaha! are you telling me that you are not going to use the MULPS and ADDPS when you do Structure of Array?
huh ? no I'm not, I was thinking that your PACK (all caps, wtf ?) was refering to packing/swizzling not the packed add/mul instructions, my bad
anyway ADDPS/MULPS will be used with a SoA or SoS layout, it's much like if you try to hide your clumsy comment about SoA by going back to a [SoA, SoS] vs [AoS] debate
if you can't see why your idea to still promote SoS in 2008 is not very smart, I'll suggest you to really *think* how you'll teach people to target SSEx and AVX from the same code
who? wrote:
... packed instructions ...
yes it's more clear with the terms used in the field
Some time, I wonder why i got to explain the "all story" look at the link i gave you,
Quote:
NumOfGroups = NumOfVertices/SIMDwidth typedef struct{ float x[SIMDwidth]; float y[SIMDwidth]; float z[SIMDwidth]; } VerticesCoordList; typedef struct{ int a[SIMDwidth]; int b[SIMDwidth]; int c[SIMDwidth]; . . . } VerticesColorList; VerticesCoordList VerticesCoord[NumOfGroups]; VerticesColorList VerticesColor[NumOfGroups];
who?
there is no timing information at your link, it's looks like an old 2004 paper revamped in a rush, it even still talk about "fewer prefetch"
with AVX, sizeof(VerticesCoordList) = 96 bytes, not very smart with a 64-byte cache lines size if you ask me, without precise timings comparing the two approachs
your speculation on sizes are not correct, but i am not going to comment, obviously :)
soft prefetcher is still awesome in the right hands.
futur will show you, stay tuned
[edit]
I got a funny email, wants to share it ... one of the reader here mailled me and asked me to "stop beating the dead dog" , i did not know the expression, funny actually!
So, you are right ... everything I plan is totally dum, and you are the only one who know better than I do! anything to add? you even know the instructions i designed better than I do!
[/edit]
with 256-bit AVX SIMDwidth = 8, so we have 8 * 3 * sizeof(float) = 96, that's not very high level maths
anyway, you still try to avoid my main point: cross compatibility between libraries, maybe next time you'll tell us that the MKL 10.0 isn't using SoA ?
Joined: Wed Jun 27, 2007 10:19 am Posts: 324 Location: Milano, Italy
who? wrote:
well, you are really clue less on this, do you? there is no SOA or SOS without Pack instructions, the purpose of SoS or SoA is TO USE the pack instructions!!!!!!!
No. The purpose of SoA and SoS is to use packed instructions.
This link is very, very interesting. You did argue up to this point that intra-vector dot-product instructions were useful because games were using AoS, now you post a link showing dot-products done on a SoS structure w/o using the instructions you mentioned. Are you having fun negating what you previously said? Oh and BTW, you didn't respond to my comment were I pointed that you was confusing culling and clipping, what about that game development expert?
This link is very, very interesting. You did argue up to this point that intra-vector dot-product instructions were useful because games were using AoS,
thinking exactly the same, this paper looks like an evolution of the "coding for SIMD architectures" part of the old P4 optimization guide dating back to Willamette, it was at the time these well presented arguments which convinced me to go SoA (and to try what we call SoS in this thread) whenever possible and to avoid AoS like the plague, the horizontal instructions in SSE3 and SSE4.1 are so slow anyway (besides the unused computation slots argument) that I don't even understand how someone can defend their use on a public forum
just another note for you Gabriele, in the past we were talking about the disapointing latency for BLENDVPS in SSE4.1 based on a latency dump someone posted here, now after extensive test on a real machine I can confirm that it's indeed not faster than an equivalent sequence of SSE andps/andnps/orps
Joined: Wed Jun 27, 2007 10:19 am Posts: 324 Location: Milano, Italy
Eric Bron wrote:
thinking exactly the same, this paper looks like an evolution of the "coding for SIMD architectures" part of the old P4 optimization guide dating back to Willamette, it was at the time these well presented arguments which convinced me to go SoA (and to try what we call SoS in this thread) whenever possible and to avoid AoS like the plague, the horizontal instructions in SSE3 and SSE4.1 are so slow anyway (besides the unused computation slots argument) that I don't even understand how someone can defend their use on a public forum
Completely agreed. Besides if you have to push the data on to the GPU (when using vertex buffers) I'm under the impression that SoA will always be much better than AoS (and also SoS with a width <32 elements) as GPUs enjoy coarse grained access & processing. I don't have a way to verify it unfortunately.
Quote:
just another note for you Gabriele, in the past we were talking about the disapointing latency for BLENDVPS in SSE4.1 based on a latency dump someone posted here, now after extensive test on a real machine I can confirm that it's indeed not faster than an equivalent sequence of SSE andps/andnps/orps
Thank you for the info. I tend to believe that the bottleneck lies in the decode stage since it is likely that a BLENDVPS issues more than one uop.
well, you are really clue less on this, do you? there is no SOA or SOS without Pack instructions, the purpose of SoS or SoA is TO USE the pack instructions!!!!!!!
No. The purpose of SoA and SoS is to use packed instructions.
This link is very, very interesting. You did argue up to this point that intra-vector dot-product instructions were useful because games were using AoS, now you post a link showing dot-products done on a SoS structure w/o using the instructions you mentioned. Are you having fun negating what you previously said? Oh and BTW, you didn't respond to my comment were I pointed that you was confusing culling and clipping, what about that game development expert?
The big majority of the video game programmers are still on Array of Structure, so, we did DPPS to help them.
SoS is the dream solution, but the adoption is fairly low, and it can be understood easily, the consoles are not very big on SIMD, and most of the programmers focus on threading, SIMD opti comes only when the console is too short (Xbox 1)
DPPS gives you a quick optimization that does not change your data structure, but give pretty good boost.
I am not contradicting myself, I am just living with reality, the dream case is SoS, and if you do so, you don't need DPPS. If you don t do SoS and you stick to AoS, then, DPPS is a quick and very efficent way to get faster. I give more choice to our customers, it can't be bad.
This is part of what I call the "new intel", we made an instruction that was asked for years by the community, and we are seeing that it is working, they are planning to use it. We try to answer feedback of the community with instructions, when needed, and it does not break the logic of the Instruction Set.
of course, I would prefert to see people using SoS, but if they don't, I am happy to see them going DPPS.
I have the feeling of chatting with 2 kids that try to pick on everything. From the email feed back i am getting from famous press guys, I am not the only one :) lol !
The smart people perfectly understood what I meant, and I am done with chatting with you 2 on this topic, after all, you are the experts ;-) lol!!!
with 256-bit AVX SIMDwidth = 8, so we have 8 * 3 * sizeof(float) = 96, that's not very high level maths
anyway, you still try to avoid my main point: cross compatibility between libraries, maybe next time you'll tell us that the MKL 10.0 isn't using SoA ?
MKL uses SoA, youpi!
Congratulation for calculating the size of a SIMDwidth in byte, pretty impressive...lol ,but you have no clue about the cache line size of SandyB, so, your comment is useless. And you are not going to make me say anything about it :)
and in any case, i will let you meditate on the prefetch of (Cacheline + 1) theory. There is more than 50 PhD written on this, but I am sure you have the absolue answer to this, as you always do.
DPPS gives you a quick optimization that does not change your data structure, but give pretty good boost.
hey cool, now we are very near a concrete discussion, just provide us numbers instead of "pretty good boost" PR talk
I'll be really glad if you provide us some concrete timings with AoS and DPPS vs AoS and on the fly swizzling (using good old MOVHPS,SHUFPS which are very fast on Penryn) let's say with AoS not aligned to 16-byte boundaries like it's typically the case with 12 byte per x,y,z tuple (hint: the bottleneck is MOVUPS)
we know that with aligned 16 SoA ("SoS" if you want) we get > 3x speedup on Conroe/Penryn vs ADDSS/MULSS, now which speedup do you get with DPPS, 1.2 x ? even less ?
Last edited by Eric Bron on Thu Mar 20, 2008 5:34 pm, edited 4 times in total.
OK, so you claim that you get >= 20% faster code with DPPS (vs. optimized on the fly swizzling) keeping vertices data in AoS, it's *interesting*, now I'll suggest you to post precise timings and code snippets of the two inner loops you are comparing, my understanding is that the baseline (without DPPS) isn't a well optimized SSE code path, but I'll be pleased to be proven wrong by factual data
DPPS is disclosed since one full year, and there is Hapertown CPUs publicly available since more than 6 months so there is nothing to hide there
OK, so you claim that you get >= 20% faster code with DPPS (vs. optimized on the fly swizzling) keeping vertices data in AoS, it's *interesting*, now I'll suggest you to post precise timings and code snippets of the two inner loops you are comparing, my understanding is that the baseline (without DPPS) isn't a well optimized SSE code path, but I'll be pleased to be proven wrong by factual data
DPPS is disclosed since one full year, and there is Hapertown CPUs publicly available since more than 6 months so there is nothing to hide there
Since you are a super coder ... why don t you do your experimentation yourself? You sound like a very bad manager to me, you scream at people, you tell them they are lieing, and then you ask them to do work for you ...
my advise, if you want people to do stuff with you, you are going to have to work on your communication skills! (I don t mean spelling ! hahahah)
hey why do you think I'm not experimenting ? trust me I've carefully experimented SSE4.1 when it was something new, my conclusion is that it's worthless for any 3D purpose
It looks very much like the "1" in "SSE4.1" is for the number of instructions showing *clearly* a speedup, man that's only one : mpsadbw
now some new instructions were looking very interesting like blendps/pd,blendvps/pd and ptest, unfortunately these lead to no speedup (when compared with industry standard SSE) even on tight loops specificaly targeted at showing their benefit, the fact is that Penryn is so fast with legacy SSE (1/3 clock throughput for some instructions) that new instructions in SSE4.1 with 2 clock throughput are just plain useless to give us a speedup
dppd/ps are not interesting for me since I deal exclusively with SoA and Hybrid SoA (what you call "SoS") layouts, so I have not tested it, since we know for a while that the future is wider vectors, AoS and horizontal computations are a dead end anyway, anyway I'll be interested to see some code and timings my guestimate is that you can't get better than 20% faster code with 12 byte per vertex (your example) than optimized SSE, moreover I suppose the dpps code path will be not significantly simpler than on the fly swizzling (on the fly swizzling can give a nice speedup without changing the AoS layout, only local changes like dpps but after that developers have already one foot in true SIMD) so I'll advise developers that want to optimize their code to target industry standard SSE, and to start planing for wider vectors and generalized SoA since it's the way going forward
Users browsing this forum: No registered users and 0 guests
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot post attachments in this forum