Aceshardware

(not so) temporary home for the aceshardware community
 FAQ •  Search •  Register •  Login 
It is currently Sun Nov 22, 2009 5:10 pm

All times are UTC + 1 hour



Welcome
Welcome to <strong>Aceshardware</strong>.

You are currently viewing our boards as a guest, which gives you limited access to view most discussions and access our other features. By joining our free community, you will have access to post topics, communicate privately with other members (PM), respond to polls, upload content, and access many other special features. Registration is fast, simple, and absolutely free, so please, <a href="/profile.php?mode=register">join our community today</a>!


Post new topic Reply to topic  [ 136 posts ]  Go to page Previous  1 ... 3, 4, 5, 6, 7, 8, 9, 10  Next
Author Message
 Post subject:
PostPosted: Tue Mar 18, 2008 4:07 pm 
Offline

Joined: Sat Sep 01, 2007 8:01 am
Posts: 652
Gabriele Svelto wrote:
who? wrote:
As usual, you speak of your little lalaland and make generality of it:
Check DivX and Windows Media for LDDQU, together they represent more than half the video codec market.

My lalaland is a fairly popular Linux distro, BTW I did the same thing here at work with Debian and no SSE3 code either, which means that there isn't any in Ubuntu too. But I guess that your employer doesn't care about Linux, does it? Oh and could you care to point to some hard data proving your claim that DivX and WMV (using which codec? and which version?) represent half of the market (what market? Stuff downloaded from P2P networks?).

Quote:
Your other comments are not any better.
Please give me a game that use SoA .. . Please!!! lol! most of them use Array of Structure, not structure of Array dude!
I guess, you never ever putted your hand on any of the game code, and if you did, please tell me witch one, because I think you did not. Taking about stuff without knowing again? I know for sure many of those engines, I work with them almost every day.

Sure you do. And how exactly do engines using SSEx deal with arrays of structures? They unpack the vertiexes (and their attributes, something you curiously forgot) every time they have to deal with them?

Quote:
For your cellphone , well, look at the result of the video encoded, and the bit rate ... lol!

LOL? You should pay attention to what you say, recent SoCs are capable of encoding 720p HD video and that's what is excepted from them since they will not be used anymore as 'just' phones. And your company is quite aware of it even if you aren't.

Quote:
I forgot, Structure of Array is a stupid system

That 'stupid system' is warmly recommended by your employer optimization manuals.
Quote:
it gives you something like
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy
zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz
(here you open a lot of memory channels ... load port don t like it!)
in the memory,

That's why your company designs processor with multi-way set-associative caches and multiple memory prefetchers in case you were wondering about those.

Quote:
Array of structure is xyzxyzxyzxyzxyzxyzxyzxyzxyzxyzxyzxyzxyzxyzxyzxyzxyzxyzxyzxyzxyz
Better for Memory port, but bad for SIMD.

The solution is SOS (Thanks AlexK) (Structure of Structure)
XXXXYYYYZZZZXXXXYYYYZZZZXXXXYYYYZZZZXXXXYYYYZZZZXXXXYYYYZZZZXXXXYYYYZZZZXXXXYYYYZZZZXXXXYYYYZZZZXXXXYYYYZZZZ

This give you perfect data locality, very good for load ports, and awesome for SIMD...

Eric already pointed out how wonderful it is to assume the size of your vectors (what about your comments on optimizing for AMD processors? If it is so bad to choose a granularity of 64-bit why choose 128-bit when you know you're going to port it to 256?). Oh, and finally how does your SoS structure can be used with inner-vector dot products because that was what we were talking about? How do you use those instructions with SoS? Do you de-interleave your data every time you use it? Vertex data can also be 8- or 16-bit integers, how do you interleave those? Do you throw away the potential memory savings and extend them to 32-bit? And in which part of your 3D pipeline you use data formatted that way considering that most of the grunt work on vertexes today is done on the GPU *anyway*.

Quote:
PS: I copied your posting style, hirritating , isn't it?

I usually answer to your post by addressing every point you make, you quoted the whole text of my post and didn't address many of the points I raised so I don't understand exactly what did you copy? Is coping with 5 or 6 different code paths easy? What are inner-vector dot-products used for since the way you store data in memory to use them prevents vectorization? Oh, and the word you were looking for is irritating. Without an 'h'.


The H is for my accent :)

well, if you can t see how to use DPPS in SoS, knowing you got data coming in your algorythm is Array of structure, I can't help you ...
actually I can ... what about a conversion from AOS to SoS while doing your view clipping in the driver ... (just one example)
yes, my friend, this is the format that DX requires...
Do I have to always break down to the ASM level, or you will stop picking on every details?
I showed you here that I know what I am talking about, if you are interrested in spelling picking details ... can't help you.

You like picking on what ever is blue ... in an unfair matter, this rank the value of your opinion to about nothing, be fair the next time ...

Oh , i forgot, for those who need total assist, here: XXXXYYYYZZZZXXXXYYYYZZZZXXXXYYYYZZZZXXXXYYYYZZZZXXXXYYYYZZZZXXXXYYYYZZZZXXXXYYYYZZZZXXXXYYYYZZZZXXXXYYYYZZZZ
yes yes, 4 variables in 128bits ... this mean, it is 32 bits float single precision...

Best regards,

who?
PS: What is he going to argue this time ???


Top
 Profile  
 
 Post subject:
PostPosted: Tue Mar 18, 2008 4:30 pm 
Offline

Joined: Wed Jun 27, 2007 10:19 am
Posts: 324
Location: Milano, Italy
who? wrote:
The H is for my accent :)

well, if you can t see how to use DPPS in SoS, knowing you got data coming in your algorythm is Array of structure, I can't help you ...
actually I can ... what about a conversion from AOS to SoS
while doing your view clipping in the driver ... (just one example)

I see. You mean when dealing with hardware that doesn't support vertex shaders and you have to deal it in software. Oh. That doesn't sound too 'next generation' if you ask me.

Quote:
yes, my friend, this is the format that DX requires...

You mean DirectX 7 right? Because clipping is down after the vertex shader stage and just nobody in its sane mind would do screen-space clipping by hand today. And if you meant frustum-culling then that's not something which gets very parallel by the hierarchical nature of the scene graph and besides that's not exactly a CPU hog either.

Quote:
Do I have to always break down to the ASM level, or you will stop picking on every details?

Do you consider testing and supporting 5-6 different code paths in an application 'picking on details'? Anyway I cannot since you didn't address the points I made.

Quote:
Oh , i forgot, for those who need total assist, here: XXXXYYYYZZZZXXXXYYYYZZZZXXXXYYYYZZZZXXXXYYYYZZZZXXXXYYYYZZZZXXXXYYYYZZZZXXXXYYYYZZZZXXXXYYYYZZZZXXXXYYYYZZZZ
yes yes, 4 variables in 128bits ... this mean, it is 32 bits float single precision...

So you agree, we'll have to rewrite that code if vectors are going to be 256-bit wide.


Top
 Profile  
 
 Post subject:
PostPosted: Tue Mar 18, 2008 4:41 pm 
Offline

Joined: Sat Sep 01, 2007 8:01 am
Posts: 652
Gabriele Svelto wrote:
who? wrote:
The H is for my accent :)

well, if you can t see how to use DPPS in SoS, knowing you got data coming in your algorythm is Array of structure, I can't help you ...
actually I can ... what about a conversion from AOS to SoS
while doing your view clipping in the driver ... (just one example)

I see. You mean when dealing with hardware that doesn't support vertex shaders and you have to deal it in software. Oh. That doesn't sound too 'next generation' if you ask me.

Quote:
yes, my friend, this is the format that DX requires...

You mean DirectX 7 right? Because clipping is down after the vertex shader stage and just nobody in its sane mind would do screen-space clipping by hand today. And if you meant frustum-culling then that's not something which gets very parallel by the hierarchical nature of the scene graph and besides that's not exactly a CPU hog either.

Quote:
Do I have to always break down to the ASM level, or you will stop picking on every details?

Do you consider testing and supporting 5-6 different code paths in an application 'picking on details'? Anyway I cannot since you didn't address the points I made.

Quote:
Oh , i forgot, for those who need total assist, here: XXXXYYYYZZZZXXXXYYYYZZZZXXXXYYYYZZZZXXXXYYYYZZZZXXXXYYYYZZZZXXXXYYYYZZZZXXXXYYYYZZZZXXXXYYYYZZZZXXXXYYYYZZZZ
yes yes, 4 variables in 128bits ... this mean, it is 32 bits float single precision...

So you agree, we'll have to rewrite that code if vectors are going to be 256-bit wide.


Yes, you got to rewrite the code, as always when you get a new instruction set ... in the case of SSE4-SoS going to AVX, it is pretty simple, correct? the loop that splitted in 4 XXXX just need to be adjusted to XXXXXXXX, it is very likely to be simple, don t you think?
Up to IDF, i can t say much, but do not worry, it is not going to be a mountain the climb, it is easy and simple.

Modern video encoders for example have many code path, MMX, SSE, SSE2, SSE3, SSE4.1, and very soon SSE4.2
Most of those critical path have to be bit to bit compatible, so, for the case of motion estimation , it is fairly easy to validate many different code path. you develope it when you get the new SDK, and you validate when you get the new CPU. You don t develope 6 code path in one week, you do it over time, when the CPUs come out.

Make sense?
who?


Top
 Profile  
 
 Post subject:
PostPosted: Tue Mar 18, 2008 5:06 pm 
Offline

Joined: Sat Sep 01, 2007 8:01 am
Posts: 652
Gabriele Svelto wrote:

Quote:
yes, my friend, this is the format that DX requires...

You mean DirectX 7 right? Because clipping is down after the vertex shader stage and just nobody in its sane mind would do screen-space clipping by hand today. And if you meant frustum-culling then that's not something which gets very parallel by the hierarchical nature of the scene graph and besides that's not exactly a CPU hog either.



well, clipping your view is still "recommended" , even in DX10. when you don t do it, you get a crisis ... if you see what I mean. you can do it with a tree, and then, get down to smaller granularity.

when you develope a super high end game, you will overwarm the GPU, and it will take any help you can give it. Removing half of a tree in crisis is removing many thousands of polygons.

you got to trim down your vertex list and triagle list to the minimum if you want the GPU to do ok, still.

You can, if you want skip this step, after DX7, it is "supported", but it is very recommanded to still trim by yourself. (and mostlikely, the trimming happen in the driver anyway)

Do not confuse the "list of supported feature" of DX n, and the right thing to do, some time, the supported feature is not as good as what you can do manually, because more generic than your own code.

I am in the process of helping a game developer with this problem on a DX10 title, it does help, just because one reason: your GPU is overwarmed, especially if it is a middle end GPU, trimming by clipping helps to go down on lower GPUs, helping the video game company to increase its market for the game tittle. Understand the goal?
It is nice to do pure DX10, but then, your market is limited to G80 8800..
very small target ...

Do you work with video game companies or you just repeated what is in the DX10 SDK?

who?


Top
 Profile  
 
 Post subject:
PostPosted: Tue Mar 18, 2008 7:18 pm 
Offline

Joined: Wed Jun 27, 2007 10:19 am
Posts: 324
Location: Milano, Italy
who? wrote:
well, clipping your view is still "recommended" , even in DX10. when you don t do it, you get a crisis ... if you see what I mean. you can do it with a tree, and then, get down to smaller granularity.

when you develope a super high end game, you will overwarm the GPU, and it will take any help you can give it. Removing half of a tree in crisis is removing many thousands of polygons.

you got to trim down your vertex list and triagle list to the minimum if you want the GPU to do ok, still.

That's called frustum culling and it is not a CPU intensive activity.

Quote:
You can, if you want skip this step, after DX7, it is "supported", but it is very recommanded to still trim by yourself. (and mostlikely, the trimming happen in the driver anyway)

You are mixing things up. Frustum culling - and other forms of hidden geometry culling - will be 'recomended' even in DX20 the day it sees the light. But that's simply because it's easy, doesn't take almost any CPU time and lightens remarkably the load on the GPU instead of unconditionally sending everthing. Clipping is another thing and happens *after* the vertex shader stage and hence *inside* the GPU so I really don't see how the CPU can cope with. Unless the driver has to emulate the vertex shader because the hardware doesn't support it but in that case overall performance is going to suck anyway.

Quote:
Do not confuse the "list of supported feature" of DX n, and the right thing to do, some time, the supported feature is not as good as what you can do manually, because more generic than your own code.

No confusion on my side, on the other hand seeing you confusing clipping and culling gives very little credibility to the claim that you are working on a game. But hey, you were also the one that claimed that FB-DIMM had lower latency than equivalent unregistered DDR2/3.


Top
 Profile  
 
 Post subject:
PostPosted: Tue Mar 18, 2008 8:25 pm 
Offline

Joined: Fri Aug 31, 2007 10:08 pm
Posts: 208
Location: Switzerland
who? wrote:
simple, correct? the loop that splitted in 4 XXXX just need to be adjusted to XXXXXXXX, it is very likely to be simple, don t you think?


that's not one loop but N loops, the producer which initialize the static data and all consumers, also the data structures must be changed and will be probably cluttered with "#ifdef" thingies

now imagine a realworld project with DLLs produced by different teams and having to change the data layout... oughhh all the teams have to recompile all their code, the new DLLs can't be mixed with older components, etc. with SoA you can change each DLL independently by merelly recompiling with just changing a single constant (the loop increment), so, please, trust me, if someone want to design *clean libraries* : SoA is the *only option* for all the data passed in the interfaces, not a "stupid system" (sic)

also note I have used such "SoS" layouts in the past (already on the P!!!) for tactical advantages (more effective for explicit sw prefetch) though I'll *not* advise this for new code, there is always cases at some point where you don't use X,Y,Z,W,whatever together and you lose some cache locality due to the useless data, trust me if you want but in a lot of cases you finish to go back to pure & simple SoA because it's in fact faster for codes using only X or only W etc. + on Conroe/Penryn the load port arguments don't stand IMHO since for top performance you have to make extensive use of loop fission, you end up with a lot of simple loops where load port pressure is quite low vs. CPU capacity (it was quite different on the P4)


Top
 Profile  
 
 Post subject:
PostPosted: Tue Mar 18, 2008 11:11 pm 
Offline

Joined: Sat Sep 01, 2007 8:01 am
Posts: 652
Eric Bron wrote:
who? wrote:
simple, correct? the loop that splitted in 4 XXXX just need to be adjusted to XXXXXXXX, it is very likely to be simple, don t you think?


that's not one loop but N loops, the producer which initialize the static data and all consumers, also the data structures must be changed and will be probably cluttered with "#ifdef" thingies

now imagine a realworld project with DLLs produced by different teams and having to change the data layout... oughhh all the teams have to recompile all their code, the new DLLs can't be mixed with older components, etc. with SoA you can change each DLL independently by merelly recompiling with just changing a single constant (the loop increment), so, please, trust me, if someone want to design *clean libraries* : SoA is the *only option* for all the data passed in the interfaces, not a "stupid system" (sic)

also note I have used such "SoS" layouts in the past (already on the P!!!) for tactical advantages (more effective for explicit sw prefetch) though I'll *not* advise this for new code, there is always cases at some point where you don't use X,Y,Z,W,whatever together and you lose some cache locality due to the useless data, trust me if you want but in a lot of cases you finish to go back to pure & simple SoA because it's in fact faster for codes using only X or only W etc. + on Conroe/Penryn the load port arguments don't stand IMHO since for top performance you have to make extensive use of loop fission, you end up with a lot of simple loops where load port pressure is quite low vs. CPU capacity (it was quite different on the P4)


Please name a commercial game using Structure of Array please!!!!
(I got the list, and it is very small, let s see if you can name one)

who?


Top
 Profile  
 
 Post subject:
PostPosted: Wed Mar 19, 2008 7:53 am 
Offline

Joined: Fri Aug 31, 2007 10:08 pm
Posts: 208
Location: Switzerland
who? wrote:
I got the list, and it is very small


yes sure, now I can even undertand why : you teach game developers that SoA is a "stupid system" (sic) and you're even paid for that


Top
 Profile  
 
 Post subject:
PostPosted: Wed Mar 19, 2008 6:47 pm 
Offline

Joined: Sat Sep 01, 2007 8:01 am
Posts: 652
Eric Bron wrote:
who? wrote:
I got the list, and it is very small


yes sure, now I can even undertand why : you teach game developers that SoA is a "stupid system" (sic) and you're even paid for that


well, I did not say that SoS was a small list :) ...

so, where is your List of games using Structure of Array! come on! don't change the subject ...

I think you spoke one more time without knowing, out of what you did read in the net.

very simple, you tell me the name of the game, I ll vTune it, and post here the vTune profile. It should show a lot of PACK SSEx instruction if you are right and the Structure of array is used. I can even post the ASM piece of code that is the responsable for the Structure of array, if there is one ... hahahah

You know better than anybody else, so, tell me, where is this game! ????

Stop dodging the question, since you know so much, give me a game that use it since you are so sure!

[edit]
Let me help you, one of the game using SOA is Moto Racer, developped by Delphine software, check the date, you ll smile :)
It was optimized with the help of my friend Alexis ...
Now, your turn, give me a name!
[/edit]

who?
PS: You BS too often.


Last edited by who? on Thu Mar 20, 2008 4:56 am, edited 1 time in total.

Top
 Profile  
 
 Post subject:
PostPosted: Wed Mar 19, 2008 8:38 pm 
Offline

Joined: Fri Aug 31, 2007 10:08 pm
Posts: 208
Location: Switzerland
who? wrote:
It should show a lot of PACK SSEx instruction if [...] the Structure of array is used.


nope, there is a lot of ways to initialize the SoA static data, simply with vanilla scalar MOVSS for ex., there is typically no hot spots in initialization code, trust me, that's typically not the code you will want to optimize, btw for most targets (but Penryn/Harpertown) pack instructions will probably lead to slower code anyway

btw pack instructions are no more no less useful for your beloved SoS so I'll suggest you to fix your methodology to build your "lists of games"

since (bogus or not) these lists tell us nothing about clean code going forward (with wider vectors around the corner) I'll also suggest to stop this "discussion"


Top
 Profile  
 
 Post subject:
PostPosted: Thu Mar 20, 2008 2:13 am 
Offline

Joined: Sat Sep 01, 2007 8:01 am
Posts: 652
Eric Bron wrote:
who? wrote:
It should show a lot of PACK SSEx instruction if [...] the Structure of array is used.


nope, there is a lot of ways to initialize the SoA static data, simply with vanilla scalar MOVSS for ex.



hahahaha! are you telling me that you are not going to use the MULPS and ADDPS when you do Structure of Array? hahahah .. you are getting funnier and funnier ...

Quote:
btw pack instructions are no more no less useful for your beloved SoS so I'll suggest you to fix your methodology to build your "lists of games"

well, you are really clue less on this, do you? there is no SOA or SOS without Pack instructions, the purpose of SoS or SoA is TO USE the pack instructions!!!!!!!
here is some lecture that will correct your mis-understanding of the situation...
http://softwarecommunity.intel.com/articles/eng/3592.htm
(Please notice the packed instructions ... hehehe)
now, when it is about instruction set, please keep your opinion for yourself, you argue for arguying, without the knowledge to go with it.

Where is your SoA game?

who?


Top
 Profile  
 
 Post subject: more interesting
PostPosted: Thu Mar 20, 2008 4:53 am 
Offline

Joined: Sat Sep 01, 2007 8:01 am
Posts: 652
let s go back to original subject: SH big die.

who?


Top
 Profile  
 
 Post subject:
PostPosted: Thu Mar 20, 2008 7:53 am 
Offline

Joined: Fri Aug 31, 2007 10:08 pm
Posts: 208
Location: Switzerland
who? wrote:
hahahaha! are you telling me that you are not going to use the MULPS and ADDPS when you do Structure of Array?


huh ? no I'm not, I was thinking that your PACK (all caps, wtf ?) was refering to packing/swizzling not the packed add/mul instructions, my bad

anyway ADDPS/MULPS will be used with a SoA or SoS layout, it's much like if you try to hide your clumsy comment about SoA by going back to a [SoA, SoS] vs [AoS] debate

if you can't see why your idea to still promote SoS in 2008 is not very smart, I'll suggest you to really *think* how you'll teach people to target SSEx and AVX from the same code

who? wrote:
... packed instructions ...


yes it's more clear with the terms used in the field


Top
 Profile  
 
 Post subject:
PostPosted: Thu Mar 20, 2008 8:31 am 
Offline

Joined: Sat Sep 01, 2007 8:01 am
Posts: 652
Eric Bron wrote:
who? wrote:
hahahaha! are you telling me that you are not going to use the MULPS and ADDPS when you do Structure of Array?


huh ? no I'm not, I was thinking that your PACK (all caps, wtf ?) was refering to packing/swizzling not the packed add/mul instructions, my bad

anyway ADDPS/MULPS will be used with a SoA or SoS layout, it's much like if you try to hide your clumsy comment about SoA by going back to a [SoA, SoS] vs [AoS] debate

if you can't see why your idea to still promote SoS in 2008 is not very smart, I'll suggest you to really *think* how you'll teach people to target SSEx and AVX from the same code

who? wrote:
... packed instructions ...


yes it's more clear with the terms used in the field


Some time, I wonder why i got to explain the "all story"
look at the link i gave you,

Quote:
NumOfGroups = NumOfVertices/SIMDwidth
typedef struct{
float x[SIMDwidth];
float y[SIMDwidth];
float z[SIMDwidth];
} VerticesCoordList;
typedef struct{
int a[SIMDwidth];
int b[SIMDwidth];
int c[SIMDwidth];
. . .
} VerticesColorList;
VerticesCoordList VerticesCoord[NumOfGroups];
VerticesColorList VerticesColor[NumOfGroups];


a simple change of SIMDwidth will let you support AVX ... hummm hummm

that's ok, i am in this every day, so, it sounds obvious to me, i can t not expect everybody to click in few seconds...

who?


Top
 Profile  
 
 Post subject:
PostPosted: Thu Mar 20, 2008 9:34 am 
Offline

Joined: Fri Aug 31, 2007 10:08 pm
Posts: 208
Location: Switzerland
who? wrote:
Eric Bron wrote:
who? wrote:
hahahaha! are you telling me that you are not going to use the MULPS and ADDPS when you do Structure of Array?


huh ? no I'm not, I was thinking that your PACK (all caps, wtf ?) was refering to packing/swizzling not the packed add/mul instructions, my bad

anyway ADDPS/MULPS will be used with a SoA or SoS layout, it's much like if you try to hide your clumsy comment about SoA by going back to a [SoA, SoS] vs [AoS] debate

if you can't see why your idea to still promote SoS in 2008 is not very smart, I'll suggest you to really *think* how you'll teach people to target SSEx and AVX from the same code

who? wrote:
... packed instructions ...


yes it's more clear with the terms used in the field


Some time, I wonder why i got to explain the "all story"
look at the link i gave you,

Quote:
NumOfGroups = NumOfVertices/SIMDwidth
typedef struct{
float x[SIMDwidth];
float y[SIMDwidth];
float z[SIMDwidth];
} VerticesCoordList;
typedef struct{
int a[SIMDwidth];
int b[SIMDwidth];
int c[SIMDwidth];
. . .
} VerticesColorList;
VerticesCoordList VerticesCoord[NumOfGroups];
VerticesColorList VerticesColor[NumOfGroups];


who?


there is no timing information at your link, it's looks like an old 2004 paper revamped in a rush, it even still talk about "fewer prefetch"

with AVX, sizeof(VerticesCoordList) = 96 bytes, not very smart with a 64-byte cache lines size if you ask me, without precise timings comparing the two approachs


Top
 Profile  
 
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 136 posts ]  Go to page Previous  1 ... 3, 4, 5, 6, 7, 8, 9, 10  Next

All times are UTC + 1 hour


Who is online

Users browsing this forum: No registered users and 0 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to: