Aceshardware

(not so) temporary home for the aceshardware community
 FAQ •  Search •  Register •  Login 
It is currently Wed Sep 03, 2014 7:58 am

All times are UTC + 1 hour



Welcome
Welcome to <strong>Aceshardware</strong>.

You are currently viewing our boards as a guest, which gives you limited access to view most discussions and access our other features. By joining our free community, you will have access to post topics, communicate privately with other members (PM), respond to polls, upload content, and access many other special features. Registration is fast, simple, and absolutely free, so please, <a href="/profile.php?mode=register">join our community today</a>!


Post new topic Reply to topic  [ 163 posts ]  Go to page Previous  1 ... 7, 8, 9, 10, 11
Author Message
 Post subject: Re: CPUID family bits added because of flaw in Intel compiler
PostPosted: Sat Jan 02, 2010 1:20 pm 
Offline

Joined: Wed Aug 08, 2007 11:26 am
Posts: 7
Ah, ok Agner. Nice that they "fixed" it...

Unfortunately, I do not have the email from AMD anymore (either it was part of my early-Thunderbird ate my Yahoo inbox tragedy, or it was left in an employer's exchange server)


Top
 Profile  
 
 
 Post subject: Re: CPUID family bits added because of flaw in Intel compiler
PostPosted: Fri Jan 08, 2010 6:11 am 
Offline

Joined: Wed Jan 06, 2010 6:46 pm
Posts: 1
Here is the real story on the history of CPUID family value, it was influenced more by bugs in NT 4:
http://lkml.org/lkml/2009/10/27/441
http://lkml.org/lkml/2009/10/27/453


Top
 Profile  
 
 Post subject: Re: CPUID family bits added because of flaw in Intel compiler
PostPosted: Sat Jan 09, 2010 12:33 pm 
Offline

Joined: Fri Sep 07, 2007 10:31 am
Posts: 41
Location: Denmark
yuhong wrote:
Here is the real story on the history of CPUID family value, it was influenced more by bugs in NT 4:
http://lkml.org/lkml/2009/10/27/441
http://lkml.org/lkml/2009/10/27/453


Thanks for this valuable info.
It just confirms my point that it is bad programming practice to make software check for known CPUs and model numbers. Any software using a list of known CPUs will surely be obsolete in a short time. Why is there no CPUID bit for x64 support?


Top
 Profile  
 
 Post subject: Re: CPUID family bits added because of flaw in Intel compiler
PostPosted: Sat Jan 09, 2010 5:43 pm 
Offline

Joined: Sun Oct 07, 2007 6:22 pm
Posts: 119
> Why is there no CPUID bit for x64 support?

0x80000001, EDX, bit 29


Top
 Profile  
 
 Post subject: Re: CPUID family bits added because of flaw in Intel compiler
PostPosted: Tue Jan 12, 2010 12:22 am 
Offline

Joined: Mon Dec 14, 2009 12:53 pm
Posts: 4
Agner wrote:
Thanks for this valuable info.
It just confirms my point that it is bad programming practice to make software check for known CPUs and model numbers. Any software using a list of known CPUs will surely be obsolete in a short time. Why is there no CPUID bit for x64 support?


I respectfully disagree... I realise it's very easy to jump to conclusions in this particular case, but Francois is correct. I have a background in compiler development for SPARC and embedded OS development on ARM, so even if I don't begin to approach the legendary experience of some of the present forum members, I am definitely overqualified to answer this question.

Using the ubiquitous gcc as an example, why do you think they have switches such as?
-march=core2

Yes, Correct. Instruction timings, this simple example I copied from your guide. (instruction_tables.pdf)
BSF [Atom] 16 Cycles
BSF [Core i7] 3/1 Cycles(Latency/Throughput)

A compiler can NEVER do a good job from only knowing which instructions are available, and it's even more wrong from a software design viewpoint to make asumptions such as: 75% of processors supporting instruction X has a fast instruction Y. I'm sorry, but for statically compiled binaries(which makes this very same choice but in runtime), CPUID is the proper way as well as for hardware workarounds in OS:es.

Thus ICCs default behaviour is perfectly valid and it's definitely NOT a "Cripple AMD" function, but rather a "not enough time invested in optimizing/certifying for AMD as would have been nice to have" feature. Besides what company except Microsoft releases unverified products? At least in my experience there is no way in hell enterprise class software gets released to end-customers without proper verification and validation, but we don't have the right to demand from intel that they walk that extra mile.

As a developer I am however very grateful for this information.


Top
 Profile  
 
 Post subject: Re: CPUID family bits added because of flaw in Intel compiler
PostPosted: Wed Jan 13, 2010 9:35 am 
Offline

Joined: Fri Sep 07, 2007 10:31 am
Posts: 41
Location: Denmark
Zyr wrote:
Instruction timings, this simple example I copied from your guide. (instruction_tables.pdf)
BSF [Atom] 16 Cycles
BSF [Core i7] 3/1 Cycles(Latency/Throughput)

A compiler can NEVER do a good job from only knowing which instructions are available,


The BSF instruction is a good example. One Intel engineer who contributes optimized string instructions to the glibc library had a problem with the strlen function being slow on the Atom processor. He finally found out that the problem was the BSF instruction (Bit Scan Forward). The obvious thing to do is to make a separate branch for Atom. Now, let me explain why this is a bad idea:

Consider the time it takes to develop a separate branch in the library for the Atom processor. Add to this the time it takes for the updated version of glibc to penetrate to general distributions. Add to this the time interval with which an average programmer updates his compiler tools. Add to this the development time for an average software product that uses the strlen instruction in critical code. Add to this the time it takes to market the software. Add to this the time before the average user decides to update this software. By then, the Atom processor will surely be obsolete and the user is likely to have some other processor. We don't know whether the successor of Atom will be slow or fast on the BSF instruction so the glibc library wouldn't know which branch is fastest on the new processor.

What is the solution then? Normally, I would just optimize for the newest processor and add an extra branch for old processors that don't have the necessary instruction sets. If the difference in speed is so big that I have to make a separate branch for processors with slow BSF instructions then the CPU dispatcher should simply test the speed of the BSF instruction or test which branch is fastest. But even here I would think twice before implementing such a solution. After all, the BSF instruction is executed only once in the strlen function. We are talking about a few nanoseconds here. The strlen function would have to be called millions or billions of times before the difference even matters. What program would have so many strings? And who would run such a big job on an Atom?

Having hundreds of branches for hundreds of different processors is just plain foolish. It bloats your software and pollutes your code cache with the result that performance goes down rather than up. And it is impossible to test, verify, debug and maintain so many different branches of code.

The only case where I would dispatch for a specific CPU model number is if the CPU has a bug that must be avoided, for example the Pentium FDIV bug.


Top
 Profile  
 
 Post subject: Re: CPUID family bits added because of flaw in Intel compile
PostPosted: Tue Sep 20, 2011 9:59 am 
Offline

Joined: Wed Aug 10, 2011 1:05 pm
Posts: 4
a good example, Zyr. And i completely agree with the idea that it is bad programming practice to make software check for known CPUs and model numbers.


Last edited by Christiande on Thu Nov 03, 2011 7:39 am, edited 2 times in total.

Top
 Profile  
 
 Post subject: Re: CPUID family bits added because of flaw in Intel compile
PostPosted: Thu Sep 29, 2011 11:15 am 
Offline

Joined: Fri Mar 21, 2008 4:07 pm
Posts: 74
Agner wrote:
Zyr wrote:
Instruction timings, this simple example I copied from your guide. (instruction_tables.pdf)
BSF [Atom] 16 Cycles
BSF [Core i7] 3/1 Cycles(Latency/Throughput)

A compiler can NEVER do a good job from only knowing which instructions are available,


The BSF instruction is a good example. One Intel engineer who contributes optimized string instructions to the glibc library had a problem with the strlen function being slow on the Atom processor. He finally found out that the problem was the BSF instruction (Bit Scan Forward). The obvious thing to do is to make a separate branch for Atom. Now, let me explain why this is a bad idea:

Consider the time it takes to develop a separate branch in the library for the Atom processor. Add to this the time it takes for the updated version of glibc to penetrate to general distributions. Add to this the time interval with which an average programmer updates his compiler tools. Add to this the development time for an average software product that uses the strlen instruction in critical code. Add to this the time it takes to market the software. Add to this the time before the average user decides to update this software. By then, the Atom processor will surely be obsolete and the user is likely to have some other processor. We don't know whether the successor of Atom will be slow or fast on the BSF instruction so the glibc library wouldn't know which branch is fastest on the new processor.

What is the solution then? Normally, I would just optimize for the newest processor and add an extra branch for old processors that don't have the necessary instruction sets. If the difference in speed is so big that I have to make a separate branch for processors with slow BSF instructions then the CPU dispatcher should simply test the speed of the BSF instruction or test which branch is fastest. But even here I would think twice before implementing such a solution. After all, the BSF instruction is executed only once in the strlen function. We are talking about a few nanoseconds here. The strlen function would have to be called millions or billions of times before the difference even matters. What program would have so many strings? And who would run such a big job on an Atom?

Having hundreds of branches for hundreds of different processors is just plain foolish. It bloats your software and pollutes your code cache with the result that performance goes down rather than up. And it is impossible to test, verify, debug and maintain so many different branches of code.

The only case where I would dispatch for a specific CPU model number is if the CPU has a bug that must be avoided, for example the Pentium FDIV bug.


But given Intel ( more recently AMD's ) strategy of having at most 2 CPU uarchs, you don't need to care about more than 3 generations : current performance uarch, current low-power uarch and last legacy uarch.

For example, Intel's ICC needs only 3 branches : The Nehalem generation ( branch 1 ), Atom generation ( branch 2 ) and Core ( legacy, branch 3 ).

I really doubt the compiler knows the difference between a Core I3 and a Core I7.


Top
 Profile  
 
 Post subject: Re: CPUID family bits added because of flaw in Intel compile
PostPosted: Sat Oct 01, 2011 12:50 pm 
Offline

Joined: Mon Jan 24, 2011 3:22 pm
Posts: 119
Agner wrote:
The BSF instruction is a good example. One Intel engineer who contributes optimized string instructions to the glibc library had a problem with the strlen function being slow on the Atom processor. He finally found out that the problem was the BSF instruction (Bit Scan Forward). The obvious thing to do is to make a separate branch for Atom. Now, let me explain why this is a bad idea:

Consider the time it takes to develop a separate branch in the library for the Atom processor. Add to this the time it takes for the updated version of glibc to penetrate to general distributions. Add to this the time interval with which an average programmer updates his compiler tools. Add to this the development time for an average software product that uses the strlen instruction in critical code. Add to this the time it takes to market the software. Add to this the time before the average user decides to update this software. By then, the Atom processor will surely be obsolete and the user is likely to have some other processor. We don't know whether the successor of Atom will be slow or fast on the BSF instruction so the glibc library wouldn't know which branch is fastest on the new processor.

What is the solution then? Normally, I would just optimize for the newest processor and add an extra branch for old processors that don't have the necessary instruction sets.


This is quite laughable. This is a schoolbook example of why you should optimize data structures before spending time on costly micro-optimizations. The proper way to "optimize" strlen() is to not call it at all, and instead use a string representation that remembers the string length (and, as an additional bonus point, allows \0 as a valid string character - and eliminates a whole array of interesting security exploits). I expect most languages except C got it right.

Micro-optimizing strlen() with assembler code is like optimizing a bubble sort implementation. A ridiculous waste of time.

Yes, there are other use cases that would probably benefit from such micro-optimizations. But the fact that such a discussion is about strlen() is really priceless and tells a lot about the mindset of *some* (not all) hardware engineers and low-level programmers.


Top
 Profile  
 
 Post subject: Re: CPUID family bits added because of flaw in Intel compile
PostPosted: Sun Oct 02, 2011 5:25 am 
Offline

Joined: Sun Sep 23, 2007 1:29 am
Posts: 175
Location: Los Angeles, CA
lol.... total agreement there. libc string functions are inherently inefficient, anybody who builds their apps around them is a moron.

I've been working on a new database library for OpenLDAP lately; it mmap's all data so data fetches do no malloc's or memcpy's. The thing is blindingly fast, ~85% faster than our previous BerkeleyDB-based code. (Our LDAP server now runs at line rate, handles as many queries/second as the network hardware can deliver. There's practically no overhead for data fetches, CPU to spare while burning up the LAN.) I spent a couple days porting the DB library it into SQLite as well. It dropped the size of the SQLite binary by about 60KB (of an overall 2MB) and shaved just 3% off SQLite's runtime. Profiling shows that 90% of the CPU time is eaten up in printf and other string-handling functions. Dumb string handling is the most common mistake I usually see in C code. It will take more than a few more day's effort to rewrite enough of the ridiculously slow code in here for the actual database performance to make a measurable difference. By then the code will barely resemble the original SQLite code. (But it will also be at least an order of magnitude faster as well, once all the idiot string handling is eliminated.)


Top
 Profile  
 
 Post subject: Re: CPUID family bits added because of flaw in Intel compile
PostPosted: Sun Oct 02, 2011 6:07 am 
Offline

Joined: Sat Mar 22, 2008 5:10 pm
Posts: 370
SQLite have so many performance problems that I'm surprised someone looked first at it's string functions...

Somewhat like Foo_ posts, SQLite have inefficient disk access, inefficient algorithms for queries and is likely calling those slow string function far more than necessary, even more than the needed by those inefficient algorithms, than someone decides to optmize the string functions...


Top
 Profile  
 
 Post subject: Re: CPUID family bits added because of flaw in Intel compile
PostPosted: Sun Oct 02, 2011 7:21 am 
Offline

Joined: Thu Sep 20, 2007 8:13 pm
Posts: 67
Foo_ wrote:
Agner wrote:
The BSF instruction is a good example. One Intel engineer who contributes optimized string instructions to the glibc library had a problem with the strlen function being slow on the Atom processor. He finally found out that the problem was the BSF instruction (Bit Scan Forward). The obvious thing to do is to make a separate branch for Atom. Now, let me explain why this is a bad idea:

Consider the time it takes to develop a separate branch in the library for the Atom processor. Add to this the time it takes for the updated version of glibc to penetrate to general distributions. Add to this the time interval with which an average programmer updates his compiler tools. Add to this the development time for an average software product that uses the strlen instruction in critical code. Add to this the time it takes to market the software. Add to this the time before the average user decides to update this software. By then, the Atom processor will surely be obsolete and the user is likely to have some other processor. We don't know whether the successor of Atom will be slow or fast on the BSF instruction so the glibc library wouldn't know which branch is fastest on the new processor.

What is the solution then? Normally, I would just optimize for the newest processor and add an extra branch for old processors that don't have the necessary instruction sets.


This is quite laughable. This is a schoolbook example of why you should optimize data structures before spending time on costly micro-optimizations. The proper way to "optimize" strlen() is to not call it at all, and instead use a string representation that remembers the string length (and, as an additional bonus point, allows \0 as a valid string character - and eliminates a whole array of interesting security exploits). I expect most languages except C got it right.

Micro-optimizing strlen() with assembler code is like optimizing a bubble sort implementation. A ridiculous waste of time.

Yes, there are other use cases that would probably benefit from such micro-optimizations. But the fact that such a discussion is about strlen() is really priceless and tells a lot about the mindset of *some* (not all) hardware engineers and low-level programmers.


To be fair, we're talking about an Intel engineer here, not someone developing an actual application. The fact is there are people using those functions, so it makes sense for Intel to optimize them, even if it would be better for everyone that developers just don't use them.


Top
 Profile  
 
 Post subject: Re: CPUID family bits added because of flaw in Intel compile
PostPosted: Sun Oct 02, 2011 5:23 pm 
Offline

Joined: Mon Jan 24, 2011 3:22 pm
Posts: 119
Alexko wrote:
To be fair, we're talking about an Intel engineer here, not someone developing an actual application. The fact is there are people using those functions, so it makes sense for Intel to optimize them, even if it would be better for everyone that developers just don't use them.


Agreed, but either string handling is not performance-critical, and having a slowish strlen() is fine; or it's performance-critical, and the developer had better switch to something else.


Top
 Profile  
 
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 163 posts ]  Go to page Previous  1 ... 7, 8, 9, 10, 11

All times are UTC + 1 hour


Who is online

Users browsing this forum: No registered users and 1 guest


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
suspicion-preferred