You are currently viewing our boards as a guest, which gives you limited access to view most discussions and access our other features. By joining our free community, you will have access to post topics, communicate privately with other members (PM), respond to polls, upload content, and access many other special features. Registration is fast, simple, and absolutely free, so please, <a href="/profile.php?mode=register">join our community today</a>!
Post subject: Re: CPUID family bits added because of flaw in Intel compiler
Posted: Sat Jan 02, 2010 1:20 pm
Joined: Wed Aug 08, 2007 11:26 am Posts: 7
Ah, ok Agner. Nice that they "fixed" it...
Unfortunately, I do not have the email from AMD anymore (either it was part of my early-Thunderbird ate my Yahoo inbox tragedy, or it was left in an employer's exchange server)
Thanks for this valuable info. It just confirms my point that it is bad programming practice to make software check for known CPUs and model numbers. Any software using a list of known CPUs will surely be obsolete in a short time. Why is there no CPUID bit for x64 support?
Post subject: Re: CPUID family bits added because of flaw in Intel compiler
Posted: Tue Jan 12, 2010 12:22 am
Joined: Mon Dec 14, 2009 12:53 pm Posts: 4
Agner wrote:
Thanks for this valuable info. It just confirms my point that it is bad programming practice to make software check for known CPUs and model numbers. Any software using a list of known CPUs will surely be obsolete in a short time. Why is there no CPUID bit for x64 support?
I respectfully disagree... I realise it's very easy to jump to conclusions in this particular case, but Francois is correct. I have a background in compiler development for SPARC and embedded OS development on ARM, so even if I don't begin to approach the legendary experience of some of the present forum members, I am definitely overqualified to answer this question.
Using the ubiquitous gcc as an example, why do you think they have switches such as? -march=core2
Yes, Correct. Instruction timings, this simple example I copied from your guide. (instruction_tables.pdf) BSF [Atom] 16 Cycles BSF [Core i7] 3/1 Cycles(Latency/Throughput)
A compiler can NEVER do a good job from only knowing which instructions are available, and it's even more wrong from a software design viewpoint to make asumptions such as: 75% of processors supporting instruction X has a fast instruction Y. I'm sorry, but for statically compiled binaries(which makes this very same choice but in runtime), CPUID is the proper way as well as for hardware workarounds in OS:es.
Thus ICCs default behaviour is perfectly valid and it's definitely NOT a "Cripple AMD" function, but rather a "not enough time invested in optimizing/certifying for AMD as would have been nice to have" feature. Besides what company except Microsoft releases unverified products? At least in my experience there is no way in hell enterprise class software gets released to end-customers without proper verification and validation, but we don't have the right to demand from intel that they walk that extra mile.
As a developer I am however very grateful for this information.
Instruction timings, this simple example I copied from your guide. (instruction_tables.pdf) BSF [Atom] 16 Cycles BSF [Core i7] 3/1 Cycles(Latency/Throughput)
A compiler can NEVER do a good job from only knowing which instructions are available,
The BSF instruction is a good example. One Intel engineer who contributes optimized string instructions to the glibc library had a problem with the strlen function being slow on the Atom processor. He finally found out that the problem was the BSF instruction (Bit Scan Forward). The obvious thing to do is to make a separate branch for Atom. Now, let me explain why this is a bad idea:
Consider the time it takes to develop a separate branch in the library for the Atom processor. Add to this the time it takes for the updated version of glibc to penetrate to general distributions. Add to this the time interval with which an average programmer updates his compiler tools. Add to this the development time for an average software product that uses the strlen instruction in critical code. Add to this the time it takes to market the software. Add to this the time before the average user decides to update this software. By then, the Atom processor will surely be obsolete and the user is likely to have some other processor. We don't know whether the successor of Atom will be slow or fast on the BSF instruction so the glibc library wouldn't know which branch is fastest on the new processor.
What is the solution then? Normally, I would just optimize for the newest processor and add an extra branch for old processors that don't have the necessary instruction sets. If the difference in speed is so big that I have to make a separate branch for processors with slow BSF instructions then the CPU dispatcher should simply test the speed of the BSF instruction or test which branch is fastest. But even here I would think twice before implementing such a solution. After all, the BSF instruction is executed only once in the strlen function. We are talking about a few nanoseconds here. The strlen function would have to be called millions or billions of times before the difference even matters. What program would have so many strings? And who would run such a big job on an Atom?
Having hundreds of branches for hundreds of different processors is just plain foolish. It bloats your software and pollutes your code cache with the result that performance goes down rather than up. And it is impossible to test, verify, debug and maintain so many different branches of code.
The only case where I would dispatch for a specific CPU model number is if the CPU has a bug that must be avoided, for example the Pentium FDIV bug.
Post subject: Re: CPUID family bits added because of flaw in Intel compile
Posted: Thu Sep 29, 2011 11:15 am
Joined: Fri Mar 21, 2008 4:07 pm Posts: 74
Agner wrote:
Zyr wrote:
Instruction timings, this simple example I copied from your guide. (instruction_tables.pdf) BSF [Atom] 16 Cycles BSF [Core i7] 3/1 Cycles(Latency/Throughput)
A compiler can NEVER do a good job from only knowing which instructions are available,
The BSF instruction is a good example. One Intel engineer who contributes optimized string instructions to the glibc library had a problem with the strlen function being slow on the Atom processor. He finally found out that the problem was the BSF instruction (Bit Scan Forward). The obvious thing to do is to make a separate branch for Atom. Now, let me explain why this is a bad idea:
Consider the time it takes to develop a separate branch in the library for the Atom processor. Add to this the time it takes for the updated version of glibc to penetrate to general distributions. Add to this the time interval with which an average programmer updates his compiler tools. Add to this the development time for an average software product that uses the strlen instruction in critical code. Add to this the time it takes to market the software. Add to this the time before the average user decides to update this software. By then, the Atom processor will surely be obsolete and the user is likely to have some other processor. We don't know whether the successor of Atom will be slow or fast on the BSF instruction so the glibc library wouldn't know which branch is fastest on the new processor.
What is the solution then? Normally, I would just optimize for the newest processor and add an extra branch for old processors that don't have the necessary instruction sets. If the difference in speed is so big that I have to make a separate branch for processors with slow BSF instructions then the CPU dispatcher should simply test the speed of the BSF instruction or test which branch is fastest. But even here I would think twice before implementing such a solution. After all, the BSF instruction is executed only once in the strlen function. We are talking about a few nanoseconds here. The strlen function would have to be called millions or billions of times before the difference even matters. What program would have so many strings? And who would run such a big job on an Atom?
Having hundreds of branches for hundreds of different processors is just plain foolish. It bloats your software and pollutes your code cache with the result that performance goes down rather than up. And it is impossible to test, verify, debug and maintain so many different branches of code.
The only case where I would dispatch for a specific CPU model number is if the CPU has a bug that must be avoided, for example the Pentium FDIV bug.
But given Intel ( more recently AMD's ) strategy of having at most 2 CPU uarchs, you don't need to care about more than 3 generations : current performance uarch, current low-power uarch and last legacy uarch.
For example, Intel's ICC needs only 3 branches : The Nehalem generation ( branch 1 ), Atom generation ( branch 2 ) and Core ( legacy, branch 3 ).
I really doubt the compiler knows the difference between a Core I3 and a Core I7.
Post subject: Re: CPUID family bits added because of flaw in Intel compile
Posted: Sat Oct 01, 2011 12:50 pm
Joined: Mon Jan 24, 2011 3:22 pm Posts: 119
Agner wrote:
The BSF instruction is a good example. One Intel engineer who contributes optimized string instructions to the glibc library had a problem with the strlen function being slow on the Atom processor. He finally found out that the problem was the BSF instruction (Bit Scan Forward). The obvious thing to do is to make a separate branch for Atom. Now, let me explain why this is a bad idea:
Consider the time it takes to develop a separate branch in the library for the Atom processor. Add to this the time it takes for the updated version of glibc to penetrate to general distributions. Add to this the time interval with which an average programmer updates his compiler tools. Add to this the development time for an average software product that uses the strlen instruction in critical code. Add to this the time it takes to market the software. Add to this the time before the average user decides to update this software. By then, the Atom processor will surely be obsolete and the user is likely to have some other processor. We don't know whether the successor of Atom will be slow or fast on the BSF instruction so the glibc library wouldn't know which branch is fastest on the new processor.
What is the solution then? Normally, I would just optimize for the newest processor and add an extra branch for old processors that don't have the necessary instruction sets.
This is quite laughable. This is a schoolbook example of why you should optimize data structures before spending time on costly micro-optimizations. The proper way to "optimize" strlen() is to not call it at all, and instead use a string representation that remembers the string length (and, as an additional bonus point, allows \0 as a valid string character - and eliminates a whole array of interesting security exploits). I expect most languages except C got it right.
Micro-optimizing strlen() with assembler code is like optimizing a bubble sort implementation. A ridiculous waste of time.
Yes, there are other use cases that would probably benefit from such micro-optimizations. But the fact that such a discussion is about strlen() is really priceless and tells a lot about the mindset of *some* (not all) hardware engineers and low-level programmers.
Post subject: Re: CPUID family bits added because of flaw in Intel compile
Posted: Sun Oct 02, 2011 5:25 am
Joined: Sun Sep 23, 2007 1:29 am Posts: 175 Location: Los Angeles, CA
lol.... total agreement there. libc string functions are inherently inefficient, anybody who builds their apps around them is a moron.
I've been working on a new database library for OpenLDAP lately; it mmap's all data so data fetches do no malloc's or memcpy's. The thing is blindingly fast, ~85% faster than our previous BerkeleyDB-based code. (Our LDAP server now runs at line rate, handles as many queries/second as the network hardware can deliver. There's practically no overhead for data fetches, CPU to spare while burning up the LAN.) I spent a couple days porting the DB library it into SQLite as well. It dropped the size of the SQLite binary by about 60KB (of an overall 2MB) and shaved just 3% off SQLite's runtime. Profiling shows that 90% of the CPU time is eaten up in printf and other string-handling functions. Dumb string handling is the most common mistake I usually see in C code. It will take more than a few more day's effort to rewrite enough of the ridiculously slow code in here for the actual database performance to make a measurable difference. By then the code will barely resemble the original SQLite code. (But it will also be at least an order of magnitude faster as well, once all the idiot string handling is eliminated.)
Post subject: Re: CPUID family bits added because of flaw in Intel compile
Posted: Sun Oct 02, 2011 6:07 am
Joined: Sat Mar 22, 2008 5:10 pm Posts: 370
SQLite have so many performance problems that I'm surprised someone looked first at it's string functions...
Somewhat like Foo_ posts, SQLite have inefficient disk access, inefficient algorithms for queries and is likely calling those slow string function far more than necessary, even more than the needed by those inefficient algorithms, than someone decides to optmize the string functions...
Post subject: Re: CPUID family bits added because of flaw in Intel compile
Posted: Sun Oct 02, 2011 7:21 am
Joined: Thu Sep 20, 2007 8:13 pm Posts: 67
Foo_ wrote:
Agner wrote:
The BSF instruction is a good example. One Intel engineer who contributes optimized string instructions to the glibc library had a problem with the strlen function being slow on the Atom processor. He finally found out that the problem was the BSF instruction (Bit Scan Forward). The obvious thing to do is to make a separate branch for Atom. Now, let me explain why this is a bad idea:
Consider the time it takes to develop a separate branch in the library for the Atom processor. Add to this the time it takes for the updated version of glibc to penetrate to general distributions. Add to this the time interval with which an average programmer updates his compiler tools. Add to this the development time for an average software product that uses the strlen instruction in critical code. Add to this the time it takes to market the software. Add to this the time before the average user decides to update this software. By then, the Atom processor will surely be obsolete and the user is likely to have some other processor. We don't know whether the successor of Atom will be slow or fast on the BSF instruction so the glibc library wouldn't know which branch is fastest on the new processor.
What is the solution then? Normally, I would just optimize for the newest processor and add an extra branch for old processors that don't have the necessary instruction sets.
This is quite laughable. This is a schoolbook example of why you should optimize data structures before spending time on costly micro-optimizations. The proper way to "optimize" strlen() is to not call it at all, and instead use a string representation that remembers the string length (and, as an additional bonus point, allows \0 as a valid string character - and eliminates a whole array of interesting security exploits). I expect most languages except C got it right.
Micro-optimizing strlen() with assembler code is like optimizing a bubble sort implementation. A ridiculous waste of time.
Yes, there are other use cases that would probably benefit from such micro-optimizations. But the fact that such a discussion is about strlen() is really priceless and tells a lot about the mindset of *some* (not all) hardware engineers and low-level programmers.
To be fair, we're talking about an Intel engineer here, not someone developing an actual application. The fact is there are people using those functions, so it makes sense for Intel to optimize them, even if it would be better for everyone that developers just don't use them.
Post subject: Re: CPUID family bits added because of flaw in Intel compile
Posted: Sun Oct 02, 2011 5:23 pm
Joined: Mon Jan 24, 2011 3:22 pm Posts: 119
Alexko wrote:
To be fair, we're talking about an Intel engineer here, not someone developing an actual application. The fact is there are people using those functions, so it makes sense for Intel to optimize them, even if it would be better for everyone that developers just don't use them.
Agreed, but either string handling is not performance-critical, and having a slowish strlen() is fine; or it's performance-critical, and the developer had better switch to something else.
Users browsing this forum: No registered users and 1 guest
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot post attachments in this forum