You are currently viewing our boards as a guest, which gives you limited access to view most discussions and access our other features. By joining our free community, you will have access to post topics, communicate privately with other members (PM), respond to polls, upload content, and access many other special features. Registration is fast, simple, and absolutely free, so please, <a href="/profile.php?mode=register">join our community today</a>!
Joined: Wed Jun 27, 2007 10:19 am Posts: 331 Location: Milano, Italy
Agner wrote:
Just-in-time compilation of byte code obviously removes the runtime CPU dispatcher, but the JIT compiler itself becomes bigger and more difficult to maintain the more different branches it has to support. JIT compilation in general has several disadvantages. The intermediate step in the compilation process makes it more difficult for the compiler to find possibilities for optimization.
That is true only if you JIT-compile at run-time but you can easily use LLVM to do ahead-of-time (AOT) compilation at install time when you can already detect the target CPU and remove the overhead at runtime.
Quote:
I have never seen a JIT compiled application that runs faster than a similar directly compiled binary. Furthermore, the JIT compiler and runtime framework often use much more resources than the application itself. I haven't tried LLVM, but my experience with .NET and Java applications is that they are increadibly slow compared to binaries compiled from C++.
Rethink about it. Java and .Net impose specific semantics which can significantly slow down the code if you cannot optimize them out (for example range checking and full dynamic method dispatch). But LLVM doesn't use a 'managed' bytecode, it can be used to compile low-level code. llvm-gcc for example can be compared very favourably to gcc when compiling C and C++. Beside it does not consume much resources - at least not much more than your average bloated C++ app - and you can always remove the overhead with AOT compilation. This doesn't require you to ship source with your program, you can distribute stripped LLVM bytecode which is just as obscure as a binary and AOT compile it at install time. Besides LLVM can do inter-module optimization which is a very nice bonus.
The huge code makes the executable big and makes code caching less efficient.
If functions inside the library are well positioned (all branch1 toghether at begging, then branch2, branch3, etc) the "useless" part may not even be loaded in memory.
Agner wrote:
JIT compilation in general has several disadvantages.
It has several advantages too:
- Perform inter-modules otimizations (very important for OOP);
- Profile and change code at run time to match users needs;
- The native binary may be cached at install time and only recompiled when some modules are upgraded;
- The already mentioned, target "every" machine.
Joined: Sun Sep 23, 2007 1:29 am Posts: 126 Location: Los Angeles, CA
Gabriele Svelto wrote:
Beside it does not consume much resources - at least not much more than your average bloated C++ app - and you can always remove the overhead with AOT compilation. This doesn't require you to ship source with your program, you can distribute stripped LLVM bytecode which is just as obscure as a binary and AOT compile it at install time. Besides LLVM can do inter-module optimization which is a very nice bonus.
So you're saying that when you need code that actually performs well, you still have to carve it into stone instead of JIT it. What's the point of this intermediate layer then? Saying that LLVM bytecode is just as obscure as a binary is a pretty weak statement. 1) if that were really true, then it would be pretty difficult to optimize. 2) recompiling any binary into any other target is a pretty well understood problem these days.
Joined: Wed Jun 27, 2007 10:19 am Posts: 331 Location: Milano, Italy
hyc wrote:
So you're saying that when you need code that actually performs well, you still have to carve it into stone instead of JIT it.
What I was saying is that you don't need to pay for the runtime overhead of a JIT compiler if you do not want to. LLVM can be used in AOT mode to 'tune' the application for the target machine it will be compiled on. You don't need to, it's just an option.
Quote:
What's the point of this intermediate layer then?
It still saves you from the hassle of writing and maintaining ten different code paths written/tuned for different processors and has the advantage of being already future proof.
Quote:
Saying that LLVM bytecode is just as obscure as a binary is a pretty weak statement. 1) if that were really true, then it would be pretty difficult to optimize. 2) recompiling any binary into any other target is a pretty well understood problem these days.
It's a type-rich low-level representation in SSA-form. If the types and function names are obscured (i.e. you haven't compiled debugging information in) it can be even worse than disassembled code from a human POV. For a compiler it's a great format which provides all the information it needs and it is fairly compact too in its latest iterations (they introduced variable length-encoding of the bytecode recently).
Anyway check for yourself, it may not be suited for everybody's needs but it's a great project and it is a very solid alternative for performance conscious codes. The only drawback from my POV is that the JIT compiler modules are not very easy to bridge to a C application so if you're not on C++ it might be tricky. AOT compilation on the other hand will work just fine.
Larrabee is currently slide ware with no actual silicon until 2009/2010.
Cuda is an evolving programming language for nVidia series 8000 and above GPUs. It is currently in C with C++/Fortran features being added.
Who, how does that encode sample code you spout run on Larrabee?
Not very well as it doesn't exist and won't for another 12 to 18 months.
That is three more generations in the fast changing GPU world. I'm sure you believe that nVidia won't be doing any more enhancements to their GPUs and that CUDA will stay unchanged and unused so that your beloved Intel can get their great ray tracing GPU out.
I'm sure you believe that once Larrabee finally comes out that everyone will throw away all of their laptops and desktops and buy complete new systems just to get the Intel Larrabee.
Not going to happen.
It would seem that all those multi-millions of laptops and desktops will already have a nVidia GPU or could just add one to an open PCIe slot for much much less $$$.
That is three more generations in the fast changing GPU world
That's is being a bit nice to the GPU world.
Nvidia just this week released it's first meaningfully new generation since Nov 2006. Every 8x00 and 9x00 GPU is clearly rooted in the 8800GTX/GTS, with only mild tweaks such as getting PureVideo HD to work and process shrinks.
Quote:
add one to an open PCIe slot for much much less $$$.
Why wouldn't you just add a Larabee into your PCIe slot?
I guess the AMD guys need to look at the array of functions when they do different CPU code path ... their way is adding many conditional Jump to get to their SSE5#!@$ ... Especially because you usually do optimization in the critical path ... I am saying this ... but many will agree here that I don t know what I am talking about .... lol!
switch case ... lol!!!! Be serious please!
who?
Do you understand the diference between an oversimplified example in a PDF for journalists and an actual implementation?
you did miss the point ... If i did post the same code here, some "smart a.." would have pick on it, the way I did it ... missing the concept, instead of the actual code ... You are right, it is a conceptual code. May be after this, the "smart a.." will stop picking up on every details and spelling mistakes ... lol!
I guess the AMD guys need to look at the array of functions when they do different CPU code path ...
[...]
switch case ... lol!!!! Be serious please!
You do know most compilers optimize switch/case constructs as jump tables, I hope?
you do know that he is the lead archictect of Nehalem, the lead designer of all Intel compilers since version 3, and was the main thinker behind Silvethorne and Larabee ? Every instruction set extension since 1997 was his idea. :lol:
I guess the AMD guys need to look at the array of functions when they do different CPU code path ...
[...]
switch case ... lol!!!! Be serious please!
You do know most compilers optimize switch/case constructs as jump tables, I hope?
There's actually a wide range of code generation strategies that can
be used for switch/case in different situations depending on the situation
(how sparse the values are, how many they are, the probabilities of
the individual switches etc.)
For example a binary search with cmp and conditional branches
might well be faster because it gives the CPU's branch predictor
more points to index unlike the single indirect jump of a jump table.
When they have profile feedback they can also do better.
I would expect a good compiler especially with profile feedback
should be able to out perform most programmers on this.
SSEPlus is a function library with CPU dispatcher inside. This is obviously a good way to take advantage of different instruction sets. Most math libraries already have a CPU dispatcher.
The math libraries supplied by Intel have a CPU dispatcher with the funny feature that it always chooses the worst possible path when running on an AMD processor. If you replace the CPU dispatcher inside the Intel math libraries with a non-discriminating CPU dispatcher then the program that is compiled with Intels compiler suddenly runs much faster on an AMD machine. See my programming manual for details http://www.agner.org/optimize/optimizing_cpp.pdf.
...
The huge code makes the executable big and makes code caching less efficient. In many cases the optimal solution is to have only two branches in the CPU dispatcher: One branch uses the oldest instruction set that has all the instructions you need for that particular purpose and one branch for supporting older computers. No need to make a branch for SSE4.2 or whatever if there are no instructions beyond SSE3 that are useful for that particular application.
...
Could someone explain how a "CPU dispatcher" works? Is there a way to tell the hardware which set of SSE instructions you would like to use?
I have heard that MS's D3DX math lib uses the same method to determine which SSE version is supported, at run-time.
In any case, I don't understand how it is possible to create code that is scheduled as well as code that is written using purely intrinsics for a target SSE version, wherein 'inline' vectorized functions can be inlined.
So you generate multiple versions of functions at compile-time, and choose the appropriate one(s) at runtime using CPUID. I guess this is not as awesome as my naive vision of certain SSE(N) instructions being replaced by fallback SSE(N-1) instructions in the instruction stream.. somehow.
My biggest beef with the CPU dispatch method, then, is that you cannot write a small function that you really want to be inlined within larger functions, if there are multiple versions that must be chosen from at runtime. Sigh, guess I would just have to stick with SSE2 intrinsics then, and just replace low-level inlinable functions with SSE3 (and higher) intrinsics where possible, and compile a new binary for those procs.
This link applies to the Intel compiler. This is the only compiler I know that can do CPU dispatching for you. Unfortunately, the Intel compiler doesn't like AMD. The compiler makers deliberately made it select the worst possible path when a non-Intel processor is detected, even if the AMD processor is capable of running the SSE2 or SSE3 or whatever path. Using the Intel compiler doesn't solve the problem of Intel and AMD having different codes for the same instructions. (See my C++ manual at http://www.agner.org/optimize/ for details about CPU dispatching in the Intel compiler)
You have to do the branching manually outside the innermost loop. This means more code to test and maintain. Every programmer's nightmare!
Users browsing this forum: No registered users and 0 guests
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot post attachments in this forum