Aceshardware Forum Index Aceshardware
(not so) temporary home for the aceshardware community
 
 FAQFAQ   SearchSearch   MemberlistMemberlist   UsergroupsUsergroups    RegisterRegister 
 ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in 

Nvidia's CUDA
Goto page Previous  1, 2, 3, 4, 5  Next
 
Post new topic   Reply to topic    Aceshardware Forum Index -> General forum
View previous topic :: View next topic  
Author Message
who?



Joined: 01 Sep 2007
Posts: 540

PostPosted: Wed Jun 25, 2008 2:00 am    Post subject: Reply with quote

They only need to recompile the world :)

who?
Back to top
View user's profile Send private message
martinw



Joined: 06 Sep 2007
Posts: 139

PostPosted: Wed Jun 25, 2008 5:18 am    Post subject: Reply with quote

JoeP wrote:
I've talked with two professors from different universities who have grad students who know C and are trying to work with CUDA. By far, it has been the hardest "paradigm" to learn to work with. You need to create a new algorithm and a new mapping for your problem at the same time. And you need to know the hardware, too.


Yes, it really exposes everything to you. The paradigm is really weird for anyone used to coding standard C as well. An example code snippet can look something like this (in pseudocode):

initialize x[i];
access x[i+1];

How does that work? Well multiple threads are each working in parallel, each with a different value of i, so every thread first initializes one element of x. Since each instruction takes a guaranteed amount of time to execute, it can be assured that once the accessor line is called, every element of x will have been initialized already so there is no risk in accessing an element that has been initialized by a different thread. I guess you get used to it after a while, but it just feels so wrong. :-)

Quote:
If there's anything promising in GPGPU, it's Apple's OpenCL.
...
and after looking that up, it's still about a year away, AND it looks like it's only for OS X.


The more I read about OpenCL the better it looks. It really does depend though on whether the hardware vendors pick it up and support it on other platforms. Talked to a couple of Apple engineers about it, and they have high hopes, but we'll see how it plays out.
Back to top
View user's profile Send private message
JoeP



Joined: 28 Jul 2007
Posts: 65

PostPosted: Wed Jun 25, 2008 7:54 am    Post subject: Reply with quote

martinw wrote:
I guess you get used to it after a while, but it just feels so wrong. :-)


That's the strange thing about it. All along, we've been taught to have tight controls over access to variables and memory locations, then we have something loose like this.

Not only that, but most (if not all) CS and SE courses will teach that high level languages and their compilers throw away the need to know the hardware implementation the programs are to be run on. If you wanted to code stuff near the hardware level, you'd be working on compilers or software for DSPs and such.

If many of these instructions are to be executed in a certain time, and other things like that are known, can't the language be a little more high level and implementation independant by putting that burden on the compiler/assembler to figure out when to execute what?
Back to top
View user's profile Send private message
Del



Joined: 09 Aug 2007
Posts: 121

PostPosted: Wed Jun 25, 2008 6:48 pm    Post subject: Reply with quote

who? wrote:
Try to use their CUDA H264 encoders, especially when you try to do SD video at less than 1.5Mbps, and have fun with the mosaic ...
no no, it is not a filter special effect ...
DivX goes down to 700kbps without any issue, no mosaic ...
This encoder is good when you don t compress ... the point of video compression is to compress, no?

who?
Care to comment on this article:
http://www.anandtech.com/video/showdoc.aspx?i=3339&p=1
and please make an effort to keep your signal to noise ratio under control when answering.
Back to top
View user's profile Send private message
ajensen



Joined: 01 Sep 2007
Posts: 133

PostPosted: Wed Jun 25, 2008 9:03 pm    Post subject: Reply with quote

@Del:
To me it seems like they are in the same ballpark on performance/watt which is surprisingly good for a general purpose CPU when matched against a special purpose design one would expect to have an advantage on the chosen workload. When future CPU designs gets more throughput/watt focus and less performance/thread focus the GPU will find it increasingly harder to find new tasks to conquer from the CPU.
Back to top
View user's profile Send private message
EduardoS



Joined: 22 Mar 2008
Posts: 110

PostPosted: Wed Jun 25, 2008 10:45 pm    Post subject: Reply with quote

JoeP wrote:
That's the strange thing about it. All along, we've been taught to have tight controls over access to variables and memory locations, then we have something loose like this.

Not only that, but most (if not all) CS and SE courses will teach that high level languages and their compilers throw away the need to know the hardware implementation the programs are to be run on. If you wanted to code stuff near the hardware level, you'd be working on compilers or software for DSPs and such.

If many of these instructions are to be executed in a certain time, and other things like that are known, can't the language be a little more high level and implementation independant by putting that burden on the compiler/assembler to figure out when to execute what?


I didn't fell CUDA or Brook+ so diferent, it ressembles SQL or high levels languages that has a "foreach" operator, altough many of hardware limitations still visible it's not much more dificult than writing a (huge) query in SQL and far simple than my first attempt with MMX (in assembly).
Back to top
View user's profile Send private message
who?



Joined: 01 Sep 2007
Posts: 540

PostPosted: Thu Jun 26, 2008 4:37 am    Post subject: Reply with quote

Del wrote:
who? wrote:
Try to use their CUDA H264 encoders, especially when you try to do SD video at less than 1.5Mbps, and have fun with the mosaic ...
no no, it is not a filter special effect ...
DivX goes down to 700kbps without any issue, no mosaic ...
This encoder is good when you don t compress ... the point of video compression is to compress, no?

who?
Care to comment on this article:
http://www.anandtech.com/video/showdoc.aspx?i=3339&p=1
and please make an effort to keep your signal to noise ratio under control when answering.


well, if you use CUDA encoder, the best compression you can get is over 1.4Mbps ... for a iPhone video ... I have nothing more to add :)

no noise ...

[edit] Just to make sure every body understands ... DivX encode "DVD personal videos of 1.5 hours" at 720x480 a 700Kbps ...

somebody forgot that in a codec ... you got to compress ! lol!

[/edit]

who?
Back to top
View user's profile Send private message
Del



Joined: 09 Aug 2007
Posts: 121

PostPosted: Fri Jun 27, 2008 12:14 pm    Post subject: Reply with quote

ajensen wrote:
@Del:
To me it seems like they are in the same ballpark on performance/watt which is surprisingly good for a general purpose CPU when matched against a special purpose design one would expect to have an advantage on the chosen workload. When future CPU designs gets more throughput/watt focus and less performance/thread focus the GPU will find it increasingly harder to find new tasks to conquer from the CPU.
Yes, also interesting to see the performance in line with earlier assumptions, i.e., a best case scenario yielding lower than 10x performance increase. On perf/watt it is indeed a close race.
who? wrote:
Del wrote:
who? wrote:
Try to use their CUDA H264 encoders, especially when you try to do SD video at less than 1.5Mbps, and have fun with the mosaic ...
no no, it is not a filter special effect ...
DivX goes down to 700kbps without any issue, no mosaic ...
This encoder is good when you don t compress ... the point of video compression is to compress, no?

who?
Care to comment on this article:
http://www.anandtech.com/video/showdoc.aspx?i=3339&p=1
and please make an effort to keep your signal to noise ratio under control when answering.


well, if you use CUDA encoder, the best compression you can get is over 1.4Mbps ... for a iPhone video ... I have nothing more to add :)

no noise ...

[edit] Just to make sure every body understands ... DivX encode "DVD personal videos of 1.5 hours" at 720x480 a 700Kbps ...

somebody forgot that in a codec ... you got to compress ! lol!

[/edit]

who?
I am trying to make you clarify what you really mean. Let me try to help with some basic questions:
Is the applicability of CUDA for H.264 encoding on par with a Penryn below some compression threshold, but superior performance wise when doing basic ripping like Anand did?
Maybe you simply meant that people using their iPhone for video viewing are better off without CUDA?
Do you believe that there is something inherent in CUDA and nVidia's current cards limiting it's potential for encoding?
If so I prefer that you try to the best of your capabilities to explain why with particular emphasis on encoding type load, it simply isn't enough to trust your word for it, and not all of us are up to date on current encoding algorithms.
Back to top
View user's profile Send private message
who?



Joined: 01 Sep 2007
Posts: 540

PostPosted: Fri Jun 27, 2008 2:41 pm    Post subject: Reply with quote

Most of the people who tried to run This H264 encoder saw a bit rate of 2.6MBits per seconds on the iphone profile... I was nice when posting the lower bit rate ,when you get to the worst quality level.

When you do a motion estimation, and you got to thread it, you will have threads processing slice of the video. They will find the best matched macro block for the slice of frame. this gives you a list of macro blocks per slide.
After getting the n list of macroblock (n is the number of threads), a less parallele task is required ... you got to look for the macroblock of the thread (i) into the entiere list of n threads, to avoid the macro blocks to be duplicated.
In this encoder, this part is not happening, because unfriendly to Today's GPU architecture. This explain the high bit rate seeing, more than double compare to commercial H263 codecs. 1 hour of video is around 1170MBytes, defiting the purpose of H264, that is suppose to be better than MP4_H263 ...
You can see an encoding at around 90 to 100FPS with this software, while you gets 55 to 60 on a Quad Core Yorkfield using the same video but with double compression efficiency.
If you look at the result of the encoding, you can see horizontal bars in the video, due to the lack of thread border deblocking. In H264 algorythm, you are suppose to use the X-1, Y-1 , X, Y-1 and X-1, Y-1 macro blocks to avoid blocking effect ... in this encoder, at each threads borderm the X,Y-1 is ignored.

to conclude, the PSNR is horrible, the colors are wash, contrast is lower than any other codec I saw in H264 space, it is even worse than the worse H263 codec.

On a 8 gigs iPod, you ll only put 4 movies (assuming 1H30min), while with the CPU compressed movies, you ll put 10 or more if compress better(slower). There is a benefit for users to get more videos with better quality.

I forgot ... I wonder why they are using 100% of 2 cores on my 4Ghz OC quad cores ... there is some threading possible there ! :)

Less compression, less Quality, faster ... that is easy to do, I could create a codec doing this, but that is against Consumers interest. Do you want to look at a badly compressed video on your new PS3 with new HDTV ???? no, you want the best restitution possible.

who?
Back to top
View user's profile Send private message
Opteron



Joined: 16 Mar 2008
Posts: 55

PostPosted: Sat Jun 28, 2008 4:13 pm    Post subject: Reply with quote

martinw wrote:
The more I read about OpenCL the better it looks. It really does depend though on whether the hardware vendors pick it up and support it on other platforms. Talked to a couple of Apple engineers about it, and they have high hopes, but we'll see how it plays out.


Seems like it is the basis of an industry standard:
Quote:
June 16th 2008 – San Francisco, CA – The Khronos™ Group announced today the formation of a new Compute Working Group to create royalty-free, open standards for programming heterogeneous data and task parallel computing across GPUs and CPUs. The creation of this open standard is intended to enable and encourage diverse applications to leverage all available platform compute resources on a wide range of platforms. Initial participants in the working group include 3Dlabs, AMD, Apple, ARM, Codeplay, Ericsson, Freescale, Graphic Remedy, IBM, Imagination Technologies, Intel, Nokia, NVIDIA, Motorola, QNX, Qualcomm, Samsung, Seaweed, TI, and Umeå University. Any company is welcome to join the Khronos Group to participate in this and the other Khronos working groups that are creating an ecosystem of open standards for graphics and media authoring and acceleration. For more details please visit www.khronos.org.

The Compute Working Group will follow proven Khronos processes and invite member contributions as a basis for standardization efforts. Apple has proposed the Open Computing Language (OpenCL) specification to enable any application to tap into the vast gigaflops of GPU and CPU resources through an approachable C-based language.....

http://www.khronos.org/news/press/releases/khronos_launches_heterogeneous_computing_initiative/

cheers

Opteron
Back to top
View user's profile Send private message
Johan



Joined: 23 Jul 2007
Posts: 162
Location: Belgium

PostPosted: Sat Jun 28, 2008 4:54 pm    Post subject: Re: FUD Reply with quote

Tvar' wrote:
Charlie -

GPUs are turning into general purpose manycore processors - which is why Intel has to make Larrabee. Nvidia understands this well, your FUD to the contrary. They're not going to break forward compatibility with CUDA and PTX.

I'm actually a little surprised at your unfounded FUD. Too bad.


The experience from people who are actually programming with CUDA are not very good (Heard that from quite a few people who now rely on custom made boards for their video encoding/streaming needs): software works on the simulator and then breaks when it has to run on the real thing without giving any decent feedback.

Sounds like Larrabee might appeal a lot to these people. Just my first impressions.
Back to top
View user's profile Send private message
martinw



Joined: 06 Sep 2007
Posts: 139

PostPosted: Sun Jun 29, 2008 1:43 pm    Post subject: Reply with quote

Opteron wrote:

Seems like it is the basis of an industry standard:


Yes, I saw that press release. Signing up to a working group is easy though, we'll have to see how much they actually commit to in terms of implementation on various platforms. Also you'll note one rather large software company missing from that list.
Back to top
View user's profile Send private message
who?



Joined: 01 Sep 2007
Posts: 540

PostPosted: Sun Jun 29, 2008 5:04 pm    Post subject: Reply with quote

I think we should all put a big effort into this


who?
Back to top
View user's profile Send private message
up



Joined: 06 Oct 2007
Posts: 38

PostPosted: Wed Jul 02, 2008 1:39 am    Post subject: Reply with quote

Quote:
Gelsinger claims that the ISVs (independent software vendors) that are currently dealing with Larrabee have responded with ‘nothing but sheer passion and enthusiasm for that direction.’ As such, he added that ‘we expect things like CUDA and CTN will end up in the same interesting footnotes in the history of computing annals – they had great promise and there were a few applications that were able to take advantage of them, but generally an evolutionary compatible computing model, such as we’re proposing with Larrabee, we expect will be the right answer long term.’

http://www.custompc.co.uk/news/602868/intel-cuda-will-be-just-a-footnote-in-computing-history.html
... :lol:
Back to top
View user's profile Send private message
new_username



Joined: 22 Jun 2008
Posts: 4

PostPosted: Wed Jul 09, 2008 8:07 am    Post subject: Reply with quote

up wrote:

http://www.custompc.co.uk/news/602868/intel-cuda-will-be-just-a-footnote-in-computing-history.html
... :lol:


How much memory bandwidth will Larabee get? Thats one advantage that GPUs have -- re-freaking-diculous aggregate RAM bandwidth.

General purpose computing on them? Always a bit in doubt. Special tasks that need massive memory throughput and are not amenable to the CPU style cache hearchy? Hmmm.

Problem is, thats not general purpose. Its certainly not video encoding. Some scientific tasks, and some MapReduce tasks for large cluster computing on gargantuan data sets can potentially take advantage of this sort of thing. But it is again, not general purpose.
I can imagine a cluster of image recognition machines doling out MapReduce jobs with multiple concurrent image processing tasks to GPUs as a good use of a GPU's true strengths -- natural ability to process 2d data structures at high volume.

Or, if it can be turned into a MapReduce algorithm, you could also just get 3x as many machines (with fewer CPU cores -- max out bandwidth per core, and no GPU in a smaller, cheaper blade form factor) and save the cost burden of using new or fragile technology and write it in a couple days with Java and Hadoop.
Back to top
View user's profile Send private message
Display posts from previous:   
Post new topic   Reply to topic    Aceshardware Forum Index -> General forum All times are GMT + 1 Hour
Goto page Previous  1, 2, 3, 4, 5  Next
Page 4 of 5   

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB
Hosted by FreeForums.org