Aceshardware Forum Index Aceshardware
(not so) temporary home for the aceshardware community
 
 FAQFAQ   SearchSearch   MemberlistMemberlist   UsergroupsUsergroups    RegisterRegister 
 ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in 

Nehalem Preview
Goto page Previous  1, 2, 3, 4, 5, 6  Next
 
Post new topic   Reply to topic    Aceshardware Forum Index -> General forum
View previous topic :: View next topic  
Author Message
savantu



Joined: 21 Mar 2008
Posts: 36

PostPosted: Mon Jun 09, 2008 8:58 am    Post subject: Reply with quote

who? wrote:
savantu wrote:
who? wrote:
Bottom line, there is no issue to do a fast L3 cache.

who?


Nor is Nehalem's.

At 39 cycles , it is slow.


Try to do by yourself 8Megs at 39cycles and call me back :) Those are not running at low frequency. Did you ever participate to any CPU design? I think you are forgetting that those CPUs are the most complex machine build by humans a little bit too easily. It is faster than other L3 on the market yet!

who?


No , I was comparing it to Penryn and Itanium.

-Penryn we have 6MB L2 , 15 cycles , 4.68ns , 2 cores.
-Itanium we have 12MB L3 , 14 cycles, 8.43ns , 2 way threaded very wide core.
-Nehalem we have 8MB L3 , 39 cycles , 14,66ns , 4 2 way threaded cores

It is fairly obvious that Nehalem's L3 is under far more pressure and has more complex arbitration , but 39 cycles seems like a lot.To match Itanium's L3 ns time , Nehalem would have to run at 4.6GHz.
Back to top
View user's profile Send private message
who?



Joined: 01 Sep 2007
Posts: 540

PostPosted: Wed Jun 11, 2008 4:00 am    Post subject: Reply with quote

savantu wrote:
who? wrote:
savantu wrote:
who? wrote:
Bottom line, there is no issue to do a fast L3 cache.

who?


Nor is Nehalem's.

At 39 cycles , it is slow.


Try to do by yourself 8Megs at 39cycles and call me back :) Those are not running at low frequency. Did you ever participate to any CPU design? I think you are forgetting that those CPUs are the most complex machine build by humans a little bit too easily. It is faster than other L3 on the market yet!

who?


No , I was comparing it to Penryn and Itanium.

-Penryn we have 6MB L2 , 15 cycles , 4.68ns , 2 cores.
-Itanium we have 12MB L3 , 14 cycles, 8.43ns , 2 way threaded very wide core.
-Nehalem we have 8MB L3 , 39 cycles , 14,66ns , 4 2 way threaded cores

It is fairly obvious that Nehalem's L3 is under far more pressure and has more complex arbitration , but 39 cycles seems like a lot.To match Itanium's L3 ns time , Nehalem would have to run at 4.6GHz.


I can't say more yet, but this is not an issue at all. It is like the idiotic FSB debate ... Core 2 still beat the c..p out of any Kx , without the memory controler on dice.
It is all about "theorical benchmarks" against real applications. few people will keep focusing on the low bandwith or higher latency from synthetic benchmarks, but this is not what matter: This is a very narrow vision that can only create processors with "theorical performance".

you are better to design a cache+prefetcher subsystem that catch 99.9999% of the memory access than a short latency cache. Core 2 is the perfect example over K8, K10. Now, you can dance around this, but this was the learning of Core 2.

Life and real life application is not theorical, you 'll understand better later.

who?
Back to top
View user's profile Send private message
Pjotr



Joined: 06 Aug 2007
Posts: 159

PostPosted: Wed Jun 11, 2008 2:19 pm    Post subject: Reply with quote

who? wrote:
I can't say more yet, but this is not an issue at all. It is like the idiotic FSB debate ... Core 2 still beat the c..p out of any Kx , without the memory controler on dice.


Duh, why is this? Because the L2 latency on Kx is so much worse. Since Kx has faster access to memory than Core 2, it proves the L2 latency can be more important than an IMC. The numbers speak for themselves. Lower latency = better performance. Cache hit % is quite high on these modern CPUs and the latency does matter. This will be a disadvantage to Nehalem. If Nehalem had kept the lower latency of Core 2, it would be even faster of course. Are you disputing this?
Back to top
View user's profile Send private message
Paul DeMone



Joined: 29 Aug 2007
Posts: 530
Location: Great white north

PostPosted: Wed Jun 11, 2008 3:26 pm    Post subject: Reply with quote

Pjotr wrote:
If Nehalem had kept the lower latency of Core 2, it would be even faster of course. Are you disputing this?


That is true. However obviously Intel thought the benefits of not doing
so outweighed the benefits of keeping the latency the same.

The problem is right now we can only see the negative side of a design
trade-off and do not known what was gained for it. Only a newb or an
Intel basher should tee off on Nehalem without knowing the other side
of this equation.

The data cache load bypass to ALU is typically the critical timing path
(or close to it) in most high performance superscalar processors so
an obvious benefit could be better clock scaling at a given supply
voltage and/or keeping clock rate the same while lowering the supply
voltage and significantly reducing core power. This choice would be
evaluated for both 45 nm and 32 nm implementations of the uarch.
That's the flip side of tick-tock.

Consider this. Increasing L1 latency from 3 to 4 cycles reduces IPC
of a deeply OOOE processor like Nehalem by only a few percent for
most classes of applications. If that change allows clock frequency
to stay the same as Penryn while reducing core voltage by 10%
then that gives about a 20% reduction in the core component of
TDP. The effect of core voltage reduction minus IPC loss increases
computational power efficiency of the Nehalem core by perhaps 17
to 18%. That could allow Intel to

1) integrate CSI and IMC without a TDP penalty vs Penryn
2) increase device frequency slightly for a given TDP
3) reduce device TDP for a given frequency

or some combination of the above. Things should be much clearer
when Intel discloses the frequency and TDP of Nehalem SKUs,
especially those of dual core products.
Back to top
View user's profile Send private message
who?



Joined: 01 Sep 2007
Posts: 540

PostPosted: Thu Jun 12, 2008 2:41 am    Post subject: Reply with quote

Pjotr wrote:
who? wrote:
I can't say more yet, but this is not an issue at all. It is like the idiotic FSB debate ... Core 2 still beat the c..p out of any Kx , without the memory controler on dice.


Duh, why is this? Because the L2 latency on Kx is so much worse. Since Kx has faster access to memory than Core 2, it proves the L2 latency can be more important than an IMC. The numbers speak for themselves. Lower latency = better performance. Cache hit % is quite high on these modern CPUs and the latency does matter. This will be a disadvantage to Nehalem. If Nehalem had kept the lower latency of Core 2, it would be even faster of course. Are you disputing this?


That is a simplistic view, "Lower latency = better performance" is true only if you have no other parameters into the equation. If you let go few cycles to get 500% faster in 99% of the case with the prefetcher help, your statement is wrong. The 1% rest is take care by the multi level prefetcher. The SPECS suit will make the point later :)

Just trash those old thinking, Core 2 made the point. I can't give any details, but we do simulate every choice with incredible accuracy. Do not expect us to make the past mistake anymore.

who?
Back to top
View user's profile Send private message
Johan



Joined: 23 Jul 2007
Posts: 162
Location: Belgium

PostPosted: Thu Jun 12, 2008 6:38 am    Post subject: Reply with quote

who? wrote:
Pjotr wrote:
who? wrote:
I can't say more yet, but this is not an issue at all. It is like the idiotic FSB debate ... Core 2 still beat the c..p out of any Kx , without the memory controler on dice.


Duh, why is this? Because the L2 latency on Kx is so much worse. Since Kx has faster access to memory than Core 2, it proves the L2 latency can be more important than an IMC. The numbers speak for themselves. Lower latency = better performance. Cache hit % is quite high on these modern CPUs and the latency does matter. This will be a disadvantage to Nehalem. If Nehalem had kept the lower latency of Core 2, it would be even faster of course. Are you disputing this?


That is a simplistic view, "Lower latency = better performance" is true only if you have no other parameters into the equation. If you let go few cycles to get 500% faster in 99% of the case with the prefetcher help, your statement is wrong. The 1% rest is take care by the multi level prefetcher. The SPECS suit will make the point later :)

Just trash those old thinking, Core 2 made the point. I can't give any details, but we do simulate every choice with incredible accuracy. Do not expect us to make the past mistake anymore.

who?


Then I would really like to hear which kind of applications really benefit from the HW prefetcher. In many Server apps, it is best to leave it off as gets in the way of the "normal" bandwidth.

It makes for example a big negative impact on Specjbb, allmost all database benchmarking and several others I - i admit - have to retest to be sure.

99% of the time faster with prefetcher, I hope you are talking about Nehalem with it's ample bandwidth, because it is complete crap for the Core 2 generation.

So far the HW prefetcher should be enabled dynamically (by the OS) to be really efficient IMHO.
Back to top
View user's profile Send private message
who?



Joined: 01 Sep 2007
Posts: 540

PostPosted: Thu Jun 12, 2008 7:10 am    Post subject: Reply with quote

Johan wrote:
who? wrote:
Pjotr wrote:
who? wrote:
I can't say more yet, but this is not an issue at all. It is like the idiotic FSB debate ... Core 2 still beat the c..p out of any Kx , without the memory controler on dice.


Duh, why is this? Because the L2 latency on Kx is so much worse. Since Kx has faster access to memory than Core 2, it proves the L2 latency can be more important than an IMC. The numbers speak for themselves. Lower latency = better performance. Cache hit % is quite high on these modern CPUs and the latency does matter. This will be a disadvantage to Nehalem. If Nehalem had kept the lower latency of Core 2, it would be even faster of course. Are you disputing this?


That is a simplistic view, "Lower latency = better performance" is true only if you have no other parameters into the equation. If you let go few cycles to get 500% faster in 99% of the case with the prefetcher help, your statement is wrong. The 1% rest is take care by the multi level prefetcher. The SPECS suit will make the point later :)

Just trash those old thinking, Core 2 made the point. I can't give any details, but we do simulate every choice with incredible accuracy. Do not expect us to make the past mistake anymore.

who?


Then I would really like to hear which kind of applications really benefit from the HW prefetcher. In many Server apps, it is best to leave it off as gets in the way of the "normal" bandwidth.

It makes for example a big negative impact on Specjbb, allmost all database benchmarking and several others I - i admit - have to retest to be sure.

99% of the time faster with prefetcher, I hope you are talking about Nehalem with it's ample bandwidth, because it is complete crap for the Core 2 generation.

So far the HW prefetcher should be enabled dynamically (by the OS) to be really efficient IMHO.


well, That is your opinion for the "crap", and it is always easy to spit on the soup, then you are the not the one making it. Get to the retest :), your understanding is probably totally OFF.

For the hardware prefetcher ... that about Games, office applications, video edition, video encoding, 3D rendering, Java engines, sequencial database, Bio-computing, financial analysis... it is an endless list...

Your statement to "turn it off" is only true for pointer chasing applications, and with Nehalem, it is not going to be an issue anymore. sooooo?

Just one of your regular posting, a lot of opinion, and very little data to back it up.

As I said at Computex, I never felt so confortable since I am in the sit I am, we just need to get the stuff cleanly out of the door. You can dance as much as you want, oupssss, we did it again.


who?
Back to top
View user's profile Send private message
Del



Joined: 09 Aug 2007
Posts: 121

PostPosted: Thu Jun 12, 2008 7:11 am    Post subject: Reply with quote

Johan wrote:

Then I would really like to hear which kind of applications really benefit from the HW prefetcher.
Don't get your hopes up, the whole thing may only be some acid humour you know. Funny how he only finds fanatic support to his employer funny, kind of narrow humour.

At least we now know that IMC is immaterial, the FSB is supreme if you only get that prefetching right. Sounds like Nehalem and Tukwila are just going to be some boring downgrades of architecture.
Back to top
View user's profile Send private message
who?



Joined: 01 Sep 2007
Posts: 540

PostPosted: Thu Jun 12, 2008 7:13 am    Post subject: Reply with quote

Del wrote:
Johan wrote:

Then I would really like to hear which kind of applications really benefit from the HW prefetcher.
Don't get your hopes up, the whole thing may only be some acid humour you know. Funny how he only finds fanatic support to his employer funny, kind of narrow humour.

At least we now know that IMC is immaterial, the FSB is supreme if you only get that prefetching right. Sounds like Nehalem and Tukwila are just going to be some boring downgrades of architecture.


I was missing your smart post :)

who?
Back to top
View user's profile Send private message
JS



Joined: 28 Jun 2007
Posts: 23
Location: Germany

PostPosted: Thu Jun 12, 2008 7:24 am    Post subject: Reply with quote

who? wrote:
Johan wrote:


Then I would really like to hear which kind of applications really benefit from the HW prefetcher. In many Server apps, it is best to leave it off as gets in the way of the "normal" bandwidth.

It makes for example a big negative impact on Specjbb, allmost all database benchmarking and several others I - i admit - have to retest to be sure.

99% of the time faster with prefetcher, I hope you are talking about Nehalem with it's ample bandwidth, because it is complete crap for the Core 2 generation.

So far the HW prefetcher should be enabled dynamically (by the OS) to be really efficient IMHO.


well, That is your opinion for the "crap", and it is always easy to spit on the soup, then you are the not the one making it. Get to the retest :), your understanding is probably totally OFF.

For the hardware prefetcher ... that about Games, office applications, video edition, video encoding, 3D rendering, Java engines, sequencial database, Bio-computing, financial analysis... it is an endless list...

Your statement to "turn it off" is only true for pointer chasing applications, and with Nehalem, it is not going to be an issue anymore. sooooo?

Just one of your regular posting, a lot of opinion, and very little data to back it up.

As I said at Computex, I never felt so confortable since I am in the sit I am, we just need to get the stuff cleanly out of the door. You can dance as much as you want, oupssss, we did it again.


who?


Reading both posts, Johan backs his opinion up. The statement "and with
Nehalem, it is not going to be an issue anymore. sooooo?" on the other hand is not backed up.

So, I am wondering about you saying "Just one of your regular posting, a
lot of opinion, and very little data to back it up.". Could you hand in
equivalent data to be able to judge your statement?
Back to top
View user's profile Send private message
Gabriele Svelto



Joined: 27 Jun 2007
Posts: 290
Location: Milano, Italy

PostPosted: Thu Jun 12, 2008 7:24 am    Post subject: Reply with quote

who? wrote:
For the hardware prefetcher ... that about Games, office applications, video edition, video encoding, 3D rendering, Java engines, sequencial database, Bio-computing, financial analysis... it is an endless list...

Your statement to "turn it off" is only true for pointer chasing applications, and with Nehalem, it is not going to be an issue anymore. sooooo?

Then you must know that SPECjbb is not a pointer chasing application, in fact cache-wise it is very well behaved and that's exactly why the hardware prefetcher lowers performance significantly. That is especially true for C2Q processors were the prefetchers of the two dies are unaware that they are stealing bandwidth from each other in bandwidth constrained situations. But you probably already know this.
Quote:
Just one of your regular posting, a lot of opinion, and very little data to back it up.

And where's *your* data to backup *your* claims? You listed a series of workloads which you consider better off with than without the prefetcher, can you post the data about those workloads comparing performance with the prefetcher on and off? Or is this going to be a replay of the "FB-DIMM has lower latency than DDR2/3" thread?
Back to top
View user's profile Send private message
who?



Joined: 01 Sep 2007
Posts: 540

PostPosted: Thu Jun 12, 2008 8:09 am    Post subject: Reply with quote

Gabriele Svelto wrote:
who? wrote:
For the hardware prefetcher ... that about Games, office applications, video edition, video encoding, 3D rendering, Java engines, sequencial database, Bio-computing, financial analysis... it is an endless list...

Your statement to "turn it off" is only true for pointer chasing applications, and with Nehalem, it is not going to be an issue anymore. sooooo?

Then you must know that SPECjbb is not a pointer chasing application, in fact cache-wise it is very well behaved and that's exactly why the hardware prefetcher lowers performance significantly. That is especially true for C2Q processors were the prefetchers of the two dies are unaware that they are stealing bandwidth from each other in bandwidth constrained situations. But you probably already know this.
Quote:
Just one of your regular posting, a lot of opinion, and very little data to back it up.

And where's *your* data to backup *your* claims? You listed a series of workloads which you consider better off with than without the prefetcher, can you post the data about those workloads comparing performance with the prefetcher on and off? Or is this going to be a replay of the "FB-DIMM has lower latency than DDR2/3" thread?



well, I do post a lot of data all the time, Pro Journalists get a lot from me all the time. it is just too early to do so on Nehalem. You can ask Anand,Charlie or any other serious press guys (Even Fudo agrees on this), I never ever mislead them, I am always on the spot. I have nothing to gain from overpositioning. Here it is simple, you are worry about the latency, I can't post data due to the NDAs, and the strategy of my company. The only thing I try to say is that there is nothing to worry about this.

Now, you can keep fudding, or you trust the guy with the top end chip in his hand ... up to you. What Anand has in his hand may not be the best tuned ... hummm hummmmm. may be it was :) many be not ... hehehehe

who?
Back to top
View user's profile Send private message
P4man



Joined: 26 Jun 2007
Posts: 540

PostPosted: Thu Jun 12, 2008 8:20 am    Post subject: Reply with quote

No one here doubts nehalems performance, but I think most people here seriously doubt your analysis Mr "fastest L3 cache, oops, but L3 latency don't matter", "lowest RAM latency, oops, memory latency don't matter", "FSB don't matter", "prefetcher is great even if it sucks". "I have a chip, so trust me, even if I'm almost always wrong on just about everything except the chip being fast", "hehe, lololol, just kidding".

You are a funny guy, just not the way you think you are.
Back to top
View user's profile Send private message
who?



Joined: 01 Sep 2007
Posts: 540

PostPosted: Thu Jun 12, 2008 8:50 am    Post subject: Reply with quote

P4man wrote:
No one here doubts nehalems performance, but I think most people here seriously doubt your analysis Mr "fastest L3 cache, oops, but L3 latency don't matter", "lowest RAM latency, oops, memory latency don't matter", "FSB don't matter", "prefetcher is great even if it sucks*". "I have a chip, so trust me, even if I'm almost always wrong on just about everything except the chip being fast", "hehe, lololol, just kidding".

*never said that

You are a funny guy, just not the way you think you are.


May be you have hard time to read between the lines ... It is only few people with this problem :)
Always the same gang hopping for bad news ... Sorry, no bad news.

Let s take your FUD example: What I said in the case of FSB is simple: On desk top, FSB was not the limitation. (Note the context, and stop removing it) (Yes, the Cache + Prefetcher is doing its job, get vtune and check by yourself ... )

Then, with SPEC, I said it is going to make a point ... (Note the context, SPEC is not a good example of "desktop workload", is it?)
So, may be you guys always remove the context when you "pick and Quote", but they is not very professional. Get the context with it.

Any way, I got back my 3 grands pa from the Muppet show :) ... Same habit, not doing any design, but spitting on what ever is not the color you like.

Anyway, I am confortable with my Tuning choice on Smarkover and Nehalem, both team rocks.

For the rest, I will let Penn & Teller to tell you what I think about your critizims. (Personal one), their most common sentence ...


who?
Back to top
View user's profile Send private message
stupid dog



Joined: 08 Aug 2007
Posts: 6

PostPosted: Thu Jun 12, 2008 11:34 am    Post subject: Reply with quote

Hans de Vries wrote:
The performance increase from Hyper Threading seems to be in the
same range as that of the Pentium 4 (see table at the end of)
http://www.xbitlabs.com/articles/cpu/display/pentium4-3066_2.html

And why that? The table is showing performance gain while running multiple task simultaneously. Anand didn't check this. On the other side, the "old HT" shows little to no performance gain in multithreaded apps. So, how can you conclude that this is "same range as that of the Pentium 4"?
Back to top
View user's profile Send private message
Display posts from previous:   
Post new topic   Reply to topic    Aceshardware Forum Index -> General forum All times are GMT + 1 Hour
Goto page Previous  1, 2, 3, 4, 5, 6  Next
Page 4 of 6   

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB
Hosted by FreeForums.org