Discussion:
Help about L1 data misses collected by Performance Counters
(too old to reply)
Anton Ertl
2004-11-25 10:54:03 UTC
Permalink
Hi all,
I am now using PAPI(Performance API, an API to collect CPU related
data generated by performance hardware counters(a set of registers)
located in CPU) to collect data about cache activities on PowerPC 750,
but some data about cache seems really puzzling. For example, the L1
data misses. The description of it in PPC 750 user manual is "Number
of L1 data cache misses. Does not include cache ops". I write two c
programs and get the L1 load misses of them. For a comparision, I also
run the same programs on Intel celeron(coppermine) and use PAPI to get
L1 data misses. I put the platform info in parenthesis right after the
1 #define SIZ 8096000
....
2 register i,b,j;
3 char *buffer;
4 buffer = (char *)malloc(SIZ);
5
6 for(j=0;j<SIZ;j++)
7 buffer[j]=0x03;
8
9 //begin counting using PAPI_start_counters
10 for(j=0;j<20;j++) {
11 for(i=0;i<SIZ;i++) {
12 b=buffer[i];
13 }
14 }
15 //end counting using PAPI_stop_counters 16
17 return b;
...
I don't know the reason. Why does the various Line6-8 variants of
Program 1 impact the L1 data misses so much? I will be very appreciated
if someone could help me about this.
If you don't initialize the array, all page table entries will point
to the same zero-filled page (it does not matter if you allocated the
array with malloc, or implicitly as uninitialized data). The caches
on the CPUs you are looking at are physically tagged, so the page
needs only one copy in the cache, and it will satisfy all the accesses
to the array (which have the same physical address). Therefore you
don't see many cache misses: basically, those for startup, those
for loading in the zero-filled page (with 4K pages and 32-byte lines,
that's only 128 misses).

Once you write to the array, the pages will be copied-on-write, and
you will get distinct physical pages (even if the content is the same;
google for mergemem for a remedy). Then the CPU needs to load the
physical memory into the L1 cache for the loads, and will flush out
other cache lines, so you see the behaviour you expected.

Followups set to colds.

- anton
--
M. Anton Ertl Some things have to be seen to be believed
***@mips.complang.tuwien.ac.at Most things have to be believed to be seen
http://www.complang.tuwien.ac.at/anton/home.html
Anton Ertl
2004-11-26 10:17:39 UTC
Permalink
#define SIZ 4*1024
static char buffer[SIZ];
int main(void)
{
register i,b=0,j;
//start counting L1 data misses
for(i=0;i<SIZ;i++)
b+=buffer[i];
//stop counting L1 data misses
return b;
}
SIZ value
1*1024 33
4*1024 128
8*1024 264
6*1024 200
8*1024 264
12*1024 268
40*1024 297
400*1024 677
4*1024*1024 4701
8*1024*1024 8829
According to what you said, the L1 data misses should remain almost
unchanged when I increase SIZ. But you could see here that L1 data
misses will grow when SIZ grows too. PowePC 750 L1 data cache is
32KBytes, 8-way associate, tag are physically addressed, but set index
was addressed by EAs.
TLB misses? 1 D-cache miss/KB seems to be a little high, though; I
would expect 1 TLB miss per page resulting in 2 D-cache misses or so
(i.e. 1/2KB).

BTW, please stop crossposting and please quote properly. Read
http://www.complang.tuwien.ac.at/anton/mail-news-errors.html#quoting
for details. And I don't need mail copies of your posts.

Followups set to colds.

- anton
--
M. Anton Ertl Some things have to be seen to be believed
***@mips.complang.tuwien.ac.at Most things have to be believed to be seen
http://www.complang.tuwien.ac.at/anton/home.html
Loading...