Anton Ertl
2004-11-25 10:54:03 UTC
Hi all,
I am now using PAPI(Performance API, an API to collect CPU related
data generated by performance hardware counters(a set of registers)
located in CPU) to collect data about cache activities on PowerPC 750,
but some data about cache seems really puzzling. For example, the L1
data misses. The description of it in PPC 750 user manual is "Number
of L1 data cache misses. Does not include cache ops". I write two c
programs and get the L1 load misses of them. For a comparision, I also
run the same programs on Intel celeron(coppermine) and use PAPI to get
L1 data misses. I put the platform info in parenthesis right after the
1 #define SIZ 8096000
....
2 register i,b,j;
3 char *buffer;
4 buffer = (char *)malloc(SIZ);
5
6 for(j=0;j<SIZ;j++)
7 buffer[j]=0x03;
8
9 //begin counting using PAPI_start_counters
10 for(j=0;j<20;j++) {
11 for(i=0;i<SIZ;i++) {
12 b=buffer[i];
13 }
14 }
15 //end counting using PAPI_stop_counters 16
17 return b;
...I am now using PAPI(Performance API, an API to collect CPU related
data generated by performance hardware counters(a set of registers)
located in CPU) to collect data about cache activities on PowerPC 750,
but some data about cache seems really puzzling. For example, the L1
data misses. The description of it in PPC 750 user manual is "Number
of L1 data cache misses. Does not include cache ops". I write two c
programs and get the L1 load misses of them. For a comparision, I also
run the same programs on Intel celeron(coppermine) and use PAPI to get
L1 data misses. I put the platform info in parenthesis right after the
1 #define SIZ 8096000
....
2 register i,b,j;
3 char *buffer;
4 buffer = (char *)malloc(SIZ);
5
6 for(j=0;j<SIZ;j++)
7 buffer[j]=0x03;
8
9 //begin counting using PAPI_start_counters
10 for(j=0;j<20;j++) {
11 for(i=0;i<SIZ;i++) {
12 b=buffer[i];
13 }
14 }
15 //end counting using PAPI_stop_counters 16
17 return b;
I don't know the reason. Why does the various Line6-8 variants of
Program 1 impact the L1 data misses so much? I will be very appreciated
if someone could help me about this.
If you don't initialize the array, all page table entries will pointProgram 1 impact the L1 data misses so much? I will be very appreciated
if someone could help me about this.
to the same zero-filled page (it does not matter if you allocated the
array with malloc, or implicitly as uninitialized data). The caches
on the CPUs you are looking at are physically tagged, so the page
needs only one copy in the cache, and it will satisfy all the accesses
to the array (which have the same physical address). Therefore you
don't see many cache misses: basically, those for startup, those
for loading in the zero-filled page (with 4K pages and 32-byte lines,
that's only 128 misses).
Once you write to the array, the pages will be copied-on-write, and
you will get distinct physical pages (even if the content is the same;
google for mergemem for a remedy). Then the CPU needs to load the
physical memory into the L1 cache for the loads, and will flush out
other cache lines, so you see the behaviour you expected.
Followups set to colds.
- anton
--
M. Anton Ertl Some things have to be seen to be believed
***@mips.complang.tuwien.ac.at Most things have to be believed to be seen
http://www.complang.tuwien.ac.at/anton/home.html
M. Anton Ertl Some things have to be seen to be believed
***@mips.complang.tuwien.ac.at Most things have to be believed to be seen
http://www.complang.tuwien.ac.at/anton/home.html