Re: A factoring algorithm
Available news archives: comp.lang.tcl - comp.lang.python - comp.security.firewalls - sci.crypt - comp.lang.php - comp.lang.javascript
Google
 
Web news.hping.org


sci.crypt archive

Re: A factoring algorithm

From: Phil Carmody <thefatphil_demunged@yahoo.co.uk>
Date: Mon Jan 09 2006 - 19:25:23 CET

"Chris Card" <ctcard@hotmail.com> writes:
> > > <wbhart@sbcglobal.net> writes:
> > Great, I'll see if I can get it and compare. My implementation is not 90
> > digit
> > optimized, but is cache friendly. I of course use cache blocking, which Pari
> > does not.
> > Whilst for 60 digits, Pari uses essentially only the L2 cache on my Athlon,
> > it does
> > not get the sieve into the L1 cache. My code does. However, my sieve code is
> > very complex. Pari uses sieve code of around 4 lines. Last time I checked,
> > my
> > sieve code was numbers of pages. I get about a 30% speed increase over Pari
> > for sieving even before I apply cache blocking.
> >
> I'd like to understand writing cache friendly code for my GNFS
> implementation, so..
> - how can you tell that the sieve is in the L1 cache? just because it's
> a lot faster?
> - what is "cache blocking"?
> - what's the best reference to learn this stuff from?

Cache blocking is subdividing tasks that would normally want
to write to a large region of memory into smaller tasks that
want to write to a small region of memory, and then performing
all of the subtasks for each small region of memory togther

e.g. if you could keep an array of 250 in the fastest cache, then
the followin 4 tasks (and these are random numbers, not QS-related):
1) write to 145 204 297 305 781
2) write to 134 299 528 621 687 704 711 837 921
3) write to 7 421 576 601 660 686 715
4) write to 282 287 749 957 970 977

could be partitioned into the following subtasks:
1) write to 145 204 | 297 305 | | 781
2) write to 134 | 299 | 528 621 687 704 711 | 837 921
3) write to 7 | 421 | 576 601 660 686 715 |
4) write to | 282 287 | 749 | 957 970 977

and actually performed as:
a) write to 145 204, write to 134, write to 7
b) write to 297 305, write to 299, write to 421, write to 282 287
c) write to 528 621 687 704 711, write to 576 601 660 686 715, write to 749
d) write to 781, write to 837, 921, write to 957, 970, 977

This reduces the number of times you need to go off and fetch
data from outside the fastest cache, which is normally a
relatively expensive operation. (The hardware I am presently
working with could perform 200 multiplies in the time it takes
to access main memory, for example, and yet L1 cache access
would have no delay at all.) Previously there would have been 13,
now there are 4.

You can tell if a the memory used in a critical part of your code
is in the L1 cache by using performance monitoring counters, assuming
your processor has such a feature. Just set them up to count memory
accesses (or L1 hits) and L1 misses.

Phil

-- 
What is it: is man only a blunder of God, or God only a blunder of man?
-- Friedrich Nietzsche (1844-1900), The Twilight of the Gods
Received on Tue Jan 17 16:49:15 2006