karl malbrain wrote:
> Of note is the table organization: 32 or 8 bit. Large tables cause
> problems. Tiny table implementations are immune because the whole
> table fits in 1/4 of the cache lines needed by large tables.
The problem [as I understand it] has to do with cache-bank conflicts.
Because the memory is dual ported you can perform two [upto] 64-bit
reads at a time. But they cannot be in the same bank on a cache line.
[I'm going from memory, I have the specs around somewhere at the
office].
... found the opt guide [public knowledge, this is on the AMD website]
on my AMD laptop ...
Section 5.9 [page 130 of the PDF]
"Utilize pair loads that do not have a bank conflict in the L1 data
cache to improve load throughput".
Bank bits are defined as bits 3 through 5 [zero based, 0 == least
significant] of the address. So banks are 8 bytes long. An "index" is
defined as bits 6 through 14 of the address. A bank conflict occurs
when the index of the two reads is different but the bank is the same.
So reading from 0x00000000 and 0x00000040 would cause a stall.
The attack works because indexes are always different (the four tables
are offset by 0x400 bytes) but the bank has a 1/8 chance of
conflicting.
The single 8x32 case uses 0x400 bytes and is split across 16 indexes.
So not only can loads be in the same index but the extra rotates
required to emulate the three other tables are adding work that masks
the input-variant bank delays.
To truly be immune your table would have to be <= 64 bytes in size. Or
you would have to ensure that no two reads are within three cycles or
to offset the tables.
For instance, you could spread the 1KB 8x32 table over more space.
There are 256 entries. If you split them over the 8 banks each 4-byte
entry would take 64 bytes. That makes the 1KB table 16KB. It would
still fit in the L1 and yet your chances of hitting are much lower. Of
course indexing such tables would be a pain...
You could also stick with the four tables of 8x32 but then encode each
of the four tables into the same array. Offset each table by one bank.
That would make good use of EA in the x86 world as basically you
multiply the index by 16 to get the address.
Oddly enough this would probably make the code faster for
micro-benchmarks since more of the loads would be successful.
The problem is different cpus have different hazards. So while you
could optimize this strictly for AMD [not a bad idea :-)] it could just
as easily become MORE insecure on Intel, PPC, ARM or the others.
The real trick with cached loads to avoid input-variant stalls is to
just make sure you're not doing back to back reads [or writes] faster
than the latency of the L1.
Tom
Received on Mon May 1 02:03:18 2006