Re: Salsa20 altivec timings
Available news archives: comp.lang.tcl - comp.lang.python - comp.security.firewalls - sci.crypt - comp.lang.php - comp.lang.javascript
Google
 
Web news.hping.org


sci.crypt archive

Re: Salsa20 altivec timings

From: D. J. Bernstein <djb@cr.yp.to>
Date: Fri Sep 30 2005 - 08:06:24 CEST

Paul Rubin wrote:
> But I think the obvious XMM code uses seven.

One can imagine XMM code using as few as six registers: four for the
previous data, one for a sum, one for a shifted sum (which wouldn't be
necessary if there were a rotate instruction).

I used two extra registers (totalling all eight available) in my Pentium
4 XMM code to avoid delays from the horrible 4-cycle copy latency. One
register is an early copy of a register to be used for the next sum
(which wouldn't be necessary if there were a three-operand addition).
The other register is 0, allowing an add to replace a subsequent copy.

I still haven't figured out why the Pentium 4 f12 is taking 12 cycles,
rather than the expected 10, for each quarter-round. Subsequent Pentium
4 revisions take slightly under 12 cycles but still more than 10.

---D. J. Bernstein, Professor, Mathematics, Statistics,
and Computer Science, University of Illinois at Chicago
Received on Sat Oct 15 04:37:57 2005