Re: Salsa20 and SSE2?
Available news archives: comp.lang.tcl - comp.lang.python - comp.security.firewalls - sci.crypt - comp.lang.php - comp.lang.javascript
Google
 
Web news.hping.org


sci.crypt archive

Re: Salsa20 and SSE2?

From: D. J. Bernstein <djb@cr.yp.to>
Date: Mon Aug 29 2005 - 04:02:51 CEST

Paul Rubin wrote:
> I wonder if anyone (DJB?) has tried coding Salsa20 using SSE2
> instructions along the lines of Matthijs van Duin's Altivec
> implementation.

Yes:

   http://cr.yp.to/salsa20/salsa20_word_p4.q (qhasm)
   http://cr.yp.to/salsa20/salsa20_word_p4.s (traditional asm)

See also Section 7 of the ``Salsa20 speed'' document.

This code takes 48 cycles per round on the Pentium 4 f12. I don't know
why; the only obvious bottleneck is 40 cycles per round for arithmetic.
Perhaps I'm missing something in the Intel documentation, or perhaps
Intel has failed to document some relevant bottleneck. I would, in any
case, be interested in understanding where the number 48 comes from.

---D. J. Bernstein, Professor, Mathematics, Statistics,
and Computer Science, University of Illinois at Chicago
Received on Thu Sep 29 21:51:27 2005