Re: Salsa20 and SSE2?
Available news archives: comp.lang.tcl - comp.lang.python - comp.security.firewalls - sci.crypt - comp.lang.php - comp.lang.javascript
Google
 
Web news.hping.org


sci.crypt archive

Re: Salsa20 and SSE2?

From: D. J. Bernstein <djb@cr.yp.to>
Date: Mon Aug 29 2005 - 08:49:46 CEST

Paul Rubin wrote:
> It's disappointing that the SSE2 instructions are so slow.

The biggest part of the problem is the number of instructions required:

   * XMM instructions can't simply add the way that LEA can. They have
     to copy, then add.

   * XMM instructions can't simply rotate the way that ROL can. They
     have to copy, then shift, then shift, then xor.

   * XMM instructions can't even copy in one instruction without serious
     P4 latency problems. They have to copy a 0, then xor.

Even if you aren't worried about P4 latency, the simple add-rotate-xor
in Salsa20 turns into at least seven XMM instructions. Each round has
four add-rotate-xor operations and three shuffles, for a total of at
least 31 XMM instructions per round.

For comparison, traditional non-XMM code takes 26.75 cycles per round on
the Athlon, and completely straightforward serial code takes 24.5 cycles
per round on the PowerPC G4.

> Alternatively, any chance of a "salsa40" function that uses 64-bit
> scalar operations? What are the obvious pitfalls in designing one
> that more or less imitates salsa20?

The main problem is that the 64-bit speedup is accompanied by a severe
32-bit slowdown. Shuffling sixteen 64-bit variables into x86 registers
takes a lot of time; also, add-with-carry is often fairly slow.

---D. J. Bernstein, Professor, Mathematics, Statistics,
and Computer Science, University of Illinois at Chicago
Received on Thu Sep 29 21:51:29 2005