The longest arithmetic operation

So far this is the absolute record for the binary size of one division/remainder/multiplication operation:

35 instructions, 87 bytes of code just to calculate a remainder of division by 2:

long long smod(long long x) { return x % 2; }

(compiled by gcc 3.4.4)
Anyone to come up with a longer single arithmeric operation?

19 thoughts on “The longest arithmetic operation”

  1. MSVC is not really much smarter.
    00000 6a 00 push 0
    00002 6a 02 push 2
    00004 ff 74 24 10 push DWORD PTR _a$[esp+8]
    00008 ff 74 24 10 push DWORD PTR _a$[esp+8]
    0000c e8 00 00 00 00 call __allrem
    00011 c2 08 00 ret 8
    And then if you expand _allrem, the code is actually longer than the case you mentioned 🙂
    I wonder what’s the smallest number of bytes you need to implement that signed modulo 2 function. Shortest one that I can come with is 18 bytes. (Uses _stdcall calling convention).

  2. Well, as soon as you see the name “__allrem”, you are done – you know the meaning of the code and can proceed.
    With a long sequence of instructions without any names you are at your own.

  3. Outrageous. So it’s back to hand-optimizing code? At least, we should check out the binary every once in a while.

  4. In fact this is a very good code and runs almost optimally. Spending time on making it faster is not worth the trouble at all.
    I completely agree with you that binaries should be checked from time to time. Some of them hide quite unexpected things…

  5. I always thought %2 should just return the lsb… But you say this code is almost optimal? What am I missing?

  6. Thank you for posting that temp fix for the wmf vulnerability,my msn messenger went heywire, tried everything, antispy, anti-virus, without much success then i remembered reading about that exploit, so did some snooping and found your posting, i installed it and back to normal again,so do you think microsoft will post a fix in there upcomming malicious removal tool,did you say in your article to unistall the temp fix before running windows updates, the only trouble is you dont know if they have got it in the removal tool, any suggestions. Anyway nice to see there is guys who are savvy in coding fighting on the good side. Regards Mike!

  7. Hi Mike,
    Thanks for your nice words! I think this comment is better in the other entry. I’ll try to move it there if I find how.

  8. I see. Mathematically the sign needn’t be preserved, I think, but I can see how it might be useful in programming. It still doesn’t look very optimal to me, but ok 😉

  9. Oh, you are absolutely right about optimization! I should have checked it once more before saying anything 🙂

  10. I can’t believe people are saying this is anywhere near optimal. For starters, you can obviously delete redundant lines at 2c3, 2c9 and 2ce. And in reality before line eax only had one bit of useful information in it, the high bit. And all instructions from 2c7 to 2d1 either operate on edx (which is wiped out at 2de) or work on eax but don’t alter the top bit. So you can delete all instructions from 2c7 to 2d1.
    Personally, I came up with the following sequence. I haven’t used Intel since before the 386, but I’m assuming the result comes out in edx:eax (H:L).
    mov eax, dword ptr [ebp+arg_0]
    and eax, 01h
    mov edx, dword ptr [ebp+arg_0+4]
    add edx, 0
    jns lbl
    neg eax
    lbl: sar edx, 1Fh
    I don’t know how long it is, and again, I’m not an Intel jock, so maybe it could be even better.

  11. You are right and I am wrong. The code is far from being optimal. I used the optimization flag in the command line but obviously I dropped it later.

  12. The mentioned code sequence has the problem that it contains a jump, which is VERY BAD for performance.
    While i agree that the code sequence is not optimal (though i’ve seen about the same function (%8 in that case) in about ~100 instructions on an Atmel AVR), the sign preserval is the real problem.
    Solution: use “unsigned” not only to extend the range, but also to express what has to be done with negative numbers.
    checking a user supplied array index “%size” would be a huge security risk, if the index is declared signed.
    MSVC once had a pretty neat way of preserving the signs in ~4 instructions without any jumps (that was for 32bit, though). Watcom just used idiv.

  13. Yeah, I know it has a jump. I tried to remove it, but I messed up, so I posted the other version.
    Jump or no, this version is faster than the other version. Additionally, it’s smaller, and smaller (overall) means less paging from disk, less cache utilization, less memory bandwidth needed to fetch instructions. That means it’s even faster.
    I know how to preserve sign on other architectures without jumping. I’ll figure it out for Intel. It’ll probably some from some kind of bittest (which goes into the carry on Intel) and then a carrying add to zero, thus putting the bit into the low part of a word.
    Either way, gcc did a poor job here.

  14. Hmm, in 6502 code at least, %2 of a signed number would be:
    cmp #$80 ;high bit -> carry
    adc #$00 ;add carry to low bit, flipping it if set
    and #$01 ;retain only the low bit
    Should translate to x86 🙂

  15. Well, I managed to get it down to 20 bytes (11 instructions). No jumps, no idiv.
    Although to be honest, I can’t quite see why you’d need the modulus of a negative number anyway…
    I quite liked the fact that MSVC’s code to peform the function call to __allrem is actually longer than my entire routine 🙂
    Ludvig, would you consider posting your 16-byte version? I’d like to see it.
    pop ecx ; 59
    pop eax ; 58
    pop edx ; 5a
    rol edx ; d1 c2
    and eax, 1 ; 83 e0 01
    and edx, eax ; 21 c2
    sub eax, edx ; 29 d0
    sub eax, edx ; 29 d0
    cdq ; 99
    sub esp, 8 ; 83 ec 08
    jmp ecx ; ff e1

  16. arg, I seem to have gotten too caught up in the post above me, having been shown this page from elsewhere. 🙂 Anyway, in restitution, what about this method?
    Roll the upper bit into the lower bit:
    SxxxxxxxxxxxxxL (S=sign bit, L=LSB, too few x’s)
    Then AND by 3 and do a table lookup, holding the table right after the function return.
    (non-x86 guy goes elsewhere now) 🙂

Comments are closed.