Skip to content
This repository was archived by the owner on Oct 12, 2022. It is now read-only.

Rewrite core.checkedint functions. #1394

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

denis-sh
Copy link
Contributor

Use fast and simple sign comparison instead of long, slow and complicated value comparisons.

Also remove reference as it doesn't apply anymore.

P.S.

I planned to rewrite this module implementation since its creation because it's clearly written in a hurry (e.g. see r < x || r < y check in addu), sorry this took that long to get to it. IMHO this is one of the modules that have to show source code readers how simple, robust and efficient code should be written in D.

@MartinNowak
Copy link
Member

Looks good, maybe add a few microbenchmarks to https://github.com/D-Programming-Language/druntime/tree/master/benchmark.

@dnadlinger
Copy link
Member

Why all the style changes? And because you specifically mention the module as a potential style showcase: I don't think we should encourage the use of e.g. bitwise not for boolean values, nor relying on operator precedence wherever possible.

Also, you should add your name after Walter's. He is still the primary author of the API.

@denis-sh denis-sh force-pushed the imprve-`core.checkedint` branch from a0b8b50 to 83ec881 Compare September 25, 2015 08:33
@denis-sh
Copy link
Contributor Author

Looks good, maybe add a few microbenchmarks to

OK, I will able to do it probably tomorrow.

Why all the style changes?

Do you mean const additions, spaces after cast(X), and ulong(x) * ulong(y) to cast(ulong) x * y changes? If so, what exactly is the problem here?

And because you specifically mention the module as a potential style showcase: I don't think we should encourage the use of e.g. bitwise not for boolean values,

No boolean values are used. All types are size_t.

nor relying on operator precedence wherever possible.

Hm, I don't see any complicated cases in my code, the only thing used are a & b || c && d (it's known logical operators has lower priority compared to arithmetic/bitwise) and a >>> b + c (C++'s cin/cout taught us about these cases).
Just a note: In contrast, previous file version had this: && ((r == y)? r : (r / x) != y), where the only probably needed for clarity parantheses pair is around r / x != y and is missed but three useless pairs are added just to complicate expression reading as (r == y) tells the reader ?: evaluates before == and then the reader may think the last != y compares the result of the whole previous text with y.

Also, you should add your name after Walter's. He is still the primary author of the API.

Sorry, I supposed it was alphabetically sorted for some reason. Fixed.

@@ -19,10 +19,9 @@
* relative to the cost of the operation itself, compiler implementations are free
* to recognize them and generate equivalent and faster code.
*
* References: $(LINK2 http://blog.regehr.org/archives/1139, Fast Integer Overflow Checks)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why take out the reference? This is critical information.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Function implementations are completely rewritten so there is no connection with this blog any more. Also after looking into the blog it looks very silly: after the author performs bitshifts in checked_add_3 instead of using bitwise operators (like done in this pull) he uses this long and slow way:

(sa && sb && !sr) ||
(!sa && !sb && sr)

So is the really a reason to include a reference to this blog?

@WalterBright
Copy link
Member

As noted in the reference that was elided by this PR (and should be reinstated), doing these checks is tricky because of the ways some compilers generate code. The way I wrote the code was to avoid various code generators invalidating the results.

I'm not saying your method is wrong, but I do suggest a careful review of the unittests to ensure that it really does work correctly and is tested thoroughly.

@denis-sh
Copy link
Contributor Author

...doing these checks is tricky because of the ways some compilers generate code. The way I wrote the code was to avoid various code generators invalidating the results.

Do you mean there are code generation bugs and these bugs should be considered in every module we write? Because I don't see when this pull rely on some concrete implementation of UB case.

@denis-sh
Copy link
Contributor Author

return cast(size_t) (n >>> T.sizeof * 8 - 1);

Many compilers generate very good code for bool b = a < b;. I suggest that instead of the shift.

size_t _differentSign(T)(in T a, in T b)

should return a bool

Do you suppose _signBit should be n => n < 0 instead of n => cast(size_t) (n >>> T.sizeof * 8 - 1) and _differentSign should be (a, b) => a._signBit != b._signBit instead of (a, b) => a._signBit ^ b._signBit?

const r = x * y;
if ( ~x._differentSign(y) // x and y has the same sign
& r._signBit // and result is negative (covers min * -1)
|| x && r / x != y) // or perform simple check for x != 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something amazing going on with whitespace here. Quite dizzying to review at a glance.

@denis-sh denis-sh force-pushed the imprve-`core.checkedint` branch from 83ec881 to cdbc2c0 Compare September 27, 2015 11:30
@denis-sh
Copy link
Contributor Author

  • Improved the original version by using (a ^ b)._signBit instead of a._signBit ^ b._signBit to save one shift operation.
  • Restored reference to silly blog post (still against this change).

Do you suppose _signBit should be n => n < 0 instead of n => cast(size_t) (n >>> T.sizeof * 8 - 1) and _differentSign should be (a, b) => a._signBit != b._signBit instead of (a, b) => a._signBit ^ b._signBit?

  • Added commit which rewrites implementation this way. The generated code still looks pretty good on GDC (just uses two conditional jumps instead of one) and very good on Clang ((a < 0) != (b < 0) is fully optimized but still there are two conditional jumps for a && b construction) (and sorry, didn't test LDC, only Clang).

@ibuclaw
Copy link
Member

ibuclaw commented Sep 27, 2015

The generated code still looks pretty good on GDC (just uses two conditional jumps instead of one)

Such a shame that GDC ignores these function bodies then. ;-)

if (r < int.min || r > int.max)
const r = x - y;
if ( x._differentSign(y) // x and y has different sign
&& x._differentSign(r)) // result and x has different sign
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would really rather put the && at the end of the first line (which is recommended in style guides anyway, e.g. Code Complete) so that you don't need the funky whitespace in the first line. Just makes it harder for the brain to parse.

@dnadlinger
Copy link
Member

LDC won't use those function bodies either, except on platforms where there is no native instruction (e.g. long on x86).

@tsbockman
Copy link
Contributor

This version of core.checkedint still passes my preciseops test suite, which is considerably more thorough than the druntime unittests.

Performance-wise, I find it a bit slower with DMD than the old version was.

@denis-sh
Copy link
Contributor Author

Performance-wise, I find it a bit slower with DMD than the old version was.

With DMD, could you check the first commit (the one without the use of logical operations) please?

@tsbockman
Copy link
Contributor

(I should note that my tests assume no one has gotten around to implementing intrinsics for core.checkedint on DMD yet; I need to retest if this is incorrect.)

The bitwise version was slightly slower than the logical version, even after I optimized it by combining some unnecessary shift operations.

@denis-sh
Copy link
Contributor Author

The bitwise version was slightly slower than the logical version, even after I optimized it by combining some unnecessary shift operations.

What functions exactly are you measuring? Is adds for long slower? I.e. is ~cast(size_t)((x ^ y) >>> 63) & cast(size_t)((x ^ r) >>> 63) slower than x < 0 && y < 0 && r >= 0 || x >= 0 && y >= 0 && r < 0?

@tsbockman
Copy link
Contributor

I am running the benchmark found in main.d of my CheckedInt project. It uses checked arithmetic to do various make-work calculations like prime finding, integer square root, the collatz sequence, and quicksort.

It was designed to test the performance of my checked integer type, but in the process it exercises all the contents of core.checkedint. On DMD, most of the runtime is taken up by other things. But, when holding other factors constant, the performance differences between various core.checkedint implementations are discernible.

I am not really a big fan of micro-benchmarks, because it is so hard to get them right. If you really want, I can take a stab at breaking things down in more detail later. However, I think it's probably a waste of time; the old core.checkedint implementation is already fast enough that, with a good optimizer (by which I mean GDC), the performance boost from enabling intrinsics is barely noticeable.

Compared to bitwise equivalents, the kind of test logic used in the old implementation looks slow, but in reality it is highly amenable to optimization and superscalar execution.

If this pull request is accepted, it should be for some reason other than runtime performance.

@denis-sh denis-sh force-pushed the imprve-`core.checkedint` branch from cdbc2c0 to 2986f54 Compare September 29, 2015 17:06
@denis-sh
Copy link
Contributor Author

If this pull request is accepted, it should be for some reason other than runtime performance.

Agree. I tried to improve the whole module implementation style.

@denis-sh
Copy link
Contributor Author

This is a && b / c != d, parenthesis really does help make it clear at a glance just exactly what it is doing.

Readded.

I would really rather put the && at the end of the first line

Changed as I'm probably the only one here using such style and ignoring Code Complete.

@MartinNowak
Copy link
Member

I am not really a big fan of micro-benchmarks

Let's please add them anyhow. Number of instructions and latency is a good proxy, though you can hardly measure the impact on the branch cache.

@denis-sh
Copy link
Contributor Author

Looks good, maybe add a few microbenchmarks to https://github.com/D-Programming-Language/druntime/tree/master/benchmark.

Not sure about benchmarks. For now I only can describe resulting machine code.

This is what Clang does (one can see full (a < 0) != (b < 0) optimization applied):

adds(int, int, bool*):
    lea eax, [rsi + rdi]
    xor esi, edi
    js  .LBB0_3
    xor edi, eax
    jns .LBB0_3
    mov byte ptr [rdx], 1
.LBB0_3:
    ret

and this is what GCC does (just direct shifting and comparison, looks rather slow):

adds(int, int, bool*):
    leal    (%rdi,%rsi), %eax
    shrl    $31, %esi
    shrl    $31, %edi
    cmpb    %dil, %sil
    jne .L4
    movl    %eax, %ecx
    shrl    $31, %ecx
    cmpb    %cl, %sil
    je  .L4
    movb    $1, (%rdx)
.L4:
    rep ret

So yes, one have to use (a ^ b)._signBit or GCC will not optimize it (but if one does use ^, GCC will produce the same assembly as Clang's shown above).

@tsbockman
Copy link
Contributor

@MartinNowak

Looks good, maybe add a few microbenchmarks to https://github.com/D-Programming-Language/druntime/tree/master/benchmark.

Is there an equivalent standard benchmark for Phobos somewhere?

@denis-sh
Copy link
Contributor Author

~diffSign(x, y) && diffSign(x, r) is a wrong code if diffSign returns 1.

Further looking at resulting assembly is even more interesting:

~diffSign(x, y) && diffSign(x, r) gives us this with with Clang:

adds(int, int, bool*):
    add esi, edi               # looks unoptimal, could use EAX via lea
    xor edi, esi
    jns .LBB0_2
    mov byte ptr [rdx], 1
.LBB0_2:
    mov eax, esi               # as EAX wasn't used
    ret

and this with GCC (if (a ^ b)._signBit is used of course):

adds(int, int, bool*):
    lea eax, [rsi+rdi]
    xor edi, eax
    jns .L4
    mov BYTE PTR [rdx], 1
.L4:
    rep ret

GCC is the "winner" here. The instruction count winner who can't optimize (a < 0) != (b < 0).

Usage of ! instead of ~ or & instead of && increases instruction count, sometimes a lot.

@denis-sh
Copy link
Contributor Author

> Further looking at resulting assembly is even more interesting:

Good short assembly. Good except it's incorrect. Looks like it didn't take much time too find codegen bug.

@denis-sh
Copy link
Contributor Author

Further looking at resulting assembly is even more interesting:

Good short assembly. Good except it's incorrect. Looks like it didn't take much time too find codegen bug.

Sorry for the noise, just my example code was incorrect. ~1 isn't zero.

@denis-sh denis-sh force-pushed the imprve-`core.checkedint` branch from 2986f54 to f57723c Compare September 29, 2015 18:37
Use fast and simple sign comparison instead of long, slow and complicated value comparisons.

Also remove reference as it doesn't apply anymore.
At least Clang knows `(a < 0) != (b < 0)` pattern.

The only possible shortcoming of this change is that use of logical operators currently causes compilers to emit two conditional jumps instead of one for bitwise operations.
@denis-sh
Copy link
Contributor Author

Changed the first commit to use sign-extending shift to make ~diffSign(x, y) && diffSign(x, r) construction valid too because otherwise there should be an warning comment and I don't want to add one just for the first commit.

@denis-sh
Copy link
Contributor Author

Not sure about benchmarks.

Will try to add one in a couple of days.

@MartinNowak
Copy link
Member

Is there an equivalent standard benchmark for Phobos somewhere?

Not that I'm aware of, but this PR is druntime code anyhow.

@tsbockman
Copy link
Contributor

@MartinNowak

Not that I'm aware of, but this PR is druntime code anyhow.

Sure. I was just wondering because it would be relevant to something else I'm working on.

@DmitryOlshansky
Copy link
Member

Ping @denis-sh

@denis-sh
Copy link
Contributor Author

Ping @denis-sh

version block styling issue or request for benchmark?

@DmitryOlshansky
Copy link
Member

version block styling issue or request for benchmark?

I think benchmarking is more important and was your last intention.

@dlang-bot dlang-bot added Needs Rebase needs a `git rebase` performed stalled Needs Work labels Jan 1, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Needs Rebase needs a `git rebase` performed Needs Work stalled
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants