Try-catch speeding up my code? (opens in new tab)

(stackoverflow.com)

144 pointsvishal012312y ago59 comments

59 comments

39 comments · 8 top-level

gorhill12y ago· 16 in thread

I actually have a similar question re. js since a while now [with Chromium 34]... Consider these two pieces of code which do exactly the same thing, one is a standalone function:

    var makeKeyCodepoint = function(word) {
        var len = word.length;
        if ( len > 255 ) { return undefined; }
        var i = len >> 2;
        return String.fromCharCode(
            (word.charCodeAt(    0) & 0x03) << 14 |
            (word.charCodeAt(    i) & 0x03) << 12 |
            (word.charCodeAt(  i+i) & 0x03) << 10 |
            (word.charCodeAt(i+i+i) & 0x03) <<  8 |
            len
        );
    };

The other a method:

    var MakeKeyCodepoint = function() {};
    MakeKeyCodepoint.prototype.makeKey = function(word) {
        var len = word.length;
        if ( len > 255 ) { return undefined; }
        var i = len >> 2;
        return String.fromCharCode(
            (word.charCodeAt(    0) & 0x03) << 14 |
            (word.charCodeAt(    i) & 0x03) << 12 |
            (word.charCodeAt(  i+i) & 0x03) << 10 |
            (word.charCodeAt(i+i+i) & 0x03) <<  8 |
            len
        );
    };
    var makeKeyCodepointObj = new MakeKeyCodepoint();

Now why the standalone function runs at over 6.3M op/sec, while the method runs at 710M op/sec (on my computer)?

Try it: http://jsperf.com/makekey-concat-vs-join/3

chewxy12y ago

I could be wrong (and if so, pie my face), but I believe it's mostly due to one of the many the inline cache optimizations that v8 employs.

Let's consider the receiver (i.e the `this` value) of Example 1 and 2. The receiver of Example 1 is Benchmark, if invoked normally. The receiver of Example 2 is the empty function object function(){}.

When you call makeKeyCodepointObj.makeKey() - the VM looks up the object's prototype chain and finds the function. This call site is cached (think of it as a K:V store, where the key is "makeKeyCodepointObj.makeKey" and the value is the call site of the function.)

When you call makeKeyCodepoint(), the VM has to, for each call, look up the prototype chain until it finds the variable. The variable is then resolved into the function call site. Because of scoping issues in JS, I don't think this is cached (or if it's cached, it'd be invalidated a lot), and a lookup has to happen every time. (I know in my JS engine, I tried to perform caching optimization for global object properties and I gave up).

TL;DR: Function lookups happen all the time when the function is a method of the global object. When a function is a method of an object, the lookup is cached.

If I am talking out of my arse, please feel free to correct me.

Stratoscope12y ago

I don't think a global variable lookup is the reason for the difference. Here is the code that jsperf generates for the function version of the test:

    (Benchmark.uid1400600789397runScript || function() {})();
    Benchmark.uid1400600789397createFunction = function(window, t14006007893970) {
        
        var global = window,
            clearTimeout = global.clearTimeout,
            setTimeout = global.setTimeout;
            
        var r14006007893970, s14006007893970, m14006007893970 = this,
            f14006007893970 = m14006007893970.fn,
            i14006007893970 = m14006007893970.count,
            n14006007893970 = t14006007893970.ns;
        
        // Test Setup
        var makeKeyCodepoint = function(word) {
            var len = word.length;
            if (len > 255) {
                return undefined;
            }
            var i = len >> 2;
            return String.fromCharCode(
                (word.charCodeAt(    0) & 0x03) << 14 |
                (word.charCodeAt(    i) & 0x03) << 12 |
                (word.charCodeAt(  i+i) & 0x03) << 10 |
                (word.charCodeAt(i+i+i) & 0x03) <<  8 |
                len
            );
        };
        
        s14006007893970 = n14006007893970.now();
        while (i14006007893970--) {
            // Test Code
            var key;
            
            key = makeKeyCodepoint('www.wired.com');
            key = makeKeyCodepoint('www.youtube.com');
            key = makeKeyCodepoint('scorecardresearch.com');
            key = makeKeyCodepoint('www.google-analytics.com');
        }
        r14006007893970 = (n14006007893970.now() - s14006007893970) / 1e3;
        
        return {
            elapsed: r14006007893970,
            uid: "uid14006007893970"
        }
    }

The test setup and the test itself are all part of the same function, and makeKeyCodepoint is a local variable in that function.

1 more reply

thedufer12y ago

The first two tests on that jsperf don't show the same behavior, though, and they differ in the same way.

tantalor12y ago

A perf test without side effects is suspect because the compiler can remove dead code. You should add asserts on the return values.

gorhill12y ago

I thought that was an interesting comment, I did wonder originally if this could be something like that, but didn't follow up.

So now I took the time to try to go around this by rearranging the calls, and all of a sudden results make more sense:

http://jsperf.com/makekey-concat-vs-join/10

Results:

1. Firefox 29 makeKeyConcat / makeKeyConcatObj = ~440 Mops/s

2. Firefox 29 makeKeyCodepoint / makeKeyCodepointObj = ~64 Mops/s

3. Chrome 34 makeKeyCodepoint / makeKeyCodepointObj = ~5.7 Mops/s

4. Chrome 34 makeKeyConcat / makeKeyConcatObj = = ~2.2 Mops/s

2 more replies

mike-cardwell12y ago

On my 64bit Linux Firefox 29 desktop:

  codepoint    : 1,172,182,887 ops/sec
  codepoint obj: 1,168,116,461 ops/sec

No significant difference.

acdha12y ago

Chrome 34:

    function:      5,572,574
      method:    903,375,064

Firefox 29:

    function:  1,747,475,085 
      method:  1,727,244,041

1 more reply

thedufer12y ago

As bzbarsky points out, tests in the realm of 1e9 op/sec look like the entire function is being optimized away because there are no side effects. Something about the method allows it to do this, while it doesn't think its okay for the function version.

One thing I found out is that dropping `String.fromCharCode` in favor of a local function (`var fromCharCode = String.fromCharCode.bind(String);`) causes neither of them to be optimizable. See http://jsperf.com/makekey-concat-vs-join/9

ciupicri12y ago

On Fedora 20 x86_66 with midori-0.5.8-1 (webkitgtk-2.2.7-1):

   concat    | concat obj | codepoint | codepoint obj
   ----------+------------+-----------+--------------
   2,243,214 | 1,983,801  | 1,823,882 | 1,746,316

On Fedora 20 x86_66 with epiphany-3.10.3-1 (webkitgtk3-2.2.7-1):

   concat    | concat obj | codepoint | codepoint obj
   ----------+------------+-----------+--------------
   2,515,750 | 2,280,291  | 2,187,448 | 1,957,199

vishal0123OP12y ago

If this is not enough try http://jsperf.com/single-vs-multiple-times-2. Running function 4 times is faster than running single time.

amalcon12y ago

Which JS VM are you using to test this? That matters a lot for this sort of thing.

I ran it past the one in my browser (current Firefox, Linux) and didn't see a significant difference.

acdha12y ago

Try Chrome 34 – the difference is massive. What's interesting is that the best result in Chrome is 45% slower than the worst result in Firefox 29 so it's probably a question of why v8 is failing to JIT the first 3 versions.

gorhill12y ago

Argh sorry, forgot to mention the browser. It's Chromium 34/Linux 64-bit.

NDizzle12y ago

Interesting. Chrome 34 here as well, and the scores from the jsperf link are 1.6M, 1.6M, 6.3M, 960M.

SixSigma12y ago

They are not exactly the same ergo they are different.

gorhill12y ago

The "two pieces of code" I am referring to are obviously the body of the function and method. (Following your comment I had to look again, I thought I missed something).

userbinator12y ago· 5 in thread

I wasn't surprised to see it had to do with register allocation, since I've encountered some extremely odd compiler output with similar issues before. "Why would it ever decide this was a good idea?" is the thought that often comes to mind when looking through the generated code.

Register allocation is one of those areas where I think compilers are pretty horrible compared to a good or even mid-level Asm programmer, and I've never understood why graph colouring is often the only way that is taught because it's clearly not the way that an Asm programmer does it, and is also completely unintuitive to me. It seems to assume that variables are allocated in a fixed fashion and a load-store architecture, which is overly restrictive for real architectures like x86. There's also no interaction between RA and instruction selection, despite them both influencing each other, whereas a human programmer will essentially combine those two steps together. The bulk of research appears to be stuck on "how do we improve graph colouring", when IMHO a completely new, more intuitive approach would make more sense. At least it would make odd behaviour like this one less of a problem, I think.

davidcuddeback12y ago

Register allocation is an NP-complete problem. Graph coloring works because it can be done with information available to a compiler from live variable analysis.

Problems are classified as NP-complete based on the Turing machine model. Since the human brain is not a Turing machine, it may be better suited for solving NP-complete problems than a computer. Solutions to NP-complete problems commonly employ heuristics to strike a balance between completeness and efficiency. The human brain seems (at least to me) to be better at solving problems involving heuristics. Chess is an obvious example.

userbinator12y ago

To me, whether it's NP-complete is of little concern, since humans have been allocating registers (and beating compilers) with little difficulty. On the contrary, I feel that being labeled as NP-complete has somehow slowed the development of better RA algorithms, on the basis that it's "too hard". There's a saying "one of the first steps to accomplishing something is to believe that it's possible", and if people believe that RA is a more difficult problem than it really is, then that has a discouraging effect. NP-completeness is only related to the complexity as the problem size increases, but in practice the problem sizes aren't that big --- e.g. within a function, having several dozen live variables at a time is probably extremely uncommon, and machines don't have that many registers - a few dozen at most.

I think the human brain is probably Turing-equivalent, but it also likely doesn't matter -- if I can describe the algorithm that I, as a human, take to perform compiler-beating register allocation and instruction selection, then a machine can probably do it just as well if not faster than me since it can consider many more alternatives and at a much faster rate.

I agree that heuristic approaches are the way to go, but in a "too far abstracted" model like graph colouring, some heuristics just can't be easily used; e.g. the technique of "clearing the table" --- setting up all the registers prior to a tight loop, so that the instructions within do not have to access memory at all. Using push/pop (x86) for "very ephemerally-spilled" values is another one.

1 more reply

mzl12y ago

The simple explanation why is that both problems are very hard to solve in isolation. Combining the problems makes it even harder.

That said, of course it is better to try to solve the real problem and not only the decomposition. I think that the focus on cheap compiler passes is a bit obsolete now. A cheap debug-build with some optimizations is good, but for a real production build where speed is needed I am willing to wait quite some extra time to get the last few drops of performance.

I've seen some researchers looking into solving the combined problem which feels promising. That kind of interest combined with the nice modularity of LLVM makes me quite optimistic that there will be nice results that are relevant both academically and industrially.

xxs12y ago

I always thought the advent of RISC was a side effect of the (very) low register count and how hard it is to actually make optimal use of them without manually writing Assembler.

The flat memory model on i386 at least made it somewhat easier compared to the segment address mode coming with 8086.

jmgrosen12y ago

Having just reverse engineered some 16-bit DOS code, I agree with your last statement wholeheartedly. That was scary.

nutjob212y ago· 5 in thread

It's a compiler bug.

teebot12y ago

I wish I could say that more often

hugi12y ago

No you don't :)

1 more reply

yaur12y ago

maybe try compiling stuff with GCC 2.96 until you get that out of your system. I'm personally very happy that its been years since some weird behavior turned out to be a compiler bug.

LeonM12y ago

Believe me, you really don't!

1 more reply

qntmfred12y ago

there is something oddly satisfying about stumbling upon compiler bugs. here's one i discovered a while back http://stackoverflow.com/questions/11303732/x86-vs-anycpu-re...

fulafel12y ago· 5 in thread

Puzzling that so many people still run in i386 mode. I haven't used a 32-bit system since shortly after x86 hardware went 64-bit, 10+ years ago. I guess in the Windows world it's because of XP?

ufmace12y ago

I code in C# mostly, and we end up doing almost all of our builds in x86 only mode because of dependencies. If you build in x64 or AnyCPU and any of your dependent DLLs aren't x64-compatible, then you'll crash.

Also, a lot of our customers apparently use XP 32bit and have no intention of updating anytime soon. Sigh...

listic12y ago

I run in i386 becaues it uses less memory (though I think I'll be switching)

davidw12y ago

Yeah, on hosting systems where memory costs money, 32 bit can be a significant savings. I wrote this a few years ago:

http://journal.dedasys.com/2008/11/24/slicehost-vs-linode/

1 more reply

dbaupp12y ago

Linux offers the x32 ABI[1] for this reason: small 32-bit pointers but maintaining the advantages of x86-64 (more/bigger registers etc).

[1]: http://en.wikipedia.org/wiki/X32_ABI

2 more replies

pron12y ago

Java uses 32-bit pointers in 64-bit mode by default.

http://docs.oracle.com/javase/7/docs/technotes/guides/vm/per...

1 more reply

stinos12y ago

One of the nice things about this question (apart from the serious in-depth answers) is that Eric Lippert himself comes with an answer after discussing it directly with the people that can actually provide the proper fix. Q&A at it's best!

edit same goes for Jon Skeet of course, and looking for info about him I came across this http://meta.stackexchange.com/questions/9134/jon-skeet-facts... which has some hilarious ones like

Jon Skeet's SO reputation is only as modest as it is because of integer overflow (SQL Server does not have a datatype large enough) and When Jon Skeet points to null, null quakes in fear.

driax12y ago

Notice that this question is 2 years old. I would imagine that several lots of things have happened for Roslyn. (They even talk about some of what they were working on). Nevertheless quite interesting.

logn12y ago

I don't code in C# but it would be interesting to surround the code in just a block instead of a try-catch block and see if the same behavior is evident. If a plain block is still slow, then maybe branch prediction gets overwhelmed with considering having to unwind the stack all the way to main and dealing with open streams and objects on the stack.

edit: I don't know byte code or machine code well so my description of what happens unwinding the stack is probably wrong, but my point is just that it's simpler for the CPU not having the possibility of unwinding the stack beyond the code section OP called out.

batmansbelt12y ago

It's literally the best feeling in the world when Eric Lippert answers your c# question.

It's like if you cried out "dear God, why?" about your troubles but actually got a response.

j / k navigate · click thread line to collapse

59 comments

39 comments · 8 top-level

gorhill12y ago· 16 in thread

I actually have a similar question re. js since a while now [with Chromium 34]... Consider these two pieces of code which do exactly the same thing, one is a standalone function:

    var makeKeyCodepoint = function(word) {
        var len = word.length;
        if ( len > 255 ) { return undefined; }
        var i = len >> 2;
        return String.fromCharCode(
            (word.charCodeAt(    0) & 0x03) << 14 |
            (word.charCodeAt(    i) & 0x03) << 12 |
            (word.charCodeAt(  i+i) & 0x03) << 10 |
            (word.charCodeAt(i+i+i) & 0x03) <<  8 |
            len
        );
    };

The other a method:

    var MakeKeyCodepoint = function() {};
    MakeKeyCodepoint.prototype.makeKey = function(word) {
        var len = word.length;
        if ( len > 255 ) { return undefined; }
        var i = len >> 2;
        return String.fromCharCode(
            (word.charCodeAt(    0) & 0x03) << 14 |
            (word.charCodeAt(    i) & 0x03) << 12 |
            (word.charCodeAt(  i+i) & 0x03) << 10 |
            (word.charCodeAt(i+i+i) & 0x03) <<  8 |
            len
        );
    };
    var makeKeyCodepointObj = new MakeKeyCodepoint();

Now why the standalone function runs at over 6.3M op/sec, while the method runs at 710M op/sec (on my computer)?

Try it: http://jsperf.com/makekey-concat-vs-join/3

chewxy12y ago

I could be wrong (and if so, pie my face), but I believe it's mostly due to one of the many the inline cache optimizations that v8 employs.

TL;DR: Function lookups happen all the time when the function is a method of the global object. When a function is a method of an object, the lookup is cached.

If I am talking out of my arse, please feel free to correct me.

Stratoscope12y ago

I don't think a global variable lookup is the reason for the difference. Here is the code that jsperf generates for the function version of the test:

    (Benchmark.uid1400600789397runScript || function() {})();
    Benchmark.uid1400600789397createFunction = function(window, t14006007893970) {
        
        var global = window,
            clearTimeout = global.clearTimeout,
            setTimeout = global.setTimeout;
            
        var r14006007893970, s14006007893970, m14006007893970 = this,
            f14006007893970 = m14006007893970.fn,
            i14006007893970 = m14006007893970.count,
            n14006007893970 = t14006007893970.ns;
        
        // Test Setup
        var makeKeyCodepoint = function(word) {
            var len = word.length;
            if (len > 255) {
                return undefined;
            }
            var i = len >> 2;
            return String.fromCharCode(
                (word.charCodeAt(    0) & 0x03) << 14 |
                (word.charCodeAt(    i) & 0x03) << 12 |
                (word.charCodeAt(  i+i) & 0x03) << 10 |
                (word.charCodeAt(i+i+i) & 0x03) <<  8 |
                len
            );
        };
        
        s14006007893970 = n14006007893970.now();
        while (i14006007893970--) {
            // Test Code
            var key;
            
            key = makeKeyCodepoint('www.wired.com');
            key = makeKeyCodepoint('www.youtube.com');
            key = makeKeyCodepoint('scorecardresearch.com');
            key = makeKeyCodepoint('www.google-analytics.com');
        }
        r14006007893970 = (n14006007893970.now() - s14006007893970) / 1e3;
        
        return {
            elapsed: r14006007893970,
            uid: "uid14006007893970"
        }
    }

The test setup and the test itself are all part of the same function, and makeKeyCodepoint is a local variable in that function.

1 more reply

thedufer12y ago

The first two tests on that jsperf don't show the same behavior, though, and they differ in the same way.

tantalor12y ago

A perf test without side effects is suspect because the compiler can remove dead code. You should add asserts on the return values.

gorhill12y ago

I thought that was an interesting comment, I did wonder originally if this could be something like that, but didn't follow up.

So now I took the time to try to go around this by rearranging the calls, and all of a sudden results make more sense:

http://jsperf.com/makekey-concat-vs-join/10

Results:

1. Firefox 29 makeKeyConcat / makeKeyConcatObj = ~440 Mops/s

2. Firefox 29 makeKeyCodepoint / makeKeyCodepointObj = ~64 Mops/s

3. Chrome 34 makeKeyCodepoint / makeKeyCodepointObj = ~5.7 Mops/s

4. Chrome 34 makeKeyConcat / makeKeyConcatObj = = ~2.2 Mops/s

2 more replies

mike-cardwell12y ago

On my 64bit Linux Firefox 29 desktop:

  codepoint    : 1,172,182,887 ops/sec
  codepoint obj: 1,168,116,461 ops/sec

No significant difference.

acdha12y ago

Chrome 34:

    function:      5,572,574
      method:    903,375,064

Firefox 29:

    function:  1,747,475,085 
      method:  1,727,244,041

1 more reply

thedufer12y ago

ciupicri12y ago

On Fedora 20 x86_66 with midori-0.5.8-1 (webkitgtk-2.2.7-1):

   concat    | concat obj | codepoint | codepoint obj
   ----------+------------+-----------+--------------
   2,243,214 | 1,983,801  | 1,823,882 | 1,746,316

On Fedora 20 x86_66 with epiphany-3.10.3-1 (webkitgtk3-2.2.7-1):

   concat    | concat obj | codepoint | codepoint obj
   ----------+------------+-----------+--------------
   2,515,750 | 2,280,291  | 2,187,448 | 1,957,199

vishal0123OP12y ago

If this is not enough try http://jsperf.com/single-vs-multiple-times-2. Running function 4 times is faster than running single time.

amalcon12y ago

Which JS VM are you using to test this? That matters a lot for this sort of thing.

I ran it past the one in my browser (current Firefox, Linux) and didn't see a significant difference.

acdha12y ago

gorhill12y ago

Argh sorry, forgot to mention the browser. It's Chromium 34/Linux 64-bit.

NDizzle12y ago

Interesting. Chrome 34 here as well, and the scores from the jsperf link are 1.6M, 1.6M, 6.3M, 960M.

SixSigma12y ago

They are not exactly the same ergo they are different.

gorhill12y ago

The "two pieces of code" I am referring to are obviously the body of the function and method. (Following your comment I had to look again, I thought I missed something).

userbinator12y ago· 5 in thread

davidcuddeback12y ago

Register allocation is an NP-complete problem. Graph coloring works because it can be done with information available to a compiler from live variable analysis.

userbinator12y ago

1 more reply

mzl12y ago

The simple explanation why is that both problems are very hard to solve in isolation. Combining the problems makes it even harder.

xxs12y ago

I always thought the advent of RISC was a side effect of the (very) low register count and how hard it is to actually make optimal use of them without manually writing Assembler.

The flat memory model on i386 at least made it somewhat easier compared to the segment address mode coming with 8086.

jmgrosen12y ago

Having just reverse engineered some 16-bit DOS code, I agree with your last statement wholeheartedly. That was scary.

nutjob212y ago· 5 in thread

It's a compiler bug.

teebot12y ago

I wish I could say that more often

hugi12y ago

No you don't :)

1 more reply

yaur12y ago

maybe try compiling stuff with GCC 2.96 until you get that out of your system. I'm personally very happy that its been years since some weird behavior turned out to be a compiler bug.

LeonM12y ago

Believe me, you really don't!

1 more reply

qntmfred12y ago

there is something oddly satisfying about stumbling upon compiler bugs. here's one i discovered a while back http://stackoverflow.com/questions/11303732/x86-vs-anycpu-re...

fulafel12y ago· 5 in thread

Puzzling that so many people still run in i386 mode. I haven't used a 32-bit system since shortly after x86 hardware went 64-bit, 10+ years ago. I guess in the Windows world it's because of XP?

ufmace12y ago

Also, a lot of our customers apparently use XP 32bit and have no intention of updating anytime soon. Sigh...

listic12y ago

I run in i386 becaues it uses less memory (though I think I'll be switching)

davidw12y ago

Yeah, on hosting systems where memory costs money, 32 bit can be a significant savings. I wrote this a few years ago:

http://journal.dedasys.com/2008/11/24/slicehost-vs-linode/

1 more reply

dbaupp12y ago

Linux offers the x32 ABI[1] for this reason: small 32-bit pointers but maintaining the advantages of x86-64 (more/bigger registers etc).

[1]: http://en.wikipedia.org/wiki/X32_ABI

2 more replies

pron12y ago

Java uses 32-bit pointers in 64-bit mode by default.

http://docs.oracle.com/javase/7/docs/technotes/guides/vm/per...

1 more reply

stinos12y ago

edit same goes for Jon Skeet of course, and looking for info about him I came across this http://meta.stackexchange.com/questions/9134/jon-skeet-facts... which has some hilarious ones like

Jon Skeet's SO reputation is only as modest as it is because of integer overflow (SQL Server does not have a datatype large enough) and When Jon Skeet points to null, null quakes in fear.

driax12y ago

logn12y ago

batmansbelt12y ago

It's literally the best feeling in the world when Eric Lippert answers your c# question.

It's like if you cried out "dear God, why?" about your troubles but actually got a response.

j / k navigate · click thread line to collapse