undefined | Better HN

0 pointsjoncatanio8y ago0 comments

That work is great!

> We have presented the first limit study that tries to quantify the costs of various dynamic language features in Python.

This is spot on what we were doing as well, that's great to have this as a reference.

> 1. If I understand the source on GitHub correctly, you parse Python source code yourself. I'm fairly sure your simulation would be a lot more faithful if you compiled Python bytecode instead. Did you consider this, and if yes, was there a particular reason not to do it that way?

We did not consider this actually. This would be a very interesting concept to explore. For the unoptimized version of Cannoli we do look up variables in a list of hash tables (which represent the current levels of scope). We did perform a scope optimization that then uses indices to access scope elements and this was much faster. However, it meant that the use of functions like `exec` and `del` were no longer permitted since we would not be able to statically determine all scope elements at run time (consider `exec(input())`, this could introduce anything into scope and we can't track that).

If you know, how does CPython resolve scope if it maps variable names to indices? In the case of `exec(input())` and say the input string is `x = 1`, how would it compile bytecode to allocate space for x and index into the value? I don't have much experience with the CPython source, so please excuse me if the question seems naive :)!

> 2. Where do you actually make useful use of Rust's static ownership system? I've only skimmed that part of the thesis very quickly, but I missed how you track ownership in Python programs and can be sure that things don't escape. Can you give an example of a Python program using dynamic allocation that your compiler maps to Rust with purely static ownership tracking and freeing of the memory when it's no longer used?

Elements of the Value enum (that encapsulates all types) relied on `Rc` and `RefCell` to defer borrow checking to run time. Consider a function who has a local variable that instantiates some object. Once that function call has finished Cannoli will pop that local scope table and all mappings will be dropped when it goes out of scope. The object encapsulated in a `Rc` will have it's reference count decremented to 0 and be freed.

This is how I've interpreted the Rust borrow checker, I will say that this was the first time I had ever used Rust so it's possible that I am not completely right on this. But once that table goes out of scope, all elements should be dropped by the borrow checker and any Rc should be decremented/dropped.

> 3. Related to 2: Why bother with any notion of ownership at all? Did you try mapping everything to Rust's reference counting and just letting it do its best? I'm wondering how much slower that would be. Python is also reference counted, after all, and I guess the Rust compiler should have more opportunities to optimize reference counting operations.

I did defer a lot of borrow checking to run time with Rc, but I tried to use this as little as possible to maximize optimizations that may result from static borrow checking.

> 4. In general, do you have an idea why your code is slower than Python, besides the hash table variable lookup issue I mentioned above?

If you remove the 3 outlier benchmarks (that are slow because of Rust printing and a suboptimal implementation of slices), Cannoli isn't too far off from CPython. And in fact, with the ray casting benchmark, Cannoli began to outperform CPython at scale. This leads me to believe that the computations in Cannoli are faster than CPython. However, there is still a lot of work to do to create a more performant version of Cannoli. The compiler itself was only developed for ~4 months, I have no doubt that more development time would yield a better results.

That being said, I think the biggest slowdown comes from features of Rust that might not have been utilized. This is just speculation, but I think the use of lifetimes could benefit the compiled code a lot. I also think there may be more elegant solutions to some of the translations (e.g. slices), that could provide speedup. But I can't say that there is one thing causing the slowdown, and profiling the benchmarks (excluding the outliers) support that.

0 comments

5 comments · 2 top-level

Erwin8y ago· 3 in thread

In Python 2, the bytecode optimization that lets you access variables by index is turned off if your function has exec code in it that may modify locals.

So if your function is:

   def foo(): exec "a=1"; return a

Then running dis.dis on foo to disassemble the bytecode it you will see:

              8 LOAD_NAME                0 (a)

while you normally would see:

              6 LOAD_FAST                0 (a)

You can't use the exec in locals thing in Python 3 at all I believe.

speedster2178y ago

I just tested it in Python3:

    def foo():
        exec("a=1")
        return a

    print(foo())

Fails with a NameError:

    Traceback (most recent call last):
      File "test.py", line 5, in <module>
        print(foo())
      File "test.py", line 3, in foo
        return a
    NameError: name 'a' is not defined

joncatanioOP8y ago

I get the same error with your example, but this works fine (Python 3.6.4):

    exec("a = 1")
    print(a)

This will print "1".

1 more reply

joncatanioOP8y ago

Ahh, I see, this makes more sense. Thanks for the clarification!

gergo_barany8y ago

Thanks for your answers!

> This leads me to believe that the computations in Cannoli are faster than CPython.

Right, the speed difference due to output buffering seems annoying. I wonder if you can change the stdout buffering mode to be more similar to Python's (if Python's stdout is not line buffered, I don't recall).

Also, you might want to try to run the same benchmarks PyPy uses: http://speed.pypy.org/

These are still micro-ish benchmarks, but not quite as micro as some of yours.

> However, there is still a lot of work to do to create a more performant version of Cannoli.

I saw in another post that you are going off to industry to do something else, but if you will continue development and evaluation of Cannoli and want to discuss further, feel free to drop me a line at the email address in the paper linked above.

j / k navigate · click thread line to collapse

0 comments

5 comments · 2 top-level

Erwin8y ago· 3 in thread

In Python 2, the bytecode optimization that lets you access variables by index is turned off if your function has exec code in it that may modify locals.

So if your function is:

   def foo(): exec "a=1"; return a

Then running dis.dis on foo to disassemble the bytecode it you will see:

              8 LOAD_NAME                0 (a)

while you normally would see:

              6 LOAD_FAST                0 (a)

You can't use the exec in locals thing in Python 3 at all I believe.

speedster2178y ago

I just tested it in Python3:

    def foo():
        exec("a=1")
        return a

    print(foo())

Fails with a NameError:

    Traceback (most recent call last):
      File "test.py", line 5, in <module>
        print(foo())
      File "test.py", line 3, in foo
        return a
    NameError: name 'a' is not defined

joncatanioOP8y ago

I get the same error with your example, but this works fine (Python 3.6.4):

    exec("a = 1")
    print(a)

This will print "1".

1 more reply

joncatanioOP8y ago

Ahh, I see, this makes more sense. Thanks for the clarification!

gergo_barany8y ago

Thanks for your answers!

> This leads me to believe that the computations in Cannoli are faster than CPython.

Also, you might want to try to run the same benchmarks PyPy uses: http://speed.pypy.org/

These are still micro-ish benchmarks, but not quite as micro as some of yours.

> However, there is still a lot of work to do to create a more performant version of Cannoli.

j / k navigate · click thread line to collapse