Optimizations in Syntax Highlighting (opens in new tab)

(code.visualstudio.com)

257 pointsriejo9y ago54 comments

54 comments

38 comments · 9 top-level

mrgalaxy9y ago· 10 in thread

> there is no feasible way to interpret TextMate grammars in the browser even today

That doesn't sound right... but then again I don't know enough about TextMate grammars to argue.

alexdima9y ago

- all the regular expressions in TM grammars are based on oniguruma, a regular expression library written in C.

- the only way to interpret the grammars and get anywhere near original fidelity is to use the exact same regular expression library (with its custom syntax constructs) in VSCode, our runtime is node.js and we can use a node native module that exposes the library to JavaScript

- in the Monaco Editor, we are constrained to a browser environment where we cannot do anything similar

- we have experimented with Emscripten to compile the C library to asm.js, but performance was very poor even in Firefox (10x slower) and extremely poor in Chrome (100x slower).

- we can revisit this once WebAssembly gets traction in the major browsers, but we will still need to consider the browser matrix we support. i.e. if we support IE11 and only Edge will add WebAssembly support, what will the experience be in IE11, etc.

alimbada9y ago

Sorry if I'm missing something, but why do you care about supporting other browsers? Isn't VSCode built on Electron which is a self contained server/browser environment?

2 more replies

Groxx9y ago

Is there a reason the oniguruma syntax can't be translated into other regexes? It'd require a fairly feature-rich engine since oniguruma supports practically everything, but is Chrome missing something fundamental that can't be mimicked? E.g. [::alpha::] could be converted to [a-zA-Z] or the UTF equivalent (though I doubt this particular one is a problem).

1 more reply

gcp9y ago

Maybe slightly offtopic, but is there any chance we could see a VSCode based on Edge/Chakra? I prefer your font rendering over your competitions', and that's a pretty big issue for an editor.

trishume9y ago

The reason is that they basically rely on using the Oniguruma regret engine, and reimplementing that in JS would be hella slow.

algorithmsRcool9y ago

> using the Oniguruma regret engine

I'm surprised I've never seen that typo for regex before. It's wonderful.

2 more replies

jbmorgado9y ago

What I don't understand is why do we still use TextMate grammars.

Can anyone explain why this is the case? It's not only in VSCode, I remember seeing something about TextMate grammars also in other editors.

trishume9y ago

Copy-pasted from a comment I made elsewhere:

It's harder than it sounds if you want to support many languages. The Sublime syntaxes repo I use has 34,000 lines of grammars whereas my engine is only 3000 lines of code. If you count all the tmLanguage files for nice languages available online it's probably hundreds of thousands of lines, and that's in a pretty dense format. The whole point of using tmLanguage files is that people don't care about how fast other languages are if there is no highlighting for their language. I could get way better performance by rewriting all those grammars using compiled parsers in Rust (like Xi has as an option https://github.com/google/xi-editor) but it would take an absurd amount of effort.

1 more reply

mikewhy9y ago

Yeah I'm not totally sure what was meant by that. They're plain text, thus parsable.

Maybe they meant that browsers don't usually have access to the file system, but that's changing and also not applicable since they're using Electron and have NodeJS at their disposal.

lanna9y ago

plain text doesn't imply parsable; natural languages, for instance.

trishume9y ago· 9 in thread

Shameless plug: my implementation of Sublime's syntax highlighting engine in Rust has similar optimizations and more. I'm not at my computer to benchmark on the same files but it should be >2x as fast as their "after" numbers just based on lines/second for JS-like files.

This evening I'm even trying to port it to a pure Rust regex engine that should eliminate non-Rust code and make it substantially faster.

It also implements the sublime-syntax format which is a superset of tmlanguage that allows even nicer highlighting.

https://github.com/trishume/syntect

trishume9y ago

So I ran the sqlite3.c benchmark on my machine (a comparable "somewhat powerful" machine) and it took 6.7s with my engine vs 10.9s with the new VSCode one. Both are doing tokenization+theme matching but the machines and exact grammars are not necessarily the same. I'm using ST3's C grammar but they are using a different one. It could be that I ran it on a faster computer with an easier grammar, or it could be the opposite. It's close enough I'm not willing to claim my engine is substantially faster.

For context Sublime Text 3 takes 2 seconds with the same grammar and same file on the same computer due to using a better custom regex engine written specifically for highlighting.

Given what alexdima mentioned in a different comment about spending most of the time in the regex engine, I'm not sure that my engine would be substantially faster under exactly identical conditions since I'm also bottlenecked by Oniguruma.

However, maybe after I port my engine to https://github.com/google/fancy-regex I'll be substantially faster. And if I do it is likely they could also benefit from a fancy-regex port.

nhaehnle9y ago

Thanks for this. Clearly the original post describes good work, but I can't help feeling the JS community is slacking off when it comes to performance.

Just eyeballing the cited numbers, they take 3939ms to handle a 1.18MB input on "a somewhat powerful desktop machine". Assuming that that means a chip running at 2GHz, we're talking about over 6300 cycles per byte!

That's quite frankly ridiculous. An improvement by at least one order of magnitude should be possible. Where's the ambition?

(Yes, there's always a trade-off with these things. But I feel someone has to point this out when the OP is explicitly about getting kudos for performance work.)

alexdima9y ago

I hate slowness and inefficiency too, that's why I try to make the editor as fast as possible :), but at least in this case, it is not the dynamic nature of JS to blame, but rather the nature of TM grammars. TM grammars consist of rules that have regular expressions, which need to be constantly evaluated; and in order to implement a correct TM grammar interpreter, you must evaluate them.

I've looked in the past for optimization opportunities in the C land (mostly through better caching), which yielded quite nice results [1][2]. I would love if you'd want to take a look too.

At this point, in tokenization, 90% of the time is spent in C, matching regular expressions in oniguruma. More precisely, regular expressions are executed 3,933,859 times to tokenize checker.ts -- the 1.18MB file. That is with some very good caching in node-oniguruma and it just speaks to the inefficiency of the TM grammars regex based design, more than anything else.

It is definitely possible to write faster tokenizers, especially when writing them by hand (even in JS), see for example the Monaco Editor[3] where we use the TypeScript compiler's lexer as a tokenizer.

At least in this case, inefficiencies are not caused by our runtime.

[1] https://github.com/atom/node-oniguruma/pull/40

[2] https://github.com/atom/node-oniguruma/pull/46

[3] https://microsoft.github.io/monaco-editor/

1 more reply

trishume9y ago

I could get way better performance by rewriting all those grammars using compiled parsers in Rust (like Xi has as an option https://github.com/google/xi-editor/blob/master/rust/lang/Ca...) but it would take an absurd amount of effort.

The speed of highlighting tmLanguage files is limited mostly by the regex library. My code spends 50% of its time in Oniguruma. You can improve that a bit by using a fancy regex engine, which is how Sublime is faster than my engine, but this evening I'm going to try to port my library to a faster engine based on Rust's regex crate. But that library (https://github.com/google/fancy-regex) is brand new and almost untested. It's the only open source library that is faster than Oniguruma while supporting all the right features. In the future VSCode may be able to gain some speed by switching to it, but they have much more overhead over the regex library to eat away first than I do.

cel1ne9y ago

Reminds me of an "efficient programming" exercise back at university.

We had to implement getuid [0] as an executable in C, which just looks up /etc/passwd by name and returns a number.

By using specialised trie-like data structures and, in the end, also custom memory alignment instead of malloc, we squeezed the whole algorithm runtime down to an average of 45 cycles or so.

[0] http://man7.org/linux/man-pages/man2/getuid.2.html

shanemhansen9y ago

The dynamic languages people are always saying programmer time is more important than computer time. Yet they put out an editor that is wasteful with both.

octref9y ago

Sorry but I'm not familiar with Rust -- is it possible to run your lib in all platforms in a Node environment?

And, do you have an issue for the regex port? Would love to see the benchmark.

trishume9y ago

Yes it would be possible to write a C wrapper for my Rust library and link that from Node, however I expect you might lose a lot of performance to data conversion.

I have yet to do the regex engine port, I'm doing that later this evening. I'm porting it from Oniguruma to https://github.com/google/fancy-regex which accelerates common types of regexes using the awesome Rust regex crate which is super fast.

1 more reply

HeyImAlex9y ago

That moment when someone else already did your hobby project :congrats through tears:

iveqy9y ago· 5 in thread

So VSCode is great in many ways, and the article might be interesting. But I would never call it fast. It's still really really slow. Just see this comparision: https://www.youtube.com/watch?v=nDRBxtEUOFE

trishume9y ago

That's not a perfectly fair comparison. Vim's syntaxes are often super simple and do a much less nice job at highlighting than most tmLanguage syntaxes.

Also all that video tests for is the presence of an optimization where it updates the on-screen colours as soon as that part of the file is done instead of after the entire file is done. It tells nothing about the underlying speed of the highlighting engines. Perhaps an important optimization, but not much information here.

alexdima9y ago

When we started the project, we did write tokenizers by hand. I mention that in the blog post. You can write some very fast tokenizers by hand, even in JavaScript. Of course they won't be as fast as hand written tokenizers in C, but you'd be surprised how well the code of a hand written tokenizer in JavaScript can be optimized by a JS engine, at least I was :). IR Hydra 2 is a great tool to visualise v8's IR representation of JS code [1]. It is a shame it is not built into the Chrome Dev Tools.

In the end, we simply could not write tokenizers for all languages by hand. And our users wanted to take their themes with them when switching to VS Code. That's why we added support for TM grammars and TM themes, and in hindsight I still consider it to be a very smart decision.

[1] http://mrale.ph/irhydra/2/

atombender9y ago

What about a parser generator that takes something like a BNF-type language and generates optimal JS/TS code on the fly, similar to Lex/Yacc? (The BNF would be portable, the generated code would be a cache.)

I can see that reusing .tmLanguage files saves a lot of work, but that format is atrocious -- hard to both read and write. (I once wrote a parser/highlighter for it in ObjC, it was not a lot of fun.)

zitterbewegung9y ago

The article only makes the claim that they have made it faster that the previous technique. Most people say VSCode is faster than other electron editors like Atom. I am not sure who has said that VSCode is faster than Vim and it would be safe to assume that Vim would be faster.

nerdponx9y ago

Vim syntax highlighting is painfully slow on long lines, which makes it unusable as a Markdown or Tex editor unless you enable automatic explicit line breaking at 80 or 150 characters, instead of letting paragraphs wrap naturally.

joshschreuder9y ago· 2 in thread

Those who switched from ST to VS Code, did you stick with VS Code? Do you have any advice for making the transition easier - key bindings, packages etc.?

bicubic9y ago

I stuck with VS Code. To be honest, I don't think it's even in the same league as ST or Atom due to the amount of integrated tooling. The integrated debugger, git, and task runner management is a god send. VS Code is also the only editor I've used where javascript type lookups and auto completion/code doc parsing Just Works out of the box.

laurencerowe9y ago

I've switched over in the past couple of months. The motivating factor for me was my switch to writing more JS (the debugger works really well.) But I'm increasingly finding myself using it for everything I used to use Sublime for. It's come a long way in the past year or so since I last gave it a try.

Animats9y ago· 2 in thread

(Syntax highlighting) is the one feature that turns a text editor into a code editor.

Syntax highlighting is eye candy. Automatic indentation is what turns a text editor into a code editor.

farnsworth9y ago

Automatic indentation saves a few keystrokes. A languages service (go to definition, etc) is what turns a text editor into a code editor!

whatever_dude9y ago

A language service is for people with poor memory. A terminal window is what turns a text editor into a code editor.

1 more reply

eduren9y ago· 1 in thread

So with the comparisons at the end of the article, does this mean that there were a lot of edge cases where the theming wasn't being correctly applied prior to 1.9? Were there themes that incorporated less stylistic choices because of the limitations?

alexdima9y ago

Yes, those comparisons at the end show differences in rendering caused by the "approximations" used prior to VS Code 1.9. They were all caused by the difference between the ranking rules of CSS selectors and the ranking rules of TM scope selectors

nojvek9y ago

I once dabbled inside vscode tokenizer code. There is a lot going but putting tokens in a buffer took me by a surprise. It was a very smart implementation.

I think it's their focus on performance and good architecture that makes vscode stand out.

Sure sublime also has great engineering behind it, but being able to contribute and look under the hood of tools, we developers use is very exciting. It feels like a very democratic process.

I remember filing the mini-map bug in Monaco, I'm so glad to see they are working on implementing it in a performant way even though it will require large rewrites of their editor rendering code.

ex3ndr9y ago

Does VS Code support highlighting for "non-regexp" cases. For example: in code i can reference a Class by it's name, but in the same time if could be a function - how you can distinct one from another by just regexp when you don't use capitalization marker for class names? Some times Class is an Object in some languages (Scala, Kotlin...), how this case is handled?

pjmlp9y ago

Great article, specially regarding the type of optimisations that were applied.

j / k navigate · click thread line to collapse

54 comments

38 comments · 9 top-level

mrgalaxy9y ago· 10 in thread

> there is no feasible way to interpret TextMate grammars in the browser even today

That doesn't sound right... but then again I don't know enough about TextMate grammars to argue.

alexdima9y ago

- all the regular expressions in TM grammars are based on oniguruma, a regular expression library written in C.

- in the Monaco Editor, we are constrained to a browser environment where we cannot do anything similar

- we have experimented with Emscripten to compile the C library to asm.js, but performance was very poor even in Firefox (10x slower) and extremely poor in Chrome (100x slower).

alimbada9y ago

Sorry if I'm missing something, but why do you care about supporting other browsers? Isn't VSCode built on Electron which is a self contained server/browser environment?

2 more replies

Groxx9y ago

1 more reply

gcp9y ago

Maybe slightly offtopic, but is there any chance we could see a VSCode based on Edge/Chakra? I prefer your font rendering over your competitions', and that's a pretty big issue for an editor.

trishume9y ago

The reason is that they basically rely on using the Oniguruma regret engine, and reimplementing that in JS would be hella slow.

algorithmsRcool9y ago

> using the Oniguruma regret engine

I'm surprised I've never seen that typo for regex before. It's wonderful.

2 more replies

jbmorgado9y ago

What I don't understand is why do we still use TextMate grammars.

Can anyone explain why this is the case? It's not only in VSCode, I remember seeing something about TextMate grammars also in other editors.

trishume9y ago

Copy-pasted from a comment I made elsewhere:

1 more reply

mikewhy9y ago

Yeah I'm not totally sure what was meant by that. They're plain text, thus parsable.

Maybe they meant that browsers don't usually have access to the file system, but that's changing and also not applicable since they're using Electron and have NodeJS at their disposal.

lanna9y ago

plain text doesn't imply parsable; natural languages, for instance.

trishume9y ago· 9 in thread

This evening I'm even trying to port it to a pure Rust regex engine that should eliminate non-Rust code and make it substantially faster.

It also implements the sublime-syntax format which is a superset of tmlanguage that allows even nicer highlighting.

https://github.com/trishume/syntect

trishume9y ago

For context Sublime Text 3 takes 2 seconds with the same grammar and same file on the same computer due to using a better custom regex engine written specifically for highlighting.

However, maybe after I port my engine to https://github.com/google/fancy-regex I'll be substantially faster. And if I do it is likely they could also benefit from a fancy-regex port.

nhaehnle9y ago

Thanks for this. Clearly the original post describes good work, but I can't help feeling the JS community is slacking off when it comes to performance.

That's quite frankly ridiculous. An improvement by at least one order of magnitude should be possible. Where's the ambition?

(Yes, there's always a trade-off with these things. But I feel someone has to point this out when the OP is explicitly about getting kudos for performance work.)

alexdima9y ago

I've looked in the past for optimization opportunities in the C land (mostly through better caching), which yielded quite nice results [1][2]. I would love if you'd want to take a look too.

At least in this case, inefficiencies are not caused by our runtime.

[1] https://github.com/atom/node-oniguruma/pull/40

[2] https://github.com/atom/node-oniguruma/pull/46

[3] https://microsoft.github.io/monaco-editor/

1 more reply

trishume9y ago

cel1ne9y ago

Reminds me of an "efficient programming" exercise back at university.

We had to implement getuid [0] as an executable in C, which just looks up /etc/passwd by name and returns a number.

By using specialised trie-like data structures and, in the end, also custom memory alignment instead of malloc, we squeezed the whole algorithm runtime down to an average of 45 cycles or so.

[0] http://man7.org/linux/man-pages/man2/getuid.2.html

shanemhansen9y ago

The dynamic languages people are always saying programmer time is more important than computer time. Yet they put out an editor that is wasteful with both.

octref9y ago

Sorry but I'm not familiar with Rust -- is it possible to run your lib in all platforms in a Node environment?

And, do you have an issue for the regex port? Would love to see the benchmark.

trishume9y ago

Yes it would be possible to write a C wrapper for my Rust library and link that from Node, however I expect you might lose a lot of performance to data conversion.

1 more reply

HeyImAlex9y ago

That moment when someone else already did your hobby project :congrats through tears:

iveqy9y ago· 5 in thread

trishume9y ago

That's not a perfectly fair comparison. Vim's syntaxes are often super simple and do a much less nice job at highlighting than most tmLanguage syntaxes.

alexdima9y ago

[1] http://mrale.ph/irhydra/2/

atombender9y ago

I can see that reusing .tmLanguage files saves a lot of work, but that format is atrocious -- hard to both read and write. (I once wrote a parser/highlighter for it in ObjC, it was not a lot of fun.)

zitterbewegung9y ago

nerdponx9y ago

joshschreuder9y ago· 2 in thread

Those who switched from ST to VS Code, did you stick with VS Code? Do you have any advice for making the transition easier - key bindings, packages etc.?

bicubic9y ago

laurencerowe9y ago

Animats9y ago· 2 in thread

(Syntax highlighting) is the one feature that turns a text editor into a code editor.

Syntax highlighting is eye candy. Automatic indentation is what turns a text editor into a code editor.

farnsworth9y ago

Automatic indentation saves a few keystrokes. A languages service (go to definition, etc) is what turns a text editor into a code editor!

whatever_dude9y ago

A language service is for people with poor memory. A terminal window is what turns a text editor into a code editor.

1 more reply

eduren9y ago· 1 in thread

alexdima9y ago

nojvek9y ago

I once dabbled inside vscode tokenizer code. There is a lot going but putting tokens in a buffer took me by a surprise. It was a very smart implementation.

I think it's their focus on performance and good architecture that makes vscode stand out.

Sure sublime also has great engineering behind it, but being able to contribute and look under the hood of tools, we developers use is very exciting. It feels like a very democratic process.

I remember filing the mini-map bug in Monaco, I'm so glad to see they are working on implementing it in a performant way even though it will require large rewrites of their editor rendering code.

ex3ndr9y ago

pjmlp9y ago

Great article, specially regarding the type of optimisations that were applied.

j / k navigate · click thread line to collapse