Blazingly fast parsing, part 1: optimizing the scanner

Published 25 March 2019 · Tagged with internals parsing

To run a JavaScript program, the source text needs to be processed so V8 can understand it. V8 starts out by parsing the source into an abstract syntax tree (AST), a set of objects that represent the program structure. That AST gets compiled to bytecode by Ignition. The performance of these parse + compile phases is important: V8 cannot run code before compilation is done. In this series of blog posts, we focus on parsing, and the work done in V8 to ship a blazingly fast parser.

In fact, we start the series one stage before the parser. V8’s parser consumes ‘tokens’ provided by the ‘scanner’. Tokens are blocks of one or more characters that have a single semantic meaning: a string, an identifier, an operator like ++. The scanner constructs these tokens by combining consecutive characters in an underlying character stream.

The scanner consumes a stream of Unicode characters. These Unicode characters are always decoded from a stream of UTF-16 code units. Only a single encoding is supported to avoid branching or specializing the scanner and parser for various encodings, and we chose UTF-16 since that’s the encoding of JavaScript strings, and source positions need to be provided relative to that encoding. The Scanner::Advance() method that decodes UTF-16 code units into full Unicode characters. The currently decoded character is buffered and picked up by scan methods, such as until we find a token other than Token::WHITESPACE. This means that if the next token is not preceded by whitespace, we immediately start scanning the relevant token without needing to explicitly check for whitespace.

The loop itself however adds overhead to each scanned token: it requires a branch to verify the token that we’ve just scanned. It would be better to continue the loop only if the token we have just scanned could be a Token::WHITESPACE. Otherwise, we should just break out of the loop. We do this by moving the loop itself into a separate ID_Continue. Looking up whether a Unicode character has the property ID_Start or ID_Continue is quite expensive. By inserting a cache mapping from characters to their properties we can speed this up a bit.

Most JavaScript source code is written using ASCII characters though. Of the ASCII-range characters, only a-z, A-Z, $ and _ are identifier start characters. ID_Continue additionally includes 0-9. We speed up identifier scanning by building a table with flags for each of the 128 ASCII characters indicating whether the character is an ID_Start, an ID_Continue character, etc. While characters we’re looking at are within ASCII range, we look up the respective flags in this table and verify a property with a single branch. Characters are part of the identifier until we see the first character that does not have the ID_Continue property.

All the improvements mentioned in this post add up to the following difference in identifier scanning performance:

It may seem counterintuitive that longer identifiers scan faster. That might make you think that it’s beneficial for performance to increase the identifier length. Scanning longer identifiers is simply faster in terms of MB/s because we stay longer in a very tight loop without returning to the parser. What you care about from the point-of-view of the performance of your application, however, is how fast we can scan full tokens. The following graph roughly shows the number of tokens we scan per second relative to the token length:

Here it becomes clear that using shorter identifiers is beneficial for the parse performance of your application: we’re able to scan more tokens per second. This means that sites that we seem to parse faster in MB/s simply have lower information density, and actually produce fewer tokens per second.

Internalizing minified identifiers #

All string literals and identifiers are deduplicated on the boundary between the scanner and the parser. If the parser requests the value of a string or identifier, it receives a unique string object for each possible literal value. This typically requires a hash table lookup. Since JavaScript code is often minified, V8 uses a simple lookup table for single ASCII character strings.

Keywords #

Keywords are a special subset of identifiers defined by the language, e.g., if, else, and function. V8’s scanner returns different tokens for keywords than for identifiers. After scanning an identifier we need to recognize whether the identifier is a keyword. Since all keywords in JavaScript only contain lowercase characters a-z, we also keep flags indicating whether ASCII characters are possible keyword start and continue characters.

If an identifier can be a keyword according to the flags, we could find a subset of keyword candidates by switching over the first character of the identifier. There are more distinct first characters than lengths of keywords, so it reduces the number of subsequent branches. For each character, we branch based on the possible keyword lengths and only compare the identifier with the keyword if the length matches as well.

Better is to use a technique called combine surrogate pairs to check whether the result is an identifier start. Similarly, we need to // replace the first image we see with a cat document.images[0].src = 'http://thecatapi.com/api/images/get?format=src&type=gif'; // replace the google logo with a cat document.getElementById('lga').innerHTML = '';