Chrome on Windows performance improvements and the journey of Native Window Occlusion
Faster Chrome - Let The Compiler do the work
int foo();int fiver(int num) {for(int j = 0; j < 5; j++)num = num + foo();return num;}
But we can do more (ThinLTO)
That's a good start, but we can do better. Let's look at inlining - the compiler takes the code of a called function and inserts all of it at the callsite.inline int foo() { return 3; };int fiver_inline(int num) {for(int j = 0; j < 5; j++)num = num + foo();return num;}
When the compiler inlines foo(), it turns into
int fiver_inline(int num) {for(int j = 0; j < 5; j++)num = num + 3;return num;}
Not bad - saves us the function call and all the setup that goes with having a function. But the compiler can in fact even do better - because now all the information is in one place. The compiler can apply that knowledge and deduce that fiver_inline() adds the number three and does so 5 times - and so the entire code is
Now we're back to compiling individual source files. Distributed/cached compilation works again, small changes don't cause a full rebuild, and since "ThinLTO" does just inline a few functions, and it is relatively little overhead.
Of course, the question of "which functions should ThinLTO inline?" still needs to be answered. And the answer is still "the ones that are small and called a lot". Hey, we know those already - from the profiles we generated for Profile Guided Optimization (PGO). Talk about lucky coincidences!
But wait, there's more! (Callgraph Sorting)
We've done a lot for inlined function calls. Is there anything we can do to speed up functions that haven't been inlined, too? Turns out there is.One important factor is that the CPU doesn't fetch data byte by byte, but in chunks. And so, if we could ensure that a chunk of data doesn't just contain the function we need right now, but ideally also the ones that we'll need next, we could ensure that we have to go out and get chunks of data less often.
In other words, we want functions that are called right after the other to live next to each other in memory also ("code locality"). And we already know which functions are called close to each other - because we ran our profiling and stored performance profiles for PGO.
We can then use that information to ensure that the right functions are next to each other when we link.
I.e.
g.cextern int f1();extern int f2();extern int f3();int g() {f1();for(..) {f3();}f1();f2();}
could be interpreted as "g() calls f3() a lot - so keep that one really close. f1() is called twice, so… somewhat close. And if we can squeeze in f2, even better". The calling sequence is a "call graph", and so this sorting process is called "call graph sorting".
Just changing the order of functions in memory might not sound like a lot, but it leads to
And since this work requires changes to compilers and linkers, that would mean changing the build - and testing it - across 5 compilers and 4 linkers. But, thankfully, we've simplified our toolchain (Simplicity - another one of the 4S's!). To be able to do this, we worked with the LLVM community to make clang a great being worked on).
Sometimes, the best way to get more speed is not to change the code you wrote, but to change the way you build the software.
Posted by Rachel Blum, Engineering Director, Chrome Desktop