Skip to main content

Ralsina.Me — Roberto Alsina's website

A Tale of Optimization

I reim­ple­ment­ed Pyg­ments in Crys­tal. It did­n't quite go as I ex­pect­ed. I have al­ready writ­ten about how I did it but that left a large part of the sto­ry un­told. You see, I am us­ing Crys­tal, which com­piles to na­tive code. And yet my reim­ple­men­ta­tion was slow­er than Python. That's not sup­posed to hap­pen.

I de­cid­ed to fig­ure out why, and fix it. This is the sto­ry of how I made my soft­ware that looked "ok" 30x faster. Mind you, this is go­ing to make me sound much smarter and ac­cu­rate than I am. I ran in­to 200 dead ends be­fore find­ing each im­prove­men­t.

The First Problem (v0.1.0)

Command Mean [ms] Min [ms] Max [ms] Relative
bin/tartrazine ../crycco/src/crycco.cr swapoff > x.html --standalone 533.8 ± 4.4 523.7 542.9 18.80 ± 0.92
chroma ../crycco/src/crycco.cr -l crystal -f html -s swapoff > x.html 28.4 ± 1.4 25.6 32.8 1.00
pygmentize ../crycco/src/crycco.cr -l crystal -O full,style=autumn -f html -o x.html 103.5 ± 2.8 95.6 109.1 3.65 ± 0.20

That bench­mark (like all the rest) is done us­ing hy­per­fine and run­ning each com­mand 50 times af­ter a 10-run warmup. Not that it needs so much care, just look at those num­ber­s. Not on­ly is tar­trazine al­most 20 times slow­er than chro­ma, it's al­so 3.5 times slow­er than Pyg­ments. And Pyg­ments is writ­ten in Python!

Even with­out com­par­ing, half a sec­ond to high­light a 100-­line file is ridicu­lous.

What's going on here? To find out, let's get data. I used callgrind to profile the code, and then kcachegrind to visualize it.

$ valgrind --tool=callgrind bin/tartrazine ../crycco/src/crycco.cr swapoff

Some functions called half a billion times

As you can see, some func­tions are called half a bil­lion times and ac­count for 40% of the ex­e­cu­tion time. What are they?

A string in Crystal is always unicode. The String::char_bytesize_at function is used to convert an offset into the string from characters to bytes. That's because unicode characters can be different "widths" in bytes. So in the string "123" the "3" is the 3rd byte, but in "áéí" the "í" starts in the 5th byte.

And why is it do­ing that? Be­cause this code does a bazil­lion regex op­er­a­tions, and the un­der­ly­ing li­brary (PCRE2) deals in bytes, not char­ac­ter­s, so ev­ery­time we need to do things like "match this regex in­to this string start­ing in the 9th po­si­tion" we need to con­vert that off­set to bytes, and then when it finds a match at byte X we need to con­vert that off­set to char­ac­ters and to ex­tract the da­ta we do it two more times, and so on.

I decided to ... not do that. One nice thing of Crystal is that even though it's compiled, the whole standard library is there in /usr/lib/crystal so I could just go and read how the regular expression code was implemented and see what to do.

Ended up writing a version of Regex and Regex.match that worked on bytes, and made my code deal with bytes instead of characters. I only needed to convert into strings when I had already generated a token, rather than in the thousands of failed attempts to generate it.

The Second Problem (commit 0626c86)

Command Mean [ms] Min [ms] Max [ms] Relative
bin/tartrazine ../crycco/src/crycco.cr -f html -t swapoff -o x.html --standalone 187.8 ± 4.4 183.2 204.1 7.48 ± 0.30
chroma ../crycco/src/crycco.cr -l crystal -f html -s swapoff > x.html 25.1 ± 0.8 23.6 28.5 1.00
pygmentize ../crycco/src/crycco.cr -l crystal -O full,style=autumn -f html -o x.html 89.9 ± 4.7 83.6 102.1 3.58 ± 0.22

While bet­ter this still suck­s. I made it 2.5 times faster, but it's still 7.5 times slow­er than chro­ma, and 3.6x slow­er than Python???

regex library takes all the time

Back to val­grind. This time, the pro­file was ... dif­fer­en­t. The regex li­brary was tak­ing al­most all the ex­e­cu­tion time, which makes sense, since it's do­ing all the work. But why is it so slow?

It's calling valid_utf8 ALL THE TIME?

Almost all the time is spent in valid_utf8. That's a function that checks if a string is valid UTF-8. And it's called all the time. Why? Because the regex library is written in C, and it doesn't know that the strings it's working with are already valid UTF-8. So, it checks. And checks. And checks.

Solution? Let's not do that either. The PCRE2 library has a handy flag called NO_UTF_CHECK just for that. So, if you pass that flag when you are doing a regex match, it will not call valid_utf8 at all!

The Third Problem (commit 7db8fdc)

Command Mean [ms] Min [ms] Max [ms] Relative
bin/tartrazine ../crycco/src/crycco.cr -f html -t swapoff -o x.html --standalone 30.0 ± 2.2 25.5 36.6 1.15 ± 0.10
chroma ../crycco/src/crycco.cr -l crystal -f html -s swapoff > x.html 26.0 ± 1.0 24.1 29.2 1.00
pygmentize ../crycco/src/crycco.cr -l crystal -O full,style=autumn -f html -o x.html 96.3 ± 7.7 86.1 125.3 3.70 ± 0.33

Yay! My com­piled pro­gram is fi­nal­ly faster than an in­ter­pret­ed one! And even in the same ball­park as the oth­er com­piled one!

I won­der what hap­pens with a larg­er file!

Command Mean [ms] Min [ms] Max [ms] Relative
bin/tartrazine /usr/include/sqlite3.h -f html -t swapoff -o x.html --standalone -l c 896.6 ± 71.6 709.6 1015.8 7.07 ± 0.58
chroma /usr/include/sqlite3.h -l c -f html -s swapoff > x.html 126.8 ± 2.1 122.8 132.9 1.00
pygmentize /usr/include/sqlite3.h -l c -O full,style=autumn -f html -o x.html 229.1 ± 4.5 219.5 238.9 1.81 ± 0.05

Clear­ly some­thing very bad hap­pens when my code pro­cess­es larg­er files. I won­der what it is?

That's rather low level

At first glance this is not very informative. So, most of the execution time is spent in libgc and libc functions. That's not very helpful. But, if you look a bit harder, you'll see the execution time is spent allocating memory. Hundreds of milliseconds spent allocating memory

But memory is fast!

Yes, memory is fast. But allocating it is not. And I was allocating a lot of memory. Or rather, I was allocating memory over and over. See that memcpy there?

This took me about a day to fig­ure out, but this line of code is where ev­ery­thing be­came slow.

matched, new_pos, new_tokens = rule.match(text_bytes, pos, self)
if matched
    # Move position forward, save the tokens,
    # tokenize from the new position
    pos = new_pos
    tokens += new_tokens

That looks innocent, right? It's not. The tokens array is created at the beginning of tokenization, and every time I find new tokens I just append them to it. BUT ... how does that work? Arrays are not infinite! Where does it put the new tokens?

Well, when an ar­ray grows, it al­lo­cates a new, larg­er ar­ray, copies all the el­e­ments there and now you have a larg­er ar­ray with some room to spare. When you keep do­ing that, you are copy­ing the whole ar­ray over and over. It's the cost you pay for the con­ve­nience of hav­ing an ar­ray that can grow.

Where was I call­ing this? In the for­mat­ter. The for­mat­ter needs to see the to­kens to turn them in­to HTM­L. Here's the rel­e­vant code:

lines = lexer.group_tokens_in_lines(lexer.tokenize(text))

All it does is group them in­to lines (BTW, that again does the whole "grow an ar­ray" thing) and then it just it­er­ates the lines, then the to­kens in each line, slaps some HTML around them and writes them to a string (which again, grows).

The so­lu­tion is ... not to do that.

In Crys­tal we have it­er­a­tors, so I changed the to­k­eniz­er so that rather than re­turn­ing an ar­ray of to­kens it's an it­er­a­tor which gen­er­ates them one af­ter the oth­er. So the for­mat­ter looks more like this:

tokenizer.each do |token|
    outp << "<span class=\"#{get_css_class(token[:type])}\">#{HTML.escape(token[:value])}</span>"
    if token[:value].ends_with? "\n"
    i += 1
    outp << line_label(i) if line_numbers?
    end
end

Rather than iterate an array, it iterates ... an iterator. So no tokens array in the middle. Instead of grouping them into lines, spit out a line label after every newline character (yes, this is a bit more complicated under the hood)

There were sev­er­al oth­er op­ti­miza­tions but this is the one that made the dif­fer­ence.

The Fourth Problem (commit ae03e46)

Command Mean [ms] Min [ms] Max [ms] Relative
bin/tartrazine /usr/include/sqlite3.h -f html -t swapoff -o x.html --standalone -l c 73.5 ± 5.9 67.7 104.8 1.00
chroma /usr/include/sqlite3.h -l c -f html -s swapoff > x.html 123.1 ± 2.7 117.5 130.3 1.68 ± 0.14
pygmentize /usr/include/sqlite3.h -l c -O full,style=autumn -f html -o x.html 222.0 ± 5.9 207.8 239.4 3.02 ± 0.26

Fi­nal­ly, tar­trazine is the fastest one. On a large file! By a good margin! But is that all there is? Is there noth­ing else to im­prove? Well, no. I can do the same trick again!

You see, the formatter is returning a String by appending to it, and then we are writing the string to a file. That's the same problem as before!. So, I changed the formatter to take an IO object and write to it directly.

Command Mean [ms] Min [ms] Max [ms] Relative
bin/tartrazine /usr/include/sqlite3.h -f html -t swapoff -o x.html --standalone -l c 70.3 ± 1.8 65.6 73.8 1.00
chroma /usr/include/sqlite3.h -l c -f html -s swapoff > x.html 122.2 ± 3.0 116.6 131.1 1.74 ± 0.06
pygmentize /usr/include/sqlite3.h -l c -O full,style=autumn -f html -o x.html 219.4 ± 4.4 212.2 235.5 3.12 ± 0.10

As you can see, that still gives a small im­prove­men­t, of just 3 mil­lisec­ond­s. But that's 5% of the to­tal time. And it's a small change.

And this is where di­min­ish­ing re­turns hit. I could prob­a­bly make it faster, but even a 10% im­prove­ment would be just 7 mil­lisec­onds on a huge file. If I were GitHub then maybe this would be worth my time, but I am not and it's not.

And how does the fi­nal ver­sion com­pare with the first one?

Command Mean [ms] Min [ms] Max [ms] Relative
bin/tartrazine-last ../crycco/src/crycco.cr -f html -t swapoff -o x.html --standalone -l c 16.1 ± 1.1 13.1 21.8 1.00
bin/tartrazine-first ../crycco/src/crycco.cr swapoff 519.1 ± 5.4 508.1 533.7 32.29 ± 2.23

A speedup fac­tor of 32.3x, for code that had noth­ing ob­vi­ous­ly wrong in it. I'd say that's pret­ty good.


Contents © 2000-2024 Roberto Alsina