Task Force Hammer (Expeditionary Force, #17)
|
|
|
I have been working on a bunch of personal code projects in the past few months and as with all things we humans do, trying to do a thing always implies doing a number of things you really don't want to do.
You want a burger: you need to kill a cow, and light a fire, and whatnot.
You want to write software: you need to release it every once in a while.
Mind you, it's possible to never release, and be happy that way, but if you want someone else to use it, then releases are needed.
The bad news: releases are boring and can be laborious. Let's use as an example a tool I wrote called Hacé.
It's a command line tool, so if I want to do a release it involves:
And if you do any of those things wrong, you get to google (again) how to remove a remote tag using git and whatever.
Here is a shell script that will do all of it, then I will explain each step.
#!/bin/bash
set e
PKGNAME=$(basename "$PWD")
VERSION=$(git cliff --bumped-version |cut -dv -f2)
sed s/^version:.*$/version: "$VERSION"/ -i shard.yml
git add shard.yml
hace lint test
git cliff --bump -o
git commit -a -m "bump: Release v$VERSION"
git tag "v$VERSION"
git push --tags
hace static
gh release create "v$VERSION" \
"bin/$PKGNAME-static-linux-amd64" \
"bin/$PKGNAME-static-linux-arm64" \
--title "Release v$VERSION" \
--notes "$(git cliff -l -s all)"
Whenever I want to do a release, I just run that script and it will do everything. Let's go over it bit by bit.
This runs using bash, and will stop on any errors.
#!/bin/bash
set e
I want to use this same script on different projects so I get PKGNAME from the name of the folder it's running in. I always make that be the same as the name of the repository.
PKGNAME=$(basename "$PWD")
VERSION=$(git cliff --bumped-version |cut -dv -f2)
The VERSION
variable is obtained from ... git cliff
? What is git cliff?
It's a changelog generator. That means it will create the changelog (later) from commit
messages, but in this case what it's doing is suggesting what the next version number
should be. To use git cliff
you need to adopt "conventional commits", which means
every commit message has a prefix like "feat:" or "fix:" or "docs:"
This way the release I am making will be semantically correct according to semver. Some people dislike semantic versioning. That's ok, this is just one way to do things.
If you have commits that are not "conventional", cliff will ignore them and they will not be in the changelog. I am forcing myself to make all my commits conventional using pre-commit hooks but that's just my personal preference.
sed s/^version:.*$/version: "$VERSION"/ -i shard.yml
git add shard.yml
This is a Crystal program, and you are supposed to keep your version
in the shard.yml
metadata file, so this patches the file using good old sed, and then
marks it as needing to be committed.
Usually you also have the version set in your code, but for that I adopted a solution I saw in someone else's code and get it at build time by inserting it into a constant at compilation time.
VERSION = {{ `shards version #{__DIR__}`.chomp.stringify }}
Hacé is sort of a make replacement, so this runs tests and a linter.
hace lint test
As mentioned before, git cliff
is a changelog generator. This command updates the
CHANGELOG.md
file with the changes in the new release.
git cliff --bump -o
Now, a conventional commit with a version bump, create the git tag with the version number, and push everything to GitHub:
git commit -a -m "bump: Release v$VERSION"
git tag "v$VERSION"
git push --tags
This creates static binaries for Linux in ARM and Intel architectures. I wrote about the details in another article.
hace static
Finally, we use the GitHub command line tool gh
to:
gh release create "v$VERSION" \
"bin/$PKGNAME-static-linux-amd64" \
"bin/$PKGNAME-static-linux-arm64" \
--title "Release v$VERSION" \
--notes "$(git cliff -l -s all)"
And that's it. While this doesn't make the release process much faster (building the static binaries takes a few minutes) it does make it super reliable, and makes sure the changelog is up to date and the release is not something like "v0.3.4: changes and bugfixes".
You can see how a release looks like in the hacé releases page
I reimplemented Pygments in Crystal. It didn't quite go as I expected. I have already written about how I did it but that left a large part of the story untold. You see, I am using Crystal, which compiles to native code. And yet my reimplementation was slower than Python. That's not supposed to happen.
I decided to figure out why, and fix it. This is the story of how I made my software that looked "ok" 30x faster. Mind you, this is going to make me sound much smarter and accurate than I am. I ran into 200 dead ends before finding each improvement.
Command | Mean [ms] | Min [ms] | Max [ms] | Relative |
---|---|---|---|---|
bin/tartrazine ../crycco/src/crycco.cr swapoff > x.html --standalone |
533.8 ± 4.4 | 523.7 | 542.9 | 18.80 ± 0.92 |
chroma ../crycco/src/crycco.cr -l crystal -f html -s swapoff > x.html |
28.4 ± 1.4 | 25.6 | 32.8 | 1.00 |
pygmentize ../crycco/src/crycco.cr -l crystal -O full,style=autumn -f html -o x.html |
103.5 ± 2.8 | 95.6 | 109.1 | 3.65 ± 0.20 |
That benchmark (like all the rest) is done using hyperfine and running each command 50 times after a 10-run warmup. Not that it needs so much care, just look at those numbers. Not only is tartrazine almost 20 times slower than chroma, it's also 3.5 times slower than Pygments. And Pygments is written in Python!
Even without comparing, half a second to highlight a 100-line file is ridiculous.
What's going on here? To find out, let's get data. I used callgrind to profile the code, and then kcachegrind
to visualize it.
$ valgrind --tool=callgrind bin/tartrazine ../crycco/src/crycco.cr swapoff
As you can see, some functions are called half a billion times and account for 40% of the execution time. What are they?
A string in Crystal is always unicode. The String::char_bytesize_at
function is used to convert
an offset into the string from characters to bytes. That's because unicode characters can
be different "widths" in bytes. So in the string "123" the "3" is the 3rd byte, but in "áéí" the "í" starts in the 5th byte.
And why is it doing that? Because this code does a bazillion regex operations, and the underlying library (PCRE2) deals in bytes, not characters, so everytime we need to do things like "match this regex into this string starting in the 9th position" we need to convert that offset to bytes, and then when it finds a match at byte X we need to convert that offset to characters and to extract the data we do it two more times, and so on.
I decided to ... not do that. One nice thing of Crystal is that even though it's
compiled, the whole standard library is there in /usr/lib/crystal
so I could just
go and read how the regular expression code was implemented and see what to do.
Ended up writing a version of Regex
and Regex.match
that worked on bytes, and made my
code deal with bytes instead of characters. I only needed to convert into strings when
I had already generated a token, rather than in the thousands of failed attempts
to generate it.
Command | Mean [ms] | Min [ms] | Max [ms] | Relative |
---|---|---|---|---|
bin/tartrazine ../crycco/src/crycco.cr -f html -t swapoff -o x.html --standalone |
187.8 ± 4.4 | 183.2 | 204.1 | 7.48 ± 0.30 |
chroma ../crycco/src/crycco.cr -l crystal -f html -s swapoff > x.html |
25.1 ± 0.8 | 23.6 | 28.5 | 1.00 |
pygmentize ../crycco/src/crycco.cr -l crystal -O full,style=autumn -f html -o x.html |
89.9 ± 4.7 | 83.6 | 102.1 | 3.58 ± 0.22 |
While better this still sucks. I made it 2.5 times faster, but it's still 7.5 times slower than chroma, and 3.6x slower than Python???
Back to valgrind. This time, the profile was ... different. The regex library was taking almost all the execution time, which makes sense, since it's doing all the work. But why is it so slow?
Almost all the time is spent in valid_utf8
. That's a function that checks if a string is valid UTF-8.
And it's called all the time. Why? Because the regex library is written in C, and it doesn't know that
the strings it's working with are already valid UTF-8. So, it checks. And checks. And checks.
Solution? Let's not do that either. The PCRE2 library has a handy flag called
NO_UTF_CHECK
just for that. So, if you pass that flag when you are doing a regex match,
it will not call valid_utf8
at all!
Command | Mean [ms] | Min [ms] | Max [ms] | Relative |
---|---|---|---|---|
bin/tartrazine ../crycco/src/crycco.cr -f html -t swapoff -o x.html --standalone |
30.0 ± 2.2 | 25.5 | 36.6 | 1.15 ± 0.10 |
chroma ../crycco/src/crycco.cr -l crystal -f html -s swapoff > x.html |
26.0 ± 1.0 | 24.1 | 29.2 | 1.00 |
pygmentize ../crycco/src/crycco.cr -l crystal -O full,style=autumn -f html -o x.html |
96.3 ± 7.7 | 86.1 | 125.3 | 3.70 ± 0.33 |
Yay! My compiled program is finally faster than an interpreted one! And even in the same ballpark as the other compiled one!
I wonder what happens with a larger file!
Command | Mean [ms] | Min [ms] | Max [ms] | Relative |
---|---|---|---|---|
bin/tartrazine /usr/include/sqlite3.h -f html -t swapoff -o x.html --standalone -l c |
896.6 ± 71.6 | 709.6 | 1015.8 | 7.07 ± 0.58 |
chroma /usr/include/sqlite3.h -l c -f html -s swapoff > x.html |
126.8 ± 2.1 | 122.8 | 132.9 | 1.00 |
pygmentize /usr/include/sqlite3.h -l c -O full,style=autumn -f html -o x.html |
229.1 ± 4.5 | 219.5 | 238.9 | 1.81 ± 0.05 |
Clearly something very bad happens when my code processes larger files. I wonder what it is?
At first glance this is not very informative. So, most of the execution time
is spent in libgc
and libc
functions. That's not very helpful. But, if you look
a bit harder, you'll see the execution time is spent allocating memory. Hundreds
of milliseconds spent allocating memory
Yes, memory is fast. But allocating it is not. And I was allocating a lot of memory.
Or rather, I was allocating memory over and over. See that memcpy
there?
This took me about a day to figure out, but this line of code is where everything became slow.
matched, new_pos, new_tokens = rule.match(text_bytes, pos, self)
if matched
# Move position forward, save the tokens,
# tokenize from the new position
pos = new_pos
tokens += new_tokens
That looks innocent, right? It's not. The tokens
array is created at the beginning
of tokenization, and every time I find new tokens I just append them to it. BUT ... how
does that work? Arrays are not infinite! Where does it put the new tokens?
Well, when an array grows, it allocates a new, larger array, copies all the elements there and now you have a larger array with some room to spare. When you keep doing that, you are copying the whole array over and over. It's the cost you pay for the convenience of having an array that can grow.
Where was I calling this? In the formatter. The formatter needs to see the tokens to turn them into HTML. Here's the relevant code:
lines = lexer.group_tokens_in_lines(lexer.tokenize(text))
All it does is group them into lines (BTW, that again does the whole "grow an array" thing) and then it just iterates the lines, then the tokens in each line, slaps some HTML around them and writes them to a string (which again, grows).
The solution is ... not to do that.
In Crystal we have iterators, so I changed the tokenizer so that rather than returning an array of tokens it's an iterator which generates them one after the other. So the formatter looks more like this:
tokenizer.each do |token|
outp << "<span class=\"#{get_css_class(token[:type])}\">#{HTML.escape(token[:value])}</span>"
if token[:value].ends_with? "\n"
i += 1
outp << line_label(i) if line_numbers?
end
end
Rather than iterate an array, it iterates ... an iterator. So no tokens
array in the
middle. Instead of grouping them into lines, spit out a line label after every newline
character (yes, this is a bit more complicated under the hood)
There were several other optimizations but this is the one that made the difference.
Command | Mean [ms] | Min [ms] | Max [ms] | Relative |
---|---|---|---|---|
bin/tartrazine /usr/include/sqlite3.h -f html -t swapoff -o x.html --standalone -l c |
73.5 ± 5.9 | 67.7 | 104.8 | 1.00 |
chroma /usr/include/sqlite3.h -l c -f html -s swapoff > x.html |
123.1 ± 2.7 | 117.5 | 130.3 | 1.68 ± 0.14 |
pygmentize /usr/include/sqlite3.h -l c -O full,style=autumn -f html -o x.html |
222.0 ± 5.9 | 207.8 | 239.4 | 3.02 ± 0.26 |
Finally, tartrazine is the fastest one. On a large file! By a good margin! But is that all there is? Is there nothing else to improve? Well, no. I can do the same trick again!
You see, the formatter is returning a String
by appending to it, and then we are
writing the string to a file. That's the same problem as before!. So, I changed the
formatter to take an IO
object and write to it directly.
Command | Mean [ms] | Min [ms] | Max [ms] | Relative |
---|---|---|---|---|
bin/tartrazine /usr/include/sqlite3.h -f html -t swapoff -o x.html --standalone -l c |
70.3 ± 1.8 | 65.6 | 73.8 | 1.00 |
chroma /usr/include/sqlite3.h -l c -f html -s swapoff > x.html |
122.2 ± 3.0 | 116.6 | 131.1 | 1.74 ± 0.06 |
pygmentize /usr/include/sqlite3.h -l c -O full,style=autumn -f html -o x.html |
219.4 ± 4.4 | 212.2 | 235.5 | 3.12 ± 0.10 |
As you can see, that still gives a small improvement, of just 3 milliseconds. But that's 5% of the total time. And it's a small change.
And this is where diminishing returns hit. I could probably make it faster, but even a 10% improvement would be just 7 milliseconds on a huge file. If I were GitHub then maybe this would be worth my time, but I am not and it's not.
And how does the final version compare with the first one?
Command | Mean [ms] | Min [ms] | Max [ms] | Relative |
---|---|---|---|---|
bin/tartrazine-last ../crycco/src/crycco.cr -f html -t swapoff -o x.html --standalone -l c |
16.1 ± 1.1 | 13.1 | 21.8 | 1.00 |
bin/tartrazine-first ../crycco/src/crycco.cr swapoff |
519.1 ± 5.4 | 508.1 | 533.7 | 32.29 ± 2.23 |
A speedup factor of 32.3x, for code that had nothing obviously wrong in it. I'd say that's pretty good.
This is a "what I did this weekend" post, but it's slightly more interesting than others, I think. So, I reimplemented a large chunk of Pygments in a lib called Tartrazine.
Because I wanted to highlight source code in Markterm, and I wanted to do it in Crystal, not using an external dependency.
I was using Chroma but it's running over a pipe and makes the code look ugly, and you need to install it, and so on.
So ... I knew Chroma was a Go port of Pygments. So I thought ... how hard can it be? They already did it!
Because I believe we need more libraries I just started writing the damned thing.
Pygments/Chroma consists of three parts.
The hard part seemed to be the lexers, so I started there.
I lied a little. I started trying to read the Pygments code. It was quickly clear that there are several kinds of lexers, but most of them (like, 90%) are "regex lexers". They are lexers that use a state machine and a bunch of regular expressions to tokenize the input.
I know and have implemented state machines.. State machines are easy. So, I decided to just implement the regex lexers. They have the huge advantage that they have little to no code. THey are just a bunch of regular expressions and a bunch of rules that say "if you see this, do that".
They are implemented as data. Here's what the ada lexer looks like:
tokens = {
'root': [
(r'[^\S\n]+', Text),
(r'--.*?\n', Comment.Single),
(r'[^\S\n]+', Text),
(r'function|procedure|entry', Keyword.Declaration, 'subprogram'),
(r'(subtype|type)(\s+)(\w+)',
bygroups(Keyword.Declaration, Text, Keyword.Type), 'type_def'),
(r'task|protected', Keyword.Declaration),
(r'(subtype)(\s+)', bygroups(Keyword.Declaration, Text)),
(r'(end)(\s+)', bygroups(Keyword.Reserved, Text), 'end'),
(r'(pragma)(\s+)(\w+)', bygroups(Keyword.Reserved, Text,
Comment.Preproc)),
(r'(true|false|null)\b', Keyword.Constant),
# builtin types
(words(BUILTIN_LIST, suffix=r'\b'), Keyword.Type),
(r'(and(\s+then)?|in|mod|not|or(\s+else)|rem)\b', Operator.Word),
(r'generic|private', Keyword.Declaration),
(r'package', Keyword.Declaration, 'package'),
(r'array\b', Keyword.Reserved, 'array_def'),
(r'(with|use)(\s+)', bygroups(Keyword.Namespace, Text), 'import'),
(r'(\w+)(\s*)(:)(\s*)(constant)',
bygroups(Name.Constant, Text, Punctuation, Text,
Keyword.Reserved)),
(r'<<\w+>>', Name.Label),
While utterly unscrutable, that's just data. Then I looked at how Pygments processes that data, and it was bad news. While it's ok it's very idiomatic Python. Like, metaclasses and things jumping around the codebase. I had a feeling it couldn't be that hard.
After all, the excellent write your own lexer document explains it in about two pages of text!
So, I looked at Chroma's implementation. Let's say I am now distrustful of those who claim go code is simple, and worried I may be extremely dumb.
Sure, if I spent some time I could understand it, but I am not a go person, and I don't have plans to be one soon, so I had to make decisions.
And then I saw a magical folder...
A folder full of XML which is obviously the lexer definitions.
Chroma took the Pygments lexer definitions which were data in Python files, and turned them into data in actual data files.
And actually reading those XML files along the Pygments doc did the trick. I now know how to write a lexer.
Let's look at how, while looking at the definition of a very simple lexer, the "bash_session" one. A lexer, like I said, is a state machine. Each lexer has some metadata, such as its name, aliases, etc, and some instructions about how to process input.
In this case, it says input should end with a newline.
<lexer>
<config>
<name>Bash Session</name>
<alias>bash-session</alias>
<alias>console</alias>
<alias>shell-session</alias>
<filename>*.sh-session</filename>
<mime_type>text/x-sh</mime_type>
<ensure_nl>true</ensure_nl>
</config>
Since a lexer is a state machine, it has states. The first state is always called root. Each state has rules. Because this lexer is very simple, it has only one state with two rules.
<rules>
<state name="root">
<rule pattern="^((?:\[[^]]+@[^]]+\]\s?)?[#$%>])(\s*)(.*\n?)">
<bygroups>
<token type="GenericPrompt"/>
<token type="Text"/>
<using lexer="bash"/>
</bygroups>
</rule>
<rule pattern="^.+\n?">
<token type="GenericOutput"/>
</rule>
</state>
</rules>
Each rule has a pattern (a regular expression) which decides if the rule applies or not.
The first rule says "if the line starts with a prompt, capture the prompt, capture the spaces after it, and then capture the rest of the line".
Then, inside the rule, we have "actions". This rule has one action, which is "bygroups". This action says "the first group we capured is a GenericPrompt, the second group is Text, and the third group we should ask the bash lexer to tokenize".
And that makes sense, since a bash session looks like this:
$ echo hello
hello
There you have "$" (the prompt), " " (text), and "echo hello" (bash code).
The second rule is simpler. It says "capture a whole line".
So, when processing that example session, it works like this:
The state is "root" (it always starts there), and we look at the beginning of the file.
The first line matches the first rule, so we capture the prompt, the spaces, and the text. We generate the first two tokens: GenericPrompt and Text. Then we ask the bash lexer to tokenize the rest of the line. It will return a list of tokens, we keep those tokens too.
Because we matched, we move the "cursor" to the end of the match, which is at the beginning of the second line now.
And we start matching again.
The state is root. The first rule doesn't match at the position we're in. The second rule does. So we capture the whole line and generate a GenericOutput token. Move the cursor to the end of the match.
Oops, no more file. There, we tokenized.
Well, no. Actions can also "push a state" which means change the current state to something else. States are kept in a stack, so if you were in state "root" and pushed the state "foo" now the stack is "root, foo" and the current state is "foo".
Of course you can also "pop" a state, which means "go back to the previous state".
There are some other bits, such as "include" which means "pretend the rules of the other lexer are here" so we don't have to write them many times in the XML, or that you can pop more than one state, whatever, the basic is just:
And that's it. That's how you write a lexer.
But suppose I wrote the lexer, how do I know if I am doing it right? I mean, I can't just run the lexer and see if it works, right?
Well, we could if we only had a whole pile of things to tokenize, and a tool that creates the tokens in a readable format!
Hey, we have those things. There is the pygments test suite and Chroma can output tokens in json!
So, let's do some extreme test-driven-development! After all, I have the tests written, now I just have to pass them, right?
I wrote enough lexer to spit out tokens, wrote a test rig that compared them to chroma's output, and started writing a lexer.
Hey, ya puedo parsear correctamente texto plano! https://t.co/wMv5HuLYKF pic.twitter.com/rgESeyk5Ta
— Roberto H. Alsina (@ralsina) August 3, 2024
That up there is a two-day-long thread of me trying to pass tests. When I finished, over 96% of the test suite was passing and most of the failures were arguable (I think chroma is wrong in some cases).
So, I had written the RegexLexer. Looks like this
That code supports 241 languages, and it's about 300 lines of simple code.
In fact, I think someone (not me) should do what I did but write this thing in C, so it can be used from any language and both chroma and tartrazine are rendered obsolete.