Skip to main content

Ralsina.Me — Roberto Alsina's website

Using tooling to automate releases

I have been work­ing on a bunch of per­son­al code projects in the past few months and as with all things we hu­mans do, try­ing to do a thing al­ways im­plies do­ing a num­ber of things you re­al­ly don't want to do.

You want a burg­er: you need to kill a cow, and light a fire, and what­not.

You want to write soft­ware: you need to re­lease it ev­ery once in a while.

Mind you, it's pos­si­ble to nev­er re­lease, and be hap­py that way, but if you want some­one else to use it, then re­leas­es are need­ed.

The bad news: re­leas­es are bor­ing and can be la­bo­ri­ous. Let's use as an ex­am­ple a tool I wrote called Hacé.

It's a com­mand line tool, so if I want to do a re­lease it in­volves:

  • Run­ning tests to make sure it's in a good state
  • Up­date the changel­og to have some da­ta of what's in the re­lease
  • Bump the ver­sion num­ber in files that have it (usu­al­ly at least source code, changel­og and meta­data)
  • Com­mit the ver­sion bump, and tag it with the same ver­sion
  • Build stat­ic ver­sions of the CLI (or cre­ate dock­er im­ages, what­ev­er)
  • Cre­ate source as­sets
  • Make the re­lease in GitHub so peo­ple can see it
  • Use the changel­og in the re­lease text

And if you do any of those things wrong, you get to google (a­gain) how to re­move a re­mote tag us­ing git and what­ev­er.

Here is a shell script that will do all of it, then I will ex­plain each step.

#!/bin/bash
set e

PKGNAME=$(basename "$PWD")
VERSION=$(git cliff --bumped-version |cut -dv -f2)

sed s/^version:.*$/version: "$VERSION"/ -i shard.yml
git add shard.yml
hace lint test
git cliff --bump -o
git commit -a -m "bump: Release v$VERSION"
git tag "v$VERSION"
git push --tags
hace static
gh release create "v$VERSION" \
    "bin/$PKGNAME-static-linux-amd64" \
    "bin/$PKGNAME-static-linux-arm64" \
    --title "Release v$VERSION" \
    --notes "$(git cliff -l -s all)"

When­ev­er I want to do a re­lease, I just run that script and it will do ev­ery­thing. Let's go over it bit by bit.

This runs us­ing bash, and will stop on any er­rors.

#!/bin/bash
set e

I want to use this same script on dif­fer­ent projects so I get PKG­NAME from the name of the fold­er it's run­ning in. I al­ways make that be the same as the name of the repos­i­to­ry.

PKGNAME=$(basename "$PWD")
VERSION=$(git cliff --bumped-version |cut -dv -f2)

The VERSION variable is obtained from ... git cliff? What is git cliff?

It's a changelog generator. That means it will create the changelog (later) from commit messages, but in this case what it's doing is suggesting what the next version number should be. To use git cliff you need to adopt "conventional commits", which means every commit message has a prefix like "feat:" or "fix:" or "docs:"

  • If those com­mits in­clude a "fix" this will be at least a patch re­lease.
  • If the com­mits since the pre­vi­ous re­lease in­clude a "feat" this will be at least a mi­nor re­lease.
  • If the com­mits in­clude a "feat!" this will be a ma­jor re­lease (the ! marks it as a break­ing change)

This way the re­lease I am mak­ing will be se­man­ti­cal­ly cor­rect ac­cord­ing to semver. Some peo­ple dis­like se­man­tic ver­sion­ing. That's ok, this is just one way to do things.

If you have com­mits that are not "con­ven­tion­al", cliff will ig­nore them and they will not be in the changel­og. I am forc­ing my­self to make all my com­mits con­ven­tion­al us­ing pre-­com­mit hooks but that's just my per­son­al pref­er­ence.

sed s/^version:.*$/version: "$VERSION"/ -i shard.yml
git add shard.yml

This is a Crystal program, and you are supposed to keep your version in the shard.yml metadata file, so this patches the file using good old sed, and then marks it as needing to be committed.

Usu­al­ly you al­so have the ver­sion set in your code, but for that I adopt­ed a so­lu­tion I saw in some­one else's code and get it at build time by in­sert­ing it in­to a con­stant at com­pi­la­tion time.

 VERSION = {{ `shards version #{__DIR__}`.chomp.stringify }}

Hacé is sort of a make re­place­men­t, so this runs tests and a lin­ter.

hace lint test

As mentioned before, git cliff is a changelog generator. This command updates the CHANGELOG.md file with the changes in the new release.

git cliff --bump -o

Now, a con­ven­tion­al com­mit with a ver­sion bump, cre­ate the git tag with the ver­sion num­ber, and push ev­ery­thing to GitHub:

git commit -a -m "bump: Release v$VERSION"
git tag "v$VERSION"
git push --tags

This cre­ates stat­ic bi­na­ries for Lin­ux in ARM and In­tel ar­chi­tec­tures. I wrote about the de­tails in an­oth­er ar­ti­cle.

hace static

Finally, we use the GitHub command line tool gh to:

  • Cre­ate the re­lease (with the right ver­sion)
  • Up­load the stat­ic bi­na­ries
  • GitHub au­to­mat­i­cal­ly cre­ates source ar­ti­facts
  • Use this re­lease's changel­og as notes for the re­lease, so users can see what's new
gh release create "v$VERSION" \
    "bin/$PKGNAME-static-linux-amd64" \
    "bin/$PKGNAME-static-linux-arm64" \
    --title "Release v$VERSION" \
    --notes "$(git cliff -l -s all)"

And that's it. While this does­n't make the re­lease process much faster (build­ing the stat­ic bi­na­ries takes a few min­utes) it does make it su­per re­li­able, and makes sure the changel­og is up to date and the re­lease is not some­thing like "v0.3.4: changes and bug­fix­es".

You can see how a re­lease looks like in the hacé re­leas­es page

A Tale of Optimization

I reim­ple­ment­ed Pyg­ments in Crys­tal. It did­n't quite go as I ex­pect­ed. I have al­ready writ­ten about how I did it but that left a large part of the sto­ry un­told. You see, I am us­ing Crys­tal, which com­piles to na­tive code. And yet my reim­ple­men­ta­tion was slow­er than Python. That's not sup­posed to hap­pen.

I de­cid­ed to fig­ure out why, and fix it. This is the sto­ry of how I made my soft­ware that looked "ok" 30x faster. Mind you, this is go­ing to make me sound much smarter and ac­cu­rate than I am. I ran in­to 200 dead ends be­fore find­ing each im­prove­men­t.

The First Problem (v0.1.0)

Command Mean [ms] Min [ms] Max [ms] Relative
bin/tartrazine ../crycco/src/crycco.cr swapoff > x.html --standalone 533.8 ± 4.4 523.7 542.9 18.80 ± 0.92
chroma ../crycco/src/crycco.cr -l crystal -f html -s swapoff > x.html 28.4 ± 1.4 25.6 32.8 1.00
pygmentize ../crycco/src/crycco.cr -l crystal -O full,style=autumn -f html -o x.html 103.5 ± 2.8 95.6 109.1 3.65 ± 0.20

That bench­mark (like all the rest) is done us­ing hy­per­fine and run­ning each com­mand 50 times af­ter a 10-run warmup. Not that it needs so much care, just look at those num­ber­s. Not on­ly is tar­trazine al­most 20 times slow­er than chro­ma, it's al­so 3.5 times slow­er than Pyg­ments. And Pyg­ments is writ­ten in Python!

Even with­out com­par­ing, half a sec­ond to high­light a 100-­line file is ridicu­lous.

What's going on here? To find out, let's get data. I used callgrind to profile the code, and then kcachegrind to visualize it.

$ valgrind --tool=callgrind bin/tartrazine ../crycco/src/crycco.cr swapoff

Some functions called half a billion times

As you can see, some func­tions are called half a bil­lion times and ac­count for 40% of the ex­e­cu­tion time. What are they?

A string in Crystal is always unicode. The String::char_bytesize_at function is used to convert an offset into the string from characters to bytes. That's because unicode characters can be different "widths" in bytes. So in the string "123" the "3" is the 3rd byte, but in "áéí" the "í" starts in the 5th byte.

And why is it do­ing that? Be­cause this code does a bazil­lion regex op­er­a­tions, and the un­der­ly­ing li­brary (PCRE2) deals in bytes, not char­ac­ter­s, so ev­ery­time we need to do things like "match this regex in­to this string start­ing in the 9th po­si­tion" we need to con­vert that off­set to bytes, and then when it finds a match at byte X we need to con­vert that off­set to char­ac­ters and to ex­tract the da­ta we do it two more times, and so on.

I decided to ... not do that. One nice thing of Crystal is that even though it's compiled, the whole standard library is there in /usr/lib/crystal so I could just go and read how the regular expression code was implemented and see what to do.

Ended up writing a version of Regex and Regex.match that worked on bytes, and made my code deal with bytes instead of characters. I only needed to convert into strings when I had already generated a token, rather than in the thousands of failed attempts to generate it.

The Second Problem (commit 0626c86)

Command Mean [ms] Min [ms] Max [ms] Relative
bin/tartrazine ../crycco/src/crycco.cr -f html -t swapoff -o x.html --standalone 187.8 ± 4.4 183.2 204.1 7.48 ± 0.30
chroma ../crycco/src/crycco.cr -l crystal -f html -s swapoff > x.html 25.1 ± 0.8 23.6 28.5 1.00
pygmentize ../crycco/src/crycco.cr -l crystal -O full,style=autumn -f html -o x.html 89.9 ± 4.7 83.6 102.1 3.58 ± 0.22

While bet­ter this still suck­s. I made it 2.5 times faster, but it's still 7.5 times slow­er than chro­ma, and 3.6x slow­er than Python???

regex library takes all the time

Back to val­grind. This time, the pro­file was ... dif­fer­en­t. The regex li­brary was tak­ing al­most all the ex­e­cu­tion time, which makes sense, since it's do­ing all the work. But why is it so slow?

It's calling valid_utf8 ALL THE TIME?

Almost all the time is spent in valid_utf8. That's a function that checks if a string is valid UTF-8. And it's called all the time. Why? Because the regex library is written in C, and it doesn't know that the strings it's working with are already valid UTF-8. So, it checks. And checks. And checks.

Solution? Let's not do that either. The PCRE2 library has a handy flag called NO_UTF_CHECK just for that. So, if you pass that flag when you are doing a regex match, it will not call valid_utf8 at all!

The Third Problem (commit 7db8fdc)

Command Mean [ms] Min [ms] Max [ms] Relative
bin/tartrazine ../crycco/src/crycco.cr -f html -t swapoff -o x.html --standalone 30.0 ± 2.2 25.5 36.6 1.15 ± 0.10
chroma ../crycco/src/crycco.cr -l crystal -f html -s swapoff > x.html 26.0 ± 1.0 24.1 29.2 1.00
pygmentize ../crycco/src/crycco.cr -l crystal -O full,style=autumn -f html -o x.html 96.3 ± 7.7 86.1 125.3 3.70 ± 0.33

Yay! My com­piled pro­gram is fi­nal­ly faster than an in­ter­pret­ed one! And even in the same ball­park as the oth­er com­piled one!

I won­der what hap­pens with a larg­er file!

Command Mean [ms] Min [ms] Max [ms] Relative
bin/tartrazine /usr/include/sqlite3.h -f html -t swapoff -o x.html --standalone -l c 896.6 ± 71.6 709.6 1015.8 7.07 ± 0.58
chroma /usr/include/sqlite3.h -l c -f html -s swapoff > x.html 126.8 ± 2.1 122.8 132.9 1.00
pygmentize /usr/include/sqlite3.h -l c -O full,style=autumn -f html -o x.html 229.1 ± 4.5 219.5 238.9 1.81 ± 0.05

Clear­ly some­thing very bad hap­pens when my code pro­cess­es larg­er files. I won­der what it is?

That's rather low level

At first glance this is not very informative. So, most of the execution time is spent in libgc and libc functions. That's not very helpful. But, if you look a bit harder, you'll see the execution time is spent allocating memory. Hundreds of milliseconds spent allocating memory

But memory is fast!

Yes, memory is fast. But allocating it is not. And I was allocating a lot of memory. Or rather, I was allocating memory over and over. See that memcpy there?

This took me about a day to fig­ure out, but this line of code is where ev­ery­thing be­came slow.

matched, new_pos, new_tokens = rule.match(text_bytes, pos, self)
if matched
    # Move position forward, save the tokens,
    # tokenize from the new position
    pos = new_pos
    tokens += new_tokens

That looks innocent, right? It's not. The tokens array is created at the beginning of tokenization, and every time I find new tokens I just append them to it. BUT ... how does that work? Arrays are not infinite! Where does it put the new tokens?

Well, when an ar­ray grows, it al­lo­cates a new, larg­er ar­ray, copies all the el­e­ments there and now you have a larg­er ar­ray with some room to spare. When you keep do­ing that, you are copy­ing the whole ar­ray over and over. It's the cost you pay for the con­ve­nience of hav­ing an ar­ray that can grow.

Where was I call­ing this? In the for­mat­ter. The for­mat­ter needs to see the to­kens to turn them in­to HTM­L. Here's the rel­e­vant code:

lines = lexer.group_tokens_in_lines(lexer.tokenize(text))

All it does is group them in­to lines (BTW, that again does the whole "grow an ar­ray" thing) and then it just it­er­ates the lines, then the to­kens in each line, slaps some HTML around them and writes them to a string (which again, grows).

The so­lu­tion is ... not to do that.

In Crys­tal we have it­er­a­tors, so I changed the to­k­eniz­er so that rather than re­turn­ing an ar­ray of to­kens it's an it­er­a­tor which gen­er­ates them one af­ter the oth­er. So the for­mat­ter looks more like this:

tokenizer.each do |token|
    outp << "<span class=\"#{get_css_class(token[:type])}\">#{HTML.escape(token[:value])}</span>"
    if token[:value].ends_with? "\n"
    i += 1
    outp << line_label(i) if line_numbers?
    end
end

Rather than iterate an array, it iterates ... an iterator. So no tokens array in the middle. Instead of grouping them into lines, spit out a line label after every newline character (yes, this is a bit more complicated under the hood)

There were sev­er­al oth­er op­ti­miza­tions but this is the one that made the dif­fer­ence.

The Fourth Problem (commit ae03e46)

Command Mean [ms] Min [ms] Max [ms] Relative
bin/tartrazine /usr/include/sqlite3.h -f html -t swapoff -o x.html --standalone -l c 73.5 ± 5.9 67.7 104.8 1.00
chroma /usr/include/sqlite3.h -l c -f html -s swapoff > x.html 123.1 ± 2.7 117.5 130.3 1.68 ± 0.14
pygmentize /usr/include/sqlite3.h -l c -O full,style=autumn -f html -o x.html 222.0 ± 5.9 207.8 239.4 3.02 ± 0.26

Fi­nal­ly, tar­trazine is the fastest one. On a large file! By a good margin! But is that all there is? Is there noth­ing else to im­prove? Well, no. I can do the same trick again!

You see, the formatter is returning a String by appending to it, and then we are writing the string to a file. That's the same problem as before!. So, I changed the formatter to take an IO object and write to it directly.

Command Mean [ms] Min [ms] Max [ms] Relative
bin/tartrazine /usr/include/sqlite3.h -f html -t swapoff -o x.html --standalone -l c 70.3 ± 1.8 65.6 73.8 1.00
chroma /usr/include/sqlite3.h -l c -f html -s swapoff > x.html 122.2 ± 3.0 116.6 131.1 1.74 ± 0.06
pygmentize /usr/include/sqlite3.h -l c -O full,style=autumn -f html -o x.html 219.4 ± 4.4 212.2 235.5 3.12 ± 0.10

As you can see, that still gives a small im­prove­men­t, of just 3 mil­lisec­ond­s. But that's 5% of the to­tal time. And it's a small change.

And this is where di­min­ish­ing re­turns hit. I could prob­a­bly make it faster, but even a 10% im­prove­ment would be just 7 mil­lisec­onds on a huge file. If I were GitHub then maybe this would be worth my time, but I am not and it's not.

And how does the fi­nal ver­sion com­pare with the first one?

Command Mean [ms] Min [ms] Max [ms] Relative
bin/tartrazine-last ../crycco/src/crycco.cr -f html -t swapoff -o x.html --standalone -l c 16.1 ± 1.1 13.1 21.8 1.00
bin/tartrazine-first ../crycco/src/crycco.cr swapoff 519.1 ± 5.4 508.1 533.7 32.29 ± 2.23

A speedup fac­tor of 32.3x, for code that had noth­ing ob­vi­ous­ly wrong in it. I'd say that's pret­ty good.

Tartrazine: reimplementing pygments or my adventure in extreme test-driven-development

This is a "what I did this week­end" post, but it's slight­ly more in­ter­est­ing than oth­er­s, I think. So, I reim­ple­ment­ed a large chunk of Pyg­ments in a lib called Tar­trazine.

Why?

Be­cause I want­ed to high­light source code in Mark­term, and I want­ed to do it in Crys­tal, not us­ing an ex­ter­nal de­pen­den­cy.

I was us­ing Chro­ma but it's run­ning over a pipe and makes the code look ug­ly, and you need to in­stall it, and so on.

So ... I knew Chro­ma was a Go port of Pyg­ments. So I thought ... how hard can it be? They al­ready did it!

Be­cause I be­lieve we need more li­braries I just start­ed writ­ing the damned thing.

What?

Pyg­ments/Chro­ma con­sists of three part­s.

  • Lex­er­s, which turn a text and turn it in­to a pile of to­ken­s.
  • Styles, which when asked about a to­ken type, re­turn a col­or/bold/un­der­line/etc. "style".
  • For­mat­ter­s, which it­er­ate a list of to­ken­s, ap­ply styles and cre­ate a stream of text (for ex­am­ple HTML with pret­ty col­ors).

The hard part seemed to be the lex­er­s, so I start­ed there.

How?

I lied a lit­tle. I start­ed try­ing to read the Pyg­ments code. It was quick­ly clear that there are sev­er­al kinds of lex­er­s, but most of them (like, 90%) are "regex lex­er­s". They are lex­ers that use a state ma­chine and a bunch of reg­u­lar ex­pres­sions to to­k­enize the in­put.

I know and have im­ple­ment­ed state ma­chines.. State ma­chines are easy. So, I de­cid­ed to just im­ple­ment the regex lex­er­s. They have the huge ad­van­tage that they have lit­tle to no code. THey are just a bunch of reg­u­lar ex­pres­sions and a bunch of rules that say "if you see this, do that".

They are im­ple­ment­ed as da­ta. Here's what the ada lex­er looks like:

    tokens = {
        'root': [
            (r'[^\S\n]+', Text),
            (r'--.*?\n', Comment.Single),
            (r'[^\S\n]+', Text),
            (r'function|procedure|entry', Keyword.Declaration, 'subprogram'),
            (r'(subtype|type)(\s+)(\w+)',
             bygroups(Keyword.Declaration, Text, Keyword.Type), 'type_def'),
            (r'task|protected', Keyword.Declaration),
            (r'(subtype)(\s+)', bygroups(Keyword.Declaration, Text)),
            (r'(end)(\s+)', bygroups(Keyword.Reserved, Text), 'end'),
            (r'(pragma)(\s+)(\w+)', bygroups(Keyword.Reserved, Text,
                                             Comment.Preproc)),
            (r'(true|false|null)\b', Keyword.Constant),
            # builtin types
            (words(BUILTIN_LIST, suffix=r'\b'), Keyword.Type),
            (r'(and(\s+then)?|in|mod|not|or(\s+else)|rem)\b', Operator.Word),
            (r'generic|private', Keyword.Declaration),
            (r'package', Keyword.Declaration, 'package'),
            (r'array\b', Keyword.Reserved, 'array_def'),
            (r'(with|use)(\s+)', bygroups(Keyword.Namespace, Text), 'import'),
            (r'(\w+)(\s*)(:)(\s*)(constant)',
             bygroups(Name.Constant, Text, Punctuation, Text,
                      Keyword.Reserved)),
            (r'<<\w+>>', Name.Label),

While ut­ter­ly un­scrutable, that's just da­ta. Then I looked at how Pyg­ments pro­cess­es that data, and it was bad news. While it's ok it's very id­iomat­ic Python. Like, meta­class­es and things jump­ing around the code­base. I had a feel­ing it could­n't be that hard.

Af­ter al­l, the ex­cel­lent write your own lex­er doc­u­ment ex­plains it in about two pages of tex­t!

So, I looked at Chro­ma's im­ple­men­ta­tion. Let's say I am now dis­trust­ful of those who claim go code is sim­ple, and wor­ried I may be ex­treme­ly dum­b.

Sure, if I spent some time I could un­der­stand it, but I am not a go per­son, and I don't have plans to be one soon, so I had to make de­ci­sion­s.

And then I saw a mag­i­cal fold­er...

A fold­er full of XML which is ob­vi­ous­ly the lex­er def­i­ni­tion­s.

Chro­ma took the Pyg­ments lex­er def­i­ni­tions which were da­ta in Python files, and turned them in­to da­ta in ac­tu­al da­ta files.

And ac­tu­al­ly read­ing those XML files along the Pyg­ments doc did the trick. I now know how to write a lex­er.

But really, How?

Let's look at how, while look­ing at the def­i­ni­tion of a very sim­ple lex­er, the "bash_ses­sion" one. A lex­er, like I said, is a state ma­chine. Each lex­er has some meta­data, such as its name, alias­es, etc, and some in­struc­tions about how to process in­put.

In this case, it says in­put should end with a new­line.

<lexer>
  <config>
    <name>Bash Session</name>
    <alias>bash-session</alias>
    <alias>console</alias>
    <alias>shell-session</alias>
    <filename>*.sh-session</filename>
    <mime_type>text/x-sh</mime_type>
    <ensure_nl>true</ensure_nl>
  </config>

Since a lex­er is a state ma­chine, it has states. The first state is al­ways called root. Each state has rules. Be­cause this lex­er is very sim­ple, it has on­ly one state with two rules.

  <rules>
    <state name="root">
      <rule pattern="^((?:\[[^]]+@[^]]+\]\s?)?[#$%&gt;])(\s*)(.*\n?)">
        <bygroups>
          <token type="GenericPrompt"/>
          <token type="Text"/>
          <using lexer="bash"/>
        </bygroups>
      </rule>
      <rule pattern="^.+\n?">
        <token type="GenericOutput"/>
      </rule>
    </state>
  </rules>

Each rule has a pat­tern (a reg­u­lar ex­pres­sion) which de­cides if the rule ap­plies or not.

The first rule says "if the line starts with a promp­t, cap­ture the promp­t, cap­ture the spa­ces af­ter it, and then cap­ture the rest of the line".

Then, in­side the rule, we have "ac­tion­s". This rule has one ac­tion, which is "by­group­s". This ac­tion says "the first group we ca­pured is a Gener­icPromp­t, the sec­ond group is Tex­t, and the third group we should ask the bash lex­er to to­k­enize".

And that makes sense, since a bash ses­sion looks like this:

$ echo hello
hello

There you have "$" (the promp­t), " " (tex­t), and "e­cho hel­lo" (bash code).

The sec­ond rule is sim­pler. It says "cap­ture a whole line".

So, when pro­cess­ing that ex­am­ple ses­sion, it works like this:

The state is "root" (it al­ways starts there), and we look at the be­gin­ning of the file.

The first line match­es the first rule, so we cap­ture the promp­t, the spaces, and the tex­t. We gen­er­ate the first two to­ken­s: Gener­icPrompt and Tex­t. Then we ask the bash lex­er to to­k­enize the rest of the line. It will re­turn a list of to­ken­s, we keep those to­kens too.

Be­cause we matched, we move the "cur­sor" to the end of the match, which is at the be­gin­ning of the sec­ond line now.

And we start match­ing again.

The state is root. The first rule does­n't match at the po­si­tion we're in. The sec­ond rule does. So we cap­ture the whole line and gen­er­ate a Gener­i­cOut­put to­ken. Move the cur­sor to the end of the match.

Oop­s, no more file. There, we to­k­enized.

Just that?

Well, no. Ac­tions can al­so "push a state" which means change the cur­rent state to some­thing else. States are kept in a stack, so if you were in state "root" and pushed the state "foo" now the stack is "root, foo" and the cur­rent state is "foo".

Of course you can al­so "pop" a state, which means "go back to the pre­vi­ous state".

There are some oth­er bit­s, such as "in­clude" which means "pre­tend the rules of the oth­er lex­er are here" so we don't have to write them many times in the XM­L, or that you can pop more than one state, what­ev­er, the ba­sic is just:

  1. You are in a state
  2. Check rules un­til one match­es
  3. Use that rule's ac­tions (y­ou may end up in an­oth­er state)
  4. Col­lect any to­kens gen­er­at­ed
  5. Move the cur­sor to the end of the match
  6. Go back to 1.

And that's it. That's how you write a lex­er.

And then?

But sup­pose I wrote the lex­er, how do I know if I am do­ing it right? I mean, I can't just run the lex­er and see if it work­s, right?

Well, we could if we on­ly had a whole pile of things to to­k­enize, and a tool that cre­ates the to­kens in a read­able for­mat!

Hey, we have those things. There is the pyg­ments test suite and Chro­ma can out­put to­kens in json!

So, let's do some ex­treme test-­driven-de­vel­op­men­t! Af­ter al­l, I have the tests writ­ten, now I just have to pass them, right?

I wrote enough lex­er to spit out to­ken­s, wrote a test rig that com­pared them to chro­ma's out­put, and start­ed writ­ing a lex­er.

That up there is a two-­day-­long thread of me try­ing to pass test­s. When I fin­ished, over 96% of the test suite was pass­ing and most of the fail­ures were ar­guable (I think chro­ma is wrong in some cas­es).

So, I had writ­ten the RegexLex­er. Looks like this

That code sup­ports 241 lan­guages, and it's about 300 lines of sim­ple code.

In fac­t, I think some­one (not me) should do what I did but write this thing in C, so it can be used from any lan­guage and both chro­ma and tar­trazine are ren­dered ob­so­lete.


Contents © 2000-2024 Roberto Alsina