Using duckdb to make CSV files talk

2024-06-25 19:40

Sometimes you want to ask data questions. And often that data is in a CSV. Sure, you can write a quick Python script and use that to extract the information you want. Or you can import it into a database and use SQL.

But TIL the easiest thing is to just ask the duck.

The duck is DuckDB here.

Why? Because you can use SQL queries directly on CSV files.

For examples, let's use a random CSV called luarocks-packages.csv I have lying around:

It starts like this:

name,src,ref,server,version,luaversion,maintainers
alt-getopt,,,,,,arobyn
bit32,,,,5.3.0-1,5.1,lblasc
argparse,https://github.com/luarocks/argparse.git,,,,,
basexx,https://github.com/teto/basexx.git,,,,,
binaryheap,https://github.com/Tieske/binaryheap.lua,,,,,vcunat
busted,,,,,,
cassowary,,,,,,marsam alerque
cldr,,,,,,alerque
compat53,,,,0.7-1,,vcunat
cosmo,,,,,,marsam

And how do I query it? Well, suppose I want to find all packages where alerque is one of the maintainers:

> duckdb
v1.0.0 1f98600c2c
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.
D select name from 'luarocks-packages.csv' where maintainers like '%alerque%';
┌───────────┐
│   name    │
│  varchar  │
├───────────┤
│ cassowary │
│ cldr      │
│ fluent    │
│ loadkit   │
│ penlight  │
└───────────┘

And boom! There you go. So, if you know even some very basic SQL (and you should!) you can leverage duckdb to extract information from CSV files quickly, reliably and in a repeatable manner.

Which is awesome!

Ralsina.Me — Roberto Alsina's website