As a Data Director in progressive politics, I swear by dbt. In fact, I refuse to work without it. I believe that dbt does more than just transform data; it can transform entire political data teams. Using dbt, we can quickly and easily spin up robust, well-tested infrastructure that can hold up through and beyond the campaign cycle. dbt enables political data practitioners to do more work with less code, freeing up our time to do what we do best: innovating new technological advancements for our organizations, delivering new insights on top of our cleaned and transformed data, and working with organizers.
What is dbt
It is difficult to explain dbt. It is a Python package consisting of SQL and YAML files that help users more effectively transform their data. It is also a new way of working with data. In fact, dbt Labs coined the term Analytics Engineering to describe the new profession that was instantiated with their tool. Finally, dbt is a way of approaching analytics work: it is its own philosophy.
The questions I receive about dbt from folk curious about the tool speak to the puzzling nature of this package/philosophy/practice Venn diagram. It is a Python package…that helps you write SQL? Is dbt a different flavor of SQL, like Redshift or BigQuery? Does dbt help you load data into your warehouse? Why does dbt have a cult-like following, and is it warranted?
Our friends at dbt Labs, the makers of dbt, released this blog post to address the well-warranted confusion around their product: dbt simply has to be experienced to be understood. I have heard some dbt Labs employees ask interviewers to "define dbt" and regard it as one of their more challenging interview questions. Despite being a fierce advocate for dbt for years, I still struggle to convey the power, magic, and joy of dbt to new users.
In essence, dbt is a tool that empowers data people to synthesize data from many different sources and transform it into the language of your organization. dbt allows you to define organization logic in SQL, the lingua franca of the political data world, add tests with ease to ensure robust data quality, and document your work so everyone within and outside of your data team is on the same page.
While this definition is adequate at sparking interest in the tool, I have found it far more illuminating to describe how dbt transforms data teams with concrete examples.
Political data professionals joke about how messy our organizing data are, because otherwise, we would cry. Everyone has a horror story about wrangling field data for a bulk upload, stripped 0s in zip codes, or myriad phone number variations in a single column. Messy data lay in every corner, and learning how to synthesize, clean, and normalize these data is the work of every practicing political data professional.
Let's say I need to make a dashboard for one of our endorsed candidates. This dashboard aggregates data from many different sources: I need volunteer and shifts data from Mobilize, attempts and canvass results from VAN, and additional voter data from the voter file. When I go to write my SQL, I need to remember that Mobilize IDs are not unique across parent accounts, so I will have to create synthetic IDs from combined columns. I also need to join participations to events in every query where I want to work with volunteer data. Oh, and I will want to convert all the time zones to Eastern Time, since that’s the timezone of our race.
After days of writing and rewriting my SQL, I finally have a dashboard live for my organizers. But–oh no!–an organizer Slacks me to tell me that our door numbers are wrong. We’ve totally knocked more doors than that. Panicked, I go back to my SQL and realize I’ve made a bad join. I then have to meticulously correct the join in every SQL chart cell in my dashboarding tool.
This is bad analytics engineering. Why?
Reliance on human memory. I have to commit the idiosyncrasies of the data to memory if I want to work with the data. It is on the onus of the analyst to remember how tables are joined together, what keys are unique, and more. Managers need to stay vigilant for these assumptions in code reviews, and repeatedly correct for them.
Repetition of code. Any time I want to know how many volunteers signed up for a particular event, I need to do the same join. Reducing code repetition is important for mitigating technical debt and potential human error.
Lack of tests. Our current workflow has no tests! If we write bad code, or the behavior in our underlying data changes, or something just plain breaks the way we are going to hear about it is from an upset organizer. Tests let us catch bugs before our organizers do.
Are we doomed to live like this, painstakingly writing out the same SQL over and over again to create our dashboards, constantly trying to squash bugs while our organizers slowly lose trust in our teams?
No–there is a better way. Enter dbt.
Let’s say we hear about dbt from one of our friends in the movement, and we decide to trial it for this race. Our end goal is to create a dashboard of our impact in the field–ideally, one that is accurate and won’t break during GOTV.
Staging our sources
Our dashboard synthesizes three sources: Mobilize, VAN, and the voter file. I know that the IDs in my Mobilize tables are not unique, and that any time I want to use these tables I need to create a synthetic ID.
But what if I could define this synthetic ID once and only once? Enter the dbt concept of “staging,” where we take raw tables and clean them up a little bit before we really get to work with them.
The tables in our data warehouse are what I call “raw” data. These data have been loaded into our warehouse via a SQL mirror or integration, but they have not yet been transformed by an analytics engineer. These tables contain idiosyncrasies that political data professionals commit to memory. “Oh, you can’t actually join on ‘’unique’ ID so you need to do it this other way” or “We need to convert all these time zones to Pacific time because that is how the vendor calculates their billing” or “This source puts a ‘+1’ in front of their phone numbers but we actually want to remove that for our analytics” are all Fun Data Facts I have had to remember while working with my data.
In dbt, we can define this logic once in our staging layer. Take a look at the code below, which is saved as `stg_mobilize__participations.sql`. I am staging the participations table from my Mobilize source. In this model I define a synthetic ID with the existing id and ‘wfp’.
Now, I can reference this staged model in any of my downstream analytical work. Instead of having to remember to clean the phone number column every time I work with our texting data, I can write that line of SQL once in my staging model and then never again.
Intermediate models and DRY code
Before I learned about dbt, I dreaded making dashboards. Each each time I needed to produce a stat or visualization in my dashboard tool contained unwieldy SQL queries. If I needed to make a change, and I often did, I would have to meticulously open each query and edit my queries one by one. It was a common occurrence for me to realize: oh, my where statements need another filter. Sure,let me spend another hour editing that in every place I call that table…😭
In dbt, we think a lot about DRY code, the software principle called “Don't Repeat Yourself.” To quote the illustrious dbt Docs:
“DRY code means you get to write duplicate code less often. You're saving lots of time writing the same thing over and over. Not only that, but you're saving your cognitive energy for bigger problems you'll end up needing to solve, instead of wasting that time and energy on tedious syntax.Sure, you might have to frontload some of your cognitive energy to create a good abstraction. But in the long run, it'll save you a lot of headaches. Especially if you're building something complex and one typo can be your undoing.”
Let’s return to my dashboard example. I need to join the Mobilize participations table to the events table in multiple places in order to produce analytics about my volunteers. In dbt, we can do this once in an “intermediate” model. The code below, saved as `int_participations_and_events`, defines the logic needed to join participations to events.
Now I have a central source of truth for my volunteers. If I need to make changes to this logic, I can edit it in one place. I can call this model in my dashboard with a simple `select * from schema.int_participations_and_events`
Bring on the tests
At this point, I have cleaned up some messy data in my staging models and defined my organization logic in my intermediate models. With these tools, I can create models for a central source of truth for my volunteer leadership, contact attempts, canvass results–you name it. But we aren’t done yet.
Analytics engineering claims to take the best practices from software engineering and blend them with data. My absolute favorite best practice from software engineering is tests. Tests assert assumptions about our data, which in turn help us deliver accurate analytics to our organizers.
It’s easy to add test coverage to your models in dbt. In the YAML excerpt below, we see tests set up for our staged Mobilize events model.
The three tests you see in the YAML above cover 90% of my testing needs. They are:
`unique`: Have you ever gone to join two tables together and only to realize the unique ID you thought was unique is actually, well, not? With this test, we can assert that every value in a specified column is unique.
`not_null`: Along with tests our IDs for uniqueness, we almost always also want to test to ensure that they are not null. Null IDs lead to unexpected, missing data in joins. This test asserts that there are no null values in a column.
`accepted_values`: My donation data will only ever have `pending` and `settled` as statuses…right? This test asserts that the column will contain a specified set of values.
And if those tests weren’t enough, you can install more tests from a package or write your own. There is no excuse for untested code with dbt.
Since we are such good analysts, we get the dashboard up and running by GOTV, and it's a hit. Our field director likes the dashboard so much that they want additional dashboards for three more candidates. And, while we’re at it, they decided to change the definition of a positive ID to also include anyone who has already voted.
But that’s not a problem. We are now dbt power users. We can spit out three more dashboards because our core logic is defined in intermediate models–meaning we don’t have to rewrite hundreds of lines of SQL. We can change the definition of our positive ID in our dbt project, and all our dashboards that reference this model will reflect that change. Better yet, one of our dbt tests caught some duplicate IDs one day, but we were quickly able to make the necessary changes all before our organizers checked the dashboard for the day. The organizing team now loves the data team.
Wrapping things up
At this point you may be thinking: “That’s all fine, but I can do all of this in Civis.” And you’re right. You can build modular, DRY code with tests in vanilla SQL and run it all in Civis. That would be a totally valid way of running your analytics program.
But, my god, why would you? You may have caught on in my code example, but there is a small function in dbt that sits at the heart of the magic: the `ref` function. `ref` is short for reference, and we use it to reference other models in our project. When it comes time to build all the models in our dbt project, dbt uses that `ref` function to magically build our DAG. This means that dbt can automatically infer the order in which your models need to be built.
I have done this work the hard way. I have built intricate workflows in Civis that run SQL queries in a sequence to build out a larger analytical project. And, let me tell you: it was a giant pain. I had to carefully construct my workflows in a janky GUI, or pour over lengthy YAML to do what dbt does effortlessly. I wasted hours of my life trying to replicate in Civis what dbt does out of the box.
And this is just the beginning of the magic of dbt. There is a reason this tool has the cult-like fan base it's famous for. dbt has saved me and my team countless hours of work, freeing up our time to focus on delivering results to our organizers. You can read about the time we used dbt to automate a tedious, time-sucking rote SQL task into a macro. This macro wound up saving my team (and future teams) hours of work, freeing up their time to find new ways to deliver value to organizers.
In my current role, dbt has helped me transform and clean tens of millions of rows of direct voter contact data. I was quickly able to normalize my call, text, doors, and relational data across multiple voter contact tools into one model of every attempt, contact, and positive ID in our universe across all dozens of our races. This means I can spin up dashboards for our key candidates and races in a matter of hours, not weeks.
I can sleep better at night knowing that we have tests that run every day on our dbt project. These tests have enabled me to catch bugs with third-party and vendor data well before anyone else. If a column drops, a sync doesn't run, or bad data gets into our systems, I am the first to know. I do not have to stress that the report I just sent might be wrong: my tests assure me that my work is correct.
When I think about the magical, life-altering difference dbt has made in my work, I can not help but wonder what other lessons we should be learning from our tech counterparts. My gander into capital T Tech has introduced me to various tools, techniques, and concepts that have been a massive boon to my work. dbt was the first modern data stack tool I learned about, and it is the tool I feel most passionately about, but I have also learned about alternative engineering orchestration platforms like and Prefect, analytical power tools like DuckDB, and a dashboarding tool that actually sparks joy in Hex.
dbt was built on the concept of taking the best practices of software engineering and blending them with analytics. I want to urge the political data community to think about what best practices from Tech we can take and blend with our work. To be clear, I am not calling for political tech to mirror what our for-profit tech counterparts do. We are solving fundamentally different problems. Harder problems. And doing it with a third of the resources in half the time. Yet, I cannot help but wonder if parts of our lives could be made easier by investing in the modern data stack. I have seen firsthand what dbt can do for small, under-resourced data teams in politics. At the time of writing, my team is wrapping up our Prefect migration, and I have high hopes for what real orchestration tooling can do for our team and our impact. What else is out there? And how can we support more progressive political data professionals, whose time is so limited, in leveling up their tech stacks?
Cover image art: Megan Krout
Thank you to Lauren McKay-Wechter and Soren Spicknall for helping create this blog post.