Write that down: What I've learned about documentation for data teams
I used to be a hypocrite. There are numerous records of me in Slack complaining about other people's lack of documentation. It would not be hard to track down the time I was venting furiously to a colleague about the non-existing API docs for a tool I was working to build out analytics for on a tight deadline. Or my bewilderment at EveryAction for not having a data dictionary despite the near ubiquity of their CRM. Or how frustrated I get reading someone else's uncommented code.
But my code was not commented. Most of the knowledge about our data program's compliance set up only existed in my head. I knew all the tables in our warehouse like the back of my hand...but my team had no insight of where to look for their data.
I learned the value of documentation the hard way: by not investing it for the first year I was brought on as Data Director at Sunrise Movement. And as a result, I accrued a massive amount of documentation debt—the build up of documentation needs on a team by focusing on short term needs rather than long term sustainability.
Lucky for me, I was able to take my team dark in December 2021. My team was given blanket permission to cancel all our meetings, not respond to Slack or email, and spend an entire month focusing on building the tools and systems we needed to sprint in 2022.
And during that month, I wrote a lot of documentation. I lost track of how many lines of YAML I wrote during December. But, by the start of the new year I have climbed my way out of my documentation debt and landed my team in a position where everyone know what the columns in our analytics tables meant, we had data dictionaries for all our sources, and my team was invested in building and maintaining a culture of documentation.
Today I am going to share some of what we learned and the best practices my team uses for our documentation.
We love our Library. Out of all the tools we use to document our work, our Library site might just be my favorite.
The New York Times developed the open source Library to be a “a collaborative newsroom documentation site, powered by Google Docs. The product is simple: make a folder if your Google Drive, fill it with Google Documents, and Library turns it into a website.
The beauty of Library is that it makes documentation accessible to everyone on my team. You do not need to know how to read markdown or YAML in order to contribute to our documentation. Google Docs are the lingua franca of organizing. Everyone one my team is comfortable editing a Google Doc.
If you are worried about security, rest assured that you can set up OAuth on your Library site to ensure that only people in your domain/organization can access your documentation.
We put everything in our Library. Our Library site includes our internal facing Data Team documentation as well as all our guides, resources, and training materials for our organizing staff. We also make transparent to our staff every dashboard, report, survey, and analysis we have ever done.
While we love our Library, it is not the appropriate tool to document our code base. If you know me, you know I am a dbt fanatic. If you are new here, dbt is the tool I use on my team to supercharge our analytics engineering (you can read more about dbt here).
In dbt, you write YAML that gets spun up into an interactive documentation website that you can share with your team. You can see an example of a dbt documentation site here. I encourage my team to write their documentation as they write their code. When I am reviewing pull requests from my team, I check that they have written the associated documentation for any new models.
We tend to document two areas of our dbt project: sources and final models. That is to say, we document when data first enters our warehouse and when data leaves our warehouse.
As we stage data from our sources, we document every column in every table to build a data dictionary. Because we engage in this practice, we are lucky to have data dictionaries for EveryAction, Mobilize, NewMode, Spoke, Strive, ThruTalk, and ActBlue. For those of you that work in progressive politics, you may know what an accomplishment this is for my team.
We also think it is a best practice to document our final views and tables–the ones that get queried by our analysts to create reports and dashboards. We want to provide our analysts with everything they need to know about the tables we create so there is no question as to what a column means.
Building a culture of documentation
Maintaining the Library
I set aside time at the start of every month to review our Library. Everyone on my team has a monthly recurring task in Asana (I think I will have to write a whole separate blog to cover my love of Asana) to update the documentation in our Library. That task looks like this:
Once a month, we take the time as a team to review our Library and: > Update the dashboards page and ensure we have all dashboards accounted for with clear owners > Check the Data Team FAQ and ensure that it's up to date. Add anything new that is needed. > Add any new staff facing guides > Add any new reports, survey, and analyses > Ask yourself "is there anything else I need to document this month?"
Folk on my team have about a week to meditate the work they have done in the last month and determine if they need to add or update anything in our Library. We revisit this task during our Data Team stand ups to ensure follow through and talk through our documentation needs as a team.
If you are not using a project management tool on your team (which I highly recommend), making documentation a recurring agenda item for your team meetings is a great option.
The magic of pull request templates
It might sound strange, but we view pull requests as a critical part of our documentation! If you are unfamiliar, a pull request is triggered when a person using git/GitHub is ready to merge their code to to the main branch. It is an opportunity for someone more senior on the team to review the code and provide feedback, and ensures that bad code does not get merged in production.
But did you know that you can create a template in mark down that will populate the pull request's body automatically? My mentor and friend of the movement Claire Carroll set Sunrise up with a Pull Request template and it has been so, so helpful to me as a manager and as someone who reviews a lot of pull requests.
Our pull request has sections for description & motivation, to do before merge, screenshots, validation of models, and changes to existing models, but I want to talk to you about the checklist.
The checklist looks like this:
## Checklist: <!--- This checklist is mostly useful as a reminder of small things that can easily be forgotten – it is meant as a helpful tool rather than hoops to jump through. Put an `x` in all the items that apply, make notes next to any that haven't been addressed, and remove any items that are not relevant to this PR. --> - [ ] My pull request represents one logical piece of work. - [ ] My commits are related to the pull request and look clean. - [ ] My SQL follows the [Fishtown Analytics style guide](https://github.com/fishtown-analytics/corp/blob/master/dbt_coding_conventions.md). - [ ] I have materialized my models appropriately. Large models should be materialized as tables and the rest should be views. - [ ] I have added appropriate tests and documentation to any new models. - [ ] I have documented this model in the appropriate YAML file.
This means that every time someone on my team wraps up a PR, and before that PR can be merged, they need to confirm that they have added the appropriate documentation! This is an excellent way to ensure we do not fall behind on our documentation.
I am now a documentation fanatic. My reports must be sick of me saying "...and could you add that to the Library?" But the result is beautiful. My team can work independently of me. They do not have to guess what a column in a table means—they can look in up in our dbt docs. Our compliance lore is now saved in Library for anyone on my team to view. And as dark as it may sound, if something happened to me, my team would be able to carry on.
So what are your favorite documentation tools and best practices? Did I miss something obvious? Drop a line and let me know (: