Open Source Masters for Progressive Analytics
I dream about going to grad school. Every couple of months I ogle over Masters and PhD programs at places like Carnegie Mellon, U Chicago, and UC Berkeley.
But I am not going to grad school.
While I love the idea of being a full time student, or dedicating myself to learning, I know that academia is not for me.
I also know that if I want to be a data scientist, a masters degree would be a poor use of time and money. I believe that everything I want to learn in data science can be learned for free (or very low cost) online. I also believe that a good portfolio and real world experience will trump a degree in most settings.
But I still wanted to go to grad school. For the prestige. For the hell of it. Because it was what my friends were doing and I was jealous.
So, I hatched a plan. I was going to reverse engineer the best data science masters programs and find the equivalent textbooks and resources for free. If I was really in love with the idea of learning, why not dedicate my free time to studying as if I were in a masters program?
Thus, the idea for my own Open Source Data Science Masters was born. Yes, I was absolutely inspired by the original open source data science masters, but I wanted a curriculum with more rigor. I have a degree in engineering, and I do not shy away from hard math or theory.
In addition, I’m not set off on a traditional data science path. I have no desire to work at a place like Uber, Amazon, or Google. I’m not interested in working at the cutting edge of machine learning. I’m in this field to leverage analytics for social good.
I recently discovered the world of data for progressive politics and fell in love. I want to leverage data analytic to turn our young voters and elect progressive candidates to office. This doesn’t look like building neural nets. It looks like cleaning messy data, working with CRMs, producing visualizations, and a lot of statistical analysis.
Using this infamous guide, I devised my own masters program tailor made for working in progressive analytics. All the resources below are either free/open source or cost very little money.
I did a quick survey of the brightest minds I know in data science to determine what kind of math is truly necessary for the field. Turns out my intuition was right and statistics and linear algebra are the right subjects to focus on. If you want to work with neural nets (I do not), consider adding Calc I-III to this set. I’ve already taken up to multi variable calculus from my engineering degree.
David M Diez, Christopher D Barr, Mine Cetinkaya-Rundel
I used this textbook while in college and it ended up being the best textbook I ever used, and not because it was free. OpenIntro Statistics covers everything you need to know in a intro stats course and more, and comes with 9 open source R-based labs.
Distribution of random numbers
Foundations for inference
Inference for numerical data
Inference for categorical data
Introduction to linear regression
Multiple and logistic regression
Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani
Anytime I mentioned wanting to double down on my stats knowledge, ISLR would inevitably come up. While OpenIntro is the equivalent of an undergrad intro course to statistics, ISLR is comparable to a grad course. This book is great for data scientists as it covers a slew of important modeling and prediction techniques. And like the textbook above, there are 9 open source R-based labs to accompany your learning.
Linear Model Selection and Regularization
Moving Beyond Linearity
Tree Based Methods
Support Vector Machines
While not free, I was able to pick this up for under $3.00 on Amazon. I spent a long time determining the best resource for learning Linear Algebra. I tried online courses, but they didn’t provide the rigor I was looking for. I attempted to MITx linear course, but I found the lecturer too obtuse and the textbook downright awful. Enter Steven Levandosky’s book, which came highly recommended from a slew of people online. Linear was the only math class I didn’t take as part of my engineering degree, though I picked it up from a slew of other technical courses. To supplement the textbook, I recommend Khan Academy’s Linear Algebra course.
Vectors in Rn
Dot Products and Cross Products
Systems of Linear Independence
Null Space & Column space
Subspaces of Rn
Composition and Matrix Multiplication
Inverses, Determinants, & Transpose of a Matrix
Orthogonal Complements & Projections
Eigenvectors & Complex Eigenvalues and Eigenvectors
From the Guide to Progressive Data Jobs, experimentation is necessary for determining the impact of variables on an outcome, which is especially important if we’re working on problems such as turning out voters. The guide suggests learning A/B Testing, Intent to Treat versus Treatment on Treated, Average Treatment Effect, Differential Attrition. I didn’t find too many resources on the web, but I did stumble upon this Coursea course that covers a good chunk of experimentation.
Basic Experiment Design Concepts
Tests of Proportions
Validity in Design and Analysis
This is where I get to cheat a little. I have completed a 6 month online bootcamp in data science where I put my coding chops to the test. I’ve already completed two capstones with Python where I worked with real, messy data. That being said, I’m pretty sure my code is ugly. And inefficient. That needs to change. To learn better Python practices, and to learn how to code for real, I wanted to mix both practical, hands-on coding experience with theory.
There are a ton of “teach yourself to code” online tutorials. Honestly, pick the one that you like the most. I opted for codewars, which presents short coding challenges. If there’s something I don’t know how to do, I’ll google it.
Think Python: How to Think Like a Computer Scientist, Allen B. Downey
This free, online book came highly recommended from a computer scientists that tutors kids. I opted for this book over all the other python resources because it can be consumed quickly and comes with a myriad of free exercises.
The SQL Tutorial for Data Analysis, Mode Analytics
I used this tutorial when I learned SQL the first time, and can’t recommend it enough. Start here if you want to cover the syntax of beginner to advanced SQL.
There are so many SQL courses and tutorials out there that, once again, you should chose the one that you like the most. I would trust Sal from Khan Academy with my life, so I opted for his interactive tutorial. At the end of the day, you’re going to learn SQL from doing it.
Bonus: Get a code review buddyI’m very lucky to know a handful of more senior data scientists, and even some PhD data scientists, who are willing to review my code. I don’t think I’ll ever truly write good code unless I’m willing to face some constructive feedback.
Learn Bash the Hard Way, Ian MiellLook,
Just google “bash tutorial” or “how to learn bash”. Don’t pay for any online classes. There’s so many free resources out there. I chose this book because, while it costs money, has the rigor I’m looking for. I plan on supplementing this book with various online interactive tutorials.
pipes and redirects
scripts and startup
signals and traps
a real-world application
Mining the Social Web, Matthew A. Russell
Maybe the most expensive item on this list (at a whopping $9.00 on Amazon), Mining the Social Web was the most recommended resource when I was Googling how to learn web mining.
Mining Web Pages
Mining the Semantically Marked Up Web
Interactive Data Visualization for the Web, Scott Murray
Data visualization was what first piqued my interest about data science, and I’ve been itching to get my hands dirty with D3.js for a long time. This free, online resource was the top comment on some Reddit thread, and Reddit’s never been wrong… right?
Drawing with data
Updates, transitions, and motion
Getting Real: Projects
While my intent with reverse engineering a data science masters was to build a solid foundation in the math and theory needed to excel in this field, I don’t think any educational endeavor is complete without hands on work.
One of my New Year’s resolutions for 2019 was to build a data science portfolio. While I work through these textbooks and online tutorials, I want to be working towards my goal of creating real world data science projects.
First, I want to complete a project that forces me to work with really, really messy data. I want to use both social web mining and SQL to gather data, and then leverage my stats knowledge and communications background to write some kind of report.
If it’s not inherent in the project above, I want to complete a data visualization project or two that shows off my d3.js skills, and potentially forces me to learn CSS for real and build a fancy website for it.
I have no idea what I’ll end up working on, but I want to hold myself to the goal of completing 3 data science projects before 2020.
Wrapping It Up
This is my pact to you, anonymous Internet reader. Over the next year, I plan on working through these resources and completing my made up data science masters. What I’m truly excited about is that the completion of this program won’t be a degree, but a real world portfolio of projects and experience.
I’ll blog my way through the vicissitudes of my journey. By writing all this out for you, I hope that you will hold me accountable to my goals. See you next week when I dive into stats!