- Brittany Bennett

# Open Source Masters for Progressive Analytics

I dream about going to grad school. Every couple of months I ogle over Masters and PhD programs at places like Carnegie Mellon, U Chicago, and UC Berkeley.

But I am **not** going to grad school.

While I love the idea of being a full time student, or dedicating myself to learning, I know that academia is not for me.

I also know that if I want to be a data scientist, a masters degree would be a poor use of time and money. I believe that everything I want to learn in data science can be learned for free (or very low cost) online. I also believe that a good portfolio and real world experience will trump a degree in most settings.

But I still wanted to go to grad school. For the prestige. For the hell of it. Because it was what my friends were doing and I was jealous.

So, I hatched a plan.** I was going to reverse engineer the best data science masters programs** and find the equivalent textbooks and resources for free. If I was really in love with the idea of learning, why not dedicate my free time to studying as if I were in a masters program?

Thus, the idea for my own Open Source Data Science Masters was born. Yes, I was absolutely inspired by the __original open source data science masters__, but I wanted a curriculum with more rigor. I have a degree in engineering, and I do not shy away from hard math or theory.

In addition, I’m not set off on a traditional data science path. I have no desire to work at a place like Uber, Amazon, or Google. I’m not interested in working at the cutting edge of machine learning. I’m in this field to leverage analytics for social good.

I recently discovered the world of data for progressive politics and fell in love. I want to leverage data analytic to turn our young voters and elect progressive candidates to office. This doesn’t look like building neural nets. It looks like cleaning messy data, working with CRMs, producing visualizations, and a lot of statistical analysis.

Using this __infamous guide__, I devised my own masters program tailor made for working in progressive analytics. All the resources below are either free/open source or cost very little money.

# Math Background

I did a quick survey of the brightest minds I know in data science to determine what kind of math is truly necessary for the field. Turns out my intuition was right and statistics and linear algebra are the right subjects to focus on. If you want to work with neural nets (I do not), consider adding Calc I-III to this set. I’ve already taken up to multi variable calculus from my engineering degree.

*David M Diez, Christopher D Barr, Mine Cetinkaya-Rundel*

I used this textbook while in college and it ended up being the best textbook I ever used, and not because it was free. OpenIntro Statistics covers everything you need to know in a intro stats course and more, and comes with 9 open source R-based labs.

Topics covered:

Probability

Distribution of random numbers

Foundations for inference

Inference for numerical data

Inference for categorical data

Introduction to linear regression

Multiple and logistic regression

** An Introduction to Statistical Learning**,

*Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani*

Anytime I mentioned wanting to double down on my stats knowledge, ISLR would inevitably come up. While OpenIntro is the equivalent of an undergrad intro course to statistics, ISLR is comparable to a grad course. This book is great for data scientists as it covers a slew of important modeling and prediction techniques. And like the textbook above, there are 9 open source R-based labs to accompany your learning.

Topics covered

Statistical Learning

Linear Regression

Classification

Resampling Methods

Linear Model Selection and Regularization

Moving Beyond Linearity

Tree Based Methods

Support Vector Machines

Unsupervised Learning

*Steven Levandosky*

While not free, I was able to pick this up for under $3.00 on Amazon. I spent a long time determining the best resource for learning Linear Algebra. I tried online courses, but they didn’t provide the rigor I was looking for. I attempted to MITx linear course, but I found the lecturer too obtuse and the textbook downright awful. Enter Steven Levandosky’s book, which came highly recommended from a slew of people online. Linear was the only math class I didn’t take as part of my engineering degree, though I picked it up from a slew of other technical courses. To supplement the textbook, I recommend Khan Academy’s Linear Algebra course.

Topics covered:

Vectors in Rn

Dot Products and Cross Products

Systems of Linear Independence

Matrix-Vector Products

Null Space & Column space

Subspaces of Rn

Linear Transofmrations

Composition and Matrix Multiplication

Inverses, Determinants, & Transpose of a Matrix

Orthogonal Complements & Projections

Orthonormal Base

Eigenvectors & Complex Eigenvalues and Eigenvectors

Symmetric Matrices

Quadratic Forms

__Designing, Running, and Analyzing Experiments__

*Coursera*

From the Guide to Progressive Data Jobs, experimentation is necessary for determining the impact of variables on an outcome, which is especially important if we’re working on problems such as turning out voters. The guide suggests learning A/B Testing, Intent to Treat versus Treatment on Treated, Average Treatment Effect, Differential Attrition. I didn’t find too many resources on the web, but I did stumble upon this Coursea course that covers a good chunk of experimentation.

Topics covered:

Basic Experiment Design Concepts

Tests of Proportions

The T-Test

Validity in Design and Analysis

# Programming Core

This is where I get to cheat a little. I have completed a 6 month online bootcamp in data science where I put my coding chops to the test. I’ve already completed two capstones with Python where I worked with real, messy data. That being said, I’m pretty sure my code is ugly. And inefficient. That needs to change. To learn better Python practices, and to learn how to code for real, I wanted to mix both practical, hands-on coding experience with theory.

There are a ton of “teach yourself to code” online tutorials. Honestly, pick the one that you like the most. I opted for codewars, which presents short coding challenges. If there’s something I don’t know how to do, I’ll google it.

__Think Python: How to Think Like a Computer Scientist__**, **Allen B. Downey

This free, online book came highly recommended from a computer scientists that tutors kids. I opted for this book over all the other python resources because it can be consumed quickly and comes with a myriad of free exercises.

__The SQL Tutorial for Data Analysis__**, **Mode Analytics

I used this tutorial when I learned SQL the first time, and can’t recommend it enough. Start here if you want to cover the syntax of beginner to advanced SQL.

__Learn SQL interactively on Khan Academy__

There are so many SQL courses and tutorials out there that, once again, you should chose the one that you like the most. I would trust Sal from Khan Academy with my life, so I opted for his interactive tutorial. At the end of the day, you’re going to learn SQL from doing it.

**Bonus: Get a code review buddy**I’m very lucky to know a handful of more senior data scientists, and even some PhD data scientists, who are willing to review my code. I don’t think I’ll ever truly write good code unless I’m willing to face some constructive feedback.

# Advanced Topics

__Learn Bash the Hard Way__**, **Ian MiellLook,

Just google “bash tutorial” or “how to learn bash”. Don’t pay for any online classes. There’s so many free resources out there. I chose this book because, while it costs money, has the rigor I’m looking for. I plan on supplementing this book with various online interactive tutorials.

Topics covered:

globbing

variables

functions

pipes and redirects

scripts and startup

command substitution

tests

loops

exit codes

the prompts

here docs

history

shortcuts

signals and traps

debugging

string manipulation

a real-world application

__Mining the Social Web__**, **Matthew A. Russell

Maybe the most expensive item on this list (at a whopping $9.00 on Amazon), *Mining the Social Web* was the most recommended resource when I was Googling how to learn web mining.

Topics covered:

Mining Twitter

Mining Facebook

Mining LinkedIn

Mining Web Pages

Mining Mailboxes

Mining Github

Mining the Semantically Marked Up Web

__Interactive Data Visualization for the Web__**, **Scott Murray

Data visualization was what first piqued my interest about data science, and I’ve been itching to get my hands dirty with D3.js for a long time. This free, online resource was the top comment on some Reddit thread, and Reddit’s never been wrong… right?

Topics covered

Drawing with data

Scales

Axes

Updates, transitions, and motion

Interactivity

Layouts

Geomapping

Exporting

# Getting Real: Projects

While my intent with reverse engineering a data science masters was to build a solid foundation in the math and theory needed to excel in this field, I don’t think any educational endeavor is complete without hands on work.

One of my New Year’s resolutions for 2019 was to build a data science portfolio. While I work through these textbooks and online tutorials, I want to be working towards my goal of creating real world data science projects.

First, I want to complete a project that forces me to work with really, really messy data. I want to use both social web mining and SQL to gather data, and then leverage my stats knowledge and communications background to write some kind of report.

If it’s not inherent in the project above, I want to complete a data visualization project or two that shows off my d3.js skills, and potentially forces me to learn CSS for real and build a fancy website for it.

I have no idea what I’ll end up working on, but I want to hold myself to the goal of completing 3 data science projects before 2020.

# Wrapping It Up

This is my pact to you, anonymous Internet reader. Over the next year, I plan on working through these resources and completing my made up data science masters. What I’m truly excited about is that the completion of this program won’t be a degree, but a real world portfolio of projects and experience.

I’ll blog my way through the vicissitudes of my journey. By writing all this out for you, I hope that you will hold me accountable to my goals. See you next week when I dive into stats!