Guide to studying ColumbiaX Machine Learning on edX

Posted on Jan 17, 2018

If you’re planning on signing up for the Machine Learning course by Columbia University, or if you’ve already begun but now find yourself drowning in the deep end, this post is for you.

In September 2017, I signed up for CSMM.102x, a Machine Learning course offered by Columbia University on edX and taught by Prof. Paisley. It was the second online course I did last year for ML and I was looking forward to it. From reading the reviews, I had gathered that this course would be mathematically rigorous and I was (naively) excited about this since the first course I’d done (the Stanford course by Prof. Ng on coursera) had left me wanting more when it came to the underlying theory of ML. So I gleefully signed up for CSMM.102x on edX and the following 3 months left me bruised, dazed and (eventually) fulfilled.

This course is essentially a graduate-level course and therefore has a list of prerequisites that is not insignificant. Despite having mastered Stanford’s Machine Learning course (as well as studying Linear Algebra) just a few months earlier, I ended up having to cram heavily to do justice to this course. Over a period of 3 months, I spent 30-40 hours per week making progress in the course while also catching up on the prerequisites. If you find yourself in a similar under-prepared situation and are up for the challenge, in this post I outline the prerequisites you’ll need to cover, as well as the chronological sequence in which these prerequisites become… requisite.

Disclaimer: I was being a bit obsessed and perfectionist with this course (I wanted to completely understand everything that was taught and I wanted to score high) and moreover I had plenty of time to dedicate to the task – but if your situation is different, you may choose to simply aim for a passing grade with somewhat less time/effort.

Prerequisites

Here’s what you need to know/study in order to grok this course:

Probability and Statistics: You need a firm grasp on expectations, variances and the meaning of discrete and continuous probability distributions. You need to be comfortable with Conditional probability and Bayes rule which are core concepts in ML. The Gaussian distribution is mentioned everywhere and other distributions like Binomial and Poisson make appearances too. You should also study Markov chains in advance, although Prof. Paisley covers it briefly too.
- Before I even began the course, I watched MIT’s Probabilistic Systems Analysis videos and did the class assignments. It took me 2 weeks of intense studying but I’m glad I did it. You could skip / skim over the topics related to permutations and combinations but the rest is worthwhile.
Linear Algebra: You need familiarity with vectors, transformation matrices, null spaces, eigenvalues — in fact everything up to and including Singular Value Decomposition (SVD). Singular matrices turn up quite a bit, and SVD is used as a tool to explain the magic behind certain ML techniques.
- A few months earlier, I had binge-watched all of Khan Academy’s Linear Algebra videos (which I loved) and also dug up some tutorials on Principal Component Analysis and SVDs. So at the start of this course, I just did a quick refresher which was sufficient (although I kept wanting to redo all of Khan academy.)
Multivariate Calculus: The math of Machine Learning often involves taking derivatives – in fact, derivatives of vectors – to do things like finding minima/maxima, and so it’s good to have familiarity with gradients and partial derivatives in addition to the Calculus you studied in college. I will say, though, that you could get by for a while just on basic Calculus, since extending the concept to vectors is mostly fairly intuitive. Lagrangian optimization and solving the dual problem become important at some point. Quadratic forms show up often, and there are mentions of Jacobian and Hessian matrices too.
- While making progress with this ML edX course, I also spent some time every evening watching videos from Khan Academy’s Multivariate Calculus which include nice 3D animated visualizations to help build the intuition. The first three sections there should suffice.
Python and Numpy: Although the course allows you to use Octave or Matlab instead of Python for the projects, the skeleton code is provided in Python and so that’s what I recommend. The course assumes you are already familiar with programming in Python and using the Numpy library. (I had wrongly assumed that the course would teach me these, and only belatedly realized that I was on my own.) There are easy mistakes that can be made with Numpy (e.g. inner vs outer vs dot products) if you’re not careful, so this may not be a cakewalk even if you’re experienced in other languages.
All said, however, I feel like this course cares more about whether you can implement the math than it does about whether you can code the algorithm.
- I’m generally a good programmer but I’m new to Python. Reading a tutorial or two (e.g. the CS231n Python Numpy tutorial) got me functioning enough to do the projects. My familiarity with handling matrices in Octave certainly helped.

The textbooks

The two so-called “intro level” textbooks recommended as reference textbooks for the course are not “intro level” at all! (Duh!) Rather, they deep-dive mathematically into topics that even Prof. Paisley doesn’t touch in this course! In fact, one should look at this course as being a guide to studying the textbooks rather than the other way round. Looking at it this way, Prof. Paisley does quite a fair job of hand-holding us through the material, covering the important bits and ignoring the excessive, and I highly recommend this course for that reason if not any other. Having completed the course successfully, one is then perhaps finally ready to read through the textbooks and understand most of it.

Elements of Statistical Learning by Hastie et. al. – This is a good and important book but it can be quite unapologetic in the math and it includes original research and advanced topics that might be better skipped on the first read.

Pattern Recognition And Machine Learning by Bishop – This is a somewhat more down-to-earth textbook although it doesn’t spare you from the math either. Unfortunately, the nomenclature in this book is different from Hastie’s book and from what Prof. Paisley tends to use, so it can be a little frustrating to read.

The course page conveniently lists the chapters/sections of these books that are to be referenced for each week’s lectures. However, I would say it’s optional to read the textbooks. I tried to read/skim the relevant sections after each week’s lecture but my experience was mixed – sometimes the books helped clarify the big picture for me (aha moments!), other times it just muddied up my understanding by making me forget what was taught in the lectures.

Chronology of Prerequisites

If you plan on eating your daily intake of green prerequisites in parallel with keeping up with this course (e.g. if your course has already begun) then it might be useful to know what you need to know in what order.

Below is the sequence of the first invocations of the various prerequisite topics:

Week	Probability and Statistics	Linear Algebra	Multivariate Calculus
2	Expectations, Variance Bayes Rule Beta distribution	SVD
3			Legrange multipliers
8	(Kullback-Leibler divergence)
9	Dirichlet distribution Poisson distribution
10	Markov chains	PCA Positive semi-definite matrix
12			Taylor expansion
	work in progress

My learning pattern

Over the weeks, I settled into my own optimal pattern of studying this ML course.

Each week is composed of two separate “lectures” and a single quiz at the end. I tackle the first lecture for 2-3 days before moving onto the second lecture for another 2-3 days, and then I do the quiz and the project (if any).

For every lecture, I first watch all the videos, pausing often to make sure I understand the main content and follow the math, searching online for any important concepts or terms mentioned that I’m unfamiliar with. I jot down any questions that pop into my head.
The next day, I re-watch all the videos, making sure I understand every word — and I ponder not only on the “what” and “how” but also the “why”. I then summarize the entire lecture in one or two paragraphs while referencing the slides. If my earlier questions are still unanswered I head to the discussion board.
If I have time for it, I read through or skim through the relevant sections of the recommended text books (usually ESL). I don’t attempt to absorb all the information there, I just focus on the gist or the big picture and call it a day.
I repeat this process for the second lecture of the week.
Before taking the week’s quiz, I re-watch all the above videos on 2x speed and maybe re-read my summary notes. (This is probably overkill, but I liked the feeling of wrapping up the week this way.)
The Quiz: <sigh> We need to talk about the quiz. See, despite all your preparations, the quiz will try its best to make you stumble and fall. It. Is. Frustrating. You will get some questions wrong despite knowing the right answer. #dealwithit. I try to make sure I am unhurried during the quiz – I allot a whole hour. I sometimes did further research (browsing the text book or online resources) if the subject was not sufficiently covered in the lectures. (I gathrered from the TAs that the quiz is somewhat “open book” and some questions are designed to get you to think/research deeper.) The good news is that the quiz scores aren’t weighted much, so don’t sweat it like I did.
The Project: If there’s a project for the week, I leave aside at least a day for it, since I’m new to python, and moreover the project may require some math to be worked out. If you’re familiar with python and numpy, you could probably do it in 1-2 hours.

Prerequisites

The textbooks

Chronology of Prerequisites

My learning pattern

Leave a Reply Cancel reply