# Random thoughts on topological sort, backpropagation and stuff you learn in school

I learned the topological sort algorithm about 10 years ago in my undergrad, and for me it was pretty bizarre. I remember it was in one of the Introduction to Algorithms modules, where we were dealing with algorithms for graphs, like Binary search trees, RB trees and so on. Then suddenly, out of nowhere, we got to learn topological sort – an algorithm for sorting partially ordered sets.

Erhh.. what sets were that?

I mean I got RB tree because it is beautiful, and even though you probably won’t need to directly implement it at all in your whole career, learning the thing will improve your algorithmic mindset. But giving an order to a set of things which might not have a well-defined total order? Sorry but how useful it is gonna be?

Turns out the algorithm is neither fancy nor beautiful. Like most of other algorithms in textbooks, it is beautiful to some degree, but in a module where they taught you algorithms, it was difficult to pay any particular attention to topological sort alone.

Now in retrospect, I think topo sort is the single algorithm from those modules that I have been using most often. Okay, maybe I have been using quick sort more often, but I don’t need to re-implement it (except in interviews, which don’t count).

Imagine you want to run backpropagation in a neural network of multiple layers. It would be easy if your network is a simple chain of layers, but if you allow a layer to have multiple inputs (something like multi-modal nets), then how are you going to do it in a generic way?

Another example is when you have a pipeline of stuff, let it be a business workflow, a data flow, etc… where every operation might depend on other operations. The generic way to figure out an order to execute those operations is via topological sort.

If you have been using Tensorflow, then yes, they implement it for the graphs. In tensorflow, there are many different kinds of connection, an Op can be input to other Ops, or its execution simply needs to be happens before some other Ops (when some variables need to be initialized, for instance), but the general idea doesn’t change.

If there is any textbook algorithm that I am re-implementing again and again, I believe it is topological sort, and I couldn’t be more thankful to the professors who taught me this thing (fundamentally).

In fact, I realize it’s super difficult to fully anticipate the beauty and potential applications of stuff they taught you when you were in school. I am pretty sure most of them are useless (like I got to learn XML databases and object oriented databases somewhere in my education), but some will turn out to be unexpectedly valuable at some point in your career. This applies to many other things you learn outside of school as well. The trick is to know what to learn to maximize the profit in the long run.

Unfortunately I don’t know the answer to that. My strategy so far is to learn as much as I could, from as many people as I could. But we will see how far that will work (although I think it wouldn’t work for too long).

Before you ask, the next most popular textbook algorithm I have been using is state machines. When I was implementing my toy compiler around a decade ago, I didn’t imagine I would need to do it again and again for various kinds of parsers I had been doing.

And of course, Viterbi, I had some good time playing around with various aspects of it.

# Yes, you should tune your pesky learning rate

For a ConvNN I trained recently, this is the learning curves when using Adam optimizer with initial learning rate = 0.01:

When using the traditional SGD with initial learning rate = 0.01, momentum = 0.9 and decaying learning rate every 3 epochs with decay rate of 0.96, the learning curves become:

I hope you see the drastic difference. With momentum, we got 10% error rate after 5 epochs. While with Adam, we got ~30% error rate after 20 epochs.

Now it might happen that Adam will work better if we add fancy stuff like Batch Norm and the likes to the network, which I didn’t try. However, when everything else being equal, it feels to me that Adam was a bit aggressive in decreasing the learning rate, which makes learning progress slow after a while.

Since the learning rate is more strongly regulated in Adam, perhaps we can be more lax in setting the initial learning rate? This is the learning curves for Adam with initial learning rate = 0.1

It went wild eventually.

But with initial learning rate = 0.001, Adam gives this:

It is much better now.

Over the years, momentum and decaying learning rate has been my first choice for tuning the learning rate. I sometimes use Adagrad/RMSProp/Adam for cross-checking, but the best results are usually found with momentum, often with less training epochs.

The take-away message is you should really tune your learning rate hard. It is still one of the most important hyper-parameters. Although methods like Adam/Adagrad might make the impression that tuning the learning rate is easy, in fact it is very problem-dependent. Momentum has many more knobs to tune, but once used wisely, it will be very flexible and powerful. Often you will end up to the same ballpark with any of those optimizers.

# Gradient-based Learning, hay là Optimization ứng dụng trong ML

Thuật toán huấn luyện của rất nhiều mô hình máy học được phát biểu hình thức dưới dạng một bài toán Optimization, chẳng hạn các mô hình như linear/logistic regression, ANN, SVM, k-means… Ý tưởng chung của các formulation dạng này là mỗi mô hình đều có một tập tham số $\theta$ nào đó cần phải được xác định giá trị thông qua quá trình huấn luyện với dữ liệu.

Trước kia khi ngành ML chưa phát triển, dân thống kê đã gọi quá trình này là Parameter estimation, trong đó giá trị của các tham số được chọn sao cho chúng tối ưu một tiêu chuẩn nào đó do người xây dựng mô hình quyết định. Các tiêu chuẩn thường được dùng là phân phối xác suất likelihood, posterior hay là mean square error… tương ứng với các phương pháp  là MLE, MAP/EAPMMSE. Ta sẽ có bài riêng về các kĩ thuật này (hi vọng thế), nhưng nói chung chúng thể hiện “niềm tin” của người xây dựng mô hình về cách định nghĩa thế nào là mô hình tốt nhất, chứ không hề có chứng minh hình thức nào cả. Mặc dù vậy, ý tưởng cần được học hỏi ở đây là ta cần chọn ra một tiêu chuẩn nào đó, rồi chọn giá trị của tham số sao cho tối ưu tiêu chuẩn đó. Rõ ràng Optimization là ngành cung cấp cơ sở lí thuyết cho bài toán này một cách vững vàng nhất, do vậy áp dụng Optimization để học các mô hình thống kê trở thành tư tưởng hết sức tự nhiên.

Với các mô hình tham số (parametric model) thì cách làm như trên là rất tự nhiên. Mặc dù ML hiện nay còn nghiên cứu cả các mô hình phi tham số (nonparametric model) nhưng vì các mô hình tham số vẫn đóng vai trò hết sức quan trọng, nên tìm hiểu nguyên lí chung của các mô hình này là việc đáng làm. Bài này sẽ nói về các phương pháp ước lượng tham số sử dụng kĩ thuật tối ưu, mà ta gọi chung là gradient-based learning. Cụ thể bài này sẽ nói về gradient descent, stochastic gradient descentminibatch gradient descent cùng một số kĩ thuật regularization. Chi tiết về các thuật toán này có thể xem thêm trong bài giảng của Andrew Ng trong lớp Machine Learning của coursera.