- Store the samples in column-order: every column is one sample, and the number of columns is equal to the batch size.
- If you store the samples row-by-row, the feedforward operation is simple: Y = X * W, where W is the weight matrix. However it is extremely inconvenient to used on GPUs (where cuBLAS and cudnn both use column-ordered matrix).
- So if you store the samples column-by-column, the feedforward operation becomes: Y = W * X. Not that much different.
- Matrix has at least 4 dimensions: number of samples x width x height x depth. This will give enough flexibility to implement feed forward net, convNN and RNN (which is pretty much everything). So now each sample has width x height x depth elements. If we want to go for video, probably we might need another dimension too. The lesson is many people tried to do N-dimensional matrix, but it is extremely difficult to maintain the generality when the number of dimension is not certainly determined, therefore it is difficult to optimize. Meanwhile, you rarely need more than 5 dimensions in implementing DL algorithms.
- Rethink the edge and layer design.
- Rethink the metrics design. Probably the metrics can be implemented as a layer, that can be activated/deactivated during training/testing.
- Have separated flags for input and output layers.
- The model compute the gradients, then gradients are adjusted by an adjuster (Adagrad, Momentum, Rmsprop, etc…) and then the model gets updated based on the adjusted gradients.
I have chances to implement DL frameworks from scratch twice in my life. Although it will get boring if I have to do that like 10 times, but doing this 1 or 2 more times is probably still ok.
This list is nowhere near an extensive one. Just some ideas jumping out of my head, so that I will get it right from the beginning next time. Some points here are already implemented in my systems.