Key Machine Learning PreReq: Viewing Linear Algebra through the right lenses

Guest blog by Ashwin Rao. Ashwin is Vice President, Data Science & Optimization at Target.

Key Machine Learning PreReq: Viewing Linear Algebra through the right lenses

The tech industry has gone berserk – everyone wants to develop “skills in Machine Learning and AI” but few are willing to put in the hard yards to develop the foundational understanding of the relevant Math and CS – Linear Algebra, Probability Theory, Multivariate Analysis, Data Structures, Algorithms, Optimization, Functional Programming. In the past few years, I have coached a few people specifically in Linear Algebra as I think this is the most important topic to master. I am happy to report that the coaching has gone well – it was relatively pain-free and some actually fell in love with Linear Algebra! So I thought I should write down the highlights of the advice I’ve been giving on how to learn Linear Algebra by viewing it through the right lenses.

  1. Think Sets and Functions, rather than manipulation of number arrays/rectangles: Linear Algebra is often introduced at the high-school level as computations one can perform on vectors and matrices – Matrix multiplication, Gauss elimination, Determinants, sometimes even Eigenvalue calculations, and I believe this introduction is quite detrimental to one’s understanding of Linear Algebra. This computational approach continues on in many undergrad (and sometimes grad) level courses in Engineering and the Social Sciences. In fact, many Computer Scientists deal with Linear Algebra for decades of their professional life with this narrow (and in my opinion, harmful) view. I believe the right way to learn Linear Algebra is to view vectors as elements in a Set (Vector Space), and matrices as functions from one vector space to another. A vector of n numbers is an element in the vector space R^n, and a m x n matrix is a function from R^n to R^m. Beyond this, all one needs to understand is that vector addition, scalar multiplication and matrix transformation follow the usual linearity properties. This approach may seem abstract but if married with plenty of visual (2-D and 3-D) examples, it provides the student with the right intuition and prepares her for the more advanced concepts in ways far superior to the computational treatment.
  2. Matrix as a sequence of column vectors: The default view of matrices is typically row-wise because when we multiply a matrix M with a vector v (M * v), we do an inner-product of each row of the matrix with the vector thought of as a column (this intuition extends to matrix multiplication). However, a better view of M is to think of each of its columns as a vector, and M * v is simply a linear combination of M’s column vectors where the scalar multipliers are the elements of v. If you realign your mind to this view as the default view, it will serve you immensely as you navigate advanced topics.
  3. The 4 important subspaces and the Fundamental Theorem: Gilbert Strang has done a great service to the world of Linear Algebra (and Applied Math) by articulating the 4 important subspaces and what he calls the “fundamental theorem” so beautifully in his book (the visual representation of this in his book should be etched in the head of every student). I am refering to the Column Space, the Row Space, the Kernel and the Co-Kernel, and their relationships – isomorphism of the Kernel-Quotient Space and Column Space, and the corresponding orthogonality of the spaces. The rank-nullity theorem is then just a special case of this. I would even go as far as to say that if there is just one thing in Linear Algebra a student should firmly understand, it is this – what Gilbert Strang famously refers to as The Fundamental Theorem of Linear Algebra. Also, this dovetails nicely with the view of matrices as linear functions where the Kernel maps into 0, and the range of the function is the column space. Invertibility can be characterized as the trivial Kernel, Pseudoinverse can be nicely visualized as a bijective function between the row space and the column space, Transpose can be thought of with regards to the Row Space and the Co-Kernel, …. to list just a few of the powerful benefits of the Fundamental Theorem.
  4. Understand Matrix Factorizations as Compositions of “Simple” Functions: Most people will tell you that Linear Algebra is all about various forms of matrix factorizations. While that is true, the usual treatment is to simply teach you the recipe to factorize. This will help you implement the algorithm in code, but it will not teach you the mathematical essence of these factorizations. The purpose of factorization is to split a matrix into “simpler” matrices that have nice mathematical properties (diagonal, triangular, orthogonal, positive-definite etc.), and a general matrix (i.e. linear function) can be viewed as the composition of these “simpler” linear functions. The study of these “simple” linear functions forms the bulk of the analysis in linear algebra because if you have understood these simple functions (canonical matrices), then it’s simply a matter of putting them together (function compositions) to conceptualize arbitrary linear transformations.
  5. View Eigendecomposition (ED) and Singular Value Decomposition (SVD) as rotations and stretches: All great mathematicians will tell you that even the hardest, most abstract topic in Math requires geometric intuition. ED and SVD are probably the most used factorizations in Applied Math (and in real-life problems), but many a student has been frustrated by the opacity and dryness of the treatment in typical courses and books. Picturing them as rotations and stretches is the (in my opinion, only) way to go about understanding them. Eigenvectors are a basis of vectors (independent but not necessarily orthogonal) that the given matrix “purely stretches” (i.e., does not change their directions), and eigenvalues are the stretch quantities. This makes our life extremely easy but not all matrices can be ED-ed. But fear not – we have SVD that is more broadly usable albeit not as simple/nice as ED. SVD works on ANY rectangular matrix (ED works only for certain square matrices) and involves two different bases of vectors that both turn out to be orthogonal bases (orthogonality is of course very nice to have!). SVD basically tells us that the matrix simply sends one orthogonal basis to the other (modulo stretches), the stretch amounts known as singular values (appearing on the middle diagonal matrix). So an arbitrary matrix applied on an arbitrary vector will first rotate the vector (as given by the orientation of the first orthogonal basis), then stretch the components of the resultant vector (by the singular values), and finally rotate the resultant vector (as given by the orientation of the second orthogonal basis). There is also a nice connection between ED and SVD since SVD is simply ED on the product of a matrix and its transpose. There are some neat little animation tools out there that bring ED and SVD to life by vividly presenting the matrix operations as rotations/stretches. Do play with them while you learn this material.
  6. Positive Definite Matrices as Quadratic Forms with a bowl-like visual: People will tell you positive definite matrices (PDMs) are REALLY important but few can explain why they are important and few will go beyond the usual definition: v^T M v > 0 for every non-zero vector v. This definition, although accurate, confuses the hell out of students. The right way to understand PDMs is to interpret v^T M v as a “quadratic form”, i.e., a function (of v) from R^n to R that is quadratic in the components of v, in every term in the function. Secondly, it’s best to graph several examples of quadratic forms for n = 2 (i.e., viewed as a 3-D graph). PDMs are those matrices M for which this graph is a nice bowl-like shape, i.e., the valley is a unique point from which you can only go up. The alternatives are a “flat valley” where you can walk horizontally, or a “saddle valley” from where you can climb up in some directions or climb down in other directions. PDMs are desirable because they are simple and friendly to optimization methods. The other very nice thing about PDMs is that all of their eigenvalues are positive and their eigenvectors are orthogonal.

In fact, these six concepts learnt through the lenses I described serve as a quick introduction to Linear Algebra that prepares you for learning Machine Learning, Optimization, AI or more generally, Applied Math. You will of course run into other Linear Algebra details in the process of learning these Applied Math topics, but once grounded in these foundationals, you will pick up those details on the fly pretty quickly (eg: wiki will then be your great friend!).

Originally posted here