Chapter 2

Linear Algebra

Linear algebra is a branch of mathematics that is widely used throughout science

and engineering. However, because linear algebra is a form of continuous rather

than discrete mathematics, many computer scientists have little experience with it.

A good understanding of linear algebra is essential for understanding and working

with many machine learning algorithms, especially deep learning algorithms. We

therefore precede our introduction to deep learning with a focused presentation of

the key linear algebra prerequisites.

If you are already familiar with linear algebra, feel free to skip this chapter. If

you have previous experience with these concepts but need a detailed reference

sheet to review key formulas, we recommend The Matrix Cookbook (Petersen and

Pedersen, 2006). If you have no exposure at all to linear algebra, this chapter

will teach you enough to read this book, but we highly recommend that you also

consult another resource focused exclusively on teaching linear algebra, such as

Shilov (1977). This chapter will completely omit many important linear algebra

topics that are not essential for understanding deep learning.

2.1 Scalars, Vectors, Matrices and Tensors

The study of linear algebra involves several types of mathematical objects:

• Scalars

: A scalar is just a single number, in contrast to most of the other

objects studied in linear algebra, which are usually arrays of multiple numbers.

We write scalars in italics. We usually give scalars lower-case variable names.

When we introduce them, we specify what kind of number they are. For

31

CHAPTER 2. LINEAR ALGEBRA

example, we might say “Let

s ∈ R

be the slope of the line,” while deﬁning a

real-valued scalar, or “Let

n ∈ N

be the number of units,” while deﬁning a

natural number scalar.

• Vectors

: A vector is an array of numbers. The numbers are arranged in

order. We can identify each individual number by its index in that ordering.

Typically we give vectors lower case names written in bold typeface, such

as

x

. The elements of the vector are identiﬁed by writing its name in italic

typeface, with a subscript. The ﬁrst element of

x

is

x

1

, the second element

is

x

2

and so on. We also need to say what kind of numbers are stored in

the vector. If each element is in

R

, and the vector has

n

elements, then the

vector lies in the set formed by taking the Cartesian product of

R n

times,

denoted as

R

n

. When we need to explicitly identify the elements of a vector,

we write them as a column enclosed in square brackets:

x =

x

1

x

2

.

.

.

x

n

. (2.1)

We can think of vectors as identifying points in space, with each element

giving the coordinate along a diﬀerent axis.

Sometimes we need to index a set of elements of a vector. In this case, we

deﬁne a set containing the indices and write the set as a subscript. For

example, to access

x

1

,

x

3

and

x

6

, we deﬁne the set

S

=

{

1

,

3

,

6

}

and write

x

S

. We use the

−

sign to index the complement of a set. For example

x

−1

is

the vector containing all elements of

x

except for

x

1

, and

x

−S

is the vector

containing all of the elements of x except for x

1

, x

3

and x

6

.

• Matrices

: A matrix is a 2-D array of numbers, so each element is identiﬁed

by two indices instead of just one. We usually give matrices upper-case

variable names with bold typeface, such as

A

. If a real-valued matrix

A

has

a height of

m

and a width of

n

, then we say that

A ∈ R

m×n

. We usually

identify the elements of a matrix using its name in italic but not bold font,

and the indices are listed with separating commas. For example,

A

1,1

is the

upper left entry of

A

and

A

m,n

is the bottom right entry. We can identify all

of the numbers with vertical coordinate

i

by writing a “:” for the horizontal

coordinate. For example,

A

i,:

denotes the horizontal cross section of

A

with

vertical coordinate

i

. This is known as the

i

-th

row

of

A

. Likewise,

A

:,i

is

32

CHAPTER 2. LINEAR ALGEBRA

A =

A

1,1

A

1,2

A

2,1

A

2,2

A

3,1

A

3,2

⇒ A

=

A

1,1

A

2,1

A

3,1

A

1,2

A

2,2

A

3,2

Figure 2.1: The transpose of the matrix can be thought of as a mirror image across the

main diagonal.

the i-th column of A. When we need to explicitly identify the elements of

a matrix, we write them as an array enclosed in square brackets:

A

1,1

A

1,2

A

2,1

A

2,2

. (2.2)

Sometimes we may need to index matrix-valued expressions that are not just

a single letter. In this case, we use subscripts after the expression, but do

not convert anything to lower case. For example,

f

(

A

)

i,j

gives element (

i, j

)

of the matrix computed by applying the function f to A.

• Tensors

: In some cases we will need an array with more than two axes.

In the general case, an array of numbers arranged on a regular grid with a

variable number of axes is known as a tensor. We denote a tensor named “A”

with this typeface:

A

. We identify the element of

A

at coordinates (

i, j, k

)

by writing A

i,j,k

.

One important operation on matrices is the

transpose

. The transpose of a

matrix is the mirror image of the matrix across a diagonal line, called the

main

diagonal

, running down and to the right, starting from its upper left corner. See

ﬁgure 2.1 for a graphical depiction of this operation. We denote the transpose of a

matrix A as A

, and it is deﬁned such that

(A

)

i,j

= A

j,i

. (2.3)

Vectors can be thought of as matrices that contain only one column. The

transpose of a vector is therefore a matrix with only one row. Sometimes we

33

CHAPTER 2. LINEAR ALGEBRA

deﬁne a vector by writing out its elements in the text inline as a row matrix,

then using the transpose operator to turn it into a standard column vector, e.g.,

x = [x

1

, x

2

, x

3

]

.

A scalar can be thought of as a matrix with only a single entry. From this, we

can see that a scalar is its own transpose: a = a

.

We can add matrices to each other, as long as they have the same shape, just

by adding their corresponding elements: C = A + B where C

i,j

= A

i,j

+ B

i,j

.

We can also add a scalar to a matrix or multiply a matrix by a scalar, just

by performing that operation on each element of a matrix:

D

=

a · B

+

c

where

D

i,j

= a · B

i,j

+ c.

In the context of deep learning, we also use some less conventional notation.

We allow the addition of matrix and a vector, yielding another matrix:

C

=

A

+

b

,

where

C

i,j

=

A

i,j

+

b

j

. In other words, the vector

b

is added to each row of the

matrix. This shorthand eliminates the need to deﬁne a matrix with

b

copied into

each row before doing the addition. This implicit copying of

b

to many locations

is called broadcasting.

2.2 Multiplying Matrices and Vectors

One of the most important operations involving matrices is multiplication of two

matrices. The

matrix product

of matrices

A

and

B

is a third matrix

C

. In

order for this product to be deﬁned,

A

must have the same number of columns as

B

has rows. If

A

is of shape

m × n

and

B

is of shape

n × p

, then

C

is of shape

m × p

. We can write the matrix product just by placing two or more matrices

together, e.g.

C = AB. (2.4)

The product operation is deﬁned by

C

i,j

=

k

A

i,k

B

k,j

. (2.5)

Note that the standard product of two matrices is not just a matrix containing

the product of the individual elements. Such an operation exists and is called the

element-wise product or Hadamard product, and is denoted as A B.

The

dot product

between two vectors

x

and

y

of the same dimensionality

is the matrix product

x

y

. We can think of the matrix product

C

=

AB

as

computing C

i,j

as the dot product between row i of A and column j of B.

34

CHAPTER 2. LINEAR