Numpy

This tutorial was contributed by Kevin Tang

Credits: Much of this was copied or inspired by https://github.com/donnemartin/data-science-ipython-notebooks

This tutorial will assume that the user is familiar with Python up to a CS 1 level. This tutorial is meant to introduce Numpy and Matplotlib as well as introducing Jupyter notebooks as a way to tackle machine learning problems.

Numpy is a package for python that makes data science easy. We import numpy as follows

In [1]:
import numpy as np

The most basic data type is the array. The array is akin to the Matrix in Matlab in that it is used for basically everything in NumPy. Like in Matlab, arrays can only contain one datatype. We can create arrays as follows

In [2]:
# Row Vector (or a 1D array)
a = np.array([1, 2, 3])
print a
print a.dtype
print a.shape
[1 2 3]
int32
(3L,)
In [3]:
# Column vector
b = np.array([[10.],[20.],[30.]])
print b
print b.dtype
print b.shape
[[ 10.]
 [ 20.]
 [ 30.]]
float64
(3L, 1L)

Some of the very useful array creation tools include np.zeros, np.ones, np.full, np.eye, and np.random.random. Look these up and see what they do, or play around with them

In [4]:
c = np.eye((3))
print c

d = np.random.random((3,3))
print d
[[ 1.  0.  0.]
 [ 0.  1.  0.]
 [ 0.  0.  1.]]
[[ 0.97431746  0.88928946  0.63610583]
 [ 0.03358437  0.57664413  0.02183867]
 [ 0.31624219  0.76387628  0.84842094]]

Array indexing and broadcasting

We can index arrays very similarly to how we index lists in python (or matrices in Matlab). Slicing is really cool too. Just remember that python is 0 indexed unlike in Matlab

In [5]:
e = np.arange(1,10).reshape((3,3))
print e
print e[1, 2]
print e[0][1]
[[1 2 3]
 [4 5 6]
 [7 8 9]]
6
2
In [6]:
# slicing
print "Row: ", e[2]
print "Col: ", e[:,2]
Row:  [7 8 9]
Col:  [3 6 9]
In [7]:
# broadcasting
f = a + b
print f
print f.dtype
print f/[2, 5, 10]
print np.sin(f/10)
[[ 11.  12.  13.]
 [ 21.  22.  23.]
 [ 31.  32.  33.]]
float64
[[  5.5   2.4   1.3]
 [ 10.5   4.4   2.3]
 [ 15.5   6.4   3.3]]
[[ 0.89120736  0.93203909  0.96355819]
 [ 0.86320937  0.8084964   0.74570521]
 [ 0.04158066 -0.05837414 -0.15774569]]

Linear Algebra

All arrays function as matrices! You can do a lot of matrix operations built in

In [8]:
# matrix multiplication
print np.dot(e, f)
# matrix transposition
print np.transpose(e)
# matrix inverse
print np.linalg.inv(e)
[[ 146.  152.  158.]
 [ 335.  350.  365.]
 [ 524.  548.  572.]]
[[1 4 7]
 [2 5 8]
 [3 6 9]]
[[ -4.50359963e+15   9.00719925e+15  -4.50359963e+15]
 [  9.00719925e+15  -1.80143985e+16   9.00719925e+15]
 [ -4.50359963e+15   9.00719925e+15  -4.50359963e+15]]
In [9]:
# Eigenvalues and right eigenvector
eigval, eigvec = np.linalg.eig(e)
print eigval
print eigvec
[  1.61168440e+01  -1.11684397e+00  -1.30367773e-15]
[[-0.23197069 -0.78583024  0.40824829]
 [-0.52532209 -0.08675134 -0.81649658]
 [-0.8186735   0.61232756  0.40824829]]
In [10]:
# norm
print np.linalg.norm(e)]
# 0th order norm for vector [1, 2, 3]
print np.linalg.norm(a, 0)
  File "<ipython-input-10-ba8c805618a9>", line 2
    print np.linalg.norm(e)]
                           ^
SyntaxError: invalid syntax

Linear Algebra to do common tasks

We can use numpys Linear Algebra to write some common tasks quicker. For example, suppose we had a table of two points and we wanted to calculate the slope for each pair of points

In [11]:
# generate points
lines = np.random.rand(10,4)*10
print "X1, X2, Y1, Y2"
print lines
# Use matrix multiplication for a linear transform into a 2x10 matrix
transform = np.transpose([[1,-1,0,0],[0,0,1,-1]])
diffs = np.dot(lines, transform)
print "dx, dy"
print diffs
# note the use of broadcasting
slope = diffs[:,1]/diffs[:,0]
print "dy/dx"
print slope
X1, X2, Y1, Y2
[[ 1.33126846  6.38351375  9.87461368  1.94584115]
 [ 2.53256218  0.33889075  8.52526668  8.7713605 ]
 [ 4.62041446  0.36131255  2.64113369  0.52721793]
 [ 7.44157535  2.32745562  9.00673957  9.8744161 ]
 [ 6.96290038  0.88831641  4.71713454  2.52125384]
 [ 9.4800438   9.61354815  2.95163108  0.66223627]
 [ 8.43020325  7.30600889  6.16575503  3.93993888]
 [ 4.33510169  2.98207505  0.93619343  2.71310048]
 [ 5.06539135  5.92966389  0.36308975  7.31024467]
 [ 7.10425889  5.60902825  6.84713215  9.28110825]]
dx, dy
[[-5.05224529  7.92877253]
 [ 2.19367143 -0.24609382]
 [ 4.25910191  2.11391577]
 [ 5.11411973 -0.86767653]
 [ 6.07458397  2.1958807 ]
 [-0.13350435  2.2893948 ]
 [ 1.12419435  2.22581615]
 [ 1.35302664 -1.77690706]
 [-0.86427253 -6.94715492]
 [ 1.49523064 -2.4339761 ]]
dy/dx
[ -1.56935621  -0.11218354   0.49632899  -0.16966293   0.3614866
 -17.14846611   1.97992112  -1.31328313   8.03815308  -1.62782653]

Functions

functions work just as well in SciPy.

In [12]:
def OLS(X, Y):
    """Implements ordinary least squares estimation for linear regression"""
    Xt = np.transpose(X)
    return np.dot(np.linalg.inv(np.dot(Xt,X)), np.dot(Xt, Y))
    

Let's test it out by randomly generating some data!

In [13]:
# Generate 100 random numbers from [0,10)
X = np.random.rand(100, 2) * 10
# Generate Y
Y = np.dot(X,[[3],[4]])
# add some noise
Y += np.random.rand(100, 1) * .1
print OLS(X,Y)
[[ 3.00421238]
 [ 4.00501352]]

Matplotlib

We can now plot stuff! Note the specific text used to get plots to show up in jupyter

In [14]:
%matplotlib inline
import matplotlib.pyplot as plt

x = np.linspace(0, 2, 10)

plt.plot(x, x, 'o-', label='linear')
plt.plot(x, x ** 2, 'x-', label='quadratic')

plt.legend(loc='best')
plt.title('Linear vs Quadratic progression')
plt.xlabel('Input')
plt.ylabel('Output');
plt.show()