Numpy¶

This tutorial was contributed by Kevin Tang

Credits: Much of this was copied or inspired by https://github.com/donnemartin/data-science-ipython-notebooks

This tutorial will assume that the user is familiar with Python up to a CS 1 level. This tutorial is meant to introduce Numpy and Matplotlib as well as introducing Jupyter notebooks as a way to tackle machine learning problems.

Numpy is a package for python that makes data science easy. We import numpy as follows

import numpy as np

The most basic data type is the array. The array is akin to the Matrix in Matlab in that it is used for basically everything in NumPy. Like in Matlab, arrays can only contain one datatype. We can create arrays as follows

# Row Vector (or a 1D array)
a = np.array([1, 2, 3])
print a
print a.dtype
print a.shape

[1 2 3]
int32
(3L,)

# Column vector
b = np.array([[10.],[20.],[30.]])
print b
print b.dtype
print b.shape

[[ 10.]
 [ 20.]
 [ 30.]]
float64
(3L, 1L)

Some of the very useful array creation tools include np.zeros, np.ones, np.full, np.eye, and np.random.random. Look these up and see what they do, or play around with them

c = np.eye((3))
print c

d = np.random.random((3,3))
print d

[[ 1.  0.  0.]
 [ 0.  1.  0.]
 [ 0.  0.  1.]]
[[ 0.97431746  0.88928946  0.63610583]
 [ 0.03358437  0.57664413  0.02183867]
 [ 0.31624219  0.76387628  0.84842094]]

Array indexing and broadcasting¶

We can index arrays very similarly to how we index lists in python (or matrices in Matlab). Slicing is really cool too. Just remember that python is 0 indexed unlike in Matlab

e = np.arange(1,10).reshape((3,3))
print e
print e[1, 2]
print e[0][1]

[[1 2 3]
 [4 5 6]
 [7 8 9]]
6
2

# slicing
print "Row: ", e[2]
print "Col: ", e[:,2]

Row:  [7 8 9]
Col:  [3 6 9]

# broadcasting
f = a + b
print f
print f.dtype
print f/[2, 5, 10]
print np.sin(f/10)

[[ 11.  12.  13.]
 [ 21.  22.  23.]
 [ 31.  32.  33.]]
float64
[[  5.5   2.4   1.3]
 [ 10.5   4.4   2.3]
 [ 15.5   6.4   3.3]]
[[ 0.89120736  0.93203909  0.96355819]
 [ 0.86320937  0.8084964   0.74570521]
 [ 0.04158066 -0.05837414 -0.15774569]]

Linear Algebra¶

All arrays function as matrices! You can do a lot of matrix operations built in

# matrix multiplication
print np.dot(e, f)
# matrix transposition
print np.transpose(e)
# matrix inverse
print np.linalg.inv(e)

[[ 146.  152.  158.]
 [ 335.  350.  365.]
 [ 524.  548.  572.]]
[[1 4 7]
 [2 5 8]
 [3 6 9]]
[[ -4.50359963e+15   9.00719925e+15  -4.50359963e+15]
 [  9.00719925e+15  -1.80143985e+16   9.00719925e+15]
 [ -4.50359963e+15   9.00719925e+15  -4.50359963e+15]]

# Eigenvalues and right eigenvector
eigval, eigvec = np.linalg.eig(e)
print eigval
print eigvec

[  1.61168440e+01  -1.11684397e+00  -1.30367773e-15]
[[-0.23197069 -0.78583024  0.40824829]
 [-0.52532209 -0.08675134 -0.81649658]
 [-0.8186735   0.61232756  0.40824829]]

# norm
print np.linalg.norm(e)]
# 0th order norm for vector [1, 2, 3]
print np.linalg.norm(a, 0)

  File "<ipython-input-10-ba8c805618a9>", line 2
    print np.linalg.norm(e)]
                           ^
SyntaxError: invalid syntax

Linear Algebra to do common tasks¶

We can use numpys Linear Algebra to write some common tasks quicker. For example, suppose we had a table of two points and we wanted to calculate the slope for each pair of points

# generate points
lines = np.random.rand(10,4)*10
print "X1, X2, Y1, Y2"
print lines
# Use matrix multiplication for a linear transform into a 2x10 matrix
transform = np.transpose([[1,-1,0,0],[0,0,1,-1]])
diffs = np.dot(lines, transform)
print "dx, dy"
print diffs
# note the use of broadcasting
slope = diffs[:,1]/diffs[:,0]
print "dy/dx"
print slope

X1, X2, Y1, Y2
[[ 1.33126846  6.38351375  9.87461368  1.94584115]
 [ 2.53256218  0.33889075  8.52526668  8.7713605 ]
 [ 4.62041446  0.36131255  2.64113369  0.52721793]
 [ 7.44157535  2.32745562  9.00673957  9.8744161 ]
 [ 6.96290038  0.88831641  4.71713454  2.52125384]
 [ 9.4800438   9.61354815  2.95163108  0.66223627]
 [ 8.43020325  7.30600889  6.16575503  3.93993888]
 [ 4.33510169  2.98207505  0.93619343  2.71310048]
 [ 5.06539135  5.92966389  0.36308975  7.31024467]
 [ 7.10425889  5.60902825  6.84713215  9.28110825]]
dx, dy
[[-5.05224529  7.92877253]
 [ 2.19367143 -0.24609382]
 [ 4.25910191  2.11391577]
 [ 5.11411973 -0.86767653]
 [ 6.07458397  2.1958807 ]
 [-0.13350435  2.2893948 ]
 [ 1.12419435  2.22581615]
 [ 1.35302664 -1.77690706]
 [-0.86427253 -6.94715492]
 [ 1.49523064 -2.4339761 ]]
dy/dx
[ -1.56935621  -0.11218354   0.49632899  -0.16966293   0.3614866
 -17.14846611   1.97992112  -1.31328313   8.03815308  -1.62782653]

Functions¶

functions work just as well in SciPy.

def OLS(X, Y):
    """Implements ordinary least squares estimation for linear regression"""
    Xt = np.transpose(X)
    return np.dot(np.linalg.inv(np.dot(Xt,X)), np.dot(Xt, Y))

Let's test it out by randomly generating some data!

# Generate 100 random numbers from [0,10)
X = np.random.rand(100, 2) * 10
# Generate Y
Y = np.dot(X,[[3],[4]])
# add some noise
Y += np.random.rand(100, 1) * .1
print OLS(X,Y)

[[ 3.00421238]
 [ 4.00501352]]

Matplotlib¶

We can now plot stuff! Note the specific text used to get plots to show up in jupyter

%matplotlib inline
import matplotlib.pyplot as plt

x = np.linspace(0, 2, 10)

plt.plot(x, x, 'o-', label='linear')
plt.plot(x, x ** 2, 'x-', label='quadratic')

plt.legend(loc='best')
plt.title('Linear vs Quadratic progression')
plt.xlabel('Input')
plt.ylabel('Output');
plt.show()