diff --git a/notebooks/Gradient Descent.ipynb b/notebooks/Gradient Descent.ipynb new file mode 100644 index 00000000..9dcd2314 --- /dev/null +++ b/notebooks/Gradient Descent.ipynb @@ -0,0 +1,401 @@ +{ + "metadata": { + "language": "haskell", + "name": "" + }, + "nbformat": 3, + "nbformat_minor": 0, + "worksheets": [ + { + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "In supervised learning algorithms, we generally have some model (such as a neural network) with a set of parameters (the weights), a data set, and an error function which measures how well our parameters and model fit the data. With many models, the way to train the model and fit the parameters is through an iterative minimization procedure which uses the gradient of the error to find the local minimum in parameter space. This notebook will not go into the details of neural networks or how to compute the derivatives of error functions, but instead focuses on some of the simple minimization methods one can employ. The goal of this notebook is to develop a simple yet flexible framework in Haskell in which we can develop gradient descent methods.\n", + "\n", + "Although you should keep in mind that the goal of these algorithms (for our purposes) is to train neural networks, for now we will just discuss some abstract function $f(\\vec x)$ for which we can compute all partial derivatives.\n", + "$\n", + "% Latex Setup and definitions\n", + "\\newcommand\\vector[1]{\\langle #1 \\rangle}\n", + "\\newcommand\\p[2]{\\frac{\\partial #1}{\\partial #2}}\n", + "$\n", + "\n", + "Gradient Descent\n", + "---\n", + "The simplest algorithm for iterative minimization of differentiable functions is known as just **gradient descent**.\n", + "Recall that the gradient of a function is defined as the vector of partial derivatives:\n", + "\n", + "$$\\nabla f(x) = \\vector{\\p{f}{x_1}, \\p{f}{x_2}, \\ldots, \\p{f}{x_n}}$$\n", + "\n", + "and that the gradient of a function always points towards the direction of maximal increase at that point.\n", + "\n", + "Equivalently, it points *away* from the direction of maximum decrease - thus, if we start at any point, and keep moving in the direction of the negative gradient, we will eventually reach a local minimum.\n", + "\n", + "This simple insight leads to the Gradient Descent algorithm. Outlined algorithmically, it looks like this:\n", + "\n", + "1. Pick a point $x_0$ as your initial guess.\n", + "2. Compute the gradient at your current guess:\n", + "$v_i = \\nabla f(x_i)$\n", + "3. Move by $\\alpha$ (your step size) in the direction of that gradient:\n", + "$x_{i+1} = x_i + \\alpha v_i$\n", + "4. Repeat steps 1-3 until your function is close enough to zero (until $f(x_i) < \\varepsilon$ for some small tolerance $\\varepsilon$)\n", + "\n", + "Note that the step size, $\\alpha$, is simply a parameter of the algorithm and has to be fixed in advance. \n", + "\n", + "Though this algorithm is simple, it will be a bit of a challenge to formalize it into executable Haskell code that we can later extend to other algorithms. First, note that gradient descent requires two things:\n", + "\n", + "- Something to optimize (a function)\n", + "- What to optimize over (the parameters)\n", + "- A way to compute partials of the function\n", + "\n", + "Note that we don't actually need to *call* the function itself - only the partial derivatives are necessary.\n", + "\n", + "We're going to define a single class for things on which we can run gradient descent. Although later we may want to modify this class, this serves as a beginning:" + ] + }, + { + "cell_type": "code", + "collapsed": false, + "input": [ + ":set -XTypeFamilies\n", + "class GradientDescent a where\n", + " -- Type to represent the parameter space.\n", + " data Params a :: *\n", + " \n", + " -- Compute the gradient at a location in parameter space.\n", + " grad :: a -> Params a -> Params a\n", + " \n", + " -- Move in parameter space.\n", + " paramMove :: Double -- Scaling factor.\n", + " -> Params a -- Direction vector.\n", + " -> Params a -- Original location.\n", + " -> Params a -- New location." + ], + "language": "python", + "metadata": {}, + "outputs": [], + "prompt_number": 1 + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "In order to use some type `a` with our gradient descent, we require that it is an instance of `GradientDescent`. This class requires a few things.\n", + "\n", + "First off, we use type families in order to define a representation for the parameter space. We want to be able to operate on points in the parameter space of our function; however, while something like a list of values might be nice and simple in one case, it is inappropriate and inefficient when storing the weights of a neural network. Thus, we let each class instance decide how to store its parameters by defining an associated type instance. (We will see an instance of this later!)\n", + "\n", + "Next, `GradientDescent` requires a single function called `grad`, which takes the thing of type `a` and the current point in parameter space (via a `Param a`) and outputs a set of partial derivatives. The partial derivatives have the same form and dimension as the point in parameter space, so they are also a `Param a`. \n", + "\n", + "Finally, we must be able to move around in parameter space, so `GradientDescent` defines a function `paramMove` which does exactly that - it takes a parameter vector and an amount by which to move and uses these to generate a new position from an old one.\n", + "\n", + "Let's go ahead and create the simplest instantiation of this class and type family: a single-argument function. Note that this is just for demo purposes, and we're going to use numerical differentiation to compute the derivative." + ] + }, + { + "cell_type": "code", + "collapsed": false, + "input": [ + "-- We need flexible instances for declarations like these.\n", + ":set -XFlexibleInstances\n", + "\n", + "instance Floating a => GradientDescent (a -> a) where\n", + " -- The parameter for a function is just its argument.\n", + " data Params (a -> a) = Arg { unArg :: a }\n", + "\n", + " -- Use numeric differentiation for taking the gradient.\n", + " grad f (Arg value) = Arg $ (f value - f (value - epsilon)) / epsilon\n", + " where epsilon = 0.0001\n", + " \n", + " paramMove scale (Arg vec) (Arg old) = Arg $ old + fromRational (toRational scale) * vec" + ], + "language": "python", + "metadata": {}, + "outputs": [], + "prompt_number": 2 + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We're getting closer to implementing the actual algorithm. However, we have yet to define when the algorithm *stops* - and, in order to give maximum flexibility to the user, we'll let the stopping condition be an argument. This lets the user specify an error tolerance, as well as how they want to derive this error:" + ] + }, + { + "cell_type": "code", + "collapsed": false, + "input": [ + "-- Define a way to decide when to stop.\n", + "-- This lets the user specify an error tolerance easily.\n", + "-- The function takes the previous two sets of parameters and returns\n", + "-- `True` to continue the descent and `False` to stop.\n", + "newtype StopCondition a = StopWhen (Params a -> Params a -> Bool)" + ], + "language": "python", + "metadata": {}, + "outputs": [], + "prompt_number": 3 + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "With a single instance of our class, we can now implement our gradient descent algorithm:" + ] + }, + { + "cell_type": "code", + "collapsed": false, + "input": [ + "gradientDescent :: GradientDescent a => a -- What to optimize.\n", + " -> StopCondition a -- When to stop.\n", + " -> Double -- Step size (alpha).\n", + " -> Params a -- Initial point (x0).\n", + " -> Params a -- Return: Location of minimum.\n", + "gradientDescent function (StopWhen stop) alpha x0 =\n", + " let iterations = iterate takeStep x0\n", + " iterationPairs = zip iterations $ tail iterations\n", + " in\n", + " -- Drop all elements where the resulting parameters (and previous parameters)\n", + " -- do not meet the stop condition. Then, return just the last parameter set.\n", + " snd . head $ dropWhile (not . uncurry stop) iterationPairs\n", + " where\n", + " -- For each step...\n", + " takeStep params = \n", + " -- Compute the gradient.\n", + " let gradients = grad function params in\n", + " -- And move against the gradient with a step size alpha.\n", + " paramMove (-alpha) gradients params" + ], + "language": "python", + "metadata": {}, + "outputs": [], + "prompt_number": 4 + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Let's go ahead and try this for some simple functions. First, we need to create a stopping condition. In this case, we're going to stop when successive updates to the parameters do not affect the outcome very much - namely, when the difference between the function value at successive parameters is below some $\\varepsilon$-tolerance. In other scenarios, we may want to use a more complicated stopping criterion." + ] + }, + { + "cell_type": "code", + "collapsed": false, + "input": [ + "-- Create a stop condition that respects a given error tolerance.\n", + "stopCondition :: (Double -> Double) -> Double -> StopCondition (Double -> Double)\n", + "stopCondition f tolerance = StopWhen stop\n", + " where stop (Arg prev) (Arg cur) =\n", + " abs (f prev - f cur) < tolerance " + ], + "language": "python", + "metadata": {}, + "outputs": [], + "prompt_number": 5 + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Let's finally try to minimize something, such as the relatively trivial function $f(x) = x^2 + 3x$. It's graph looks like this:\n", + "\n", + "