"In supervised learning algorithms, we generally have some model (such as a neural network) with a set of parameters (the weights), a data set, and an error function which measures how well our parameters and model fit the data. With many models, the way to train the model and fit the parameters is through an iterative minimization procedure which uses the gradient of the error to find the local minimum in parameter space. This notebook will not go into the details of neural networks or how to compute the derivatives of error functions, but instead focuses on some of the simple minimization methods one can employ. The goal of this notebook is to develop a simple yet flexible framework in Haskell in which we can develop gradient descent methods.\n",
"\n",
"Although you should keep in mind that the goal of these algorithms (for our purposes) is to train neural networks, for now we will just discuss some abstract function $f(\\vec x)$ for which we can compute all partial derivatives.\n",
"and that the gradient of a function always points towards the direction of maximal increase at that point.\n",
"\n",
"Equivalently, it points *away* from the direction of maximum decrease - thus, if we start at any point, and keep moving in the direction of the negative gradient, we will eventually reach a local minimum.\n",
"\n",
"This simple insight leads to the Gradient Descent algorithm. Outlined algorithmically, it looks like this:\n",
"\n",
"1. Pick a point $x_0$ as your initial guess.\n",
"2. Compute the gradient at your current guess:\n",
"$v_i = \\nabla f(x_i)$\n",
"3. Move by $\\alpha$ (your step size) in the direction of that gradient:\n",
"$x_{i+1} = x_i + \\alpha v_i$\n",
"4. Repeat steps 1-3 until your function is close enough to zero (until $f(x_i) < \\varepsilon$ for some small tolerance $\\varepsilon$)\n",
"\n",
"Note that the step size, $\\alpha$, is simply a parameter of the algorithm and has to be fixed in advance. \n",
"\n",
"Though this algorithm is simple, it will be a bit of a challenge to formalize it into executable Haskell code that we can later extend to other algorithms. First, note that gradient descent requires two things:\n",
"\n",
"- Something to optimize (a function)\n",
"- What to optimize over (the parameters)\n",
"- A way to compute partials of the function\n",
"\n",
"Note that we don't actually need to *call* the function itself - only the partial derivatives are necessary.\n",
"\n",
"We're going to define a single class for things on which we can run gradient descent. Although later we may want to modify this class, this serves as a beginning:"
" -- Compute the gradient at a location in parameter space.\n",
" grad :: a -> Params a -> Params a\n",
" \n",
" -- Move in parameter space.\n",
" paramMove :: Double -- Scaling factor.\n",
" -> Params a -- Direction vector.\n",
" -> Params a -- Original location.\n",
" -> Params a -- New location."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In order to use some type `a` with our gradient descent, we require that it is an instance of `GradientDescent`. This class requires a few things.\n",
"\n",
"First off, we use type families in order to define a representation for the parameter space. We want to be able to operate on points in the parameter space of our function; however, while something like a list of values might be nice and simple in one case, it is inappropriate and inefficient when storing the weights of a neural network. Thus, we let each class instance decide how to store its parameters by defining an associated type instance. (We will see an instance of this later!)\n",
"\n",
"Next, `GradientDescent` requires a single function called `grad`, which takes the thing of type `a` and the current point in parameter space (via a `Param a`) and outputs a set of partial derivatives. The partial derivatives have the same form and dimension as the point in parameter space, so they are also a `Param a`. \n",
"\n",
"Finally, we must be able to move around in parameter space, so `GradientDescent` defines a function `paramMove` which does exactly that - it takes a parameter vector and an amount by which to move and uses these to generate a new position from an old one.\n",
"\n",
"Let's go ahead and create the simplest instantiation of this class and type family: a single-argument function. Note that this is just for demo purposes, and we're going to use numerical differentiation to compute the derivative."
"We're getting closer to implementing the actual algorithm. However, we have yet to define when the algorithm *stops* - and, in order to give maximum flexibility to the user, we'll let the stopping condition be an argument. This lets the user specify an error tolerance, as well as how they want to derive this error:"
" -- And move against the gradient with a step size alpha.\n",
" paramMove (-alpha) gradients params"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's go ahead and try this for some simple functions. First, we need to create a stopping condition. In this case, we're going to stop when successive updates to the parameters do not affect the outcome very much - namely, when the difference between the function value at successive parameters is below some $\\varepsilon$-tolerance. In other scenarios, we may want to use a more complicated stopping criterion."
"Finally, let's take the minimum. We're going to use a step size of $\\alpha = 0.1$, start at $x_0 = 12$, and stop with a tolerance of $1\\times 10^{-4}$:"
"unArg $ gradientDescent function (stopCondition function tolerance) alpha (Arg initValue)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Monadic Gradient Descent\n",
"---\n",
"\n",
"Although the above implementation of gradient descent works, we're going to run into problems when our functions are more complicated. For instance, suppose that computing the gradient required a lot of computation, and the computation required communicating with a distributed network of processing nodes. Or suppose that there were some regimes in which the function was non-differentiable, and we wanted to use the `Maybe` type to represent this. In order to support this, we can try to rewrite our class with *monadic* variants of its operations."
" -- Compute the gradient at a location in parameter space.\n",
" grad :: a -> Params a -> m (Params a)\n",
" \n",
" -- Move in parameter space.\n",
" paramMove :: Double -- Scaling factor.\n",
" -> Params a -- Direction vector.\n",
" -> Params a -- Original location.\n",
" -> m (Params a) -- New location.\n",
" \n",
" \n",
"-- Since we've redefined GradientDescent, we need to redefine StopCondition.\n",
"newtype StopCondition a = StopWhen (Params a -> Params a -> Bool)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In order to utilize this, we're going to have to rewrite our instance to run all computations in a monad. The implementation will look quite familiar, but we won't be able to use as many built-in functions, as they do not have monadic variants in the base packages."
" -> m (Params a) -- Return: Location of minimum.\n",
"gradientDescent function (StopWhen stop) alpha x0 = do\n",
" -- Take the next step.\n",
" next <- takeStep x0\n",
" \n",
" -- If we stop, do so, otherwise recurse.\n",
" if stop x0 next\n",
" then return next\n",
" else gradientDescent function (StopWhen stop) alpha next\n",
" where\n",
" takeStep params = do\n",
" gradients <- grad function params\n",
" paramMove (-alpha) gradients params"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's try this for something simple. Suppose we're using our old $f(x) = x^2 + 3x$, but for some reason, we are incapable of differentiating if the function value is below zero. We'll use the `Maybe` monad to represent this - if the parameter to a function is negative, we return `Nothing`, otherwise, we return `Just` the derivative."
"let stopper = stopCondition function tolerance\n",
"case gradientDescent function stopper alpha x0 of\n",
" Just x -> print $ unArg x\n",
" Nothing -> putStrLn \"Nothing!\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We saw in the original example that the minimum is at $-\\frac{3}{2}$, so this gradient descent tries to go into the $x < 0$ region - at which point the differentiation returns `Nothing`, and the gradient descent implicitly stops! This monadic gradient descent can be used to implement things such as bounded optimization, optimization that keeps track of the states it went through, optimization that uses networked IO to do its computation, and so on.\n",
"\n",
"That's it for now. In the next notebook, I'm going to try implementing conjugate gradient in this same framework."