Improvements
First let's review improvements for Gradient Descent. They are still GD by its nature.
- Momentum gradient descent (MGD)
- Nesterov accelerated gradient descent (NAG)
- Adaptive gradient (Adagrad)
- Root-mean-square gradient propagation (RMSprop)
- Adaptive moment estimation (Adam)
Source: from an alternative ANN architecture - Backpropagation Neural Tree paper
Alternatives
When it comes to Global Optimisation tasks (i.e. attempting to find a global minimum of an objective function) you might wanna take a look at:
- Pattern Search (also known as direct search, derivative-free search, or black-box search), which uses a pattern (set of vectors ) to determine the points to search at next iteration.
- Genetic Algorithm that uses the concept of mutation, crossover and selection to define the population of points to be evaluated at next iteration of the optimisation.
- Particle Swarm Optimisation that defines a set of particles that "walk" through the space searching for the minimum.
- Surrogate Optimisation that uses a surrogate model to approximate the objective function. This method can be used when the objective function is expensive to evaluate.
- Multi-objective Optimisation (also known as Pareto optimisation) which can be used for the problem that cannot be expressed in a form that has a single objective function (but rather a vector of objectives).
- Simulated Annealing, which uses the concept of annealing (or temperature) to trade-off exploration and exploitation. It proposes new points for evaluation at each iteration, but as the number of iteration increases, the "temperature" drops and the algorithm becomes less and less likely to explore the space thus "converging" towards its current best candidate.
- Backpropagation Neural Tree (BNeuralT)
- Target Propagation
- Alternating Descent Method of Multipliers (ADMM) (You need 7000 CPU cores on a supercomputer, while I can have 1280 GPU cores on my old laptop with GeForce GTX 1060)
- Zeroth-Order Relaxed Backpropagation (ZORB)
- Finito https://arxiv.org/abs/1407.2710
- Stochastic Dual Coordinate Ascent (SDCA) https://arxiv.org/abs/1209.1873
- Stochastic Optimization with Variance Reduction https://hal.inria.fr/hal-01375816v1/document
- Spike timing-dependent plasticity (STDP) for Spiking Neural Networks
- Feedback alignment (FA)
As mentioned above, Simulated Annealing, Particle Swarm Optimisation and Genetic Algorithms are good global optimisation algorithms that navigate well through huge search spaces and unlike Gradient Descent do not need any information about the gradient and could be successfully used with black-box objective functions and problems that require running simulations.
Vityaev's method
Vityaev’s method gives more flexibility to the dendrites part of the system comparing to how they are modeled in perceptrons or any deep network, instead of simple weights per synapses it reacts to signal patterns like in spiking neurons. Moreover, using this framework it’s possible to interpret signal propagation with logic statements and probabilities like in Bayes networks. Thus we have a bigger choice of learning methods. It can be some statistical analysis of data.
Extreme Learning
Towards a more biologically plausible learning algorithm
Main article - Biologically-Plausible Learning Algorithms Can Scale to Large Datasets link
Nash’s equilibrium
Why is gradient descent the most popular, the only one technique in use? Because it’s stable. No matter what’s in your training data, with any order you will get predictable outcome. And things like Adaptive Resonance Theory aren't stable in that regard. They learn fast, but the order of data is very important. But. What if we apply Nash’s equilibrium here? Every system, component in the brain fight for their truth and shift the system, they play against each other, but we can determine when they must stop overwriting each other. Will it fix the problem with order?