Roadmap
- Tabular (Lecture 20, Huggingface course, develop the thing from scratch)
- DQN (DQN Paper and implement something better than any Python code)
- PPO
- Distributional RL (good explanation)
- Rainbow (github, paper)
- Multi goal
- Multi agent
- Continuous time
- SAC
- REINFOCE
- Dreamer
- td3
Environments
- OpenAI gym (Farama)
- Petting Zoo - multi-agent
Sims:
- Isaac Sim
- MuJoCo
- Webots
- Gazebo
Walking bipeds:
Personal research
Articles
- How RL Agents Behave When Their Actions Are Modified, code
- soccer bi-pedal robot recover after push with Deep Reinforcement Learning site paper
- 2019 Neftci - Reinforcement learning in artificial and biological systems https://www.gwern.net/docs/reinforcement-learning/model-free/2019-neftci.pdf
- neurorobotics can be used to explain how neural network activity leads to behavior. Neurorobots as a Means Toward Neuroethology and Explainable AI
- TensorFlow tutorials probably bad about theory part, but Hugging Face course is definetily more clear.
- Explnataion-based learning VS Reinforcement learning Dietterich 1997
- HOT! (According to Ignacio de Gregorio, pulp fiction writer: In a paper that’s not even been officially presented yet, Google has announced pre-trained robots that are capable of doing multiple different activities and also be easily trained to ambitious downstream tasks.) https://arxiv.org/pdf/2211.15144.pdf
- Review of clutches required in RL
- Smart Reinforcement learning in How to Create a Modular & Compositional Self-Preserving Agent for Life-Long Learning. It is based on Markov decision processes (too wordy, but very simple articles with examples) that uses Bellman equation, value functions, temporal goals in MDP. I think one can rewrite this smelly python code while using this popular python library (470 stars) as reference.
- Game theory in RF. Mastering the Game of Stratego with Model-Free Multiagent Reinforcement Learning. Where you need to know what is The replicator equation and how to plot 3-Strategy Evolutionary Games
- Policy gradient methods -> Proximal policy optimization
Questions
- Does anyone know if the reward function in reinforcement learning was ever supported by logical reasoning instead of a priori given values?
Program
I have read DQN Paper and implement something better than any Python code.
Let's write Reinforcement Learning program in Rust. In the beginning we will be solving frost lake environment. We will use a grid world from rsrl
cargo new qu-robot
cd qu-robot
cargo add rsrl
Walls in the grid environment is an interesting problem. In rsrl
crate it was implemented that the movement is possible but not movement actually occurs, so the agent just stays in the same position. It's not good. Because stay in place should be intentional action rather than a loophole in the implementation. Staying in the same spot is an equilibrium, so we can say it's safe strategy.
What can we do to fix it? We can remove movements that go into any wall from a list of possible actions. But then the sum of probabilities for remaining actions stops being equal to 1. And this can be solved by normalization, but I have more important vote against it. I think that the agent should understand something about the environment and know what walls are. As thy can be the limits of the environment and also any obstacle inside the environment. And if the agent finds a way to avoid the obstacles then it can apply the same technique to the limits of the environment and leave the box...
Then if we add another state 'W' everywhere around our current grid then we emphasize the problem that we had originally. Every decision takes in consideration only the current state. If this state was near a wall then we learned that that direction is bad. Instead of just observing what is near our agent is taking actions like a blind hedgehog in the dark. So then we need to combine all cells around into one state.
But this is just very exponentially increased our set of possible states and dimensions of transition matrix. Then let's only include states that have reward values. That's not every state, right?
We skip diogonal values. Thus we have 5 states that will create one meta state. Having only 3 meaningful states we have 5^3 = 125 meta states.
Further development
Sub goals
Goal-Conditioned Reinforcement Learning: Problems and Solutions by Liu et al (paper, 2022) Goals are positions where robot should reach (it's body, or parts like arms). There are many end positions that correspond to different tasks or steps. These positions can be discovered with several methods, so no need to hard-code them. But there is no single answer how to switch between goals. It resembles very much bayesian learning where conditional probabilities are learned from experience.
Nash equilibrium
Stabilization tricks
In actor-critic:
To stabilize training, the resulting sequence of returns is also standardized (i.e. to have zero mean and unit standard deviation).
For normalization layer (by Kazumi):
mean = x.mean(0).square()
square = x.square().mean(0)
self.activation = torch.mean(square - 2*torch.sqrt(square-mean))