Reinforcement Learning - N | Robots Science Gamedev

Published: December 20, 2022

Table of contents

Roadmap
- Environments
Personal research
Articles
Questions
Program
Further development

Roadmap

Tabular (Lecture 20, Huggingface course, develop the thing from scratch)
DQN (DQN Paper and implement something better than any Python code)
PPO
Distributional RL (good explanation)
Rainbow (github, paper)
Multi goal
Multi agent
Continuous time
SAC
REINFOCE
Dreamer
td3
Planning (a practical course from Yandex)

Environments

OpenAI gym (Farama)
Petting Zoo - multi-agent

Sims:

Isaac Sim
MuJoCo
Webots
Gazebo

Walking bipeds:

Kyle (code from "cat crime research")
???

Personal research

Link

Articles

How RL Agents Behave When Their Actions Are Modified, code
soccer bi-pedal robot recover after push with Deep Reinforcement Learning site paper
2019 Neftci - Reinforcement learning in artificial and biological systems https://www.gwern.net/docs/reinforcement-learning/model-free/2019-neftci.pdf
neurorobotics can be used to explain how neural network activity leads to behavior. Neurorobots as a Means Toward Neuroethology and Explainable AI
TensorFlow tutorials probably bad about theory part, but Hugging Face course is definetily more clear.
Explnataion-based learning VS Reinforcement learning Dietterich 1997
HOT! (According to Ignacio de Gregorio, pulp fiction writer: In a paper that’s not even been officially presented yet, Google has announced pre-trained robots that are capable of doing multiple different activities and also be easily trained to ambitious downstream tasks.) https://arxiv.org/pdf/2211.15144.pdf
Review of clutches required in RL
Smart Reinforcement learning in How to Create a Modular & Compositional Self-Preserving Agent for Life-Long Learning. It is based on Markov decision processes (too wordy, but very simple articles with examples) that uses Bellman equation, value functions, temporal goals in MDP. I think one can rewrite this smelly python code while using this popular python library (470 stars) as reference.
Game theory in RF. Mastering the Game of Stratego with Model-Free Multiagent Reinforcement Learning. Where you need to know what is The replicator equation and how to plot 3-Strategy Evolutionary Games
Policy gradient methods -> Proximal policy optimization

Questions

Does anyone know if the reward function in reinforcement learning was ever supported by logical reasoning instead of a priori given values?

Program

I have read DQN Paper and implement something better than any Python code.

Let's write Reinforcement Learning program in Rust. In the beginning we will be solving frost lake environment. We will use a grid world from rsrl

cargo new qu-robot
cd qu-robot
cargo add rsrl

Walls in the grid environment is an interesting problem. In rsrl crate it was implemented that the movement is possible but not movement actually occurs, so the agent just stays in the same position. It's not good. Because stay in place should be intentional action rather than a loophole in the implementation. Staying in the same spot is an equilibrium, so we can say it's safe strategy.

What can we do to fix it? We can remove movements that go into any wall from a list of possible actions. But then the sum of probabilities for remaining actions stops being equal to 1. And this can be solved by normalization, but I have more important vote against it. I think that the agent should understand something about the environment and know what walls are. As thy can be the limits of the environment and also any obstacle inside the environment. And if the agent finds a way to avoid the obstacles then it can apply the same technique to the limits of the environment and leave the box...

Then if we add another state 'W' everywhere around our current grid then we emphasize the problem that we had originally. Every decision takes in consideration only the current state. If this state was near a wall then we learned that that direction is bad. Instead of just observing what is near our agent is taking actions like a blind hedgehog in the dark. So then we need to combine all cells around into one state.

But this is just very exponentially increased our set of possible states and dimensions of transition matrix. Then let's only include states that have reward values. That's not every state, right?

We skip diogonal values. Thus we have 5 states that will create one meta state. Having only 3 meaningful states we have 5^3 = 125 meta states.

Further development

Sub goals

Goal-Conditioned Reinforcement Learning: Problems and Solutions by Liu et al (paper, 2022) Goals are positions where robot should reach (it's body, or parts like arms). There are many end positions that correspond to different tasks or steps. These positions can be discovered with several methods, so no need to hard-code them. But there is no single answer how to switch between goals. It resembles very much bayesian learning where conditional probabilities are learned from experience.

Nash equilibrium

RL and nash equlibrium

Stabilization tricks

In actor-critic:

To stabilize training, the resulting sequence of returns is also standardized (i.e. to have zero mean and unit standard deviation).

Actor-critic when calculating expected returns

For normalization layer (by Kazumi):

mean = x.mean(0).square()
square = x.square().mean(0)
self.activation = torch.mean(square - 2*torch.sqrt(square-mean))

Rate this page

Recently posted in "brain"

Bidirectional Associative Memories

Invalid Date

BAM can link together data of different types. Associations... From one side the model requires to use bipolar patterns - arrays of -1 and…

Cascade Correlation Neural Networks

Invalid Date

Neural networks with biologically plausible accounts of neurogenesis they start with a minimal topology (just input and output unit) and…

Brain map

Invalid Date

Phenotypic variation of transcriptomic cell types in mouse motor cortex (paper). Scientist got neurons from different layers and regions and…

Cognitive Architecture

Invalid Date

ACT-R CLARION LIDA Soar (code)

Inverse Dynamics Model

Invalid Date

I was trying to understand the "inverse-dynamics model" block in the paper by Mitsuo Kawato. First, I thought that somehow he's applying…