Reinforcement Learning

Published:

Roadmap

Environments

Sims:

  • Isaac Sim
  • MuJoCo
  • Webots
  • Gazebo

Walking bipeds:

  • Kyle (code from "cat crime research")
  • ???

Personal research

Link

Articles

Questions

  • Does anyone know if the reward function in reinforcement learning was ever supported by logical reasoning instead of a priori given values?

Program

I have read DQN Paper and implement something better than any Python code.

Let's write Reinforcement Learning program in Rust. In the beginning we will be solving frost lake environment. We will use a grid world from rsrl

cargo new qu-robot
cd qu-robot
cargo add rsrl

Walls in the grid environment is an interesting problem. In rsrl crate it was implemented that the movement is possible but not movement actually occurs, so the agent just stays in the same position. It's not good. Because stay in place should be intentional action rather than a loophole in the implementation. Staying in the same spot is an equilibrium, so we can say it's safe strategy.

What can we do to fix it? We can remove movements that go into any wall from a list of possible actions. But then the sum of probabilities for remaining actions stops being equal to 1. And this can be solved by normalization, but I have more important vote against it. I think that the agent should understand something about the environment and know what walls are. As thy can be the limits of the environment and also any obstacle inside the environment. And if the agent finds a way to avoid the obstacles then it can apply the same technique to the limits of the environment and leave the box...

Then if we add another state 'W' everywhere around our current grid then we emphasize the problem that we had originally. Every decision takes in consideration only the current state. If this state was near a wall then we learned that that direction is bad. Instead of just observing what is near our agent is taking actions like a blind hedgehog in the dark. So then we need to combine all cells around into one state.

But this is just very exponentially increased our set of possible states and dimensions of transition matrix. Then let's only include states that have reward values. That's not every state, right?

We skip diogonal values. Thus we have 5 states that will create one meta state. Having only 3 meaningful states we have 5^3 = 125 meta states.

Further development

Sub goals

Goal-Conditioned Reinforcement Learning: Problems and Solutions by Liu et al (paper, 2022) Goals are positions where robot should reach (it's body, or parts like arms). There are many end positions that correspond to different tasks or steps. These positions can be discovered with several methods, so no need to hard-code them. But there is no single answer how to switch between goals. It resembles very much bayesian learning where conditional probabilities are learned from experience.

Nash equilibrium

RL and nash equlibrium

Stabilization tricks

In actor-critic:

To stabilize training, the resulting sequence of returns is also standardized (i.e. to have zero mean and unit standard deviation).

Actor-critic when calculating expected returns

For normalization layer (by Kazumi):

mean = x.mean(0).square()
square = x.square().mean(0)
self.activation = torch.mean(square - 2*torch.sqrt(square-mean))

Rate this page