Alexandre LHUILLIER - Software projects, stories & opinions
Home Blog Projects Miscellaneous About

Version Française

QLearning

The QLearning algorithm is one of the many methods of reinforcement learning. It is an interesting one because, since its principle is very simple, it is the easiest to implement.
The difficulty of this project was not how to implement it, rather it was what to use it for and how to represent its parameters.

Group: Alexandre LHUILLIER (FIPA 2) - Pascal NUNES (FIP 15)

Explanation

Reinforcement learning is a way to teach an agent the best policy to execute its task. The agent has a finite set of actions and a set of states. Its goal is to figure, for each state, the most fitting action. To reach this goal, each time it executes an action, its environment must be able to send it a feedback about its action, which means either a reward or a penalty.


We can see the agent as a robot, capable of a finite set of actins and owning a list of "Qualities". It is always doing an action; when this action has a consequence either on it or its environment, the environment sends back a reward (or a penalty). The agent computes the quality value of the action it just did. During the first iterations, the only actions that have an effect on the computations are the ones that provoke an immediate reward. After a large number of iterations, the agent will be able to know the sequence of actions that lead it to a better reward. The computation doesn’t only take into account the immediate reward of the action, but also the quality of the best next action.

With

The quality of an action is computed this way:

Q(s, a) = Q(s, a)+α[r+τ*maxQ(s’, a’)-Q(s, a)]

Examples


Our application

Our goal

If you tried Thierry Masson’s or similar demos - where a robot moves on a grid - , they never need to go back on their steps: they stop as soon as they reach a reward.
We wanted to go further, by allowing a robot to go back on its steps, for example to get a reward one side of a corridor then go back and reach the labyrinth’s exit on the other side.


To get the best possible reward, the agent must get as many rewards as possible, walking back on its steps if needed


The implémentation

To get to this result, we had to rethink how we represent a state. While in the above examples, they are represented by their position on the grid, a state in our implementation is defined by:
What we mean by "mobile objects" in our implementation is simply a collectible reward, which disappears as soon at the agent walks on it.
This approach has a drawback: the state count depends on the number of mobile objects on the map. For example, on a 10 by 10 map with 4 hearts inside, the number of states the agent might explore is 10*10*(4+6+4+1+1)=1600.(The area of the map multiplied by every possible combination of hearts on the map?). Moreover, depending on the map’s layout, backtracking isn’t always necessary. More details to come

The software

The working application:
Our Java implementation


With the left toolbar, you can:
Remember the robot is not conscious of its environment. It only knows rewards and penalties sent to him. Therefore is behaviour can change completely if you change the reward values. For example, if you set a penalty for walking on a heart and a reward for walking on a wall, the robot will avoid hearts and walk straight on walls.
With the top toolbar, you can:

After you chose a kind of object, you can drop it on the map by clicking on a tile. You can remove an object on the map by right-clicking on it.
Be careful when dropping hearts: as stated above, the more hearts on the map, the bigger count of possible states, therefore learning can get longer and eat much more RAM.
Sources might be available later.

Bonus

A completely different application: two agents learn how to fight each other with a saber.
They have 5 possible actions:

Their state is represented by the distance between them and their opponent, and the current pose of the opponent.
Therefore, the number of possible states is fixed and can be represented by a 2-Dimensional array, like a grid, but for a completely different result.


Usage:
In the "vitesse"(speed) dropdown list, choose "10x" to speed up learning.
Push "Apprendre"("Learn").
After a few minutes, push "Stopper apprentissage"("Stop learning").
Put speed back to "1x".
Push "Exploiter" to appreciate the results of the learning.

With the buttons "Remplacer"("Replace") in the "Agent de gauche" et "Agent de droite" menus("Left Agent" and "Right Agent"), you can control directly one of the agents with these keyboard keys:

Will the man be stronger than the machine?

Still in Java