Projet d’école: IA: Q-Learning -

Alexandre LHUILLIER - Software projects, stories & opinions

QLearning

The QLearning algorithm is one of the many methods of reinforcement learning. It is an interesting one because, since its principle is very simple, it is the easiest to implement.
The difficulty of this project was not how to implement it, rather it was what to use it for and how to represent its parameters.

QLearning

Explanation
Examples

Our application

Our but
The implementation
The software
Bonus

Group: Alexandre LHUILLIER (FIPA 2) - Pascal NUNES (FIP 15)

Explanation

Reinforcement learning is a way to teach an agent the best policy to execute its task. The agent has a finite set of actions and a set of states. Its goal is to figure, for each state, the most fitting action. To reach this goal, each time it executes an action, its environment must be able to send it a feedback about its action, which means either a reward or a penalty.

We can see the agent as a robot, capable of a finite set of actins and owning a list of "Qualities". It is always doing an action; when this action has a consequence either on it or its environment, the environment sends back a reward (or a penalty). The agent computes the quality value of the action it just did. During the first iterations, the only actions that have an effect on the computations are the ones that provoke an immediate reward. After a large number of iterations, the agent will be able to know the sequence of actions that lead it to a better reward. The computation doesn’t only take into account the immediate reward of the action, but also the quality of the best next action.

With

s as the current state of the agent
s’ the next state
a its current action
a’ its next action
α a learning coefficient
r the reward of the current (state, action) couple
τ a refresh coefficient
Q(s, a) and Q(s’, a’) the "qualities" of, respectively, the current and next (state, action) couple. maxQ(s’, a’) represents the quality of the best action for state s’.

The quality of an action is computed this way:

Q(s, a) = Q(s, a)+α[r+τ*maxQ(s’, a’)-Q(s, a)]

Examples

QLearning implementations:
- Applet showcasing the algorithm:
  http://thierry.masson.free.fr/IA/fr/qlearning_apropos.htm
- The one-armed robot that learns how to go forward:
  http://www.applied-mathematics.net/qlearning/qlearning.html
Reinforcement learning in general:
- Robots that learn how to dribble:
  http://www.youtube.com/watch?v=nM1HTp_p3ly

Our application

Our goal

If you tried Thierry Masson’s or similar demos - where a robot moves on a grid - , they never need to go back on their steps: they stop as soon as they reach a reward.
We wanted to go further, by allowing a robot to go back on its steps, for example to get a reward one side of a corridor then go back and reach the labyrinth’s exit on the other side.

To get the best possible reward, the agent must get as many rewards as possible, walking back on its steps if needed

The implémentation

To get to this result, we had to rethink how we represent a state. While in the above examples, they are represented by their position on the grid, a state in our implementation is defined by:

Its coordinates on the grid
The list of "mobile" objects still present on the map

What we mean by "mobile objects" in our implementation is simply a collectible reward, which disappears as soon at the agent walks on it.
This approach has a drawback: the state count depends on the number of mobile objects on the map. For example, on a 10 by 10 map with 4 hearts inside, the number of states the agent might explore is 10*10*(4+6+4+1+1)=1600.(The area of the map multiplied by every possible combination of hearts on the map?). Moreover, depending on the map’s layout, backtracking isn’t always necessary. More details to come

The software

The working application:
Our Java implementation

With the left toolbar, you can:

Start learning
Reset all learning data
Change the reward value for every type of object
Change the max number of episodes before learning is considered finished
Start an exploitation
Change the starting position of the robot

Remember the robot is not conscious of its environment. It only knows rewards and penalties sent to him. Therefore is behaviour can change completely if you change the reward values. For example, if you set a penalty for walking on a heart and a reward for walking on a wall, the robot will avoid hearts and walk straight on walls.
With the top toolbar, you can:

Zoom/Unzoom the map
Choose what kind of object to drop on the map.

After you chose a kind of object, you can drop it on the map by clicking on a tile. You can remove an object on the map by right-clicking on it.
Be careful when dropping hearts: as stated above, the more hearts on the map, the bigger count of possible states, therefore learning can get longer and eat much more RAM.
Sources might be available later.

Bonus

A completely different application: two agents learn how to fight each other with a saber.
They have 5 possible actions:

go forward
go backward
attack
parry
wait

Their state is represented by the distance between them and their opponent, and the current pose of the opponent.
Therefore, the number of possible states is fixed and can be represented by a 2-Dimensional array, like a grid, but for a completely different result.

Usage:
In the "vitesse"(speed) dropdown list, choose "10x" to speed up learning.
Push "Apprendre"("Learn").
After a few minutes, push "Stopper apprentissage"("Stop learning").
Put speed back to "1x".
Push "Exploiter" to appreciate the results of the learning.

With the buttons "Remplacer"("Replace") in the "Agent de gauche" et "Agent de droite" menus("Left Agent" and "Right Agent"), you can control directly one of the agents with these keyboard keys:

Left and Right arrows to go backwards or forward
Up Arrow to attack
Down Arrow to parry

Will the man be stronger than the machine?

Still in Java