Reinforcement Learning in Scala

This site contains the demos for my 'Reinforcement Learning in Scala' talk.

Links

The slides for the talk are available here.

The source code for all the demos is available on GitHub.

Demos

There are 3 demos, all of which use the same RL algorithm known as Q-learning.

This is a continous (non-episodic) problem with very simple rules:

Of course, the optimal policy is to always move towards A in order to pick up the reward of 10. If you run the demo, you should see the agent gradually learn this policy.

It may get stuck in a local minimum (i.e. preferring the B cell) for a while, but it is guaranteed to eventually converge on the optimal policy. This is because the agent constantly explores the state space using the ε-greedy algorithm.

The big table under the grid shows the agent's current Q(s, a) for all state-action pairs. This is the estimate that the agent holds for being in state s and taking action a.

The smaller table shows the same information summarised as a policy. In other words, for a given state, what action(s) the agent currently believes to be the best.

This episodic problem is a classic in RL literature.

At every time step the agent must push the cart either to the left or the right. The goal is to stop the pole from toppling too far either to the left or the right, whilst also ensuring the cart does not crash into the walls.

The rules are as follows:

It's fascinating to see how quickly the agent learns, especially bearing in mind:

  1. Q-learning is a model-free algorithm, so the agent has no idea of the problem it's solving. It doesn't know anything about poles, carts, angular velocities, and so on. All it knows is that it has to pick one of two actions at every time step.
  2. The amount of feedback from the environment is very small. All the agent gets is a negative reward at the end of the episode.

To get a feel for the problem, you might want to try it yourself first. Use the Left and Right arrow keys on your keyboard to move the cart.

Next you can watch the agent learn. Use the buttons to run through a single time step, a single episode or continously.

This one is an exercise for the reader.

The demo shows a very "dumb" agent. Its state space is enormous, so it has no chance of doing any meaningful learning.

See if you can improve the agent by redesigning its state space and putting it through some training.

Take a look at the README for more details.