This site contains the demos for my 'Reinforcement Learning in Scala' talk.

The slides for the talk are available here.

The source code for all the demos is available on GitHub.

There are 3 demos, all of which use the same RL algorithm known as Q-learning.

This is a continous (non-episodic) problem with very simple rules:

- The agent (the red dot on the grid) can move up, down, left or right.
- If the agent tries to leave the edge of the grid, it stays in the same cell and gets a reward of -1.
- If the agent is in cell
`A`

and moves in any direction, it jumps to`A'`

and gets a reward of 10. - If the agent is in cell
`B`

and moves in any direction, it jumps to`B'`

and gets a reward of 5. - In all other cases, it gets no reward for moving around the grid.

Of course, the optimal policy is to always move towards `A`

in order to pick up the reward of 10.
If you run the demo, you should see the agent gradually learn this policy.

It may get stuck in a local minimum (i.e. preferring the `B`

cell) for a while,
but it is guaranteed to eventually converge on the optimal policy.
This is because the agent constantly explores the state space using the ε-greedy algorithm.

The big table under the grid shows the agent's current `Q(s, a)`

for all state-action pairs.
This is the estimate that the agent holds for being in state `s`

and taking action `a`

.

The smaller table shows the same information summarised as a policy. In other words, for a given state, what action(s) the agent currently believes to be the best.

This episodic problem is a classic in RL literature.

At every time step the agent must push the cart either to the left or the right. The goal is to stop the pole from toppling too far either to the left or the right, whilst also ensuring the cart does not crash into the walls.

The rules are as follows:

- If the pole topples more than 12° from vertical, the agent gets a reward of -1 and the episode ends.
- If the cart hits the left or right wall, the agent gets a reward of -1 and the episode ends.
- In all other cases, the agent gets no reward.

It's fascinating to see how quickly the agent learns, especially bearing in mind:

- Q-learning is a model-free algorithm, so the agent has no idea of the problem it's solving. It doesn't know anything about poles, carts, angular velocities, and so on. All it knows is that it has to pick one of two actions at every time step.
- The amount of feedback from the environment is very small. All the agent gets is a negative reward at the end of the episode.

To get a feel for the problem, you might want to try it yourself first. Use the Left and Right arrow keys on your keyboard to move the cart.

Next you can watch the agent learn. Use the buttons to run through a single time step, a single episode or continously.

This one is an exercise for the reader.

The demo shows a very "dumb" agent. Its state space is enormous, so it has no chance of doing any meaningful learning.

See if you can improve the agent by redesigning its state space and putting it through some training.

Take a look at the README for more details.