Reinforcement Learning (Q-learning) - An Introduction (Part 1)

Have you heard about AI learning to play computer games on their own and giving tough competitions to expert Human gamers?

A very popular example being Deepmind whose AlphaGo program defeated the South Korean Go world champion in 2016. Other than this there are other AI agents developed with the intent of playing Atari games like Breakout, Pong, and Space Invaders.

These AI agents use Reinforcement Learning algorithms which is one of three basic machine learning paradigms, alongside supervised learning and unsupervised learning.

Reinforcement learning differs from supervised learning in not needing labeled input/output pairs to be presented, and in not needing sub-optimal actions to be explicitly corrected. Instead, the focus is on finding a balance between exploration (of uncharted territory) and exploitation (of current knowledge).

Note: If you are already aware of RL and Q-learning concepts, you may directly move to Part 2 which has an implementation of Q-learning using R from Scratch

Reinforcement learning (RL)

RL is an area of machine learning concerned with how software agents ought to take actions in an environment to maximize some notion of cumulative reward.

As an example, consider the process of boarding a train, in which the reward is measured by the negative of the total time spent boarding (alternatively, the cost of boarding the train is equal to the boarding time). One strategy is to enter the train door as soon as they open, minimizing the initial wait time for yourself. If the train is crowded, however, then you will have a slow entry after the initial action of entering the door as people are fighting you to depart the train as you attempt to board. The total boarding time, or cost, is then:

0 seconds wait time + 15 seconds fight time

On the next day, by random chance (exploration), you decide to wait and let other people depart first. This initially results in longer wait times. However, time-fighting other passengers are less. Overall, this path has a higher reward than that of the previous day, since the total boarding time is now:

5 second wait time + 0 second fight time.

Through exploration, despite the initial (patient) action resulting in a larger cost (or negative reward) than in the forceful strategy, the overall cost is lower, thus revealing a more rewarding strategy.

No alt text provided for this image

Many algorithms that come under Reinforcement Learning. For this article, we would focus on Q-learning which one of the most famous among other RL algorithms.

Q-Learning

Q-learning is an off-policy, model-free RL algorithm based on the well-known Bellman Equation.

Bellman’s Equation:

Where:

Alpha(α) – Learning rate (0<α≤1) – It is the rate by which the Q-values are updated. A high value of Alpha (close to 1) means the magnitude of the Q values will update fastly and take fewer iterations to learn. Similarly, low values of Alpha, will update Q values slowly and take more iterations to learn.

Gamma(γ) – Discount factor (0≤γ≤1) – Determines how much importance we want to give to future rewards. A high value for the discount factor (close to 1) captures the long-term effective award, whereas, a discount factor of 0 makes our agent consider only immediate reward, hence making it greedy.

Important terms to understand:

Action (A): All the possible moves that the agent can take
State (S): Current situation returned by the environment.
Reward (R): An immediate return sends back from the environment to evaluate the last action.
Policy (π): The strategy that the Agent employs to determine the next action based on the current state.
Value (V): The expected long-term return with discount, as opposed to the short-term reward under policy π.
Q-value or action-value (Q): Q-value is similar to Value, except that it takes an extra parameter, the current Action. Q(state, action) refers to the long-term return of the current State, taking Action under policy π.

Psuedo Code:

Ref: https://www.cse.unsw.edu.au/~cs9417ml/RL1/algorithms.html

This procedural approach can be translated into simple language steps as follows:

Initialize the Q-values table, Q(s, a).
Observe the current state, s.
Choose an action, a, for that state based on one of the action selection policies(soft, greedy or softmax).
Take the action, and observe the reward, r, as well as the new state, s’.
Update the Q-value for the state using the observed reward and the maximum reward possible for the next state. The updating is done according to the formula and parameters described above.
Set the state to the new state, and repeat the process until a terminal state is reached.

Reinforcement learning is an awesome and interesting set of algorithms but there are few of many scenarios where you should not use the reinforcement learning model:

When you have enough labeled data to solve the problem with a supervised learning method.
When you do not have the heavy computing power and learning time as RL algorithms are heavy and take longer to train.

I hope with this article we could get an overview of RL and Q-learning algorithms. If you are still curious and want to see this working, Check out Part 2 of Reinforcement Learning; which has an implementation of Q-learning using R from Scratch.

I would be happy to discuss any doubts related to the article and R programming. Please feel free to reach out to me via Linkedin or Email.

References:

https://en.wikipedia.org/wiki/Reinforcement_learning

https://en.wikipedia.org/wiki/Q-learning

https://towardsdatascience.com/introduction-to-various-reinforcemen…

https://www.cse.unsw.edu.au/~cs9417ml/RL1/tdlearning.html