.

This the second part of Reinforcement Learning (Q-learning). If you would like to understand the RL, Q-learning, and key terms please read Part 1.

In this part, we will implement a simple example of Q learning using the R programming language from scratch. It is expected from you to understand the basics of R programming and complete the reading of Part 1 of this article.

We are coding the algorithms using the R base package only however we would need a few libraries to plot the various matrix and visualize the output.

install.packages('plot.matrix')install.packages('RColorBrewer')library(plot.matrix)

library(RColorBrewer)

The below function will create a plot of any R matrix using the plot.matrix library. Please refer CRAN page for details about arguments used in the plot function. Plot_matrix function takes matrix to plot and no. of digits(optional) as input arguments.

plot_matrix <- function(mat_pot, digits_arg){

digits_arg <- ifelse(missing(digits_arg),0,digits_arg)

plot(mat_pot, cex = 1.2, fmt.cell=paste0('%.',digits_arg,'f'),

col=brewer.pal(3,"Blues"), breaks=c(-500, 0, 0, 500),

xlab="States", ylab = "States",main = "")

}

Let's consider there are 5 states i.e. 1,2,3,4,5 and possible actions as left, right, up and down.

states <- seq(1, 5, by = 1)

state_seq <- cbind(merge(states,states), state =

seq(1,length(states)*length(states)))

state_mat <- matrix(state_seq$state, nrow = length(states), ncol= length(states))

plot_matrix(state_mat,1) ## matrix, digits

The above code will create a seq of states from 1 to 5. Since we can move from one state to state using the actions we will create a states * states matrix which will become our environment. So if we label each state to other state movements as numbers then it will be 1 to 5*5 =25. Below plot can help you understand this better:

Once we have our states and actions are defined, next, we need is the reward for each of the states. This can be negative or positive or any other set of values (numeric) based on the problem you are solving. Here, we have randomly given rewards to various states and creating a matrix.

rewards <- c(0,-100,-100,0,-100,10,10,10,10,-100,10,10,-100,10,-100,

10,10,10,10,-100,

-100,-100,10,10,100)

rewards_mat <- matrix(rewards, nrow = length(states), ncol= length(states))

plot_matrix(rewards_mat)

If we compare the rewards matric with the state matrix, the state numbers 2, 3, 5, 10, 13, 15, 20, 21, and 22 have rewards -100 points which means we do not want our Agent to go there or those are dangerous areas. And State 25 has reward 100 points so this would be our Goal state.

goal <- which(rewards==max(rewards), arr.ind=TRUE)

Further, we need to create a Q matrix with the same dimensions as the states matrix and initialize it with 0 values.

## Initialize Q-MatrixQ <- matrix(0, nrow = length(states), ncol= length(states))plot_matrix(Q,1)

Suppose Agent is at State 3, then based on actions (left, right, up, down) then next possible states will be states 2, 4, and 8. So here we have created a function that would let the Agent know what all are the possible next steps. Also, this will be required when we are calculating Q values using Bellman's equation. I will be explaining this later. :)

getNextStates <- function(cs) {stalen <- length(states);NS <- stalen*stalen

aa <- state_seq[state_seq$state == cs,]

if (aa$x == max(states)) {

ns <- c(cs - 1,

cs - stalen,

cs + stalen);

} else if (aa$x == min(states)) {

ns <- c(cs + 1,

cs - stalen,

cs + stalen);

} else {

ns <- c(cs + 1,

cs - 1,

cs - stalen,

cs + stalen);

}

nss <- sort(ns[ns > 0 & ns <= NS]);

return(nss);

}

The above function will return all the possible states that Agent can move based on its current states. Refer the function call example below:

> getNextStates(3)[1] 2 4 8> getNextStates(7)[1] 2 6 8 12

The episode is an iteration that starts with Agent's initial state to terminal state(Goal). So an agent will start from a random state, learn while moving to various states, collect rewards, and update Q value at each step. Psuedo code for the Q-learning is as below:

Following the above steps, R code is written for 10 episodes which can be configured by changing the value of N.

########### Episodes Execution ################N <- 10 # No. of Episode

alpha <- 0.8 # Learning Rate

gamma <- 0.7 # Discount Factor

for (i in 1:N) {

current_episode <- i;

cat("\nStart Episode: ", current_episode)

## choose next state from possible actions at current state

cs <- sample(state_seq$state, 1)

cat("\n\tCurrent state: ", cs)

step_num <- 1;

while (T) {

cat("\n\n\tStep no.: ", step_num)

cat("\n\t\tCurrent State: ", cs)

reward <- rewards[cs]

if(reward == 0 | is.na(reward) | length(reward) == 0 ){

reward <- 0

}

cat("\n\t\tReward CS: ", reward)

next.states <- getNextStates(cs);

cat("\n\t\tPossible next states: ", next.states)

# next.states

# If we have any states present, else choose randomly.

if (length(next.states) == 1) {

ns <- next.states

} else {

ns <- sample(next.states, 1)

}

cat("\n\t\tNext state: ", ns)

# Update Q values for next states.

Q[cs] <- round(Q[cs] + alpha * (reward +

gamma * max(Q[getNextStates(ns)])-Q[cs]),1);

cat("\n\t\tNew Q-Value: ", Q[cs])plot_matrix(Q,1)

Sys.sleep(0.2)

if (cs == goal | step_num > 20) {

break;

}

cs <- ns;

step_num <- step_num + 1;

}

cat("\nEnd Episode: ", current_episode)

}

This will execute 10 episodes updating the Q matrix. To understand the working of the code, we will now observe a few steps from Episode 1:

## Start Episode: 1## Initial state: 5## Step no.: 1## Current state: 5

## Reward CS: -100

## Possible next states: 4 10

## Next state: 10

## New Q-Value: -80

## Step no.: 2## Current state: 10

## Reward CS: -100

## Possible next states: 5 9 15

## Next state: 15

## New Q-Value: -80

In the first 2 steps of Episode 1, Agent has updated Q values for 2 states i.e. 5 and 10. Also, Q values updated are negative which will reflect what we have seen in the rewards matrix as we have negative rewards for these states. Similarly, it will keep moving, analyze the reward, and update the Q matrix. Let's fast forward and see the terminating step of Episode 1 which is when the Current state is State 25.

## Step no.: 7## Current State: 25## Reward CS: 100

## Possible next states: 20 24

## Next state: 24

## New Q-Value: 84.5

## End Episode: 1

In this step, Agent has updated the Q value for State 25 with the new Q value 84.5.

Further Agent will execute 10 episodes and terminate when it will reach state 25. It may also terminate when the number of steps within an episode reaches 20 because we do not want our agent to stay between some states.

*Final Q table*

Once the Q matrix is updated, Agent may choose max rewarding next states and reach the goal with the correct and shortest path.

Also, we can observe that our Goal state 25 has the maximum Q value.

We have reached the end of this tutorial. The code used in this code is available in my Git repo

I hope you have enjoyed the articles and have some takeaways from this. I would be happy to discuss any doubts related to the above code and R programming. Please feel free to reach out to me via Linkedin or Email.

Views: 2550

Tags: AI, Q, R, artificial, data, dsc_ai, dsc_code, dsc_tagged, intelligence, learning, More…programming, reinforcement, science

- Demand proliferates for low-code app development platforms
- US Senate mulling bill on data breach notifications
- AIOps network management requires vendor-buyer teamwork
- Deloitte SAP Industry Cloud apps aim to fill ERP gaps
- Compare 7 headless CMS offerings and their key differences
- 9 steps to a dynamic data architecture plan
- Building trustworthy AI is key for enterprises
- IoT and responsibility: Use digital for good
- 4 zero-trust IoT steps to scale security
- Apply hyperscale data center design principles to IT storage

Posted 27 July 2021

© 2021 TechTarget, Inc. Powered by

Badges | Report an Issue | Privacy Policy | Terms of Service

**Most Popular Content on DSC**

To not miss this type of content in the future, subscribe to our newsletter.

- Book: Applied Stochastic Processes
- Long-range Correlations in Time Series: Modeling, Testing, Case Study
- How to Automatically Determine the Number of Clusters in your Data
- New Machine Learning Cheat Sheet | Old one
- Confidence Intervals Without Pain - With Resampling
- Advanced Machine Learning with Basic Excel
- New Perspectives on Statistical Distributions and Deep Learning
- Fascinating New Results in the Theory of Randomness
- Fast Combinatorial Feature Selection

**Other popular resources**

- Comprehensive Repository of Data Science and ML Resources
- Statistical Concepts Explained in Simple English
- Machine Learning Concepts Explained in One Picture
- 100 Data Science Interview Questions and Answers
- Cheat Sheets | Curated Articles | Search | Jobs | Courses
- Post a Blog | Forum Questions | Books | Salaries | News

**Archives:** 2008-2014 |
2015-2016 |
2017-2019 |
Book 1 |
Book 2 |
More

**Most popular articles**

- Free Book and Resources for DSC Members
- New Perspectives on Statistical Distributions and Deep Learning
- Time series, Growth Modeling and Data Science Wizardy
- Statistical Concepts Explained in Simple English
- Machine Learning Concepts Explained in One Picture
- Comprehensive Repository of Data Science and ML Resources
- Advanced Machine Learning with Basic Excel
- Difference between ML, Data Science, AI, Deep Learning, and Statistics
- Selected Business Analytics, Data Science and ML articles
- How to Automatically Determine the Number of Clusters in your Data
- Fascinating New Results in the Theory of Randomness
- Hire a Data Scientist | Search DSC | Find a Job
- Post a Blog | Forum Questions

## You need to be a member of Data Science Central to add comments!

Join Data Science Central