In this article the multiarmed bandit framework problem and a few algorithms to solve the problem is going to be discussed. This problem appeared as a lab assignment in the edX course DAT257x: Reinforcement Learning Explained by Microsoft. The problem description is taken from the assignment itself.
Given a set of actions with some unknown reward distributions, maximize the cumulative reward by taking the actions sequentially, one action at each time step and obtaining a reward immediately. This is the traditional exploreexploit problem in reinforcement learning. In order to find the optimal action, one needs to explore all the actions but not too much. At the same time, one needs to exploit the best action found sofar by exploring.
The following figure defines the problem mathematically and shows the explorationexploitation dilemma in a general setting of the reinforcement learning problems with sequential decision with incomplete information.
The following figure shows a motivating application of the multiarmed bandit problem in drug discovery. Given a set of experimental drugs (each of which can be considered an arm in the bandit framework) to be applied on a set of patients sequentially, with a reward 1 if a patient survives after the application of the drug and 0 if he dies, the goal is to save as many patients as we can.
The following figures show the naive and a few variants of the greedy algorithms for maximizing the cumulative rewards.
The following code from the github repository of the same course shows how the basic bandit Framework can be defined:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103

import numpy as np import sys ### Interface class Environment( object ): def reset( self ): raise NotImplementedError( 'Inheriting classes must override reset.' ) def actions( self ): raise NotImplementedError( 'Inheriting classes must override actions.' ) def step( self ): raise NotImplementedError( 'Inheriting classes must override step' ) class ActionSpace( object ): def __init__( self , actions): self .actions = actions self .n = len (actions) ### BanditEnv Environment class BanditEnv(Environment): def __init__( self , num_actions = 10 , distribution = "bernoulli" , evaluation_seed = "387" ): super (BanditEnv, self ).__init__() self .action_space = ActionSpace( range (num_actions)) self .distribution = distribution np.random.seed(evaluation_seed) self .reward_parameters = None if distribution = = "bernoulli" : self .reward_parameters = np.random.rand(num_actions) elif distribution = = "normal" : self .reward_parameters = (np.random.randn(num_actions), np.random.rand(num_actions)) elif distribution = = "heavytail" : self .reward_parameters = np.random.rand(num_actions) else : print ( "Please use a supported reward distribution" ) #, flush = True) sys.exit( 0 ) if distribution ! = "normal" : self .optimal_arm = np.argmax( self .reward_parameters) else : self .optimal_arm = np.argmax( self .reward_parameters[ 0 ]) def reset( self ): self .is_reset = True return None def compute_gap( self , action): if self .distribution ! = "normal" : gap = np.absolute( self .reward_parameters[ self .optimal_arm]  self .reward_parameters[action]) else : gap = np.absolute( self .reward_parameters[ 0 ][ self .optimal_arm]  self .reward_parameters[ 0 ][action]) return gap def step( self , action): self .is_reset = False valid_action = True if (action is None or action = self .action_space.n): print ( "Algorithm chose an invalid action; reset reward to inf" ) #, flush = True) reward = float ( "inf" ) gap = float ( "inf" ) valid_action = False if self .distribution = = "bernoulli" : if valid_action: reward = np.random.binomial( 1 , self .reward_parameters[action]) gap = self .reward_parameters[ self .optimal_arm]  self .reward_parameters[action] elif self .distribution = = "normal" : if valid_action: reward = self .reward_parameters[ 0 ][action] + self .reward_parameters[ 1 ][action] * np.random.randn() gap = self .reward_parameters[ 0 ][ self .optimal_arm]  self .reward_parameters[ 0 ][action] elif self .distribution = = "heavytail" : if valid_action: reward = self .reward_parameters[action] + np.random.standard_cauchy() gap = self .reward_parameters[ self .optimal_arm]  self .reward_parameters[action] #HACK to compute expected gap else : print ( "Please use a supported reward distribution" ) #, flush = True) sys.exit( 0 ) return ( None , reward, self .is_reset, '') #Policy interface class Policy: #num_actions: (int) Number of arms [indexed by 0 ... num_actions1] def __init__( self , num_actions): self .num_actions = num_actions def act( self ): pass def feedback( self , action, reward): pass 
Now in order to implement an algorithm we need to just extend (inherit from) the Policy base class (interface) and implement the functions act() and feedback() for that algorithm (policy).
In order to theoretically analyze the greedy algorithms and find algorithms that have better performance guarantees, let’s define regret as the gap in between the total expected reward with the action chosen by the optimal policy and the cumulative reward with a set of actions chosen by any algorithm (assuming that the reward distributions are known), as shown in the following figure. Hence, maximizing cumulative reward is equivalent to minimizing the regret.
Given that greedy exploits too much an epsilon greedy explores too much, it can be shown that all the greedy variants have regrets linear in the number of timesteps T.
Also, it was theoretically proven by Lai and Robbins,that the lower bound on the regret is logarithmic in the number of timesteps T.
Now, let’s assume that we don’t know p_i values and we use the Greedy and the Naive RoundRobin algorithms to maximize the cumulative rewards over 10000 timesteps.
As can be seen from the next animations and figures
Again, let’s assume that we don’t know p_i values and we use the EpsilonGreedy (with different values of the hyperparameter ε) and the OptimisticGreedy (with different values of the hyperparameter R) algorithms to maximize the cumulative rewards over 10000 timesteps.
As can be seen from the next animations and figures
ε = 0
ε = 0.05
ε = 0.1
ε = 0.15
ε = 1.0
R = 0
R = 1
R = 3
R = 5
R = 10000
The next figure shows two algorithms (UCB and Bayesian ThompsonBeta Posterior Sampling) that achieve logarithmic regret.
Again, let’s assume that we don’t know p_i values, we implement and use the UCB1 and ThompsonBeta algorithms to maximize the cumulative rewards over 10000 timesteps.
As can be seen from the next animations and figures
>
© 2018 Data Science Central ® Powered by
Badges  Report an Issue  Privacy Policy  Terms of Service
You need to be a member of Data Science Central to add comments!
Join Data Science Central