In this article the multi-armed bandit framework problem and a few algorithms to solve the problem is going to be discussed. This problem appeared as a lab assignment in the edX course DAT257x: Reinforcement Learning Explained by Microsoft. The problem description is taken from the assignment itself.
Given a set of actions with some unknown reward distributions, maximize the cumulative reward by taking the actions sequentially, one action at each time step and obtaining a reward immediately. This is the traditional explore-exploit problem in reinforcement learning. In order to find the optimal action, one needs to explore all the actions but not too much. At the same time, one needs to exploit the best action found so-far by exploring.
The following figure defines the problem mathematically and shows the exploration-exploitation dilemma in a general setting of the reinforcement learning problems with sequential decision with incomplete information.
The following figure shows a motivating application of the multi-armed bandit problem in drug discovery. Given a set of experimental drugs (each of which can be considered an arm in the bandit framework) to be applied on a set of patients sequentially, with a reward 1 if a patient survives after the application of the drug and 0 if he dies, the goal is to save as many patients as we can.
The following figures show the naive and a few variants of the greedy algorithms for maximizing the cumulative rewards.
The following code from the github repository of the same course shows how the basic bandit Framework can be defined:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
|
import numpy as np import sys ### Interface class Environment( object ): def reset( self ): raise NotImplementedError( 'Inheriting classes must override reset.' ) def actions( self ): raise NotImplementedError( 'Inheriting classes must override actions.' ) def step( self ): raise NotImplementedError( 'Inheriting classes must override step' ) class ActionSpace( object ): def __init__( self , actions): self .actions = actions self .n = len (actions) ### BanditEnv Environment class BanditEnv(Environment): def __init__( self , num_actions = 10 , distribution = "bernoulli" , evaluation_seed = "387" ): super (BanditEnv, self ).__init__() self .action_space = ActionSpace( range (num_actions)) self .distribution = distribution np.random.seed(evaluation_seed) self .reward_parameters = None if distribution = = "bernoulli" : self .reward_parameters = np.random.rand(num_actions) elif distribution = = "normal" : self .reward_parameters = (np.random.randn(num_actions), np.random.rand(num_actions)) elif distribution = = "heavy-tail" : self .reward_parameters = np.random.rand(num_actions) else : print ( "Please use a supported reward distribution" ) #, flush = True) sys.exit( 0 ) if distribution ! = "normal" : self .optimal_arm = np.argmax( self .reward_parameters) else : self .optimal_arm = np.argmax( self .reward_parameters[ 0 ]) def reset( self ): self .is_reset = True return None def compute_gap( self , action): if self .distribution ! = "normal" : gap = np.absolute( self .reward_parameters[ self .optimal_arm] - self .reward_parameters[action]) else : gap = np.absolute( self .reward_parameters[ 0 ][ self .optimal_arm] - self .reward_parameters[ 0 ][action]) return gap def step( self , action): self .is_reset = False valid_action = True if (action is None or action = self .action_space.n): print ( "Algorithm chose an invalid action; reset reward to -inf" ) #, flush = True) reward = float ( "-inf" ) gap = float ( "inf" ) valid_action = False if self .distribution = = "bernoulli" : if valid_action: reward = np.random.binomial( 1 , self .reward_parameters[action]) gap = self .reward_parameters[ self .optimal_arm] - self .reward_parameters[action] elif self .distribution = = "normal" : if valid_action: reward = self .reward_parameters[ 0 ][action] + self .reward_parameters[ 1 ][action] * np.random.randn() gap = self .reward_parameters[ 0 ][ self .optimal_arm] - self .reward_parameters[ 0 ][action] elif self .distribution = = "heavy-tail" : if valid_action: reward = self .reward_parameters[action] + np.random.standard_cauchy() gap = self .reward_parameters[ self .optimal_arm] - self .reward_parameters[action] #HACK to compute expected gap else : print ( "Please use a supported reward distribution" ) #, flush = True) sys.exit( 0 ) return ( None , reward, self .is_reset, '') #Policy interface class Policy: #num_actions: (int) Number of arms [indexed by 0 ... num_actions-1] def __init__( self , num_actions): self .num_actions = num_actions def act( self ): pass def feedback( self , action, reward): pass |
Now in order to implement an algorithm we need to just extend (inherit from) the Policy base class (interface) and implement the functions act() and feedback() for that algorithm (policy).
In order to theoretically analyze the greedy algorithms and find algorithms that have better performance guarantees, let’s define regret as the gap in between the total expected reward with the action chosen by the optimal policy and the cumulative reward with a set of actions chosen by any algorithm (assuming that the reward distributions are known), as shown in the following figure. Hence, maximizing cumulative reward is equivalent to minimizing the regret.
Given that greedy exploits too much an epsilon greedy explores too much, it can be shown that all the greedy variants have regrets linear in the number of timesteps T.
Also, it was theoretically proven by Lai and Robbins,that the lower bound on the regret is logarithmic in the number of timesteps T.
Now, let’s assume that we don’t know p_i values and we use the Greedy and the Naive Round-Robin algorithms to maximize the cumulative rewards over 10000 timesteps.
As can be seen from the next animations and figures
Again, let’s assume that we don’t know p_i values and we use the Epsilon-Greedy (with different values of the hyper-parameter ε) and the Optimistic-Greedy (with different values of the hyper-parameter R) algorithms to maximize the cumulative rewards over 10000 timesteps.
As can be seen from the next animations and figures
ε = 0
ε = 0.05
ε = 0.1
ε = 0.15
ε = 1.0
R = 0
R = 1
R = 3
R = 5
R = 10000
The next figure shows two algorithms (UCB and Bayesian Thompson-Beta Posterior Sampling) that achieve logarithmic regret.
Again, let’s assume that we don’t know p_i values, we implement and use the UCB1 and Thompson-Beta algorithms to maximize the cumulative rewards over 10000 timesteps.
As can be seen from the next animations and figures
>
Posted 1 March 2021
© 2021 TechTarget, Inc.
Powered by
Badges | Report an Issue | Privacy Policy | Terms of Service
Most Popular Content on DSC
To not miss this type of content in the future, subscribe to our newsletter.
Other popular resources
Archives: 2008-2014 | 2015-2016 | 2017-2019 | Book 1 | Book 2 | More
Most popular articles
You need to be a member of Data Science Central to add comments!
Join Data Science Central