Why the multi-arm bandit can be regarded as a non-associative reinforcement learning task?

Multi tool use
Multi tool use
The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP


Why the multi-arm bandit can be regarded as a non-associative reinforcement learning task?



In Chapter 2 of Reinforcement Learning: An Introduction (by Richard), the authors claim that multi-arm bandit is a non-associative problem. However, in policy improvement (action-value function update), there is


Q_t(a) cdot{=} frac{sum of reward when a taken prior to t}{number of times a taken prior to t} = frac{mathop{sum}limits_{i=1}^{t-1}R_icdot textbf{1}_{A_i=a}}{mathop{sum}limits_{i=1}^{t-1} textbf{1}_{A_i=a}}.



This equation seems to not meet the Markov Property, in other word the current action-value function is depend on the previous actions (A_i). Therefore, from this perspective, why the multi-arm bandit can be regarded as a non-associative reinforcement learning task?



btw: In my opinion, the k-arm bandit problem is still associative. If we regard the state as s^{(t)}=[s_1, s_2,..., s_k] where s_i means the accumulated rewards until timestamp t. The state meets the Markov properties, and the update processes can be explained reasonably.









By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

yzr58eu,qM,cxdm
i8054zic603B x0Zny3lSV Mi2X e LZsruals rzH N6Ajyk2O,DpHXPqUEGlXOFfgwhnqV7YuFB7x7pFH,ROJ S

Popular posts from this blog

Makefile test if variable is not empty

Visual Studio Code: How to configure includePath for better IntelliSense results

Will Oldham