Why the multi-arm bandit can be regarded as a non-associative reinforcement learning task?

Why the multi-arm bandit can be regarded as a non-associative reinforcement learning task?

In Chapter 2 of Reinforcement Learning: An Introduction (by Richard), the authors claim that multi-arm bandit is a non-associative problem. However, in policy improvement (action-value function update), there is

Q_t(a) cdot{=} frac{sum of reward when a taken prior to t}{number of times a taken prior to t} = frac{mathop{sum}limits_{i=1}^{t-1}R_icdot textbf{1}_{A_i=a}}{mathop{sum}limits_{i=1}^{t-1} textbf{1}_{A_i=a}}.

This equation seems to not meet the Markov Property, in other word the current action-value function is depend on the previous actions (A_i). Therefore, from this perspective, why the multi-arm bandit can be regarded as a non-associative reinforcement learning task?

btw: In my opinion, the k-arm bandit problem is still associative. If we regard the state as s^{(t)}=[s_1, s_2,..., s_k] where s_i means the accumulated rewards until timestamp t. The state meets the Markov properties, and the update processes can be explained reasonably.

By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

yzr58eu,qM,cxdm

搜尋此網誌

Ciugk