Reinforcement Learning and Bellman Equations

Concepts

The goal of RL is to learn a good policy with none or limited supervision. We can using MDP to model this process.

The main goal of MDP is to optimize the agent’s behavior to get the maximum accumulated rewards at the end of the game.

For learning all the Q values of MDP, then using these Q values to find the best policy given a particular state, in other words, our goal is find a policy function $\pi(s)$, it takes an state as input, and output an action that gives us the maximum expected reward.

The policy function:

$$ \pi(state) \rightarrow action $$

$$ \displaystyle \displaystyle \pi(s) = \displaystyle \max _ a Q^*(s, a)$$

Markov Decision Process (MDP)

$MDP(S, A, T, R) = MDP(State, Action, Transition Probability, Reward)$

a set of states $s \in S$
a set of actions $a \in A$
Action dependent transition probabilities $T(s,a,s’)=P(s’|s, a)$, so that for each state $s$ and action $a$, $\displaystyle \sum _{s’\in S} T(s,a,s’)=1$.
Reward functions $R(s,a,s’)$ representing the reward for starting in state $s$, taking action $a$ and ending up in state $s’$ after one step.

Bellman equations

$$ V_{i}^(s) = \max_a Q_{i}^(s, a) $$

$$ Q_i^(s, a) = \sum {s’} T(s, a, s’)\left(R(s, a, s’) + \gamma V{i-1}^(s’)\right). $$

$V_i^*(s)$: the maximum expected rewards starting from state s.
$Q_i^*(s, a)$: the expected rewards starting at state $s$ when taking action $a$.
$T(s, a, s’)$: the transition probability of from state s to state s’ when taking action a,$\displaystyle \sum _{s’\in S} T(s,a,s’)=1$

Discounted Factor(Gamma)

Total discounted reward obtained is: $$\sum {t=0}^{\infty }\gamma ^{t}r{t}.$$

Using this form is to guarantee the total reward is bounded and finite.

Q Learning Steps

1. Initialize Q Values

Let all $Q_0(s,a)=0$, so all $V_0(s)=0$

2. Learn Q Values

How good is a state-action pair?

$Q$ means the maximum expected reward given a state action pair.

For all (state, action) pair, calculate $Q_1(s, a)$. $Q_1$ represents the first iteration of Q learning.

3. Learn V Values(maximum reward)

How good is a state?

$V$ means the maximum expected reward given a state.

According to the first equation of Bellman equations, calculate all $V_i(s)$ through all $Q_i(s,a)$ from last step.

$$ V_i^(s) = \displaystyle \max_a Q_i^(s, a)$$

4. The Policy Function

Policy means how should we act in the real world(game) or in practice.

Now we have all Q vlaues, then if we are at a particular state, using the following policy, we will know what will be the next action to act.

$$ \displaystyle \displaystyle \pi(s) = \displaystyle \max _ a Q^*(s, a) $$

5. Iteration and Convergence

Iteration over step 2 and step 3 until $V$ convergence.

Implementation Concerns

Transitions and Rewards are Unknown

In MDP(S, A, T, R) model, the transition probabilites and rewards are unknown beforehand, so we need to try and explore, then record and estimate these values.

Learning Rate for Training

Learning rate $\alpha$ the update rete of $Q$ values.

$$Q_i(s,c) \leftarrow (1-\alpha)Q_i(s,c) + \alpha [R(s,c) + \gamma\cdot \max_{c’ \in C}Q_{i-1}(s’,c’)]$$

epsilon_greedy Exploration vs Exploitation

With probability $\epsilon$ we try a new action, with probability $1-\epsilon$ we using the learned $Q(s,a)$ or policy to decide the best next action to perform.

At the beginning, $\epsilon$ should be set to 1, all actions are to explore, as time goes on, the Q vulue will be converge, $\epsilon$ should be close to 0, then we do more exploitation.


def epsilon_greedy(state, Q, epsilon)
 coin = np.random.binomial(1, epsilon)
 if coin == 1:
  # do Exploration
  # next_action =
 else:
  # do Exploitation
  # next_action =

 return next_action