Rigorous Q-Learning

One of the more popular machine learning algorithms is Q-Learning-based systems and its deep variant – Deep Q Learning. Through this blog post, I want to examine the more mathematical and theoretical aspects of it. Mainly why it does what it’s supposed to do and the mathematical intuition behind it.

Q-Learning

We have an agent, a fixed number of states and actions respectively represented by \(S\) and \(A\). There is a reward function that gives a reward when engaging in a specific action in a specific state. Let’s call this the reward function \(R\).

Our objective is to maximize the reward while interacting with the environment whether it has a terminal state or not. This is where the policy function comes represented by \(Q\). Mathematically, \(Q\) assigns a numerical value that represents the quality of a state-action pair.

\[Q: S \to A = \mathbb{R}\]

Remember that there is a cycle occurring throughout the interaction with the environment at a specific time step \(t\).

The Agent is in a state \(s_{t}\)
It takes an action \(a_{t}\) based on the exploration-exploitation tradeoff.
Receives a reward for the action \(r_{t}\)
Then it updates the policy function

What’s the exploration-exploitation tradeoff? Well, the agent can either exploit the existing policy function but that means that it will adhere to inherent biases in the existing policy and won’t learn anything new. Or it can explore – whereas in it selects random actions and figures out how that rewards the agent. The exploration-exploitation trade-off constant is a value from \(\epsilon\) where \(0 \leq \epsilon \leq 1\)where it determines how often should you explore or exploit – based on your preference.

From this, we are clear about the first three steps of the process. Then only one question remains – how do we update the policy table/function. Here comes the famous variant of the Bellman Equation

\[Q^{‘}(s_{t},a_{t}) = Q(s_{t},a_{t}) + \alpha{}(r_{t} + \gamma{}\max_{a}Q(s_{t+1},a) – Q(s_{t}, a_{t})\]

What’s the intuition behind this? Well, let’s decompose it into parts.

\(Q(s_{t},a_{t})\) – This is the existing reward. We can think of neural networks as functional approximators that slowly move from less accurate predictions to more accurate predictions. What’s happening here is basically that we’re taking the previous policy value/approximation and adding a small portion of the error (which I’ll explain below) that will provide a direction and help us reach closer and closer to the optimal value
\(\alpha{}(r_{t} + \gamma{}\max_{a}Q(s_{t+1},a) – Q(s_{t}, a_{t})\). This is where the interesting part comes in. The parameter \(\alpha\) is the learning rate. This determines how rapidly you want to learn. This is multiplied by the temporal difference error.

What’s the temporal difference error? It’s like any other error function whereas it measures how off are we from the original measurement. Here we calculate how far off is the policy function or table from the optimal policy function. We are making an assumption that the agent doesn’t only care about the current reward, but also the future reward as well. The second assumption is that we care more about the closer reward than the far-off reward. For instance, the reward in step 1 is more valuable than the reward in step 5. Ideally, we would take the accumulative reward as the sum of all rewards throughout the timesteps multiplied by a discount factor. The discount factor is a number between 0 and 1 which determines how relatively important the future action is to the agent. This method would be computationally expensive and in cases of games without a terminal state, impossible. It would look something like this

\[G_{t} = r_{t} + \gamma{}r_{t+1} + \gamma^{2}{}r_{t+2} + \gamma^{3}r_{t+3}….\]

To avoid this, we only take the immediate future reward. Therefore the reward becomes

\[G_{t} = r_{t} + \gamma\max_{a}{Q(s_{t+1}a})\]

Then we measure of far off this is from the actual policy predicted. Hence the term \(r_{t} + \gamma{}\max_{a}Q(s_{t+1},a) – Q(s_{t}, a_{t})\).

That’s the intuition behind the update and q-learning in general. Eventually, the model converges to an approximately optimal state.

Niranjan Krishna