Welcome to the Reinforcement Learning (RL) Roadmap with Programming Club IITK! Get ready for a hands on, curiosity driven deep dive into how intelligent agents learn to make decisions. This roadmap is crafted to provide you with both a solid theoretical basis and ample practical experience, tailored for newcomers and enthusiasts eager to build real RL intuition.
Week 1: Build your foundation by mastering the exploration-exploitation dilemma through various multi-armed bandit problems. This week is dedicated to understanding how an agent can learn the best actions in a static environment, a core concept that underpins all of RL.
Week 2: Transition from stateless bandits to environments with sequential decision-making. You will be introduced to Markov Decision Processes (MDPs), the mathematical framework for RL, and learn how to solve them using foundational methods like Dynamic Programming, Monte Carlo, and Temporal-Difference learning.
Week 3: Next we will study model-free methods, where agents learn optimal policies directly from experience without needing a model of the environment. You’ll implement and contrast cornerstone algorithms like Q-Learning, SARSA, and Policy Gradients.
Week 4: Scale up your knowledge to tackle complex, high dimensional problems with Deep Reinforcement Learning. This week, you will combine neural networks with RL principles to build powerful agents using algorithms like Deep Q-Networks (DQN) and explore methods for continuous action spaces.
No matter which week you’re in, don’t hesitate to experiment! Document your learning, reflect frequently, and seek deeper understanding rather than shortcuts. Happy learning!
Week 1: The Foundations - Bandits and Basic RL</summary>
</details>
Week 2: From Bandits to Full reinforecement learning</summary>
</details>
Week 3: Model Free Methods</summary>
</details>
Week 4: Deep RL and Advanced Algorithms</summary>
</details>
The idea that we learn by interacting with the world around us feels obvious. Think about an infant exploring its surroundings, reaching out, touching things, making sounds. No one is explicitly teaching every action, yet the child learns cause and effect. Some actions bring smiles and attention, while others bring discomfort or no response. Through these interactions, the child figures out what works to achieve certain goals.
This pattern continues throughout life. When we learn to ride a bike, cook a new recipe, or hold a conversation, we rely on trial, error, and feedback. We act, observe the outcome, and adjust. The better we get at understanding how our actions influence what happens next, the more effectively we can achieve our goals.
Reinforcement Learning is built on this same idea—learning through interaction and feedback. Instead of being told the right answer in advance, the learner discovers it by trying different things, seeing the results, and improving over time.
Why formalize this process? To study this kind of learning systematically, we need a clear way to describe what’s happening. When we act and learn from feedback, there’s always something we’re interacting with, choices we make, and outcomes that guide our decisions. In reinforcement learning, these ideas are formalized into concepts like the environment, actions, rewards, and strategies. To understand them better, let’s look at an example.
Imagine a small delivery robot that needs to deliver packages in a city. It doesn’t know the roads well, so it has to figure out how to reach destinations efficiently.
Working of the Robot: Every decision the robot makes shapes what it learns. Each choice, whether it takes a shortcut or sticks to a main road, brings consequences that reinforce or discourage that behavior. Through this constant cycle of action and feedback, the robot builds an understanding of which strategies work best.
It doesn’t just aim for one quick delivery; its goal is to maximize success in the long run, that is, delivering packages faster and safer while avoiding failures like dead batteries or traffic delays. To achieve this, it must balance two competing needs: exploring new possibilities to discover better routes and exploiting what it already knows works well.
What’s remarkable is that no one hands the robot a map or the correct answers. It learns by doing, by interacting with its environment, observing the outcomes, and adjusting its strategy based on experience. This ability to discover effective behavior through trial and error, driven by rewards and penalties, is the essence of reinforcement learning.
Now that you understand the core idea of reinforcement learning, the best way to deepen your understanding is to see it in action and start with simple problems before tackling complex ones.
Watch RL in Action Before diving into technical details, it helps to see reinforcement learning in the real world. RL isn’t just theory; it powers some of the most exciting breakthroughs, from bots that master games to robots navigating complex environments. To explore these applications, check out the AI Warehouse YouTube channel.
Bonus for cat lovers: Want something fun? Watch this short video that explains RL through cats; it’s simple, intuitive, and adorable.
Build a Strong Foundation Once you’ve seen RL in action, it’s time to understand the core ideas step by step. Start with this StatQuest video; it’s one of the most beginner-friendly introductions, perfect for building intuition.
After that, reinforce your understanding by reading this GeeksforGeeks article. It summarizes the key concepts in simple, clear language—great for a quick refresher after the video.
Finally, if you want a solid reference to guide you throughout your learning journey, dive into Chapter 1 of Reinforcement Learning: An Introduction by Sutton & Barto. This book is considered the gold standard for RL and will provide the depth you need once you’ve grasped the basics.
Imagine you walk into a casino with 10 slot machines lined up. Each one hides an unknown probability of giving you a payout. You have only 100 coins. Every pull of a lever costs one coin.
Win as much as possible before you run out of coins.
But here’s the dilemma:
This simple setup captures one of the core challenges of decision-making under uncertainty: Should you exploit what seems good or explore the unknown for something better?
Why this Matters? This is the multi-armed bandit problem, the simplest form of reinforcement learning. There are no states or transitions, just choices with uncertain rewards. Yet, solving it well requires intelligent strategies to manage the exploration–exploitation trade-off.
Resources: Here’s a cool video: Multi-Armed Bandits: A Cartoon Introduction (This video also introduces a few strategies; don’t worry, we’ll explore them in detail later.)
For a deeper read (just the intuition and examples, no need to worry about math yet): Sutton & Barto, Reinforcement Learning: An Introduction, Chapter 2.1 – The Multi-Armed Bandit Problem
Imagine you’re playing a new game, and you’re not sure which moves score the most points. How do you figure it out? For that, we’ll explore methods for estimating action values and using them to choose actions intelligently.
Key Idea: For each action $a$: \(Q(a) \approx \mathbb{E}[R \mid A = a]\)
This means:
So $Q(a)$ is basically our guess of the average reward if we keep taking action $a$.
If you’re new to expectation or probability, check this first: Interactive Guide: Seeing Theory – Basic Probability
The simplest estimate is the sample average: \(Q_n(a) = \frac{\sum_{i=1}^{N_n(a)} R_i}{N_n(a)}\) Where:
Read: Sutton & Barto, Section 2.2
Read: Sutton & Barto, Section 2.3
Recomputing averages from scratch is costly. Use an incremental formula:
\[Q_{n+1} = Q_n + \frac{1}{n}(R_n - Q_n)\]For nonstationary problems (where reward probabilities change over time), use a constant step size:
\[Q_{n+1} = Q_n + \alpha(R_n - Q_n), \; 0 < \alpha \le 1\]General RL pattern: \(\text{New Estimate} \leftarrow \text{Old Estimate} + \text{Step Size} \times (\text{Target} - \text{Old Estimate})\)
Read: Sutton & Barto, Section 2.4
Implementation: Notebook with all the above implemented.
The exploration in ε-greedy is “blind.” It wastes time trying obviously suboptimal arms and can’t prioritize actions that are promising but uncertain. Today, we’ll explore smarter ways to explore.
By setting the initial action values $Q_1(a)$ to a number much higher than any possible reward, we can force a greedy agent to explore. Why? Because every real reward it receives will be “disappointing” compared to its optimistic initial belief, causing it to try every other action at least once before converging. This is a simple but powerful trick for encouraging initial exploration.
However, this strategy is not well-suited for non-stationary settings because the drive to explore is temporary.
Read: Sutton & Barto, Section 2.5
UCB addresses the shortcomings of ε-greedy by exploring intelligently.It chooses actions based on both their estimated value and the uncertainty in that estimate. \(A_t = \arg\max_a \left[ Q_t(a) + c \sqrt{\frac{\ln t}{N_t(a)}} \right]\) Here:
The square root term is an “uncertainty bonus.” It’s large for actions that have been tried infrequently and shrinks as actions are tried more often. This makes UCB’s exploration strategic, not random.
Read: Sutton & Barto, Section 2.6
So far, we’ve focused on methods that estimate the value of actions. But what if we could learn a preference for each action directly? This shifts us from action-value methods to policy-based methods, where we learn a parameterized policy directly.
Key Idea: Instead of learning action values, we learn a numerical preference $H_t(a)$ for each action.These preferences are converted into action probabilities using a softmax distribution: \(\pi_t(a) = \text{Pr}\{A_t=a\} \doteq \frac{e^{H_t(a)}}{\sum_{b=1}^{k}e^{H_t(b)}}\)
Updating Preferences with Stochastic Gradient Ascent: We update these preferences using the reward signal. The update rules are:
Here, $\overline{R}_t$ is the average of all rewards up to time $t$, which serves as a baseline.
Read: Sutton & Barto, Section 2.7
We’ve only considered nonassociative tasks, where we try to find the single best action overall. But what if the best action depends on the situation?This is the contextual bandit problem, a crucial step toward the full reinforcement learning problem.
Key Idea: The goal in a contextual bandit task is to learn a policy that maps situations (contexts) to the actions that are best in those situations.
Imagine a bank of slot machines, but their colors change randomly. You learn that “if the machine is red, pull arm 2” and “if the machine is blue, pull arm 1.” This is what we are doing here:
This is different from the full RL problem because here, an action only affects the immediate reward, not the next context you see. We’ll tackle that final piece next week!
Read: Sutton & Barto, Section 2.8
This week, we move beyond single-step decisions and into the world of sequential prob=ems, where an action affects not only the immediate reward but also all future situations. We’ll formalize this problem using Markov Decision Processes and explore the three fundamental approaches to solving it: Dynamic Programming, Monte Carlo, and Temporal-Difference Learning.
We’re now moving from single-step bandits to sequential decision problems, where an action has long-term consequences. To do this, we need a formal framework.
Today, we’ll introduce the core concepts that define this framework: the agent-environment interface, the goal of maximizing a cumulative return, and the Markov property that governs the environment’s dynamics. These ideas come together to form the Markov Decision Process (MDP).
For a full background on these foundational concepts, please read sections 3.1-3.4 in the Sutton & Barto text.
The key assumption that makes sequential problems solvable is the Markov property.
Read: Sutton & Barto, Section 3.5 Watch for better understanding: Markov Property
An MDP is the mathematical framework for reinforcement learning. It’s a Markov Process with the addition of actions (choices) and rewards (goals).
An MDP is formally defined by a tuple containing:
| $p(s’, r | s, a)$: The probability of transitioning to state $s’$ with reward $r$, from state $s$ and action $a$. | 
Read: Sutton & Barto, Section 3.6
To solve an MDP, we need to figure out how “good” each state is. We do this by learning value functions, which estimate the expected future return.
These value functions follow a recursive relationship known as the Bellman Expectation Equation: \(v_\pi(s) = \sum_{a} \pi(a|s) \sum_{s', r} p(s', r | s, a) [r + \gamma v_\pi(s')]\) In simple terms, the value of where you are is the expected immediate reward you get, plus the discounted value of where you’re likely to end up next. This equation is the foundation for almost all RL algorithms.
Resources:
Yesterday we formalized our problem as a Markov Decision Process (MDP). Today, we’ll learn how to solve it. Dynamic Programming (DP) is a collection of algorithms that can compute the optimal policy, given a perfect model of the environment.
The core idea of DP is to turn the Bellman equations we learned about into update rules to progressively find the optimal value functions.
Almost all reinforcement learning algorithms, including DP, follow a general pattern called Generalized Policy Iteration (GPI). It’s a dance between two competing processes:
These two processes are repeated, interacting until they converge to a single joint solution: the optimal policy and the optimal value function, where neither can be improved further .
There are two classic DP algorithms that implement GPI:
Policy Iteration: This algorithm performs the full GPI dance. It alternates between completing a full Policy Evaluation step and then performing a Policy Improvement step. This process is repeated until the policy is stable .
Value Iteration: This is a more streamlined approach. Instead of waiting for the value function to fully converge, Value Iteration performs just one backup for each state before improving the policy. It combines the two steps of GPI by directly using the Bellman Optimality Equation as its update rule .
DP provides the theoretical foundation for reinforcement learning and is guaranteed to find the optimal solution. However, it has two major drawbacks:
Yesterday, we saw that Dynamic Programming can find the optimal policy if we have a perfect model of the environment. But what if we don’t have that model? This is where model-free reinforcement learning begins.
Monte Carlo (MC) methods are our first approach to learning without a model. The idea is simple: we learn the value of states by running many complete episodes and simply averaging the returns we get . To figure out the average time it takes to get home, just drive home many times and calculate the average. That’s the essence of Monte Carlo.
The first step is learning to predict the value function, $v_\pi(s)$, for a given policy.
Our real goal is to find the optimal policy. To do this in a model-free world, we must learn action-values, $q_\pi(s, a)$, because they tell us how good an action is without needing a model to look ahead . We follow the same GPI pattern as before:
This process, called on-policy Monte Carlo control, repeats, with the policy and action-value estimates gradually improving each other.
So far, we’ve seen two extremes: DP (needs a model) and MC (must wait for an episode to end). Temporal-Difference (TD) Learning combines the best of both.
TD learning is a model-free method, like MC, that learns from raw experience. But, like DP, it updates its estimates based on other learned estimates. This elegant idea of learning a “guess from a guess” is called bootstrapping and is central to modern reinforcement learning .
Instead of waiting for the final return $G_t$, TD updates its estimate $V(S_t)$ toward a TD Target.
The TD Target: This is an estimated return formed after one step: the immediate reward plus the estimated value of the next state: $R_{t+1} + \gamma V(S_{t+1})$ .
The TD Error ($\delta_t$): The learning signal in TD is the difference between the TD target and our current estimate. \(\delta_t = R_{t+1} + \gamma V(S_{t+1}) - V(S_t)\) The TD(0) update rule uses this error to nudge the value of the current state: \(V(S_t) \leftarrow V(S_t) + \alpha \cdot \delta_t\)
This allows the agent to learn from every single step.
When we use TD for control, we learn action-values ($Q(s,a)$). This leads to two of the most famous algorithms in RL.
Sarsa (On-Policy): Sarsa learns the value of the policy the agent is currently following. Its name comes from the quintuple of experience it uses: $(S_t, A_t, R_{t+1}, S_{t+1}, A_{t+1})$. It’s “on-policy” because its update target depends on the action $A_{t+1}$ that the policy actually chooses next .
Q-Learning (Off-Policy): Q-Learning is an off-policy algorithm. It learns the value of the optimal policy, regardless of what exploratory actions the agent takes. Its update target uses the best possible action from the next state, represented by the $\max$ operator . \(Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha [R_{t+1} + \gamma \max_a Q(S_{t+1}, a) - Q(S_t, A_t)]\)
Over the last few days, we’ve introduced the three foundational pillars for solving MDPs. Today, we’ll consolidate our understanding by directly comparing them.
The methods can be understood by looking at how they perform backups along two key dimensions: the depth and the width of the backup.
This dimension describes how far into the future an algorithm looks to get its update target.
Shallow Backups (Bootstrapping): DP and TD methods bootstrap. They create an update target using the reward from just one step, plus the estimated value of the next state. They are learning a “guess from a guess.”
Deep Backups (Full Returns): MC methods do not bootstrap. They use the entire sequence of rewards from a completed episode—the actual, unbiased return—as their update target.
This dimension describes whether an algorithm considers all possibilities or just one.
| Full Backups (Model-Based): DP methods use full backups. They average over every possible next state and reward according to a model of the environment ($p(s’, r | s, a)$). This is why DP requires a model . | 
This table summarizes the core trade-offs:
| Method | Requires a Model? | Bootstraps? | Key Advantage | Key Disadvantage | 
|---|---|---|---|---|
| Dynamic Programming | Yes | Yes | Guaranteed to find the optimal policy. | Requires a model; computationally expensive. | 
| Monte Carlo | No | No | Unbiased; conceptually simple. | High variance; must wait for the episode to end. | 
| Temporal-Difference | No | Yes | Low variance; learns online (step-by-step). | Biased (learns from an estimate). | 
Essentially, TD learning is the sweet spot that combines the model-free sampling of Monte Carlo with the step-by-step bootstrapping of Dynamic Programming.
Use this day to apply the concepts you’ve learned so far in a hands-on project.
Project: Design and solve a simple gridworld environment. The goal is to train an agent to find the shortest path from a start state to a goal state, avoiding obstacles.
Q-learning is a fundamental algorithm in model-free reinforcement learning (RL). Unlike model-based methods, it does not require knowledge of the environment’s dynamics. Instead, the agent learns purely through trial-and-error interactions with the environment.
The central idea is the use of Q-values, denoted as Q(s,a), which estimate the value of taking action 𝑎 in state 𝑠. These estimates are updated repeatedly as the agent explores, gradually improving its understanding of which actions lead to higher rewards.
Q-learning uses an off-policy temporal-difference (TD) update rule, meaning it can learn the optimal action-value function even if the agent’s behavior policy is not optimal. Over time, Q-values converge to their true values, enabling the agent to derive the optimal policy.

Resources:
Although SARSA and Q-learning are both temporal-difference (TD) learning methods used in reinforcement learning, they differ in how they update the action-value function. Both algorithms aim to estimate the optimal policy, but the way they handle the next action distinguishes them:
Q-learning (off-policy):
\[Q(s, a) \leftarrow Q(s, a) + \alpha \left[ r + \gamma \max_{a'} Q(s', a') - Q(s, a) \right]\]Here, the update rule uses the maximum future action-value regardless of the agent’s current policy. This makes Q-learning an off-policy method, since it learns the optimal greedy policy while potentially following a different exploratory policy during training.
SARSA (on-policy):
\[Q(s, a) \leftarrow Q(s, a) + \alpha \left[ r + \gamma Q(s', a') - Q(s, a) \right]\]In contrast, SARSA updates its value based on the actual action chosen by the current policy in the next state. This makes it an on-policy method, where learning is tied to the agent’s own behavior, including exploration.
To get a deeper understanding of their behavior, similarities, differences, and applications, go through the following resources:
Till now we’ve studied how a RL model chooses how good an action is and chooses the best. Now, instead of estimating these values, it’ll directly learn how to act.
Policy gradient methods directly optimize the policy instead of estimating value functionss. The REINFORCE algorithm is the simplest such method. It stars with a random policy, then runs several episodes using this policy and records the respective states, actions and rewards. Then for each action taken, it updates the policy function using an update rule. This process is repeated to reach the optimal policy.
Go through the following resources to learn about policy gradient methods:
Actor–critic methods are TD methods that have a separate memory structure to explicitly represent the policy independent of the value function. The policy structure is known as the actor, because it is used to select actions, and the estimated value function is known as the critic, because it criticizes the actions made by the actor.
Resources:
Hyperparameters are external parameters set before the training begins. Learning rate α, discount factor γ, exploration rate 𝜖, number of episodes, etc, are some of the common hyperparameters in RL models.
Often, RL models face convergence issues, that is it fails to learn. One of the major reasons for this failed convergence is often a bad set of hyperparameters as mentioned above. Check out this video summarizing the importance of hyperparameter tuning in RL paper.
There are three major optimization methods used in tuning RL hyperparameters. Those are:
Watch this video to learn how to implement hyperparameter optimization in RL models using Meta’s Ax.
Before diving into Deep Q-Networks (DQNs) in the upcoming Week-4, it’s important to understand Linear Function Approximation (LFA). This is one of the simplest yet powerful approaches to approximate value functions in Reinforcement Learning. In RL, the state space can often be enormous or continuous. Take chess for example, where the number of possible states is estimated to be around 1047. Storing a value for each state is clearly impossible. This explosion of possibilities is termed as the Curse of Dimensionality – as the number of state variables grows, the total number of states grows exponentially.
To make this possible, we represent each state using a feature vector. A state $s$ is mapped into a vector:
\[\phi(s) = [\phi_1(s), \phi_2(s), ..., \phi_n(s)]^T\]where each component $\phi_i(s)$ captures some aspect of the state. This feature-based representation reduces the complexity of the state space and makes learning more tractable.
Once states are represented by feature vectors, we can approximate the value function as a linear combination of these features. Mathematically, this is written as:
\[\hat{v}(s; \theta) = \theta^T \phi(s)\]Here, $\theta$ is the weight vector (the parameters we learn), $\phi(s)$ is the feature vector of the state, and $\hat{v}(s; \theta)$ is the estimated value of state $s$. The aim is to learn weights $\theta$ that best approximate the true values of states.
Using the SGD update rule, we minimize the mean squared error between the predicted value and the target:
\(J(\theta) = \frac{1}{2} \big( v(s) - \hat{v}(s; \theta) \big)^2\)
and update the weights as: \(\theta \leftarrow \theta + \alpha \big( v(s) - \hat{v}(s; \theta) \big) \phi(s)\) where $\alpha$ is the learning rate and $v(s)$ is the target (either the actual return or a bootstrapped value estimate). For those unfamiliar with the SGD update rule, you can refer to the SGD update rule section in the ML Roadmap.
What makes linear approximation powerful is how easily it integrates with reinforcement learning algorithms. In TD(0), for instance, the target is a bootstrapped estimate $R + \gamma \hat{v}(s’; \theta)$. In SARSA and Q-learning, we can extend the idea to action-value functions, approximating them linearly as $\hat{q}(s,a;\theta) = \theta^T \phi(s,a)$
You can refer to the following resources for a deeper dive into LFA:
We’ll be integrating Q-learning with deep neural networks this week. So, for obvious reasons, you’ll need to know about neural networks and some other basics of machine learning. So, for those who are new to these terms, here are a few resources that will be helpful to get started:
Deep Q-Networks revolutionized reinforcement learning by combining Q-learning with deep neural networks, enabling agents to handle high-dimensional state spaces like raw pixel inputs. DQN uses a convolutional neural network that takes raw state observations as input and outputs Q-values for each possible action.

To brush up your neural network concepts, refer to Week6 of the ML roadmap.
Resources:
Key Learning Points:
Deep Q-Networks (DQN) represented a breakthrough in reinforcement learning, but several fundamental problems emerged that led to the development of improved variants.
Key Problems:
Solution: Double DQN Uses two separate “brains” - one picks the action, the other judges it. This prevents the overoptimistic feedback loop.
Solution: Prioritized Experience Replay Smart studying! Focuses more on important experiences and less on routine ones.
Solution: Dueling DQN Splits the network into two parts - one learns “how good is this situation?” and another learns “which action is relatively better?” Then combines them smartly.
Resources:
The Implementations of DDQN, Dueling DQN and PER can be found here.
While DQN improvements—Double DQN, Dueling DQN, and Prioritized Experience Replay significantly enhanced the stability and efficiency of value based learning, they still operate within fundamental constraints that limit their real-world applicability. These methods remain bound to discrete action spaces and continue learning policies indirectly through value function approximation.
Key Limitations of Value-Based Methods:
Policy-Based vs Value-Based Approaches
On-Policy vs Off-Policy Methods
Trust Region Policy Optimization (TRPO) TRPO prevents significant performance drops by keeping policy updates within a trust region using KL-divergence constraints
Mathematical Foundation:
Resources:
Proximal Policy Optimization (PPO) PPO is an on-policy, policy gradient method that uses a clipped surrogate objective function to improve training stability by limiting policy changes at each step.
Key Innovation: Clipped probability ratio prevents large policy updates
Clipped Objective = min(r(θ) * A, clip(r(θ), 1-ε, 1+ε) * A) where r(θ) = π_new(a|s) / π_old(a|s)
Advantages over TRPO:
Resources:
** Practical Implementation Tips**
Implementations of both PPO and TRPO can be found here
Policies like PPO and TRPO work when you have a limited number of clear choices (like “turn left,” “turn right,” “go straight”).But in real life, many problems need smooth, precise control.
DDPG and TD3 specifically tackle environments where actions are continuous vectors (like robot joint angles, steering wheel positions, or throttle controls) rather than discrete choices.
These algorithms can learn to output precise numerical values. Instead of choosing between 4 discrete actions, these algorithms can choose any number between -1 and +1 (or any range you need).
Deep Deterministic Policy Gradient (DDPG) DDPG combines ideas from DQN and policy gradient methods using an actor-critic architecture. The actor learns a deterministic policy mapping states to actions, while the critic evaluates actions by estimating Q-values
Twin Delayed DDPG (TD3) TD3 addresses DDPG’s brittleness with three critical improvements:
DDPG vs. TD3
| Feature | DDPG | TD3 | 
|---|---|---|
| Value Estimation | Single critic; prone to overestimation | Two critics; uses minimum value to reduce bias | 
| Update Schedule | Actor and critic update together | Actor updates are delayed and less frequent | 
| Stability | Brittle and sensitive to hyperparameters | Significantly more stable and robust | 
Resources:
So far, we’ve built agents that can learn optimal actions. But how does an agent discover these actions in the first place? This is the core of the exploration vs. exploitation dilemma. Should the agent exploit its current knowledge to get high rewards, or should it explore new, untried actions that might lead to even better rewards in the long run?
Agents like DDPG and TD3 are great at executing a known strategy, but without a good exploration plan, they can get stuck in a rut, repeating the first successful actions they find without ever discovering superior alternatives. This section covers the fundamental strategies agents use to explore their environment effectively.
Classic Exploration Strategies (for Discrete Actions): These methods are foundational and work best in environments with a limited set of distinct actions (like “left,” “right,” “up,” “down”).
Performance Insights: Research shows that softmax consistently performs best across different maze environments, while ε-greedy often underperforms compared to other strategies. UCB and pursuit strategies show competitive performance with proper parameter tuning.
Exploration in Continuous Action Spaces: For algorithms like DDPG and TD3, which operate in continuous action spaces (e.g., controlling a steering angle), exploration is handled differently. Since you can’t just pick a “random” action from an infinite set, exploration is typically achieved by adding noise to the policy’s output.
Advanced Exploration A more advanced concept is to create “curious” agents that are intrinsically motivated to explore. Instead of relying only on external rewards from the environment, the agent receives an intrinsic reward for visiting new or unpredictable states.
Further Reading:
Here are advanced project suggestions for Week 3:
Implementation Resources: