Policy Gradient Bandit Simulator (REINFORCE)

Bandit Setup (3 Arms)

Set the true probability of receiving a reward (1) for each arm. The agent doesn't know these, it must learn.

Learning Parameters

Disclaimer: This calculator provides a simplified illustration of the REINFORCE policy gradient algorithm on a stateless Multi-Armed Bandit problem. It is for educational purposes only. Real-world reinforcement learning involves complex stateful environments, potentially function approximation (like neural networks), and more advanced algorithms (e.g., Actor-Critic, PPO) for stability and efficiency. Results depend heavily on parameters like learning rate and the number of episodes.
Scroll to Top