|Machine learning and|
- a set of environment and agent states, S;
- a set of actions, A, of the agent;
- is the probability of transition from state to state under action .
- is the immediate reward after transition from to with action .
- rules that describe what the agent observes
- A model of the environment is known, but an analytic solution is not available;
- Only a simulation model of the environment is given (the subject of simulation-based optimization);
- The only way to collect information about the environment is to interact with it.
Algorithms for control learning
Criterion of optimality
- For each possible policy, sample returns while following it
- Choose the policy with the largest expected return
Monte Carlo methods
- The procedure may spend too much time evaluating a suboptimal policy.
- It uses samples inefficiently in that a long trajectory improves the estimate only of the single state-action pair that started the trajectory.
- When the returns along the trajectories have high variance, convergence is slow.
- It works in episodic problems only;
- It works in small, finite MDPs only.
Temporal difference methods
Direct policy search
- adaptive methods that work with fewer (or no) parameters under a large number of condition
- addressing the exploration problem in large MDPs
- large-scale empirical evaluations
- learning and acting under partial information (e.g., using Predictive State Representation)
- modular and hierarchical reinforcement learning
- improving existing value-function and policy search methods
- algorithms that work well with large (or continuous) action spaces
- transfer learning
- lifelong learning
- efficient sample-based planning (e.g., based on Monte Carlo tree search).
- bug detection in software projects
- Actor-Critic Reinforcement Learning
Comparison of reinforcement learning algorithms
|Algorithm||Description||Model||Policy||Action Space||State Space||Operator|
|Monte Carlo||Every visit to Monte Carlo||Model-Free||Off-policy||Discrete||Discrete||Sample-means|
|Q-learning - Lambda||State–action–reward–state with eligibility traces||Model-Free||Off-policy||Discrete||Discrete||Q-value|
|SARSA - Lambda||State–action–reward–state–action with eligibility traces||Model-Free||On-policy||Discrete||Discrete||Q-value|
|DQN||Deep Q Network||Model-Free||Off-policy||Discrete||Continuous||Q-value|
|DDPG||Deep Deterministic Policy Gradient||Model-Free||Off-policy||Continuous||Continuous||Q-value|
|A3C||Asynchronous Advantage Actor-Critic Algorithm||Model-Free||Off-policy||Continuous||Continuous||Q-value|
|NAF||Q-Learning with Normalized Advantage Functions||Model-Free||Off-policy||Continuous||Continuous||Advantage|
|TRPO||Trust Region Policy Optimization||Model-Free||On-policy||Continuous||Continuous||Advantage|
|PPO||Proximal Policy Optimization||Model-Free||On-policy||Continuous||Continuous||Advantage|