Reinforcement learning

1. Intro to RL
Lecture: RL problems around us. Markov decision process. Simple solutions through combinatoric optimization.
Seminar: Frozenlake with genetic algorithms
2. Crossentropy method and monte-carlo algorithms
Lecture: Crossentropy method in general and for RL. Extension to continuous state & action space. Limitations.
Seminar: Tabular CEM for Taxi-v0, deep CEM for box2d environments.
3. Temporal Difference
Lecture: Discounted reward MDP. Value iteration. Q-learning. Temporal difference Vs Monte-Carlo.
Seminar: Tabular q-learning
4. Value-based algorithms
Lecture: SARSA. Off-policy Vs on-policy algorithms. N-step algorithms. Eligibility traces.
Seminar: Qlearning Vs SARSA Vs expected value sarsa in the wild
5. Deep learning recap
Lecture: deep learning, convolutional nets, batchnorm, dropout, data augmentation and all that stuff.
Seminar: Theano/Lasagne on mnist, simple deep q-learning with CartPole (TF version contrib is welcome)
6. Approximate reinforcement learning
Lecture: Infinite/continuous state space. Value function approximation. Convergence conditions. Multiple agents trick.
Seminar: Approximate Q-learning with experience replay. (CartPole, Acrobot, Doom)
7. Deep reinforcement learning
Lecture: Deep Q-learning/sarsa/whatever. Heuristics & motivation behind them: experience replay, target networks, double/dueling/bootstrap DQN, etc.
Seminar: DQN on atari
8. Policy gradient methods
Lecture: Motivation for policy-based, policy gradient, logderivative trick, REINFORCE/crossentropy method, variance theorem(advantage), advantage actor-critic (incl.n-step advantage)
Seminar: REINFORCE manually, advantage actor-critic for MountainCar - week6/README.md
9. RNN recap
Lecture: recurrent neura networks for sequences. GRU/LSTM. Gradient clipping. Seq2seq
Seminar: char-rnn and simple seq2seq
10. Partially observable MDPs
Lecture: POMDP intro. Model-based solvers. RNN solvers. RNN tricks: attention, problems with normalization methods, pre-training.
Seminar: Deep kung-fu & doom with recurrent A3C and DRQN
11. Case studies 1
Lecture: Reinforcement Learning as a general way to optimize non-differentiable loss. Seq2seq tasks: g2p, machine translation, conversation models, image captioning.
Seminar: Simple neural machine translation with self-critical policy gradient
12. Advanced exploration methods
Lecture1: Improved exploration methods for bandits. UCB, Thompson Sampling, bayesian approach.
Lecture2: Augmented rewards. Density-based models, UNREAL, variational information maximizing exploration, bayesian optimization with BNNs.
Seminar: bayesian exploration for contextual bandits
13. Trust Region Policy Optimization.
Lecture: Trust region policy optimization in detail. NPO/TRPO.
Seminar: approximate TRPO vs approximate Q-learning for gym box2d envs (robotics-themed).
14. RL in Large/Continuous action spaces.
Lecture: Continuous action space MDPs. Value-based approach (NAF). Special case algorithms (dpg, svg). Case study:finance. Large discrete action space problem. Action embedding.
Seminar: Classic Control and BipedalWalker with ddpg Vs qNAF. https://gym.openai.com/envs/BipedalWalker-v2 . Financial bot as bonus track.
15. Advanced RL topics
Lecture 1: Hierarchical MDP. MDP Vs real world. Sparse and delayed rewards. When Q-learning fails. Hierarchical MDP. Hierarchy as temporal abstraction. MDP with symbolic reasoning.
Lecture 2: Knowledge Transfer in RL & Inverse Reinforcement Learning: basics; personalized medical treatment; robotics.