**1. Intro to RL**

Lecture: RL problems around us. Markov decision process. Simple solutions through combinatoric optimization.

Seminar: Frozenlake with genetic algorithms

**2. Crossentropy method and monte-carlo algorithms**

Lecture: Crossentropy method in general and for RL. Extension to continuous state & action space. Limitations.

Seminar: Tabular CEM for Taxi-v0, deep CEM for box2d environments.

**3. Temporal Difference**

Lecture: Discounted reward MDP. Value iteration. Q-learning. Temporal difference Vs Monte-Carlo.

Seminar: Tabular q-learning

**4. Value-based algorithms**

Lecture: SARSA. Off-policy Vs on-policy algorithms. N-step algorithms. Eligibility traces.

Seminar: Qlearning Vs SARSA Vs expected value sarsa in the wild

**5. Deep learning recap**

Lecture: deep learning, convolutional nets, batchnorm, dropout, data augmentation and all that stuff.

Seminar: Theano/Lasagne on mnist, simple deep q-learning with CartPole (TF version contrib is welcome)

**6. Approximate reinforcement learning**

Lecture: Infinite/continuous state space. Value function approximation. Convergence conditions. Multiple agents trick.

Seminar: Approximate Q-learning with experience replay. (CartPole, Acrobot, Doom)

**7. Deep reinforcement learning**

Lecture: Deep Q-learning/sarsa/whatever. Heuristics & motivation behind them: experience replay, target networks, double/dueling/bootstrap DQN, etc.

Seminar: DQN on atari

**8. Policy gradient methods**

Lecture: Motivation for policy-based, policy gradient, logderivative trick, REINFORCE/crossentropy method, variance theorem(advantage), advantage actor-critic (incl.n-step advantage)

Seminar: REINFORCE manually, advantage actor-critic for MountainCar - week6/README.md

**9. RNN recap**

Lecture: recurrent neura networks for sequences. GRU/LSTM. Gradient clipping. Seq2seq

Seminar: char-rnn and simple seq2seq

**10. Partially observable MDPs**

Lecture: POMDP intro. Model-based solvers. RNN solvers. RNN tricks: attention, problems with normalization methods, pre-training.

Seminar: Deep kung-fu & doom with recurrent A3C and DRQN

**11. Case studies 1**

Lecture: Reinforcement Learning as a general way to optimize non-differentiable loss. Seq2seq tasks: g2p, machine translation, conversation models, image captioning.

Seminar: Simple neural machine translation with self-critical policy gradient

**12. Advanced exploration methods**

Lecture1: Improved exploration methods for bandits. UCB, Thompson Sampling, bayesian approach.

Lecture2: Augmented rewards. Density-based models, UNREAL, variational information maximizing exploration, bayesian optimization with BNNs.

Seminar: bayesian exploration for contextual bandits

**13. Trust Region Policy Optimization.**

Lecture: Trust region policy optimization in detail. NPO/TRPO.

Seminar: approximate TRPO vs approximate Q-learning for gym box2d envs (robotics-themed).

**14. RL in Large/Continuous action spaces.**

Lecture: Continuous action space MDPs. Value-based approach (NAF). Special case algorithms (dpg, svg). Case study:finance. Large discrete action space problem. Action embedding.

Seminar: Classic Control and BipedalWalker with ddpg Vs qNAF. https://gym.openai.com/envs/BipedalWalker-v2 . Financial bot as bonus track.

**15. Advanced RL topics**

Lecture 1: Hierarchical MDP. MDP Vs real world. Sparse and delayed rewards. When Q-learning fails. Hierarchical MDP. Hierarchy as temporal abstraction. MDP with symbolic reasoning.

Lecture 2: Knowledge Transfer in RL & Inverse Reinforcement Learning: basics; personalized medical treatment; robotics.