Timeline of RL - elsciRL

# Timeline of Main Events/Concepts This timeline is structured around the development of Reinforcement Learning (RL), particularly focusing on challenges, techniques, and applications discussed in the provided documents. It mixes events with the appearance of new concepts which influenced the field, and the creation of benchmarks. ## Early Reinforcement Learning (Pre-2010s): - **Core MDP Formulation:** RL is grounded in the Markov Decision Process (MDP), defined by states, actions, transition probabilities, rewards, and a discount factor. This foundational concept underpins much of the work discussed. - **Policy Iteration:** The core principle of Policy Iteration, using policy evaluation and improvement, emerges as the central method of solving MDPs. - **Monte Carlo (MC) Methods:** MC methods are used to evaluate policies, by constructing unbiased estimates via sampling from complete trajectories. - **Temporal Difference (TD) Learning:** TD learning provides an alternative method for policy evaluation that bootstrap from future state value functions. - **Off-Policy Learning:** The concept of off-policy training is noted, where a policy learns from data generated by a different "behaviour" policy. - **Early Deep Learning Approaches:** Some of the earliest work in deep RL that uses neural networks to approximate Q functions and policies was developed at this time. ## Mid-2010s: Advances and Challenges in Deep Reinforcement Learning: - **Deep Q-Networks (DQN):** Deep neural networks are used to approximate the Q-function, leading to breakthroughs in domains like Atari. This marks a major step in handling high-dimensional state spaces. - **Continuous Control:** Research extends RL to continuous action spaces, necessary for real-world robotic control. (Lillicrap et al., 2015). - **Partial Observability:** Techniques are developed to handle partially observable Markov Decision Processes (POMDPs), addressing cases where agents do not have complete state information (Hausknecht and Stone, 2015). This includes using recurrent neural networks to process history. - **Risk-Aware RL:** Approaches emerge for considering risk and safety, leading to the development of Constrained MDPs (CMDPs) and methods for handling safety constraints (Tamar et al., 2015a, 2015b, 2016; Dalal et al., 2018). - **Multi-objective RL:** Methods to handle multiple reward functions are developed (Roijers et al., 2013). ## Late 2010s: Focus on Real-World Applications and Robustness - **Batch RL:** Emphasis on learning from batches of data, enabling the use of pre-collected data for training policies. This includes offline and off-policy training regimes, as well as batch RL training frameworks. (Scherrer et al., 2012) - **High-Dimensional Action Spaces:** The challenges of dealing with large and continuous state and action spaces come to the fore. Techniques like action elimination are explored. - **Robustness:** Research focuses on dealing with noisy, non-stationary environments. Approaches like domain randomization and system identification are developed to create robust policies (Iyengar, 2005, Peng et al., 2018; Finn et al., 2017; Nagabandi et al., 2018). - **Model-Based RL:** Model based approaches which aim to learn a model of the environment for planning become an important focus of research and are developed. (Hafner et al., 2018; Chua et al., 2018). - **Text-Based Games (TGs) as Benchmarks:** Text-based games are identified as important testbeds for RL, introducing the challenge of natural language processing into sequential decision-making. Datasets and benchmarks appear, such as Zork and Treasure Hunter. - **Deep Reinforcement Learning with Natural Language Action Spaces:** Techniques that allow for the handling of natural language as an action space begin to emerge. (He et al. 2015). ## Early 2020s: Text-Based Environments and Language-Based RL: - **Text-Based Game Surveys and Architectures:** Surveys and taxonomies of the text game environment, as well as descriptions of agent architectures used in the field are published. - **ALFWorld:** The ALFWorld benchmark is developed, which aims to align text and embodied environments for interactive learning (Shridhar et al., 2020). - **Language-Based RL (NLRL):** The framework of Natural Language RL, emerges, transferring RL concepts into the domain of natural language. This involves using language descriptions for states, actions and value functions. - **Large Language Models (LLMs) in RL:** The potential of Large Language Models to act as policies and evaluators is discussed and investigated. - **Retrieval-based RL:** The concept of Retrieval Augmented Generation (RAG) is integrated with RL with episodic memories of past trajectories used to augment decision making (Lewis et al., 2020, Goyal et al., 2022).