Modelling the Real-World

# Assumptions First, we assume that most real-world problems are inherently partially observable because: 1) features used for defining the state space do not fully describe the attributes or mechanisms of the target modelled system and/or, 2) probabilities used in the environments transition model are samples of an underlying distribution. For example, a medical diagnosis may be based on patient biometrics and assumed as fully observable under the realistic limitations of data capture. However, a doctor may consider additional observable physical symptoms such as skin colour and fatigue or incorporate the patients descriptions of recent symptoms not captured by the current biometric data. Second, we assume that the system can be implemented online in order to interact with the live system. Therefore, in this work, simulated problems are used for the evaluation and therefore we do not interact with the physical world. However, to align to this requirement, we still formalize the training into two phases: 1) interacting with the live system, and 2) simulation from sampled experience. The latter is defined to enable the agent to train with fewer live samples. To prevent safety constraints an *Constrained-MDP* may be used with a pre-defined specification of the constraints and although it is not included in this work could be incorporated if required by a problem. Alternatively, Offline-RL \cite{OfflineSurvey} work exists whereby the training phase is entirely offline. Instead, in this work we specify that this should happen in parallel to the live interaction and therefore an action will still interact online and resultant experience data is based on this. Lastly, we assume that the agent is model-free by design (i.e., can apply any learning model) commonly used in prior work in Reinforcement Learning. For this reason, the model-free tabular and neural Q learning agent methodologies are used in the evaluation of this work. Although we could apply a model-based approach on the simulated sample experience, we require that the same agent is trained on the simulation but also continues to train on the live system in parallel. This is to account for the situations where the interval for collected experience is varied between problems.