In medical research, a dynamic treatment regime (DTR), adaptive intervention, or adaptive treatment strategy is a set of rules for choosing effective treatments for individual patients (Lei et al. 2012). The treatment choices made for a particular patient are based on that individual's characteristics and history, with the goal of optimizing his or her long-term clinical outcome. A dynamic treatment regime is analogous to a policy in the field of reinforcement learning, and analogous to a controller in control theory. While most work on dynamic treatment regimes has been done in the context of medicine, the same ideas apply to time-varying policies in other fields, such as education, marketing, and economics.
Historically, medical research and the practice of medicine tended to rely on an acute care model for the treatment of all medical problems, including chronic illness. More recently, the medical field has begun to look at long term care plans to treat patients with a chronic illness. This shift in ideology, coupled with increased demand for evidence based medicine and individualized care, has led to the application of sequential decision making research to medical problems and the formulation of dynamic treatment regimes.
The figure below illustrates a hypothetical dynamic treatment regime for Attention Deficit Hyperactivity Disorder (ADHD). There are two decision points in this DTR. The initial treatment decision depends on the patient's baseline disease severity. The second treatment decision is a "responder/non-responder" decision: At some time after receiving the first treatment, the patient is assessed for response, i.e. whether or not the initial treatment has been effective. If so, that treatment is continued. If not, the patient receives a different treatment. In this example, for those who did not respond to initial medication, the second "treatment" is a package of treatments—it is the initial treatment plus behavior modification therapy. "Treatments" can be defined as whatever interventions are appropriate, whether they take the form of medications or other therapies.
The decisions of a dynamic treatment regime are made in the service of producing favorable clinical outcomes in patients who follow it. To make this more precise, the following mathematical framework is used:
For a series of decision time points, , define to be the treatment ("action") chosen at time point , and define to be all clinical observations made at time , immediately prior to treatment . A dynamic treatment regime, consists of a set of rules, one for each time point , for choosing treatment based clinical observations . Thus , is a function of the past and current observations, and past treatments , which returns a choice of the current treatment, .
Also observed at each time point is a measure of success called a reward . The goal of a dynamic treatment regime is to make decisions that result in the largest possible expected sum of rewards, R = ∑ t R t. A dynamic treatment regime, is optimal if it satisfies
where is an expectation over possible observations and rewards. The quantity is often referred to as the value of .
In the example above, the possible first treatments for are "Low-Dose B-mod" and "Low-Dose Medication". The possible second treatments for are "Increase B-mod Dose", "Continue Treatment", and "Augment w/B-mod". The observations and are the labels on the arrows: The possible are "Less Severe" and "More Severe", and the possible are "Non-Response" and "Response". The rewards are not shown; one reasonable possibility for reward would be to set and set to a measure of classroom performance after a fixed amount of time.
To find an optimal dynamic treatment regime, it might seem reasonable to find the optimal treatment that maximizes the immediate reward at each time point and then patch these treatment steps together to create a dynamic treatment regime. However, this approach is shortsighted and can result in an inferior dynamic treatment regime, because it ignores the potential for the current treatment action to influence the reward obtained at more distant time points.
For example a treatment may be desirable as a first treatment even if it does not achieve a high immediate reward. For example, when treating some kinds of cancer, a particular medication may not result in the best immediate reward (best acute effect) among initial treatments. However, this medication may impose sufficiently low side effects so that some non-responders are able to become responders with further treatment. Similarly a treatment that is less effective acutely may lead to better overall rewards, if it encourages/enables non-responders to adhere more closely to subsequent treatments.
Dynamic treatment regimes can be developed in the framework of evidence-based medicine, where clinical decision making is informed by data on how patients respond to different treatments. The data used to find optimal dynamic treatment regimes consist of the sequence of observations and treatments for multiple patients along with those patients' rewards . A central difficulty is that intermediate outcomes both depend on previous treatments and determine subsequent treatment. However, if treatment assignment is independent of potential outcomes conditional on past observations—i.e., treatment is sequentially unconfounded—a number of algorithms exist to estimate the causal effect of time-varying treatments or dynamic treatment regimes.
While this type of data can be obtained through careful observation, it is often preferable to collect data through experimentation if possible. The use of experimental data, where treatments have been randomly assigned, is preferred because it helps eliminate bias caused by unobserved confounding variables that influence both the choice of the treatment and the clinical outcome. This is especially important when dealing with sequential treatments, since these biases can compound over time. Given an experimental data set, an optimal dynamic treatment regime can be estimated from the data using a number of different algorithms. Inference can also be done to determine whether the estimated optimal dynamic treatment regime results in significant improvements in expected reward over an alternative dynamic treatment regime.
Experimental designs of clinical trials that generate data for estimating optimal dynamic treatment regimes involve an initial randomization of patients to treatments, followed by re-randomizations at each subsequent time point to another treatment. The re-randomizations at each subsequent time point may depend on information collected after previous treatments, but prior to assigning the new treatment, such as how successful the previous treatment was. These types of trials were introduced and developed in Lavori & Dawson (2000), Lavori (2003) and Murphy (2005) and are often referred to as SMART trials (Sequential Multiple Assignment Randomized Trial). Some examples of SMART trials are the CATIE trial for treatment of Alzheimer's (Schneider et al. 2001) and the STAR*D trial for treatment of major depressive disorder (Lavori et al. 2001, Rush, Trivedi & Fava 2003).
SMART trials attempt to mimic the decision-making that occurs in clinical practice, but still retain the advantages of experimentation over observation. They can be more involved than single-stage randomized trials; however, they produce the data trajectories necessary for estimating optimal policies that take delayed effects into account. Several suggestions have been made to attempt to reduce complexity and resources needed. One can combine data over same treatment sequences within different treatment regimes. One may also wish to split up a large trial into screening, refining, and confirmatory trials (Collins et al. 2005). One can also use fractional factorial designs rather than a full factorial design (Nair et al. 2008), or target primary analyses to simple regime comparisons (Murphy 2005).
A critical part of finding the best dynamic treatment regime is the construction of a meaningful and comprehensive reward variable, . To construct a useful reward, the goals of the treatment need to be well defined and quantifiable. The goals of the treatment can include multiple aspects of a patient's health and welfare, such as degree of symptoms, severity of side effects, time until treatment response, quality of life and cost. However, quantifying the various aspects of a successful treatment with a single function can be difficult, and work on providing useful decision making support that analyzes multiple outcomes is ongoing (Lizotte 2010). Ideally, the outcome variable should reflect how successful the treatment regime was in achieving the overall goals for each patient.
Analysis is often improved by the collection of any variables that might be related to the illness or the treatment. This is especially important when data is collected by observation, to avoid bias in the analysis due to unmeasured confounders. Subsequently more observation variables are collected than are actually needed to estimate optimal dynamic treatment regimes. Thus variable selection is often required as a preprocessing step on the data before algorithms used to find the best dynamic treatment regime are employed.
Several algorithms exist for estimating optimal dynamic treatment regimes from data. Many of these algorithms were developed in the field of computer science to help robots and computers make optimal decisions in an interactive environment. These types of algorithms are often referred to as reinforcement learning methods (Sutton & Barto 1998) . The most popular of these methods used to estimate dynamic treatment regimes is called q-learning (Watkins 1989). In q-learning models are fit sequentially to estimate the value of the treatment regime used to collect the data and then the models are optimized with respect to the treatmens to find the best dynamic treatment regime. Many variations of this algorithm exist including modeling only portions of the Value of the treatment regime (Murphy 2003, Robins 2004). Using model-based Bayesian methods, the optimal treatment regime can also be calculated directly from posterior predictive inferences on the effect of dynamic policies (Zajonc 2010).
An alternative approach to developing dynamic treatment regimes is based on random effects linear models, which is supported by solid Decision Theory concepts (this approach does not use machine learning concepts ) (Diaz et al. 2007, 2012 and 2012). There is empirical and theoretical evidence, supported by some empirical studies and recent developments in pharmacokinetic theory, showing that random-effects linear models can be used to describe not only patient populations but also individual patients simultaneously, and therefore that these models are suitable for designing dynamic treatment regimes. For instance, by this remarkable characteristics, random-effects linear models are promising and useful tools for investigating drug dosage individualization in chronic diseases and for designing effective treatments for individual patients based on each individual patient's characteristics and needs. The following is a theoretical framework for drug dosage individualization. A useful model is the following random effects linear model:
where α is characteristic constant that varies from patient to patient, is steady-state drug plasma concentration in response to drug dosage D, X is vector of covariates (includes clinical, demographic, environmental or genetic covariates), and ϵ is an intra-individual random error. β are population constants. β is a vector of regression coefficients that are treated as constants, and α is a random intercept. So this model (1) is generally called random intercept linear model which can be used to design a clinical algorithm for finding the optimal drug dosage D for a particular patient. The decisions of an appropriate drug dosage D are made by maximizing the probability that the drug plasma concentration response takes a value in the therapeutic window, that is, a value between two pre-specified values l1 and l2. There is empirical evidence supporting model (1) and some of its generalizations, at least for some drugs. This model still can be generalized to include covariates with random effects. The more general model is
where ϵ is defined as same as in model (1), ψ and η are both characteristic constants of a particular patient that vary from patient to patient. Z is a vector with covariates.In order to produce a better personalized dosage, Diaz et al. proposed a clinical algorithm for drug dosage individualization based on this more general model (2) which is based on the concept of Bayesian feedback. The assumption of the algorithm is that the model (2) describes adequately a population of patients. The population parameter , β, d, and must be estimated by using a sample of patients before applying the algorithm, so the estimated model can be built up as empirical prior information. Next, as described before, the dosage regime must be first adapted to the patient’s characteristics and comedication. This initial adaptation realizes a prior individualization. Diaz et al.’s clinical algorithm is not a computer algorithm but a series of steps to find an optimal dosage. In the first step of the algorithm, the clinician uses both estimators and the information from patient’s covariates to compute the initial dosage
where defined by Diaz et al.
This new dosage is administered to the patient for an appropariate time period, and once the steady-state response is reached, then the new response YD is measured. The step i, i≥2 is as follows: By using the dosage-response pairs, which were obtained in the previous j-1 steps,compute the ith dosage
where is an empirical Bayes predictor of α given by
and ,i≥1, is defined by Diaz et al. At this time, if model (2) holds, Diaz et al.'s  algorithm is optimal in the sense that that the obtained dosages minimizes the a Bayes risk. Also, Diaz et al. introduced the concept of omega-optimum dosage which that is defined as a dosage D that satisfies
where w is a number between 0 and 1. The concept of omega-optimum dosage allows determineing how many algorithm steps are necessary to obtain the optimal dosage for the patient, and allows developing a theory of drug dosage individualization.
Diaz et al. showed through simulations and theoretical arguments that their proposed approach to drug dosage individualization in chronic diseases may produce better pharmacokinetic or pharmacodynamic responses than traditional approaches used in therapeutic drug monitoring.