Reinforcement Learning Models Capture Human Decision-Making Processes

Post by Shireen Parimoo

What's the science?

How do people flexibly plan their actions in service of novel goals? According to reinforcement learning (RL) models of human behaviour, actions are chosen to maximize reward in the long run. In standard RL algorithms, actions are guided by knowledge of the environment, where the outcomes achieved are either known (model-based) or learned when they occur without much prior knowledge (model-free). These algorithms can require a lot of resources and also run the risk of under- or over-generalizing to new tasks. 

Two new algorithms have been proposed to model human cross-task generalization. One approach extracts similarities across tasks to inform future actions using universal value function approximators (UVFAs). Let’s consider this example: you are both tired and hungry, and want to make a decision between going to a Burger Shop (known for food), a coffee shop (known for coffee) or a diner (known for both). You know the Burger Shop is good when hungry and the coffee shop is good when tired, so you select an action using UVFA: you look for a place similar to both these places when you are both hungry and tired (the diner). In other words, you use previous values to extrapolate and predict a new value. Another approach is to keep track of actions associated with commonly encountered tasks using a generalized policy improvement algorithm (GPI): this predicts the outcome of an action from learned experience. Here’s an example of selecting an action using GPI: You are hungry and looking for a place to eat. You have been frequenting the diner lately and you have a ‘policy’ of going there when hungry. Now, you’re tired and would like to get coffee. You can also envision the outcome of going to the diner - there is coffee available, and people tend to drink coffee there. Therefore, you might choose to apply this same policy of going to the diner, in your decision to get coffee. In other words, you are generalizing the policy to a new task. This is the distinction between UVFA and GPI: UVFA uses previously learned values to approximate a solution, while GPI evaluates a previously learned policy (the relationship between an action and an end state or solution), and applies this policy to a new situation.

This week in Nature Human Behavior, Tomov and colleagues tested the generalizability of standard and new RL algorithms across tasks and compared their performance to human behavior.

How did they do it?

In a set of four online experiments, over 1100 participants played a resource-trading game set in a castle. Before each trial began, participants saw the “daily market price” for the resources – the amount of money they could expect to either receive or pay for each resource (wood, stone, and iron). For example, they might see “Wood: $1, Stone, $2, and Iron, $0”, indicating that they would receive $1 for each wood and pay $2 for each stone they had at the end of the trial. Each trial consisted of a two-step decision-making process: Participants were instructed to choose a) between three doors to enter one of three rooms, and then b) choose between another three doors per room to enter a final room containing resources. Importantly, each final door always led to the same amount of resources in the corresponding final room (e.g., door 2 in room 3 always contained 100 wood, 40 stone, and 0 iron across all trials). The amount of money received or paid, however, would change from trial to trial because the ‘daily market value’ changed each trial. In our example, participants might receive $1 x 100 wood ($100), -$2 x 40 stone (-$80), and $0 x 0 iron = $20 total.

The researchers carefully selected a few sets of daily market prices for the resources for each experiment in order to manipulate the computational demands required and to vary the degree of difficulty in successfully mapping actions to outcomes across the four experiments. In the training phase of each of the four experiments, participants completed 100 trials, each of which was randomly assigned one of the pre-selected sets of daily market prices. For example, the profit from finding wood might double earnings on one trial but cost participants money on the next trial. Thus, participants had to learn which final door would ensure maximum profit. 

The goal of the experiments was to determine which RL model the participants likely had used (model-free, model-based, UVFA, or GPI). The experiments were designed in such a way that the door participants chose on a test trial, completed after the 100 training trials, would be likely to reflect the underlying RL algorithm that best modelled how they chose actions to maximize reward. For example, the third door in the second room in one experiment would in fact result in the highest profit in the final test trial, but it would only be selected by a model-based learner who had successfully learned the entire structure of the environment. 

What did they find?

In the first experiment, participants were most likely to select actions leading to the final doors predicted by the model-based and the GPI algorithms. In more difficult experiments, however, participants were by far more likely to choose the final door predicted by GPI. Interestingly, the final door chosen by the model-based and UVFA algorithms would have been the most rewarding, yet participants did not choose those actions more frequently than would be expected by chance. In comparing the different algorithms, the authors found that participants learned to select the final door predicted by GPI faster than that predicted by the model-based algorithm, which is consistent with the fact that model-based algorithms tend to require more resources. Finally, as GPI makes predictions based on learned experience, the authors compared participants’ choice history during the training phase to their actions on the test trial. Here, learned experience did indeed drive choice at test time; participants’ tendency to choose the same door during training predicted the probability that they would select that door in the test trial. This indicates that participants kept track of the different situations they encountered during the training phase, along with the associated action-state mapping, which informed their behavior during the test.

Shireen+%282%29.jpg

What's the impact?

People use their knowledge of frequently encountered experiences to make predictions about future outcomes and inform their decisions. One of the interesting outcomes of this study is that people do not necessarily make the most rewarding decisions, but rather they tend to map previously used policies onto new scenarios. This finding provides exciting new insight into how reinforcement learning captures human decision-making processes in complex and changing environments.

Tomov et al. Multi-task reinforcement learning in humans. Nature Human Behavior (2021). Access the original scientific publication here.

How Sleep Helps Us Remember and Forget

Post by Amanda McFarlan

What’s the deal with sleep?

Humans spend approximately one third of their lives sleeping, so it is no surprise that we’re curious about it! Sleep has a wide variety of benefits, like repairing and regenerating tissues in the body, improving cognitive and physical performance, and consolidating memories. On the other hand, a chronic lack of sleep can put us at risk of developing health problems like cardiovascular disease, high blood pressure, diabetes, and depression. So, what happens when we sleep? Every night, when our heads hit the pillow, we enter into the first stage of ‘non-Rapid Eye Movement’ (non-REM) sleep. Non-REM sleep consists of 4 stages, with Stage 1 being the lightest sleep stage and Stage 4 being the deepest. Your body moves through the 4 stages of non-REM sleep and finally through REM sleep in a cycle that takes approximately 90 minutes, and this cycle is repeated throughout the night. Non-REM and REM sleep are characterized by different brain activity patterns, with non-REM sleep creating slow waves in its deepest stages, called ‘slow-wave sleep’, and REM sleep generating activity patterns that resemble wakefulness. The role of non-REM and REM sleep in the transfer and long-term storage of memories, known as memory consolidation, has been studied for many years. Here, we will discuss how sleep helps us remember or forget, as well as what goes wrong when we don’t sleep.

How does sleep help us remember?

Evidence strongly suggests that sleep is integral to memory consolidation. For example, a behavioural study, in which participants performed a visual task, a motor sequence task, and a motor adaptation task, found that participants’ performance was greatly improved if they had a full night’s sleep compared to those that did not sleep. The degree of performance improvement for each type of task was dependent on improved sleep in different stages in the sleep cycle. These findings suggest that non-REM and REM sleep both play an important role in memory consolidation. In line with this, other studies have shown that intensive learning of a new task is followed by increased time spent in REM sleep, resulting in subsequent task improvement, as well as the amplification of slow waves during non-REM sleep. Sleep results in a reactivation of cells in the hippocampus, which subsequently reactivate representations of memory in the cortex, also known as an engram. Over time, after many reactivations, these memories become distributed and consolidated within the cortex. 

sleep_jan26.png

Interestingly, research has shown that while we’re sleeping there is increased activity in the same hippocampal place cells (neurons that are activated when moving through specific locations in the environment) that were active throughout the day. This reactivation of hippocampal place cells during REM sleep follows a theta frequency band pattern of firing, hypothesized to be critical for memory consolidation. This hippocampal activity is mediated by neurons that release the neurotransmitter acetylcholine in the hippocampus. Acetylcholine, which plays a major role in altering the strength of synaptic connections, crucial for memory, is known to be elevated during REM sleep. REM sleep has also been associated with the upregulation of the expression of several calcium-dependent genes that are thought to be involved in synaptic plasticity and memory consolidation. 

Compared to REM sleep, the conditions in non-REM sleep are less ideal for promoting synaptic plasticity. For example, acetylcholine and calcium-dependent genes are expressed at low levels or are absent altogether during non-REM sleep. However, researchers have proposed that non-REM sleep might be important for the later stages of memory consolidation, rather than the initial conversion of short-term memories to long-term memories. In support of this, protein synthesis, which is required for long-term but not short-term potentiation (strengthening) of synapses, is increased during non-REM sleep. Therefore, the induction of protein synthesis during non-REM sleep may act to strengthen the synapses that were sufficiently potentiated during wakefulness. 

Although the majority of research on sleep and memory focuses on the role of the hippocampus in memory consolidation, a recent study has provided evidence that the thalamus might also play a role in memory consolidation during sleep. In this study, memory encoding (when memories are initially stored) during a visual task was shown to increase the activity of sensory relay nuclei of the thalamus in mice. Following a night of sleep, the primary visual cortex also showed evidence of a potentiated response to the visual task. Together, these findings suggest that task-related information may be passed from the thalamus to the primary visual cortex, resulting in the formation of a corresponding memory during sleep.

How does sleep help us forget?

Sleep research is centered around how we remember. However, sleep arguably plays just as important a role in the process of forgetting memories. The hippocampus serves as a temporary storage area for newly formed memories until they can be consolidated and integrated into long-term memory storage in the cortex. As a result, the hippocampus must be able to unlearn memories that have already been consolidated or memories that are not pertinent in order to store new memories. Research has shown that in addition to helping with memory consolidation, sleep is also important for unlearning memories. Studies in rats have shown that following sleep, there are widespread reductions in dendritic spines (protrusions on the dendrite that form synapses with nearby neurons) in the cortex as well as a reduction in receptors on glutamatergic neurons that are critical for memory and learning.

Norepinephrine and serotonin are two neurotransmitters in the brain that are associated with the enhancement of synaptic plasticity. During REM sleep, however, norepinephrine and serotonin signaling is suppressed, suggesting that REM sleep may allow for the depotentiation — or weakening — of synapses.  

What happens when we don’t sleep?

We all know how difficult it is to get through the day after a sleepless night. Suddenly, concentrating on what was previously a trivial task can become very challenging. Neuroimaging data has shown that sleep deprived individuals recruit more brain areas while performing the same cognitive task compared to individuals who slept normally. Moreover, brain imaging studies have revealed that hippocampal function is greatly reduced following one night of sleep deprivation, which suggests that losing sleep may actually disrupt our ability to learn new things. Sleep deprivation studies in rats have demonstrated the importance of REM sleep for learning as well as the induction and maintenance of long-term potentiation of synapses during learning. Additionally, REM sleep deprivation was shown to impair learning-dependent neurogenesis (the formation of new neurons) in the hippocampal dentate gyrus, which can impact future learning. The role of REM sleep for learning and memory is particularly relevant for individuals who are treated for depression with antidepressants, since these medications can greatly reduce the amount of time spent in REM sleep and may potentially have consequences on the efficacy of memory consolidation.

How can we get a good night’s sleep?

Given what we know about the role of sleep for learning and memory, it’s important to ensure that we get a good night’s sleep. However, with the challenges of daily life, this is not always an easy feat. First, it is important to establish a regular sleep schedule where you go to sleep and wake up around the same time each day, even when traveling or on the weekends. This habit can reinforce your body’s circadian rhythms, which helps your body to prepare for sleep and wakefulness more efficiently. Second, it is important to avoid using electronic devices before bed, like watching television or using your phone or tablet. The blue light that is emitted by these devices tricks our bodies into thinking it is daylight, and, as a result, our bodies produce lower levels of the hormone melatonin which promotes sleep. Third, use what you know about the science of sleep cycles to your advantage by timing your sleep in 90-minute intervals. For example, by setting your alarm for 7.5 hours of sleep (5 sleep cycles x 90 minutes each) you may actually feel more refreshed than if you slept for 8.5 hours and were awakened during the middle of a deep stage of sleep. Finally, avoiding caffeine and naps late in the afternoon or evening, as well as avoiding large meals or exercise right before bed may help to promote better sleep. 

Now, time to consolidate all of this learning with a good night’s sleep!

Feld, G.B., & Born, J. Sculpting memory during sleep: concurrent consolidation and forgetting. Current opinion in neurobiology, 44, 20–27 (2017). https://doi.org/10.1016/j.conb.2017.02.012

Klinzing, J.G., Niethard, N. & Born, J. Mechanisms of systems memory consolidation during sleep. Nat Neurosci 22, 1598–1610 (2019). https://doi.org/10.1038/s41593-019-0467-3

Poe, G. R., Walsh, C. M., & Bjorness, T. E. Cognitive neuroscience of sleep. Progress in brain research, 185, 1–19 (2010). https://doi.org/10.1016/B978-0-444-53702-7.00001-4

Stickgold, R. Sleep-dependent memory consolidation. Nature 437, 1272–1278 (2005). https://doi.org/10.1038/nature04286

Non-Invasive Electrical Brain Stimulation Reduces Obsessive-Compulsive Behaviours

Post by D. Chloe Chung

What's the science?

Obsessive-compulsive (OC) behaviours are characterized by excessive, unreasonable thoughts and repetitive behaviours. While the exact underlying mechanisms are unclear, OC behaviours may result from excessive habit learning. Habit learning involves the brain’s medial orbitofrontal cortex (OFC) which is connected to the brain’s reward network. This week in Nature Medicine, Grover and colleagues show that non-invasive OFC stimulation, using a high-frequency current to target the high-frequency neural activity associated with reward processing, can modulate OC behaviours.

How did they do it?

In the first experiment, the authors wanted to determine the role of high-frequency, beta-gamma rhythms in reward learning. To do this, they selected a monetary reinforcement learning task that included two trial types – “reward trials” in which the participants earn money (versus not earning any) upon making an optimal choice (choosing the correct image), and “punishment trials” in which the participants lose money (versus not losing any) upon making an incorrect choice (choosing the wrong image). Before the actual task, beta-gamma frequency band activity was measured for 60 participants using electroencephalography (EEG), while the participants learned how to associate visual stimuli with monetary gain and loss. Next, the participants randomly received either control (“passive” sham or “active” alpha frequency of ~10Hz) or personalized beta-gamma neuromodulation (~27Hz on average) and completed the reinforcement learning task for 30 minutes each before, during, and after the neuromodulation (90 minutes total). 

In the second experiment, the authors aimed to evaluate whether chronic beta-gamma neuromodulation of the OFC can impact OC behaviours. To test this, 64 participants first completed a self-assessment of their OC behaviours and then received either control, alpha frequency, or personalized beta-gamma frequency for 5 days (30 minutes per day). The participants self-assessed their OC behaviours right after the last neuromodulation, as well as 1, 2, and 3 months post-neuromodulation. After these two experiments, the authors analyzed the relationship between intrinsic beta-gamma rhythms and a) changes during the reward learning and b) OC behaviours caused by beta-gamma neuromodulation.

What did they find?

In the first experiment, the authors observed that the reward behaviour was altered upon personalized beta-gamma neuromodulation targeting the OFC. Participants made fewer optimal choices during the reward trials of the monetary reinforcement learning task, while no change was observed in either control condition. Importantly, beta-gamma neuromodulation changed behaviours during the reward trials but not during the punishment trials, indicating that beta-gamma frequency specifically modulates reward-related behaviours. These neuromodulation-induced changes in the reward behaviour were found to be reversible, as the participants showed a similar rate of making optimal choices both before and after the neuromodulation. 

chloe+%282%29.jpg

In the second experiment, the authors found that the 5-day beta-gamma neuromodulation successfully reduced obsessive-compulsive behaviours for three months (based on self-assessment). Interestingly, participants with more severe OC symptoms displayed a more drastic reduction in their compulsive behaviours after neuromodulation. Lastly, by comparing the two experiments, the authors found that convergent mechanisms exist between both neuromodulation-regulated reward and OC behaviours.

What’s the impact?

This study demonstrates that high-frequency neuromodulation can effectively regulate reward learning. Further, this work supports the link between reward learning and OC behaviours by highlighting shared mechanisms between the two. Findings from this study strongly suggest potential clinical benefits of personalized neuromodulation for obsessive-compulsive disorder (OCD) patients. It will be interesting for future studies to use additional methodologies, such as neuroimaging, to discover how neural processes are altered by beta-gamma neuromodulation.

chloe_jan26.jpg

Grover et al. High-frequency neuromodulation improves obsessive-compulsive behavior. Nature Medicine (2021). Access the original scientific publication here.