1
|
Noel JP, Zhang R, Pitkow X, Angelaki DE. Dorsolateral prefrontal cortex drives strategic aborting by optimizing long-run policy extraction. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.11.28.625897. [PMID: 39651243 PMCID: PMC11623693 DOI: 10.1101/2024.11.28.625897] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/11/2024]
Abstract
Real world choices often involve balancing decisions that are optimized for the short-vs. long-term. Here, we reason that apparently sub-optimal single trial decisions in macaques may in fact reflect long-term, strategic planning. We demonstrate that macaques freely navigating in VR for sequentially presented targets will strategically abort offers, forgoing more immediate rewards on individual trials to maximize session-long returns. This behavior is highly specific to the individual, demonstrating that macaques reason about their own long-run performance. Reinforcement-learning (RL) models suggest this behavior is algorithmically supported by modular actor-critic networks with a policy module not only optimizing long-term value functions, but also informed of specific state-action values allowing for rapid policy optimization. The behavior of artificial networks suggests that changes in policy for a matched offer ought to be evident as soon as offers are made, even if the aborting behavior occurs much later. We confirm this prediction by demonstrating that single units and population dynamics in macaque dorsolateral prefrontal cortex (dlPFC), but not parietal area 7a or dorsomedial superior temporal area (MSTd), reflect the upcoming reward-maximizing aborting behavior upon offer presentation. These results cast dlPFC as a specialized policy module, and stand in contrast to recent work demonstrating the distributed and recurrent nature of belief-networks.
Collapse
|
2
|
Hocker D, Constantinople CM, Savin C. Compositional pretraining improves computational efficiency and matches animal behavior on complex tasks. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.01.12.575461. [PMID: 38318205 PMCID: PMC10843159 DOI: 10.1101/2024.01.12.575461] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/07/2024]
Abstract
1Recurrent neural networks (RNN) are ubiquitously used in neuroscience to capture both neural dynamics and behaviors of living systems. However, when it comes to complex cognitive tasks, training RNNs with traditional methods can prove difficult and fall short of capturing crucial aspects of animal behavior. Here we propose a principled approach for identifying and incorporating compositional tasks as part of RNN training. Taking as target a temporal wagering task previously studied in rats, we design a pretraining curriculum of simpler cognitive tasks that reflect relevant sub-computations. We show that this pretraining substantially improves learning efficacy and is critical for RNNs to adopt similar strategies as rats, including long-timescale inference of latent states, which conventional pretraining approaches fail to capture. Mechanistically, our pretraining supports the development of slow dynamical systems features needed for implementing both inference and value-based decision making. Overall, our approach is an important step for endowing RNNs with relevant inductive biases, which is important when modeling complex behaviors that rely on multiple cognitive computations.
Collapse
|
3
|
Schiereck SS, Pérez-Rivera DT, Mah A, DeMaegd ML, Ward RM, Hocker D, Savin C, Constantinople CM. Neural dynamics in the orbitofrontal cortex reveal cognitive strategies. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.10.29.620879. [PMID: 39554155 PMCID: PMC11565993 DOI: 10.1101/2024.10.29.620879] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 11/19/2024]
Abstract
Behavior is sloppy: a multitude of cognitive strategies can produce similar behavioral read-outs. An underutilized approach is to combine multifaceted behavioral analyses with neural recordings to resolve cognitive strategies. Here we show that rats performing a decision-making task exhibit distinct strategies over training, and these cognitive strategies are decipherable from orbitofrontal cortex (OFC) neural dynamics. We trained rats to perform a temporal wagering task with hidden reward states. While naive rats passively adapted to reward statistics, expert rats inferred reward states. Electrophysiological recordings and novel methods for characterizing population dynamics identified latent neural factors that reflected inferred states in expert but not naive rats. In experts, these factors showed abrupt changes following single trials that were informative of state transitions. These dynamics were driven by neurons whose firing rates reflected single trial inferences, and OFC inactivations showed they were causal to behavior. These results reveal the neural signatures of inference.
Collapse
Affiliation(s)
| | | | - Andrew Mah
- Center for Neural Science, New York University; New York, NY 10003
| | | | | | - David Hocker
- Center for Neural Science, New York University; New York, NY 10003
| | - Cristina Savin
- Center for Neural Science, New York University; New York, NY 10003
- Center for Data Science, New York University; New York, NY 10003
| | | |
Collapse
|
4
|
Mah A, Golden CEM, Constantinople CM. Dopamine transients encode reward prediction errors independent of learning rates. Cell Rep 2024; 43:114840. [PMID: 39395170 PMCID: PMC11571066 DOI: 10.1016/j.celrep.2024.114840] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/15/2024] [Revised: 08/19/2024] [Accepted: 09/20/2024] [Indexed: 10/14/2024] Open
Abstract
Biological accounts of reinforcement learning posit that dopamine encodes reward prediction errors (RPEs), which are multiplied by a learning rate to update state or action values. These values are thought to be represented by corticostriatal synaptic weights, which are updated by dopamine-dependent plasticity. This suggests that dopamine release reflects the product of the learning rate and RPE. Here, we characterize dopamine encoding of learning rates in the nucleus accumbens core (NAcc) in a volatile environment. Using a task with semi-observable states offering different rewards, we find that rats adjust how quickly they initiate trials across states using RPEs. Computational modeling and behavioral analyses show that learning rates are higher following state transitions and scale with trial-by-trial changes in beliefs about hidden states, approximating normative Bayesian strategies. Notably, dopamine release in the NAcc encodes RPEs independent of learning rates, suggesting that dopamine-independent mechanisms instantiate dynamic learning rates.
Collapse
Affiliation(s)
- Andrew Mah
- Center for Neural Science, New York University, New York, NY, USA
| | - Carla E M Golden
- Center for Neural Science, New York University, New York, NY, USA
| | | |
Collapse
|
5
|
Golden CEM, Martin AC, Kaur D, Mah A, Levy DH, Yamaguchi T, Lasek AW, Lin D, Aoki C, Constantinople CM. Estrogenic control of reward prediction errors and reinforcement learning. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2023.12.09.570945. [PMID: 38105956 PMCID: PMC10723450 DOI: 10.1101/2023.12.09.570945] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/19/2023]
Abstract
Gonadal hormones act throughout the brain 1 , and neuropsychiatric disorders vary in symptom severity over the reproductive cycle, pregnancy, and perimenopause 2-4 . Yet how hormones influence cognitive processes is unclear. Exogenous 17 β -estradiol modulates dopamine signaling in the nucleus accumbens core (NAcc) 5,6 , which instantiates reward prediction errors (RPEs) for reinforcement learning 7-16 . Here we show that endogenous 17 β -estradiol enhances RPEs and sensitivity to previous rewards by reducing dopamine reuptake proteins in the NAcc. Rats performed a task with different reward states; they adjusted how quickly they initiated trials across states, balancing effort against expected rewards. NAcc dopamine reflected RPEs that predicted and causally influenced initiation times. Elevated endogenous 17 β -estradiol increased sensitivity to reward states by enhancing dopaminergic RPEs in the NAcc. Proteomics revealed reduced dopamine transporter expression. Finally, knockdown of midbrain estrogen receptors suppressed reinforcement learning. 17 β -estradiol therefore controls RPEs via dopamine reuptake, mechanistically revealing how hormones influence neural dynamics for motivation and learning.
Collapse
|
6
|
Mah A, Golden CE, Constantinople CM. Dopamine transients encode reward prediction errors independent of learning rates. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.04.18.590090. [PMID: 38659861 PMCID: PMC11042285 DOI: 10.1101/2024.04.18.590090] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/26/2024]
Abstract
Biological accounts of reinforcement learning posit that dopamine encodes reward prediction errors (RPEs), which are multiplied by a learning rate to update state or action values. These values are thought to be represented in synaptic weights in the striatum, and updated by dopamine-dependent plasticity, suggesting that dopamine release might reflect the product of the learning rate and RPE. Here, we leveraged the fact that animals learn faster in volatile environments to characterize dopamine encoding of learning rates in the nucleus accumbens core (NAcc). We trained rats on a task with semi-observable states offering different rewards, and rats adjusted how quickly they initiated trials across states using RPEs. Computational modeling and behavioral analyses showed that learning rates were higher following state transitions, and scaled with trial-by-trial changes in beliefs about hidden states, approximating normative Bayesian strategies. Notably, dopamine release in the NAcc encoded RPEs independent of learning rates, suggesting that dopamine-independent mechanisms instantiate dynamic learning rates.
Collapse
Affiliation(s)
- Andrew Mah
- Center for Neural Science, New York University
| | | | | |
Collapse
|
7
|
Tyler Boyd-Meredith J, Piet AT, Kopec CD, Brody CD. A cognitive process model captures near-optimal confidence-guided waiting in rats. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.06.07.597954. [PMID: 38895394 PMCID: PMC11185770 DOI: 10.1101/2024.06.07.597954] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/21/2024]
Abstract
Rational decision-makers invest more time pursuing rewards they are more confident they will eventually receive. A series of studies have therefore used willingness to wait for delayed rewards as a proxy for decision confidence. However, interpretation of waiting behavior is limited because it is unclear how environmental statistics influence optimal waiting, and how sources of internal variability influence subjects' behavior. We trained rats to perform a confidence-guided waiting task, and derived expressions for optimal waiting that make relevant environmental statistics explicit, including travel time incurred traveling from one reward opportunity to another. We found that rats waited longer than fully optimal agents, but that their behavior was closely matched by optimal agents with travel times constrained to match their own. We developed a process model describing the decision to stop waiting as an accumulation to bound process, which allowed us to compare the effects of multiple sources of internal variability on waiting. Surprisingly, although mean wait times grew with confidence, variability did not, inconsistent with scalar invariant timing, and best explained by variability in the stopping bound. Our results describe a tractable process model that can capture the influence of environmental statistics and internal sources of variability on subjects' decision process during confidence-guided waiting.
Collapse
Affiliation(s)
- J Tyler Boyd-Meredith
- Princeton Neuroscience Institute, Princeton University, Princeton, United States
- Sainsbury Wellcome Centre, University College London, London, UK
| | - Alex T Piet
- Allen Institute, Seattle, Washington, United States
| | - Chuck D Kopec
- Princeton Neuroscience Institute, Princeton University, Princeton, United States
| | - Carlos D Brody
- Princeton Neuroscience Institute, Princeton University, Princeton, United States
- Howard Hughes Medical Institute, Princeton University, Princeton, United States
| |
Collapse
|
8
|
Jang HJ, Ward RM, Golden CEM, Constantinople CM. Acetylcholine demixes heterogeneous dopamine signals for learning and moving. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.05.03.592444. [PMID: 38746300 PMCID: PMC11092744 DOI: 10.1101/2024.05.03.592444] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/16/2024]
Abstract
Midbrain dopamine neurons promote reinforcement learning and movement vigor. A major outstanding question is how dopamine-recipient neurons in the striatum parse these heterogeneous signals. Here we characterized dopamine and acetylcholine release in the dorsomedial striatum (DMS) of rats performing a decision-making task. We found that dopamine acted as a reward prediction error (RPE), modulating behavior and DMS spiking on subsequent trials when coincident with pauses in cholinergic release. In contrast, at task events that elicited coincident bursts of acetylcholine and dopamine, dopamine preceded contralateral movements and predicted movement vigor without inducing plastic changes in DMS firing rates. Our findings provide a circuit-level mechanism by which cholinergic modulation allows the same dopamine signals to be used for either movement or learning depending on instantaneous behavioral context.
Collapse
|