A theoretical analysis of temporal difference learning in the iterated Prisoner's dilemma game

Naoki Masuda*, Hisashi Ohtsuki

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

14 Citations (Scopus)

Abstract

Direct reciprocity is a chief mechanism of mutual cooperation in social dilemma. Agents cooperate if future interactions with the same opponents are highly likely. Direct reciprocity has been explored mostly by evolutionary game theory based on natural selection. Our daily experience tells, however, that real social agents including humans learn to cooperate based on experience. In this paper, we analyze a reinforcement learning model called temporal difference learning and study its performance in the iterated Prisoner's Dilemma game. Temporal difference learning is unique among a variety of learning models in that it inherently aims at increasing future payoffs, not immediate ones. It also has a neural basis. We analytically and numerically show that learners with only two internal states properly learn to cooperate with retaliatory players and to defect against unconditional cooperators and defectors. Four-state learners are more capable of achieving a high payoff against various opponents. Moreover, we numerically show that four-state learners can learn to establish mutual cooperation for sufficiently small learning rates.

Original languageEnglish
Pages (from-to)1818-1850
Number of pages33
JournalBulletin of Mathematical Biology
Volume71
Issue number8
DOIs
Publication statusPublished - 2009 Oct
Externally publishedYes

Keywords

  • Cooperation
  • Direct reciprocity
  • Prisoner's dilemma
  • Reinforcement learning

ASJC Scopus subject areas

  • Neuroscience(all)
  • Immunology
  • Mathematics(all)
  • Biochemistry, Genetics and Molecular Biology(all)
  • Environmental Science(all)
  • Pharmacology
  • Agricultural and Biological Sciences(all)
  • Computational Theory and Mathematics

Fingerprint

Dive into the research topics of 'A theoretical analysis of temporal difference learning in the iterated Prisoner's dilemma game'. Together they form a unique fingerprint.

Cite this