Uncovering Latent Human Wellbeing in Language Model Embeddings

Abstract

Do language models implicitly learn a concept of human wellbeing? We explore this through the ETHICS Utilitarianism task, assessing if scaling enhances pretrained models’ representations. Our initial finding reveals that, without any prompt engineering or finetuning, the leading principal component from OpenAI’s text-embedding-ada-002 achieves 73.9% accuracy. This closely matches the 74.6% of BERT-large finetuned on the entire ETHICS dataset, suggesting pretraining conveys some understanding about human wellbeing. Next, we consider four language model families, observing how Utilitarianism accuracy varies with increased parameters. We find performance is nondecreasing with increased model size when using sufficient numbers of principal components.

ChengCheng Tan
ChengCheng Tan
Senior Communications Specialist

ChengCheng is a Senior Communications Specialist at FAR.AI.

Adam Gleave
Adam Gleave
CEO and President of the Board

Adam Gleave is the CEO of FAR.AI. He completed his PhD in artificial intelligence (AI) at UC Berkeley, advised by Stuart Russell. His goal is to develop techniques necessary for advanced automated systems to verifiably act according to human preferences, even in situations unanticipated by their designer. He is particularly interested in improving methods for value learning, and robustness of deep RL. For more information, visit his website.

Scott Emmons
Scott Emmons
Research Scientist

Scott Emmons previously cofounded FAR.AI and served as a Research Advisor. He is now a research scientist at Google DeepMind focused on AI safety and alignment.