Uncovering Latent Human Wellbeing in Language Model Embeddings

February 19, 2024

abstract

Do language models implicitly learn a concept of human wellbeing? We explore this through the ETHICS Utilitarianism task, assessing if scaling enhances pretrained models' representations. Our initial finding reveals that, without any prompt engineering or finetuning, the leading principal component from OpenAI's text-embedding-ada-002 achieves 73.9% accuracy. This closely matches the 74.6% of BERT-large finetuned on the entire ETHICS dataset, suggesting pretraining conveys some understanding about human wellbeing. Next, we consider four language model families, observing how Utilitarianism accuracy varies with increased parameters. We find performance is nondecreasing with increased model size when using sufficient numbers of principal components.

Research

Our research explores a portfolio
of high-potential agendas.

Events

Our events bring together
global leaders in AI.

Programs

Our programs build the field of trustworthy and secure AI