Jobs
Blog
Our Team
Contact
Donate
Publications
Blog
Uncovering Latent Human Wellbeing in LLM Embeddings
A one-dimensional PCA projection of OpenAI’s
text-embedding-ada-002
model achieves 73.7% accuracy on the ETHICS Util test dataset. This is comparable with the 74.6% accuracy of BERT-large
finetuned on the entire ETHICS Util training dataset
. This demonstrates language models develop implicit representations of human utility purely from self-supervised learning.
Pedro Freire
,
ChengCheng Tan
,
Dan Hendrycks
,
Scott Emmons
Last updated on Sep 13, 2023
12 min read
Blog
Even Superhuman Go AIs Have Surprising Failures Modes
Our adversarial testing algorithm uncovers a simple, human-interpretable strategy that consistently beats superhuman Go AIs. We explore the implications this has for the robustness and safety of AI systems.
Adam Gleave
,
Euan McLean
,
Kellin Pelrine
,
Tony Wang
,
Tom Tseng
Last updated on Jul 18, 2023
17 min read
Blog
AI Safety in a World of Vulnerable Machine Learning Systems
All contemporary machine learning systems are vulnerable to adversarial attack. This poses serious problems for existing alignment proposals. We explore these issues and propose several research directions FAR is pursuing to overcome this challenge.
Adam Gleave
Last updated on Jul 18, 2023
43 min read
Blog
,
Agenda
Cite
×