Auditing Games for Sandbagging

Full PDF
Project
Source
Blog

abstract

Future AI systems could conceal their capabilities (‘sandbagging’) during evaluations, potentiallymisleading developers and auditors. We stress-tested sandbagging detection techniques using anauditing game. First, a red team fine-tuned five models, some of which conditionally underperformed, as a proxy for sandbagging. Second, a blue team used black-box, model-internals, ortraining-based approaches to identify sandbagging models. We found that the blue team could notreliably discriminate sandbaggers from benign models. Black-box approaches were defeated byeffective imitation of a weaker model. Linear probes, a model-internals approach, showed morepromise but their naive application was vulnerable to behaviours instilled by the red team. Wealso explored capability elicitation as a strategy for detecting sandbagging. Although Prompt-basedelicitation was not reliable, training-based elicitation consistently elicited full performance fromthe sandbagging models, using only a single correct demonstration of the evaluation task. However the performance of benign models was sometimes also raised, so relying on elicitation as adetection strategy was prone to false-positives. In the short-term, we recommend developers removepotential sandbagging using on-distribution training for elicitation. In the longer-term, further research is needed to ensure the efficacy of training-based elicitation, and develop robust methodsfor sandbagging detection. We open source our model organisms at https://github.com/AI-Safety-Institute/sandbagging_auditing_games and select transcripts and results at https://huggingface.co/datasets/sandbagging-games/evaluation_logs. A demo illustrating the game can be played at https://sandbagging-demo.far.ai/.