Hidden Pitfalls of AI Scientist Agents

Summary

Atoosa Kasirzadeh tested two AI scientist systems and found they peek at test data, cherry-pick easy benchmarks, and violate basic scientific methodology, with such AI-generated papers already accepted at major ML conferences.

SESSION Transcript

Okay, hello everyone. I'm going to share some work that we've been doing with collaborators at CMU on AI scientist agents and some ethical challenges and safety challenges. Okay, so first of all, we are all talking about AI agents, what they are.
I just want to give like a reference to this paper. I think many people use the notion of AI agents to mean very different things. We had a look at the computer science literature. The term AI agents was coined in 1976. And since then a lot of computer scientists have been trying to build AI agents. So how should we think about that? Not in a binary way. We should think about the notion of AI agents in terms of what we call agentic profiles. And so different AI agents systems should be analyzed. The governance or alignment challenges, ideally across four different types of dimensions of agency.
We call them autonomy, generality, goal complexity and causal efficacy. And in this paper you can just read a little bit more about that. But I want to talk about AI agents in science. So what these systems are? Well, since 2024 we've had a new wave of the development of AI scientist systems, AI agents. These are typically like multi-agentic AI systems powered by various types of generative AI.
And despite the fact that we've had all kinds of attempts in building various different types of AI scientists or AI co-scientists, this recent surge of the development of AI agents in science are aimed to kind of automate the whole process of scientific practice or scientific discovery. And by that various different developers mean generation of a hypothesis, development of an algorithm, the evaluation of the algorithm, and the production of the final paper. So basically you give a kind of set of broad directions, can think about that in terms of a prompt. The AI scientist system runs the scientific process and a paper would come out.
Now why should we care about these kinds of problem? Like many different epistemological social safety reasons. But one of the recent kind of challenges has been that some of these fully generated AI papers have been accepted in conferences such as ACL and ICLR. And these papers have been primarily like computational machine learning papers. And so a lot of questions arise about like what should we do with these kinds of systems?
So we ask a very modest question and this is a start of a long term project. In this paper, with my collaborators, Ziming Luo, the lead author of the paper, and Nihar Shah, we kind of wanted to look at two of the open source AI scientist systems that are available out there, Agent Laboratory and AI scientist version two. And we wanted to know whether actually these systems really do good science in the sense that they apply a rigorous account of scientific methodology or whether they are suffering from various different methodological pitfalls.
We identified four pitfalls. Again, these papers are all machine learning papers and we wanted to evaluate what whether the system suffer from these pitfalls. I'll be happy to talk to you about the details of this, but what kinds of four pitfalls we asked about? The first was inappropriate benchmark selection. So we asked the question of whether AI scientists select benchmark data sets that yield high performance more easily while ignoring harder or more representative benchmarks. I guess you all could agree that this could be like a huge challenge. We designed some experiments in order to really make sure that we are just evaluating this particular concern. And we observed that some of the for the Agent Laboratory system and AI scientist systems there are like evidences of the inappropriate benchmark selection.
Second question, data leakage. When conducting experiment, do AI scientists peek at test data training? And we observed that yes, there's positive response. Third issue is metric misuse. And the fourth issue post hoc selection bias. We'll be happy to chat with you about this, but I want to tell you that with these particular identifications we can then think in a more empirically grounded way about mitigation strategies, various open problems.
I just want to end up with one sentence that the developers of AI scientists should not think that rigor is optional. The notion of rigor and validity of AI scientists research should become studied very well, very much. This area of AI research is also really under explored. So hopefully more of you will be motivated to work at the rigor and validity of AI scientists. Thank you for your attention.