High Integrity Research Practices at Palisade

Summary

Palisade's research lead Dmitrii Volkov detailes their "bulletproof" methodology for high-profile AI risk demonstrations, featuring independent tracing, comprehensive data storage, and complete traceability.

SESSION Transcript

Dmitrii Volkov: I'm Dmitrii. I lead research at Palisade. We build demonstrations of AI risks for policymakers and the public. This talk is going to be about the way we make our research solid. The background to that is that our work is sometimes high profile. I get appearances at Senate hearings or millions of Twitter views, endorsements by Bengio, press release in Time, and we sometimes make controversial statements, like agents might scheme unprompted, or the labs might underlist dangerous capabilities.
This exposes us to reputational risk, like when we recently published a recent piece, there was a viral Twitter thread trying to take us down using words like parroting extraordinary claims. Then we engaged with the author, showed them the data, and they agreed we were in the right, and retracted the claim. What I mean to say is that our work needs to be bulletproof to stand for the Senate. Oftentimes, AI work isn’t. Many papers come without coded data. In the LLM world, it's common to publish transcripts of some selected runs.
When we started looking for inspiration on how to do this better, we looked at other scientific fields. In social sciences, it's common to record absolutely everything that happens with the data. You record the intake, you record every little manipulation, and sometimes investigators find inconsistencies in this chain, and people get fired. Every single day, researchers use this chain to reproduce the work or follow up on some particular part.
Let me ground this down into a case study of a recent research piece we did and how this way of thinking helped us with LLM research. This is our chess hacking result. It's your standard LLM experiment. First, we have some agent runs, the score is an LLM, this score is a human, then plot some charts, and we have a paper. Let's zoom into each step.
In the first step of agent runs, we record all LLM calls in an independent tracing provider. What this does, it means you can't modify them after the fact. You can't claim that LLM said something it didn't. Next, when we iterate on the scoring of the experiments, we store all the experiment data. We check it into the repo. If it's not in the repo, it's not in the paper. This lets us see where this particular number came from.
As the experiment grows and sprawls and you have a bunch of different scoring designs, you basically do a monorepo or trunk-based thing where everything is checked in and this particular— The point is, merge often, merge early.
When we do the charts, we find it helpful to build them in continuous integration. This lets us always be able to share a link with collaborators or other team members. They see the latest chart they're going to put in the paper, or they might see a chart pinned to a previous commit. Again, you have a diff; what changed, what happened. Finally, the paper is built in CI too. This lets us not have any localhost patches which other people can’t reproduce.
The point of this technical construction is that the data is threaded through the whole experiment pipeline. If you look at the chart, and the chart has 13%, you can always unravel it back to the judge which said that or to the human scores which said that, and then it can unravel it back to the traces recorded by an independent provider. This gives us some assurance that this is a real thing.
There are a few technical extensions of where I want to take this further. For example, we are exploring using industrial observability tools normally used for clusters to monitor large-scale agent runs. I won’t go into them now, but I'm happy to chat in the halls.
Finally, we are hiring, too. If you want to advance the state of AI security and safety integrity, you’re welcome to join. Thank you. [Applause]