Pacing Outside the Box: RNNs Learn to Plan in Sokoban
Ever notice how some people pace when they’re deep in thought? Surprisingly, neural networks do something similar—and it boosts their performance! We made this discovery while exploring the planning behavior of a recurrent neural network (RNN) trained to play the complex puzzle game Sokoban.
Like prior work, we train an RNN with standard model-free reinforcement learning, and give it extra thinking steps at test time. We find that without extra thinking steps, the RNN agent sometimes locks itself into an unsolvable position. However, with additional thinking time, the level is successfully mastered, suggesting that it is planning.
In another case, the agent starts the level by pacing around as if “buying time” when it wasn’t given extra thinking time at the beginning. With thinking time granted, its path to the solution becomes more direct and efficient, solving the puzzle faster.
These observations raise the question: why does this intriguing behavior emerge? To find out, we used linear probes to predict and even modify the network’s actions, revealing how it represents plans and adjusts in real time. We also performed model surgery, allowing the network to generalize beyond its 10x10 input image constraint to tackle more complex puzzles. This combination of techniques offers new insights into neural network planning. We invite other interpretability researchers to study these “model organisms” of planning, open-sourcing our trained agents and source code that replicates prior work.
Understanding how neural networks reason is crucial for advancing AI and ensuring its alignment with human values, especially considering the concept of “mesa-optimizers”—neural networks that develop internal goals during training that may differ from their intended objectives. This study sheds light on the emergence of planning in deep neural networks, an important topic in AI safety debates, and provides crucial insights for developing safe and efficient AI systems.
This represents an important first step in our longer-term research agenda to automatically detect mesa-optimizers, understand their goals, and modify the goals or planning procedures to align with human values and objectives.
Table of Contents
Training Setup
Sokoban, a classic puzzle game, is a benchmark for AI planning algorithms due to its simple rules and strategic complexity. In AI research, planning involves an agent’s ability to think ahead and devise strategies to achieve a goal. This study reproduced and extended the findings of Guez et al. (2019), investigating the planning behavior of an RNN trained to play Sokoban using reinforcement learning. We trained a model with 1.28 million parameters using the Deep Repeating ConvLSTM (DRC) architecture they developed. The network was trained in a Sokoban environment with levels taken from the Boxoban dataset that comprises levels of varying difficulty, with easy levels over-represented. We pass 10x10 RGB images as input to the network and use the IMPALA algorithm to train the network. The agent received rewards for pushing boxes onto targets and a small penalty for every move, encouraging efficient planning and problem-solving. Over the course of 2 billion environment steps, the network gradually learned to plan ahead and solve puzzles effectively.
Replicating the State-of-the-Art
Our results confirm Guez et al (2019)’s findings that giving DRCs extra thinking time at the start of an episode during inference leads to enhanced planning and efficiency. In particular, we demonstrate:
- Emergent Planning Behavior: The DRC agent demonstrated strategic thinking, benefiting greatly from additional thinking time early in training.
- Improved Performance with Thinking Time: With more thinking steps, the DRC agent solved puzzles more efficiently and outperformed a non-recurrent ResNet baseline, especially on complex puzzles.
- Training and Architecture Details: The DRC architecture, with its convolutional LSTM layers, proved effective for our Sokoban tasks, outperforming the non-recurrent ResNet baseline (orange). The training process, powered by the IMPALA algorithm, achieved strong performance – preliminary experiments with PPO yielded a substantially lower level solution rate.
Behavioral Analysis: Planning vs Search
The distinction between planning and search is subtle but important. Planning involves a sequence of actions that an agent intends to take. A search algorithm, on the other hand, evaluates many potential plans and selects the best one based on expected outcomes. Our study focuses on planning.
In our RNN, we observed “pacing”—moving in cycles early in the episode without interacting with boxes. This seemingly repetitive behavior served a purpose: it gave the network time to think ahead, forming the best plan before committing to moves that could make the level unsolvable. The pacing behavior suggests that the network might be searching through different plans to figure out the best one.
However, when we introduced extra thinking steps, this pacing behavior diminished significantly. With additional time for computation, the RNN was able to form more efficient plans upfront, reducing the need for trial and error. Particularly in complex puzzles, these thinking steps allowed the network to solve problems more effectively from the outset.
The figures below visualize this behavior. They illustrate the success rates of the agent with different amounts of thinking steps, showing a significant improvement in performance with additional steps. This provides evidence that giving the network additional time to plan leads to significant performance improvements, particularly in more complex levels, by enabling more deliberate, long-term planning. The thinking steps make the network avoid greedy strategies of placing boxes as fast as possible in favor of strategies that solve the level by placing all the boxes.
Probing and Controlling the Agent’s Plan
To investigate how the RNN forms and executes plans, we trained linear probes on its hidden states, based on Bush et al’s (2024) concurrent work with trained direction probes for a DRC Sokoban agent. For each square in the game grid, activations were fed into probes to predict whether the agent or boxes would move up, down, left, or right.
These probes achieved high F1 scores—72.3% for the agent’s movements and 86.4% for box movements—revealing the RNN’s internal decision-making. By using activations from the first step, we can predict most of the agent’s path through the level. The arrows represent predicted moves: green arrows show correct predictions that are executed in future moves, while red arrows indicate predictions that don’t happen.
For example, in the video shown below, the agent initially plans to push the down-right box to the right but adjusts its strategy after just one step, realizing that this would block the alley, so it should move around that box. This adaptability shows the RNN’s ability to update its plan dynamically as it gets more computation steps by taking moves.
What’s more, by manipulating these probes, we found we could intervene in the planning process. However, a hierarchy exists: box probes typically take priority over the agent probes. The agent’s actions are steered with the agent probes only when the box probes cannot fully explain the next steps, showing that box-related decisions drive much of the network’s planning.
Interpreting with Action Channels and Autoencoders
To better understand how the RNN makes decisions, we examined its internal representations using two techniques: analyzing the last layer’s hidden state channels and features obtained by training sparse autoencoders (SAEs) on the same channels. These methods revealed valuable insights into how the network organizes information.
We found that the final recurrent layer of the RNN dedicates specific action channels to each of the four possible moves: up, down, left, and right. These channels are so distinct that the final decision layer (a multilayer perceptron) only needs to read the corresponding channel to output the next move. This clear separation makes the network’s decision-making highly interpretable, showing a streamlined internal structure.
We also found the action features in the SAE trained on the same hidden state. These autoencoders, designed to focus on the most important patterns, also identified elements such as static target locations, solved targets, and unsolved boxes and targets. However, the same features were also present in the channels. In fact, the SAEs were less monosemantic than the channels, meaning they weren’t as effective as channels in representing the distinct features.
The RNN’s ability to generate clean, purpose-specific action channels suggests that its internal representations were already well optimized during training. Although the SAEs provided useful insights, the action channels were more interpretable and useful for understanding the network’s decision-making process.
Unlocking Generalization with Model Surgery
The RNN was initially trained only on 10x10 Sokoban puzzles with four boxes, limiting its ability to handle larger levels. Although the recurrent-convolutional layers could process input images of any size, the multilayer perceptron (MLP) layer flattened the 3D hidden states into a fixed-size vector, restricting the network to fixed grid dimensions.
To overcome this, we performed model surgery—modifying the architecture by bypassing the MLP layer and predicting actions using action features represented in the last hidden state. This adjustment allowed the RNN to generalize to larger puzzles without needing retraining, removing the bottleneck imposed by the MLP.
With this modification, the RNN was able to solve much more complex puzzles, including over 10 boxes in 2-3 times larger grids compared to the 10x10 grid with 4 boxes it sees during training. The level shown took more than 600 steps to complete, and the network could only solve it after 128 thinking steps to plan its moves. Some bigger levels are also solvable without any thinking steps. This demonstrates that interpretability research can help uncover the latent generalization capabilities of AI models that can’t be discovered from observational evaluations.
Implications & Conclusion
Just like people who pace to think through tough problems, neural networks benefit from a bit of ‘pacing’ or time to plan ahead for challenging tasks like solving Sokoban. Understanding how this planning behavior emerges in networks can help develop more resilient and reliable AI systems. Additionally, these insights can improve the interpretability of AI decision-making, making it easier to diagnose and address potential issues.
By revealing how neural networks develop planning strategies, we aim to provide empirical evidence for mesa-optimizers and insights that contribute to AI alignment and help reduce the risk of harmful misgeneralization. This work presents a promising model for further exploration into mechanistic interpretability and AI safety. Ultimately, this study contributes to the broader goal of creating trustworthy and aligned AI systems that can think ahead and plan effectively.
For more information, read our full paper “Planning in a recurrent neural network that plays Sokoban.” If you are interested in working on problems in AI safety, we’re hiring. We’re also open to exploring collaborations with researchers at other institutions – just reach out to hello@far.ai.
References
- Thomas Bush, Stephen Chung, Usman Anwar, Adrià Garriga Alonso, and David Krueger. Interpreting emergent planning in model-free reinforcement learning. arXiv (forthcoming), 2024.
- Arthur Guez, Mehdi Mirza, Karol Gregor, Rishabh Kabra, Sébastien Racanière, Théophane Weber, David Raposo, Adam Santoro, Laurent Orseau, Tom Eccles, Greg Wayne, David Silver, and Timothy Lillicrap. An investigation of model-free planning. arXiv, 2019.