Pacing Outside the Box: RNNs Learn to Plan in Sokoban

Full Paper

Ever notice how some people pace when they’re deep in thought? Surprisingly, neural networks do something similar—and it boosts their performance! We made this discovery while exploring the planning behavior of a recurrent neural network (RNN) trained to play the complex puzzle game Sokoban.

Like prior work, we train an RNN with standard model-free reinforcement learning, and give it extra thinking steps at test time. We find that without extra thinking steps, the RNN agent sometimes locks itself into an unsolvable position. However, with additional thinking time, the level is successfully mastered, suggesting that it is planning.

Level 18 with 0 vs 6 steps of thinking

In another case, the agent starts the level by pacing around as if “buying time” when it wasn’t given extra thinking time at the beginning. With thinking time granted, its path to the solution becomes more direct and efficient, solving the puzzle faster.

Level 53 with 0 vs 6 steps of thinking

These observations raise the question: why does this intriguing behavior emerge? To find out, we conduct a detailed black-box behavioral analysis, providing novel insights into how planning behavior develops inside neural networks. However, the exact details of the internal planning process remain mysterious: we invite other interpretability researchers to study these “model organisms” of planning, open-sourcing our trained agents and source code that replicates prior work.

Understanding how neural networks reason is crucial for advancing AI and ensuring its alignment with human values, especially considering the concept of “mesa-optimizers”—neural networks that develop internal goals during training that may differ from their intended objectives. This study sheds light on the emergence of planning in deep neural networks, an important topic in AI safety debates, and provides crucial insights for developing safe and efficient AI systems.

This represents an important first step in our longer-term research agenda to automatically detect mesa-optimizers, understand their goals, and modify the goals or planning procedures to align with human values and objectives.

RNN agent solving Sokoban level 31215
Table of Contents

Training Setup

Sokoban, a classic puzzle game, is a benchmark for AI planning algorithms due to its simple rules and strategic complexity. In AI research, planning involves an agent’s ability to think ahead and devise strategies to achieve a goal. This study reproduced and extended the findings of Guez et al. (2019), investigating the planning behavior of an RNN trained to play Sokoban using reinforcement learning. We trained a model with 1.28 million parameters using the Deep Repeating ConvLSTM (DRC) architecture they developed. The network was trained in a Sokoban environment with levels taken from the Boxoban dataset that comprises levels of varying difficulty, with easy levels over-represented. We pass 10x10 RGB images as input to the network and use the IMPALA algorithm to train the network. The agent received rewards for pushing boxes onto targets and a small penalty for every move, encouraging efficient planning and problem-solving. Over the course of 2 billion environment steps, the network gradually learned to plan ahead and solve puzzles effectively.

Performance improves with additional thinking steps as compared to the ResNet baseline.
Performance improves with additional thinking steps as compared to the ResNet baseline.

Replicating the State-of-the-Art

Our results confirm Guez et al (2019)’s findings that giving DRCs extra thinking time at the start of an episode during inference leads to enhanced planning and efficiency. In particular, we demonstrate:

  • Emergent Planning Behavior: The DRC agent demonstrated strategic thinking, benefiting greatly from additional thinking time early in training.
  • Improved Performance with Thinking Time: With more thinking steps, the DRC agent solved puzzles more efficiently and outperformed a non-recurrent ResNet baseline, especially on complex puzzles.
  • Training and Architecture Details: The DRC architecture, with its convolutional LSTM layers, proved effective for our Sokoban tasks, outperforming the non-recurrent ResNet baseline (orange). The training process, powered by the IMPALA algorithm, achieved strong performance – preliminary experiments with PPO yielded a substantially lower level solution rate.

Behavioral Analysis

Planning Solves More Levels: More thinking steps (x-axis, below) improved the success rate of the DRC agent up to 6 steps, after which it plateaued or slightly declined. The recurrent DRC agent substantially outperforms the non-recurrent ResNet baseline, even though the ResNet had more than 2x the parameters of the DRC agent, further demonstrating the utility of planning.

Success rate increases with more thinking steps, until 6 steps.
Success rate increases with more thinking steps, until 6 steps.

Solving Harder Levels: If extra thinking steps enable new levels to be solved, what’s special about those new levels? We analyze the average length of the optimal solution (not necessarily the one played by the DRC agent) for levels first solved at a given number of thinking steps. We find that levels requiring more thinking steps tend to have a higher optimal solution length, indicating those levels are harder than average. We conjecture that more thinking steps enabled the DRC agent to solve these harder levels by taking strategic moves that pay-off in the long run, avoiding actions that are myopically good but cause the level to become unsolveable (e.g. getting a box “stuck”).

Average optimal solution length grouped by the number of thinking steps needed to first solve the level. Higher thinking steps correlate with solving harder levels.
Average optimal solution length grouped by the number of thinking steps needed to first solve the level. Higher thinking steps correlate with solving harder levels.

Efficient Box Placement: Evidence for the above conjecture is provided by the plot to the right, below. In levels that are solved with 6 thinking steps, but not 0 steps, the time taken to place the first three boxes (B1, B2, B3) actually increases – the agent is seemingly less efficient. However, the time taken to place the final fourth box (B4) decreases. This suggests the agent is taking actions that are better in the long-run, enabling more levels to be solved (figure above) and faster (figure below) – but only by delaying the instantaneous gratification of placing the first few boxes.

Average timesteps to place each box (B1 to B4) with different thinking steps. (a) All levels. (b) Levels solved by 6 thinking steps but not by 0 steps.
Average timesteps to place each box (B1 to B4) with different thinking steps. (a) All levels. (b) Levels solved by 6 thinking steps but not by 0 steps.

Cycle Reduction: 82.39% of cycles, or agent “pacing” behavior, disappeared when the network was made to think for N steps before starting an N-length cycle. This confirms that the network uses these cycles to formulate a plan.

(Left) Histogram of cycle start times on medium-difficulty levels. (Right) Number of cycles in the first 5 steps with N initial thinking steps.
(Left) Histogram of cycle start times on medium-difficulty levels. (Right) Number of cycles in the first 5 steps with N initial thinking steps.

Implications & Conclusion

Just like people who pace to think through tough problems, neural networks benefit from a bit of ‘pacing’ or time to plan ahead for challenging tasks like solving Sokoban. Understanding how this planning behavior emerges in networks can help develop more resilient and reliable AI systems. Additionally, these insights can improve the interpretability of AI decision-making, making it easier to diagnose and address potential issues.

By revealing how neural networks develop planning strategies, we aim to provide insights that contribute to AI alignment and help reduce the risk of harmful misgeneralization. This work presents a promising model for further exploration into mechanistic interpretability and AI safety. Ultimately, this study contributes to the broader goal of creating trustworthy and aligned AI systems that can think ahead, plan effectively, and align with human values.

For more information, read our full paper “Planning behavior in a recurrent neural network that plays Sokoban.” If you are interested in working on problems in AI safety, we’re hiring. We’re also open to exploring collaborations with researchers at other institutions – just reach out at hello@far.ai.

Adrià Garriga-Alonso
Adrià Garriga-Alonso
Research Scientist

Adrià Garriga-Alonso is a scientist at FAR AI, working on understanding what learned optimizers want. Previously he worked at Redwood Research on neural network interpretability, and holds a PhD from the University of Cambridge.

Mohammad Taufeeque
Mohammad Taufeeque
Research Engineer

Mohammad Taufeeque is a research engineer at FAR AI. Taufeeque has a bachelor’s degree in Computer Science & Engineering from IIT Bombay, India. He has previously interned at Microsoft Research, working on adapting deployed neural text classifiers to out-of-distribution data.

ChengCheng Tan
ChengCheng Tan
Senior Communications Specialist

ChengCheng is a Senior Communications Specialist at FAR AI. She loves working with artificial intelligence, is passionate about lifelong learning, and is devoted to making technology education accessible to a wider audience. She brings over 20 years of experience consulting in UI/UX research, design and programming. At Stanford University, her MS in Computer Science specializing in Human Computer Interaction was advised by Terry Winograd and funded by the NSF. Prior to this, she studied computational linguistics at UCLA and graphic design at the Laguna College of Art + Design.

Adam Gleave
Adam Gleave
CEO and President of the Board

Adam Gleave is the CEO of FAR AI. He completed his PhD in artificial intelligence (AI) at UC Berkeley, advised by Stuart Russell. His goal is to develop techniques necessary for advanced automated systems to verifiably act according to human preferences, even in situations unanticipated by their designer. He is particularly interested in improving methods for value learning, and robustness of deep RL. For more information, visit his website.