Notes on SayPlan: Grounding Large Language Models using 3D Scene Graphs for Scalable Task Planning

Link to paper: https://arxiv.org/abs/2307.06135

Paper published on: 2023-07-12

Paper's authors: Krishan Rana, Jesse Haviland, Sourav Garg, Jad Abou-Chakra, Ian Reid, Niko Suenderhauf

GPT3 API Cost: $0.07

GPT4 API Cost: $0.15

Total Cost To Write This: $0.22

Time Savings: 33:1

Let's start with the main idea of the research. Picture a robot in a multi-story, multi-room building trying to carry out tasks based on natural language instructions. The tasks could be as simple as "Find an apple in the kitchen" or as complex as "Pick up a document in Peter's office and deliver it to Will's office". The challenge here is to have the robot understand these instructions and plan its actions in a way that is efficient and accurate. This is where the research introduces SayPlan, a framework that allows Large Language Models (LLMs) to plan and execute tasks in large-scale environments.

SayPlan essentially uses a 3D scene graph (3DSG) to represent the environment. Think of it as a detailed map of the building, with each room, object, and asset represented as a node in the graph. The beauty of this approach is that it allows the LLM to conduct a semantic search for task-relevant subgraphs. In other words, the LLM can quickly find the parts of the graph that are relevant to the task at hand.

The SayPlan framework consists of two stages: semantic search and iterative replanning. During the semantic search stage, the LLM explores a collapsed representation of the full scene graph to find a subgraph that contains the necessary items for the task. For instance, if the task is to find an apple, the LLM would search the graph for nodes representing the kitchen and any fruit bowls within it.

The second stage, iterative replanning, involves generating long-horizon task plans by shortening the planning horizon and using a scene graph simulator to validate the plan. This is akin to making a detailed plan for the robot's actions and then testing it in a virtual environment to ensure it will work in the real world. If the simulator indicates that the plan is unfeasible, the LLM can update the plan based on the feedback.

To evaluate the performance of SayPlan, the research used two large-scale environments and compared its performance against a human baseline and a variant of SayPlan using GPT-3.5. The results showed that SayPlan was capable of grounding large-scale, long-horizon task plans from abstract, natural language instruction for a mobile manipulator robot to execute.

The research also introduced a new embodied multimodal language model called Palm-e, and proposed a few-shot grounded planning method for embodied agents using large language models. The research also explored embodied reasoning through planning with language models, and introduced a structure called 3D scene graph for unified semantics, 3D space, and camera.

The research presented a spatial perception engine called Kimera for 3D dynamic scene graph construction and optimization, and a method called Visual Graphs from Motion (VGFM) for scene understanding with object geometry reasoning. It also introduced a real-time spatial perception engine called Hydra for 3D scene graph construction and optimization.

The research then explored reasoning in large language models through prompting, and introduced a planning domain definition language called PDDL. It also explored planning, propositional logic, and stochastic search in the context of pushing the envelope, and presented the 2014 international planning competition and its progress and trends.

The research also proposed a method called Graph-Toolformer to empower large language models with graph reasoning ability, and evaluated the SayPlan system in a 3D scene graph environment. The environment used was an office floor with a mobile manipulator robot, and the research evaluated SayPlan's semantic search capabilities and graph-based reasoning.

The research then provided a full 3D scene graph representation of the office floor environment, which included various rooms, assets, and objects. The scene graph can be expanded or contracted to show different levels of detail. The research evaluated the effectiveness of semantic search in the office environment, with the evaluation including various search instructions and their success or failure.

Finally, the research involved evaluating a system called SayPlan that uses language models to generate plans for executing instructions in an office or home environment. The evaluation included simple and complex search tasks in both office and home environments, and measured the success and failure of the generated plans. The research compared the performance of different versions of SayPlan, including LLM-As-Planner and LLM+P. The evaluation included specific instructions such as finding meeting rooms, locating objects, identifying rooms with specific features, and performing actions like closing cabinets and refrigerating items. The research provided detailed sequences of explored nodes for each instruction, indicating the steps taken to execute the task. The evaluation results showed the success and failure rates of the generated plans for each instruction.

Notes on SayPlan: Grounding Large Language Models using 3D Scene Graphs for Scalable Task Planning

More from this blog

Notes on Android in the Wild: A Large-Scale Dataset for Android Device Control

Notes on LLMs as Workers in Human-Computational Algorithms? Replicating Crowdsourcing Pipelines with LLMs

Notes on Text2Layer: Layered Image Generation using Latent Diffusion Model

Notes on DialogStudio: Towards Richest and Most Diverse Unified Dataset Collection for Conversational AI

Notes on Towards A Unified Agent with Foundation Models

Command Palette

More from this blog