We introduce D4RT, an integrated AI model for 4D scene reconstruction and tracking across space and time.
Every time we look at the world, we perform extraordinary feats of memory and prediction. We see and understand things as they are at one moment, as they were a moment ago, and as they will be in the next moment. Our mental models of the world are persistent representations of reality, and we use them to draw intuitive conclusions about causal relationships between the past, present, and future.
We can equip machines with cameras to allow them to see the world the same way we do, but that only solves the input problem. To understand this input, the computer must solve a complex inverse problem. This means you need to capture a video, a series of planar 2D projections, to recover or understand a rich, three-dimensional 3D world in motion.
Today we are introducing D4RT (Dynamic 4D Reconstruction and Tracking). It is a new AI model that unifies dynamic scene reconstruction into a single efficient framework, bringing us closer to the next frontier in artificial intelligence: holistic perception of dynamic reality.
Challenge to the fourth dimension
To understand a dynamic scene captured in 2D video, an AI model must track every pixel of every object as it moves through three dimensions of space and four dimensions of time. Additionally, this movement must be disentangled from camera movement to maintain a consistent representation even when objects move behind each other or leave the frame entirely. Traditionally, capturing this level of geometry and motion from 2D video requires a compute-intensive process or a patchwork of specialized AI models (e.g. for depth, motion and camera angles), resulting in slow and fragmented AI reconstruction.
D4RT’s simplified architecture and novel query mechanism puts it at the forefront of 4D reconstruction, making it up to 300 times more efficient than traditional methods and fast enough for real-time applications such as robotics and augmented reality.
How D4RT works: A query-based approach
D4RT operates as an integrated encoder and decoder Transformer architecture. The encoder first processes the input video to compress and represent the geometry and motion of the scene. Unlike older systems that used separate modules for different tasks, D4RT uses a flexible query mechanism centered around a single basic question to calculate only what is needed.
“Where is a particular pixel in the video at any given time located in 3D space as seen from the selected camera?”
Based on previous work, a lightweight decoder queries this representation to answer the specific instance of the question posed. Queries are independent and can be processed in parallel on modern AI hardware. This makes D4RT extremely fast and scalable, whether you’re tracking just a few points or reconstructing an entire scene.
