the study
Published December 5, 2024
Advances in adaptive AI agents, enhancements to 3D scene creation, and innovations in LLM training for a smarter, safer future
Next week, AI researchers from around the world will gather for the 38th Annual Conference on Neural Information Processing Systems (NeurIPS), to be held in Vancouver from December 10th to 15th.
Two papers led by Google DeepMind researchers will receive Test of Time awards for their “undeniable impact” on the field. Ilya Sutskever presents Sequence-to-Sequence Learning with Neural Networks, co-authored with Oriol Vinyals, VP of Drastic Research at Google DeepMind, and renowned scientist Quoc V. Le. Google DeepMind scientists Ian Goodfellow and David Warde-Farley will talk about generative adversarial nets.
We will also have live demonstrations including Gemma Scope, AI for music generation, weather forecasting, and more to show how basic research translates into real-world applications.
The Google DeepMind team will publish more than 100 new papers on topics ranging from AI agents and generative media to innovative learning approaches.
Building adaptive, smart, and secure AI agents
LLM-based AI agents show the potential to perform digital tasks via natural language commands. However, their success depends on accurate interaction with complex user interfaces, which requires extensive training data. AndroidControl lets you share the most diverse control dataset ever, including over 15,000 human-collected demos across over 800 apps. AI agents trained using this dataset showed significant performance improvements. This is expected to advance research into more general AI agents.
For an AI agent to generalize across tasks, it must learn from each experience it encounters. We introduce a method for in-context abstraction learning that helps agents grasp key task patterns and relationships from incomplete demos and natural language feedback to improve performance and adaptability.
A frame of a video showing someone making a sauce. Individual elements are identified and numbered. ICAL can extract important aspects of the process
Developing agent AI that works to achieve users’ goals helps increase the usefulness of the technology, but alignment is key when developing AI that works on our behalf. To that end, we propose a theoretical method to measure the goal-directedness of an AI system and show how the model’s user perception can influence the safety filter. Taken together, these insights highlight the importance of strong safeguards to prevent unintended or dangerous behavior and ensure that AI agent behavior is safe and consistent with its intended use. will be done.
Advances in 3D scene creation and simulation
While demand for high-quality 3D content is increasing across industries such as gaming and visual effects, creating lifelike 3D scenes remains costly and time-consuming. Our recent work introduces new 3D generation, simulation, and control approaches to streamline content creation and enable faster, more flexible workflows.
Creating high-quality, realistic 3D assets and scenes often requires capturing and modeling thousands of 2D photos. CAT3D is a system that allows you to create 3D content from any number of images (even a single image or text prompt) in just one minute. CAT3D accomplishes this with a multi-view diffusion model that generates additional consistent 2D images from many different viewpoints and uses those generated images as input for traditional 3D modeling techniques. The results outperformed previous methods in both speed and quality.
CAT3D allows you to create 3D scenes from any number of generated or real images.
From left to right: text to image to 3D, real photo to 3D, multiple photos to 3D.
Simulating scenes with many hard objects, such as a messy tabletop or rolling Lego blocks, is still computationally intensive. To overcome this obstacle, we introduce a new technique called SDF-Sim that represents object shapes in a scalable way, speeds up collision detection, and enables efficient simulation of large and complex scenes. I will.
Complex simulations of hundreds of object falls and collisions accurately modeled using SDF-Sim
AI image generators based on diffusion models struggle to control the 3D position and orientation of multiple objects. Our solution, Neural Assets, introduces object-specific representations that capture both appearance and 3D pose, learned through training on dynamic video data. Neural Assets allows users to move, rotate, and swap objects between scenes. This is a useful tool for animation, gaming, and virtual reality.
By specifying the 3D bounding boxes of source images and objects, you can move, rotate, and scale objects, and transfer objects and backgrounds between images.
Improving how LLMs study and respond
We’ve also evolved the way LLM trains, learns, and interacts with users to improve performance and efficiency in many ways.
The larger context window allows LLM to potentially learn from thousands of samples at once. This is known as multi-shot in-context learning (ICL). This process improves model performance in tasks such as mathematics, translation, and inference, but often requires high-quality human-generated data. To make training more cost-effective, we explore ways to adapt multishot ICLs that reduce reliance on manually curated data. Because there is so much data available for training language models, the main constraint for teams building language models is the available compute. Here we address the important question of how to choose an appropriate model size to achieve the best results, given a fixed computing budget.
Another innovative approach, called time-reversed language models (TRLMs), considers pre-training and fine-tuning LLMs to work in reverse. Given traditional LLM responses as input, TRLM generates queries that might have produced those responses. Combining this method with traditional LLM not only ensures that responses follow user instructions, but also improves summary text citation generation and strengthens safety filters against harmful content.
Curating high-quality data is essential for training large-scale AI models, but manual curation is difficult at scale. To address this, the Joint Example Selection (JEST) algorithm optimizes training by identifying the most learnable data in larger batches, resulting in up to 13 minute training rounds and 10 times less computation. , and outperforms state-of-the-art multimodal pre-training baselines.
Planning tasks are another challenge for AI, especially in stochastic environments where outcomes are influenced by randomness and uncertainty. Researchers use different types of reasoning to develop plans, but there is no consistent approach. We demonstrate that plans themselves can be considered a distinct type of probabilistic inference and propose a framework for ranking different inference techniques based on their effectiveness.
Uniting the global AI community
We are proud to be a Diamond Sponsor of the conference, supporting Women in Machine Learning, Latinas in AI, and Black People in AI in building communities around the world in AI, machine learning, and data science. I’m doing it.
If you’re attending NeurIP this year, stop by the Google DeepMind and Google Research booths and explore cutting-edge research at demos, workshops, and more throughout the conference.