The community has two general implementations of the zero-redundant optimizer (zero) algorithm, one from Deepspeed and the other from Pytorch. With the acceleration of the embracing face, both these frameworks allow end users to train/tune the model. This blog highlights the differences in the way these backends are exposed by acceleration. We have put precision-related changes and concept guide upstream to allow users to seamlessly switch between these backends.
Is FSDP and Deep Speed interchangeable?
I recently tried to run a training pipeline on DeepSpeed and Pytorch FSDP. I noticed that the results obtained differed. Certain models were based on Mistral-7B and were loaded with half-precision (BFLOAT16). The deep speed (blue) losses were well converged, but as seen in Figure 1, the FSDP (orange) losses did not decrease.
We assumed that learning rates might require scaling as many GPUs as possible, and because we used four GPUs, we increased the learning rate by four times. Next, the following loss behavior was observed, as shown in Figure 2:
Scaling the FSDP learning rate by the number of GPUs seemed to achieve the desired behavior! However, when trying different learning rates (1E-5) without scaling, we observed similar losses and normal characteristics of gradients in both frameworks shown in Figure 3.
Accuracy issues
I’ve noticed that the internal implementation of DeepSpeed CodeBase, especially DeepSpeedzerooptimizer_Stage3 (which, as the name suggests, deals with stage 3 optimizer shards), will pass the pass of the internal_SETUP_FOR_REAL_REAL_REAL_CREATE_FCREATE_FCREATE_CREATE_CREATE_CREATE_CREATE_CREATE_CREATE_CREATE_CREATE_CREATE_CREATE_CREATE_CREATE_CREATE_CREATE_CREATE_CREATE_CREATE_CREATE_CREATE_CREATE_CREATE_CREATE_CREATE_CREATE_CREATE_CREATE_CREATE_CREATE_CREATE_CREATE_CREATE_CREATE_CREATE_CREATE_CREATE_CREATE_CREATE_CREATE_CREATE_CREATE_CREATE_CREATE_CREATE_CREATE_CREATE_CREATE_CREATE_CREATE_CREATE_CREATE_CREATE_CREATE_CREATE_CREATE_CREATE_CREATE_CREATE_CREATE_CREATE_CREATE_CREATE_CREATE_CREATE_CREATE_CREATE_CREATE_CREATE_CREATE_CREATE_CREATE_CREATE_CREATE_CREATE_CREATE_CREATE_CREATE_CREATE_CREATE_CREATE_CREATE_CREATE_CREATE_CREATE_CREATE_CREATE_CREATE_CREATE_CREATE_CREATE_CREATE_CREATE_CREATE_CREATE_CREATE_CREATE_CREATE_CREATE_CREATE_CREATE_CREATE_CREATE_CREATE_CREATE_CREATE_CREATE_ As the FP32 in the name suggests, DeepSpeed runs upcasting internally, and always designs Master Weights for the FP32. This upcasting to this perfect accuracy meant that the optimizer converged at learning speed and not at lower accuracy. Previous observations were artifacts of this difference in accuracy.
In FSDP, the model and optimizer parameters are first “flat” into a one-dimensional tensor before they are distributed in the GPU. FSDP and Deep Speed use different DTYPEs for these “flatted” parameters that affect the Pytorch optimizer. Table 1 provides an overview of the processes for both frameworks. The “Local” column shows the processes that occur per GPU, so the memory overhead from upcasting is amortized by the number of GPUs.
Do you want to handle local? Loading the framework’s detailed model (Automodel.from_pretrained (…, torch_dtype = torch_dtype, etc.)
DeepSpeed uses TORCH_DTYPE
Ignoring torch_dtype and created with float32 optimizer initialization ✅fsdp
DeepSpeed creates parameters with TORCH_DTYPE
Create parameters for float32 training steps (forward, backward, reduction) ❌fsdp
DeepSpeed follows FSDP.MixedPrecision
Mixed Precision config Optimizer (pre-step) for deepspeed_config_file ✅ Follow fsdp
deepspeed upcasting (if any) torch_dtype
Upcast everything to float32 optimizer (actual step) ✅
DeepSpeed occurs in TORCH_DTYPE
Occurs on float32
Table 1: Summary of how FSDP and deep speed handle mixed accuracy
Some takeaway points:
As mentioned 🤗 is accelerating, the rule of thumb when performing mixed accuracy is to maintain trainable parameters with float32. As is done with DeepSpeed, upcasting can have a negligible impact on memory consumption when sharding a large number of GPUs. However, when using deep speed on a small number of GPUs, a double increase in memory consumption is important. The torch native implementation of FSDP does not force upcasting and allows users to operate the Pytorch optimizer with low precision. This gives you more flexibility than DeepSpeed’s native upcasting.
Harmonize Accelerate’s Deepspeed and FSDP
To better align DeepSpeed and FSDP with Accelerate, if mixed precision is enabled, you can automatically perform upcasting on FSDP. I created a pull request with this change included in the 0.30.0 release.
The result of this PR is that FSDP can operate in two modes.
As shown in Figure 4, a “mixed precision” mode like the deep speed counterpart in memory-constrained scenarios.
The two new FSDP modes are summarized in Table 2 and compared to DeepSpeed.
Load framework model (TORCH_DTYPE) Mixed precision preparation (local) Training optimizer (local) FSDP (memory limited) BF16 Default (none) BF16 BF16 FSDP (mixed precision mode)
Table 2: Overview of two new FSDP modes and comparison with DeepSpeed
Throughput results
For throughput comparisons, we use the IBM Granite 7B model (following the Meta Llama2 architecture). Compare model flop usage (MFU) with token/SEC/GPU metrics and display information about FSDP (full shard) and deep speed (Zero3).
As before, I used four A100 GPUs with the following hyperparameters:
The batch size of the 8 models loaded with torch.bfloat16 is the same dtype.
Table 3 shows that FSDP and deep speed are expected to be performed in a similar way.
We intend to follow up on a comprehensive throughput comparison and approach. For example, large-scale alignment techniques such as instruction and gran, improve throughput (such as packaging, torch.compile, selective activation checkpoints).
Framework token/sec/device step time model flop usage (MFU) FSDP (alignment mode) 3158.7 10.4 0.41 Deep speed 3094.5 10.6 0.40
Table 3: Comparison of the stadium throughput between FSDP and deep speed for four A100 GPUs.
Close thoughts
We have provided a new concept guide to help users migrate between two frameworks. This guide will help users answer questions such as:
How do you achieve a comparable sharding strategy? How do you perform an efficient model load? How is weight prefetching managed with FSDP and DeepSpeed? What is the equivalent of FSDP wrapping in DeepSpeed?
We consider the various modes that accelerate and construct these frameworks.
Accelerate Accelerate is almost trivial toggling between FSDP and DeepSpeed, most of which is modifying the Accelerate config file (see the new concept guide for instructions on this).
In addition to changing the configuration, some of the other considerations (also outlined in the guide) are differences in how checkpoints are handled, such as how checkpoints are handled.
All experiments on this blog can be reproduced with code that accelerates the original 🤗 code.
We intend to follow up throughput comparisons and techniques for magnitude to better utilize these GPUs for tuning and alignment jobs while maintaining the quality of the model.
Acknowledgments
This is an effort to engage several teams from multiple organizations to come together. It started with IBM Research, especially Aldo Pareja, which discovered this issue, and Fabian Lim, which has identified a precise gap and fixed this issue. Zach Mueller and Stas Bekman were incredible in providing feedback and accelerated modifications. The lack of lights on Meta’s Pytorch team was extremely helpful for questions about FSDP parameters. Finally, I would like to thank the DeepSpeed team for providing feedback on this blog.