Streamlined and Secure Model Loading with Safetensors
Written on
Chapter 1: Introduction to Safetensors
In the realm of machine learning, model security and efficiency are paramount. Traditional methods of model storage often rely on Python's pickle module, which poses significant risks. According to the official Python documentation, using pickle can be dangerous:
Warning: The pickle module is not secure. Only unpickle data you trust.
The potential for executing harmful code during unpickling is a serious concern. Furthermore, loading large models with pickle can be inefficient. The process involves several steps:
- An empty model is instantiated.
- The model weights are loaded into memory.
- These weights are then copied into the newly created model.
- The final model is transferred to the appropriate device for inference, such as a GPU.
This two-step loading means that PyTorch requires double the memory of the model size. Fortunately, there are alternatives that enhance both security and efficiency. One such solution is safetensors, a format developed by Hugging Face designed for safer and more efficient model loading.
Chapter 2: What Makes Safetensors Unique?
The safetensors format is straightforward, comprising three components:
- A small segment indicating the header size (integer).
- A header segment in JSON format.
- The main segment containing the model data in binary format.
2.1 Why Is Safetensors Considered Safe?
Unlike pickle, safetensors avoids the use of Python's eval function, which poses a security risk by executing any code contained in the loaded model. Instead, safetensors is implemented in RUST, which is known for its resilience against various exploits. Although RUST is not infallible and may have vulnerabilities, it significantly reduces the risk compared to loading unknown binaries with Python.
2.2 Efficiency and Speed of Safetensors
In addition to being secure, safetensors is designed for speed and memory efficiency. Unlike PyTorch's method, which duplicates memory usage during loading, safetensors loads the model directly onto the specified device. For instance, if your model requires 100 GB of memory, the loading process will only use that amount, rather than the 200 GB needed with a pickled model. Additionally, safetensors supports lazy loading, allowing you to access portions of the model without loading it entirely.
Chapter 3: Utilizing Safetensors for Model Management
When loading models from the Hugging Face hub, the transformers library defaults to the safetensors format if available. For example, executing the following command will load the safetensors version of Llama 2 7B:
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-chat-hf", device_map={"": 0})
3.1 Loading and Saving Models
To save a model in the safetensors format, use the safe_serialization=True parameter:
model.save_pretrained("llama2_safetensors", safe_serialization=True)
This command will create a directory containing .safetensors files.
3.2 Converting Existing Models
If you have models stored in the pickled format, the Hugging Face Hub can automatically convert them to safetensors. However, you can also perform the conversion manually, though it’s advisable to do so in a controlled environment like Google Colab.
Chapter 4: Benchmarking Safetensors
For detailed benchmark results comparing safetensors with traditional methods, refer to the original article on The Kaitchup.
Conclusion: The Future of Model Loading
Safetensors stands out as a faster, safer, and more memory-efficient alternative to the conventional PyTorch pickle method. While there are other options available, few offer the same level of efficiency and security that safetensors provides.