Generative Models by Stability AI
Generative Models by Stability AI is a collection of open-source models for generating various types of media, such as images, audio, video, and text. The models are based on the Stable Diffusion framework, which uses a diffusion process to create realistic and diverse samples from natural language prompts. The models can be accessed via the Stability AI platform, API, or GitHub repository.
Generative Models by Stability AI are powered by Amazon SageMaker, which provides scalable and cost-effective compute resources for training and inference. The models are also integrated with various tools and plugins, such as ClipDrop, DreamStudio, Photoshop, and Blender, to enable creative and practical applications for generative media.
✔️ Stable Diffusion XL (SDXL): A text-to-image model that can produce high-resolution images with fine details and complex compositions from natural language prompts.
✔️ Stable Diffusion Audio (SDA): A text-to-audio model that can generate realistic and expressive speech, music, and sound effects from natural language prompts.
✔️ Stable Diffusion Video (SDV): A text-to-video model that can generate realistic and dynamic videos with motion, lighting, and scene changes from natural language prompts.
✔️ Stable Diffusion Language (SDL): A text-to-text model that can generate natural and coherent texts with various styles, tones, and formats from natural language prompts.
Installation:
1. Clone the repo
git clone git@github.com:Stability-AI/generative-models.git
cd generative-models
2. Setting up the virtualenv
Make sure you are in the generative-models
generative-models directory after downloading it.
PyTorch 1.13
# install required packages from pypi
python3 -m venv .pt1
source .pt1/bin/activate
pip3 install wheel
pip3 install -r requirements_pt13.txt
PyTorch 2.0
# install required packages from pypi
python3 -m venv .pt2
source .pt2/bin/activate
pip3 install wheel
pip3 install -r requirements_pt2.txt
Inference:
You can use scripts/demo/sampling.py
to run a streamlit demo for sampling text-to-image and image-to-image models. The supported models are:
Weights for SDXL: To use these models for your research, fill out the application form for either the SDXL-0.9-Base model or the SDXL-0.9-Refiner. You will get access to both models if your application is approved. Make sure you sign in to your HuggingFace Account with your organization email before requesting access.
After obtaining the weights, put them into checkpoints/
. Then, launch the demo using:
streamlit run scripts/demo/sampling.py --server.port <your_port>
Invisible Watermark Detection
The code uses the invisible-watermark library to add a hidden watermark to the images produced by the model. We also include a script to detect the watermark easily. This watermark is different from the ones in the previous Stable Diffusion 1.x/2.x versions.
You can execute the script in two ways: either install the required packages as described above or use an experimental import that needs fewer packages.
python -m venv .detect
source .detect/bin/activate
pip install "numpy>=1.17" "PyWavelets>=1.1.1" "opencv-python>=4.1.0.25"
pip install --no-deps invisible-watermark
Make sure you have installed everything correctly as described above. Then you can use the script in these ways (activate your virtual environment first, for example, source .pt1/bin/activate
):
# test a single file
python scripts/demo/detect.py <your filename here>
# test multiple files at once
python scripts/demo/detect.py <filename 1> <filename 2> ... <filename n>
# test all files in a specific folder
python scripts/demo/detect.py <your folder name here>/*
Training:
You can find sample training configs in configs/example_training
. To start training, execute the following command.
python main.py --base configs/<config1.yaml> configs/<config2.yaml>
The order of configs matters: the last one overrides the previous ones. This lets you mix and match configs for model, training, and data. You can also put everything in one config. For example, this command trains a class-conditional pixel-based diffusion model on MNIST.
python main.py --base configs/example_training/toy/mnist_cond.yaml
Building New Diffusion Models
Conditioner
The conditioner_config
sets up the GeneralConditioner
, which has a list of embedders (subclasses of AbstractEmbModel
) for conditioning the generative model. Each embedder specifies if it is trainable (is_trainable
, default False
), a dropout rate for classifier-free guidance (ucg_rate
, default 0
), and an input key (input_key
), such as txt
for text-conditioning or cls
for class-conditioning. The embedder uses batch[input_key]
as input when computing conditionings. We support conditionings with two to four dimensions and concatenate different embedders’ conditionings properly. The order of the embedders in the conditioner_config
matters.
Network
The network_config
parameter defines the neural network architecture. It has been renamed from unet_config
to make it more flexible for future experiments with different diffusion backbones, such as transformers.
Loss
To train the standard diffusion model, you need to specify the loss_config
and the sigma_sampler_config
.The loss_config
determines how the loss is computed.
Sampler config
The model does not affect the sampler. We configure the sampler by choosing the numerical solver, discretization method, step number, and guidance wrappers without classifiers if needed.
Dataset Handling
To train large-scale models, we suggest using the data pipelines from the datapipelines project. This project is included in the requirement and installed automatically when you follow the steps in the Installation section. For small map-style datasets (for example, MNIST, CIFAR-10, etc.), you should define them here in the repository and return a dictionary of data keys and values.
example = {"jpg": x, # this is a tensor -1...1 chw
"txt": "a beautiful image"}
where images should be shown in a -1…1 channel-first format.