Generative Models by Stability AI

Generative Models by Stability AI is a collection of open-source models for generating various types of media, such as images, audio, video, and text. The models are based on the Stable Diffusion framework, which uses a diffusion process to create realistic and diverse samples from natural language prompts. The models can be accessed via the Stability AI platform, API, or GitHub repository.

Generative Models by Stability AI are powered by Amazon SageMaker, which provides scalable and cost-effective compute resources for training and inference. The models are also integrated with various tools and plugins, such as ClipDrop, DreamStudio, Photoshop, and Blender, to enable creative and practical applications for generative media.

Models

✔️ Stable Diffusion XL (SDXL): A text-to-image model that can produce high-resolution images with fine details and complex compositions from natural language prompts.
✔️ Stable Diffusion Audio (SDA): A text-to-audio model that can generate realistic and expressive speech, music, and sound effects from natural language prompts.
✔️ Stable Diffusion Video (SDV): A text-to-video model that can generate realistic and dynamic videos with motion, lighting, and scene changes from natural language prompts.
✔️ Stable Diffusion Language (SDL): A text-to-text model that can generate natural and coherent texts with various styles, tones, and formats from natural language prompts.

Getting Started 🚀

Installation:

1. Clone the repo

git clone [email protected]:Stability-AI/generative-models.git
cd generative-models

2. Setting up the virtualenv

Make sure you are in the generative-models generative-models directory after downloading it.

NOTE: This is tested under python3.8 and python3.10. For other python versions, you might encounter version conflicts.

PyTorch 1.13

# install required packages from pypi
python3 -m venv .pt1
source .pt1/bin/activate
pip3 install wheel
pip3 install -r requirements_pt13.txt

PyTorch 2.0

# install required packages from pypi
python3 -m venv .pt2
source .pt2/bin/activate
pip3 install wheel
pip3 install -r requirements_pt2.txt

Inference:

You can use scripts/demo/sampling.pyto run a streamlit demo for sampling text-to-image and image-to-image models. The supported models are:

Weights for SDXL: To use these models for your research, fill out the application form for either the SDXL-0.9-Base model or the SDXL-0.9-Refiner. You will get access to both models if your application is approved. Make sure you sign in to your HuggingFace Account with your organization email before requesting access.

After obtaining the weights, put them into checkpoints/. Then, launch the demo using:

streamlit run scripts/demo/sampling.py --server.port <your_port>

Invisible Watermark Detection

The code uses the invisible-watermark library to add a hidden watermark to the images produced by the model. We also include a script to detect the watermark easily. This watermark is different from the ones in the previous Stable Diffusion 1.x/2.x versions.

You can execute the script in two ways: either install the required packages as described above or use an experimental import that needs fewer packages.

python -m venv .detect
source .detect/bin/activate

pip install "numpy>=1.17" "PyWavelets>=1.1.1" "opencv-python>=4.1.0.25"
pip install --no-deps invisible-watermark

Make sure you have installed everything correctly as described above. Then you can use the script in these ways (activate your virtual environment first, for example, source .pt1/bin/activate):

# test a single file
python scripts/demo/detect.py <your filename here>
# test multiple files at once
python scripts/demo/detect.py <filename 1> <filename 2> ... <filename n>
# test all files in a specific folder
python scripts/demo/detect.py <your folder name here>/*

Training:

You can find sample training configs in configs/example_training. To start training, execute the following command.

python main.py --base configs/<config1.yaml> configs/<config2.yaml>

The order of configs matters: the last one overrides the previous ones. This lets you mix and match configs for model, training, and data. You can also put everything in one config. For example, this command trains a class-conditional pixel-based diffusion model on MNIST.

python main.py --base configs/example_training/toy/mnist_cond.yaml

NOTE 1: To train with non-toy-dataset configs configs/example_training/imagenet-f8_cond.yaml, configs/example_training/txt2img-clipl.yaml and configs/example_training/txt2img-clipl-legacy-ucg-training.yaml ,you need to edit them based on your dataset (stored in tar-file in webdataset-format). Look for USER: comments in the configs to see what to change.

NOTE 2: This repository is compatible with both pytorch1.13 and pytorch2for generative model training. For autoencoder training, such as in configs/example_training/autoencoder/kl-f4/imagenet-attnfree-logvar.yaml, you need to use pytorch1.13 .

NOTE 3: To train latent generative models (as e.g. in configs/example_training/imagenet-f8_cond.yaml) you need to download the checkpoint from Hugging Face and replace the CKPT_PATH placeholder in this line. Follow the same steps for the text-to-image configs.

Building New Diffusion Models

Conditioner

The conditioner_configsets up the GeneralConditioner , which has a list of embedders (subclasses of AbstractEmbModel) for conditioning the generative model. Each embedder specifies if it is trainable (is_trainable, default False), a dropout rate for classifier-free guidance (ucg_rate, default 0), and an input key (input_key), such as txt for text-conditioning or cls for class-conditioning. The embedder uses batch[input_key] as input when computing conditionings. We support conditionings with two to four dimensions and concatenate different embedders’ conditionings properly. The order of the embedders in the conditioner_config matters.

Network

The network_config parameter defines the neural network architecture. It has been renamed from unet_config to make it more flexible for future experiments with different diffusion backbones, such as transformers.

Loss

To train the standard diffusion model, you need to specify the loss_configand the sigma_sampler_config.The loss_config determines how the loss is computed.

Sampler config

The model does not affect the sampler. We configure the sampler by choosing the numerical solver, discretization method, step number, and guidance wrappers without classifiers if needed.

Dataset Handling

To train large-scale models, we suggest using the data pipelines from the datapipelines project. This project is included in the requirement and installed automatically when you follow the steps in the Installation section. For small map-style datasets (for example, MNIST, CIFAR-10, etc.), you should define them here in the repository and return a dictionary of data keys and values.

example = {"jpg": x,  # this is a tensor -1...1 chw
           "txt": "a beautiful image"}

where images should be shown in a -1…1 channel-first format.

Github

Datapipelines

StabilityAI

Generative Models by Stability AI

1 min