NExT-GPT: Any-to-Any Multimodal LLM

NExT-GPT

Any-to-Any Multimodal LLM

NextGPT is a novel framework that combines language and vision models to generate multimodal content. It consists of three main stages:

– Multimodal Encoding Stage:

This stage uses state-of-the-art encoders to transform inputs from different modalities, such as text, images, or videos, into language-like representations that can be understood by the language model.

– LLM Understanding and Reasoning Stage:

This stage employs a pre-trained language model to process the encoded inputs and perform semantic understanding and reasoning. The language model outputs text tokens as well as special “modality signal” tokens that indicate what type of content and how to generate it in the next stage.

– Multimodal Generation Stage:

This stage takes the modality signal tokens from the previous stage and uses them to guide the generation of multimodal content. Depending on the signal, the stage uses different decoders to produce text, images, or videos that match the input and the desired output.

Project Paper

Environment Preparation

To get started, you need to copy the repo and set up the necessary environment. You can do this by executing these commands:

conda env create -n nextgpt python=3.8

conda activate nextgpt

# CUDA 11.6
conda install pytorch==1.13.1 torchvision==0.14.1 torchaudio==0.13.1 pytorch-cuda=11.6 -c pytorch -c nvidia

git clone https://github.com/NExT-GPT/NExT-GPT.git
cd NExT-GPT

pip install -r requirements.txt

Training NExt-GPT on your own

Preparing Pre-trained Checkpoint

NExT-GPT is built on the foundations of several outstanding models. To get the checkpoints ready, please adhere to the guidelines below.

ImageBind is the unified image/video/audio encoder. The pre-trained checkpoint can be downloaded from here with version huge. Afterward, put the imagebind_huge.pth file at [./ckpt/pretrained_ckpt/imagebind_ckpt/huge].
Vicuna: first prepare the LLaMA by following the instructions [here]. Then put the pre-trained model at [./ckpt/pretrained_ckpt/vicuna_ckpt/].
Image Diffusion is used to generate images. NExT-GPT uses Stable Diffusion with version v1-5. (will be automatically downloaded)
Audio Diffusion for producing audio content. NExT-GPT employs AudioLDM with version l-full. (will be automatically downloaded)
Video Diffusion for the video generation. We employ ZeroScope with version v2_576w. (will be automatically downloaded)

Preparing Dataset

Please download the following datasets used for model training:

A) T-X pairs data

CC3M of text-image pairs, please follow this instruction [here]. Then put the data at [./data/T-X_pair_data/cc3m].
WebVid of text-video pairs, see the [instruction]. The file should be saved at [./data/T-X_pair_data/webvid].
AudioCap of text-audio pairs, see the [instruction]. Save the data in [./data/T-X_pair_data/audiocap].

B) Instruction data

T+X-T
- LLaVA of the visual instruction data, download it from here, and then put it at [./data/IT_data/T+X-T_data/llava].
- Alpaca of the textual instruction data, download it from here, and then put it at [./data/IT_data/T+X-T_data/alpaca/].
- VideoChat, download the video instruction data here, and then put it at [./data/IT_data/T+X-T_data/videochat/].
T-X+T
- Run the following commands to construct the data. Please ensure the above T+X-T datasets are prepared. Afterward, the T-X+T file instruction_data.json will be saved at [./data/IT_data/T-T+X_data].
```
cd ./code/dataset/
python instruction_dataset.py
```
MosIT
- Download the file from here, put them in [./data/IT_data/MosIT_data/]. (We are in the process of finalizing the data and handling the copyright issue. Will release later.)

Precomputing Embeddings

NExT-GPT uses decoding-side alignment training to make the signal tokens and captions representations closer. NExT-GPT precomputes the text embeddings for image, audio and video captions with the text encoder in the diffusion models to reduce time and memory costs.

Please run this command before the following training of NExT-GPT, where the produced embedding file will be saved at [./data/embed].

cd ./code/
python process_embeddings.py ../data/T-X_pair_data/cc3m/cc3m.json image ../data/embed/ runwayml/stable-diffusion-v1-5

Note of arguments:

args[1]: path of caption file;
args[2]: modality, which can be image, video, and audio;
args[3]: saving path of embedding file;
args[4]: corresponding pre-trained diffusion model name.

Training NExT-GPT

First of all, kindly refer to the base configuration file [./code/config/base.yaml] for the basic system setting of overall modules.

Then, the training of NExT-GPT starts with this script:

cd ./code
bash scripts/train.sh

Specifying the command:

deepspeed --include localhost:0 --master_addr 127.0.0.1 --master_port 28459 train.py \
    --model nextgpt \
    --stage 1\
    --dataset cc3m\
    --data_path  ../data/T-X_pair_data/cc3m/cc3m.json\
    --mm_root_path ../data/T-X_pair_data/cc3m/images/\
    --embed_path ../data/embed/\
    --save_path  ../ckpt/delta_ckpt/nextgpt/7b/\
    --log_path ../ckpt/delta_ckpt/nextgpt/7b/log/

where the key arguments are:

--include: localhost:0 indicating the GPT cuda number 0 of deepspeed.
--stage: training stage.
--dataset: the dataset name for training model.
--data_path: the data path for the training file.
--mm_root_path: the data path for the image/video/audio file.
--embed_path: the data path for the text embedding file.
--save_path: the directory which saves the trained delta weights. This directory will be automatically created.
--log_path: the directory which saves the log file.

The whole NExT-GPT training involves 3 steps:

Step-1: Encoding-side LLM-centric Multimodal Alignment. This stage trains the input projection layer while freezing the ImageBind, LLM, output projection layer.

Just run the above train.sh script by setting:
- --stage 1
- --dataset x, where x varies from [cc3m, webvid, audiocap]
- --data_path ../.../xxx.json, where xxx is the file name of the data in [./data/T-X_pair_data]
- --mm_root_path .../.../x, x varies from [images, audios, videos]
Also refer to the running config file [./code/config/stage_1.yaml] and deepspeed config file [./code/dsconfig/stage_1.yaml] for more step-wise configurations.
Step-2: Decoding-side Instruction-following Alignment. This stage trains the output projection layers while freezing the ImageBind, LLM, input projection layers.

Just run the above train.sh script by setting:
- --stage 2
- --dataset x, where x varies from [cc3m, webvid, audiocap]
- --data_path ../.../xxx.json, where xxx is the file name of the data in [./data/T-X_pair_data]
- --mm_root_path .../.../x, x varies from [images, audios, videos]
Also refer to the running config file [./code/config/stage_2.yaml] and deepspeed config file [./code/dsconfig/stage_2.yaml] for more step-wise configurations.
Step-3: Instruction Tuning. This stage instruction-tune 1) the LLM via LoRA, 2) input projection layer and 3) output projection layer on the instruction dataset.

Just run the above train.sh script by setting:
- --stage 3
- --dataset instruction
- --data_path ../.../xxx.json, where xxx is the file name of the data in [./data/IT_data/T+X-T_data] or data in [./data/IT_data/T+X-T_data] or data in [./data/IT_data/MosIT_data]
- --mm_root_path .../.../x, x varies from [images, audios, videos]
Also refer to the running config file [./code/config/stage_3.yaml] and deepspeed config file [./code/dsconfig/stage_3.yaml] for more step-wise configurations.

Running NExT-GPT System

Preparing Checkpoints

First, loading the pre-trained NExT-GPT system.

Step-1: load Frozen parameters. Please refer to 3.1 Preparing Pre-trained Checkpoint.
Step-2: load Tunable parameters. Please put the NExT-GPT system in [./ckpt/delta_ckpt/nextgpt/7b_tiva_v0]. You may either 1) use the params trained yourselves, or 2) download our checkpoints from here. (We are still working hard on optimizing the system, and will release the params shortly.)

Deploying Gradio Demo

Upon completion of the checkpoint loading, you can run the demo locally via:

cd ./code
bash scripts/app.sh

Specifying the key arguments as:

--nextgpt_ckpt_path: the path of pre-trained NExT-GPT params.

What can NExT-GPT do?

Here are some examples of NExT-GPT outputs

Github

Project Page

Paper

Youtube

@articles{wu2023nextgpt,
  title={NExT-GPT: Any-to-Any Multimodal LLM},
  author={Shengqiong Wu and Hao Fei and Leigang Qu and Wei Ji and Tat-Seng Chua},
  journal = {CoRR},
  volume = {abs/2309.05519},
  year={2023}
}