NextGPT is a novel framework that combines language and vision models to generate multimodal content. It consists of three main stages:
– Multimodal Encoding Stage:
This stage uses state-of-the-art encoders to transform inputs from different modalities, such as text, images, or videos, into language-like representations that can be understood by the language model.
– LLM Understanding and Reasoning Stage:
This stage employs a pre-trained language model to process the encoded inputs and perform semantic understanding and reasoning. The language model outputs text tokens as well as special “modality signal” tokens that indicate what type of content and how to generate it in the next stage.
– Multimodal Generation Stage:
This stage takes the modality signal tokens from the previous stage and uses them to guide the generation of multimodal content. Depending on the signal, the stage uses different decoders to produce text, images, or videos that match the input and the desired output.
To get started, you need to copy the repo and set up the necessary environment. You can do this by executing these commands:
conda env create -n nextgpt python=3.8
conda activate nextgpt
# CUDA 11.6
conda install pytorch==1.13.1 torchvision==0.14.1 torchaudio==0.13.1 pytorch-cuda=11.6 -c pytorch -c nvidia
git clone https://github.com/NExT-GPT/NExT-GPT.git
cd NExT-GPT
pip install -r requirements.txt
Preparing Pre-trained Checkpoint
NExT-GPT is built on the foundations of several outstanding models. To get the checkpoints ready, please adhere to the guidelines below.
ImageBind
is the unified image/video/audio encoder. The pre-trained checkpoint can be downloaded from here with versionhuge
. Afterward, put theimagebind_huge.pth
file at [./ckpt/pretrained_ckpt/imagebind_ckpt/huge].Vicuna
: first prepare the LLaMA by following the instructions [here]. Then put the pre-trained model at [./ckpt/pretrained_ckpt/vicuna_ckpt/].Image Diffusion
is used to generate images. NExT-GPT uses Stable Diffusion with versionv1-5
. (will be automatically downloaded)Audio Diffusion
for producing audio content. NExT-GPT employs AudioLDM with versionl-full
. (will be automatically downloaded)Video Diffusion
for the video generation. We employ ZeroScope with versionv2_576w
. (will be automatically downloaded)
Preparing Dataset
Please download the following datasets used for model training:
A) T-X pairs data
CC3M
of text-image pairs, please follow this instruction [here]. Then put the data at [./data/T-X_pair_data/cc3m].WebVid
of text-video pairs, see the [instruction]. The file should be saved at [./data/T-X_pair_data/webvid].AudioCap
of text-audio pairs, see the [instruction]. Save the data in [./data/T-X_pair_data/audiocap].
B) Instruction data
-
T+X-T
LLaVA
of the visual instruction data, download it from here, and then put it at [./data/IT_data/T+X-T_data/llava].Alpaca
of the textual instruction data, download it from here, and then put it at [./data/IT_data/T+X-T_data/alpaca/].VideoChat
, download the video instruction data here, and then put it at [./data/IT_data/T+X-T_data/videochat/].
-
T-X+T
- Run the following commands to construct the data. Please ensure the above
T+X-T
datasets are prepared. Afterward, theT-X+T
fileinstruction_data.json
will be saved at [./data/IT_data/T-T+X_data].cd ./code/dataset/ python instruction_dataset.py
- Run the following commands to construct the data. Please ensure the above
-
MosIT
- Download the file from here, put them in [./data/IT_data/MosIT_data/]. (We are in the process of finalizing the data and handling the copyright issue. Will release later.)
Precomputing Embeddings
NExT-GPT uses decoding-side alignment training to make the signal tokens and captions representations closer. NExT-GPT precomputes the text embeddings for image, audio and video captions with the text encoder in the diffusion models to reduce time and memory costs.
Please run this command before the following training of NExT-GPT, where the produced embedding
file will be saved at [./data/embed].
cd ./code/
python process_embeddings.py ../data/T-X_pair_data/cc3m/cc3m.json image ../data/embed/ runwayml/stable-diffusion-v1-5
Note of arguments:
- args[1]: path of caption file;
- args[2]: modality, which can be
image
,video
, andaudio
; - args[3]: saving path of embedding file;
- args[4]: corresponding pre-trained diffusion model name.
Training NExT-GPT
First of all, kindly refer to the base configuration file [./code/config/base.yaml] for the basic system setting of overall modules.
Then, the training of NExT-GPT starts with this script:
cd ./code
bash scripts/train.sh
Specifying the command:
deepspeed --include localhost:0 --master_addr 127.0.0.1 --master_port 28459 train.py \
--model nextgpt \
--stage 1\
--dataset cc3m\
--data_path ../data/T-X_pair_data/cc3m/cc3m.json\
--mm_root_path ../data/T-X_pair_data/cc3m/images/\
--embed_path ../data/embed/\
--save_path ../ckpt/delta_ckpt/nextgpt/7b/\
--log_path ../ckpt/delta_ckpt/nextgpt/7b/log/
where the key arguments are:
--include
:localhost:0
indicating the GPT cuda number0
of deepspeed.--stage
: training stage.--dataset
: the dataset name for training model.--data_path
: the data path for the training file.--mm_root_path
: the data path for the image/video/audio file.--embed_path
: the data path for the text embedding file.--save_path
: the directory which saves the trained delta weights. This directory will be automatically created.--log_path
: the directory which saves the log file.
The whole NExT-GPT training involves 3 steps:
-
Step-1: Encoding-side LLM-centric Multimodal Alignment. This stage trains the input projection layer while freezing the ImageBind, LLM, output projection layer.
Just run the above
train.sh
script by setting:--stage 1
--dataset x
, wherex
varies from [cc3m
,webvid
,audiocap
]--data_path ../.../xxx.json
, wherexxx
is the file name of the data in [./data/T-X_pair_data]--mm_root_path .../.../x
,x
varies from [images
,audios
,videos
]
Also refer to the running config file [./code/config/stage_1.yaml] and deepspeed config file [./code/dsconfig/stage_1.yaml] for more step-wise configurations.
-
Step-2: Decoding-side Instruction-following Alignment. This stage trains the output projection layers while freezing the ImageBind, LLM, input projection layers.
Just run the above
train.sh
script by setting:--stage 2
--dataset x
, wherex
varies from [cc3m
,webvid
,audiocap
]--data_path ../.../xxx.json
, wherexxx
is the file name of the data in [./data/T-X_pair_data]--mm_root_path .../.../x
,x
varies from [images
,audios
,videos
]
Also refer to the running config file [./code/config/stage_2.yaml] and deepspeed config file [./code/dsconfig/stage_2.yaml] for more step-wise configurations.
-
Step-3: Instruction Tuning. This stage instruction-tune 1) the LLM via LoRA, 2) input projection layer and 3) output projection layer on the instruction dataset.
Just run the above
train.sh
script by setting:--stage 3
--dataset instruction
--data_path ../.../xxx.json
, wherexxx
is the file name of the data in [./data/IT_data/T+X-T_data] or data in [./data/IT_data/T+X-T_data] or data in [./data/IT_data/MosIT_data]--mm_root_path .../.../x
,x
varies from [images
,audios
,videos
]
Also refer to the running config file [./code/config/stage_3.yaml] and deepspeed config file [./code/dsconfig/stage_3.yaml] for more step-wise configurations.
Preparing Checkpoints
First, loading the pre-trained NExT-GPT system.
-
Step-1: load
Frozen parameters
. Please refer to 3.1 Preparing Pre-trained Checkpoint. -
Step-2: load
Tunable parameters
. Please put the NExT-GPT system in [./ckpt/delta_ckpt/nextgpt/7b_tiva_v0]. You may either 1) use the params trained yourselves, or 2) download our checkpoints from here. (We are still working hard on optimizing the system, and will release the params shortly.)
Deploying Gradio Demo
Upon completion of the checkpoint loading, you can run the demo locally via:
cd ./code
bash scripts/app.sh
Specifying the key arguments as:
--nextgpt_ckpt_path
: the path of pre-trained NExT-GPT params.
@articles{wu2023nextgpt,
title={NExT-GPT: Any-to-Any Multimodal LLM},
author={Shengqiong Wu and Hao Fei and Leigang Qu and Wei Ji and Tat-Seng Chua},
journal = {CoRR},
volume = {abs/2309.05519},
year={2023}
}
0 Comments