NExT-GPT: Any-to-Any Multimodal LLM

1 min


Any-to-Any Multimodal LLM


NextGPT is a novel framework that combines language and vision models to generate multimodal content. It consists of three main stages:

– Multimodal Encoding Stage:

This stage uses state-of-the-art encoders to transform inputs from different modalities, such as text, images, or videos, into language-like representations that can be understood by the language model.

LLM Understanding and Reasoning Stage:

This stage employs a pre-trained language model to process the encoded inputs and perform semantic understanding and reasoning. The language model outputs text tokens as well as special “modality signal” tokens that indicate what type of content and how to generate it in the next stage.

– Multimodal Generation Stage:

This stage takes the modality signal tokens from the previous stage and uses them to guide the generation of multimodal content. Depending on the signal, the stage uses different decoders to produce text, images, or videos that match the input and the desired output.

Environment Preparation

To get started, you need to copy the repo and set up the necessary environment. You can do this by executing these commands:

conda env create -n nextgpt python=3.8

conda activate nextgpt

# CUDA 11.6
conda install pytorch==1.13.1 torchvision==0.14.1 torchaudio==0.13.1 pytorch-cuda=11.6 -c pytorch -c nvidia

git clone

pip install -r requirements.txt

Training NExt-GPT on your own

Preparing Pre-trained Checkpoint

NExT-GPT is built on the foundations of several outstanding models. To get the checkpoints ready, please adhere to the guidelines below.

  • ImageBind is the unified image/video/audio encoder. The pre-trained checkpoint can be downloaded from here with version huge. Afterward, put the imagebind_huge.pth file at [./ckpt/pretrained_ckpt/imagebind_ckpt/huge].
  • Vicuna: first prepare the LLaMA by following the instructions [here]. Then put the pre-trained model at [./ckpt/pretrained_ckpt/vicuna_ckpt/].
  • Image Diffusion is used to generate images. NExT-GPT uses Stable Diffusion with version  v1-5. (will be automatically downloaded)
  • Audio Diffusion for producing audio content. NExT-GPT employs AudioLDM with version l-full. (will be automatically downloaded)
  • Video Diffusion for the video generation. We employ ZeroScope with version v2_576w. (will be automatically downloaded)

Preparing Dataset

Please download the following datasets used for model training:

A) T-X pairs data

B) Instruction data

Precomputing Embeddings

NExT-GPT uses decoding-side alignment training to make the signal tokens and captions representations closer. NExT-GPT precomputes the text embeddings for image, audio and video captions with the text encoder in the diffusion models to reduce time and memory costs.

Please run this command before the following training of NExT-GPT, where the produced embedding file will be saved at [./data/embed].

cd ./code/
python ../data/T-X_pair_data/cc3m/cc3m.json image ../data/embed/ runwayml/stable-diffusion-v1-5

Note of arguments:

  • args[1]: path of caption file;
  • args[2]: modality, which can be imagevideo, and audio;
  • args[3]: saving path of embedding file;
  • args[4]: corresponding pre-trained diffusion model name.

Training NExT-GPT

First of all, kindly refer to the base configuration file [./code/config/base.yaml] for the basic system setting of overall modules.

Then, the training of NExT-GPT starts with this script:

cd ./code
bash scripts/

Specifying the command:

deepspeed --include localhost:0 --master_addr --master_port 28459 \
    --model nextgpt \
    --stage 1\
    --dataset cc3m\
    --data_path  ../data/T-X_pair_data/cc3m/cc3m.json\
    --mm_root_path ../data/T-X_pair_data/cc3m/images/\
    --embed_path ../data/embed/\
    --save_path  ../ckpt/delta_ckpt/nextgpt/7b/\
    --log_path ../ckpt/delta_ckpt/nextgpt/7b/log/

where the key arguments are:

  • --includelocalhost:0 indicating the GPT cuda number 0 of deepspeed.
  • --stage: training stage.
  • --dataset: the dataset name for training model.
  • --data_path: the data path for the training file.
  • --mm_root_path: the data path for the image/video/audio file.
  • --embed_path: the data path for the text embedding file.
  • --save_path: the directory which saves the trained delta weights. This directory will be automatically created.
  • --log_path: the directory which saves the log file.

The whole NExT-GPT training involves 3 steps:

  • Step-1: Encoding-side LLM-centric Multimodal Alignment. This stage trains the input projection layer while freezing the ImageBind, LLM, output projection layer.

    Just run the above script by setting:

    • --stage 1
    • --dataset x, where x varies from [cc3mwebvidaudiocap]
    • --data_path ../.../xxx.json, where xxx is the file name of the data in [./data/T-X_pair_data]
    • --mm_root_path .../.../xx varies from [imagesaudiosvideos]

    Also refer to the running config file [./code/config/stage_1.yaml] and deepspeed config file [./code/dsconfig/stage_1.yaml] for more step-wise configurations.

  • Step-2: Decoding-side Instruction-following Alignment. This stage trains the output projection layers while freezing the ImageBind, LLM, input projection layers.

    Just run the above script by setting:

    • --stage 2
    • --dataset x, where x varies from [cc3mwebvidaudiocap]
    • --data_path ../.../xxx.json, where xxx is the file name of the data in [./data/T-X_pair_data]
    • --mm_root_path .../.../xx varies from [imagesaudiosvideos]

    Also refer to the running config file [./code/config/stage_2.yaml] and deepspeed config file [./code/dsconfig/stage_2.yaml] for more step-wise configurations.

  • Step-3: Instruction Tuning. This stage instruction-tune 1) the LLM via LoRA, 2) input projection layer and 3) output projection layer on the instruction dataset.

    Just run the above script by setting:

    Also refer to the running config file [./code/config/stage_3.yaml] and deepspeed config file [./code/dsconfig/stage_3.yaml] for more step-wise configurations.

Running NExT-GPT System

Preparing Checkpoints

First, loading the pre-trained NExT-GPT system.

Deploying Gradio Demo

Upon completion of the checkpoint loading, you can run the demo locally via:

cd ./code
bash scripts/

Specifying the key arguments as:

  • --nextgpt_ckpt_path: the path of pre-trained NExT-GPT params.

What can NExT-GPT do?

Here are some examples of NExT-GPT outputs

  title={NExT-GPT: Any-to-Any Multimodal LLM},
  author={Shengqiong Wu and Hao Fei and Leigang Qu and Wei Ji and Tat-Seng Chua},
  journal = {CoRR},
  volume = {abs/2309.05519},

Join Guidady AI Mail List

Subscribe to our mailing list and get interesting stuff and updates to your email inbox.

Thank you for subscribing.

Something went wrong.

Like it? Share with your friends!



Your email address will not be published. Required fields are marked *


I am an IT engineer, content creator, and proud father with a passion for innovation and excellence. In both my personal and professional life, I strive for excellence and am committed to finding innovative solutions to complex problems.
Choose A Format
Personality quiz
Series of questions that intends to reveal something about the personality
Trivia quiz
Series of questions with right and wrong answers that intends to check knowledge
Voting to make decisions or determine opinions
Formatted Text with Embeds and Visuals
The Classic Internet Listicles
The Classic Internet Countdowns
Open List
Submit your own item and vote up for the best submission
Ranked List
Upvote or downvote to decide the best list item
Upload your own images to make custom memes
Youtube and Vimeo Embeds
Soundcloud or Mixcloud Embeds
Photo or GIF
GIF format