Github logo

ChatGLM 2-6B is an open-source Chinese-English bilingual dialogue model that builds on the original ChatGLM-6B. It inherits the advantages of the first-generation model, such as fluent conversation and easy deployment.


Improved base model

GLM hybrid objective, 1.4T pre-training, human preference alignment, better MMLU, CEval, GSM8K, BBH scores

Longer context

FlashAttention, 2K pedestal, 8K dialogue, limited single-round long document understanding

More efficient inference

Multi-Query Attention, 42% faster inference, INT4 quantization, 6G memory supports 8K dialogue

More open protocol

Academic and commercial use (with permission), donation welcome for ChatGLM3

valuation results

Using typical English and Chinese datasets, we evaluated ChatGLM2-6B on MMLU (English), C-Eval (Chinese), GSM8K (mathematics), and BBH (English). See evaluation for C-Eval scripts.


Model Average STEM Social Sciences Humanities Others
ChatGLM-6B 40.63 33.89 44.84 39.02 45.71
ChatGLM2-6B (base) 47.86 41.20 54.44 43.66 54.46
ChatGLM2-6B 45.46 40.06 51.61 41.23 51.24

The Chat model was tested using the zero-shot CoT (Chain-of-Thought) method, and the Base model was tested using the few-shot answer-only method


Model Average STEM Social Sciences Humanities Others
ChatGLM-6B 38.9 33.3 48.3 41.3 38.0
ChatGLM2-6B (base) 51.7 48.6 60.5 51.3 49.8
ChatGLM2-6B 50.1 46.4 60.4 50.6 46.9


Model Accuracy Accuracy (Chinese)*
ChatGLM-6B 4.82 5.85
ChatGLM2-6B (base) 32.37 28.95
ChatGLM2-6B 28.05 20.45

All models were tested using the few-shot CoT method, and the CoT prompt came from

* We translated 8 questions and CoT prompt in GSM500K using the Translate API and performed human proofreading


Model Accuracy
ChatGLM-6B 18.73
ChatGLM2-6B (base) 33.68
ChatGLM2-6B 30.00

How It Works

Environment installation

To get started, clone this repository:

git clone
cd ChatGLM2-6B

To install the required libraries, run pip with the following command, specifying the recommended version or higher for optimal inference speed: pip install -r requirements.txttransformers4.30.2torch

Load the model locally

To run the code, you need to download the model from Hugging Face Hub. This is a full model that the code will use automatically. If you have a slow or unstable internet connection, you might face issues or delays in downloading the model. A possible solution is to download the model to your local machine and then load it from there.transformers

To get a model from Hugging Face Hub, you need to set up Git LFS and run it.

git clone

A faster way to get models from Hugging Face Hub is to download only the model code, without the checkpoints.


To update the model parameter file, follow this link and save the file to your local directory, overwriting the existing one.chatglm2-6b

o use the model from your local folder, you need to change the code above with the local folder path where you saved the model.THUDM/chatglm2-6bchatglm2-6b

The model implementation may change over time. To ensure compatibility with the model implementation invariant, you can increase the argument when you invoke. is the most recent version number, see Change Log for all versions.from_pretrainedrevision="v1.0"v1.0

Web Demo

First, install Gradio: then run the in the repository:pip install gradio


The output address of the program, which operates a Web server, can be opened in a browser for functionality.

The default mode is a startup, which does not create a public network link. To enable Internet access, change the setting from share=Falsetoshare=True

@AdamBear deserves our gratitude for creating the web demo with Streamlit. To run it, you have to install these extra dependencies

pip install streamlit streamlit-chat

Use this command to launch it:

streamlit run

The web demo works better with Streamlit when the input prompt is longer.

Command line Demo

Run the in the repository:


The program will use the command line to communicate with the user. The user can type commands and press enter to get a response from the program. To reset the conversation history, the user can press enter without typing anything. To exit the program, the user can type terminate and press enter.clearstop

API deployment

Before running the file, make sure you have installed all the required packages from the repository:pip install fastapi uvicorn


The POST method is used to call the deployment, which runs on port 8000 on-premises by default.

curl -X POST "" \
     -H 'Content-Type: application/json' \
     -d '{"prompt": "你好", "history": []}'

The resulting return value is

  "response":"你好👋!我是人工智能助手 ChatGLM2-6B,很高兴见到你,欢迎问我任何问题。",
  "history":[["你好","你好👋!我是人工智能助手 ChatGLM2-6B,很高兴见到你,欢迎问我任何问题。"]],
  "time":"2023-03-23 21:38:40"

We appreciate @hiyouga’s contribution to the OpenAI-formatted streaming API deployment, which enables any ChatGPT-based application, such as ChatGPT-Next-Web. to use it as a backend. To deploy, simply run in your repository:


This is the sample code for making an API call:

import openai
if __name__ == "__main__":
    openai.api_base = "http://localhost:8000/v1"
    openai.api_key = "none"
    for chunk in openai.ChatCompletion.create(
            {"role": "user", "content": "你好"}
        if hasattr(chunk.choices[0].delta, "content"):
            print(chunk.choices[0].delta.content, end="", flush=True)

Low-cost deployment

Model quantization

To reduce the video memory usage, you can load the model with quantization instead of FP16 precision. The following code shows how to do this. The model needs about 13GB of video memory with FP16 precision.

# 按需修改,目前只支持 4/8 bit 量化
model = AutoModel.from_pretrained("THUDM/chatglm2-6b", trust_remote_code=True).quantize(8).cuda()

Reducing the model to 4 bits affects its performance, but ChatGLM2-6B can generate natural and smooth texts even with this quantization.

A quantized model can help you save memory. To use it, just load it directly:

model = AutoModel.from_pretrained("THUDM/chatglm2-6b-int4",trust_remote_code=True).cuda()

CPU deployment

You can perform inference on the CPU if you lack GPU hardware, but it will take longer. Follow these steps to use it (you need about 32GB RAM).

model = AutoModel.from_pretrained("THUDM/chatglm2-6b", trust_remote_code=True).float()

A quantized model can reduce memory consumption when the available memory is insufficient.

model = AutoModel.from_pretrained("THUDM/chatglm2-6b-int4",trust_remote_code=True).float()

To use the quantized model on the CPU, you need to install it. These are usually pre-installed on Linux systems. For Windows, make sure to select when you install TDM-GCC. Refer to Q1 on MacOS.gccopenmpopenmpgccTDM-GCC 10.3.0gcc 11.3.0

Join Guidady AI Mail List

Subscribe to our mailing list and get interesting stuff and updates to your email inbox.

Thank you for subscribing.

Something went wrong.

Like it? Share with your friends!

2 shares, -1 points


Your email address will not be published. Required fields are marked *


I am an IT engineer, content creator, and proud father with a passion for innovation and excellence. In both my personal and professional life, I strive for excellence and am committed to finding innovative solutions to complex problems.
Choose A Format
Personality quiz
Series of questions that intends to reveal something about the personality
Trivia quiz
Series of questions with right and wrong answers that intends to check knowledge
Voting to make decisions or determine opinions
Formatted Text with Embeds and Visuals
The Classic Internet Listicles
The Classic Internet Countdowns
Open List
Submit your own item and vote up for the best submission
Ranked List
Upvote or downvote to decide the best list item
Upload your own images to make custom memes
Youtube and Vimeo Embeds
Soundcloud or Mixcloud Embeds
Photo or GIF
GIF format