ChatGLM2-6B

ChatGLM 2-6B is an open-source Chinese-English bilingual dialogue model that builds on the original ChatGLM-6B. It inherits the advantages of the first-generation model, such as fluent conversation and easy deployment.

Features

Improved base model

GLM hybrid objective, 1.4T pre-training, human preference alignment, better MMLU, CEval, GSM8K, BBH scores

Longer context

FlashAttention, 2K pedestal, 8K dialogue, limited single-round long document understanding

More efficient inference

Multi-Query Attention, 42% faster inference, INT4 quantization, 6G memory supports 8K dialogue

More open protocol

Academic and commercial use (with permission), donation welcome for ChatGLM3

valuation results

Using typical English and Chinese datasets, we evaluated ChatGLM2-6B on MMLU (English), C-Eval (Chinese), GSM8K (mathematics), and BBH (English). See evaluation for C-Eval scripts.

MMLU

Model	Average	STEM	Social Sciences	Humanities	Others
ChatGLM-6B	40.63	33.89	44.84	39.02	45.71
ChatGLM2-6B (base)	47.86	41.20	54.44	43.66	54.46
ChatGLM2-6B	45.46	40.06	51.61	41.23	51.24

The Chat model was tested using the zero-shot CoT (Chain-of-Thought) method, and the Base model was tested using the few-shot answer-only method

C-Eval

Model	Average	STEM	Social Sciences	Humanities	Others
ChatGLM-6B	38.9	33.3	48.3	41.3	38.0
ChatGLM2-6B (base)	51.7	48.6	60.5	51.3	49.8
ChatGLM2-6B	50.1	46.4	60.4	50.6	46.9

GSM8K

Model	Accuracy	Accuracy (Chinese)*
ChatGLM-6B	4.82	5.85
ChatGLM2-6B (base)	32.37	28.95
ChatGLM2-6B	28.05	20.45

All models were tested using the few-shot CoT method, and the CoT prompt came from http://arxiv.org/abs/2201.11903

* We translated 8 questions and CoT prompt in GSM500K using the Translate API and performed human proofreading

BBH

Model	Accuracy
ChatGLM-6B	18.73
ChatGLM2-6B (base)	33.68
ChatGLM2-6B	30.00

How It Works

Environment installation

To get started, clone this repository:

git clone https://github.com/THUDM/ChatGLM2-6B
cd ChatGLM2-6B

To install the required libraries, run pip with the following command, specifying the recommended version or higher for optimal inference speed: pip install -r requirements.txttransformers4.30.2torch

Load the model locally

To run the code, you need to download the model from Hugging Face Hub. This is a full model that the code will use automatically. If you have a slow or unstable internet connection, you might face issues or delays in downloading the model. A possible solution is to download the model to your local machine and then load it from there.transformers

To get a model from Hugging Face Hub, you need to set up Git LFS and run it.

git clone https://huggingface.co/THUDM/chatglm2-6b

A faster way to get models from Hugging Face Hub is to download only the model code, without the checkpoints.

GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/THUDM/chatglm2-6b

To update the model parameter file, follow this link and save the file to your local directory, overwriting the existing one.chatglm2-6b

o use the model from your local folder, you need to change the code above with the local folder path where you saved the model.THUDM/chatglm2-6bchatglm2-6b

The model implementation may change over time. To ensure compatibility with the model implementation invariant, you can increase the argument when you invoke. is the most recent version number, see Change Log for all versions.from_pretrainedrevision="v1.0"v1.0

Web Demo

First, install Gradio: then run the web_demo.py in the repository:pip install gradio

python web_demo.py

The output address of the program, which operates a Web server, can be opened in a browser for functionality.

The default mode is a startup, which does not create a public network link. To enable Internet access, change the setting from share=Falsetoshare=True

@AdamBear deserves our gratitude for creating the web demo with Streamlit. To run it, you have to install these extra dependencies first:web_demo2.py

pip install streamlit streamlit-chat

Use this command to launch it:

streamlit run web_demo2.py

The web demo works better with Streamlit when the input prompt is longer.

Command line Demo

Run the cli_demo.py in the repository:

python cli_demo.py

The program will use the command line to communicate with the user. The user can type commands and press enter to get a response from the program. To reset the conversation history, the user can press enter without typing anything. To exit the program, the user can type terminate and press enter.clearstop

API deployment

Before running the api.py file, make sure you have installed all the required packages from the repository:pip install fastapi uvicorn

python api.py

The POST method is used to call the deployment, which runs on port 8000 on-premises by default.

curl -X POST "http://127.0.0.1:8000" \
     -H 'Content-Type: application/json' \
     -d '{"prompt": "你好", "history": []}'

The resulting return value is

{
  "response":"你好👋！我是人工智能助手 ChatGLM2-6B，很高兴见到你，欢迎问我任何问题。",
  "history":[["你好","你好👋！我是人工智能助手 ChatGLM2-6B，很高兴见到你，欢迎问我任何问题。"]],
  "status":200,
  "time":"2023-03-23 21:38:40"
}

We appreciate @hiyouga’s contribution to the OpenAI-formatted streaming API deployment, which enables any ChatGPT-based application, such as ChatGPT-Next-Web. to use it as a backend. To deploy, simply run openai_api.py in your repository:

python openai_api.py

This is the sample code for making an API call:

import openai
if __name__ == "__main__":
    openai.api_base = "http://localhost:8000/v1"
    openai.api_key = "none"
    for chunk in openai.ChatCompletion.create(
        model="chatglm2-6b",
        messages=[
            {"role": "user", "content": "你好"}
        ],
        stream=True
    ):
        if hasattr(chunk.choices[0].delta, "content"):
            print(chunk.choices[0].delta.content, end="", flush=True)

Low-cost deployment

Model quantization

To reduce the video memory usage, you can load the model with quantization instead of FP16 precision. The following code shows how to do this. The model needs about 13GB of video memory with FP16 precision.

# 按需修改，目前只支持 4/8 bit 量化
model = AutoModel.from_pretrained("THUDM/chatglm2-6b", trust_remote_code=True).quantize(8).cuda()

Reducing the model to 4 bits affects its performance, but ChatGLM2-6B can generate natural and smooth texts even with this quantization.

A quantized model can help you save memory. To use it, just load it directly:

model = AutoModel.from_pretrained("THUDM/chatglm2-6b-int4",trust_remote_code=True).cuda()

CPU deployment

You can perform inference on the CPU if you lack GPU hardware, but it will take longer. Follow these steps to use it (you need about 32GB RAM).

model = AutoModel.from_pretrained("THUDM/chatglm2-6b", trust_remote_code=True).float()

A quantized model can reduce memory consumption when the available memory is insufficient.

model = AutoModel.from_pretrained("THUDM/chatglm2-6b-int4",trust_remote_code=True).float()

To use the quantized model on the CPU, you need to install it. These are usually pre-installed on Linux systems. For Windows, make sure to select when you install TDM-GCC. Refer to Q1 on MacOS.gccopenmpopenmpgccTDM-GCC 10.3.0gcc 11.3.0

Github

Twitter

Paper