ARB

ARB From Duck AI (Advanced Reasoning Benchmark)

1 min


[vc_headings linewidth=”0″ borderwidth=”1″ borderclr=”#000000″ title=”ARB” google_fonts=”font_family:Comfortaa%3A300%2Cregular%2C700|font_style:700%20bold%20regular%3A700%3Anormal” titlesize=”60″ titleclr=”#000000″]

Advanced Reasoning Benchmark

[/vc_headings]

Duck AI

Duck AI ARB (Advanced Reasoning Benchmark) is a new dataset that contains complex reasoning tasks to measure the performance of LLMs on text understanding and domain-specific reasoning. It is more difficult than previous benchmarks, as it includes questions that require deeper knowledge of mathematics, physics, biology, chemistry, and law.

[mvc_advanced_button align=”center” btn_text=”Github” icon_size=”25″ use_theme_fonts=”yes” btn_icon=”fab fa-github” btn_url=”url:https%3A%2F%2Fgithub.com%2FTheDuckAI%2Farb|target:_blank” btn_clr=”#ffffff” btn_bg=”#0a0a0a” btn_radius=”50″]
[mvc_advanced_button align=”center” btn_text=”Paper” icon_size=”25″ use_theme_fonts=”yes” btn_url=”url:https%3A%2F%2Farxiv.org%2Fpdf%2F2307.13692.pdf|target:_blank” btn_clr=”#ffffff” btn_bg=”#fa0f00″ btn_radius=”50″ btn_icon=”fas fa-file-pdf”]
[mvc_advanced_button align=”center” btn_text=”Arxiv” icon_size=”25″ use_theme_fonts=”yes” btn_icon=”fas fa-file-excel” btn_url=”url:https%3A%2F%2Farxiv.org%2Fabs%2F2307.13692|target:_blank” btn_clr=”#ffffff” btn_bg=”#600c00″ btn_radius=”50″]
[mvc_advanced_button align=”center” btn_text=”API” icon_size=”25″ use_theme_fonts=”yes” btn_icon=”fas fa-book” btn_url=”url:https%3A%2F%2Fapp.swaggerhub.com%2Fapis-docs%2Farb-dataset%2Farb-api%2F1.0.5|target:_blank” btn_clr=”#ffffff” btn_bg=”#7289da” btn_radius=”50″]
[vc_headings style=”theme4″ borderclr=”#000000″ style2=”image” title=”Sample Problems” google_fonts=”font_family:Comfortaa%3A300%2Cregular%2C700|font_style:700%20bold%20regular%3A700%3Anormal” lineheight=”3″ titlesize=”40″ titleclr=”#000000″ image_id=”2854″][/vc_headings]

Math Symbolic

Duck AI

Math Proof-like

Duck AI

Physics Symbolic

Duck AI

Evaluation Results

Current large language models (LLMs) are mainly tested on text-only problems, without any multimodal tasks, using models such as ChatGPT, GPT 3.5, GPT-4, and Claude. The questions are different for each problem type, with specific instructions and reasoning steps; for multiple-choice questions, the model’s answer is checked against the right one, while numerical, symbolic, and proof-like problems need to extract and parse the model’s answer, which can be very complex and need mathematical tools and human grading. The study also tried two methods for grading based on models, including GPT-4‘s skill to grade two symbolic expressions for equivalence and a method based on a rubric, which showed good results, making it easier to evaluate more unstructured answers.

Model-based Rubric Evaluation

One challenge for evaluating language learning models (LLMs) that perform complex reasoning tasks is how to grade symbolic answers and check intermediate reasoning steps. The study suggests a method where the model produces and applies rubrics to assess solutions, based on reference solutions and examples of human-made rubrics. The evaluation showed that GPT-4 generates effective rubrics, covering key solution steps well but having difficulties with point allocation, surpassing its predecessor, GPT-3.5-turbo.

Duck AI

You can access the Duck AI and see more in the interface!


Like it? Share with your friends!

0

0 Comments

Your email address will not be published. Required fields are marked *

Belmechri

I am an IT engineer, content creator, and proud father with a passion for innovation and excellence. In both my personal and professional life, I strive for excellence and am committed to finding innovative solutions to complex problems.
Choose A Format
Personality quiz
Series of questions that intends to reveal something about the personality
Trivia quiz
Series of questions with right and wrong answers that intends to check knowledge
Poll
Voting to make decisions or determine opinions
Story
Formatted Text with Embeds and Visuals
List
The Classic Internet Listicles
Countdown
The Classic Internet Countdowns
Open List
Submit your own item and vote up for the best submission
Ranked List
Upvote or downvote to decide the best list item
Meme
Upload your own images to make custom memes
Video
Youtube and Vimeo Embeds
Audio
Soundcloud or Mixcloud Embeds
Image
Photo or GIF
Gif
GIF format