[vc_row][vc_column][vc_headings linewidth=”0″ borderwidth=”1″ borderclr=”#000000″ title=”ARB” google_fonts=”font_family:Comfortaa%3A300%2Cregular%2C700|font_style:700%20bold%20regular%3A700%3Anormal” titlesize=”60″ titleclr=”#000000″]
Advanced Reasoning Benchmark
[/vc_headings][vc_single_image image=”4855″ img_size=”medium” alignment=”center”][vc_column_text]Duck AI ARB (Advanced Reasoning Benchmark) is a new dataset that contains complex reasoning tasks to measure the performance of LLMs on text understanding and domain-specific reasoning. It is more difficult than previous benchmarks, as it includes questions that require deeper knowledge of mathematics, physics, biology, chemistry, and law.[/vc_column_text][vc_separator][/vc_column][/vc_row][vc_row][vc_column width=”1/4″][mvc_advanced_button align=”center” btn_text=”Github” icon_size=”25″ use_theme_fonts=”yes” btn_icon=”fab fa-github” btn_url=”url:https%3A%2F%2Fgithub.com%2FTheDuckAI%2Farb|target:_blank” btn_clr=”#ffffff” btn_bg=”#0a0a0a” btn_radius=”50″][/vc_column][vc_column width=”1/4″][mvc_advanced_button align=”center” btn_text=”Paper” icon_size=”25″ use_theme_fonts=”yes” btn_url=”url:https%3A%2F%2Farxiv.org%2Fpdf%2F2307.13692.pdf|target:_blank” btn_clr=”#ffffff” btn_bg=”#fa0f00″ btn_radius=”50″ btn_icon=”fas fa-file-pdf”][/vc_column][vc_column width=”1/4″][mvc_advanced_button align=”center” btn_text=”Arxiv” icon_size=”25″ use_theme_fonts=”yes” btn_icon=”fas fa-file-excel” btn_url=”url:https%3A%2F%2Farxiv.org%2Fabs%2F2307.13692|target:_blank” btn_clr=”#ffffff” btn_bg=”#600c00″ btn_radius=”50″][/vc_column][vc_column width=”1/4″][mvc_advanced_button align=”center” btn_text=”API” icon_size=”25″ use_theme_fonts=”yes” btn_icon=”fas fa-book” btn_url=”url:https%3A%2F%2Fapp.swaggerhub.com%2Fapis-docs%2Farb-dataset%2Farb-api%2F1.0.5|target:_blank” btn_clr=”#ffffff” btn_bg=”#7289da” btn_radius=”50″][/vc_column][/vc_row][vc_row][vc_column][vc_headings style=”theme4″ borderclr=”#000000″ style2=”image” title=”Sample Problems” google_fonts=”font_family:Comfortaa%3A300%2Cregular%2C700|font_style:700%20bold%20regular%3A700%3Anormal” lineheight=”3″ titlesize=”40″ titleclr=”#000000″ image_id=”2854″][/vc_headings][vc_column_text]
Math Symbolic

Math Proof-like

Physics Symbolic

Evaluation Results
Current large language models (LLMs) are mainly tested on text-only problems, without any multimodal tasks, using models such as ChatGPT, GPT 3.5, GPT-4, and Claude. The questions are different for each problem type, with specific instructions and reasoning steps; for multiple-choice questions, the model’s answer is checked against the right one, while numerical, symbolic, and proof-like problems need to extract and parse the model’s answer, which can be very complex and need mathematical tools and human grading. The study also tried two methods for grading based on models, including GPT-4‘s skill to grade two symbolic expressions for equivalence and a method based on a rubric, which showed good results, making it easier to evaluate more unstructured answers.[/vc_column_text][vc_single_image image=”4859″ img_size=”full” alignment=”center”][vc_column_text]
Model-based Rubric Evaluation
One challenge for evaluating language learning models (LLMs) that perform complex reasoning tasks is how to grade symbolic answers and check intermediate reasoning steps. The study suggests a method where the model produces and applies rubrics to assess solutions, based on reference solutions and examples of human-made rubrics. The evaluation showed that GPT-4 generates effective rubrics, covering key solution steps well but having difficulties with point allocation, surpassing its predecessor, GPT-3.5-turbo.[/vc_column_text][vc_single_image image=”4860″ img_size=”full” alignment=”center”][vc_message message_box_color=”black”]You can access the Duck AI and see more in the interface![/vc_message][vc_separator][vc_btn title=”Accees Duck AI Interface” color=”danger” align=”center” i_icon_fontawesome=”fas fa-external-link-square-alt” add_icon=”true” link=”url:https%3A%2F%2Farb.duckai.org%2Fhome|target:_blank”][/vc_column][/vc_row][vc_row][vc_column][vc_separator][/vc_column][/vc_row]



0 Comments