Benchmark modelu językowego

Benchmark modelu językowego – benchmark testujący możliwości modeli językowych takich jak duże modele językowe^[1]. Testy te mają na celu porównanie możliwości różnych modeli w takich obszarach jak rozumienie języka, generowanie i wnioskowanie.

Testy porównawcze zazwyczaj składają się ze zbioru danych i metryk ewaluacyjnych. Zbiór danych zawiera próbki tekstu i adnotacje, natomiast metryki mierzą wydajność modelu w zakresie takich zadań jak odpowiadanie na pytania, klasyfikacja tekstu i tłumaczenie maszynowe.

Charakterystyka

Kategorie

Benchmarki mogą zostać skategoryzowane względem różnych metryk do jednych z poniższych kategorii:

Klasyczny – skupiają się na analizie statystycznej i często powstawały przed spopularyzowaniem mechanizmów uczenia głębokiego. Do przykładów zalicza się bank drzew i BLEU
Odpowiadania na pytania – ta kategoria testów posiada pary pytań i odpowiedzi, często wielokrotnego wyboru^[2]^[3]
Rozumowania – sprawdzająca kwestie rozumowania i wiedzy^[4]
Agencji – sprawdzająca możliwości działania agenta, który może wykonywać takie operacje jak uruchamianie kodu^[5]

Ocena

Można wyróżnić trzy typy oceny wyników benchamarka^[6]:

Automatyczna ocena np F1, dokładne dopasowanie, perpleksja^[7]
Ocena przez człowieka, pozwalający na jakościową ocenę odpowiedzi^[8]
LLM jako osoba oceniająca będący alternatywą do oceny przez człowieka^[9]

Krytyka

Jedna z najczęściej pojawiających krytyk odnośnie benchmarków jest dopasowanie modeli do danych testowych^[10]^[11]. Aplikowane jest również w tym kontekście prawo Goodharta^[12]. Oprócz tego zbiór pytań i odpowiedzi może posiadać błędy^[13] lub posiadać ambiwalentne odpowiedzi, gdzie ludzie nie byliby w stanie dać 100% odpowiedzi^[14]^[15]^[16]^[17].

Podkreślany jest również fakt wyrywkowego podejścia do wybieranych benchmarków przez twórców modeli^[18].

Przykłady

SQuAD

Benchmark SQuAD w wersji 1.1 składa się z 100 tys. pytań stworzonych na bazie ponad 500 artykułów z Wikipedii. Każde zadanie składa się z podania artykułu i pytania, a odpowiedzią jest konkretne zdanie z odpowiedzią^[19]. Wersja 2.0 zawiera 50 tys. pytań bez odpowiedzi, gdzie na każde pytanie należy odpowiedzieć spacją^[20].

GPQA

GPQA (ang. Google-Proof Q&A) składa się 448 pytań na poziomie doktoratu wielokrotnego wyboru napisanych przez ekspertów w dziedzinie biologii, fizyki i chemii. Podzbiór "Diamond" zawiera 198 najtrudniejszych pytań^[21]. OpenAI ustaliło, że eksperci osiągają średni wynik 69,7% w tym podzbiorze^[22].

Humanity's Last Exam

Jako przykład benchmarku w kategorii rozumowania można wyróżnić Humanity's Last Exam. Posiada on 3000 multimodalnych pytań z ponad stu przedmiotów akademickich, z nieudostępnionym zbiorem odpowiedzi, aby zapobiec zanieczyszczeniu. 10% pytań wymaga zrozumienia zarówno tekstu, jak i obrazu, reszta opiera się wyłącznie na tekście. 80% pytań jest punktowanych poprzez dokładne dopasowanie ciągu znaków, reszta to pytania wielokrotnego wyboru^[23].

Przypisy

↑ DavidD. Owen DavidD., How predictable is language model benchmark performance?, arXiv, 9 stycznia 2024, DOI: 10.48550/arXiv.2401.04757 [dostęp 2025-05-11] .
↑ DanqiD. Chen DanqiD., Wen-tauW. Yih Wen-tauW., Open-Domain Question Answering, AgataA. Savary, YueY. Zhang (red.), Online: Association for Computational Linguistics, lipiec 2020, s. 34–37, DOI: 10.18653/v1/2020.acl-tutorials.8 [dostęp 2025-05-11] .
↑ LilianL. Weng LilianL., How to Build an Open-Domain Question Answering System? [online], lilianweng.github.io, 29 października 2020 [dostęp 2025-05-11] (ang.).
↑ TomohiroT. Sawada TomohiroT. i inni, ARB: Advanced Reasoning Benchmark for Large Language Models, arXiv, 28 lipca 2023, DOI: 10.48550/arXiv.2307.13692 [dostęp 2025-05-11] .
↑ QianQ. Huang QianQ. i inni, Benchmarking Large Language Models as AI Research Agents [online], 8 listopada 2023 [dostęp 2025-05-11] (ang.).
↑ Md Tahmid RahmanM.T.R. Laskar Md Tahmid RahmanM.T.R. i inni, A Systematic Survey and Critical Review on Evaluating Large Language Models: Challenges, Limitations, and Recommendations, arXiv, 3 października 2024, DOI: 10.48550/arXiv.2407.04069 [dostęp 2025-05-11] .
↑ TaojunT. Hu TaojunT., Xiao-HuaX.H. Zhou Xiao-HuaX.H., Unveiling LLM Evaluation Focused on Metrics: Challenges and Solutions, arXiv, 14 kwietnia 2024, DOI: 10.48550/arXiv.2404.09135 [dostęp 2025-05-11] .
↑ Chris van derCh. Lee Chris van derCh. i inni, Human evaluation of automatically generated text: Current trends and best practice guidelines, „Computer Speech & Language”, 67, 2021, s. 101151, DOI: 10.1016/j.csl.2020.101151, ISSN 0885-2308 [dostęp 2025-05-11] .
↑ Cheng-HanCh.H. Chiang Cheng-HanCh.H., Hung-yiH. Lee Hung-yiH., Can Large Language Models Be an Alternative to Human Evaluations?, arXiv, 3 maja 2023, DOI: 10.48550/arXiv.2305.01937 [dostęp 2025-05-11] .
↑ ChunyuanCh. Deng ChunyuanCh. i inni, Investigating Data Contamination in Modern Benchmarks for Large Language Models, arXiv, 3 kwietnia 2024, DOI: 10.48550/arXiv.2311.09783 [dostęp 2025-05-11] .
↑ YanyangY. LI YanyangY., lyy1994/awesome-data-contamination [online], 9 maja 2025 [dostęp 2025-05-11] .
↑ MostafaM. Dehghani MostafaM. i inni, The Benchmark Lottery, arXiv, 14 lipca 2021, DOI: 10.48550/arXiv.2107.07002 [dostęp 2025-05-11] .
↑ Curtis G.C.G. Northcutt Curtis G.C.G., AnishA. Athalye AnishA., JonasJ. Mueller JonasJ., Pervasive Label Errors in Test Sets Destabilize Machine Learning Benchmarks, arXiv, 7 listopada 2021, DOI: 10.48550/arXiv.2103.14749 [dostęp 2025-05-11] .
↑ RussellR. Richie RussellR., SachinS. Grover SachinS., Fuchiang (Rich)F.(R.) Tsui Fuchiang (Rich)F.(R.), Inter-annotator agreement is not the ceiling of machine learning performance: Evidence from a comprehensive set of simulations, DinaD. Demner-Fushman i inni red., Dublin, Ireland: Association for Computational Linguistics, maj 2022, s. 275–284, DOI: 10.18653/v1/2022.bionlp-1.26 [dostęp 2025-05-11] .
↑ RonR. Artstein RonR., Inter-annotator Agreement, NancyN. Ide, JamesJ. Pustejovsky (red.), Dordrecht: Springer Netherlands, 2017, s. 297–313, DOI: 10.1007/978-94-024-0881-2_11, ISBN 978-94-024-0881-2 [dostęp 2025-05-11] (ang.).
↑ YixinY. Nie YixinY., XiangX. Zhou XiangX., MohitM. Bansal MohitM., What Can We Learn from Collective Human Opinions on Natural Language Inference Data? BonnieB. Webber i inni red., Online: Association for Computational Linguistics, listopad 2020, s. 9131–9143, DOI: 10.18653/v1/2020.emnlp-main.734 [dostęp 2025-05-11] .
↑ EllieE. Pavlick EllieE., TomT. Kwiatkowski TomT., Inherent Disagreements in Human Textual Inferences, „Transactions of the Association for Computational Linguistics”, 7, 2019, s. 677–694, DOI: 10.1162/tacl_a_00293, ISSN 2307-387X [dostęp 2025-05-11] .
↑ MariaM. Eriksson MariaM. i inni, Can We Trust AI Benchmarks? An Interdisciplinary Review of Current Issues in AI Evaluation, arXiv, 10 lutego 2025, DOI: 10.48550/arXiv.2502.06559 [dostęp 2025-05-11] .
↑ PranavP. Rajpurkar PranavP. i inni, SQuAD: 100,000+ Questions for Machine Comprehension of Text, arXiv, 11 października 2016, DOI: 10.48550/arXiv.1606.05250 [dostęp 2025-05-11] .
↑ PranavP. Rajpurkar PranavP., RobinR. Jia RobinR., PercyP. Liang PercyP., Know What You Don't Know: Unanswerable Questions for SQuAD, arXiv, 11 czerwca 2018, DOI: 10.48550/arXiv.1806.03822 [dostęp 2025-05-11] .
↑ DavidD. Rein DavidD. i inni, GPQA: A Graduate-Level Google-Proof Q&A Benchmark, arXiv, 20 listopada 2023, DOI: 10.48550/arXiv.2311.12022 [dostęp 2025-05-11] .
↑ Learning to reason with LLMs [online], openai.com [dostęp 2025-05-11] (ang.).
↑ Humanity's Last Exam. lastexam.ai. [dostęp 2025-02-02].

[1] DavidD. Owen DavidD., How predictable is language model benchmark performance?, arXiv, 9 stycznia 2024, DOI: 10.48550/arXiv.2401.04757 [dostęp 2025-05-11] .

[2] DanqiD. Chen DanqiD., Wen-tauW. Yih Wen-tauW., Open-Domain Question Answering, AgataA. Savary, YueY. Zhang (red.), Online: Association for Computational Linguistics, lipiec 2020, s. 34–37, DOI: 10.18653/v1/2020.acl-tutorials.8 [dostęp 2025-05-11] .

[3] LilianL. Weng LilianL., How to Build an Open-Domain Question Answering System? [online], lilianweng.github.io, 29 października 2020 [dostęp 2025-05-11] (ang.).

[4] TomohiroT. Sawada TomohiroT. i inni, ARB: Advanced Reasoning Benchmark for Large Language Models, arXiv, 28 lipca 2023, DOI: 10.48550/arXiv.2307.13692 [dostęp 2025-05-11] .

[5] QianQ. Huang QianQ. i inni, Benchmarking Large Language Models as AI Research Agents [online], 8 listopada 2023 [dostęp 2025-05-11] (ang.).

[6] Md Tahmid RahmanM.T.R. Laskar Md Tahmid RahmanM.T.R. i inni, A Systematic Survey and Critical Review on Evaluating Large Language Models: Challenges, Limitations, and Recommendations, arXiv, 3 października 2024, DOI: 10.48550/arXiv.2407.04069 [dostęp 2025-05-11] .

[7] TaojunT. Hu TaojunT., Xiao-HuaX.H. Zhou Xiao-HuaX.H., Unveiling LLM Evaluation Focused on Metrics: Challenges and Solutions, arXiv, 14 kwietnia 2024, DOI: 10.48550/arXiv.2404.09135 [dostęp 2025-05-11] .

[8] Chris van derCh. Lee Chris van derCh. i inni, Human evaluation of automatically generated text: Current trends and best practice guidelines, „Computer Speech & Language”, 67, 2021, s. 101151, DOI: 10.1016/j.csl.2020.101151, ISSN 0885-2308 [dostęp 2025-05-11] .

[9] Cheng-HanCh.H. Chiang Cheng-HanCh.H., Hung-yiH. Lee Hung-yiH., Can Large Language Models Be an Alternative to Human Evaluations?, arXiv, 3 maja 2023, DOI: 10.48550/arXiv.2305.01937 [dostęp 2025-05-11] .

[10] ChunyuanCh. Deng ChunyuanCh. i inni, Investigating Data Contamination in Modern Benchmarks for Large Language Models, arXiv, 3 kwietnia 2024, DOI: 10.48550/arXiv.2311.09783 [dostęp 2025-05-11] .

[11] YanyangY. LI YanyangY., lyy1994/awesome-data-contamination [online], 9 maja 2025 [dostęp 2025-05-11] .

[12] MostafaM. Dehghani MostafaM. i inni, The Benchmark Lottery, arXiv, 14 lipca 2021, DOI: 10.48550/arXiv.2107.07002 [dostęp 2025-05-11] .

[13] Curtis G.C.G. Northcutt Curtis G.C.G., AnishA. Athalye AnishA., JonasJ. Mueller JonasJ., Pervasive Label Errors in Test Sets Destabilize Machine Learning Benchmarks, arXiv, 7 listopada 2021, DOI: 10.48550/arXiv.2103.14749 [dostęp 2025-05-11] .

[14] RussellR. Richie RussellR., SachinS. Grover SachinS., Fuchiang (Rich)F.(R.) Tsui Fuchiang (Rich)F.(R.), Inter-annotator agreement is not the ceiling of machine learning performance: Evidence from a comprehensive set of simulations, DinaD. Demner-Fushman i inni red., Dublin, Ireland: Association for Computational Linguistics, maj 2022, s. 275–284, DOI: 10.18653/v1/2022.bionlp-1.26 [dostęp 2025-05-11] .

[15] RonR. Artstein RonR., Inter-annotator Agreement, NancyN. Ide, JamesJ. Pustejovsky (red.), Dordrecht: Springer Netherlands, 2017, s. 297–313, DOI: 10.1007/978-94-024-0881-2_11, ISBN 978-94-024-0881-2 [dostęp 2025-05-11] (ang.).

[16] YixinY. Nie YixinY., XiangX. Zhou XiangX., MohitM. Bansal MohitM., What Can We Learn from Collective Human Opinions on Natural Language Inference Data? BonnieB. Webber i inni red., Online: Association for Computational Linguistics, listopad 2020, s. 9131–9143, DOI: 10.18653/v1/2020.emnlp-main.734 [dostęp 2025-05-11] .

[17] EllieE. Pavlick EllieE., TomT. Kwiatkowski TomT., Inherent Disagreements in Human Textual Inferences, „Transactions of the Association for Computational Linguistics”, 7, 2019, s. 677–694, DOI: 10.1162/tacl_a_00293, ISSN 2307-387X [dostęp 2025-05-11] .

[18] MariaM. Eriksson MariaM. i inni, Can We Trust AI Benchmarks? An Interdisciplinary Review of Current Issues in AI Evaluation, arXiv, 10 lutego 2025, DOI: 10.48550/arXiv.2502.06559 [dostęp 2025-05-11] .

[19] PranavP. Rajpurkar PranavP. i inni, SQuAD: 100,000+ Questions for Machine Comprehension of Text, arXiv, 11 października 2016, DOI: 10.48550/arXiv.1606.05250 [dostęp 2025-05-11] .

[20] PranavP. Rajpurkar PranavP., RobinR. Jia RobinR., PercyP. Liang PercyP., Know What You Don't Know: Unanswerable Questions for SQuAD, arXiv, 11 czerwca 2018, DOI: 10.48550/arXiv.1806.03822 [dostęp 2025-05-11] .

[21] DavidD. Rein DavidD. i inni, GPQA: A Graduate-Level Google-Proof Q&A Benchmark, arXiv, 20 listopada 2023, DOI: 10.48550/arXiv.2311.12022 [dostęp 2025-05-11] .

[22] Learning to reason with LLMs [online], openai.com [dostęp 2025-05-11] (ang.).

[23] Humanity's Last Exam. lastexam.ai. [dostęp 2025-02-02].

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]