Add ability to evauate multiple choice tasks (#5047)

* TruthfulQA: 1st attempt, does not look like it is working The same implementation can be used for HellaSwag as well, so I converted a HellaSwag validation dataset to the binary format used here and tested with that. The score is only around 50, so something is not quite right. * TruthfulQA: works but the result is bad I know it works because if I convert the HellaSwag validation data to the binary format used in the truthful_qa_score() function I get the exact same result as from the hellaswag_score() function. But I guess, the questions are tricky and the way I have done the combination of question + answer is very likely not the best. The TruthfulQA validation dataset contains 817 questions, with random chance result around 19%. With this version I get 29.1% for Mistral-7B and 55.2% for Mistral-7B-Instruct-v0.2. The HF leader board results for these two models are 42.2% and 68.3%, respectively. * TruthfulQA: fix random sample * TruthfulQA: prepare tasks in parallel for large test datasets * Rename truthful_qa to multiple_choice * Make MSVC happy I had forgotten that MSVC does not make constexpr's available inside a lambda. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
author: Kawrakow <48489457+ikawrakow@users.noreply.github.com> 2024-01-21 14:42:44 +0200
committer: GitHub <noreply@github.com> 2024-01-21 14:42:44 +0200
commit: 7dcbe39d36b76389f6c5cd3b151928472b7e22ff (patch)
tree: d0b13b66cdd5046d5767b4791d183bed3e97c61c /common/common.h
parent: 726c0fa9a2da976e9c5d5c51e185d9dd453fc9e5 (diff)
1 files changed, 3 insertions, 0 deletions
diff --git a/common/common.h b/common/common.h
index 0ae9c18b..c69ad7e9 100644
--- a/common/common.h
+++ b/common/common.h
@@ -108,6 +108,9 @@ struct gpt_params {
     bool   winogrande      = false; // compute Winogrande score over random tasks from datafile supplied in prompt
     size_t winogrande_tasks= 0;     // number of tasks to use when computing the Winogrande score. If 0, all tasks will be computed
 
+    bool   multiple_choice = false; // compute TruthfulQA score over random tasks from datafile supplied in prompt
+    size_t multiple_choice_tasks = 0;     // number of tasks to use when computing the TruthfulQA score. If 0, all tasks will be computed
+
     bool mul_mat_q         = true;  // if true, use mul_mat_q kernels instead of cuBLAS
     bool random_prompt     = false; // do not randomize prompt if none provided
     bool use_color         = false; // use color to distinguish generations and inputs
author	Kawrakow <48489457+ikawrakow@users.noreply.github.com>	2024-01-21 14:42:44 +0200
committer	GitHub <noreply@github.com>	2024-01-21 14:42:44 +0200
commit	7dcbe39d36b76389f6c5cd3b151928472b7e22ff (patch)
tree	d0b13b66cdd5046d5767b4791d183bed3e97c61c /common/common.h
parent	726c0fa9a2da976e9c5d5c51e185d9dd453fc9e5 (diff)