Add Winogrande evaluation (#5015)

* winogrande: simple implementation It doesn't look like it is working - why? For Mistral-7B it is barely better than random chance (score ~60% for 1267 tasks), while I see Mistral-7B scoring 78.4% on the HF leader board. 1-sigma statistical uncertainty for 1267 tasks is ~1.4, so no way the difference is due to statistics. * winogrande: somewhat better Score for Mistrali7-B is now 68.9 on the validation set of winogrande_debiased. Still far from the reported 78.4, but better than what I had before. * winogrande: improving Mistral-7B score is now 73.56. Still not quite 78.4 but getting there. We are also getting a lower score on HellaSwag compared to HF leader board, so I'm not expecting we will get up to 78.4 anyway. It looks like it is better to skip the choice word(s) when evaluating the average log-likelihood. This kind of makes sense because a more common word (in Winogrande this is often a name) will have a higher probability without knowing about the follow up context, and this will skew the log-likelihood towards the more common word. We can only do this if the choice words are not last in the sentence. It also looks like it is better to skip the punctuation at the end of the sentence, provided the choice words are not last. * winogrande: add dataset instructions --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
author: Kawrakow <48489457+ikawrakow@users.noreply.github.com> 2024-01-18 13:46:27 +0200
committer: GitHub <noreply@github.com> 2024-01-18 13:46:27 +0200
commit: 682986a08eb5cb04865d2e713449f17304d266d8 (patch)
tree: 604a85edc16fa1e589db2587bf620bed473cbfc9 /common/common.h
parent: dcad445d0c83ad49bca1b58cf9c139cfcebee5d4 (diff)
1 files changed, 3 insertions, 0 deletions
diff --git a/common/common.h b/common/common.h
index 1f43e628..0ae9c18b 100644
--- a/common/common.h
+++ b/common/common.h
@@ -105,6 +105,9 @@ struct gpt_params {
     bool   hellaswag       = false; // compute HellaSwag score over random tasks from datafile supplied in prompt
     size_t hellaswag_tasks = 400;   // number of tasks to use when computing the HellaSwag score
 
+    bool   winogrande      = false; // compute Winogrande score over random tasks from datafile supplied in prompt
+    size_t winogrande_tasks= 0;     // number of tasks to use when computing the Winogrande score. If 0, all tasks will be computed
+
     bool mul_mat_q         = true;  // if true, use mul_mat_q kernels instead of cuBLAS
     bool random_prompt     = false; // do not randomize prompt if none provided
     bool use_color         = false; // use color to distinguish generations and inputs
author	Kawrakow <48489457+ikawrakow@users.noreply.github.com>	2024-01-18 13:46:27 +0200
committer	GitHub <noreply@github.com>	2024-01-18 13:46:27 +0200
commit	682986a08eb5cb04865d2e713449f17304d266d8 (patch)
tree	604a85edc16fa1e589db2587bf620bed473cbfc9 /common/common.h
parent	dcad445d0c83ad49bca1b58cf9c139cfcebee5d4 (diff)