ParEval Leaderboard: Evaluating the Ability of Large Language Models to Generate Parallel Code

less than 1 minute read

We introduced the ParEval benchmark in “Can Large Language Models Write Parallel Code?” to evaluate the capability of LLMs at parallel code generation. We found a significant gap between their ability to generate sequential vs parallel code for a large array of computational problems and parallel programming models. On this page we keep an up-to-date table tracking the progress of state-of-the-art LLMs on ParEval.

ParEval Results

Model No. Parameters HumanEval
pass@1
ParEval Serial
pass@1
ParEval Parallel
pass@1
StarCoder2-3B 3B 31.7 42.7 9.6
StarCoder2-7B 7B 35.4 59.4 15.9
CodeLlama-7B 7B 29.9 48.4 15.3
CodeLlama-13B 13B 35.0 52.8 17.4
StarCoder2-15B 15B 46.3 61.6 23.1
StarCoderBase 15.5B 30.3 51.7 18.6
CodeLlama-34B 34B 45.1 54.0 10.2
Phind-V2 34B 71.9 65.6 32.1
Gemini-Pro 67.7 59.3 25.1
GPT-3.5 61.5 76.0 39.6
GPT-4 84.1 76.1 37.8

Last updated March 5, 2024

If you would like a model added you can reach out to dnicho@umd.edu.

Citing ParEval

@misc{nichols2024large,
      title={Can Large Language Models Write Parallel Code?}, 
      author={Daniel Nichols and Joshua H. Davis and Zhaojun Xie and 
              Arjun Rajaram and Abhinav Bhatele},
      year={2024},
      eprint={2401.12554},
      archivePrefix={arXiv},
      primaryClass={cs.DC}
}