ParEval Leaderboard: Evaluating the Ability of Large Language Models to Generate Parallel Code

less than 1 minute read

We introduced the ParEval benchmark in “Can Large Language Models Write Parallel Code?” to evaluate the capability of LLMs at parallel code generation. We found a significant gap between their ability to generate sequential vs parallel code for a large array of computational problems and parallel programming models. On this page we keep an up-to-date table tracking the progress of state-of-the-art LLMs on ParEval.

ParEval Results

Model	No. Parameters	HumanEval pass@1	ParEval Serial pass@1	ParEval Parallel pass@1
StarCoder2-3B	3B	31.7	42.7	9.6
StarCoder2-7B	7B	35.4	59.4	15.9
CodeLlama-7B	7B	29.9	48.4	15.3
CodeLlama-13B	13B	35.0	52.8	17.4
StarCoder2-15B	15B	46.3	61.6	23.1
StarCoderBase	15.5B	30.3	51.7	18.6
CodeLlama-34B	34B	45.1	54.0	10.2
Phind-V2	34B	71.9	65.6	32.1
Gemini-Pro	—	67.7	59.3	25.1
GPT-3.5	—	61.5	76.0	39.6
GPT-4	—	84.1	76.1	37.8

Last updated March 5, 2024

If you would like a model added you can reach out to dnicho@umd.edu.

Citing ParEval

@misc{nichols2024large,
      title={Can Large Language Models Write Parallel Code?}, 
      author={Daniel Nichols and Joshua H. Davis and Zhaojun Xie and 
              Arjun Rajaram and Abhinav Bhatele},
      year={2024},
      eprint={2401.12554},
      archivePrefix={arXiv},
      primaryClass={cs.DC}
}

Share on

Twitter Facebook LinkedIn

UMD HPC Group

ParEval Leaderboard: Evaluating the Ability of Large Language Models to Generate Parallel Code

ParEval Results

Citing ParEval

Share on

You may also enjoy