☕︎ DomainEval ☕️

An Auto-Constructed Benchmark for Multi-Domain Code Generation

Pass@1 (Greedy Search N=1 Temperature=0.0)

#	Model	Pass

Pass@5 (Sampling Search N=5 Temperature=0.2)

#	Model	Pass

📝 Submission

Thank you for your interest in DomainEval. We warmly welcome researchers to submit additional benchmarking results, as we believe that collaborative efforts can significantly advance the study of Large Language Models and software engineering. For submission guidelines, please refer to our Github Repository Submission Guide.

🤗 Acknowledgement

Thanks for the EvalPlus for sharing the leaderboard template. In addition to DomainEval leaderboards, it is recommended to comprehensively understand LLM coding ability through a diverse set of benchmarks and leaderboards, such as:

EvalPlus Leaderboard
JavaBench
Big Code Models Leaderboard
Chatbot Arena Leaderboard
CrossCodeEval
ClassEval
CRUXEval
Code Lingua
Evo-Eval
HumanEval.jl - Julia version HumanEval with EvalPlus test cases
InfiCoder-Eval
LiveCodeBench
NaturalCodeBench
RepoBench
SWE-bench
TabbyML Leaderboard