Modern machine learning models still cannot code solutions to basic problems . The APPS benchmark measures the ability of models to take an arbitrary language specification and generate Python code fulfilling this specification . Similar to how companies assess candidate software developers, we then evaluate models by checking their generated code on test cases . We fine-tune largelanguage models on both GitHub and our training set, and we find that theprevalence of syntax errors is decreasing exponentially . Recent models such as GPT-Neo can pass approximately 15% of the test cases of introductory problems of introductory questions . We find that recent models are beginning to learn how to code. Asthe social significance of automatic code generation increases over the coming years, our benchmark can provide an important measure for trackingadvancements of automated code generation, our benchmarks can provide a measure for Tracking Advancements. Our benchmark can help trackadvancementments of automaticcode-generating models, our Benchmark

Author(s) : Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora, Ethan Guo, Collin Burns, Samir Puranik, Horace He, Dawn Song, Jacob Steinhardt

Links : PDF - Abstract

Code :

Keywords : models - code - benchmark - find - specification -

Leave a Reply

Your email address will not be published. Required fields are marked *