Introducing Nphardeval’s leaderboards using Nphardeval, a cutting-edge benchmark developed by researchers at the University of Michigan and Rutgers University.
Nphardeval introduces a dynamic, complexity-based framework for assessing the inference capabilities of large-scale language models (LLMS). It raises 900 algorithm questions spanning complexity classes and low LLMS, and is designed to rigorously test LLMS, and is updated monthly to prevent overfitting!
A unique approach to LLM evaluation
Nphardeval stands out by adopting a class of computational complexity to provide a quantifiable and robust measure of LLM inference skills. The benchmark tasks reflect actual decision-making challenges and increase their relevance and applicability. Regular monthly updates of benchmark data points reduce the risk of overfitting the model and ensure reliable ratings.
Nphardeval’s main contribution is new using new benchmark strategies (suggesting automatic and dynamic benchmarks), introducing new ways to evaluate LLM inference.
Regarding benchmarking strategies, Nphardeval uses an automated mechanism to generate and verify questions in the benchmark. As they are based on algorithmically computable problems, no human intervention is required to determine the correctness of the response from the LLMS. This allows Nphardeval to be a dynamic benchmark. Benchmarks can be updated monthly as questions can be automatically generated. This monthly fresh benchmark helps prevent overfitting of the model as it can always generate new questions at different levels of difficulty for evaluation.
The question itself uses the new system to evaluate LLM inference. Benchmark questions are based on hierarchies of computational complexity. This is an established concept that has been widely studied in theoretical computer science. This foundation allows us to harness existing research to measure the scope of logical inference in LLMs rigorously and quantitatively by defining inferences through complexity classes. Also, benchmarking is a well-known and difficult task for LLMS, so we exclude numerical calculations from the questions. Focusing on logical questions allows for a more accurate assessment of LLM’s pure logical reasoning ability, as numerical questions can blur this assessment.
Data Integration
Nphardeval uses 100 questions for each of the nine different algorithms, with 900 questions occurring across complexity and difficulty levels at 10 difficulty levels. The nine algorithms, including 3p, 3np incomplete, and 3 np hard questions, are characterized according to computing theory. All 900 questions are synthesized and updated monthly.
These slides offer more background and insights.
Rating Metrics
We evaluate the inference ability of LLMS using two metrics: weighting accuracy and failure rate.
Weighted Accuracy (WA)
Weighted Accuracy (WA) is used to evaluate the accuracy of problem solving. This method applies to each question by comparing with the correct answer, or by checking the questions step-by-step results without a singular answer. We assign weights to different difficulty levels to more effectively reflect comparison accuracy. The weight of each level corresponds to a relative importance or challenge, with higher difficulty receiving more weight in linear progression (e.g. Weight 1 for Level 1, Weight 2 for Level 2, etc.).
The weighting accuracy equation is:
In this equation, wiw_i Represents the weight assigned to the difficulty level ii (ranges from 1 to 10), and aia_i That level of accuracy.
Failure rate (FR)
Another important metric we consider is failure rate (FR). This measure helps to assess the frequency of failure of outcomes across a variety of problems and difficulty levels. It is especially useful for identifying instances where the LLM results do not match the expected output format.
Failure rates are calculated by taking into account the percentage of failed attempts to total number of trials at each difficulty level. If the model produces results that cannot be successfully analyzed on all endpoint calls, the attempts are counted as failing. Set the maximum number of attempts to 10. For each problem, the failure rate is aggregated across all difficulty, taking into account a total of 10 attempts at each level.
The official definition of failure rate is:
here, fi f_i Indicates the number of failed attempts at difficulty level ii .
Experiments and insights
The benchmark includes comprehensive experiments to analyze LLMs of different complexity classes and difficulty levels. It delves into the nuances of LLM performance and provides valuable insight into the strengths and limitations of inference. Generally:
Closed source models generally perform better than open source models, while the GPT 4 turbo performs best overall. Models generally improve performance with complex questions, i.e., classes of simple complexity, but do not always decrease linearly at complexity levels. Models such as the Claude 2 perform best in NP Complete (Middle Complexity) questions. Some open source models can outperform closed source models with specific questions. Major open source models include YI-34B, QWEN-14B, PHI-2, and Mistral-7B.


Reproduce the results of the nphardeval benchmark on a machine
To set up the Nphardeval benchmark, you need to follow a few steps:
Environment setup: After cloning the repository into your local machine, install the required Python libraries in Conda. Conda Create –Name LLM_REASOON Python == 3.10 Conda Activate LLM_Reason git clone https://github.com/casmlab/nphardeval.git pip install -r requastion.txt Setup API Key: Get the API Key and modify the corresponding entry in Secrets.txt. Example command: Evaluate your model with nphardeval benchmarks!
For example, to use the Edit Distance Problem (EDP) for GPT 4 Turbo Model (GPT-4-1106-PREVIEW) and evaluation:
For Zeroshot experiments, you can use the CD Close/Run Python run_p_edp.py GPT-4-1106-PREVIEW.
Currently, we support some example shots of the same question (self), and we may support examples of other questions (other) in the future.
Join the conversation
Nphardeval leaderboards, datasets and code are available on GitHub and Hugging Face for access and contribution to the community.
I would like to see the community’s contribution and interest in the Nphardeval Github repository and the face leaderboard of the hugs.