Infosecurity Europe
3-5 June 2025
ExCeL London

Top Eight Large Language Models Benchmarks for Cybersecurity Practices

Infosecurity has selected eight benchmark suites to help you pick the best LLM for cybersecurity

Generative AI and large language models (LLMs) are increasingly being used in cybersecurity.

Many security products, from threat intelligence assistants to vulnerability enrichment tools, as well as phishing detection engines, rely on LLMs.

However, concerns about their reliability and accuracy remain a significant limitation in critical use cases.

Recently, private sector and academic researchers have been developing evaluation tools to compare LLMs on specific cybersecurity tasks.

These evaluation tools are known as benchmarks. A benchmark is a standardized test used to compare the performance of different systems or components, such as algorithms or software tools.

Infosecurity has selected eight benchmark suites developed on state-of-the-art LLMs to help you select the best LLM for cybersecurity.

Performance of LLMs in General Security Tasks

Meta Benchmark Suite CyberSecEval 2

CyberSecEval 2 is a benchmark suite that was introduced by Meta AI in April 2024.

The suite is designed to evaluate LLMs’ security risks and capabilities in the context of cybersecurity applications.

Its predecessor, CyberSecEval, was primarily dedicated to quantifying the potential security hazards of using LLMs. CyberSecEval 2 has introduced new tests and metrics allowing it to quantify LLMs' strengths in cybersecurity tasks.

CyberSecEval 2 takes a two-pronged approach:

  • Identifying cybersecurity risks: It includes various tests designed to uncover weaknesses in LLMs that could be exploited for malicious purposes. These tests include insecure coding practices, cyber-attack helpfulness and following instructions
  • Quantifying LLM capabilities:  It also includes tests that measure the strengths of LLMs in cybersecurity tasks, such as threat detection and security awareness training

“We find most LLMs failed to completely solve our test cases, but LLMs of increasing general coding ability scored better on our tests, suggesting more research is needed before LLMs can autonomously generate exploits, but that LLM exploitation capability may grow with overall LLM coding capability,” said the authors of the paper introducing CyberSecEval 2.

The Sophos Benchmark Suite

In March 2024, Sophos introduced a set of three benchmarks based on tasks its AI team believed are fundamental perquisites for most LLM-based defensive cybersecurity applications:

  • Incident investigation assistant: LLMs are evaluated on their ability to act as incident investigation assistants by converting natural language questions about telemetry into SQL statements
  • Incident summarisation: LLMs are asked to generate incident summaries from security operations centre (SOC) data
  • Incident severity evaluation: LLMs are asked to rate the severity of cyber incidents

Upon performing these three tests with ten LLMs (commercial and open source), the team behind this benchmark suite concluded that when it comes to summarising incident information from raw data, most LLMs perform adequately, though there is room for improvement through fine-tuning.

“However, evaluating individual artefacts or groups of artefacts remains a challenging task for pre-trained and publicly available LLMs. To tackle this problem, a specialised LLM trained specifically on cybersecurity data might be required,” the Sophos team said. 



How LLMs Evaluate Cybersecurity Knowledge

SECURE, the ICS-Specific Benchmark Suite

In May 2024, a team of researchers from the Rochester Institute of Technology (RIT) published a paper introducing Security Extraction, Understanding & Reasoning Evaluation (SECURE), a new benchmark suite designed to assess LLMs performance in realistic cybersecurity scenarios.

SECURE includes six datasets focused on industrial control systems (ICS) security. These include knowledge extraction, understanding, and reasoning based on industry-standard sources.

They tested seven LLMs – three commercial and four open source – and found that commercial LLMs, and especially ChatGPT 4, currently outperform open source options.

CyberMetric, the Q&A Benchmark

A team of five researchers from the United Arab Emirates and Norway developed benchmark suite similar to SECURE, which they introduced in a June 2024 paper.

They developed four multiple-choice Q&A benchmark datasets by training an LLM (ChatGPT 3.5) on a range of cybersecurity documents, including NIST standards, research papers, publicly accessible books and other publications in the cybersecurity domain, using retrieval-augmented generation (RAG).

RAG is an AI training method that combines LLMs with external knowledge bases to improve the accuracy and reliability of generated text.

The four benchmarks are called CyberMetric-80, CyberMetric-500, CyberMetric-2000, and CyberMetric-10000. They comprise 80, 500, 2000, and 10,000 questions, respectively.

OpenAI’s GPT-4o provided the best results in all four benchmarks out of the 25 tested models.

Cyber Threat Intelligence Capabilities of LLMs

CTIBench, the Comprehensive CTI Benchmark Suite

In June 2024, a team of researchers from the Rochester Institute of Technology (RIT) launched CTIBench, a benchmark suite designed to assess the performance of LLMs in cyber threat intelligence (CTI) applications.

Some of the co-creators were involved in creating the SECURE benchmark suite.

The researchers described CTIBench as “a novel suite of benchmark tasks and datasets to evaluate LLMs in cyber threat intelligence.”

The final product is composed of four building blocks:

  • CTI Multiple Choice Questions (CTI-MCQ)
  • CTI Root Cause Mapping (CTI-RCM)
  • CTI Vulnerability Severity Prediction (CTI-VSP)
  • CTI Threat Actor Attribution (CTI-TAA)

Infosecurity Magazine explained the process behind the creation of CTIBench in more detail in a June 17 article

SEvenLLM, the Bilingual CTI Benchmark

Also in June 2024, a team of academics from the State Key Laboratory of Complex & Critical Software Environment at Beihang University in China introduced a cyber threat intelligence-focused framework called SEvenLLM.

SEvenLLM was built using a bilingual instruction corpus made of a curated collection of 6706 English and 1779 Chinese high-quality security reports.

Then, the researchers designed a pipeline to auto-select tasks from the tasks pool and convert the raw text into supervised corpora comprised of questions and responses.

The instruction dataset, SEvenLLM-Instruct, contains nearly 90,000 samples and is used to train cybersecurity LLMs with a multi-task learning objective (with 28 tasks) to augment the analysis of cybersecurity events.

Then, they developed a multiple-choice Q&A benchmark, SEvenLLM-Bench, with 1300 test samples.

In their paper, the researchers concluded: “The instruction dataset SEvenLLM-Instruct encompassing 28 well-conceptualized tasks is used to fine-tune SEVenLLM based on the different foundation models (Llama and Qwen).

“The extensive main and analytic experiments performed on a specialised curated cybersecurity benchmark, SEVenLLM-Bench, further corroborate the efficacy of SEVenLLM in improving the analytical capabilities and providing a robust response mechanism against cyber threats.”


ADVERTISEMENT


Evaluating the Use of LLMs in Other Cybersecurity Tasks

SecLLMHolmes, Assessing Vulnerability Detection Capabilities

SecLLMHolmes is a generalised, fully automated, and scalable framework for evaluating the performance of LLMs in vulnerability detection.

It was introduced in April 2024 by a team of researchers at IBM, Boston University, and the University of New South Wales (UNSW) in Sydney.

SecLLMHolmes’ objective is to evaluate the performance of LLMs at identifying vulnerable code.

For this, SecLLMHolmes assesses two key aspects:

  • Accuracy: it involves measuring how well an LLM understands and responds to queries accurately, ensuring it provides reliable information
  • Reasoning capabilities: it evaluates an LLM's ability to logically analyse information and draw sound conclusions

Upon testing five different LLMs — both commercial and open source — the researchers concluded that they are all currently unreliable at vulnerability detection and will answer incorrectly when asked to identify vulnerabilities in source code.

“Based on these results, we conclude that state-of-the-art LLMs are not yet ready to be used for vulnerability detection and urge future research to address and resolve the highlighted issues,” they added.

DebugBench, Assessing Debugging Capabilities

In June 2024, a team of Chinese academics and researchers at ModelBest and Siemens introduced DebugBench, an LLM debugging benchmark consisting of 4,253 instances and covering four major bug categories and 18 minor types in C++, Java, and Python.

To develop DebugBench, the researchers collected code snippets from the LeetCode community, implanted bugs into source data with GPT-4, and ensured rigorous quality checks.

They evaluated two commercial and four open-source models in a zero-shot scenario.

They drew the following conclusions:

  1. While closed-source models exhibit inferior debugging performance compared to humans, open-source models have relatively lower pass rate scores
  2. The complexity of debugging notably fluctuates depending on the bug category
  3. Incorporating runtime feedback has a clear impact on debugging performance, which is not always helpful

Conclusion 

By implementing these benchmarks, researchers and security professionals can collaboratively ensure that LLMs reach their full potential in cybersecurity and play a part in safeguarding our digital infrastructure.

This approach will be crucial for building trust and ensuring the reliable application of LLMs in critical cybersecurity tasks.

Enjoyed this article? Make sure to share it!



Looking for something else?


Tags


ADVERTISEMENT


ADVERTISEMENT