Evaluating Large Language Models for Cybersecurity Applications

Evaluating Large Language Models for Cybersecurity Applications

photo illus. artificial intelligence by Pixabay
photo illus. artificial intelligence by Pixabay

This post is also available in: heעברית (Hebrew)

A white paper published by SEI and OpenAI claims large language models could be an asset for cybersecurity professionals, but must be evaluated using real and complex scenarios to better understand the technology’s capabilities and risks.

While LLMs are excellent at recalling facts, the paper “Considerations for Evaluating Large Language Models for Cybersecurity Tasks” claims that it is not enough – the LLM knows a lot, but it doesn’t necessarily know how to deploy the information correctly in the right order.

According to Techxplore, focusing on theoretical knowledge ignores the complexity and nuance of real-world cybersecurity tasks, which results in cybersecurity professionals not knowing how or when to incorporate LLMs into their operations.

The paper claims that the solution is to evaluate LLMs like one would evaluate a human cybersecurity operator: theoretical, practical, and applied knowledge. However, testing an artificial neural network is extremely challenging, as even defining the tasks is hard in a field as diverse as cybersecurity.

Furthermore, once the tasks are defined, an evaluation must ask up to millions of questions in order for LLMs to learn and mimic the human brain. While creating that volume of questions can be done through automation, there isn’t a tool that can generate enough practical or applied scenarios for the LLM.

In the meantime, as the technology catches up, the white paper provides a framework for designing realistic cybersecurity evaluations of LLMs: define the real-world task for the evaluation to capture, represent tasks appropriately, make the evaluation robust, and frame results appropriately.

The paper’s authors believe LLMs will eventually enhance human cybersecurity operators in a supporting role, rather than work autonomously, and emphasize that even so, LLMs will still need to be evaluated. They also express their hope that the paper starts a movement toward practices that can inform the decision-makers in charge of integrating LLMs into cyber operations.