"Security and AI" - Part 1
ChatGPT and code analysis
LLMs open up new possibilities for analyzing and improving code. This article is the first of a three-part series on experiments that show how already available general language models influence safety-relevant processes and activities.
Modern software uses a variety of different libraries and frameworks to provide new functions and enable a wide range of interfaces. However, these software components can also contain vulnerabilities and introduce new security gaps into a project. The correct configuration and maintenance of software components - for example through updates - during operation is essential for the secure operation of modern software. During development, ensuring code quality - for example through code analysis - plays just as important a role. For this reason, a wide range of tools for checking security and quality have become established over the years.
The surprising success of large language models has opened up new possibilities for analyzing and improving code. This article highlights the extent to which ChatGPT, a very generic AI model, can handle the problem of code analysis. ChatGPT is often credited with being able to solve complex problems due to its advanced language processing capabilities and good pattern recognition. This research aims to evaluate the actual performance of ChatGPT in code analysis.
The aim is to shed light on how it performs in practice when detecting and evaluating vulnerabilities in code. For a practical assessment, known vulnerabilities from the OWASP Top 10 list are used and the results of ChatGPT are compared with static analysis tools such as Snyk and DeepSource.
The methodology
To determine the influence of generative AI on the code analysis process in experimental tests, code snippets were created that contain known vulnerabilities from the OWASP Top 10 list. For the tests, 49 code snippets with vulnerabilities and 35 code snippets without vulnerabilities were generated. The experimental approach aims to evaluate the actual performance of the generative AI model under real-world conditions. The analyzed code snippets had to be designed realistically in order to simulate the conditions that developers encounter in everyday life as closely as possible. All code snippets are available online on GitHub (see web tip). The code snippets were then analyzed by ChatGPT, Snyk and DeepSource for possible vulnerabilities. The results of this experiment should show how well ChatGPT can be integrated into a real development environment. The tests were conducted in various programming languages - including C#, C++, Java, JavaScript, Go, PHP and Python - to assess the flexibility and efficiency of the generative AI tools.
ChatGPT-4 is considered one of the leading generative AI tools with Large Language Model (LLM) and was therefore selected. Developed by OpenAI, it is characterized by its adaptability and pattern recognition. For the specific task of code analysis, a custom GPT was created and assigned the role of "cyber analyst". The custom GPT was instructed to find vulnerabilities in code snippets using static analysis, categorize and explain them using the OWASP Top 10 and suggest possible solutions. Custom GPTs allow specific instructions to be given to the AI, simplifying the analysis process. Since ChatGPT is trained with data from the Internet, the availability of tutorials and forum posts has a significant influence on the results for the individual languages.
The aim of the evaluation is to show how well generative AI can support the code analysis process. The evaluation was carried out by providing both the created custom GPT and the static code analysis tools (default settings) with the same code snippets for analysis. These snippets contained vulnerabilities from the OWASP Top 10 and it was then evaluated how many of these vulnerabilities were detected by the respective tools (true positives) and how many remained undetected (false negatives). This approach enabled a direct comparison of the effectiveness of the various analysis tools under identical test conditions.
Another aspect of the investigation, in addition to the vulnerability detection rate, was the analysis of the false positives of the various tools, i.e. how often the tools falsely raise the alarm for secure code. Code locations for which no alarm is raised are true negatives. The higher the false positives rate, the more often the tools identify secure code as vulnerable. For this purpose, secure variants of the previously insecure code snippets were created and made available to the tools again.
The results
First, we will focus on the general code analysis performance of ChatGPT, Snyk and DeepSource. This is followed by a more detailed code example. The observed detection rates and the discussion of possible reasons for this are shown.
Detection rate of vulnerabilities
Figure 1: The detection rates of the vulnerable code snippets.
© Esslingen University of Applied SciencesAs can be seen in Figure 1, ChatGPT-4 shows a high detection rate (true positives) of around 88% for code vulnerabilities across all tested programming languages. ChatGPT classifies patterns and evaluates them statistically. This high detection rate is therefore an indication of a good representation of similar problems in the generative AI training data set. In comparison, the recognition rate of the static analysis tools Snyk and DeepSource was around 31%. These differences are due to the way the static code analysis tools work. Static code analysis tools use data flow analysis, control flow graphs, taint analysis and lexical analysis. These methods use compiler technologies to collect information and patterns and make decisions based on them. These patterns correspond to known vulnerabilities, errors or deviations from best practices. This creates detection problems with vulnerabilities such as authentication issues, privilege escalation, access control and insecure use of cryptography that are difficult to detect using these methods. Errors in best practices include, for example, assigning too many rights to non-privileged users. The example of rights assignment in particular makes it clear that generative AI recognizes this pattern through corresponding trained examples and, in contrast, the analysis tools for static code analysis cannot check the logic of rights assignment.
ChatGPT's ability to classify contextual information enables a deeper and more comprehensive analysis of the code. This can be seen in our experiment in the identification of vulnerabilities that are not only based on superficial patterns, but also include the logic and structure of the code. ChatGPT provides practicable solutions to fix each identified vulnerability. These solution approaches, but also the detection of the problems, are mainly learned from forum entries, which ChatGPT combines independently, but can only fall back on what is already documented on the Internet.
False positive rate
However, ChatGPT also showed a high false positive rate of 34%, indicating that secure code is often incorrectly marked as insecure (Figure 2). This tendency to over-identify secure code sections as vulnerable highlights the challenges of using generative AI in code analysis. The static analysis tools had a significantly lower false positive rate of 7%. The high false-positive rate of ChatGPT can be attributed to the inherent tendency of AI models to look for an error when asked about it. In productive use in the development process, this characteristic can result in a lot of manual checking and can also cause a lot of frustration due to inaccuracy.
Evaluation of the sample code
To illustrate vulnerabilities that require contextual knowledge, a code snippet in C++ based on an example provided by the cybersecurity company Snyk is presented here. This illustrates the "Insecure Design" vulnerability:
#include <fstream>
#include <nlohmann/json.hpp>
using json = nlohmann::json;
// Open the permissions file and parse the roles
const json empl_role = [] {
std::ifstream fin("./permissions.json");
return json::parse(fin);
}();
void remove_employee(int employee_id, const std::string& person_request) {
// Return the JSON object of the requesting person
auto requester = std::find_if(empl_role.begin(), empl_role.end(),
[&person_request](const auto& emp) {
return emp["name"] == person_request;
});
// Return early if the requesting person is not found
if (requester == empl_role.end()) {
return;
}
// Get the role from the requesting person
std::string role = requester->at("role");
// Allow the roles 'developer' and 'owner' to delete employees
if (role == "developer" || role == "owner"){
db.query("DELETE FROM employees WHERE ID = ?",{employee_id},
[](const auto& err, const auto& data) {
if (err) throw err;
});
}
}
This code snippet demonstrates basic access control and user management in a system. First, a JSON file called permissions.json is loaded, which specifies the roles of the employees. In the main part of the code, there is a function called remove employee, which is used to remove an employee from the database if the requesting user has the appropriate authorization. The requesting user is identified by their name, which is then searched for in the JSON object empl_roles. If the function does not find the user, it terminates. If the requesting user is found, their role is checked. Only users with the roles "developer" or "owner" may remove employees. If this is the case, the function executes a database command to delete the employee. This code snippet illustrates how simple security checks can be implemented in a software application to ensure that only authorized persons can perform critical actions such as deleting employee data. However, a closer look reveals that removing employees is a critical process that should only be performed by administrators or "owners" and not by "developers" which can lead to an insecure design.
| ChatGPT and the attackers |
|---|
| In the next two articles in this series, the experiment will be extended to the use of ChatGPT from the attackers' perspective. In particular, we will look at the generation of malware and the support of social engineering. Both are rapidly growing areas that developers need to understand in order to ensure the operation of a secure environment. |
When scanning the code with the static tools Snyk and DeepSource, this vulnerability was not detected because these tools do not have context understanding. ChatGPT, on the other hand, detected the vulnerability due to the context that the "Developer" role should not have the authorization to remove employees. This example shows why ChatGPT has a significantly higher detection rate, as it recognizes more complex problems that cannot be identified by simple rule-based pattern matching.
The pros and cons
The results of the study highlight the transformative power of AI-based tools in the cybersecurity landscape. These tools are characterized by their ability to efficiently detect context-based and non-context-based vulnerabilities. These characteristics stand in stark contrast to traditional, static code analysis methods, whose capabilities largely involve rule-based pattern recognition.
The dynamic analysis capability and high detection rate of ChatGPT underlines the potential of AI-based approaches in code analysis. However, the high false-positive rate and non-deterministic behaviour lead to challenges that can be addressed by more specialized training of generative AI and targeted prompting. The first tools already offer these specializations, such as IntelliCode and GitHub CoPilot.
Did you know? A new series of articles explores the opportunities and challenges of ChatGPT for the industry. https://bit.ly/3VuxSHf
© Limitless Visions/stock.adobe.comAn important finding is that the accessibility and the very high true-positive rate are promising characteristics. The results indicate that AI tools are a valuable addition to traditional methods, but still offer potential for optimization. The AI tool offers security experts another way to proactively detect and mitigate threats. However, the tool also entails risks. In addition to the high false positive rate, the tools can also be used by attackers to discover and exploit vulnerabilities more easily. This can significantly reduce the barriers to entry for cybercriminal activities, as it enables both experienced and less experienced attackers to identify security vulnerabilities.
The authors
Prof. Dr. Tobias Heer is Professor of IT Security and Dean of the IT Faculty at Esslingen University of Applied Sciences.
Lukas Bechtel is a research assistant for IT security at Esslingen University of Applied Sciences.
Dieter Holstein is a Master's student at Esslingen University of Applied Sciences.
Nils Lohmiller is a research assistant for IT security at Esslingen University of Applied Sciences.
















