Evaluate and improve code structures

ChatGPT in the industry - Part 8

Dr. Hans Egermeier | Alexandra Hose, 12.09.2024, 09:45

Evaluate and improve code structures

Are gpt-4o and Google Gemini also suitable for evaluating code quality and code structures? Or would developers be better off using traditional software tools? Questions that are worth a closer look.

Images

Part 7 of this series showed how the LLM gpt-4o can be used to identify architectural structures in software and visualize them as UML diagrams. This is very useful with regard to architecture documentation. This article looks at how well gpt-4o and Google Gemini can be used to assess the quality of the underlying code.

An essential aspect of software development is the continuous improvement and optimization of code, code structures and architecture. Classic software analysis tools for determining quality metrics are available for this task. For example, it is possible to determine how complex the software is, i.e. how difficult it is to understand and change. Another common metric would be the Maintainability Index, which is a combination of several metrics. This index indicates how easily software can be maintained or how high the risk of errors is when a change is made. However, these analysis tools are usually expensive or not accessible to the development team. It would be beneficial if LLMs with chat GPT and the like could also be used for simple code evaluation.

In terms of quantitative metrics, this is only possible to a very limited extent. This is because an LLM is not a calculation tool, a knowledge database or a deterministic rule automaton. This means that LLMs are only of very limited use for determining and calculating code metrics for code evaluation. Let's take a look at some evaluation experiments with the LLM 'gpt-4o' from OpenAI and the LLM 'Gemini' from Google.

Quantitative assessment

For our experiment, we use McCabe's Cyclomatic Complexity (CC), which is comparatively easy to determine. This metric measures the number of independent execution paths in a software program and serves as an indicator of its complexity. The higher the complexity, the more time-consuming and risky it is to make changes and search for errors in the corresponding code. The interpretation and assessment of the specific values are recommended as follows:

1 - 10: Simple code, low risk of change, easy to understand, easy to test, potentially error-free
11 - 20: More complex code, moderate risk, higher familiarization and testing effort
21 - 50: Complex code, high risk of change, considerable training and testing effort
> 50: Untestable code, very high risk of change, potentially error-prone

The tests are carried out in the Python programming language. The test object is the open source code of the 'Chainlit' library. The reference with regard to the code metrics is the Python software analysis tool 'Radon'. The experiments with the LLMs are each run three times in order to also show the variance within the LLM responses. Before the actual tests with source code, a separate prompt was used in each case to check and ensure that gpt-4o and Gemini were familiar in principle with the terms and the associated procedures for determining cyclomatic complexity according to McCabe. The prompt used was carried out in a new chat in each case in order to rule out mutual interference. With Gemini, it was necessary to use a new tab in the incognito mode of the web browser. Without this strict separation of the questions, the response behavior indicated that the different chats were not completely independent of each other.

The test prompt is as follows:

You are an expert code analyst with deep knowledge of cyclomatic complexity (CC) analysis. Please determine the McCabe CC of this code in the following format:

total_cc = sum of all cyclomatic complexities of all methods in the given file

num_methods = total number of methods detected in code

num_methods_greater_0 = number of methods with a CC greater than zero

average_cc = total_cc / num_methods_greater_0
```python

<code to be analyzed comes here>
```

The results in table form

Test file	The tool	Cyclomatic Complexity (sum)	Cyclomatic Complexity (average)	Number of functions found	Number of functions with CC > 0
Server.py (948 LOC)	gpt-4o (1st test)	89	3,0	30	30
	gpt-4o (2nd test)	107	3,8	28	28
	gpt-4o (3rd test)	81	2,9	28	28
	Gemini (1st test)	29-49	2,1-3,5	14	14
	Gemini (2nd test)	-	-	-	-
	Gemini (3rd test)	-	-	-	-
	Radon (McCabe)	116	3,9	30	30
Step.py (475 LOC	gpt-4o (1st test)	51	2,8	18	18
	gpt-4o (2nd test)	50	2,8	18	18
	gpt-4o (3rd test)	60	4,0	15	15
	Gemini (1st test)	18	2	12	9
	Gemini (2nd test)	12	1,5	10	8
	Gemini (3rd test)	12	1,71	8	7
	Radon (McCabe)	83	4,2	20	20

In the table, the clear deviations between the values determined by the LLMs and the Radon metrics tool are immediately apparent. On the one hand, the systematic underestimation in contrast to Radon is obvious. The explanation for this is that the simple prompt does not define the determination of the McCabe CC exactly in every point and parameter. Thus, the calculation rule for determining the CC in Radon and the LLMs differs on an algorithmic level alone. This tends to lead to lower LLM values in our examples. This is not a surprising problem, as it cannot be assumed that exactly the same, hard-coded algorithm from Radon also corresponds to the model knowledge. If you want to get closer to the metrics tool of your choice, you have to explicitly list the analysis rules coded in the tool in the prompt. As the deviations are of a systematic nature, the orders of magnitude are reasonably consistent and therefore useful in terms of a rough house number. However, an experienced developer should always keep a critical eye on this before drawing concrete conclusions from the LLM metrics.

On the other hand, the very strongly fluctuating values are striking. This means that gpt-4o or Gemini can only serve as a rough guide rather than a concrete metric. So if you want to measure your code precisely, you will not be able to avoid the architecture and code analysis tools available on the market. At the current state of the art, the use of gpt-4o or Gemini is only conceivable for a very rough assessment or, in the absence of a metrics tool, as a makeshift replacement.

Qualitative assessment

In contrast, we can expect more help from the LLMs in the qualitative assessment of software structures and architectures. Even if the answers and assessments of the LLMs should only be understood as formulation suggestions and not as absolute truth, it is still surprising what can be achieved with the corresponding prompts. Two examples of the evaluation of software structures illustrate the capabilities of gpt-4o:

Single Responsibility Principle (SRP): a class should have only one responsibility, i.e. only one reason to change. This improves the maintainability and comprehensibility of the code.
Separation of Concerns (SoC): This principle states that a program should be divided into different areas (concerns), each of which performs a specific function or task.

These two principles help to design software architectures that are robust, maintainable and extensible by reducing complexity and increasing reusability.

The code for our application example is again the backend of the open source software Chainlit. The test prompt used analyzes the entire code base in one step and generates an evaluation with potential suggestions for improvement. The unabridged prompt in our test setup comprises 6443 lines. The concrete test prompt that has proven itself for this task is as follows:

# RULES OF AN EXPERT ARCHITECTURE ANALYST
You are a helpfull architecture analyst who is familiar with following best practices of software architecture:
1. **Separation of Concerns**: Divide the system into distinct sections, each responsible for a specific aspect or concern, to reduce complexity and improve maintainability.

2 **Single Responsibility Principle**: Ensure each component or module has one, and only one, reason to change by focusing on a single responsibility or functionality.

# GIVENS
<your code comes here> Test prompt with the entire code base of the Chainlit backend is not shown for space reasons

# Intention

generate proposals in order to enhance the quality of the given code with respect to strictly follow the "single responsibility principle" and the "separation of concerns"

# COMMANDS FOR ARCHITECTURE QUALITY ANALYSIS AND IMPROVEMENT

Your job is now:

* Remember you are strictly following your given RULES and SKILLS AS AN EXPERT ARCHITECTURE ANALYST

* Silently analyze the intended architectural structure and behavior of the entire given code base in detail and also analyze any further documentation in detail if given.

* return a table of your five top rated proposed enhancements sorted by the "Architecture enhancement rating" and formatted in the following way (one example line is given as reference):

| 1 | file.py | some_method | 30-65 | 0.9 | {here comes an explanation that on the level of a precise and detailed refactoring prompt} |

* In case you think information is missing to generate a sufficiently precise formulation, return a warning "WARNING: information is missing to correctly fulfill the job!" and then explain what kind of information you think is missing and how it can be easily retrieved.

The result of the prompt, which is not shown here, is an evaluation table with suggestions for refactorings to improve the architecture. In many cases, the suggestions are justified and lead to modular and better structured software in the sense of SRP and SoC. In a practical application scenario, the suggestions would be checked by a human expert and the corresponding code would then be changed manually in a refactoring step or automatically adapted by an LLM. However, as with LLMs, it is always important to bear this in mind: One and the same prompt executed multiple times will generate different suggestions. This means that although the suggestions are very often useful, they should never be regarded as complete or as the absolute truth.

Architecture documentation of existing software systems with AI

The author: Dr. Hans Egermeier is Managing Director of talsen team.

When it comes to determining code metrics, you should not rely too much on the LLM results. If you need more than a very rough direction of code quality, you should still determine the concrete analysis and evaluation with classic software tools. The suggestions generated with LLMs, on the other hand, are always worth a try in terms of improving the software structure. Especially if the context is not chosen too large, so that the LLM response does not provide answers that are too generic and therefore useless.

Back to topic page

You might also be interested in

"ChatGPT in the industry" - Part 6

Support with architecture and conception

The generation and processing of natural language is the paradigm discipline of AI in the form of Large Language Models (LLMs). But how suitable are LLMs for the generative generation of useful results in the architecture and conception phase?

"ChatGPT in the industry" - Part 5

Requirements management

Can generative AI be inventive? In view of the rapid evolution of AI, this can perhaps be answered with a "yes" at present. In any case, LLMs are already useful tools when it comes to formulating requirement descriptions.

"ChatGPT in the industry" - Part 4

Catalyst for an agile development process

How does generative AI fit in with an agile way of working? This part of the article series deals explicitly with the interaction of AI and agile working methods in product development.

"ChatGPT in the industry" - Part 3

The intelligent assistant

This part of the article series "ChatGPT in industry" focuses on how the technology of Large Language Models (LLM) is finding its way into industrial products and services in the form of intelligent assistants.

"ChatGPT in the industry" - Part 1

ChatGPT in the industry

ChatGPT was the hype topic last year. This technology is expected to enter the industry in 2024. A new series of articles explores the opportunities and challenges of ChatGPT for the industry.

VDMA at the Hannover Messe 2026

Relocation concerns in mechanical engineering

The German mechanical and plant engineering industry is under increasing pressure from regulation, high costs and geopolitical risks. According to the VDMA, many companies are considering investing abroad. At the same time, the industry remains...

RobCo at the Hannover Messe 2026

Robotic system for dynamic environments

RobCo is presenting a new industrial robotics system for dynamic tasks in industry at the Hannover Messe. With 'Autonomous Alfie', the company is opening up a new category of robotic solutions. This allows tasks in which processes, objects and...

Fraunhofer at the Hannover Messe 2026

Data, AI and material innovations for the industry of tomorrow

Under the motto "Innovations for our Future", nine Fraunhofer Institutes and Groups and the Research Fab Microelectronics Germany (FMD) will be presenting concrete solutions for a resilient, digitally networked industry at Hannover Messe 2026.

Fraunhofer IMS

Funding project on embedded AI

The "Edge AI Platform" project is entering its third round of funding: three Fraunhofer Institutes are further developing the platform to version 3.0 in order to provide companies with even more efficient access to embedded artificial intelligence...

Evaluate and improve code structures

AI and the implementation of phishing and social engineering

Architecture documentation of existing software systems with AI

ChatGPT and code analysis

Quantitative assessment

The results in table form

Qualitative assessment

You might also be interested in

Support with architecture and conception

Requirements management

Catalyst for an agile development process

The intelligent assistant

ChatGPT in the industry

Relocation concerns in mechanical engineering

Robotic system for dynamic environments

Data, AI and material innovations for the industry of tomorrow

Funding project on embedded AI

Categories

Focus areas

Service

Magazine

Our network