ChatGPT in the industry - Part 8
Evaluate and improve code structures
Are gpt-4o and Google Gemini also suitable for evaluating code quality and code structures? Or would developers be better off using traditional software tools? Questions that are worth a closer look.
Part 7 of this series showed how the LLM gpt-4o can be used to identify architectural structures in software and visualize them as UML diagrams. This is very useful with regard to architecture documentation. This article looks at how well gpt-4o and Google Gemini can be used to assess the quality of the underlying code.
An essential aspect of software development is the continuous improvement and optimization of code, code structures and architecture. Classic software analysis tools for determining quality metrics are available for this task. For example, it is possible to determine how complex the software is, i.e. how difficult it is to understand and change. Another common metric would be the Maintainability Index, which is a combination of several metrics. This index indicates how easily software can be maintained or how high the risk of errors is when a change is made. However, these analysis tools are usually expensive or not accessible to the development team. It would be beneficial if LLMs with chat GPT and the like could also be used for simple code evaluation.
In terms of quantitative metrics, this is only possible to a very limited extent. This is because an LLM is not a calculation tool, a knowledge database or a deterministic rule automaton. This means that LLMs are only of very limited use for determining and calculating code metrics for code evaluation. Let's take a look at some evaluation experiments with the LLM 'gpt-4o' from OpenAI and the LLM 'Gemini' from Google.
Quantitative assessment
For our experiment, we use McCabe's Cyclomatic Complexity (CC), which is comparatively easy to determine. This metric measures the number of independent execution paths in a software program and serves as an indicator of its complexity. The higher the complexity, the more time-consuming and risky it is to make changes and search for errors in the corresponding code. The interpretation and assessment of the specific values are recommended as follows:
- 1 - 10: Simple code, low risk of change, easy to understand, easy to test, potentially error-free
- 11 - 20: More complex code, moderate risk, higher familiarization and testing effort
- 21 - 50: Complex code, high risk of change, considerable training and testing effort
- > 50: Untestable code, very high risk of change, potentially error-prone
The tests are carried out in the Python programming language. The test object is the open source code of the 'Chainlit' library. The reference with regard to the code metrics is the Python software analysis tool 'Radon'. The experiments with the LLMs are each run three times in order to also show the variance within the LLM responses. Before the actual tests with source code, a separate prompt was used in each case to check and ensure that gpt-4o and Gemini were familiar in principle with the terms and the associated procedures for determining cyclomatic complexity according to McCabe. The prompt used was carried out in a new chat in each case in order to rule out mutual interference. With Gemini, it was necessary to use a new tab in the incognito mode of the web browser. Without this strict separation of the questions, the response behavior indicated that the different chats were not completely independent of each other.
The test prompt is as follows:
You are an expert code analyst with deep knowledge of cyclomatic complexity (CC) analysis. Please determine the McCabe CC of this code in the following format:
total_cc = sum of all cyclomatic complexities of all methods in the given file
num_methods = total number of methods detected in code
num_methods_greater_0 = number of methods with a CC greater than zero
average_cc = total_cc / num_methods_greater_0
```python
<code to be analyzed comes here>
```
The results in table form
| Test file | The tool | Cyclomatic Complexity (sum) |
Cyclomatic Complexity (average) |
Number of functions found |
Number of functions with CC > 0 |
|---|---|---|---|---|---|
| Server.py (948 LOC) | gpt-4o (1st test) | 89 | 3,0 | 30 | 30 |
| gpt-4o (2nd test) | 107 | 3,8 | 28 | 28 | |
| gpt-4o (3rd test) | 81 | 2,9 | 28 | 28 | |
| Gemini (1st test) | 29-49 | 2,1-3,5 | 14 | 14 | |
| Gemini (2nd test) | - | - | - | - | |
| Gemini (3rd test) | - | - | - | - | |
| Radon (McCabe) | 116 | 3,9 | 30 | 30 | |
| Step.py (475 LOC | gpt-4o (1st test) | 51 | 2,8 | 18 | 18 |
| gpt-4o (2nd test) | 50 | 2,8 | 18 | 18 | |
| gpt-4o (3rd test) | 60 | 4,0 | 15 | 15 | |
| Gemini (1st test) | 18 | 2 | 12 | 9 | |
| Gemini (2nd test) | 12 | 1,5 | 10 | 8 | |
| Gemini (3rd test) | 12 | 1,71 | 8 | 7 | |
| Radon (McCabe) | 83 | 4,2 | 20 | 20 |
In the table, the clear deviations between the values determined by the LLMs and the Radon metrics tool are immediately apparent. On the one hand, the systematic underestimation in contrast to Radon is obvious. The explanation for this is that the simple prompt does not define the determination of the McCabe CC exactly in every point and parameter. Thus, the calculation rule for determining the CC in Radon and the LLMs differs on an algorithmic level alone. This tends to lead to lower LLM values in our examples. This is not a surprising problem, as it cannot be assumed that exactly the same, hard-coded algorithm from Radon also corresponds to the model knowledge. If you want to get closer to the metrics tool of your choice, you have to explicitly list the analysis rules coded in the tool in the prompt. As the deviations are of a systematic nature, the orders of magnitude are reasonably consistent and therefore useful in terms of a rough house number. However, an experienced developer should always keep a critical eye on this before drawing concrete conclusions from the LLM metrics.
On the other hand, the very strongly fluctuating values are striking. This means that gpt-4o or Gemini can only serve as a rough guide rather than a concrete metric. So if you want to measure your code precisely, you will not be able to avoid the architecture and code analysis tools available on the market. At the current state of the art, the use of gpt-4o or Gemini is only conceivable for a very rough assessment or, in the absence of a metrics tool, as a makeshift replacement.
Qualitative assessment
In contrast, we can expect more help from the LLMs in the qualitative assessment of software structures and architectures. Even if the answers and assessments of the LLMs should only be understood as formulation suggestions and not as absolute truth, it is still surprising what can be achieved with the corresponding prompts. Two examples of the evaluation of software structures illustrate the capabilities of gpt-4o:
- Single Responsibility Principle (SRP): a class should have only one responsibility, i.e. only one reason to change. This improves the maintainability and comprehensibility of the code.
- Separation of Concerns (SoC): This principle states that a program should be divided into different areas (concerns), each of which performs a specific function or task.
These two principles help to design software architectures that are robust, maintainable and extensible by reducing complexity and increasing reusability.
The code for our application example is again the backend of the open source software Chainlit. The test prompt used analyzes the entire code base in one step and generates an evaluation with potential suggestions for improvement. The unabridged prompt in our test setup comprises 6443 lines. The concrete test prompt that has proven itself for this task is as follows:
# RULES OF AN EXPERT ARCHITECTURE ANALYST
You are a helpfull architecture analyst who is familiar with following best practices of software architecture:
1. **Separation of Concerns**: Divide the system into distinct sections, each responsible for a specific aspect or concern, to reduce complexity and improve maintainability.
2 **Single Responsibility Principle**: Ensure each component or module has one, and only one, reason to change by focusing on a single responsibility or functionality.
# GIVENS
<your code comes here> Test prompt with the entire code base of the Chainlit backend is not shown for space reasons
# Intention
generate proposals in order to enhance the quality of the given code with respect to strictly follow the "single responsibility principle" and the "separation of concerns"
# COMMANDS FOR ARCHITECTURE QUALITY ANALYSIS AND IMPROVEMENT
Your job is now:
* Remember you are strictly following your given RULES and SKILLS AS AN EXPERT ARCHITECTURE ANALYST
* Silently analyze the intended architectural structure and behavior of the entire given code base in detail and also analyze any further documentation in detail if given.
* return a table of your five top rated proposed enhancements sorted by the "Architecture enhancement rating" and formatted in the following way (one example line is given as reference):
| Number | filename | method name | Line of code (from - to) | Architecture enhancement rating (0 - 1) | explanation of enhancement |
| 1 | file.py | some_method | 30-65 | 0.9 | {here comes an explanation that on the level of a precise and detailed refactoring prompt} |
* In case you think information is missing to generate a sufficiently precise formulation, return a warning "WARNING: information is missing to correctly fulfill the job!" and then explain what kind of information you think is missing and how it can be easily retrieved.
The result of the prompt, which is not shown here, is an evaluation table with suggestions for refactorings to improve the architecture. In many cases, the suggestions are justified and lead to modular and better structured software in the sense of SRP and SoC. In a practical application scenario, the suggestions would be checked by a human expert and the corresponding code would then be changed manually in a refactoring step or automatically adapted by an LLM. However, as with LLMs, it is always important to bear this in mind: One and the same prompt executed multiple times will generate different suggestions. This means that although the suggestions are very often useful, they should never be regarded as complete or as the absolute truth.
When it comes to determining code metrics, you should not rely too much on the LLM results. If you need more than a very rough direction of code quality, you should still determine the concrete analysis and evaluation with classic software tools. The suggestions generated with LLMs, on the other hand, are always worth a try in terms of improving the software structure. Especially if the context is not chosen too large, so that the LLM response does not provide answers that are too generic and therefore useless.















