How good is LLM Code? A self-evaluation done by AI

Last Refreshed //

Vibe coding is fun. It is fast, creative, and sometimes genuinely exciting. In just a few hours, an LLM can generate features that would normally take days or weeks to build.

But that is not the real question.

The real question is this: how good is the code once the excitement is over?

Over the last few weeks, I built more than 20 projects with LLMs. Some were created with coding agents, some with agentic workflows, and some with a more direct prompt-driven approach. I enjoyed the experiment a lot. The feeling of coding ten times faster is a bit like driving on the highway for the first time with a cabrio: fun, powerful, and slightly unreal.

Some of these projects ended up with beautiful, modern user interfaces. After just four hours, they already had features that would normally require months of spare-time development. I work on large and complex financial systems, so I would never have imagined building a video editor in a few hours with Java and Angular. The same happened with WebAssembly: I built a small Kotlin/Wasm application without even looking at most of the generated code.

That is the magic part.

Then comes the review.

BROWSER React + TypeScript Vite / 5173 MediaRecorder App.tsx · api.ts · types.ts REST / HTTP :8080 SPRING BOOT Java 21 · Port 8080 MvpController · MediaController PracticeService (orchestrator) ChallengeService UserProfileService ProgressService DeliveryDiagnostics MediaStorage WhisperTranscription CoachingService (Ollama · LlamaCpp · Qwen2) IN-MEMORY UserProgress UserProfile ConcurrentHashMap (no DB) FILESYSTEM ./media-storage/ audio/video files OLLAMA llama3:8b localhost:11434 LLAMA.CPP text LLM localhost:8080 QWEN2-AUDIO audio LLM localhost:8090 WHISPER CLI speech-to-text ggml-base.en.bin FFMPEG pause / silence detect LEGEND Frontend Backend State (in-memory) Filesystem Local LLM Whisper CLI FFmpeg HTTP / REST shell / local call

No cloud. All inference runs locally.

Fast code, lots of code

At the beginning, I tried to review everything carefully. Then I got bored. And after a few iterations, most of the code had already changed anyway.

To be clear, I used this approach only for pet projects. The goals were simple:

  • understand how LLMs are changing our work as developers
  • evaluate the quality of the generated code
  • see how these tools could affect real-world projects
  • build personal tools I always wanted, but never had time to implement

And that is exactly what happened: in a few weeks I created more than 20 projects with a huge amount of code, most of it unread until something started behaving strangely.

Agents, skills, and unreliable employees

Yes, I tried to follow the recommended practices. I created skills, used AGENTS.md and CLAUDE.md, and added project instructions where it made sense.

My “employees” were Claude, Codex, Gemini, and Mistral. A couple of them were clearly more capable than the others, so they got priority on the more important tasks.

What surprised me is how human they sometimes felt, especially in the worst way.

Like a normal employee, an agent can start a large refactoring, do half of it, and then disappear because the token budget is finished. That is a strange way to work with a teammate: they break the code, stop in the middle, and then basically say, “Either pay more or wait a few hours.”

Funny? Yes. Professional? Not really.

So what is the quality of the code?

I will leave the full collection of agent fights for another post. Sometimes they are brilliant. Sometimes they behave worse than a junior developer with too much confidence.

One common pattern is surprisingly consistent: retrieving too much data, then filtering it later in the backend or frontend as if database queries did not exist. Add Hibernate into the picture, and you can quickly get the usual N+1 disasters too.

image

Modern development

But I want to focus on one project in particular because it was the one I enjoyed most. It used Java, React, Python, Ollama, and a few other tools. The goal was to build a tutor that could help users analyze their communication and language skills.

After two days of slot-machine development, I realized the project had reached a dead end.

image

The problem was not only the code. It was also the current state of the models. I tried one open-source model and one mainstream model, and while both were excellent at turning sounds into words, they were much weaker at understanding the broader quality of human communication. Transcription was good. Real coaching and nuanced feedback were much harder.

At that point, I decided to do something interesting: I asked the LLMs to evaluate the code they had written.

Mistral: 6.5/10

Mistral gave the project an overall 6.5.

It was generous in a few areas:

  • Architecture: 8/10
  • Error handling: 8/10

It clearly appreciated the layered structure and the fallback logic.

The weak points were more revealing:

  • Maintainability: 5/10
  • Readability: 6/10
  • Testing: 3/10

The frontend had become a monster. Almost everything ended up in App.tsx, which grew beyond 3,000 lines. I had explicitly asked to keep files under 300 lines, but that instruction did not survive contact with the agent.

So according to Mistral, the application was not terrible, but it was hard to read, hard to maintain, and barely tested.

In other words: a very human result.

Gemini: 4.0/10

Gemini was much harsher and gave the project an overall 4.0.

Maybe that is because it was less involved in the development and had to fix bugs left by the others. Or maybe it was simply more honest.

Its review was painful, but fair.

The backend received 6.0 for separation of concerns, but large classes and bloated services dragged the score down.

The frontend got 2.0.

That part was brutal:

  • App.tsx had grown to almost 4,000 lines
  • there were thousands of lines of hardcoded values
  • the components folder, which I had explicitly requested, was empty
  • too much JSON and configuration data lived directly in the frontend instead of a more appropriate place

Testing and reliability got 1.0, which honestly felt accurate.

Gemini also pointed out something important: the rules described in AGENTS.md were only partially followed. This is one of the most frustrating things about these tools. You can define rules, structure, and constraints, and they may still ignore them the moment the task becomes large enough.

On the positive side, it gave 8/10 for AI and innovation. The project ideas were good. The implementation discipline was not.

Claude: 5.5/10

Claude was one of the main contributors, so this was close to a self-evaluation.

Here is the short version of its review.

Overall score: 5.5/10

Backend: 7.5/10

Claude liked:

  • the layered architecture
  • centralized exception handling
  • the AI fallback chain
  • thread-safe in-memory state with ConcurrentHashMap
  • basic file upload protections
  • externalized configuration

It criticized:

  • PracticeService for doing too much
  • silent failures in some AI service fallbacks
  • missing validation annotations on DTOs
  • overly broad CORS configuration

That feels fair. The backend was not clean, but it was at least recognizable as software architecture.

Frontend: 3.5/10

This was the real disaster.

Main issues:

  • App.tsx had 3,830 lines
  • it contained 68 useState hooks
  • only two tiny components had been extracted
  • AppContext.tsx was created and then abandoned
  • there was no ESLint, no Prettier, and no pre-commit hooks

Claude did recognize a few good parts:

  • api.ts was clean
  • useRecording.ts was well written
  • the TypeScript types were fairly complete and aligned with the backend DTOs

But those were islands in a monolith.

Tests: 1/10

Only one empty smoke test existed for the whole codebase.

That is not a testing strategy. That is decoration.

And this is one of the clearest patterns I have seen with LLM-generated projects: they are happy to generate visible features first, but unless you force them very explicitly, testing is one of the first things they neglect.

Codex: no review without billing

Codex did not complete the evaluation because the token budget was exhausted.

I was offered a very modern choice: wait for a reset, or pay more.

You can imagine my answer.

image

Does AI loves monoliths? What I learned

I have to admit that I did not create 300 ultra-detailed instructions for the agents in this project. It was a prototype, and I did not want to spend too much time writing and validating rules before I even knew whether the idea was worth pursuing.

Still, the results were interesting.

The biggest lesson was how bad the frontend code became.

Coming from Angular, it feels almost unnatural to build everything as one giant monolith. But without strong instructions, LLMs often drift toward exactly that style: inline logic, mixed responsibilities, giant files, and almost no real component structure.

I noticed the same pattern in other stacks too. In Angular, they happily mix HTML, CSS, and TypeScript inline and keep growing the same class. In Python, if you do not force a multi-file structure, they keep adding more code into the same file forever.

A valid theory could be that monoliths are easier for the model. The context is all in one place, so it keeps appending code instead of designing structure.

That may be convenient for the LLM.

It is terrible for the human who has to debug the result later, especially when the agent runs out of tokens and leaves in the middle of a half-finished refactoring.

Skill issue? Are we Project Managers now?

Some people could argue that this is not a real problem, but simply a skill issue. In this project, I did not enforce strict rules or detailed constraints. In other projects, I use much more structured setups with agents, sub-agents, skills, and explicit guidelines.

I plan to repeat the same evaluation on larger and more important projects to see how much those constraints actually improve the outcome.

However, this raises a deeper concern.

If LLMs do not reliably internalize good engineering practices, and instead require carefully designed systems of rules, agents, and coordination layers, then our role starts to shift. We are no longer just developers writing software — we become orchestrators of virtual teams.

In other words, we start behaving like project managers.

And that is ironic, because for many engineers, that is one of the least appealing roles (suject for lot of jokes in our community).

Final verdict

After asking three LLMs to evaluate code they had helped write, the pattern was surprisingly consistent:

Area Score
Backend 7
Frontend 3
Testing 1
Maintainability 5

That is the part I find most interesting.

LLMs are incredible accelerators. They can help you explore ideas, build prototypes, generate features quickly, and enter unfamiliar technical areas much faster than before.

But speed is not the same as quality.

In my experience, they are much better at creating momentum than at preserving structure. They are good at making progress look impressive. They are much less reliable when it comes to maintainability, testing, boundaries, and long-term clarity.

So, how good is LLM-generated code?

My current answer is: good enough to impress you, often too messy to trust, and very expensive to ignore once the project grows.

Or, to put it more bluntly: sometimes it looks like a normal human project — just created ten times faster, and often ten times bigger.