Oscar Follow I collect mechanical wristwatches and enjoy listening to jazz and techno music. As an open source enthusiast, I focus on legal compliance issues in the software world.

Beyond Simple Code Scanning: Advanced Semantic Analysis for AI-Generated Code Compliance

When Code Scanners Miss the Forest for the Trees

The rise of AI-powered coding tools is creating a new kind of compliance risk that most organizations are not prepared for.

These tools offer remarkable productivity gains, but they also introduce subtle and serious challenges. License violations, patent exposure, and unauthorized reproduction of proprietary algorithms can now happen without direct copying. Traditional code scanners are not equipped to detect these risks.

The Growing Challenge of AI Code Compliance

Most current scanners rely on pattern matching, hashing, and surface-level similarity. This works for detecting direct reuse, but not for transformed, translated, or restructured code.

For example, an AI system might generate an audio or video codec that implements patented algorithms. Even though the code looks different (using new names, structures, or even languages), the core logic may still be the same. This can lead to intellectual property violations without any obvious sign of duplication.

Why Traditional Scanners Fall Short

Transformation blindness

AI-generated code is rarely copied directly. Instead, it often rewrites logic in a different style, language, or structure. For example, a quicksort algorithm written in Python using list comprehensions may appear entirely distinct from one written in Java with traditional for loops, even though they perform the same task.

Modern software projects frequently span multiple languages. An algorithm might start in Python during prototyping and later be implemented in Rust or Java for production. Most scanning tools are not equipped to follow this kind of transformation, which is understandable, since they were never designed to do so.

Pattern convergence

AI models can recreate functional equivalents of proprietary code without ever directly accessing the original code. This kind of convergence introduces risk, even when no exact code has been copied. In most cases, the developer is completely unaware that their generated code may resemble or replicate protected logic.

What Kind of Tools Are Needed

Scanning code to extract evidence and understand transformations is possible, but not easy. To meet this challenge, organizations need code analysis systems that can:

Recognize algorithmic similarity across different programming languages
Understand semantic equivalence despite style or syntax changes.
Identify core logic based on structure and behavior.
Remain effective even after variable renaming, code reordering, or formatting changes.

The problem requires a deeper understanding of the code’s logic and intent, not just its appearance. It may not appeal to everyone, but for old dinosaurs like me who enjoy technical puzzles and the thrill of discovering (or creating new problems), it is worth spending time exploring.

Research Project: Semantic Code Analysis

As part of a side research project, I developed a prototype system that analyzes the meaning behind code. The goal was to explore whether semantic similarity detection could uncover hidden risks in AI-generated code.

The Scenario

Over 1,000 AI-generated code samples
Five languages: Python, Java, JavaScript, TypeScript, and C
Algorithms included sorting, search, compression, and multimedia codecs.
Each code sample was transformed through:

Only the generated output was evaluated, completely isolated from any training data. I used publicly available generic LLMs, so I had no control over the content they produced.

The challenge required designing a new mechanism and algorithm to support decomposition and analysis. I built a kind of “salad” using components from various Open Source tools and code analysis techniques. While I cannot share many details about the implementation, here is what I discovered:

Key Results

Transformation resistance

The new similarity mechanism detected a similarity range of 53 to 72 percent across transformed versions. It was designed to identify AI-generated reimplementations across different languages, coding styles, and workflows. In comparison, traditional hash-based tools detected only between 0% and 15%.

Cross-language detection

Python to Java implementations showed 53.5 percent similarity.
Shared algorithmic patterns resulted in a 73.3 percent overlap.
Function purpose matching reached 66.7 percent accuracy.

Pattern recognition

Sorting algorithms were identified across all five languages.
Mathematical operations were detected even when expressed in different ways.
Control flow and logic were recognized despite variations in syntax.

The results suggested that it might be possible to detect similarities between existing code, such as a proprietary codec, and unintentionally created AI-generated reimplementations. Based on that, I decided to apply the method to a real-world case and attempt to uncover potential reimplementations.

Case Study: Real-World Codec Analysis

To further test the approach, I used different publicly available LLMs to generate source code for audio codec implementations that could infringe on existing patents. The goal was to evaluate how the system performs when analyzing unknown code across multiple programming languages and to measure semantic similarity. Due to licensing restrictions, I am unable to share the exact code that was generated.

Additionally, I reimplemented some well-known algorithms used by commercial code scanners to benchmark their performance in the same scenarios. I recreated these algorithms using only public documentation, so the results may differ from those produced by proprietary tools.

What worked

Detected similar audio compression algorithms in both Python and Java
Identified MDCT (Modified Discrete Cosine Transform) logic regardless of language or library
Recognized Huffman coding, windowing, and frame-processing logic
Maintained reliability despite differences in naming, libraries, and language paradigms

What was challenging

Pattern matching mismatches reduced confidence in some cases. For example, masking thresholds in Java and psychoacoustic_model in Python referred to the same concept but used different terminology.
Library abstraction levels also presented challenges. High-level libraries required different detection strategies, which were too complex to fully resolve in a short timeframe.
Language-specific idioms added noise to the analysis, necessitating normalization and fine-tuning. This is where most of the meaningful work took place, and where the real value of the approach began to emerge.

Real-World Impact

In the codec analysis:

The Python version showed 28.6 percent similarity with known codecs
The Java version showed a zero percent match using traditional tools
Cross-language semantic similarity confirmed the presence of equivalent compression logic at 53.5 percent

This confirmed that surface-level scanners would have missed the intellectual property risk entirely, while the multi-tier analysis mechanism was able to detect it successfully.

Statistical Highlights

Analysis across 2,320 samples (464 algorithms in 5 languages):

Detection accuracy reached over 85 percent
Cross-language similarity averaged between 53 and 72 percent
False positives remained under 5 percent
Detection held strong even with:

Broader Applications

This kind of semantic analysis is valuable for much more than compliance. It can support:

Internal code clone detection across large codebases
Competitive algorithm analysis
Open source license compliance when derivative work is involved (the new legal challenge?)
Validation of originality in AI-generated code

As generative coding systems become more advanced, organizations may need analysis tools that match that level of complexity. The question is, do such tools already exist?

Final Thoughts

Traditional scanners are built to catch copied code. However, AI does not simply copy, it reconstructs, transforms, and reimplements.

To detect risks, we must analyze the code’s meaning, not just its appearance. This research shows that semantic detection is both possible and practical. It provides better insight, stronger compliance, and a deeper understanding of the algorithms behind the code.

The future of compliance lies in semantic understanding, not superficial matching. It may be time to implement “vibe compliance” to hunt code copycats. Then again, that might open a Pandora’s box.

This research was conducted independently. All code used for testing was explicitly generated for experimental purposes and was not sourced from any known proprietary dataset. I used generic, publicly available LLMs and did not use models whose creators claim to have excluded copyleft or other problematic code from their training data.

02 Aug 2025

« AI Supply Chain Security Risks and Legal Compliance Gaps BSA Security Scanning tool for AI Models »

Oscar's Blog