Oscar Follow I collect mechanical wristwatches and enjoy listening to jazz and techno music. As an open source enthusiast, I focus on legal compliance issues in the software world.

Detecting source code generated by AI using Machine Learning

AI has become disruptive in many ways, especially for developers using AI agents to debug software, remediate errors, and even automatically generate the whole code for simple applications.

Using AI to create the software comes with additional concerns related to undeclared obligations, mainly for the lack of clearness around definitions of copyright ownership, an undefined legal figure of the service provider (contractor services vs. tooling), and the still open questions if the generated code is affected by third-party licenses of the datasets used to train the model.

The Problem with AI and Copyright

The new paradigm has brought new challenges for lawyers and copyright experts, who were surprised by the technology and are rushing to define policies and strategies to help developers to use AI tools responsibly while balancing friction to use and management for possible legal risks.

Now, whatever set of rules an organization defines about using AI, there’s the practical problem around enforcing and verifying if source code comes with AI contributions, for which companies must devise a way to verify when code was generated by an AI tool or an engineer to ensure legal compliance and security standards. While many could suggest plagiarism detection for identifying AI-generated code, it’s apparent that matching code blocks against Open Source is no longer helpful.

Detecting AI-generated code has become challenging, and as AI technology evolves, it will become more complex each day until it becomes impractical. Some AI agent services are looking to provide additional features for users so legal aspects are covered, like using “reftags” to track the provenance of the generated output or allowing users to filter the datasets used to answer a question. Still, not all players want to keep the game fair and to keep the advantage are OK with not adding these safeguards when generating code, which undoubtedly will raise concern from the Industry.

Plagiarism and Snippet Detection for AI could be obsolete.

Many vendors who offer snippet detection products are trying to add AI detection features. Still, AI technology has evolved beyond generating blocks of code. Most recent versions of AI tools can understand the programming language as human writing language, making code snippet detection obsolete if they don’t add AI capabilities to the detection tools and implement additional techniques.

If plagiarism detection is obsolete, we should look at what a developer does when writing software instead of checking the code blocks. Every developer writes code differently, not because they use different structures, but because each has other preferences for the “style” of how the code is written. There are even studies around coding style preferences against the amount of money a developer makes.

The “styles” and preferences added when the code was written together with the algorithm behind the source code could be the input for an AI detection system.

Using Machine Learning and feature extraction to identify AI

Implementing an AI detection system goes beyond what a blog post could cover, but simple techniques can be used.

While researching, I came to an idea about implementing a “feature extraction” to later generate a score for each feature that can be used to train a Machine Learning DecisionTreeClassifier, to identify AI vs. Human based on the scores.

As features, I’m extracting five features: cyclomatic complexity, style guidelines consistency, repetitive patterns, comments quality, and code indentation. Each class provides a score later passed to a DecisionTreeClassifier implemented with scikit-learn. Feel free to suggest others features, you can find the whole project on GitHub.

It is a rudimentary method in an ugly code, but it’s a method that can be extended.

NOTICE

This tool does not provide legal advice; I’m not a lawyer.

The code is an experimental implementation. Refrain from relying on the accuracy of the output of this tool.

23 Apr 2023

« Using Machine Learning for Open Source License Identification The 'keep it simple SBoM' is the perfect small first step for your organization. »

Oscar's Blog

Detecting source code generated by AI using Machine Learning

The Problem with AI and Copyright

Plagiarism and Snippet Detection for AI could be obsolete.

Using Machine Learning and feature extraction to identify AI

Explore →