<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>Oscar&apos;s Blog</title>
    <description>Free Software, GNU/Linux, License Compliance, and more.</description>
    <link>https://ovalenzuela.com/</link>
    <atom:link href="https://ovalenzuela.com/feed.xml" rel="self" type="application/rss+xml"/>
    <pubDate>Fri, 13 Mar 2026 20:35:52 +0000</pubDate>
    <lastBuildDate>Fri, 13 Mar 2026 20:35:52 +0000</lastBuildDate>
    <generator>Jekyll v3.10.0</generator>
    
      <item>
        <title>Launching SEMCL.ONE: Community-Driven Software Composition Analysis</title>
        <description>&lt;p&gt;After years of building compliance automation inside large organizations, I kept running into the same problem: the tools that exist are either too expensive, too rigid, or too disconnected from how software is actually built today.&lt;/p&gt;

&lt;p&gt;So I built something different.&lt;/p&gt;

&lt;h2 id=&quot;what-is-semclone&quot;&gt;What is SEMCL.ONE?&lt;/h2&gt;

&lt;p&gt;&lt;a href=&quot;https://community.semcl.one&quot;&gt;SEMCL.ONE&lt;/a&gt; is a community-driven Software Composition Analysis platform. It combines open-source tools, shared infrastructure, and AI automation to make compliance management accessible and scalable.&lt;/p&gt;

&lt;p&gt;The core idea is simple: compliance should be automated, not manual.&lt;/p&gt;

&lt;p&gt;That means detection, analysis, validation, reporting, and continuous monitoring—all flowing through your development pipeline without constant human intervention.&lt;/p&gt;

&lt;h2 id=&quot;why-now&quot;&gt;Why now?&lt;/h2&gt;

&lt;p&gt;Traditional SCA tools were built for a world where developers copied libraries from known sources. But that world is changing fast. AI-generated code doesn’t always match existing packages. It transforms, adapts, and creates variations that slip past hash-based detection.&lt;/p&gt;

&lt;h2 id=&quot;whats-inside&quot;&gt;What’s inside?&lt;/h2&gt;

&lt;p&gt;The platform includes several specialized tools:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;PURL2SRC &amp;amp; SRC2PURL&lt;/strong&gt; — Convert between package identifiers and source code across 13+ ecosystems&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;OSSLILI&lt;/strong&gt; — License detection supporting 700+ SPDX identifiers with multiple detection methods&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Semantic CopyCat&lt;/strong&gt; — Advanced IP contamination detection targeting AI-generated code transformations&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;MCP-SEMCLONE&lt;/strong&gt; — IDE integration that brings conversational compliance to AI assistants like Cursor, Cline, and VS Code&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;OSPAC&lt;/strong&gt; — Policy engine with 712 licenses and compatibility checking&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Some tools are production-ready. Others are still in development. The project is about 79% complete, with OSSLILI, OSPAC, UPMEX, and MCP-SEMCLONE already live.&lt;/p&gt;

&lt;h2 id=&quot;community-driven-by-design&quot;&gt;Community-driven by design&lt;/h2&gt;

&lt;p&gt;This isn’t a closed product. Everything is open source. The reference databases are community-contributed. The output formats follow industry standards—SPDX, PURL, CycloneDX.&lt;/p&gt;

&lt;p&gt;If you want to contribute, there’s room for everything: bug reports, feature suggestions, code, documentation, detection patterns. The goal is to build something that works for everyone, not just those who can afford enterprise licenses.&lt;/p&gt;

&lt;h2 id=&quot;why-im-doing-this&quot;&gt;Why I’m doing this&lt;/h2&gt;

&lt;p&gt;SEMCL.ONE is compliance automation built on agentic AI principles, designed for the way software is actually written today.&lt;/p&gt;

&lt;p&gt;If you’re working on open source compliance, software supply chain security, or just curious about what AI-powered SCA looks like, come check it out.&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://community.semcl.one&quot;&gt;Join the community&lt;/a&gt;&lt;/p&gt;

&lt;hr /&gt;

&lt;p&gt;&lt;em&gt;The views expressed in this document are solely my own and do not represent those of my current or past employers. If you identify an error, please contact me, and I will make the necessary updates.&lt;/em&gt;&lt;/p&gt;
</description>
        <pubDate>Thu, 15 Jan 2026 00:00:00 +0000</pubDate>
        <link>https://ovalenzuela.com/2026/01/launching-semclone-community-driven-sca.html</link>
        <guid isPermaLink="true">https://ovalenzuela.com/2026/01/launching-semclone-community-driven-sca.html</guid>
        
        <category>sca</category>
        
        <category>compliance</category>
        
        <category>oss</category>
        
        <category>supplychain</category>
        
        <category>ai</category>
        
        <category>sbom</category>
        
        <category>semclone</category>
        
        <category>opensource</category>
        
        
        <category>oss</category>
        
        <category>security</category>
        
        <category>compliance</category>
        
        <category>supplychain</category>
        
        <category>ai</category>
        
        <category>sbom</category>
        
      </item>
    
      <item>
        <title>How to Use AI Without Wasting Time and Money</title>
        <description>&lt;p&gt;After thinking about it for a long time (very long time), I realized the “AI problem” isn’t new. It’s the same old problem we’ve seen in software for decades.&lt;/p&gt;

&lt;p&gt;AI gives you extra capacity. It speeds up your work, helps you move from an idea to a quick demo or a basic product, and reduces the time between each step. But the same speed also increases the number of mistakes. If your process was unclear before, AI will spread that confusion across every step.&lt;/p&gt;

&lt;p&gt;Most AI projects don’t fall apart because the tech is weak. The tech is here. Many projects fall apart for the same reasons teams have failed with past tools:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;No training&lt;/li&gt;
  &lt;li&gt;No real understanding of the tool&lt;/li&gt;
  &lt;li&gt;Trying to solve the wrong problem&lt;/li&gt;
  &lt;li&gt;False expectations pushed by hype&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;People expect AI to work like magic. They feed it missing context, unclear goals, or random data and expect it to “figure it out.” That never works.&lt;/p&gt;

&lt;p&gt;Still, every misstep teaches you something. Every prompt that doesn’t work tells you what you misunderstood. And every rough draft helps you shape your thinking. These are the lessons that helped me the most while building with AI.&lt;/p&gt;

&lt;h2 id=&quot;1-learn-the-basics-before-putting-ai-into-real-work&quot;&gt;1. Learn the basics before putting AI into real work&lt;/h2&gt;

&lt;p&gt;You don’t need deep technical knowledge, but you should still understand a few core ideas. It helps to know what a prompt is, what makes one clear or vague, and why the model reacts the way it does. It also helps to understand what RAG means, when it helps, and when it is unnecessary. The same goes for agents. A little preparation goes a long way, and skipping this early learning step usually leads to wasted money, wasted time, and a lot of confusion.&lt;/p&gt;

&lt;h2 id=&quot;2-expect-ai-to-be-wrong-and-expect-yourself-to-be-wrong-too&quot;&gt;2. Expect AI to be wrong, and expect yourself to be wrong too&lt;/h2&gt;

&lt;p&gt;You’ll get more value out of AI if you treat it as a partner that challenges your thinking instead of a tool that writes perfect answers. When you write a draft, try asking the model whether your argument actually addresses the real problem. Ask whether your reasoning makes sense and whether you’re overlooking something important. This works much better than simply asking it to clean up grammar. You get clarity, not just pretty text.&lt;/p&gt;

&lt;h2 id=&quot;3-keep-ai-systems-small&quot;&gt;3. Keep AI systems small&lt;/h2&gt;

&lt;p&gt;Large AI systems with too many features usually collapse under their own weight. They lose context, take too long to build, and fail quietly. Smaller projects do the opposite. A small, clear use case gives you fast results, shows you where the gaps are, and helps you stay focused. The most helpful AI tools are the simple ones that solve a real problem someone faces every day.&lt;/p&gt;

&lt;h2 id=&quot;4-build-simple-versions-first-and-release-them-quickly&quot;&gt;4. Build simple versions first and release them quickly&lt;/h2&gt;

&lt;p&gt;Classic software cycles move slowly, and AI work moves fast. If you wait months to ship something, the idea is already old. It’s far better to release a simple version, try it, fix what breaks, and release the next version soon after. When you don’t release small pieces, you end up working hard without making any real progress, as if you were pushing the brake and the gas at the same time.&lt;/p&gt;

&lt;h2 id=&quot;5-ai-does-not-know-your-goals&quot;&gt;5. AI does not know your goals&lt;/h2&gt;

&lt;p&gt;A model does not understand how your business works, what matters to you, or how you make decisions. You have to teach it. This is a good opportunity to map your own work. When you write down your process, your rules, and what a good outcome looks like, you not only help the model but you help yourself. Clear context always produces better results than any prompt trick.&lt;/p&gt;

&lt;h2 id=&quot;6-do-not-replace-people-or-processes&quot;&gt;6. Do not replace people or processes&lt;/h2&gt;

&lt;p&gt;AI should support the way people work, not erase it. It can help you draft early versions of documents, sort information, pull out key ideas, and record recurring steps so you can turn them into playbooks. But people should still make decisions. And before you connect an AI system to anything serious, especially a production database, stop and think about what could go wrong. Treat this as a risk question, not a convenience question.&lt;/p&gt;

&lt;h2 id=&quot;7-if-you-plan-to-use-ai-in-your-business-define-a-risk-model&quot;&gt;7. If you plan to use AI in your business, define a risk model&lt;/h2&gt;

&lt;p&gt;Different AI systems carry different levels of risk. A simple text classifier is not the same as a model that writes code. And a local model that stays within your network is not the same as a frontier model with broad reach. One way to think about this is to divide systems into several levels. At the lowest level, the concerns look similar to the usual software supply chain issues teams already understand. The next level might include smaller text tasks or basic predictions. Beyond that are systems that can produce code or guide decisions. At the top are advanced models with broad behavior and harder-to-predict outcomes.&lt;/p&gt;

&lt;p&gt;As you move up each level, you should raise the bar for security, testing, documentation, deployment limits, and monitoring. This keeps you from treating every AI system as if it all carries the same weight.&lt;/p&gt;

&lt;h2 id=&quot;8-start-small-and-grow-from-there&quot;&gt;8. Start small and grow from there&lt;/h2&gt;

&lt;p&gt;If you prefer workflows that behave predictably and only pull AI in as a helper, take a look at Strands Agents. When combined with a local Ollama setup, it lets you experiment without sending your data anywhere. It even works well when you’re offline on a plane. Once you’re comfortable, you can shift to cloud providers or bigger systems later.&lt;/p&gt;

&lt;h2 id=&quot;final-thought&quot;&gt;Final thought&lt;/h2&gt;

&lt;p&gt;AI is just a tool. When it breaks, it’s almost always because the person asked the wrong thing, used the wrong tool, or started with a problem that wasn’t defined in the first place. Start small. Move fast. Give the model the information it needs. And make sure the thing you’re trying to solve is real.&lt;/p&gt;

&lt;p&gt;Have fun with AI, but stay grounded. Focus your time and money on the parts that give you results now, and grow from there.&lt;/p&gt;

&lt;hr /&gt;

&lt;p&gt;&lt;em&gt;The views expressed in this document are solely my own and do not represent those of my current or past employers. If you identify an error, please contact me, and I will make the necessary updates.&lt;/em&gt;&lt;/p&gt;
</description>
        <pubDate>Tue, 18 Nov 2025 00:00:00 +0000</pubDate>
        <link>https://ovalenzuela.com/2025/11/how-to-use-ai-without-wasting-time-and-money.html</link>
        <guid isPermaLink="true">https://ovalenzuela.com/2025/11/how-to-use-ai-without-wasting-time-and-money.html</guid>
        
        <category>ai</category>
        
        <category>productivity</category>
        
        <category>development</category>
        
        <category>bestpractices</category>
        
        <category>llm</category>
        
        <category>agentic</category>
        
        
        <category>ai</category>
        
        <category>development</category>
        
        <category>bestpractices</category>
        
      </item>
    
      <item>
        <title>BSA Security Scanning tool for AI Models</title>
        <description>&lt;p&gt;Six months ago, I wrote about the massive security blind spot in AI adoption. Organizations download ML models from the internet. They deploy them in production. They trust them completely.&lt;/p&gt;

&lt;p&gt;The response was overwhelming. Security teams reached out, asking the same question: “How do we scan these models?”&lt;/p&gt;

&lt;p&gt;The truth was uncomfortable. No existing tools could do it properly.&lt;/p&gt;

&lt;h2 id=&quot;the-problem-is-real&quot;&gt;The Problem is Real&lt;/h2&gt;
&lt;p&gt;A quick internet search reveals that over 60% of HuggingFace models lack license information. Traditional security scanners miss pickle files entirely. Supply chain attacks, such as PoisonGPT and the PyTorch compromise, have proven that a threat exists.&lt;/p&gt;

&lt;p&gt;However, pointing out problems without solutions felt incomplete, and even a proof-of-concept tool could make a significant difference when the gap is enormous.&lt;/p&gt;

&lt;h3 id=&quot;a-security-scanner-for-ai-models&quot;&gt;A Security Scanner for AI Models&lt;/h3&gt;
&lt;p&gt;I spent the last weeks extending BinarySniffer (an existing project I use for Binary Static Analysis of Linux Packages, firmware, and Mobile Apps) with ML security capabilities (file parsing, signature matching, reporting, etc).&lt;/p&gt;

&lt;p&gt;The tool does what I wished existed when I wrote that first post. It analyzes ML models without executing them. It understands pickle files, PyTorch models, ONNX formats, and SafeTensors. It maps threats to MITRE ATT&amp;amp;CK frameworks to produce “features” that later match with signatures that I can generate using an AI Agent. Most importantly, it catches real attacks.&lt;/p&gt;

&lt;h3 id=&quot;testing-against-real-threats&quot;&gt;Testing Against Real Threats&lt;/h3&gt;
&lt;p&gt;I validated the tool against every known ML attack I could find. Results speak louder than marketing claims. 100% detection rate on malicious pickle exploits. Detects command execution patterns used in supply chain attacks. 94% confidence in detecting PyTorch model threats and 56% confidence in identifying suspicious XGBoost models. All of that can be achieved using simple string searching and signatures, which is pretty cool for a proof-of-concept approach.&lt;/p&gt;

&lt;p&gt;The tool identified over 50 different attack patterns across various ML formats, but it’s not perfect. The tool is a proof-of-concept that makes progress but requires significantly more time to address the necessary features.&lt;/p&gt;

&lt;h2 id=&quot;making-it-practical&quot;&gt;Making It Practical&lt;/h2&gt;
&lt;p&gt;The implementation is straightforward. You need to “pip it” and use it. Signatures and calibration will happen automatically as you use it:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;table class=&quot;rouge-table&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;rouge-gutter gl&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
2
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;rouge-code&quot;&gt;&lt;pre&gt;pip install semantic-copycat-binarysniffer
binarysniffer ml-scan suspicious_model.pkl
&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The output integrates with GitHub Actions. It generates SARIF reports for CI/CD pipelines. Therefore, security teams can obtain actionable intelligence that they can act on immediately or execute on demand. The tool can also be integrated as a Python Library into another system. Simple as it gets.&lt;/p&gt;

&lt;h3 id=&quot;what-this-means-for-your-organization&quot;&gt;What This Means for Your Organization&lt;/h3&gt;
&lt;p&gt;Every organization adopting AI faces the same choice. Wait for the first ML supply chain attack to hit your systems. Or start scanning your models today, and share your experience with others so we can improve tooling and research.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/images/posts/bsa_results.png&quot; alt=&quot;BinarySniffer scan results&quot; /&gt;&lt;/p&gt;

&lt;p&gt;BinarySniffer is Open Source under Apache-2.0 and available on GitHub. Feel free to submit comments, recommendations, examples, new signatures, and even feature requests.&lt;/p&gt;

&lt;p&gt;What it catches: Basic attacks, known malware patterns, unobfuscated pickle exploits, and obvious command injection attempts.
What it misses: Sophisticated obfuscation, weight-based backdoors, novel attack patterns, and steganographic payloads. Think of it as your first line of defense, not a complete solution.&lt;/p&gt;

&lt;h2 id=&quot;moving-forward&quot;&gt;Moving Forward&lt;/h2&gt;
&lt;p&gt;The question isn’t whether ML supply chain attacks will happen. They already are. The question is whether your organization will be protected when the next one hits. Don’t wait to find out; you don’t need to use this tool. You can use any tool that works for your use case. However, the time to start doing something is today.&lt;/p&gt;

&lt;p&gt;I welcome ideas to improve scanners or produce proof-of-concept tools. Many more are coming after this one.&lt;/p&gt;

&lt;h3 id=&quot;reference&quot;&gt;Reference:&lt;/h3&gt;
&lt;ul&gt;
  &lt;li&gt;Pepe, F., et al. (2024). “How do Hugging Face Models Document Datasets, Bias, and Licenses? An Empirical Study.” 32nd IEEE/ACM International Conference on Program Comprehension (ICPC). https://mdipenta.github.io/files/icpc2024.pdf&lt;/li&gt;
&lt;/ul&gt;
</description>
        <pubDate>Fri, 15 Aug 2025 00:00:00 +0000</pubDate>
        <link>https://ovalenzuela.com/2025/08/bsa-security-scanning-tool-for-ai-models.html</link>
        <guid isPermaLink="true">https://ovalenzuela.com/2025/08/bsa-security-scanning-tool-for-ai-models.html</guid>
        
        <category>sbom</category>
        
        <category>compliance</category>
        
        <category>oss</category>
        
        <category>supplychain</category>
        
        <category>openchain</category>
        
        <category>ai</category>
        
        
        <category>oss</category>
        
        <category>security</category>
        
        <category>compliance</category>
        
        <category>supplychain</category>
        
        <category>ai</category>
        
        <category>models</category>
        
        <category>sca</category>
        
      </item>
    
      <item>
        <title>Beyond Simple Code Scanning: Advanced Semantic Analysis for AI-Generated Code Compliance</title>
        <description>&lt;h2 id=&quot;when-code-scanners-miss-the-forest-for-the-trees&quot;&gt;When Code Scanners Miss the Forest for the Trees&lt;/h2&gt;
&lt;p&gt;The rise of AI-powered coding tools is creating a new kind of compliance risk that most organizations are not prepared for.&lt;/p&gt;

&lt;p&gt;These tools offer remarkable productivity gains, but they also introduce subtle and serious challenges. License violations, patent exposure, and unauthorized reproduction of proprietary algorithms can now happen without direct copying. Traditional code scanners are not equipped to detect these risks.&lt;/p&gt;

&lt;h2 id=&quot;the-growing-challenge-of-ai-code-compliance&quot;&gt;The Growing Challenge of AI Code Compliance&lt;/h2&gt;
&lt;p&gt;Most current scanners rely on pattern matching, hashing, and surface-level similarity. This works for detecting direct reuse, but not for transformed, translated, or restructured code.&lt;/p&gt;

&lt;p&gt;For example, an AI system might generate an audio or video codec that implements patented algorithms. Even though the code looks different (using new names, structures, or even languages), the core logic may still be the same. This can lead to intellectual property violations without any obvious sign of duplication.&lt;/p&gt;

&lt;h2 id=&quot;why-traditional-scanners-fall-short&quot;&gt;Why Traditional Scanners Fall Short&lt;/h2&gt;
&lt;h3 id=&quot;transformation-blindness&quot;&gt;Transformation blindness&lt;/h3&gt;
&lt;p&gt;AI-generated code is rarely copied directly. Instead, it often rewrites logic in a different style, language, or structure. For example, a quicksort algorithm written in Python using list comprehensions may appear entirely distinct from one written in Java with traditional for loops, even though they perform the same task.&lt;/p&gt;

&lt;p&gt;Modern software projects frequently span multiple languages. An algorithm might start in Python during prototyping and later be implemented in Rust or Java for production. Most scanning tools are not equipped to follow this kind of transformation, which is understandable, since they were never designed to do so.&lt;/p&gt;

&lt;h3 id=&quot;pattern-convergence&quot;&gt;Pattern convergence&lt;/h3&gt;
&lt;p&gt;AI models can recreate functional equivalents of proprietary code without ever directly accessing the original code. This kind of convergence introduces risk, even when no exact code has been copied. In most cases, the developer is completely unaware that their generated code may resemble or replicate protected logic.&lt;/p&gt;

&lt;h3 id=&quot;what-kind-of-tools-are-needed&quot;&gt;What Kind of Tools Are Needed&lt;/h3&gt;
&lt;p&gt;Scanning code to extract evidence and understand transformations is possible, but not easy. To meet this challenge, organizations need code analysis systems that can:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Recognize algorithmic similarity across different programming languages&lt;/li&gt;
  &lt;li&gt;Understand semantic equivalence despite style or syntax changes.&lt;/li&gt;
  &lt;li&gt;Identify core logic based on structure and behavior.&lt;/li&gt;
  &lt;li&gt;Remain effective even after variable renaming, code reordering, or formatting changes.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The problem requires a deeper understanding of the code’s logic and intent, not just its appearance. It may not appeal to everyone, but for old dinosaurs like me who enjoy technical puzzles and the thrill of discovering (or creating new problems), it is worth spending time exploring.&lt;/p&gt;

&lt;h3 id=&quot;research-project-semantic-code-analysis&quot;&gt;Research Project: Semantic Code Analysis&lt;/h3&gt;
&lt;p&gt;As part of a side research project, I developed a prototype system that analyzes the meaning behind code. The goal was to explore whether semantic similarity detection could uncover hidden risks in AI-generated code.&lt;/p&gt;

&lt;h3 id=&quot;the-scenario&quot;&gt;The Scenario&lt;/h3&gt;
&lt;ul&gt;
  &lt;li&gt;Over 1,000 AI-generated code samples&lt;/li&gt;
  &lt;li&gt;Five languages: Python, Java, JavaScript, TypeScript, and C&lt;/li&gt;
  &lt;li&gt;Algorithms included sorting, search, compression, and multimedia codecs.&lt;/li&gt;
  &lt;li&gt;Each code sample was transformed through:&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Only the generated output was evaluated, completely isolated from any training data. I used publicly available generic LLMs, so I had no control over the content they produced.&lt;/p&gt;

&lt;p&gt;The challenge required designing a new mechanism and algorithm to support decomposition and analysis. I built a kind of “salad” using components from various Open Source tools and code analysis techniques. While I cannot share many details about the implementation, here is what I discovered:&lt;/p&gt;

&lt;h2 id=&quot;key-results&quot;&gt;Key Results&lt;/h2&gt;
&lt;h3 id=&quot;transformation-resistance&quot;&gt;Transformation resistance&lt;/h3&gt;
&lt;p&gt;The new similarity mechanism detected a similarity range of 53 to 72 percent across transformed versions. It was designed to identify AI-generated reimplementations across different languages, coding styles, and workflows. In comparison, traditional hash-based tools detected only between 0% and 15%.&lt;/p&gt;

&lt;h3 id=&quot;cross-language-detection&quot;&gt;Cross-language detection&lt;/h3&gt;
&lt;ul&gt;
  &lt;li&gt;Python to Java implementations showed 53.5 percent similarity.&lt;/li&gt;
  &lt;li&gt;Shared algorithmic patterns resulted in a 73.3 percent overlap.&lt;/li&gt;
  &lt;li&gt;Function purpose matching reached 66.7 percent accuracy.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;pattern-recognition&quot;&gt;Pattern recognition&lt;/h3&gt;
&lt;ul&gt;
  &lt;li&gt;Sorting algorithms were identified across all five languages.&lt;/li&gt;
  &lt;li&gt;Mathematical operations were detected even when expressed in different ways.&lt;/li&gt;
  &lt;li&gt;Control flow and logic were recognized despite variations in syntax.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The results suggested that it might be possible to detect similarities between existing code, such as a proprietary codec, and unintentionally created AI-generated reimplementations. Based on that, I decided to apply the method to a real-world case and attempt to uncover potential reimplementations.&lt;/p&gt;

&lt;h3 id=&quot;case-study-real-world-codec-analysis&quot;&gt;Case Study: Real-World Codec Analysis&lt;/h3&gt;
&lt;p&gt;To further test the approach, I used different publicly available LLMs to generate source code for audio codec implementations that could infringe on existing patents. The goal was to evaluate how the system performs when analyzing unknown code across multiple programming languages and to measure semantic similarity. Due to licensing restrictions, I am unable to share the exact code that was generated.&lt;/p&gt;

&lt;p&gt;Additionally, I reimplemented some well-known algorithms used by commercial code scanners to benchmark their performance in the same scenarios. I recreated these algorithms using only public documentation, so the results may differ from those produced by proprietary tools.&lt;/p&gt;

&lt;h4 id=&quot;what-worked&quot;&gt;What worked&lt;/h4&gt;
&lt;ul&gt;
  &lt;li&gt;Detected similar audio compression algorithms in both Python and Java&lt;/li&gt;
  &lt;li&gt;Identified MDCT (Modified Discrete Cosine Transform) logic regardless of language or library&lt;/li&gt;
  &lt;li&gt;Recognized Huffman coding, windowing, and frame-processing logic&lt;/li&gt;
  &lt;li&gt;Maintained reliability despite differences in naming, libraries, and language paradigms&lt;/li&gt;
&lt;/ul&gt;

&lt;h4 id=&quot;what-was-challenging&quot;&gt;What was challenging&lt;/h4&gt;
&lt;ul&gt;
  &lt;li&gt;Pattern matching mismatches reduced confidence in some cases. For example, masking thresholds in Java and psychoacoustic_model in Python referred to the same concept but used different terminology.&lt;/li&gt;
  &lt;li&gt;Library abstraction levels also presented challenges. High-level libraries required different detection strategies, which were too complex to fully resolve in a short timeframe.&lt;/li&gt;
  &lt;li&gt;Language-specific idioms added noise to the analysis, necessitating normalization and fine-tuning. This is where most of the meaningful work took place, and where the real value of the approach began to emerge.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;real-world-impact&quot;&gt;Real-World Impact&lt;/h3&gt;
&lt;p&gt;In the codec analysis:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;The Python version showed 28.6 percent similarity with known codecs&lt;/li&gt;
  &lt;li&gt;The Java version showed a zero percent match using traditional tools&lt;/li&gt;
  &lt;li&gt;Cross-language semantic similarity confirmed the presence of equivalent compression logic at 53.5 percent&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This confirmed that surface-level scanners would have missed the intellectual property risk entirely, while the multi-tier analysis mechanism was able to detect it successfully.&lt;/p&gt;

&lt;h4 id=&quot;statistical-highlights&quot;&gt;Statistical Highlights&lt;/h4&gt;
&lt;p&gt;Analysis across 2,320 samples (464 algorithms in 5 languages):&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Detection accuracy reached over 85 percent&lt;/li&gt;
  &lt;li&gt;Cross-language similarity averaged between 53 and 72 percent&lt;/li&gt;
  &lt;li&gt;False positives remained under 5 percent&lt;/li&gt;
  &lt;li&gt;Detection held strong even with:&lt;/li&gt;
&lt;/ul&gt;

&lt;h4 id=&quot;broader-applications&quot;&gt;Broader Applications&lt;/h4&gt;
&lt;p&gt;This kind of semantic analysis is valuable for much more than compliance. It can support:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Internal code clone detection across large codebases&lt;/li&gt;
  &lt;li&gt;Competitive algorithm analysis&lt;/li&gt;
  &lt;li&gt;Open source license compliance when derivative work is involved (the new legal challenge?)&lt;/li&gt;
  &lt;li&gt;Validation of originality in AI-generated code&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;As generative coding systems become more advanced, organizations may need analysis tools that match that level of complexity. The question is, do such tools already exist?&lt;/p&gt;

&lt;h3 id=&quot;final-thoughts&quot;&gt;Final Thoughts&lt;/h3&gt;
&lt;p&gt;Traditional scanners are built to catch copied code. However, AI does not simply copy, it reconstructs, transforms, and reimplements.&lt;/p&gt;

&lt;p&gt;To detect risks, we must analyze the code’s meaning, not just its appearance. This research shows that semantic detection is both possible and practical. It provides better insight, stronger compliance, and a deeper understanding of the algorithms behind the code.&lt;/p&gt;

&lt;p&gt;The future of compliance lies in semantic understanding, not superficial matching. It may be time to implement “vibe compliance” to hunt code copycats. Then again, that might open a Pandora’s box.&lt;/p&gt;

&lt;p&gt;This research was conducted independently. All code used for testing was explicitly generated for experimental purposes and was not sourced from any known proprietary dataset. I used generic, publicly available LLMs and did not use models whose creators claim to have excluded copyleft or other problematic code from their training data.&lt;/p&gt;
</description>
        <pubDate>Sat, 02 Aug 2025 00:00:00 +0000</pubDate>
        <link>https://ovalenzuela.com/2025/08/advanced-semantic-analysis-for-ai-generated-code-compliance.html</link>
        <guid isPermaLink="true">https://ovalenzuela.com/2025/08/advanced-semantic-analysis-for-ai-generated-code-compliance.html</guid>
        
        <category>sbom</category>
        
        <category>compliance</category>
        
        <category>oss</category>
        
        <category>supplychain</category>
        
        <category>openchain</category>
        
        <category>ai</category>
        
        
        <category>oss</category>
        
        <category>security</category>
        
        <category>compliance</category>
        
        <category>supplychain</category>
        
        <category>ai</category>
        
        <category>models</category>
        
        <category>sca</category>
        
      </item>
    
      <item>
        <title>AI Supply Chain Security Risks and Legal Compliance Gaps</title>
        <description>&lt;h2 id=&quot;what-are-ai-models-and-how-are-they-distributed&quot;&gt;What Are AI Models and How Are They Distributed?&lt;/h2&gt;

&lt;p&gt;Artificial Intelligence models are trained software systems that make predictions or generate content based on input data. Unlike traditional software that follows explicit programming logic, AI models learn patterns from training data and encode this knowledge in mathematical weights and parameters.&lt;/p&gt;

&lt;p&gt;Organizations today consume AI models through several distribution channels that create unique security and legal risks.&lt;/p&gt;

&lt;h3 id=&quot;model-components-and-file-formats&quot;&gt;Model Components and File Formats&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Tensors and Weights&lt;/strong&gt;
The core of any AI model consists of numerical parameters called tensors or weights. These contain the learned knowledge from training data. Think of them as the “brain” of the model that determines how inputs get processed into outputs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pickle Files (.pt, .pkl, .joblib)&lt;/strong&gt;
Python’s pickle format allows both data and executable code to be stored together in a single file. When you load a pickle file, any embedded code runs automatically. This creates a major security risk because malicious actors can hide executable payloads inside what appears to be a simple model file.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;SafeTensors Format&lt;/strong&gt;
A newer secure format designed specifically to store only model weights without executable code. SafeTensors files cannot execute arbitrary code during loading, making them safer for production use.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;GGUF Format&lt;/strong&gt;
Used primarily with LLaMA models and llama.cpp implementations. While safer than pickle, GGUF files can still contain metadata that requires inspection before use.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Inference Scripts and Configuration Files&lt;/strong&gt;
Models often include Python scripts that handle data preprocessing, model execution, and output formatting. These scripts can contain hidden logic or import malicious libraries.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Model Adapters and Extensions&lt;/strong&gt;
Lightweight modifications that change model behavior without retraining. LoRA adapters, for example, apply small weight adjustments to base models. Extensions might add web APIs or connect models to external services.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Datasets&lt;/strong&gt;
Training and evaluation data often accompanies models. Datasets can contain biased, copyrighted, or sensitive information that creates legal liability.&lt;/p&gt;

&lt;h3 id=&quot;distribution-through-python-packages&quot;&gt;Distribution Through Python Packages&lt;/h3&gt;

&lt;p&gt;Many organizations distribute AI models inside standard Python packages (wheel files or tar.gz archives). This bundling approach creates a blind spot for traditional security tools.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Package Problem&lt;/strong&gt;
When developers install a Python package that contains AI models, traditional dependency scanners only check the package metadata and Python code. They miss the model files stored in data folders within the package.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Real Distribution Examples&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;Transformers library bundles model configurations&lt;/li&gt;
  &lt;li&gt;Custom ML packages include pre-trained weights in data directories&lt;/li&gt;
  &lt;li&gt;Industry-specific packages embed domain models as package resources&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This distribution method bypasses most security scanning because the models exist as data files rather than declared dependencies.&lt;/p&gt;

&lt;h2 id=&quot;legal-compliance-risks&quot;&gt;Legal Compliance Risks&lt;/h2&gt;

&lt;p&gt;The AI model ecosystem creates new categories of legal risk that traditional software compliance programs do not address.&lt;/p&gt;

&lt;h3 id=&quot;license-documentation-gaps&quot;&gt;License Documentation Gaps&lt;/h3&gt;

&lt;p&gt;Research analyzing 159,132 models on HuggingFace reveals concerning compliance gaps¹:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Missing License Information&lt;/strong&gt;
Only 35% of HuggingFace models include any license information². This means 65% of available models exist in a legal gray area where usage rights remain undefined.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;AI-Specific Licensing Complexity&lt;/strong&gt;
New license types like OpenRAIL, CreativeML-OpenRAIL-M, and BigScience-BLOOM-RAIL include “responsible use” restrictions that traditional open source licenses do not contain³. These licenses may prohibit certain applications, require attribution for outputs, or restrict commercial use in specific industries.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;License Compatibility Violations&lt;/strong&gt;
Analysis found 707 GitHub projects using restrictively licensed models while distributing their own code under permissive licenses¹. This creates potential legal exposure for any organization using these projects.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Specific Violation Examples&lt;/strong&gt;
It’s common to find repositories on GitHub for projects that use permissive licenses but include (or bundle) incompatible models. The most common cases are:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;Apache 2.0 licensed projects using GPL-3.0 licensed models&lt;/li&gt;
  &lt;li&gt;MIT licensed software incorporating CC-BY-SA-4.0 models&lt;/li&gt;
  &lt;li&gt;Commercial applications using non-commercial research models&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;dataset-provenance-problems&quot;&gt;Dataset Provenance Problems&lt;/h3&gt;

&lt;p&gt;Training data creates additional legal complexity that most organizations ignore.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Missing Dataset Documentation&lt;/strong&gt;
Only 14% of models properly tag their training datasets¹. Manual analysis of popular models shows 58% provide some dataset information, but documentation quality varies widely.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Copyright Exposure&lt;/strong&gt;
Models trained on copyrighted content (books, articles, images, code) may create derivative works. Organizations using these models could face copyright infringement claims, especially for commercial applications.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Privacy Compliance Risks&lt;/strong&gt;
Training datasets may contain personal information, biometric data, or other regulated content. Using models trained on such data could violate GDPR, CCPA, or industry-specific privacy requirements.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Real-World Legal Challenges&lt;/strong&gt;
The Stability AI StableLM model sparked legal discussions about licensing validity when trained on copyrighted datasets⁴. Similar concerns affect most large language models trained on web-scraped content.&lt;/p&gt;

&lt;h3 id=&quot;bias-and-fairness-documentation&quot;&gt;Bias and Fairness Documentation&lt;/h3&gt;

&lt;p&gt;Only 18% of analyzed models document potential biases¹. This creates liability for organizations deploying models in regulated industries or customer-facing applications.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Documented Bias Categories&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;Population bias affecting demographic groups&lt;/li&gt;
  &lt;li&gt;Geographic bias favoring certain regions&lt;/li&gt;
  &lt;li&gt;Cultural and religious bias in content generation&lt;/li&gt;
  &lt;li&gt;Historical bias reflecting outdated social norms&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Business Impact Examples&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;Hiring tools showing gender bias in candidate ranking&lt;/li&gt;
  &lt;li&gt;Credit models discriminating against protected classes&lt;/li&gt;
  &lt;li&gt;Content generation systems producing culturally insensitive outputs&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;security-threat-landscape&quot;&gt;Security Threat Landscape&lt;/h2&gt;

&lt;p&gt;AI models introduce attack vectors that traditional cybersecurity tools cannot detect or prevent. Although the known impact may be reduced or limited—potentially due to the absence of tools capable of detecting these issues at scale—there are some well-known real-world cases:&lt;/p&gt;

&lt;h3 id=&quot;weaponized-model-incidents&quot;&gt;Weaponized Model Incidents&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;December 2022: PyTorch Supply Chain Attack&lt;/strong&gt;
The torchtriton package on PyPI contained malicious code that stole environment variables and SSH keys from developers using PyTorch-nightly⁵. The attack used DNS exfiltration to steal credentials without triggering network monitoring tools.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;July 2023: PoisonGPT&lt;/strong&gt;
Researchers demonstrated a backdoored GPT-J model that produced misinformation when triggered by specific phrases⁶. The poisoned model appeared to function normally in most cases but generated false information about historical events when prompted with trigger words.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;February 2024: HuggingFace Model Backdoors&lt;/strong&gt;
JFrog Security discovered approximately 100 malicious models on HuggingFace containing pickle files designed to open reverse shells⁷. These models appeared legitimate but executed malicious code when loaded.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;December 2024: YOLOv8 Cryptominer&lt;/strong&gt;
Attackers compromised the YOLOv8 computer vision model repository through GitHub CI systems, injecting cryptocurrency mining malware into model downloads⁸.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;January 2025: Obfuscated Pickle Attack&lt;/strong&gt;
ReversingLabs identified sophisticated attacks using corrupted 7z archives to hide pickle backdoors⁹. These attacks evaded HuggingFace’s security scanners by obscuring malicious code within compressed archives.&lt;/p&gt;

&lt;h3 id=&quot;attack-vector-analysis&quot;&gt;Attack Vector Analysis&lt;/h3&gt;

&lt;p&gt;If everyone is adopting AI for everything, how is it possible that these problems occur? The honest truth is that mostly no one is looking, verifying, or reviewing. As a result, attack vectors remain available.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Model Serialization Exploits&lt;/strong&gt;
Pickle format attacks remain the most common threat vector. Malicious actors embed executable code within model files that runs automatically during loading. This code can:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Install persistent backdoors&lt;/li&gt;
  &lt;li&gt;Exfiltrate sensitive data&lt;/li&gt;
  &lt;li&gt;Establish command and control channels&lt;/li&gt;
  &lt;li&gt;Deploy additional malware&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Supply Chain Injection&lt;/strong&gt;
Attackers target model repositories, CI/CD pipelines, and distribution infrastructure to inject malicious code into legitimate models. These attacks affect downstream users who trust the model source.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Model Poisoning&lt;/strong&gt;
Subtle modifications to model weights create backdoors that trigger on specific inputs. Poisoned models perform normally for most inputs but produce attacker-controlled outputs when triggered.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Dependency Confusion&lt;/strong&gt;
Malicious packages with names similar to legitimate AI libraries trick developers into installing compromised versions. These packages often contain backdoored models or steal credentials during installation.&lt;/p&gt;

&lt;h3 id=&quot;current-protection-gaps&quot;&gt;Current Protection Gaps&lt;/h3&gt;

&lt;p&gt;Industry technical leaders are actively trying to reduce or control the situation. However, it is unusual that these efforts are not widely recognized or discussed. I would assume that companies with Open Source Program Offices (OSPOs) or specialized security teams would have their own governance programs focused on handling AI models. Unfortunately, the sad reality is that very few understand the problem or attempt to implement the best supply chain practices.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;HuggingFace Security Measures&lt;/strong&gt;
HuggingFace implements several security controls:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Pickle scanning for known malicious patterns (limited)&lt;/li&gt;
  &lt;li&gt;Automated malware detection&lt;/li&gt;
  &lt;li&gt;Community reporting mechanisms&lt;/li&gt;
  &lt;li&gt;SafeTensors format promotion&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Identified Weaknesses&lt;/strong&gt;
Research reveals significant gaps in current protections:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Static analysis cannot detect runtime-activated threats&lt;/li&gt;
  &lt;li&gt;Obfuscated payloads evade signature-based detection&lt;/li&gt;
  &lt;li&gt;Social engineering bypasses community review&lt;/li&gt;
  &lt;li&gt;Advanced serialization attacks exploit parser vulnerabilities&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Traditional Security Tool Limitations&lt;/strong&gt;
Standard cybersecurity tools fail to address AI-specific risks:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Antivirus software cannot analyze model behavior&lt;/li&gt;
  &lt;li&gt;Network monitoring misses AI-specific exfiltration methods&lt;/li&gt;
  &lt;li&gt;Dependency scanners ignore bundled model files&lt;/li&gt;
  &lt;li&gt;Code analysis tools cannot inspect serialized weights&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;why-traditional-security-tools-fall-short&quot;&gt;Why Traditional Security Tools Fall Short&lt;/h2&gt;

&lt;p&gt;Software Composition Analysis (SCA) tools focus on declared dependencies in requirements.txt or setup.py files, while specialized code scanners or snippet analysis tools focus on hash matching against known signatures. However, they cannot:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Analyze Package Contents&lt;/strong&gt;
SCA tools do not extract and examine files within wheel (.whl) or tar.gz packages. This means bundled model files remain invisible to security scanning. While some SCAs include features to inspect archive files, they are not capable of directly handling serialization and binary blobs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Understand Model Formats&lt;/strong&gt;
Traditional tools cannot parse pickle, ONNX, or other AI-specific file formats. They treat model files as generic binary data without security analysis.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Detect Runtime Behavior&lt;/strong&gt;
Model files may appear benign during static analysis but execute malicious code when loaded by AI frameworks. Current tools cannot predict this runtime behavior.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Assess Training Data Risks&lt;/strong&gt;
No existing tools analyze the legal or ethical implications of training datasets. Organizations remain blind to copyright, privacy, and bias risks embedded in model weights.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Monitor AI-Specific Network Activity&lt;/strong&gt;
Models may communicate with external services through subtle channels that traditional network monitoring cannot detect. AI-specific exfiltration methods evade standard security controls.&lt;/p&gt;

&lt;h2 id=&quot;ai-governance-model&quot;&gt;AI Governance Model&lt;/h2&gt;

&lt;p&gt;So, what should we focus on to create an internal risk management strategy for AI models? Start by assessing your risks and establish AI-oriented security processes.&lt;/p&gt;

&lt;h3 id=&quot;organizational-risk-assessment&quot;&gt;Organizational Risk Assessment&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Immediate Threats&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;Malicious models stealing credentials or data&lt;/li&gt;
  &lt;li&gt;Backdoored packages in AI development environments&lt;/li&gt;
  &lt;li&gt;License violations in production AI systems&lt;/li&gt;
  &lt;li&gt;Bias-related legal exposure from undocumented model behavior&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Long-Term Concerns&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;Increasing sophistication of AI-targeted attacks&lt;/li&gt;
  &lt;li&gt;Regulatory enforcement of AI compliance requirements&lt;/li&gt;
  &lt;li&gt;Supply chain attacks targeting AI infrastructure&lt;/li&gt;
  &lt;li&gt;Copyright litigation over training data usage&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Business Impact&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;Financial liability from legal violations&lt;/li&gt;
  &lt;li&gt;Reputational damage from biased AI outputs&lt;/li&gt;
  &lt;li&gt;Operational disruption from compromised AI systems&lt;/li&gt;
  &lt;li&gt;Competitive disadvantage from security incidents&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;recommendations-for-organizations&quot;&gt;Recommendations for Organizations&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Establish AI-Specific Security Processes&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Create dedicated review procedures for AI models and related packages. Traditional software review processes do not address AI-specific risks.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Implement Model Intake Controls&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Verify model source and authenticity&lt;/li&gt;
  &lt;li&gt;Convert pickle-based models to safer formats when possible&lt;/li&gt;
  &lt;li&gt;Scan model files for embedded executables&lt;/li&gt;
  &lt;li&gt;Test model behavior in isolated environments&lt;/li&gt;
  &lt;li&gt;Document training data sources and licenses&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Deploy Specialized Security Tools&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Use ModelScan for multi-format model file analysis&lt;/li&gt;
  &lt;li&gt;Implement Adversarial Robustness Toolbox for model testing&lt;/li&gt;
  &lt;li&gt;Deploy Garak for LLM vulnerability assessment&lt;/li&gt;
  &lt;li&gt;Add AI-specific monitoring to network security&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Develop Legal Compliance Programs&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Audit existing AI model usage for license compliance&lt;/li&gt;
  &lt;li&gt;Create approval processes for new AI licensing models&lt;/li&gt;
  &lt;li&gt;Establish dataset provenance requirements&lt;/li&gt;
  &lt;li&gt;Document bias testing and mitigation efforts&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Train Development Teams&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Educate developers about AI-specific security risks&lt;/li&gt;
  &lt;li&gt;Provide guidance on safe model handling practices&lt;/li&gt;
  &lt;li&gt;Create incident response procedures for AI security events&lt;/li&gt;
  &lt;li&gt;Establish secure AI development workflows&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Monitor the Threat Landscape&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Subscribe to AI security threat intelligence feeds&lt;/li&gt;
  &lt;li&gt;Participate in AI security community forums&lt;/li&gt;
  &lt;li&gt;Regular security assessments of AI infrastructure&lt;/li&gt;
  &lt;li&gt;Stay current with emerging AI attack techniques&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Build Organizational Capabilities&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Hire or train staff with AI security expertise&lt;/li&gt;
  &lt;li&gt;Invest in AI-specific security tooling&lt;/li&gt;
  &lt;li&gt;Develop partnerships with AI security vendors&lt;/li&gt;
  &lt;li&gt;Create internal AI security standards and policies&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;takeaways&quot;&gt;Takeaways&lt;/h2&gt;

&lt;p&gt;The AI revolution brings tremendous business opportunities alongside new categories of risk. Organizations that proactively address these challenges will gain competitive advantages while protecting themselves from emerging threats. Those that ignore AI-specific risks face increasing exposure to both security incidents and legal liability.&lt;/p&gt;

&lt;h2 id=&quot;references&quot;&gt;References&lt;/h2&gt;

&lt;ol&gt;
  &lt;li&gt;
    &lt;p&gt;Pepe, F., Nardone, V., Mastropaolo, A., Canfora, G., Bavota, G., &amp;amp; Di Penta, M. (2024). How do Hugging Face Models Document Datasets, Bias, and Licenses? An Empirical Study. 32nd IEEE/ACM International Conference on Program Comprehension (ICPC). https://mdipenta.github.io/files/icpc2024.pdf&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Mend.io. (2024). Quick Guide to Popular AI Licenses. https://www.mend.io/blog/quick-guide-to-popular-ai-licenses/&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Responsible AI Licenses. (2022). OpenRAIL Licenses. https://www.licenses.ai/&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;HuggingFace Community Discussion. (2023). StabilityAI/StableLM License Clarity Issue. https://huggingface.co/stabilityai/stablelm-tuned-alpha-7b/discussions/6&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;PyTorch Team. (2022). PyTorch-nightly dependency compromised. https://pytorch.org/blog/compromised-nightly-dependency/&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Mithril Security. (2023). PoisonGPT: How we hid a lobotomized LLM on Hugging Face to spread fake news. https://blog.mithrilsecurity.io/poisongpt-how-we-hid-a-lobotomized-llm-on-hugging-face-to-spread-fake-news/&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;JFrog Security Research. (2024). Malicious ML Models on Hugging Face. https://jfrog.com/blog/jfrog-and-hugging-face-join-forces/&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Wiz Research. (2024). Ultralytics YOLOv8 Cryptominer Supply Chain Attack. https://www.wiz.io/blog/ultralytics-ai-library-hacked-via-github-for-cryptomining&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;ReversingLabs. (2025). Backdoored ML Models Evade Hugging Face Scanners. https://www.reversinglabs.com/newsroom/press-releases/reversinglabs-identifies-novel-ml-malware-hosted-on-leading-hugging-face-ai-model-platform&lt;/p&gt;
  &lt;/li&gt;
&lt;/ol&gt;
</description>
        <pubDate>Thu, 26 Jun 2025 00:00:00 +0000</pubDate>
        <link>https://ovalenzuela.com/2025/06/ai-supply-chain-security-risks-and-legal-compliance-gaps.html</link>
        <guid isPermaLink="true">https://ovalenzuela.com/2025/06/ai-supply-chain-security-risks-and-legal-compliance-gaps.html</guid>
        
        <category>sbom</category>
        
        <category>compliance</category>
        
        <category>oss</category>
        
        <category>supplychain</category>
        
        <category>openchain</category>
        
        <category>ai</category>
        
        
        <category>oss</category>
        
        <category>security</category>
        
        <category>compliance</category>
        
        <category>supplychain</category>
        
        <category>ai</category>
        
        <category>sbom</category>
        
      </item>
    
      <item>
        <title>Data Resources for Agentic AI in Open Source Security and Compliance</title>
        <description>&lt;p&gt;Following my &lt;a href=&quot;https://ovalenzuela.com/2025/04/ai-agents-the-missing-piece-in-sbom-compliance.html&quot;&gt;previous post&lt;/a&gt;, I want to expand on the topic with some ideas for data feeds that support open source compliance audits and risk assessments when using an Agentic AI approach. These resources are available today, and I’ve had the chance to test and demonstrate them in a recent talk.&lt;/p&gt;

&lt;h2 id=&quot;scanoss--software-transparency-foundation&quot;&gt;SCANOSS / Software Transparency Foundation&lt;/h2&gt;

&lt;p&gt;The &lt;a href=&quot;https://www.softwaretransparency.org/&quot;&gt;Software Transparency Foundation&lt;/a&gt; offers a public &lt;a href=&quot;https://docs.osskb.org/&quot;&gt;API&lt;/a&gt; that connects to the Open Source Software Knowledge Base (OSSKB), which is produced by SCANOSS and licensed through the STF for public use. This system uses code fingerprints to identify code components, even at the snippet level. In addition to the main scanning API, SCANOSS offers specialized open datasets focused on regulatory and legal risks, including the &lt;a href=&quot;https://github.com/scanoss/crypto_algorithms_open_dataset&quot;&gt;&lt;em&gt;Crypto Algorithms Open Dataset&lt;/em&gt;&lt;/a&gt; and the &lt;a href=&quot;https://www.scanoss.com/post/understanding-the-geo-provenance-dataset&quot;&gt;&lt;em&gt;Geo-Provenance dataset&lt;/em&gt;&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Key data provided by SCANOSS includes:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Identified component identifiers (e.g., &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;PURLs&lt;/code&gt;) from code fingerprint matches&lt;/li&gt;
  &lt;li&gt;Detected licenses associated with the identified code&lt;/li&gt;
  &lt;li&gt;Hashes of matched code snippets or files&lt;/li&gt;
  &lt;li&gt;A catalog of known cryptographic algorithm implementations&lt;/li&gt;
  &lt;li&gt;Jurisdictional mappings of software components by country of origin&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This data enables agents to detect reused open source code, verify license attribution, trace code origins, identify components that might require export licenses due to cryptography, and flag software sourced from regions under trade restrictions or subject to data regulations.&lt;/p&gt;

&lt;h2 id=&quot;software-heritage&quot;&gt;Software Heritage&lt;/h2&gt;
&lt;p&gt;Software Heritage is an open, non-profit initiative maintaining the world’s largest public archive of source code. Its mission of collecting and preserving all publicly available software makes it a powerful resource for compliance and cybersecurity. By offering a rich API and persistent identifiers (SWHIDs) for every piece of code, Software Heritage enables agentic systems to verify code provenance, integrity, and traceability at scale. In practice, an AI agent can leverage Software Heritage in several ways to bolster open source compliance and security:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Archive Verification: Agents can quickly check if a given source code release or package (e.g., a tarball) is already archived in Software Heritage’s database. This enables use cases like using the archived version for source compliance instead of publishing redundant tarballs.&lt;/li&gt;
  &lt;li&gt;Verify Modifications: Internal forks from open-source packages represent incremental technical debt due to cherry-picking changes when the software needs to be updated to a more recent upstream version, which is still often necessary (at least temporarily). Forks also introduce compliance and IP risks. Software Heritage’s API and scanners allow agents to identify whether a local fork has been modified and where those modifications were introduced.&lt;/li&gt;
  &lt;li&gt;License Files Dataset: Software Heritage has built the largest dataset of licenses from all the archived projects, a valuable resource for benchmarking compliance tooling and training AI on license generation.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Software Heritage’s archival infrastructure and identifiers ultimately empower a more transparent and secure open-source supply chain. By integrating SWH into their workflows, agent-based systems gain a dependable memory: they can detect unrecorded code, validate integrity through immutable hashes, and link to historical versions for context or legal compliance.&lt;/p&gt;

&lt;h2 id=&quot;aboutcode&quot;&gt;AboutCode&lt;/h2&gt;

&lt;p&gt;The &lt;a href=&quot;https://aboutcode.org/&quot;&gt;AboutCode&lt;/a&gt; initiative encompasses several open datasets, tools, and APIs designed for in-depth analysis of software components concerning licenses, security posture, and provenance. Their key offerings include:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;https://github.com/aboutcode-org/scancode.io/&quot;&gt;ScanCode.io&lt;/a&gt;, a self-hosted service with a RESTful API for scanning codebases&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://public.vulnerablecode.io/&quot;&gt;VulnerableCode&lt;/a&gt;, a public API that maps open source packages to known vulnerabilities&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://github.com/aboutcode-org/scancode-licensedb&quot;&gt;ScanCode LicenseDB&lt;/a&gt;, a comprehensive license database hosted on GitHub&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These tools provide valuable data, including:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Detailed scan results from ScanCode.io in &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;JSON&lt;/code&gt; or &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;SPDX&lt;/code&gt; format, including detected licenses, package metadata, and file-level insights (Note: this service must be self-hosted.)&lt;/li&gt;
  &lt;li&gt;Vulnerability mappings from VulnerableCode, linking package identifiers (e.g., &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;PURLs&lt;/code&gt;) to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;CVEs&lt;/code&gt;, along with references and fixed version details&lt;/li&gt;
  &lt;li&gt;An extensive collection of license texts, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;SPDX&lt;/code&gt; identifiers, and detection rules from ScanCode LicenseDB&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Agents can leverage these tools to automate large-scale code scanning within CI/CD pipelines, query vulnerability data by package, and enrich license validation workflows or AI training sets with high-quality licensing metadata.&lt;/p&gt;

&lt;h2 id=&quot;clearlydefined&quot;&gt;ClearlyDefined&lt;/h2&gt;

&lt;p&gt;&lt;a href=&quot;https://clearlydefined.io/&quot;&gt;ClearlyDefined&lt;/a&gt; is a community-driven effort focused on aggregating and curating licensing and security metadata for open source packages. This information is available via a public &lt;a href=&quot;https://api.clearlydefined.io/definitions/&quot;&gt;API&lt;/a&gt;, offering structured data across many popular ecosystems (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;npm&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Maven&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;PyPI&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;NuGet&lt;/code&gt;, etc.).&lt;/p&gt;

&lt;p&gt;The data includes:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Declared and discovered software licenses (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;SPDX&lt;/code&gt; identifiers)&lt;/li&gt;
  &lt;li&gt;URLs pointing to source code repositories&lt;/li&gt;
  &lt;li&gt;Copyright holder information&lt;/li&gt;
  &lt;li&gt;Full texts of detected licenses&lt;/li&gt;
  &lt;li&gt;Confidence scores indicating metadata quality&lt;/li&gt;
  &lt;li&gt;Per-file license information (accessible through specific query options)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The ClearlyDefined project is key in enriching SBOMs and compliance workflows by filling in missing or incomplete attribution data. To retrieve structured license insights, agents can query the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;API&lt;/code&gt; using standard package coordinates (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;type/provider/namespace/name/version&lt;/code&gt;).&lt;/p&gt;

&lt;h2 id=&quot;depsdev-open-source-insights-by-google&quot;&gt;deps.dev (Open Source Insights by Google)&lt;/h2&gt;

&lt;p&gt;&lt;a href=&quot;https://deps.dev/&quot;&gt;deps.dev&lt;/a&gt; (Open Source Insights by Google) provides structured metadata and dependency information for software packages across significant ecosystems, including &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;npm&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Maven&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;PyPI&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Go&lt;/code&gt;, and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Rust&lt;/code&gt;. Its public &lt;a href=&quot;https://docs.deps.dev/api/v3/&quot;&gt;API&lt;/a&gt; returns comprehensive metadata such as:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Complete dependency graphs, including direct and transitive relationships&lt;/li&gt;
  &lt;li&gt;Known vulnerabilities from the OSV database, mapped to affected version ranges&lt;/li&gt;
  &lt;li&gt;License data for each package version&lt;/li&gt;
  &lt;li&gt;Version history with release dates&lt;/li&gt;
  &lt;li&gt;Cryptographic hashes and links to source repositories&lt;/li&gt;
  &lt;li&gt;Integrated security signals, such as OpenSSF Scorecard results&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This service helps agents analyze how dependencies are interconnected, track the entry points of vulnerabilities, verify licensing, monitor version freshness, and assess overall risk and health using standardized metrics.&lt;/p&gt;

&lt;h2 id=&quot;openssf-security-scorecards&quot;&gt;OpenSSF Security Scorecards&lt;/h2&gt;

&lt;p&gt;&lt;a href=&quot;https://openssf.org/projects/scorecard/&quot;&gt;OpenSSF Security Scorecards&lt;/a&gt; is a project by the Open Source Security Foundation that automatically evaluates open source repositories against security best practices. It assigns numeric scores to a set of defined checks, and makes results available through a public &lt;a href=&quot;https://api.securityscorecards.dev/&quot;&gt;API&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The system evaluates aspects such as:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Use of code review and branch protection&lt;/li&gt;
  &lt;li&gt;Dependency pinning practices&lt;/li&gt;
  &lt;li&gt;Use of fuzzing and static analysis tools&lt;/li&gt;
  &lt;li&gt;CI/CD integration&lt;/li&gt;
  &lt;li&gt;Overall maintenance activity&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each check is scored from 0 to 10, and includes detailed explanations of what was detected and how the score was derived.&lt;/p&gt;

&lt;p&gt;Scorecards offer a standardized, automated view of a project’s security maturity. This data can complement vulnerability and license scanning by flagging projects with weak security practices. Agentic systems can use this to prioritize high-risk components, identify gaps in development hygiene, and include security posture in broader risk assessments across the software supply chain.&lt;/p&gt;

&lt;h2 id=&quot;final-takeaways&quot;&gt;Final Takeaways&lt;/h2&gt;

&lt;p&gt;These data sources are available and mature enough to support agent-based compliance tooling today. You can automate large parts of OSS license validation, vulnerability triage, and legal risk assessment by connecting them into your pipelines or AI systems.&lt;/p&gt;

&lt;p&gt;If you’re exploring how to build or extend agentic systems for open source compliance, now is a great time to start integrating these feeds. The infrastructure is ready — it’s just connecting the dots.&lt;/p&gt;
</description>
        <pubDate>Tue, 08 Apr 2025 00:00:00 +0000</pubDate>
        <link>https://ovalenzuela.com/2025/04/data-resources-agentic-ai-open-source-security-compliance.html</link>
        <guid isPermaLink="true">https://ovalenzuela.com/2025/04/data-resources-agentic-ai-open-source-security-compliance.html</guid>
        
        <category>sbom</category>
        
        <category>compliance</category>
        
        <category>oss</category>
        
        <category>supplychain</category>
        
        <category>openchain</category>
        
        <category>ai</category>
        
        <category>n8n</category>
        
        
        <category>oss</category>
        
        <category>security</category>
        
        <category>compliance</category>
        
        <category>supplychain</category>
        
        <category>ai</category>
        
        <category>n8n</category>
        
        <category>sbom</category>
        
      </item>
    
      <item>
        <title>AI Agents: The Missing Piece in SBOM Compliance</title>
        <description>&lt;p&gt;Today, I gave a talk called “Taming the SBOM Chaos: Using AI Agents to Audit SBOMs for OSS Compliance.” The slides and materials are on my &lt;a href=&quot;https://github.com/oscarvalenzuelab/sbom_analysis_using_agentic&quot;&gt;GitHub account&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;As Software Bill of Materials (SBOM) adoption grows, so does the complexity of managing, validating, and ensuring compliance with evolving regulatory frameworks. Many companies struggle with these challenges due to a lack of specialized in-house Subject Matter Experts (SMEs). My talk explored how AI-powered workflows, leveraging specialized models and OpenData APIs, can streamline compliance and audits. Attendees gained insights into how AI agents can assist in automating SBOM analysis, reducing human workload, and enhancing compliance strategies in an increasingly regulated software landscape.&lt;/p&gt;

&lt;p&gt;Here’s a quick write-up of the ideas behind the talk:&lt;/p&gt;

&lt;h2 id=&quot;the-problem&quot;&gt;The problem&lt;/h2&gt;
&lt;p&gt;Open source license compliance is complicated. A license might seem simple, but how you use the software changes everything. Using a library internally might involve minimal obligations, but shipping it within a product can trigger significant compliance requirements under the same license.&lt;/p&gt;

&lt;p&gt;So the same code, under the same license, can mean different obligations depending on its use. That’s why compliance isn’t just about reading the license or running a scanner to generate a report. You need to understand the whole picture—how the software is built, how it runs, and who it’s for. That makes scaling compliance across large projects very difficult. Every case is different, and making the right call takes time, experience, and judgment.&lt;/p&gt;

&lt;h2 id=&quot;tools-dont-solve-the-problem&quot;&gt;Tools don’t solve the problem.&lt;/h2&gt;
&lt;p&gt;We have tools that try to help but don’t go far enough. Most tools assume the input is perfect. They expect a clean SBOM with all the correct and complete data. They presume every package has a valid license string, every file has the correct hash, and every rule is black and white.&lt;/p&gt;

&lt;p&gt;But in real life, SBOMs are often broken or missing key information. Many tools struggle with this reality. Some crash, others give you results that look right but aren’t. Either way, they often don’t explain their reasoning. Even if you identify an error in the source data, correcting it within the tool isn’t always possible.&lt;/p&gt;

&lt;p&gt;Even commercial tools that promise complete automation (Automated SCA audits) fall short. They might work for simple cases (like security vulnerability scans), but they struggle with edge cases or complex builds. And they often lock you into their thinking, acting as black boxes that you can’t easily question or improve their results.&lt;/p&gt;

&lt;h2 id=&quot;sboms-should-help-but-they-often-dont&quot;&gt;SBOMs should help. But they often don’t.&lt;/h2&gt;
&lt;p&gt;A good SBOM should tell you what’s inside your software: package names, versions, licenses, and hashes. But most SBOMs aren’t that helpful.&lt;/p&gt;

&lt;p&gt;Some use different formats. Others are missing key fields like license data or component hashes. Sometimes, the fields are there, but they’re wrong or manually edited. There’s no standard for what “good” looks like, and the quality varies significantly depending on how the SBOM was created.&lt;/p&gt;

&lt;p&gt;Many SBOMs create more work instead of providing answers, forcing teams to spend valuable time correcting inaccuracies rather than using the data for analysis.&lt;/p&gt;

&lt;h2 id=&quot;how-agentic-ai-helps&quot;&gt;How Agentic AI helps&lt;/h2&gt;
&lt;p&gt;This is where agentic AI comes in. It’s not just a script or a one-time tool. It’s a system that can think, plan, and take action to solve a goal.&lt;/p&gt;

&lt;p&gt;If you provide an SBOM to an AI agent, it will read the data, even if the SBOM is incomplete or incorrectly formatted. It can figure out what’s missing and try to fill in the blanks. It can check other sources, like public license databases or internal records. It can pull together what it finds and generate a clear summary for assessment.&lt;/p&gt;

&lt;p&gt;In short, it doesn’t stop at parsing the file. It tries to answer the bigger question: “What are the risks, and what should we do about them?”&lt;/p&gt;

&lt;p&gt;For example, imagine you get an SBOM with no license info and bad hashes. A regular tool would either fail or give you incomplete results. An AI agent would keep going. It could search for license data using the package identifiers. It might cross-check with external databases. It could flag what it couldn’t verify and give you a report showing what it found and what’s still unknown.&lt;/p&gt;

&lt;p&gt;That’s a big step forward.&lt;/p&gt;

&lt;h2 id=&quot;the-key-ingredient-expert-curated-data&quot;&gt;The Key Ingredient: Expert-Curated Data&lt;/h2&gt;
&lt;p&gt;Training an AI agent without data is like hiring an intern and asking them to handle compliance on day one—with no examples, training, or documentation. They won’t know where to start. Forced to provide answers without adequate guidance, you can expect anything from half-baked responses to outright hallucinations.&lt;/p&gt;

&lt;p&gt;Like any Open Source Compliance Engineer, AI agents need foundational knowledge and context to perform effectively. If we expect agents to make sense of licenses, assess risks, and produce useful reports, we must give them expert-curated information. That includes examples of license metadata, known issues with specific packages, rules for when obligations apply, and context about how software is used.&lt;/p&gt;

&lt;p&gt;These aren’t just “nice to have.” They’re the foundation. Without them, the agent can’t reason. It can’t tell if a component is risky. It can’t spot patterns. It can’t improve to help you better.&lt;/p&gt;

&lt;p&gt;So instead of building tools overloaded with underdeveloped features (suffering from ‘Swiss Army knife’ syndrome), we need to focus on building and maintaining strong data feeds of knowledge. These should come from people who know open source compliance, like compliance engineers, lawyers, and auditors, who have done the work and been in the trenches.&lt;/p&gt;

&lt;p&gt;The better the data, the better the agent. That’s what makes the difference. Pioneering organizations like &lt;a href=&quot;https://www.scanoss.com/&quot;&gt;SCANOSS&lt;/a&gt;, &lt;a href=&quot;https://nexb.com/&quot;&gt;NexB&lt;/a&gt;, and &lt;a href=&quot;https://www.softwareheritage.org/&quot;&gt;Software Heritage&lt;/a&gt; understand this. They lead the way by offering the curated knowledge and data feeds crucial for effective compliance. This foundational work will fuel the next wave of automation driven by AI.&lt;/p&gt;

&lt;p&gt;Other organizations like OpenSSF, ClearlyDefined, and Deps.dev offer data sources to address security and compliance information.&lt;/p&gt;

&lt;h2 id=&quot;final-thought&quot;&gt;Final thought&lt;/h2&gt;
&lt;p&gt;AI won’t replace compliance engineers but will change how we work. Instead of spending hours looking for missing package metadata, checking GitHub for licenses, reading reports, or fixing bad SBOMs, we’ll design systems that do most of the heavy lifting for us. We’ll focus on the decisions that matter, and we’ll train agents to handle the rest.&lt;/p&gt;

&lt;p&gt;We’re already seeing that shift, and it’s worth building on. AI agents represent the next wave in compliance management. Now is the time to start riding it.&lt;/p&gt;
</description>
        <pubDate>Mon, 07 Apr 2025 00:00:00 +0000</pubDate>
        <link>https://ovalenzuela.com/2025/04/ai-agents-the-missing-piece-in-sbom-compliance.html</link>
        <guid isPermaLink="true">https://ovalenzuela.com/2025/04/ai-agents-the-missing-piece-in-sbom-compliance.html</guid>
        
        <category>sbom</category>
        
        <category>compliance</category>
        
        <category>oss</category>
        
        <category>supplychain</category>
        
        <category>openchain</category>
        
        <category>ai</category>
        
        <category>n8n</category>
        
        
        <category>oss</category>
        
        <category>security</category>
        
        <category>compliance</category>
        
        <category>supplychain</category>
        
        <category>ai</category>
        
        <category>n8n</category>
        
        <category>sbom</category>
        
      </item>
    
      <item>
        <title>Why Most SBOMs Fail and What to Do About It</title>
        <description>&lt;p&gt;SBOM adoption is accelerating. Regulatory pressure, threats to software supply chains, and transparency demands drive widespread use. But while SBOMs are becoming standard, their quality often falls short.&lt;/p&gt;

&lt;p&gt;Open Source License Compliance (OSLC) teams have tracked software components and licenses for years before SBOMs became mainstream, often using spreadsheets. Standardized formats like SPDX and CycloneDX promised automation and clarity, but SBOM-driven processes usually fail to deliver in practice.&lt;/p&gt;

&lt;h3 id=&quot;the-reality-sboms-are-often-incomplete-inaccurate-and-hard-to-use&quot;&gt;The Reality: SBOMs Are Often Incomplete, Inaccurate, and Hard to Use&lt;/h3&gt;

&lt;p&gt;Despite years of standardization and significant progress around standardized structure formats, most SBOMs lack the consistency and depth needed for real-world use. They pass schema checks but fail basic usability and quality tests.&lt;/p&gt;

&lt;p&gt;** Common SBOM issues fall into several categories:**&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Many files are incomplete, missing critical fields such as licenses, supplier names, or checksums.&lt;/li&gt;
  &lt;li&gt;Others contain duplicate components or rely heavily on “no-assertion” entries, offering little usable information.&lt;/li&gt;
  &lt;li&gt;License data is frequently inaccurate, a problem made worse by the overconfidence placed in automated SCA tools—often promoted by security departments that, by sheer coincidence, also happen to control the majority of the tooling budget.&lt;/li&gt;
  &lt;li&gt;Format incompatibility is also a frequent challenge. Although SPDX and CycloneDX aim for similar outcomes, their structural differences create friction during integration or conversion.&lt;/li&gt;
  &lt;li&gt;Compounding this, updates to SBOM standards introduce new fields and capabilities, but tools often lag in adopting them.&lt;/li&gt;
  &lt;li&gt;Many SBOMs are generated automatically by software composition tools and assumed to be accurate without further validation, leading to widespread trust in documents that may not meet compliance or quality requirements.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;what-makes-an-sbom-high-quality&quot;&gt;What Makes an SBOM High Quality?&lt;/h3&gt;

&lt;p&gt;The OpenChain Telco SBOM Guide v1.1 offers a practical definition of SBOM quality. It emphasizes standardization, completeness, and transparency to support software supply chain management. It outlines recommendations and requirements for including key metadata, license data, and transitive dependencies. The core goal is simple: every SBOM should be clear, consistent, and complete at the point of delivery.&lt;/p&gt;

&lt;h3 id=&quot;testing-sbom-quality-how-can-you-measure-it&quot;&gt;Testing SBOM Quality: How Can You Measure It?&lt;/h3&gt;

&lt;p&gt;Here are practical methods to assess the quality of SBOMs for those using or generating them. These are general recommendations, and industry-specific practices may further enhance this framework. The following list represents basic validation checks that have proven effective in various scenarios.&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Schema Validation&lt;/strong&gt;: Use tools, such as JSON or XML validation tools, to verify that your SBOM adheres to the SPDX or CycloneDX schemas.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;NTIA/CRA Compliance&lt;/strong&gt;: Verify that your SBOM includes the necessary fields for regulatory compliance, such as license information, supplier details, and versioning.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;License Verification&lt;/strong&gt;: Employ license validation tools to compare the licenses listed in your SBOM against established license datasets and identify any inconsistencies. In experimental scenarios, I’ve observed that even advanced SCA tools may struggle to achieve 100% accuracy in license identification. When a license cannot be definitively determined, the SBOM often contains multiple notations, “no-assertion” entries, or empty fields.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Evaluating No-Assertion Rates&lt;/strong&gt;: I have encountered SBOMs where every field for each component is marked as “no-assertion.” It is advisable to verify this information directly in the raw SBOM file, as many commercial tools attempt to infer missing data during the SBOM import, potentially contaminating the SBOM with unverifiable assumptions as the tool lacks access to the source code for confirmation.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Legal Risk Detection&lt;/strong&gt;: Scan SBOMs against curated datasets of problematic packages to identify high-risk dependencies associated with known vulnerabilities. Some open-source communities maintain projects with datasets of problematic components (hidden binaries, incorrect license assertions, undeclared deep dependencies, etc.), which can aid in risk identification. Examples include VulnerableCode from NexB, ClearlyDefined from OSI, and OSSA (Open Source Software Advisory) from Xpertians. If your project uses any cryptography library, it’s a good time to check against the Crypto Algorithm dataset offered by SCANOSS.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Cross-Format Compatibility&lt;/strong&gt;: When supporting multiple SBOM formats, conduct tests to preserve data integrity during conversions between SPDX and CycloneDX.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Hashing&lt;/strong&gt;: Verify that all components include the same set of hashes and that the length and format of these hash strings correspond to the expected values.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Metadata Integrity&lt;/strong&gt;: Vendors sometimes provide heavily manipulated SBOMs. In such cases, the tool description may be absent, or the information may exhibit inconsistencies throughout the file.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;the-industry-problem-sboms-evolve-faster-than-tools&quot;&gt;The Industry Problem: SBOMs Evolve Faster Than Tools&lt;/h3&gt;

&lt;p&gt;The most significant barrier to SBOM reliability is the gap between evolving standards and stagnant tooling. While SPDX and CycloneDX continue to add new metadata and security features, most supporting tools—scanners, automation pipelines, and policy engines—struggle to keep up. This misalignment creates gaps in automation, format conversion, policy enforcement, and risk detection. The result is a fragmented ecosystem with inconsistent adoption and limited interoperability across the software supply chain.&lt;/p&gt;

&lt;h3 id=&quot;the-path-forward&quot;&gt;The Path Forward&lt;/h3&gt;

&lt;p&gt;SBOMs are essential, but generating a file isn’t enough. Usability and reliability require collaboration across standards bodies, toolmakers, and users. Schema validation is the floor, not the ceiling. We need clear quality benchmarks, better cross-format compatibility, and automation that flags low-quality SBOMs. Until a widely accepted quality standard emerges, teams must validate and refine the SBOMs they produce, consume, and share feedback to strengthen the ecosystem.&lt;/p&gt;

&lt;hr /&gt;

&lt;p&gt;&lt;strong&gt;What challenges have you faced with SBOMs? What does “quality” mean in your context? I’d love to hear how you validate and improve the SBOMs in your environment.&lt;/strong&gt;&lt;/p&gt;
</description>
        <pubDate>Fri, 14 Feb 2025 00:00:00 +0000</pubDate>
        <link>https://ovalenzuela.com/2025/02/why-most-sboms-fail-and-what-to-do-about-it.html</link>
        <guid isPermaLink="true">https://ovalenzuela.com/2025/02/why-most-sboms-fail-and-what-to-do-about-it.html</guid>
        
        <category>sbom</category>
        
        <category>compliance</category>
        
        <category>oss</category>
        
        <category>supplychain</category>
        
        <category>openchain</category>
        
        
        <category>oss</category>
        
        <category>security</category>
        
        <category>compliance</category>
        
        <category>supplychain</category>
        
        <category>openchain</category>
        
      </item>
    
      <item>
        <title>The &apos;keep it simple SBoM&apos; is the perfect small first step for your organization.</title>
        <description>&lt;p&gt;SPDX and CycloneDX are excellent standards for handling Software Bill of Material (SBoM), but full adoption requires time, tooling, and correct intake processes. If your organization is not yet ready for this, consider using a simplified format like KissBOM (commonly known as Open Source Package Inventory or OSPI). It’s a practical choice that can ease your transition into SBOM management.&lt;/p&gt;

&lt;p&gt;For early adopters, KissBOM offers a simplified SBOM format that is not only more accessible to manage but also covers the essentials: package URLs (PURLs), license information, and optional copyright statements. Its design for ease of use makes it an excellent option for organizations needing to quickly track their software components without the overhead of more complex formats. It even allows for the manual creation of an SBOM.&lt;/p&gt;

&lt;p&gt;While KissBOM doesn’t support the complexity of managing relationships and metadata, it can help achieve two primary goals: documenting the package inventory with license information and empowering organizations with the knowledge of using these artifacts to track and share package inventory information. Once best practices for generating, parsing, and tracking components are established, organizations can transition to fully standardized formats, implementing the necessary tooling and robust processes to create and receive SBOMs in a fully standardized format.&lt;/p&gt;

&lt;p&gt;Starting with a simplified format will alleviate Open Source Compliance activities and allow risk screening without involving heavy tooling. For example, using the package data from KissBOMs and data sources like VulnerableCode can help simplify security risk analysis. I’ve been using KissBOMs for testing tools that automatically produce Open Source Compliance artifacts like Legal Notices and Source Compliance to alleviate the work of generating Attribution Documents. While all of these projects are still “Works in Progress,” the key is the adoption of a simplified SBoM.&lt;/p&gt;

&lt;p&gt;Adopting a full standard will be ideal, but if we want to start experimenting and aligning organizations, concepts like KissBOM are the perfect small step to consider today.&lt;/p&gt;
</description>
        <pubDate>Fri, 08 Nov 2024 00:00:00 +0000</pubDate>
        <link>https://ovalenzuela.com/2024/11/keep-it-simple-sbom-is-the-perfect-small-first-step-for-your-organization.md.html</link>
        <guid isPermaLink="true">https://ovalenzuela.com/2024/11/keep-it-simple-sbom-is-the-perfect-small-first-step-for-your-organization.md.html</guid>
        
        <category>sbom</category>
        
        <category>compliance</category>
        
        <category>oss</category>
        
        <category>supplychain</category>
        
        <category>openchain</category>
        
        
        <category>oss</category>
        
        <category>securty</category>
        
        <category>compliance</category>
        
        <category>supplychain</category>
        
        <category>openchain</category>
        
      </item>
    
      <item>
        <title>Detecting source code generated by AI using Machine Learning</title>
        <description>&lt;p&gt;AI has become disruptive in many ways, especially for developers using AI agents to debug software, remediate errors, and even automatically generate the whole code for simple applications.&lt;/p&gt;

&lt;p&gt;Using AI to create the software comes with additional concerns related to undeclared obligations, mainly for the lack of clearness around definitions of copyright ownership, an undefined legal figure of the service provider (contractor services vs. tooling), and the still open questions if the generated code is affected by third-party licenses of the datasets used to train the model.&lt;/p&gt;

&lt;h3 id=&quot;the-problem-with-ai-and-copyright&quot;&gt;The Problem with AI and Copyright&lt;/h3&gt;
&lt;p&gt;The new paradigm has brought new challenges for lawyers and copyright experts, who were surprised by the technology and are rushing to define policies and strategies to help developers to use AI tools responsibly while balancing friction to use and management for possible legal risks.&lt;/p&gt;

&lt;p&gt;Now, whatever set of rules an organization defines about using AI, there’s the practical problem around enforcing and verifying if source code comes with AI contributions, for which companies must devise a way to verify when code was generated by an AI tool or an engineer to ensure legal compliance and security standards.
While many could suggest plagiarism detection for identifying AI-generated code, it’s apparent that matching code blocks against Open Source is no longer helpful.&lt;/p&gt;

&lt;p&gt;Detecting AI-generated code has become challenging, and as AI technology evolves, it will become more complex each day until it becomes impractical. Some AI agent services are looking to provide additional features for users so legal aspects are covered, like using “reftags” to track the provenance of the generated output or allowing users to filter the datasets used to answer a question. Still, not all players want to keep the game fair and to keep the advantage are OK with not adding these safeguards when generating code, which undoubtedly will raise concern from the Industry.&lt;/p&gt;

&lt;h3 id=&quot;plagiarism-and-snippet-detection-for-ai-could-be-obsolete&quot;&gt;Plagiarism and Snippet Detection for AI could be obsolete.&lt;/h3&gt;
&lt;p&gt;Many vendors who offer snippet detection products are trying to add AI detection features. Still, AI technology has evolved beyond generating blocks of code. Most recent versions of AI tools can understand the programming language as human writing language, making code snippet detection obsolete if they don’t add AI capabilities to the detection tools and implement additional techniques.&lt;/p&gt;

&lt;p&gt;If plagiarism detection is obsolete, we should look at what a developer does when writing software instead of checking the code blocks. Every developer writes code differently, not because they use different structures, but because each has other &lt;a href=&quot;https://insanelab.com/blog/notes/spaces-vs-tabs/&quot;&gt;preferences&lt;/a&gt; for the “style” of how the code is written. There are even &lt;a href=&quot;https://stackoverflow.blog/2017/06/15/developers-use-spaces-make-money-use-tabs/&quot;&gt;studies&lt;/a&gt; around coding style preferences against the amount of money a developer makes.&lt;/p&gt;

&lt;p&gt;The “styles” and preferences added when the code was written together with the algorithm behind the source code could be the input for an AI detection system.&lt;/p&gt;

&lt;h3 id=&quot;using-machine-learning-and-feature-extraction-to-identify-ai&quot;&gt;Using Machine Learning and feature extraction to identify AI&lt;/h3&gt;
&lt;p&gt;Implementing an AI detection system goes beyond what a blog post could cover, but simple techniques can be used.&lt;/p&gt;

&lt;p&gt;While researching, I came to an idea about implementing a “feature extraction” to later generate a score for each feature that can be used to train a Machine Learning DecisionTreeClassifier, to identify AI vs. Human based on the scores.&lt;/p&gt;

&lt;p&gt;As features, I’m extracting five features: cyclomatic complexity, style guidelines consistency, repetitive patterns, comments quality, and code indentation. Each class provides a score later passed to a DecisionTreeClassifier implemented with scikit-learn. Feel free to suggest others features, you can find the whole project on &lt;a href=&quot;https://github.com/oscarvalenzuelab/botsniffer&quot;&gt;GitHub&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;It is a rudimentary method in an ugly code, but it’s a method that can be extended.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;NOTICE&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This tool does not provide legal advice; I’m not a lawyer.&lt;/p&gt;

&lt;p&gt;The code is an experimental implementation. Refrain from relying on the accuracy of the output of this tool.&lt;/p&gt;
</description>
        <pubDate>Sun, 23 Apr 2023 00:00:00 +0000</pubDate>
        <link>https://ovalenzuela.com/2023/04/detecting-source-code-generated-by-ai-using-machine-learning.html</link>
        <guid isPermaLink="true">https://ovalenzuela.com/2023/04/detecting-source-code-generated-by-ai-using-machine-learning.html</guid>
        
        <category>ai</category>
        
        <category>machinelearning</category>
        
        <category>python</category>
        
        <category>oss</category>
        
        
        <category>oss</category>
        
        <category>development</category>
        
        <category>compliance</category>
        
      </item>
    
  </channel>
</rss>
