In October, the OSI hosted the State of the Source Track at All Things Open designed to connect developers with the big policy conversations shaping our ecosystem. Gabriel Toscano from Duke University led the State of the “Open” AI session, unpacking what it really means when models call themselves open.
State of the “Open” AI
Gabriel Toscano (Duke University)
The AI boom, driven by unprecedented investment and access to tools, has spawned a flood of models claiming to be “open.” The October 2024 release of the Open Source AI Definition (OSAID) sought to anchor the term in clear, unambiguous standards. Yet, the concept and application of openness in AI are nascent and inconsistently understood. Drawing on a growing set of current AI models described as ‘open,’ this session explores how the term is currently applied and describes how different conceptions of open AI relate to the use, sharing, study, and modification of these models.
Video summary
Introduction
My name is Gabriel Toscano. I’m a second-year master’s student in Public Policy at Duke University, specializing in technology policy—yes, that’s a real thing! I’m a recovering software engineer and philosopher, now working under the mentorship of the Open Source Initiative (OSI). I’m here as both a scholar and a developer, and I want to thank the OSI for giving me an outlet to explore ideas around open source and contribute research in this space.
I approach this topic with humility—it’s rapidly evolving, and there’s a lot we still don’t know. My goal is to engage the community and learn from experts to better understand what “open” really means in the context of AI.
The Focus: Openness in AI
Today, I’ll talk about the state of openness in AI. My work uses computational methods to study questions that traditionally belong to the social sciences—like how people interpret “openness.”
If I asked everyone here what “open AI” means, we’d probably hear a dozen different answers. I’m trying to systematically analyze how developers use the idea of openness when creating and publishing AI models.
The main goal of this research is to understand how open models are released, identify key trends, and explore how the term “open source AI” might develop into a defined concept within the field. The aim isn’t to judge which models are “truly” open or to evaluate the Open Source AI Definition itself—it’s to capture a snapshot of how the term is used in practice.
Methods
To do this, I collected metadata from Hugging Face, which hosts hundreds of thousands of public AI models. I analyzed about 20,000 models—roughly 10% of the total.
It’s important to note that “public” doesn’t mean “open source.” Developers can attach different licenses that define how their work may be used or shared. Understanding those licenses is essential: they tell users what’s allowed, what’s restricted, and under what terms.
I used the Hugging Face API to search for models tagged with the words “open” or “open source.” The results were noisy—some didn’t belong—but that’s also part of the story. The term “open” in AI is still used inconsistently, and we can expect these patterns to evolve.
I also analyzed the licensing practices of major AI labs, since their choices often set standards that ripple through the ecosystem. For each model, I collected metadata such as publication date, author, license, README content, and last update date, combining quantitative and qualitative analysis in Python.
Key Findings
- Most models build on existing ones.
Few developers create models entirely from scratch. This means the largest and most popular models have an outsized influence downstream. - Apache 2.0 is the most common OSI-approved license, followed by MIT. These are classic open-source licenses, especially for code components.
- Many models—surprisingly—use Creative Commons (CC BY) licenses, which are not recommended for software, since they lack clear rules for redistribution or modification.
- About half of the models had no license at all.
That means users can’t easily tell what’s permitted. Whether this is oversight or intentional is unclear, but it complicates evaluation and reuse. - The Qwen model family is the most common base model—about 2.5% of all models on Hugging Face build from it. That’s significant influence for one lineage.
- Custom licenses are increasingly popular, but many impose usage restrictions, creating tension with the spirit of open source.
Understanding “Open” in AI
The Open Source AI Definition focuses on ensuring the four freedoms:
to use, study, modify, and share systems.
But in practice, we see confusion between terms like open access, open weights, and open source.
- Open access generally means the model is publicly available.
- Open weights means the trained parameters are accessible.
- Open source requires both of the above plus code, data, and licensing that guarantee the four freedoms.
In other words, open source AI sets a higher bar—it’s not just about access but also about legal and functional freedom to build upon the work.
License Trends
Here’s how license usage breaks down:
- Nearly 60% of models labeled “open” have no license.
- Apache 2.0 accounts for around 23%, followed by MIT.
- A small but growing share use custom or restrictive licenses, including those from Llama, Qwen, DeepSeek, and others.
Even though the data is messy, we do see a trend toward using open-source-style licenses. However, custom and hybrid licenses are muddying the waters.
Examples of Custom Licenses
Custom AI licenses often use open-source language (“use, study, modify, share”) but then add restrictions that break those principles.
For example:
- Qwen license: allows the four freedoms but limits usage for products with over 100 million active users—introducing a usage restriction inconsistent with open-source norms.
- Llama license: similar restrictions, plus a clause banning use within the EU under Llama 4, possibly in response to new regulations.
- DeepSeek and Grok licenses: include “acceptable use” policies, often linked externally. These can change over time, making them unstable and hard to track.
This growing pattern—embedding commercial or moral restrictions inside otherwise open-source licenses—creates confusion and legal uncertainty.
Summary
- Most models are built on existing ones, amplifying the influence of major labs.
- Apache 2.0 and MIT remain the dominant truly open licenses.
- Many developers still omit licenses entirely, which undermines transparency.
- Creative Commons licenses are misapplied to software.
- Custom licenses mimic open language but limit freedom through use restrictions or mutable terms.
- Qwen and similar models are shaping the open AI ecosystem despite restrictive terms.
Platforms like Hugging Face could help by requiring license selection and validation during model upload, ensuring clarity for both creators and users.
Next Steps
I plan to extend this study by:
- Conducting network analysis of model relationships—mapping how models build upon each other.
- Tracking license propagation to see if restrictions are spreading correctly or being ignored.
- Analyzing download and reuse trends to measure real-world impact.
- Studying documentation practices and how developers describe openness.
- Connecting these findings to policy frameworks, since future AI legislation will hinge on how “open source AI” is defined and applied.
This is a living project, and I’m eager to collaborate. The repository will be updated with cleaned data, code, and documentation. I invite you to submit models that call themselves “open,” especially from sovereign or culturally specific AI initiatives.
Thank you—and I look forward to continuing this conversation as we work toward a clearer, more consistent understanding of openness in AI.