Unless you have been in an isolated Yoga retreat for the last week, you will certainly have heard of DeepSeek. This is a new model from a Chinese startup that has taken the tech world by storm, inducing a Sputnik-like panic in the US, and prompting a sudden drop in share value as the Silicon Valley oligarchs suddenly remember that there’s a big scary world outside their borders.
But what makes DeepSeek different? The main issue that has gotten everyone’s attention is their R1 model, which is a reasoning model akin to OpenAI’s o1 and Google’s Gemini Flash Thinking, but unlike those models, it was trained at a fraction of the cost, and it has been released as an open source model. Reasoning models are seen as the future of AI development, and the most likely route towards AGI, the Holy Grail of AI research. It is likely that you mostly have interacted with large language models (LLMs), but reasoning models operate at a different level. The difference between a reasoning model and an LLM model is a bit nuanced and depends on how strictly you define each term. In practice, the lines can blur, especially as LLMs become more sophisticated. LLMs are designed to understand and generate human language, their core task is to predict the next word in a sequence based on vast amounts of text data they’ve been trained on. Reasoning models are designed to perform logical inference, deduction, problem-solving, planning, and other forms of structured reasoning. So they tend to mimic human thought and reasoning in more comprehensive ways than simply guessing what the next word will be based on their understanding of language patterns.
DeepSeek R1
So, pretty big news. While everyone is scrambling to write about what it all means for the AI arms race, I wanted to take a look at what DeepSeek’s deployment may mean for the AI Copyright Wars.
Training data
The DeepSeek R1 research paper doesn’t specify which data it was trained on, but while the startup has just burst into everyone’s attention, it has been in operation since May 2023, and had already worked in training other models, mostly LLMs. The paper for their first LLM and for their second generation of LLM models mentions the use of CommonCrawl, but other than describing de-duplication efforts, there’s no specifics about what their LLM dataset consists of, and one has to assume that it is not only CommonCrawl. This lack of specificity is not particularly surprising, after all, early mention of the use of specific datasets has been used in copyright complaints against companies such as OpenAI and Meta.
However, on a paper for their Vision-Language (VL) model there is an actual list of training data used, and it has quite a few surprises that may prove relevant for copyright purposes (thanks to Alexander Doria for sending me in the right direction). A VLM aligns images and text, and it is trained on large datasets of image–text pairs so that it can predict how well a piece of text matches an image (or vice versa). Because of this dual nature, having access to large and diverse dataset is very important. Some of the datasets are unsurprising, including the use of publicly available PDFs and epub files, which have the required image-text duality needed. A large part of the training data used DeepSeek’s LLM dataset (70%), which consists of the text-only LLM training corpus, and while there’s no indication specifically of what that is, there is a surprising mention of Anna’s Archive.
This is pretty interesting for various reasons. Anna’s Archive is arguably the world’s largest search aggregator of shadow libraries, including Z-Library, LibGen, and Sci-Hub. As of January of 2025, the Archive links to over 40 million books, and 98 million papers. While the Archive doesn’t host the works themselves, there’s no doubt that sharing the works constitute a communication to the public of those works without the author’s permission, so the site has been blocked in the Netherlands, Italy, and the UK. So the use of Anna’s Archive in training would undoubtedly prove to be controversial at the very least. For example, Meta has found itself in hot water recently when it was disclosed that it had used LibGen in training, and this shadow library is part of Anna’s Archive.
It is important to stress that we do not know for sure if Anna’s Archive was used in the training of the LLM or the reasoning models, or what importance do those libraries have on the overall training corpus. What is interesting to point out is that if it is found that DeepSeek did indeed train on Anna’s Archive, it would be the first large model to openly do so. Whether this could result in legal action would be more difficult to discern, as far as I can tell DeepSeek only has offices in China, so any legal action would have to take place there. And to what extent would the use of an undisclosed amount of shadow libraries for training would be actionable in other countries is also not clear, personally I think that it would be difficult to prove specific damage, but it’s still early days. An interesting aside is that the latest version of the EU’s AI Act General Purpose Code of Conduct contains a prohibition for signatories to use pirated sources, and that includes shadow libraries. Would this result in DeepSeek not being available in the EU?
Another interesting aspect of DeepSeek’s training is that they are being accused by OpenAI of training on synthetic data acquired from their own models in a process that is known as model distillation. So far I have not seen any evidence to this effect other than a couple of press reports citing an unnamed person at OpenAI and a few gleeful people on social media. Regardless, this would not be a copyright issue at all, but it could potentially have interesting implications as apparently such an action is not allowed by OpenAI’s terms of use; but I am not sure if this would be something worth getting worked up about. Regardless of potential disputes about APIs and terms of use, one thing is distillation could also have an effect for the future of AI training. Distillation means relying more on synthetic data for training. Synthetic data is artificially generated information that mimics real-world data in terms of structure, patterns, and statistical properties, but is not derived from actual human-generated data. At some point it was argued by some that AI training would run out of human-generated data, and it would act as an upper limit to development, but the potential use of synthetic data means that such limits may not exist. In fact DeepSeek has been successful in using synthetic data to train its Math model.
Open source
Even if DeepSeek is quickly overtaken by other developers and it ends up being mostly hype, there is likely to be one lasting effect, and it is that it is proving to be the best advertising for open source AI development so far. One could argue that the current crop of AI copyright lawsuits is temporary, my argument has always been that after a few years of strife things will quiet down and stability will ensue (get it, stability, get it? huh? Oh why do I bother?). The real battle, the one that counts for the long term, is the conflict between closed development and open source development.
On the closed side we have models that are being trained behind closed doors, with no transparency, and the actual models are not released to the public, they are only closed products that can’t be run locally and you have to interact with them via an app, a web interface, or an API for larger commercial uses. Open source models are released to the public using an open source licence, can be run locally by someone with the adequate resources.
DeepSeek R1 is publicly available on HuggingFace under an MIT Licence, which has to be one of the biggest open source releases since LLaMa. Yann LeCunn, the Meta AI Scientist, commented that “Open source models are surpassing proprietary ones.”
So whatever happens with DeepSeek, we’re seeing some victories for openness. The only wrinkle is that arguably doesn’t R1 comply with the newly released Open Source AI definition because while it is released with an open licence, it doesn’t contain “sufficiently detailed information about the data used to train the system so that a skilled person can build a substantially equivalent system.”
Policy
Besides what may or may not happen with DeepSeek and copyright directly, I think that it may have a more lasting effect on the policy side of the debate. Whether it is justified or not, AI is seen as a technology of immense national interest for the US, China, and the EU. One of the first acts by the new Trump administration was to advertise over $500 billion USD in infrastructure investment for AI development in something they call the “Stargate Project”. Just a few days after, DeepSeek announces a models that is cheaper than the US competitors, and to say that it freaked out a lot of people is an understatement. I even saw the R1 launch described as a “Sputnik” moment.
The implications for copyright policy should be evident. If we’re about to witness an AI arms race, it is unlikely that the countries involved will allow their foot soldiers to be bogged down in the courts with copyright infringement lawsuits. I have mentioned this before, but we could see some sort of legislation deployed in the US sooner rather than later, particularly if it turns out that some countries with less than perfect copyright enforcement mechanisms are direct competitors.
We could also see DeepSeek being used by policymakers in other countries to ensure that AI development continues unabated. In the EU this could mean doubling-down on reservation of rights in the DSM Directive, with a more lenient Code of Conduct for general purpose models. And for the UK this could prove to give the government more reasons to push forward with establishing an opt-out exception regime after the current consultation is over. I wouldn’t be surprised if we saw arguments being put forward by ministers along the line of “a British DeepSeek is impossible under the current copyright system”, or words to that effect.
Concluding
I’m not sure if DeepSeek warrants the incredible level of hype that we have seen recently. I’ve used it and at least to my untrained eye it didn’t perform any better or worse that o1 or Gemini Flash, but I have to admit that I have not put them to any sort of comprehensive test, I’m just speaking as a user. But I think that there are a couple of interesting copyright implications to the launch that may warrant further examination. I do believe that the Chinese model will definitely have a policy effect, at least in the short term.
I asked DeepSeek for a copyright-related joke to finish this blogpost, and it said:
“DeepSeek is so good at finding information, it even found the copyright symbol on my original thoughts!”
Humans, we’re safe!