Training Data in Japan - AUGMXNT/shisa GitHub Wiki

Copyright

Currently, per Japanese copyright law (PDF), re-affirmed as current policy in April 2023 by Keiko Nagaoka, the Japanese Minister of Education, Culture, Sports, Science, and Technology, states that all works are permitted to be used for the purposes of AI training.

In March 2024, the Japan Agency for Cultural Affairs (ACA) published their latest draft document on AI and Copyright (see also this summary. METI has their own documents/working group as well). See also the notes of the Japanese AI Strategy Council.

Here's some more analysis and color on this:

2023-07-11 Legal Issues in Generative AI under Japanese Law - 3 lawyers of Japanese law-firm Nishimura & Asahi give an overview
2024-02-24 The US should look at Japan’s unique approach to generative AI copyright law - an policy editorial that does a also good job covering the state of AI training in Japan (as an argument for the US to adopt a similar policy)
2024-03-12 Japan’s New Draft Guidelines on AI and Copyright: Is It Really OK to Train AI Using Pirated Materials? - on the latest guidelines published by the ACA. "The committee essentially embraced Article 30-4 allowing the ingestion and analysis of copyrighted materials for AI learning to promote creative innovations in AI. It removes the need of acquiring consent from copyright holders, as long as it would not have a “material impact on the relevant markets” and that the AI usage does not “violate the interests of the copyright holders.”"
2024-05-01 Report on AI and Copyright Issues by Japanese Government - a full English summary of the latest ACA report
2024-05 General Understanding on AI and Copyright in Japan Overview (PDF) - this is a new EN presentation published by the Legal Subcommittee under the Copyright Subdivision of the Cultural Council of the Agency of Cultural Affairs and summarizes the current thinking. It re-affirms 30-4, however expressly warns about collecting data from piracy distribution sites, and also covers infringement at the usage stage (which understandably is more stringent). It touches also on copyrightability of AI generated material which largely falls within the standard norms (AI generated work is generally deemed non-creative works and to that extent are not considered copyrighted works).

Terms of Service and Synthetic Data

In Japanese AI Twitter, I've noticed a lot of confusion/worries about using synthetic data generated by models due to Terms of Service violations (eg, OpenAI's Terms of Service and the like). It's important to understand that Terms of Service (TOS) is a contract that binds two agreeing parties (see privity of contract or the Japanese term 契約上の関係 (Keiyaku-jō no Kankei)) and a third party cannot be bound to (or break) a TOS they haven't agreed to. Note, that Terms of Service (as its name implies) specifically regulates "access and use" of the service (not the generated output itself).

While as a matter of course, everyone should respect the TOS that they agreed to with their service provider (or suffer potential liability/consequences), any data generated by a third party, whether synthetic or not, obviously falls within the same copyright laws/policies in your jurisdiction and does not have any additional licensing or legal terms applied to it.

Notes:

There has been a recent trend of using synthetic data generated from completely open models (eg Mistral or CALM2-7B models). While this allows a developer to train their own models without TOS worries, from a practical standpoint, the current state of open models are much weaker, and currently generate poorer synthetic data without necessarily providing any legal benefit.
As mentioned, due to the contractual nature of TOS, the idea of TOS transitivity or any downstream "data contamination" doesn't apply, but if it did, using any open models won't help, as all AI models contain large amounts of TOS constrained data (including in OpenAI's models, of course).