Training Data in Japan - AUGMXNT/shisa GitHub Wiki

Copyright

Currently, per Japanese copyright law (PDF), re-affirmed as current policy in April 2023 by Keiko Nagaoka, the Japanese Minister of Education, Culture, Sports, Science, and Technology, states that all works are permitted to be used for the purposes of AI training.

In March 2024, the Japan Agency for Cultural Affairs (ACA) published their latest draft document on AI and Copyright (see also this summary. METI has their own documents/working group as well). See also the notes of the Japanese AI Strategy Council.

Here's some more analysis and color on this:

Terms of Service and Synthetic Data

In Japanese AI Twitter, I've noticed a lot of confusion/worries about using synthetic data generated by models due to Terms of Service violations (eg, OpenAI's Terms of Service and the like). It's important to understand that Terms of Service (TOS) is a contract that binds two agreeing parties (see privity of contract or the Japanese term 契約上の関係 (Keiyaku-jō no Kankei)) and a third party cannot be bound to (or break) a TOS they haven't agreed to. Note, that Terms of Service (as its name implies) specifically regulates "access and use" of the service (not the generated output itself).

While as a matter of course, everyone should respect the TOS that they agreed to with their service provider (or suffer potential liability/consequences), any data generated by a third party, whether synthetic or not, obviously falls within the same copyright laws/policies in your jurisdiction and does not have any additional licensing or legal terms applied to it.

Notes:

  • There has been a recent trend of using synthetic data generated from completely open models (eg Mistral or CALM2-7B models). While this allows a developer to train their own models without TOS worries, from a practical standpoint, the current state of open models are much weaker, and currently generate poorer synthetic data without necessarily providing any legal benefit.

  • As mentioned, due to the contractual nature of TOS, the idea of TOS transitivity or any downstream "data contamination" doesn't apply, but if it did, using any open models won't help, as all AI models contain large amounts of TOS constrained data (including in OpenAI's models, of course).