this post was submitted on 22 Aug 2023

694 points (95.4% liked)

Technology

59647 readers

2609 users here now

This is a most excellent place for technology news and articles.

Our Rules

Follow the lemmy.world rules.
Only tech related content.
Be excellent to each another!
Mod approved content bots can post up to 10 articles per day.
Threads asking for personal tech support may be deleted.
Politics threads may be removed.
No memes allowed as posts, OK to post as comments.
Only approved bots from the list below, to ask if your bot can be added please contact us.
Check for duplicates before posting, duplicates may be removed

Approved Bots

founded 1 year ago

MODERATORS

694

OpenAI now tries to hide that ChatGPT was trained on copyrighted books, including J.K. Rowling's Harry Potter series (www.businessinsider.com)

submitted 1 year ago by L4s@lemmy.world to c/technology@lemmy.world

272 comments fedilink hide all child comments

OpenAI now tries to hide that ChatGPT was trained on copyrighted books, including J.K. Rowling's Harry Potter series::A new research paper laid out ways in which AI developers should try and avoid showing LLMs have been trained on copyrighted material.

top 50 comments

sorted by: hot top controversial new old

[–] fubo@lemmy.world 80 points 1 year ago* (last edited 1 year ago) (22 children)

If I memorize the text of Harry Potter, my brain does not thereby become a copyright infringement.

A copyright infringement only occurs if I then reproduce that text, e.g. by writing it down or reciting it in a public performance.

Training an LLM from a corpus that includes a piece of copyrighted material does not necessarily produce a work that is legally a derivative work of that copyrighted material. The copyright status of that LLM's "brain" has not yet been adjudicated by any court anywhere.

If the developers have taken steps to ensure that the LLM cannot recite copyrighted material, that should count in their favor, not against them. Calling it "hiding" is backwards.

[–] cantstopthesignal@sh.itjust.works 24 points 1 year ago* (last edited 1 year ago) (1 children)

You are a human, you are allowed to create derivative works under the law. Copyright law as it relates to machines regurgitating what humans have created is fundamentally different. Future legislation will have to address a lot of the nuance of this issue.

load more comments (1 replies)

load more comments (21 replies)

[–] Blapoo@lemmy.ml 75 points 1 year ago (8 children)

We have to distinguish between LLMs

Trained on copyrighted material and
Outputting copyrighted material

They are not one and the same

[–] Even_Adder@lemmy.dbzer0.com 20 points 1 year ago (2 children)

Yeah, this headline is trying to make it seem like training on copyrighted material is or should be wrong.

[–] scv@discuss.online 25 points 1 year ago

Legally the output of the training could be considered a derived work. We treat brains differently here, that's all.

I think the current intellectual property system makes no sense and AI is revealing that fact.

load more comments (1 replies)

load more comments (7 replies)

[–] Skanky@lemmy.world 67 points 1 year ago (2 children)

Vanilla Ice had it right all along. Nobody gives a shit about copyright until big money is involved.

load more comments (2 replies)

[–] rosenjcb@lemmy.world 38 points 1 year ago* (last edited 1 year ago)

The powers that be have done a great job convincing the layperson that copyright is about protecting artists and not publishers. It's historically inaccurate and you can discover that copyright law was pushed by publishers who did not want authors keeping second hand manuscripts of works they sold to publishing companies.

Additional reading: https://en.m.wikipedia.org/wiki/Statute_of_Anne

[–] uzay@infosec.pub 35 points 1 year ago (12 children)

I hope OpenAI and JK Rowling take each other down

load more comments (12 replies)

[–] Sentau@lemmy.one 35 points 1 year ago* (last edited 1 year ago) (6 children)

I think a lot of people are not getting it. AI/LLMs can train on whatever they want but when then these LLMs are used for commercial reasons to make money, an argument can be made that the copyrighted material has been used in a money making endeavour. Similar to how using copyrighted clips in a monetized video can make you get a strike against your channel but if the video is not monetized, the chances of YouTube taking action against you is lower.

Edit - If this was an open source model available for use by the general public at no cost, I would be far less bothered by claims of copyright infringement by the model

[–] Tyler_Zoro@ttrpg.network 25 points 1 year ago (15 children)

AI/LLMs can train on whatever they want but when then these LLMs are used for commercial reasons to make money, an argument can be made that the copyrighted material has been used in a money making endeavour.

And does this apply equally to all artists who have seen any of my work? Can I start charging all artists born after 1990, for training their neural networks on my work?

Learning is not and has never been considered a financial transaction.

load more comments (15 replies)

[–] FMT99@lemmy.world 15 points 1 year ago (2 children)

But wouldn't this training and the subsequent output be so transformative that being based on the copyrighted work makes no difference? If I read a Harry Potter book and then write a story about a boy wizard who becomes a great hero, anyone trying to copyright strike that would be laughed at.

load more comments (2 replies)

[–] 1ird@notyour.rodeo 11 points 1 year ago* (last edited 1 year ago) (2 children)

How is it any different from someone reading the books, being influenced by them and writing their own book with that inspiration? Should the author of the original book be paid for sales of the second book?

load more comments (2 replies)

[–] AffineConnection@lemmy.world 9 points 1 year ago

using copyrighted clips in a monetized video can make you get a strike against your channel

Much of the time, the use of very brief clips is clearly fair use, but the people who issue DMCA claims don't care.

[–] ciwolsey@lemmy.world 8 points 1 year ago* (last edited 1 year ago)

You could run a paid training course using a paid-for book, that doesn't mean you're breaking copyright.

load more comments (1 replies)

[–] paraphrand@lemmy.world 27 points 1 year ago (30 children)

Why are people defending a massive corporation that admits it is attempting to create something that will give them unparalleled power if they are successful?

[–] bamboo@lemm.ee 23 points 1 year ago (12 children)

Mostly because fuck corporations trying to milk their copyright. I have no particular love for OpenAI (though I do like their product), but I do have great distain for already-successful corporations that would hold back the progress of humanity because they didn't get paid (again).

load more comments (12 replies)

load more comments (29 replies)

[–] uriel238@lemmy.blahaj.zone 22 points 1 year ago* (last edited 1 year ago) (3 children)

Training AI on copyrighted material is no more illegal or unethical than training human beings on copyrighted material (from library books or borrowed books, nonetheless!). And trying to challenge the veracity of generative AI systems on the notion that it was trained on copyrighted material only raises the specter that IP law has lost its validity as a public good.

The only valid concern about generative AI is that it could displace human workers (or swap out skilled jobs for menial ones) which is a problem because our society recognizes the value of human beings only in their capacity to provide a compensation-worthy service to people with money.

The problem is this is a shitty, unethical way to determine who gets to survive and who doesn't. All the current controversy about generative AI does is kick this can down the road a bit. But we're going to have to address soon that our monied elites will be glad to dispose of the rest of us as soon as they can.

Also, amateur creators are as good as professionals, given the same resources. Maybe we should look at creating content by other means than for-profit companies.

load more comments (3 replies)

[–] ClamDrinker@lemmy.world 21 points 1 year ago* (last edited 1 year ago)

This is just OpenAI covering their ass by attempting to block the most egregious and obvious outputs in legal gray areas, something they've been doing for a while, hence why their AI models are known to be massively censored. I wouldn't call that 'hiding'. It's kind of hard to hide it was trained on copyrighted material, since that's common knowledge, really.

[–] RadialMonster@lemmy.world 21 points 1 year ago (5 children)

what if they scraped a whole lot of the internet, and those excerpts were in random blogs and posts and quotes and memes etc etc all over the place? They didnt injest the material directly, or knowingly.

load more comments (5 replies)

[–] GeneralEmergency@lemmy.world 20 points 1 year ago

So that explains the "problematic" responses.

[–] Thorny_Thicket@sopuli.xyz 19 points 1 year ago (2 children)

I don't get why this is an issue. Assuming they purchased a legal copy that it was trained on then what's the problem? Like really. What does it matter that it knows a certain book from cover to cover or is able to imitate art styles etc. That's exactly what people do too. We're just not quite as good at it.

[–] Hildegarde@lemmy.world 9 points 1 year ago (2 children)

A copyright holder has the right to control who has the right to create derivative works based on their copyright. If you want to take someone's copyright and use it to create something else, you need permission from the copyright holder.

The one major exception is Fair Use. It is unlikely that AI training is a fair use. However this point has not been adjudicated in a court as far as I am aware.

[–] FatCat@lemmy.world 19 points 1 year ago (3 children)

It is not a derivative it is transformative work. Just like human artists "synthesise" art they see around them and make new art, so do LLMs.

load more comments (3 replies)

[–] LordShrek@lemmy.world 9 points 1 year ago (3 children)

this is so fucking stupid though. almost everyone reads books and/or watches movies, and their speech is developed from that. the way we speak is modeled after characters and dialogue in books. the way we think is often from books. do we track down what percentage of each sentence comes from what book every time we think or talk?

load more comments (3 replies)

load more comments (1 replies)

[–] Default_Defect@midwest.social 19 points 1 year ago

They made it read Harry Potter? No wonder its gonna kill us all one day.

[–] Technoguyfication@lemmy.ml 17 points 1 year ago (30 children)

People are acting like ChatGPT is storing the entire Harry Potter series in its neural net somewhere. It’s not storing or reproducing text in a 1:1 manner from the original material. Certain material, like very popular books, has likely been interpreted tens of thousands of times due to how many times it was reposted online (and therefore how many times it appeared in the training data).

Just because it can recite certain passages almost perfectly doesn’t mean it’s redistributing copyrighted books. How many quotes do you know perfectly from books you’ve read before? I would guess quite a few. LLMs are doing the same thing, but on mega steroids with a nearly limitless capacity for information retention.

[–] abbotsbury@lemmy.world 9 points 1 year ago

but on mega steroids with a nearly limitless capacity for information retention.

That sounds like redistributing copyrighted books

load more comments (29 replies)

[–] Psythik@lemm.ee 17 points 1 year ago

Well of course it was. It has to learn somehow.

[–] afraid_of_zombies@lemmy.world 17 points 1 year ago (2 children)

I am sure they have patched it by now but at one point I was able to get chatgpt to give me copyright text from books by asking for ever large quotations. It seemed more willing to do this with books out of print.

load more comments (2 replies)

[–] Tetsuo@jlai.lu 16 points 1 year ago* (last edited 1 year ago) (13 children)

If I'm not mistaken AI work was just recently considered as NOT copyrightable.

So I find interesting that an AI learning from copyrighted work is an issue even though what will be generated will NOT be copyrightable.

So even if you generated some copy of Harry Potter you would not be able to copyright it. So in no way could you really compete with the original art.

I'm not saying that it makes it ok to train AIs but I think it's still an interesting aspect of this topic.

As others probably have stated, the AI may be creating content that is transformative and therefore under fair use. But even if that work is transformative it cannot be copyrighted because it wasn't created by a human.

[–] Even_Adder@lemmy.dbzer0.com 10 points 1 year ago* (last edited 1 year ago)

If you're talking about the ruling that came out this week, that whole thing was about trying to give an AI authorship of a work generated solely by a machine and having the copyright go to the owner of the machine through the work-for-hire doctrine. So an AI itself can’t be authors or hold a copyright, but humans using them can still be copyright holders of any qualifying works.

load more comments (12 replies)

[–] Jat620DH27@lemmy.world 10 points 1 year ago (1 children)

I thought everyone knows that OpenAI has the same access to any books, knowledge that human beings have.

[–] Redditiscancer789@lemmy.world 13 points 1 year ago (10 children)

Yes, but it's what it is doing with it that is the murky grey area. Anyone can read a book, but you can't use those books for your own commercial stuff. Rowling and other writers are making the case their works are being used in an inappropriate way commercially. Whether they have a case iunno ianal but I could see the argument at least.

load more comments (10 replies)

load more comments