this post was submitted on 10 Jul 2023
418 points (94.7% liked)

Technology

34441 readers
196 users here now

This is the official technology community of Lemmy.ml for all news related to creation and use of technology, and to facilitate civil, meaningful discussion around it.


Ask in DM before posting product reviews or ads. All such posts otherwise are subject to removal.


Rules:

1: All Lemmy rules apply

2: Do not post low effort posts

3: NEVER post naziped*gore stuff

4: Always post article URLs or their archived version URLs as sources, NOT screenshots. Help the blind users.

5: personal rants of Big Tech CEOs like Elon Musk are unwelcome (does not include posts about their companies affecting wide range of people)

6: no advertisement posts unless verified as legitimate and non-exploitative/non-consumerist

7: crypto related posts, unless essential, are disallowed

founded 5 years ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
[–] jmcs@discuss.tchncs.de 25 points 1 year ago (2 children)

I guess they will get to analyze OpenAI's dataset during discovery. I bet OpenAI didn't have authorization to use even 1% of the content they used.

[–] maynarkh@feddit.nl 15 points 1 year ago

That's why they don't feel they can operate in the EU, as the EU will mandate AI companies to publish what datasets they trained their solutions on.

[–] Jaded@lemmy.dbzer0.com 7 points 1 year ago (1 children)

Things might change but right now, you simply don't need anyones authorization.

Hopefully it doesn't change because only a handful of companies have the data or the funds to buy the data, it would kill any kind of open source or low priced endeavour.

[–] Flaky@iusearchlinux.fyi 4 points 1 year ago

FWIW, Common Crawl - a free/open-source dataset of crawled internet pages - was used by OpenAI for GPT-2 and GPT-3 as well as EleutherAI's GPT-NeoX. Maybe on GPT3.5/ChatGPT as well but they've been hush about that.