this post was submitted on 26 Feb 2024
166 points (93.7% liked)

Reddit

17810 readers
86 users here now

News and Discussions about Reddit

Welcome to !reddit. This is a community for all news and discussions about Reddit.

The rules for posting and commenting, besides the rules defined here for lemmy.world, are as follows:

Rules


Rule 1- No brigading.

**You may not encourage brigading any communities or subreddits in any way. **

YSKs are about self-improvement on how to do things.



Rule 2- No illegal or NSFW or gore content.

**No illegal or NSFW or gore content. **



Rule 3- Do not seek mental, medical and professional help here.

Do not seek mental, medical and professional help here. Breaking this rule will not get you or your post removed, but it will put you at risk, and possibly in danger.



Rule 4- No self promotion or upvote-farming of any kind.

That's it.



Rule 5- No baiting or sealioning or promoting an agenda.

Posts and comments which, instead of being of an innocuous nature, are specifically intended (based on reports and in the opinion of our crack moderation team) to bait users into ideological wars on charged political topics will be removed and the authors warned - or banned - depending on severity.



Rule 6- Regarding META posts.

Provided it is about the community itself, you may post non-Reddit posts using the [META] tag on your post title.



Rule 7- You can't harass or disturb other members.

If you vocally harass or discriminate against any individual member, you will be removed.

Likewise, if you are a member, sympathiser or a resemblant of a movement that is known to largely hate, mock, discriminate against, and/or want to take lives of a group of people, and you were provably vocal about your hate, then you will be banned on sight.



Rule 8- All comments should try to stay relevant to their parent content.



Rule 9- Reposts from other platforms are not allowed.

Let everyone have their own content.



:::spoiler Rule 10- Majority of bots aren't allowed to participate here.

founded 2 years ago
MODERATORS
 

In case you didn't know, you can't train an AI on content generated by another AI because it causes distortion that reduces the quality of the output. It is also very difficult to filter out AI text from human text in a database. This phenomenon is known as AI collapse.

So if you were to start using AI to generate comments and posts on Reddit, their database would be less useful for training AI and therefore the company wouldn't be able to sell it for that purpose.

you are viewing a single comment's thread
view the rest of the comments
[–] lvxferre@mander.xyz 5 points 10 months ago (1 children)

The PS archives are publicly available. If either OpenAI or Google were to use it, they wouldn't pay Reddit Inc. a single penny; and yet Google is paying it 60 million dollars do to do. This means that there's content that they cannot retrieve through the PS archives that would still be valuable as LLM data.

[–] FaceDeer@kbin.social 1 points 10 months ago (1 children)

They're paying Reddit to not sue them.

Regardless, the content that's available through PS is the content that people are talking about overwriting or deleting. They can't edit or delete stuff that PushShift couldn't see in the first place.

[–] lvxferre@mander.xyz 1 points 10 months ago (1 children)

They’re paying Reddit to not sue them.

Given how many defences Google would have against that ant called Reddit suing it, ranging from actual fair points to "ackshyually", I find it unlikely.

Regardless, the content that’s available through PS is the content that people are talking about overwriting or deleting. They can’t edit or delete stuff that PushShift couldn’t see in the first place.

Emphasis mine. Can you back up this claim?

I'm asking this because the content from PS is up to March/2023, it's literally a year old. There was a lot of activity in Reddit in the meantime, and it's from my impression that people talking about this are the ones who already erased their content in the APIcalypse, but kept using Reddit because there's some subject "stuck" there that they'd like to use.

[–] FaceDeer@kbin.social 1 points 10 months ago

Academic Torrents has Reddit data up to December 2023. This data isn't live-updated, my understanding is that it's scraped when it's first posted. That's how services like removeddit worked, it would show the "original" version of a post or comment from when it was scraped rather than the edited or deleted version that Reddit shows now.

The age isn't really the most important thing when it comes to training a base AI model. If you want to teach it about current events there are better ways to do that than social media scrapes. Stuff like Reddit is good for teaching an AI about how people talk to each other.