585
top 50 comments
sorted by: hot top controversial new old
[-] stembolts@programming.dev 136 points 1 week ago* (last edited 1 week ago)

This is similar to when I heard reddit was doing the API lockdown, I wrote an automation bot over the weekend that self-destructed my subreddit and the entire post history. The bot also automatically downloaded and archived all of the content on my local machine.

It was annoying because at first I couldn't get access to older posts since at the time reddit had changed their API to only show the first X posts (100 or 1,000 or whatever). So I told my bot to delete the posts as it archived them so as I deleted content, reddit had no choice but to populate the page with the older posts.

And that's how I archived my subreddit. Reddit banned me two days later for automation, lol. I did not break any of the reddit or reddit api ToS during this process but I guess I upset someone.

[-] ubergeek77@lemmy.ubergeek77.chat 26 points 1 week ago

I don't think I've been banned, but I did a similar thing. I requested all my data from Reddit, then used that list of comment/post IDs to mass-edit them. I think I'm in the clear because I used the official third party API, with an official "app." If you used the private API or instrumented this via the browser, that may be why you were banned.

Anyway, if you or someone else wants their full history, Reddit will give it to you via a data export request.

[-] GBU_28@lemm.ee 18 points 1 week ago

Unfortunately they still have everything. It's good for the "human" visibility (lack of) but they have the data still

[-] stembolts@programming.dev 13 points 1 week ago* (last edited 1 week ago)

Oh I know, I just wanted a copy too.

Deleting posts from the user PoV was the only way I could come up with to force the API to show them to me.

load more comments (1 replies)
[-] henfredemars@infosec.pub 82 points 1 week ago

I feel like this content craze is going to evaporate soon because all the new content from here forward is sure to be polluted by LLM output already. AI is fast becoming a snake eating its own tail.

That reminds me. I should go update my licenses to spit in the face of AI training companies.

[-] CaptObvious@literature.cafe 74 points 1 week ago

Stack Overflow just earned a place under Reddit in the hosts block list.

[-] pe1uca@lemmy.pe1uca.dev 59 points 1 week ago

It's just a matter of time until all your messages on Discord, Twitter etc. are scraped, fed into a model and sold back to you

As if it didn't happen already

[-] darkphotonstudio@beehaw.org 44 points 1 week ago

I think people would have less issues with AI training if it was non-profit and for the common good. And there are open source AI projects, many in fact. But yeah, these deals by companies like this are sleazy.

[-] NeatNit@discuss.tchncs.de 21 points 1 week ago

OpenAI was literally that until it wasn't

[-] darkphotonstudio@beehaw.org 8 points 1 week ago

I don't think OpenAI actually released any FOSS code, did they?

Up until GPT3 they were quite open. When GPTs became good, they started claiming sharing the models would be risky and that there were ethical problems and that they would safekeep the technology. I believe they were even sued by one of their investors for sticking to their open mission at some point.

The source code they would provide would be pretty useless to most people anyway, unless you have a couple million laying around to spend on GPUs.

Plenty of AI companies do what OpenAI did, without ever sharing any models or writing any papers. We only hear about the open stuff. We see tons of open source AI stuff on Github that's all mostly based on research by either Google or OpenAI. All the Llama stuff exists only because Facebook shared their model (accidentally). All of this stuff is mostly open, even if it's not FOSS.

Compare that to what companies are doing internally. You bet data brokers and other shady shits are sucking up as much data as they can get their hand on to train their own, specialised AI, free from the burdens of "as an LLM I can't do that".

Data Rule Numero Uno:

Garbage in, garbage out.

Have fun training your LLM on a big steaming pile of hot garbage. That's 80% of Stack Overflows content.

[-] harrys_balzac@lemmy.dbzer0.com 16 points 1 week ago

Mostly "this has been answered in another thread" and "why don't you Google it" comments in my experience.

[-] DarkDarkHouse@lemmy.sdf.org 9 points 1 week ago

Can’t wait until the top answer to every Google search is “just google it”

[-] LostXOR@fedia.io 7 points 1 week ago* (last edited 1 week ago)

The other 20% is mostly high quality however, and I'm sure they'd filter out the heavily downvoted crud.

[-] mnemonicmonkeys@sh.itjust.works 6 points 1 week ago

You say that as if the garbage gets downvoted

[-] mnemonicmonkeys@sh.itjust.works 6 points 1 week ago

One time I was went on there to figure out an issue in Arduino. The answer one guy gave was "I don't know how to do this in Arduino, here's how you do this in Java". Not only the the mods prevent any other answers from being posted, I tried the guy's suggestion in Java and it didn't even work

[-] hagar@lemmy.ml 41 points 1 week ago

StackOverflow: *grabs money on monetizing massive amounts of user-contributed content without consulting or compensating the users in any way*

Users: *try to delete it all to prevent it*

StackOverflow: *your contributions belong to the community, you can't do that*

Pretty fucked-up laws. A lot of lawsuits going on right now against AI companies for similar issues. In this case, StackOverflow is entitled to be compensated for its partnership, and because the answers are all CC BY-SA 3.0, no one can complain. Now, that SA? Whatever.

[-] 9point6@lemmy.world 15 points 1 week ago

That SA part needs to be tested in court against the AI models themselves

A lot of this shittiness would probably go away if there was a risk that ingesting certain content would mean you need to release the actual model to the public.

load more comments (7 replies)

AI companies are hoping for a ruling that says content generated from a model trained on content is not a derivative work. So far, the Sarah Silverman lawsuit seems to be going that way, at least; the claimants were set back because they've been asked to prove the connection between AI output and their specific inputs.

If this does become jurisprudence or law in one or more countries, licenses don't mean jack. You can put the AGPL on your stuff and AI could suck it up into their model and use it for whatever they want, and you couldn't do anything about it.

The AI training sets for all common models contains copyright works like entire books, movies, and websites. Don't forget that most websites don't even have a license, and that that unlicensed work is as illegal to replicate as any book or movie normally would be, including internet comments. If AI data sets need to comply with copyright, all current AI will need to be retrained (except maybe for that image AI by that stock photo company, which is exclusively trained on licensed work).

load more comments (5 replies)
[-] davel@lemmy.ml 37 points 1 week ago

Good luck with the deleting. It often just means UPDATE comments SET is_deleted = 1 WHERE ID = 666;.

[-] chiisana@lemmy.chiisana.net 13 points 1 week ago

There was similar things done on Reddit during the big exit. I doubt it achieved what people expected it to achieve. Even if they’re not visible externally, I’m sure they can easily access (thereby make deals to license) the data out of their backend / backup; just a matter of how hard they want to try (hint: it’s really not very hard).

[-] duncesplayed@lemmy.one 14 points 1 week ago

Yeah during the reddit exodus, people were recommending to overwrite your comment with garbage before deleting it. This (probably) forces them to restore your comment from backup. But realistically they were always going to harvest the comments stored in backup anyway, so I don't think it caused them any more work.

If anything, this probably just makes reddit's/SO's partnership more valuable because your comments are now exclusive to reddit's/SO's backend, and other companies can't scrape it.

[-] Lemongrab@lemmy.one 9 points 1 week ago

It was to make the data inaccessible to general people, therefore removing the reason people visit reddit. Even if reddit could still get the data, regular people would be inconvenienced (in theory) and look somewhere else.

load more comments (2 replies)
[-] beyond@linkage.ds8.zone 27 points 1 week ago* (last edited 1 week ago)

There is, I believe, a fundamental misunderstanding as to what exactly a site like Stack Overflow is. It's not a forum; there's no such thing as "your posts." It's more like Wikipedia, as in a collaborative question-and-answer site, or a knowledgebase. Each question and answer can be edited like a mini wiki page. They aren't "yours" any more than the Wikipedia page you created ten years ago is; you contributed it to the commons, so (at least in theory) you don't have the right to take it back.

Whether whatever "Open"AI is doing is right is another question, of course. But, I don't think destroying or poisoning the commons to strike back at it is any helpful either; it feels like "destroying it to save it."

[-] tetris11@lemmy.ml 17 points 1 week ago

Fine, but when coding projects undergo licensing changes that the contributors are against, the code author has to remove those contributions and replace them.

[-] helenslunch@feddit.nl 25 points 1 week ago* (last edited 1 week ago)

Would be a shame if someone used ChatGPT to generate bad answers and a short script to resubmit them back to Stackoverflow. So awful.

load more comments (5 replies)

Why now? Other people have been profiting off of your Stack Overflow answers for years. This is nothing new.

[-] wuphysics87@lemmy.ml 18 points 1 week ago

Those answers were given in good faith under the presumption that they would be read and used by another person. Not used to train something to remove the interactions which motivated the answer in the first place.

load more comments (3 replies)
[-] haui_lemmy@lemmy.giftedmc.com 18 points 1 week ago

Simple answer: people vs corporations. A dev or homelabber getting help from you is very different from a company making billions just by mass shoveling your knowledge to the highest bidder.

The reason we need this as a fediverse service is that everyone can take in this knowledge and one corp doesnt have the ability to sell it. Thats what the worth comes from. Someone holding they key to it.

That's not what I mean. When you contribute content to Stack Exchange, it is licensed CC BY-SA. There are websites that scrape this content and rehost it, or at least there used to be. I've had a problem before where all the search results were unanswered Stack Overflow posts or copies of those posts on different sites. Maybe similar to Reddit they restricted access to the data so they could sell it to AI companies.

load more comments (2 replies)
[-] mbirth@lemmy.mbirth.uk 13 points 1 week ago

Currently, all answers are properly attributed. But once OpenAI will have trained and sell a “hackerman” persona, do you really think it will answer people’s questions with ”This answer was contributed by i_am_not_a_robot” or will it just sell this as its own answer?

[-] Taleya@aussie.zone 8 points 1 week ago

As a tech, i'm fucking howling because 99% of answers to any given question is already bullshit that ranges from useless to dangerous.

"The machine" can't tell the difference and it's going to be considered authoratitive in its blithe stupidity. hoover up SA all you want, you're just gonna agregate it with bullishit and poison your own well anyway

[-] delirious_owl@discuss.online 24 points 1 week ago

Like AI doesn't know how to use the way back machine?

[-] scottmeme@sh.itjust.works 23 points 1 week ago

Based users

[-] baseless_discourse@mander.xyz 17 points 1 week ago* (last edited 1 week ago)

This is a violation of GDPR, no?

EDIT: user created content is not directly protected under GDPR, only personally identifiable data is pertected under GDPR.

[-] lemmyreader@lemmy.ml 16 points 1 week ago

Dunno. GDPR is a Europe only thing, and isn't it only related to how your private data (like name, IP address, phone number) is cared about ?

[-] AccountMaker 7 points 1 week ago

Right, I think it only covers personal information: companies can only collect what they need to run their service, users can request to see their data etc. I don't think it applies to comments and posts.

load more comments (1 replies)
load more comments (14 replies)
[-] fluxc0@lemmy.world 16 points 1 week ago

This feels a little iffy to me. it rings of what happened with reddit.

[-] delirious_owl@discuss.online 15 points 1 week ago

This isn't really comparable to reddit, since users can just send a request to SO for all the content. Reddit locking down the API meant we lost access to our content.

load more comments (1 replies)
[-] mhzawadi@lemmy.horwood.cloud 12 points 1 week ago

Why delete the answer, why not edit it so that a human can see the answer but for AI its a load of nonsense?

[-] chicken@lemmy.dbzer0.com 15 points 1 week ago

There's no way that would work either, they can just store the full edit history and auto-curate as needed.

[-] gjoel@programming.dev 8 points 1 week ago

People did that. Stack overflow reverted the change.

If that would happen, I assume companies would just grab an older copy of the dumps from before people started editing their stuff because of the AI bullshit.

SA would ban everyone sabotaging their business plans and things would move on like normal, like what happened to Reddit.

load more comments (2 replies)
[-] HexesofVexes@lemmy.world 8 points 1 week ago

I mean, here is a thought, if an AI tool uses creative commons data, then it's derivatives fall under creative commons. I.e. stop charging for AI tools and people will stop complaining.

So what is the stack overflow replacement?

load more comments
view more: next ›
this post was submitted on 08 May 2024
585 points (99.0% liked)

Technology

33225 readers
656 users here now

This is the official technology community of Lemmy.ml for all news related to creation and use of technology, and to facilitate civil, meaningful discussion around it.


Ask in DM before posting product reviews or ads. All such posts otherwise are subject to removal.


Rules:

1: All Lemmy rules apply

2: Do not post low effort posts

3: NEVER post naziped*gore stuff

4: Always post article URLs or their archived version URLs as sources, NOT screenshots. Help the blind users.

5: personal rants of Big Tech CEOs like Elon Musk are unwelcome (does not include posts about their companies affecting wide range of people)

6: no advertisement posts unless verified as legitimate and non-exploitative/non-consumerist

7: crypto related posts, unless essential, are disallowed

founded 5 years ago
MODERATORS