this post was submitted on 31 Aug 2023
570 points (98.0% liked)

Technology

58164 readers
4584 users here now

This is a most excellent place for technology news and articles.


Our Rules


  1. Follow the lemmy.world rules.
  2. Only tech related content.
  3. Be excellent to each another!
  4. Mod approved content bots can post up to 10 articles per day.
  5. Threads asking for personal tech support may be deleted.
  6. Politics threads may be removed.
  7. No memes allowed as posts, OK to post as comments.
  8. Only approved bots from the list below, to ask if your bot can be added please contact us.
  9. Check for duplicates before posting, duplicates may be removed

Approved Bots


founded 1 year ago
MODERATORS
 

I'm rather curious to see how the EU's privacy laws are going to handle this.

(Original article is from Fortune, but Yahoo Finance doesn't have a paywall)

you are viewing a single comment's thread
view the rest of the comments
[–] SomethingBurger@jlai.lu 5 points 1 year ago (6 children)

Can't they remove the data from the training set and start over?

[–] Zarxrax@lemmy.world 26 points 1 year ago

They can, but the article is taking about removing data from a model that is already in production. Like if someone emails ChatGPT and says "hey, remove my data from this", good luck, because it might be a year before they can release a newly trained model with the data removed.

[–] Jerkface@lemmy.world 9 points 1 year ago (2 children)

Indeed they can, but training a model can take a month or more and cost many millions of dollars, so it's not trivial.

[–] guyrocket@kbin.social 3 points 1 year ago (4 children)

What makes up the cost? Buying CPU cycles and storage? Just curious.

[–] garyyo@lemmy.world 4 points 1 year ago

Outside of the costs of hardware, its just power. Running these sorts of computations is getting more efficient, but the sheer amount of computation means that its gonna take a lot of electricity to run.

[–] nuke@yah.lol 3 points 1 year ago (1 children)

The GPU cluster. The H100 GPUs are about $40,000 each and you need many.

[–] guyrocket@kbin.social 2 points 1 year ago

Interesting. Thanks.

[–] Jerkface@lemmy.world 2 points 1 year ago

GPU cycles probably, but yeah. That makes up the bulk of the cost. The price of data is assuredly increasing as well, but that's slightly beside the issue.

[–] theneverfox@pawb.social 1 points 1 year ago

All of it. At that scale, you're paying for data access, network communication, layers of storage... Basically every single step of computation becomes a meaningful cost

[–] ComfortablyGlum@sh.itjust.works 3 points 1 year ago (1 children)

So the REAL issue is how much it costs to remove the info vs how much value the info has? Such as the average Joe's social security number vs a movie star's social security number vs the president's social security number.

[–] Jerkface@lemmy.world 3 points 1 year ago

I might change 'value the info has' to 'liability it creates', but I think you're right about the cost/benefit situation. Since our laws have not kept up with technology, there are a lot of unanswered questions making it hard to analyze.

[–] knotthatone@lemmy.one 8 points 1 year ago (1 children)

Not really, no. None of the source material is actually stored inside the model's dataset, so once it's in, it's in. Because of the way they are designed, you can't point to a particular document and just delete that one thing. It's like unscrambling an egg.

[–] snooggums@kbin.social 9 points 1 year ago (1 children)

They can remove ALL the data and start over.

[–] teradome@lemmy.one 1 points 1 year ago (1 children)

exactly.

removing one thing from a pile != removing the entire pile.

b/c the original goal was to not disturb the rest of the pile

[–] snooggums@kbin.social 2 points 1 year ago

If they can't remove individual pieces then they need to remove the whole pile, and rebuild the process in a way that does allow then to remove individual pieces.

No, I don't care how much time and effort it costs. That is on them for abusing other people's data.

[–] mo_ztt@lemmy.world 4 points 1 year ago (1 children)

Yes, but that's not easy... I can't remember exactly, but I think I saw an estimate that the compute time to train just one of the GPT models cost around $66 million. IDK whether that's total cost from scratch, or incremental cost to arrive at that model starting from an earlier model that was already built, but I do know that GPT is still to this day using that September 2021 cutoff which to me kind of implies that they're building progressively on top of already-assembled models and datasets (which makes sense, because to start from scratch without needing to would be insane).

You could, technically, start from scratch and spend 2 more years and however many million dollars retraining a new model that doesn't have the private data you're trying to excise, but I think the point the article is making is that that's a pretty difficult approach and it seems right now like that's the only way.

[–] skulblaka@kbin.social 5 points 1 year ago

Un-robbing a bank also isn't easy, but that doesn't mean I'm able to just say "it too hard :c" and then walk off into the sunset with my looted gains.

[–] Hildegarde@lemmy.world 2 points 1 year ago (1 children)

Yes. They can also reload a backup from before the data in question was added to the training data and retrain from that point. This is also what will need to be done if AI companies lose their copyright lawsuits.

None of this is impossible. Its just expensive. And these are expenses that AI companies could have avoided if they picked their datasets more carefully.

[–] assassin_aragorn@lemmy.world 2 points 1 year ago (1 children)

It's crazy that they aren't taking at least daily captures of the model nor having it record what information it processes.

[–] Hildegarde@lemmy.world 3 points 1 year ago

I would be shocked if they don't. It's pretty critical for any software development, AI or not, to retain the ability to roll back changes in the case any change breaks something.

[–] Zeth0s@lemmy.world 2 points 1 year ago* (last edited 1 year ago)

Information leaking is a thing. Some information is spread across multiple sources without actually being in any of those. If you remove something, the model can still infer the information.

If macron asks for his name to be deleted, you can retrieve his political opinion by simply knowing the history of interactions of other people with the French government. I just need to tell the model that the person he has no direct information about is named macron, and he can profile him.

Same with the search engine. The only difference is that the inference of missing information now is done by human brains. The model can substitute them