this post was submitted on 07 Oct 2024

566 points (98.8% liked)

Technology

59669 readers

3970 users here now

This is a most excellent place for technology news and articles.

Our Rules

Follow the lemmy.world rules.
Only tech related content.
Be excellent to each another!
Mod approved content bots can post up to 10 articles per day.
Threads asking for personal tech support may be deleted.
Politics threads may be removed.
No memes allowed as posts, OK to post as comments.
Only approved bots from the list below, to ask if your bot can be added please contact us.
Check for duplicates before posting, duplicates may be removed

Approved Bots

founded 1 year ago

MODERATORS

566

TikTok’s parent launched a web scraper that’s gobbling up the world’s online data 25-times faster than OpenAI (fortune.com)

submitted 1 month ago by LuuTuyen@lemmy.world to c/technology@lemmy.world

131 comments fedilink hide all child comments

top 50 comments

sorted by: hot top controversial new old

[–] dinckelman@lemmy.world 290 points 1 month ago (17 children)

It's illegal when a regular person steals something, but it's innovation and courage, when a huge corporation steals something. Interesting how that works

[–] b3an@lemmy.world 106 points 1 month ago (11 children)

Honestly it’s fucking angering. So much regulation and geo-restrictions and licensing schemes… but it’s cool that there are data brokers, and shit like this. On top of it all Chrome screwing us with manifest v3 and killing ad blocking on chrome. It’s already in canary build.

WHAT THE FUCK IS WRONG WITH THIS SPECIES?!

[–] MelodiousFunk 26 points 1 month ago

WHAT THE FUCK IS WRONG WITH THIS SPECIES?!

Yes.

[–] ZombieMantis@lemmy.world 23 points 1 month ago

WHAT THE FUCK IS WRONG WITH THIS SPECIES?!

Capitalism.

load more comments (9 replies)

[–] Chozo@fedia.io 53 points 1 month ago (1 children)

They're not stealing your data, they're pirating it.

[–] GuyDudeman@lemmy.world 33 points 1 month ago (1 children)

They’re not pirating it. They’re collecting it.

[–] ieatpwns@lemmy.world 23 points 1 month ago (1 children)

They’re not collecting it. They’re archiving it.

[–] GuyDudeman@lemmy.world 14 points 1 month ago (1 children)

Oh, like the way back machine?

[–] Evotech@lemmy.world 34 points 1 month ago (2 children)

No, that's stealing /s

load more comments (2 replies)

[–] alphapuggle@programming.dev 36 points 1 month ago (3 children)

Aaron Schwartz killed himself over punishments for less

[–] _stranger_@lemmy.world 21 points 1 month ago* (last edited 1 month ago)

Worse punishments. For far less.

[–] brbposting@sh.itjust.works 11 points 1 month ago

RIP Aaron

load more comments (1 replies)

load more comments (14 replies)

[–] zod000@lemmy.ml 104 points 1 month ago (1 children)

We've had this thing hammering our servers. The scraper uses randomized user-agents browser/OS combinations and comes from a number of distinct IP ranges in different datacenters around the world, but all the IPs track back to Bytedance.

[–] UnderpantsWeevil@lemmy.world 38 points 1 month ago (1 children)

Wouldn't be surprised if they're just cashing out while TikTok is still public in the US. One last desperate grab at value-add for the parent company before the shut down.

Also a great way to burn the infrastructure for subsequent use. After this, you can guarantee every data security company is going to add the TikTok servers to their firewalls and blacklists. So the American company that tries to harvest the property is going to be tripping over these legacy bullwarks for years after.

[–] Maggoty@lemmy.world 13 points 1 month ago

This has nothing to do with Tik Tok other than ByteDance being a shareholder in Tik Tok

[–] BlackEco@lemmy.blackeco.com 82 points 1 month ago (3 children)

Also it doesn't respect robots.txt (the file that tells bots whether or not a given page can be accessed) unlike most AI scrapping bots.

[–] kboy101222@sh.itjust.works 52 points 1 month ago (7 children)

My personal website that primarily functions as a front end to my home server has been getting BEAT by these stupid web scrapers. Every couple of days the server is unusable because some web scraper demanded every single possible page and crashed the damn thing

[–] assaultpotato@sh.itjust.works 16 points 1 month ago (1 children)

I do the same thing, and I've noticed my modem has been absolutely bricked probably 3-4 times this month. I wonder if this is why.

load more comments (1 replies)

load more comments (6 replies)

[–] glimse@lemmy.world 21 points 1 month ago (1 children)

most

I doubt that...

[–] DarkThoughts@fedia.io 14 points 1 month ago

People would be able to tell from the traffic on their websites.

load more comments (1 replies)

[–] Dindonmasker@sh.itjust.works 57 points 1 month ago (8 children)

Not surprising that Bytedance would want to gobble up every bit of data they can as fast as possible.

load more comments (8 replies)

[–] Breve@pawb.social 35 points 1 month ago (5 children)

They're too late, there's going to be way too much AI generated garbage in their data and so many social media platforms like Reddit and Twitter have already taken measures to curb scrapers.

[–] chickenf622@sh.itjust.works 18 points 1 month ago

Like those platforms aren't already full of AI garbage as well. Training new models will require a cut-off date before the genie was let out of the bottle.

load more comments (4 replies)

[–] GnuLinuxDude@lemmy.ml 32 points 1 month ago* (last edited 1 month ago) (2 children)

As for what ByteDance plans to do with a new LLM, a person familiar with the company’s ambitions said one goal has to do with the search function for TikTok.

Last week, TikTok released an update to its current search function focused on [keywords for ads], basically allowing advertisers to search in real time for words that are trending on TikTok. It allows marketers to build an ad with relevant keywords that would ostensibly help the ad show up on the screens of more users.

…

“Given the audience and the amount of use, TikTok with a search environment that is a completely biddable space with keywords and topics, that would be very interesting to a lot of people spending a ton of money with Google right now,” the person said.

A dark vision just flashed in my mind. And I am certain this is what will happen. AI-generated ads done in real time based on the latest “trending” thing. Presented to users basically as soon as the topic has the slightest amount of “trend”.

Just emitting untold amounts of CO2 to show you generated ads in near real time.

[–] WhatYouNeed@lemmy.world 14 points 1 month ago

No wonder Google ex-CEO was saying fuck climate goals.

load more comments (1 replies)

[–] Soup@lemmy.cafe 31 points 1 month ago (3 children)

There it begins. Nothing good will ever come form this.

load more comments (3 replies)

[–] Roflmasterbigpimp@lemmy.world 26 points 1 month ago (2 children)

I can not contribute to anything here, I just came to say I really really like the phrase "gobbling something up" :D

[–] GreenKnight23@lemmy.world 18 points 1 month ago (1 children)

better than "slammed"

Screenshot_20241008-091015_Firefox

load more comments (1 replies)

[–] affiliate@lemmy.world 21 points 1 month ago (2 children)

from the article:

Robots.txt is a line of code that publishers can put into a website that, while not legally binding in any way, is supposed to signal to scraper bots that they cannot take that website’s data.

i do understand that robots.txt is a very minor part of the article, but i think that’s a pretty rough explanation of robots.txt

[–] Corkyskog@sh.itjust.works 11 points 1 month ago (4 children)

Out of curiosity, how would you word it?

[–] affiliate@lemmy.world 19 points 1 month ago

i would probably word it as something like:

Robots.txt is a document that specifies which parts of a website bots are and are not allowed to visit. While it’s not a legally binding document, it has long been common practice for bots to obey the rules listed in robots.txt.

in that description, i’m trying to keep the accessible tone that they were going for in the article (so i wrote “document” instead of file format/IETF standard), while still trying to focus on the following points:

robots.txt is fundamentally a list of rules, not a single line of code
robots.txt can allow bots to access certain parts of a website, it doesn’t have to ban bots entirely
it’s not legally binding, but it is still customary for bots to follow it

i did also neglect to mention that robots.txt allows you to specify different rules for different bots, but that didn’t seem particularly relevant here.

load more comments (3 replies)

load more comments (1 replies)

[–] Gammelfisch@lemmy.world 17 points 1 month ago

Another fucking CCP and PLA creation.

load more comments