this post was submitted on 05 Jul 2023
42 points (68.4% liked)
Fediverse
28295 readers
775 users here now
A community to talk about the Fediverse and all it's related services using ActivityPub (Mastodon, Lemmy, KBin, etc).
If you wanted to get help with moderating your own community then head over to !moderators@lemmy.world!
Rules
- Posts must be on topic.
- Be respectful of others.
- Cite the sources used for graphs and other statistics.
- Follow the general Lemmy.world rules.
Learn more at these websites: Join The Fediverse Wiki, Fediverse.info, Wikipedia Page, The Federation Info (Stats), FediDB (Stats), Sub Rehab (Reddit Migration), Search Lemmy
founded 1 year ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
view the rest of the comments
They can siphon your data no matter what you do. As I've said in other comments, everything on the internet has been crawled and scraped for literal decades. This post is already indexed by a bunch of different search engines and most likely by some other scrapers that harvest our data for AI or ad profiles. And you can do nothing about it without hurting your legitimate audience. Nothing at all. There's robots.txt as a mechanism to tell a crawler what it should or shouldn't index but that's just asking nicely (mostly to prevent search engines from indexing pages that don't contain actual content). You could in theory block certain IP ranges or user agents but those change faster than you can identify them. This dilemma is the whole reason why Twitter implemented rate limiting. They wanted to protect their stuff from scrapers. See where it got them.
Most important rule of the internet: if you don't want something archived forever, don't post it!