this post was submitted on 12 Jan 2024

69 points (98.6% liked)

Piracy: ꜱᴀɪʟ ᴛʜᴇ ʜɪɢʜ ꜱᴇᴀꜱ

54420 readers

307 users here now

⚓ Dedicated to the discussion of digital piracy, including ethical problems and legal advancements.

Rules • Full Version

1. Posts must be related to the discussion of digital piracy

2. Don't request invites, trade, sell, or self-promote

3. Don't request or link to specific pirated titles, including DMs

4. Don't submit low-quality posts, be entitled, or harass others

Loot, Pillage, & Plunder

📜 c/Piracy Wiki (Community Edition):

💰 Please help cover server costs.


Ko-fi	Liberapay

founded 1 year ago

MODERATORS

db0@lemmy.dbzer0.com

sunbrothersco@lemmy.dbzer0.com

dataprolet@lemmy.dbzer0.com

Flatworm7591@lemmy.dbzer0.com

RandomLegend@lemmy.dbzer0.com

Downloading Websites (lemmy.zip)

submitted 9 months ago by fortniteplaya@lemmy.zip to c/piracy@lemmy.dbzer0.com

13 comments fedilink hide all child comments

I’ve been looking online for ways to download websites (game wikis mostly), in order to have them all in my collection and ensure that they dont get taken down or changed.

After trying linkwarden, which is fine for singular web pages, one has to manually link each individual page of the wiki in order to make a pdf.

With this in mind, the only other option that I’ve discovered is using wget recursively. Do any you of you have experience with this or reccomend alternative ideas? Any and all help is appreciated.

PS: I will most likely download official game guides which will cover most of the games, but looking for something to cover all my games library.

all 14 comments

sorted by: hot top controversial new old

[–] otter@lemmy.ca 20 points 9 months ago (1 children)

I've used HTTrack in the past

It worked well for the websites that I tried it on and it's pretty easy to run. You could give it a go?

https://en.wikipedia.org/wiki/HTTrack

[–] ShellMonkey@lemmy.socdojo.com 5 points 9 months ago (1 children)

I've got a container set up of this. Drops the output on the NAS and can be accessed from any box on the local net. The only issue is it has a tendency to need the container recycled every so often, like it just gets board sitting there and quits. 🤔

[–] TropicalDingdong@lemmy.world 14 points 9 months ago

You wouldn't download a website, would you?

If you were going to do somehthing like this, you might also consider doing it on some kind of version control. You might want something that does update regularly (like a wiki), but if they ever try and paywall it, you've got a copy.

Should be something you could knock out in python in an afternoon.

[–] redcalcium@lemmy.institute 11 points 9 months ago* (last edited 9 months ago)

SingleFile extension can save a web page into a single html file where all media are neatly inlined inside the file. You'll have to do this manually on each page though, so it's not ideal for saving the whole website.

If you're comfortable running commands in terminal, you can use SingleFile CLI to crawl the whole website, e.g.: single-file https://www.wikipedia.org --crawl-links=true --crawl-inner-links-only=true --crawl-max-depth=1 --crawl-rewrite-rule="^(.*)\\?.*$ $1"

[–] hperrin@lemmy.world 11 points 9 months ago* (last edited 9 months ago) (2 children)

I’ve used wget to mirror websites. It works very well.

[–] efreak@lemmy.dbzer0.com 4 points 9 months ago

Wget2 can mirror websites also, and it also has the advantage of the following features that wget does not have:

downloads multiple files in parallel, speeding things up a lot
brotli and zstd compression support
uses multiple proxies for parallel downloads
supports sitemap indexes
http2 support

[–] friek@sh.itjust.works 0 points 9 months ago

This is the way

[–] Rosco@sh.itjust.works 8 points 9 months ago

You should look into curl https://curl.se/

[–] vildis@lemmy.dbzer0.com 5 points 9 months ago

I use grab-site (unmaintained) for full site archival and wget -p -k for simple non-javascript single pages

I've heard good things about HTTrack, SingleFile and Archivebox but don't have any experience with them.

Archivebox looks the most modern and intuitive but is hosted on docker

[–] WhyAUsername_1@lemmy.world 3 points 9 months ago

Check if wayback machine solve your problem ? https://archive.org/web/

[–] laserjet@lemmy.dbzer0.com 3 points 9 months ago

https://github.com/ArchiveBox/ArchiveBox/wiki/Web-Archiving-Community

[–] Cinner@lemmy.world 1 points 9 months ago

You may need a program that uses a browser (not chrome headless) to bypass cloudflare. For non-DDoS protected sites, just use any number of free open source tools.

[–] notasandwich1948@sh.itjust.works 1 points 9 months ago

grabsite works really well, can recommend