Programming

17433 readers

225 users here now

Welcome to the main community in programming.dev! Feel free to post anything relating to programming here!

Cross posting is strongly encouraged in the instance. If you feel your post or another person's post makes sense in another community cross post into it.

Hope you enjoy the instance!

Rules

Follow the programming.dev instance rules
Keep content related to programming in some way
If you're posting long videos try to add in some form of tldr for those who don't want to watch videos

Wormhole

Follow the wormhole through a path of communities !webdev@programming.dev

founded 1 year ago

MODERATORS

snowe@programming.dev

Ategon@programming.dev

MaungaHikoi@lemmy.nz

Trying to scrape a wedding gallery website for wedding images, what software would you suggest? (jlai.lu)

submitted 4 months ago by Arietty@jlai.lu to c/programming@programming.dev

7 comments fedilink hide all child comments

The site seems very locked down lol, i guess they really want me to pay $300 for semi high resolution images, so i want to scrape the previews instead. it's probably some sort of script since with noscript on the site doesn't even load. It's even beaten my 'absolute enable right click' Extension , and while i can still get the right click going and take a screenshot, i have no option to open the image itself in a new window.

Next up was a simple scraping extension, one i use regularly is webscraper but it's a huge process to use and can snag super easily, so i tried this one called Download All images

That one didn't grab anything besides headers and icons and seemed to have gotten me IP banned. Thankfully i have a vpn and they didn't even revoke my gallery access and i'm back at it again.

I have tried commercial scraping software before but the problem is, afaik these are very big on following robots.txt and that makes a lot of sites unscrapable.

So you've all heard my dilemma, and i'm curious, cause at this point it's a game. How would you all approach this? what software would you use?

top 7 comments

sorted by: hot top controversial new old

[–] Kissaki@programming.dev 7 points 4 months ago* (last edited 4 months ago)

You didn't even describe how it's on the website.

I would use the webbrowser/Firefox save page functionality.

Or open the webbrowser dev tools and document.querySelectorAll('img') and get the URLs from it and use those.

Or Page info media tab.

Or dev tools network tab. To identify and use the image web requests.

Or use Nushell with query module enabled, and http get query html.

Or my own C# until.

But I suspect there's Auth in play, so the only easy access is within the browser session?

[–] 8263ksbr@lemmy.ml 3 points 4 months ago

Puppeteer and playwright were not mentioned yet

[–] echindod@programming.dev 3 points 4 months ago

I'd probably use selenium. But that depends.

[–] ExperimentalGuy@programming.dev 2 points 4 months ago

Before scraping I would verify that there is no HTTP API that you can use to craft requests instead of scraping from the website. These might be higher quality than what you can scrape. If there is no easy to use http API, go to scraping then. I would generally consider scraping the last option, unless it's a ridiculously easy website to scrape.

[–] MajorHavoc@programming.dev 1 points 4 months ago

Have a look at RobotFramework with the Selenium library. Anything you can manage manually, you can automate repetitively with Robot.

Also, have a look at the F12 Network tab, in case the real images are stored in a predictably named manner.

[–] dbx12@programming.dev 1 points 4 months ago

Not an answer, but you don't need an extension to defeat right-click blocking scripts: shift-right-click usually does the trick.

[–] madeindjs@programming.dev 1 points 4 months ago

You can get pretty far using a bit of JS and Tamper Monkey . You can even search in existing user scripts if someone already did it.