this post was submitted on 17 Oct 2024

691 points (99.6% liked)

Science Memes

11161 readers

2265 users here now

Welcome to c/science_memes @ Mander.xyz!

A place for majestic STEMLORD peacocking, as well as memes about the realities of working in a lab.

Rules

Don't throw mud. Behave like an intellectual and remember the human.
Keep it rooted (on topic).
No spam.
Infographics welcome, get schooled.

This is a science community. We use the Dawkins definition of meme.

Research Committee

!spiders@lemmy.world

Other Mander Communities

Science and Research

Biology and Life Sciences

Physical Sciences

Humanities and Social Sciences

Practical and Applied Sciences

Memes

Miscellaneous

founded 2 years ago

MODERATORS

Sal@mander.xyz

fossilesque@mander.xyz

SciBot@mander.xyz

691

Publishers Always Innovating (mander.xyz)

submitted 1 month ago by fossilesque@mander.xyz to c/science_memes@mander.xyz

39 comments fedilink hide all child comments

you are viewing a single comment's thread
view the rest of the comments

[–] JackbyDev@programming.dev 19 points 1 month ago (3 children)

Oh boy, I sure am excited to websites hosting PDFs! I love when the tool that everyone uses for hosting and viewing HTML get to be blessed with the perfect format that is PDF!

I LOVE PDFS! I love two column PDFs! I love reading like this!

1 3
2 4
5 7
6 8

Instead of like this

1
2
3
4
5
6
7
8

It's amazing and such a good user experience!

I love that PDFs are so difficult to transform into HTML, too. I would never want the besmirch the publishers oerfect one approved layout by resizing the window!

[–] keepthepace 6 points 1 month ago (1 children)

I love that PDFs are so difficult to transform into HTML, too

FYI, if that's relevant to your field, every new article published on arxiv.org now has a HTML render as well.

And on many older publications, transforming "arxiv.org" into "ar5iv.org" leads to an HTML rendering that is a best-effort experiments they ran for a while.

[–] JackbyDev@programming.dev 2 points 1 month ago (1 children)

That's really cool! What I really would like is a tool that converts PDFs to semantic HTML files. I took a peek there and it seems easier for them because they have the original LeX source.

I think for arbitrary PDFs files the information just isn't there. I've looked into it a bit and it's sort of all over. A tool called pdf2htmlex is pretty good but it makes the HTML look exactly like the PDF.

[–] keepthepace 2 points 1 month ago (2 children)

Yes, PDFs are much more permissive and may not have any semantic information at all. Hell, some old publications are just scanned images!

PDF -> semantic seems to be a hard problem that basically requires OCR, like these people are doing

[–] JackbyDev@programming.dev 2 points 1 month ago

Oh nice, thanks for sharing that project. I haven't heard of it before!

[–] thevoidzero@lemmy.world 1 points 1 month ago

Not just semantics. PDFs doesn't even have segmentations like spaces/lines/paragraph. It's just text drawn at locations the text processor/any other softwares inserted into. Many pdf editor softwares just detect the closeness of the characters to group them together.

And one step further is you can convert text to path, which basically won't even have glyph (characters) info and font info, all characters will just be geometric shapes. In that case you can't even copy the text. OCR is your only choice.

PDF is for finalizing something and printing/sharing without the ability to edit.

[–] brianary@startrek.website 5 points 1 month ago

I've always called Word documents and PDFs "dead-end formats" (DEF). Once you export your data to them, there's no reliable way to retrieve your data from them for further transformation like you can for YAML, JSON, XML, HTML, Markdown, &c.

[–] werefreeatlast@lemmy.world 3 points 1 month ago

Choose your own adventure PDF! 1, 5, 7, 3, 9, 2, 0, 6, 4, 8! What an ending!