Ask HN: What is nowadays (opensource) way of converting HTML to PDF?

47 points by hhthrowaway1230 3 days ago

I'm using wkhtmltopdf but it is painful to work with? what are other people using nowadays? i.e canva or other tools?

Just print to PDF in a browser, or automate that using a browser automation tool. For a non-browser-based open source solution, WeasyPrint.

https://weasyprint.org/

For a proprietary solution, try Prince XML:

https://www.princexml.com/

rossdavidh - 2 hours ago

+1 to weasyprint; I have used weasyprint with a django production system for a few years now, and it works well enough that I never have to think about it. I'm not doing anything fancy, though, but for me it has worked well.
grounder - an hour ago

WeasyPrint works really well for me. It can support all of the languages and fonts I need. I run it on AWS Lambda and in Docker as a web service.
I previously used WKHTMLTOPDF, but it hasn't been supported for years and doesn't support the latest CSS, etc. It does support JS if you need it, but I'd probably look at headless Chromium or another solution for JS if needed.
Edit: Previous post with some good discussion: https://news.ycombinator.com/item?id=26578826
sureglymop - 44 minutes ago

Prince XML looks nice but what about creating a PDF directly from a website? This often adds some problems, for example links still pointing to other pages on the web. But in my experience printing to PDF is often not good enough.
jmyeet - 34 minutes ago

I’ve had excellent experience with Prince XML and poor experience with everything else I’ve tried. Prince is fast, robust and reliable.
Yes it costs money. So does developer time.

kappadi3 - 3 days ago

Puppeteer and Playwright are the main open-source options nowadays, both solid for HTML → PDF once your print CSS is sorted. Don’t forget proper page breaks (break-before/after/inside) — e.g. break-after: page works in Chromium, while always doesn’t. For trickier pagination you can look at Paged.js, and I’d test layouts in Chrome/Edge before automating.

Shameless plug: I run yakpdf.com, a hosted Puppeteer-based service if you want to avoid self-hosting. https://rapidapi.com/yakpdf-yakpdf/api/yakpdf

johnh-hn - 3 hours ago

Seconded. I went with C# + Playwright. I tried iTextSharp, iText, PDFSharp, and wkhtmltopdf, but they all had limitations. I had good results with Playwright in minutes, outside of tweaking the CSS like you mention.
I documented the process here[0] if anyone needs examples of the CSS and loading web fonts. Apologies for the article being long-winded – it was the first one I published.
[0] https://johnh.co/blog/creating-pdfs-from-html-using-csharp

Aachen - 2 hours ago

Please don't turn nice formats into a format that's similar to screenshots of text. Pandoc has an option to pack all images and styles needed to render the page into one html file:

    pandoc --self-contained input.html -o output.html

moralestapia - a few seconds ago

Please don't police what other people do.
crazygringo - an hour ago

Or, please do?
I use PDF's so I can send them to my iPad to read offline, highlight them, annotate them, and then send them back to my filesystem with highlights and annotations intact.
I sure can't do that with any "nice formats" like HTML or TXT or EPUB or MOBI.
- mr_mitm - an hour ago
  
  You could, though. What you are describing are features of an editor, not a file format. I can imagine a browser addon performing the same tasks.
  - - an hour ago
    
    [deleted]
  - whenc - an hour ago
    
    PDF annotations sit within the file.
    
    mr_mitm - 18 minutes ago
    
    I know, even though that depends on the editor. Okular for example places them in an extra file, last I checked. That's not unique to PDFs. HTML files are modifiable. There is nothing preventing an editor to put annotations in it as well.
agedclock - 2 hours ago

Pandoc would be my preferred tool. It is excellent at converting between other formats as well.
TylerE - 2 hours ago

Being (not so easily) edited is often a feature, not a bug.
- ryandrake - an hour ago
  
  Is this really that much of a motivation in 2025? Maybe in 2000 you could publish a PDF with the assurance that only the people who paid for Acrobat would be able to edit it, but today, there are a lot of accessible ways to edit PDFs, I don't think I'd choose PDF if I for whatever reason wanted to limit others from editing.
- guywithahat - 2 hours ago
  
  I was thinking this too, PDF's exist so people don't mess with the document. That said, it's still a clever feature, and pandoc can convert html into a pdf as well with a conversion engine. That said, I suspect it'll fail on anything sufficiently complex
  pandoc input.html -o output.pdf --pdf-engine=<your engine>

delduca - 4 minutes ago

https://gotenberg.dev

lizimo - 9 minutes ago

If generating PDF dynamically is what you really care about, consider Typst. https://typst.app/ We use it in production to generate reports, and it is amazing.

Snawoot - 3 hours ago

chrome --headless --disable-gpu --print-to-pdf https://example.com

piptastic - 2 hours ago

same: google-chrome --headless --disable-gpu --no-pdf-header-footer --hide-scrollbars --print-to-pdf-margins="0,0,0,0" --print-to-pdf --window-size=1280,720 https://example.com
ended up using headless chrome specifically to make sure javascript things rendered properly
mmphosis - 35 minutes ago

Can Firefox do this?
with an elaborate script that relies on xdotool
HPsquared - 2 hours ago

Can Chromium do this?
Edit: it appears so- https://news.ycombinator.com/item?id=15131840

RiverCrochet - 2 hours ago

If you don't really need the PDF but just want to archive pages, SingleFile is better. It'll capture the entire page to a single HTML file and I find this is better than the PDF if I don't want to print it. It's a browser extension, but there's also a command line version (https://github.com/gildas-lormeau/single-file-cli) that uses Chrome or Chromium's headless mode.

juice_bus - an hour ago

I have Chromium shoved into an AWS Lambda Layer, when we need HTML to PDF conversion we shove it off onto that. It loads the HTML into Chromium then "prints" it to PDF.

freedomben - 32 minutes ago

I'd love to go the other way: convert a PDF into a self contained HTML page that renders properly in a browser. It's been way harder than I thought it would. Any advice?

mr_mitm - 8 minutes ago

You could embed it as a base64 blob, embed PDF.js (which is included by browsers anyway, I think) and use that to render it in the HTML. But I realize you probably meant a static HTML without JavaScript.
drabbiticus - 15 minutes ago

> renders properly
Depending on your requirements on both PDF input and HTML output, there is often no way to do this that is both easy and general. At it's core, PDFs are not designed to be universally reflowable.

thangalin - 2 hours ago

Is this an xy problem? If you have the original document (in Markdown), one possibility would be to use my software, KeenWrite[1], to convert Markdown to XHTML then typeset XHTML to PDF via ConTeXt. See the user manual[2] for an example of a Markdown document typeset in this fashion, along with usage instructions.

If you only have HTML to work with, you can also use Flying Saucer[3], which is what KeenWrite uses to preview Markdown documents when rendered as HTML. Flying Saucer uses an open-source version of iText[4] to produce PDF documents (from HTML source docs).

Another possibility is to use pandoc and LaTeX.

[1]: https://keenwrite.com/

[2]: https://keenwrite.com/docs/user-manual.pdf

[3]: https://github.com/flyingsaucerproject/flyingsaucer

[4]: https://itextpdf.com/

etyhhgfff - 2 hours ago

What exactly is so painful about it? It is just one command, can be isolated in a container and runs on every Linux machine.

docker run alpine-wkhtmltopdf google.com - > test.pdf

Source: https://github.com/madnight/docker-alpine-wkhtmltopdf

pentium166 - 2 hours ago

wkhtmltopdf is pretty out of date at this point and headless Chrome/Chromium or something that wraps them is probably a better and safer, roughly equivalent, alternative. Docker might not be a great option if you're already running a containerized service and don't want to deal with getting them to play nice together.

bob1029 - an hour ago

If your HTML is simply an intermediary to get you to a PDF, you could consider just skipping straight to building the PDF directly:

https://pdfbox.apache.org

This would be far more efficient than spinning up an entire browser and printing PDFs to disk.

deaddodo - 20 minutes ago

Building PDF directly (unless you're creating documents, especially fillables) is non-intuitive. Most PDFs are people trying to capture live data in a cached manner. If not, using a preliminary format like Markdown/HTML/LaTeX/DocX/etc to generate your PDF is almost always more intuitive.

haft - 2 hours ago

A reverse of this question; what is the best way to convert pdf to html? We are required by accessibility law to make our PDFs WCAG compliant however it would be easier to convert these to HTML.

bencornia - 40 minutes ago

I have been using pdf2htmlex with some success. https://github.com/pdf2htmlEX/pdf2htmlEX

ratStallion - an hour ago

My website's content is xml, and I use Apache Fop to turn it into a PDF with page numbers and other nice things. It works nicely, but takes some setup.

- 2 hours ago

[deleted]

nicoburns - 2 hours ago

https://github.com/plutoprint/plutobook was a recent Show HN and looks excellent

mightjustwork - 3 days ago

https://gotenberg.dev/ ...has been working well for me for the last few years. It's a headless instance of Google Chrome with a golang wrapper. Runs well in Docker or a cloud instance.

hansonkd - 2 hours ago

gotenberg is really rock solid for us. Easy to deploy as a docker container to any infrastructure.

haft - 2 hours ago

A revers of this question; what is the best way to convert pdf to html? We are required by accessibility law to make our PDFs WCAG compliant however it would be easier to convert these to HTML.

hhthrowaway1230 - an hour ago

5k pdfs a month for archival purposes, must be pdf, customers demand this

throw03172019 - 3 days ago

I run chromium on my server and render the PDF from there using puppeteer.

zja - 3 days ago

pandoc

w10-1 - 2 hours ago

To reinforce this: pandoc has been the go-to for a long, long time and they have encountered and addressed tons of issues, which is especially important for two underspecified and over-provisioned formats like HTML and pdf.
Go through the revision and bug history to see a sample of issues you're avoiding by using a highly-trafficked, well-supported solution.
The only reason not to use it is when they say they don't support a given feature that you need; and the nice thing there is that they'll usually say it, and have a good reason why.
The other reason to use pandoc is that while you might currently want PDF as your outbound format, you might end up preferring some other format (structured logically instead of by layout); with pandoc that change would be easy.
Finally, pandoc is extensible. If you do find that you want different output in some respect, you can easily write an plugin (in python or haskel or ...) to make exactly the tweak you need.
beeforpork - 2 hours ago

Does pandoc do JavaScript? For stuff that is rendered (I don't want animated, interactive PDFs...).
hhthrowaway1230 - 3 days ago

doesn't pandoc rely on some engine itself?
- cpach - a day ago
  
  Yep, you need something like XeTeX in order to render the PDF.
- brudgers - a day ago
  
  Curious why that matters to you?
  I mean everything has dependencies (some of the solutions elsewhere require Chrome and other common solutions require the JVM). At least Pandoc is GPL.
  - kakokiyrvoooo - 12 hours ago
    
    It matters because pandoc is not rendering the website to pdf, it converts the html to latex and then uses a latex engine to render the pdf.
    
    brudgers - 23 minutes ago
    
    Forgive me but I don’t understand why that matters to you and am trying to understand what the issue with Latex is.
    Because lots of things work this way. For example compilers built on LLV uses an intermediate language and Python uses byte code.
    I suspect some html to pdf tools go through postScript.
  - kreetx - 2 hours ago
    
    There are multiple ways to "depend", so if pandoc executes some external tool all of the work then might as well use that external tool directly. You will get more control over how the conversion happens, know for what search for when in trouble etc.
    
    brudgers - 19 minutes ago
    
    My understanding and experience is that Latex has a significant learning curve and Pandoc provides a more gentle front end.
    Of course Latex gives you fine control to hand tune the engine…but that doesn’t seem like what the OP is looking for.

ftchd - an hour ago

the only thing I found to work reliably well is simply Chromium's print feature

exabrial - 3 hours ago

openhtmltopdf is what we're using. Some outdated versions.

supersaw - 2 minutes ago

Been using this as well. It's worth noting that while the original project appears to have been abandoned, it has since been forked and is currently maintained here: https://github.com/openhtmltopdf/openhtmltopdf

fogzen - 15 hours ago

Don’t. Show a web page and open the print dialog, and tell people to save as PDF. All major browsers support this, and the browser HTML to PDF code is the most robust and accurate.

crazygringo - 2 hours ago

There's nothing in OP's question that suggests this is a one-off operation in response to a user action.
It's very likely to be a massive batch operation of a ton of HTML files that might not even be their own site.
- hhthrowaway1230 - an hour ago
  
  this is the case indeed
chibbell - 2 hours ago

That does make sense where possible. I do feel like OPs question is super relevant if you are doing anything where the PDF has to be rendered server side, like say as part of a larger data process when producing an exportable report in PDF format.

lovelydata - 7 minutes ago

[dead]

journal - 15 hours ago

if you are doing html to pdf, you might also need the ability to merge. a few more features and you're better of with a commercial solution.

crazygringo - 2 hours ago

Merge what?
- pentium166 - 2 hours ago
  
  I assume combining 2+ documents. For example, attaching a cover page with document owner/version control/lifecycle information to an existing PDF.
  - crazygringo - an hour ago
    
    That's the easiest thing in the world with free software.
    One way is to install poppler-utils and use pdfunite. There are many other open-source packages you can use as well.