Ask HN: What is nowadays (opensource) way of converting HTML to PDF?
47 points by hhthrowaway1230 3 days ago
47 points by hhthrowaway1230 3 days ago
I'm using wkhtmltopdf but it is painful to work with? what are other people using nowadays? i.e canva or other tools?
Just print to PDF in a browser, or automate that using a browser automation tool. For a non-browser-based open source solution, WeasyPrint. For a proprietary solution, try Prince XML: +1 to weasyprint; I have used weasyprint with a django production system for a few years now, and it works well enough that I never have to think about it. I'm not doing anything fancy, though, but for me it has worked well. WeasyPrint works really well for me. It can support all of the languages and fonts I need. I run it on AWS Lambda and in Docker as a web service. I previously used WKHTMLTOPDF, but it hasn't been supported for years and doesn't support the latest CSS, etc. It does support JS if you need it, but I'd probably look at headless Chromium or another solution for JS if needed. Edit: Previous post with some good discussion: https://news.ycombinator.com/item?id=26578826 Prince XML looks nice but what about creating a PDF directly from a website? This often adds some problems, for example links still pointing to other pages on the web. But in my experience printing to PDF is often not good enough. I’ve had excellent experience with Prince XML and poor experience with everything else I’ve tried. Prince is fast, robust and reliable. Yes it costs money. So does developer time. Puppeteer and Playwright are the main open-source options nowadays, both solid for HTML → PDF once your print CSS is sorted.
Don’t forget proper page breaks (break-before/after/inside) — e.g. break-after: page works in Chromium, while always doesn’t. For trickier pagination you can look at Paged.js, and I’d test layouts in Chrome/Edge before automating. Shameless plug: I run yakpdf.com, a hosted Puppeteer-based service if you want to avoid self-hosting.
https://rapidapi.com/yakpdf-yakpdf/api/yakpdf Seconded. I went with C# + Playwright. I tried iTextSharp, iText, PDFSharp, and wkhtmltopdf, but they all had limitations. I had good results with Playwright in minutes, outside of tweaking the CSS like you mention. I documented the process here[0] if anyone needs examples of the CSS and loading web fonts. Apologies for the article being long-winded – it was the first one I published. [0] https://johnh.co/blog/creating-pdfs-from-html-using-csharp Please don't turn nice formats into a format that's similar to screenshots of text. Pandoc has an option to pack all images and styles needed to render the page into one html file: Or, please do? I use PDF's so I can send them to my iPad to read offline, highlight them, annotate them, and then send them back to my filesystem with highlights and annotations intact. I sure can't do that with any "nice formats" like HTML or TXT or EPUB or MOBI. You could, though. What you are describing are features of an editor, not a file format. I can imagine a browser addon performing the same tasks. PDF annotations sit within the file. I know, even though that depends on the editor. Okular for example places them in an extra file, last I checked. That's not unique to PDFs. HTML files are modifiable. There is nothing preventing an editor to put annotations in it as well. Pandoc would be my preferred tool. It is excellent at converting between other formats as well. Being (not so easily) edited is often a feature, not a bug. Is this really that much of a motivation in 2025? Maybe in 2000 you could publish a PDF with the assurance that only the people who paid for Acrobat would be able to edit it, but today, there are a lot of accessible ways to edit PDFs, I don't think I'd choose PDF if I for whatever reason wanted to limit others from editing. I was thinking this too, PDF's exist so people don't mess with the document. That said, it's still a clever feature, and pandoc can convert html into a pdf as well with a conversion engine. That said, I suspect it'll fail on anything sufficiently complex pandoc input.html -o output.pdf --pdf-engine=<your engine> If generating PDF dynamically is what you really care about, consider Typst. https://typst.app/
We use it in production to generate reports, and it is amazing. chrome --headless --disable-gpu --print-to-pdf https://example.com same:
google-chrome --headless --disable-gpu --no-pdf-header-footer --hide-scrollbars --print-to-pdf-margins="0,0,0,0" --print-to-pdf --window-size=1280,720 https://example.com ended up using headless chrome specifically to make sure javascript things rendered properly Can Chromium do this? Edit: it appears so- https://news.ycombinator.com/item?id=15131840 If you don't really need the PDF but just want to archive pages, SingleFile is better. It'll capture the entire page to a single HTML file and I find this is better than the PDF if I don't want to print it. It's a browser extension, but there's also a command line version (https://github.com/gildas-lormeau/single-file-cli) that uses Chrome or Chromium's headless mode. I have Chromium shoved into an AWS Lambda Layer, when we need HTML to PDF conversion we shove it off onto that. It loads the HTML into Chromium then "prints" it to PDF. I'd love to go the other way: convert a PDF into a self contained HTML page that renders properly in a browser. It's been way harder than I thought it would. Any advice? You could embed it as a base64 blob, embed PDF.js (which is included by browsers anyway, I think) and use that to render it in the HTML. But I realize you probably meant a static HTML without JavaScript. > renders properly Depending on your requirements on both PDF input and HTML output, there is often no way to do this that is both easy and general. At it's core, PDFs are not designed to be universally reflowable. Is this an xy problem? If you have the original document (in Markdown), one possibility would be to use my software, KeenWrite[1], to convert Markdown to XHTML then typeset XHTML to PDF via ConTeXt. See the user manual[2] for an example of a Markdown document typeset in this fashion, along with usage instructions. If you only have HTML to work with, you can also use Flying Saucer[3], which is what KeenWrite uses to preview Markdown documents when rendered as HTML. Flying Saucer uses an open-source version of iText[4] to produce PDF documents (from HTML source docs). Another possibility is to use pandoc and LaTeX. [2]: https://keenwrite.com/docs/user-manual.pdf What exactly is so painful about it? It is just one command, can be isolated in a container and runs on every Linux machine. docker run alpine-wkhtmltopdf google.com - > test.pdf Source: https://github.com/madnight/docker-alpine-wkhtmltopdf wkhtmltopdf is pretty out of date at this point and headless Chrome/Chromium or something that wraps them is probably a better and safer, roughly equivalent, alternative. Docker might not be a great option if you're already running a containerized service and don't want to deal with getting them to play nice together. If your HTML is simply an intermediary to get you to a PDF, you could consider just skipping straight to building the PDF directly: This would be far more efficient than spinning up an entire browser and printing PDFs to disk. Building PDF directly (unless you're creating documents, especially fillables) is non-intuitive. Most PDFs are people trying to capture live data in a cached manner. If not, using a preliminary format like Markdown/HTML/LaTeX/DocX/etc to generate your PDF is almost always more intuitive. A reverse of this question; what is the best way to convert pdf to html? We are required by accessibility law to make our PDFs WCAG compliant however it would be easier to convert these to HTML. I have been using pdf2htmlex with some success. https://github.com/pdf2htmlEX/pdf2htmlEX My website's content is xml, and I use Apache Fop to turn it into a PDF with page numbers and other nice things. It works nicely, but takes some setup. https://github.com/plutoprint/plutobook was a recent Show HN and looks excellent https://gotenberg.dev/
...has been working well for me for the last few years. It's a headless instance of Google Chrome with a golang wrapper. Runs well in Docker or a cloud instance. gotenberg is really rock solid for us. Easy to deploy as a docker container to any infrastructure. A revers of this question; what is the best way to convert pdf to html? We are required by accessibility law to make our PDFs WCAG compliant however it would be easier to convert these to HTML. 5k pdfs a month for archival purposes, must be pdf, customers demand this I run chromium on my server and render the PDF from there using puppeteer. pandoc To reinforce this: pandoc has been the go-to for a long, long time and they have encountered and addressed tons of issues, which is especially important for two underspecified and over-provisioned formats like HTML and pdf. Go through the revision and bug history to see a sample of issues you're avoiding by using a highly-trafficked, well-supported solution. The only reason not to use it is when they say they don't support a given feature that you need; and the nice thing there is that they'll usually say it, and have a good reason why. The other reason to use pandoc is that while you might currently want PDF as your outbound format, you might end up preferring some other format (structured logically instead of by layout); with pandoc that change would be easy. Finally, pandoc is extensible. If you do find that you want different output in some respect, you can easily write an plugin (in python or haskel or ...) to make exactly the tweak you need. Does pandoc do JavaScript? For stuff that is rendered (I don't want animated, interactive PDFs...). doesn't pandoc rely on some engine itself? Curious why that matters to you? I mean everything has dependencies (some of the solutions elsewhere require Chrome and other common solutions require the JVM). At least Pandoc is GPL. It matters because pandoc is not rendering the website to pdf, it converts the html to latex and then uses a latex engine to render the pdf. Forgive me but I don’t understand why that matters to you and am trying to understand what the issue with Latex is. Because lots of things work this way. For example compilers built on LLV uses an intermediate language and Python uses byte code. I suspect some html to pdf tools go through postScript. There are multiple ways to "depend", so if pandoc executes some external tool all of the work then might as well use that external tool directly. You will get more control over how the conversion happens, know for what search for when in trouble etc. My understanding and experience is that Latex has a significant learning curve and Pandoc provides a more gentle front end. Of course Latex gives you fine control to hand tune the engine…but that doesn’t seem like what the OP is looking for. openhtmltopdf is what we're using. Some outdated versions. Been using this as well. It's worth noting that while the original project appears to have been abandoned, it has since been forked and is currently maintained here: https://github.com/openhtmltopdf/openhtmltopdf Don’t. Show a web page and open the print dialog, and tell people to save as PDF. All major browsers support this, and the browser HTML to PDF code is the most robust and accurate. There's nothing in OP's question that suggests this is a one-off operation in response to a user action. It's very likely to be a massive batch operation of a ton of HTML files that might not even be their own site. That does make sense where possible. I do feel like OPs question is super relevant if you are doing anything where the PDF has to be rendered server side, like say as part of a larger data process when producing an exportable report in PDF format. if you are doing html to pdf, you might also need the ability to merge. a few more features and you're better of with a commercial solution. Merge what? I assume combining 2+ documents. For example, attaching a cover page with document owner/version control/lifecycle information to an existing PDF. That's the easiest thing in the world with free software. One way is to install poppler-utils and use pdfunite. There are many other open-source packages you can use as well.
pabs3 - 2 days ago
rossdavidh - 2 hours ago
grounder - an hour ago
sureglymop - 44 minutes ago
jmyeet - 34 minutes ago
kappadi3 - 3 days ago
johnh-hn - 3 hours ago
Aachen - 2 hours ago
pandoc --self-contained input.html -o output.html
crazygringo - an hour ago
mr_mitm - an hour ago
whenc - an hour ago
mr_mitm - 18 minutes ago
agedclock - 2 hours ago
TylerE - 2 hours ago
ryandrake - an hour ago
guywithahat - 2 hours ago
lizimo - 9 minutes ago
Snawoot - 3 hours ago
piptastic - 2 hours ago
HPsquared - 2 hours ago
RiverCrochet - 2 hours ago
juice_bus - an hour ago
freedomben - 32 minutes ago
mr_mitm - 8 minutes ago
drabbiticus - 15 minutes ago
thangalin - 2 hours ago
etyhhgfff - 2 hours ago
pentium166 - 2 hours ago
bob1029 - an hour ago
deaddodo - 20 minutes ago
haft - 2 hours ago
bencornia - 40 minutes ago
ratStallion - an hour ago
nicoburns - 2 hours ago
mightjustwork - 3 days ago
hansonkd - 2 hours ago
haft - 2 hours ago
hhthrowaway1230 - an hour ago
throw03172019 - 3 days ago
zja - 3 days ago
w10-1 - 2 hours ago
beeforpork - 2 hours ago
hhthrowaway1230 - 3 days ago
brudgers - a day ago
kakokiyrvoooo - 12 hours ago
brudgers - 23 minutes ago
kreetx - 2 hours ago
brudgers - 19 minutes ago
exabrial - 3 hours ago
supersaw - 2 minutes ago
fogzen - 15 hours ago
crazygringo - 2 hours ago
chibbell - 2 hours ago
journal - 15 hours ago
crazygringo - 2 hours ago
pentium166 - 2 hours ago
crazygringo - an hour ago