11:46 AM - Generating and cleaning HTML for PDF generation, Joomla import
At work, I've been working on some software to generate a newsletter. The software required several formats including PDF, HTML, and text. Each one used a different template. The input is from RSS feeds.
For the PDF generation, we went with a commercial product called PDF Reactor. It's a java library that allows one to create PDF documents from HTML and CSS. It runs on top of iText and does several post processing tasks. As part of the process, it uses one of three configurable HTML formatters. We chose to use the default which has problems with malformed HTML in some cases. An open anchor tag causes grief. Since the input is random and from the internet, we needed a way to clean it up. I setup JTidy to process the input when it's HTML. That way the HTML is always wellformed going into the system.
PDF Reactor is licensed per cpu core. They do check the core count. We tried to run it on an i5, assuming it would just not take advantage of all the features, and it went into evaluation mode. We had to physically disable two cores on the CPU to get the system to work. This only requires 2 lines in /boot/loader.conf in FreeBSD.
The odd part of this project was importing from Joomla. I had to take data from various Joomla installs, and import select categories into our new system, then create newsletters from this data after it was cleaned up and categorized. This meant selecting all the data from Joomla's content table on certain sections and categories. The sections were constant within the install. I also had to take the section and categories names and tag the articles in the new system with them. If the names were duplicates, we had some problems. The system was overloaded to use categories as two different levels. It's caused some complications as joomla categories are not this flexible.
If I had more time, I would have tried to create the PDF generation from HTML myself using iText. This is something I would like for just journal where I use iText for PDF and RTF generation now.