Observation on screenshot captures of PDFs (PDF_CAPTURE_FAILED)

2 posts / 0 new
Last post
puravida's picture
puravida
Jedi Warrior
Offline
Joined: 09/01/2007
Visit puravida's Website

As I spent the last 34 out of 36 hours tracking down little "bugs", "bad data", and capture failures (0.00005% of overall requests but still might as well eradicate any errors!); I discovered that about 10% of PDFs in the system fail to capture and do not output as images. The error reported in our logs is PDF_CAPTURE_FAILED. It has likely always been that way, but I put that support in a few years back and moved on; not having enough time to monitor it closely. I was just thrilled to have gotten it working.

Now that I've taken some time to look into it, I'm a bit shocked at the high rate of failures. Oddly, about 50% of the PDFs will capture on the 2nd or 3rd try. So, those, like a few other hard cases, will automatically be retried on failure. That should hopefully get the failure rate down to about 5%; although still very high in my books.

I do not know of any reliable web page thumbnail utilities that also capture screenshots of PDFs, so it is unlikely that competing services even support this feature. As a result, unfortunately, for converting PDFs into images; I rely on a 3rd-party script and have no control over its behavior. The best I can do is report issues that I find and wait for a fix. Upon closer inspection and research, I found that the script (Ghostscript) has many, many bugs that are supposedly resolved but still occur in the later version that we are running. The developers likely are not aware that the issues are on-going, because they do not have the wide sampling of PDFs that have been requested in our system from all over the world in many languages and fonts.

The failures seem to happen most often on non-English documents, and verbose output suggests a failure in parsing the PDF, understanding the syntax, or recognizing that the decoder for the PDF is available and installed correctly. All of these are known bugs of that script, so all we can do is make subsequent reports and hope they rise to the challenge of fixing them.

I don't know the first thing about PDF technology or where to even start in writing my own proprietary code to convert PDFs, but then again, I had no idea how to automate the capture of web page screenshots when I tried my hand at building this service either. So, maybe I'll consider building something, but the hard work is done and works fairly well already. Ideally, the community will fix the bugs and make our lives easier. Wink

At any rate, I just wanted to let users know that we are aware of the PDF failures and have implemented a workaround to reduce them. There are a couple of alternatives that I may look into at some point when I find time. In the meantime, I hope my recent changes will help some of you out there in Internet land.

puravida's picture
puravida
Jedi Warrior
Offline
Joined: 09/01/2007
Visit puravida's Website

Update: Apparently my code to detect PDFs was not case-insensitive and caused a few HASH_MISMATCH errors in the logs. I have corrected that and verified that PDF URLs with uppercase ".PDF" now capture correctly.

ShrinkTheWeb® (About STW) is another innovation by Neosys Consulting
Contact Us | PagePix Benefits | Learn More | STW Forums | Our Partners | Privacy Policy | Terms of Use

Announcing Javvy, the best crypto exchange and wallet solution (coming soon!)

©2018 ShrinkTheWeb. All rights reserved. ShrinkTheWeb is a registered trademark of ShrinkTheWeb.