Change to PDF capture support to add Failover

3 posts / 0 new
Last post
puravida's picture
puravida
Jedi Warrior
Offline
Joined: 09/01/2007
Visit puravida's Website

I have noticed quite a few PDF_CAPTURE_FAILED errors in the database since we improved the PDF support, added custom size w/ cropping capability, and monitoring of those types of requests.

As a result, my observation is that 80% of the failures failed due to:

  • 404 web page returned
  • Parked web page returned
  • URL was not really a PDF (active web page)
  • PDF was not embedded (attempted to prompt for download)
  • Host not found

The remaining 20% of errors are mostly related to known PDF parsing errors when the PDF does not follow the PDF encoding guidelines and doesn't yet have a workaround in the capture script. Read more about that in my post about the PDF parsing errors. As a result of this change, you will likely no longer see PDF_CAPTURE_FAILED errors logged. If you have a custom integration that looks for that specific error, this may require a code change. Aside from that, no changes are necessary to benefit from this enhancement.

The script we use to capture screenshots of PDF files expects and requires the URL to be an embedded PDF. Anything else will fail. Therefore, I have added support to immediately retry failed PDF captures as a normal capture. PDFs normally capture within 3s-9s, but those that fail and retry with failover seem to return within 5s-20s on average. So, the latency on using the failover is minimal.

There are a couple of benefits to doing it this way:

  • We will return a more accurate error than simply PDF_CAPTURE_FAILED (For instance: NS_ERROR_UNKNOWN_HOST, HTTP:404, etc)
  • We will deliver a screenshot of what a visitor would see (more accurate), in the case of failure

The downsides, which seem acceptable, are:

  • Your code may not realize that a PDF link on on your site or app is broken, because we will return HTTP:200 on successful capture of a "Parked" page. However, on 404 responses, you could look for HTTP:404 (i.e. HTTP:40x), assuming the remote server responds properly.
  • In the case that the PDF link attempts to force a download dialog, strange captures may occur. I've seen these range from a single pinpoint dot capture to a mostly blank page with a solid bar header to reporting a DIALOG_PROMPT_ABORT error.

Despite the minor downsides, to me, this seems like a much more elegant way to handle these failures, but I'm not really the one using the service. So, if anyone sees strange behavior or prefers the prior method, please share your observations and concerns in the STW forum or by opening a support ticket.

The capture below is an example of a PDF captured using the PDF capture failover feature. It was actually rendered through the "web page rendering engine," which is why there is a frame around the image. I suppose that having a frame is better than not having an image at all.

First page at 400px wide and cropped at 300px high (4:3 ratio)

For more PDF capture examples, see the original PDF support announcement thread.

puravida's picture
puravida
Jedi Warrior
Offline
Joined: 09/01/2007
Visit puravida's Website

Update: Tonight's all-nighter consisted of the following...

In many cases of PDF failures, often when a dialog prompt to download the PDF is sent, a 1x1 image was uploaded and the request was marked as good. It took quite a bit of testing and re-factoring to get this resolved. It ended up being related to the code that accounts for incorrectly sized captures when web pages force a size larger than the capture window. This code has been updated and thoroughly tested. It now accounts for image sizes much better than before and works for PDFs and also works for regular requests --including failover captures for failed PDFs.

I tested dozens of scenarios that included PDFs (working, not working, and 404), Full-Length, Full-Length cropped, Custom Sizes, Custom Browser resolutions, and default sizes.

I ran into several issues with the OS caching the filesize. I needed to always clear the cache when overwriting a file in place, such as when resizing, cropping, or padding. This is necessary to get the correct filesize to report to the DB and it also avoids the following Amazon S3 error (caused by filesize mismatch):

BadDigest: The Content-MD5 you specified did not match what we received.

I believe I have this nipped in the bud now, and my updates account for quite a few more error conditions regarding PDF failures. Now, many of these that fail on the failover attempt as well, will report BLANK_DETECTED or PDF_CAPTURE_FAILED, depending on the issue. If it is a dialog prompt, it will likely be BLANK_DETECTED. Otherwise, it may be the PDF parsing error or something else (like an I/O error with the PDF capture script), and I pass through the PDF_CAPTURE_FAILED to indicate those types of errors that we have no control over.

puravida's picture
puravida
Jedi Warrior
Offline
Joined: 09/01/2007
Visit puravida's Website

Update: It seems that the reason for the 1x1 px failures mentioned above were not because the PDFs were not embedded. As it turns out, we can capture embedded or "forced download" PDFs alike. One or the other may fail for other reasons but not because of the delivery method used by the remote site. Apparently, the 1x1 px issue was caused by a failure to upload to our storage and may have been a coding issue that was resolved at some point. After thorough testing, this no longer seems to be an issue but it came up again tonight, because I noticed that several PDF captures were broken with the 1x1 px error. I have refreshed all PDFs in the system and visually verified dozens of previously broken captures were now working as expected.

I also wanted to add an example of a PDF captured with the "failover" feature, which I posted in the initial post in this thread (see above).

ShrinkTheWeb® (About STW) is another innovation by Neosys Consulting
Contact Us | PagePix Benefits | Learn More | STW Forums | Our Partners | Privacy Policy | Terms of Use

Announcing Javvy, the best crypto exchange and wallet solution (coming soon!)

©2018 ShrinkTheWeb. All rights reserved. ShrinkTheWeb is a registered trademark of ShrinkTheWeb.