AnsweredAssumed Answered

Transformation service question

Question asked by swhitman on Mar 12, 2011
I have a bash script that takes an input pdf file and an output pdf file name and does the following:
  • uses pdftotext and wc to determine if the pdf contains only images

  • if so, uses gs to break the pdf file into separate pages in tiff format

  • then uses tesseract to OCR each page and outputs an hOCR html file

  • then horc2pdf is used to combing the tiff image and the hOCR file

  • gs is then used to combine the separate pages back into a single pdf file with the given output file name
I would like to use this script in a RuntimeExecutableContentTransformer, but what I have read about them indicates that they are for converting different mime types. In this case the mime type is the same. Is this a problem? Should I be trying to use a different Alfresco facility (Javascript for example)?

Second, the script does not always process the given input pdf file. If pdftotext returns a non zero word count then the script exits without creating the output pdf file. How will the Transformation service handle this, or how should the script behave to interface properly with the Transformation service in this case?