AnsweredAssumed Answered

Lucene search and content indexing in PDF documents

Question asked by ricardoc-moredata on Feb 18, 2010
Latest reply on Feb 19, 2010 by ricardoc-moredata
Hi everyone,

I'm having trouble getting reliable results in the Lucene search of Alfresco, in PDF documents.

Examples:
Search Language:     lucene
Search:    PATH:"/app:company_home/cm:Empresa/cm:Expediente/cm://*"

Results (14 rows)
Parent Node Name
_x0032_010 workspace: / / SpacesStore/837eda52-bc75-4fba-b78a-2a7e694b6542 workspace: / / SpacesStore/0ec3f10e-c165-4a6e-ac32-d61f6539af33
_x0030_2 workspace: / / SpacesStore/8f6b470c-5dc1-49af-b5a2-33f7653c6c03 workspace: / / SpacesStore/837eda52-bc75-4fba-b78a-2a7e694b6542
_x0031_8 workspace: / / SpacesStore/15e04bc9-9389-4618-aba6-faac4b6bf845 workspace: / / SpacesStore/8f6b470c-5dc1-49af-b5a2-33f7653c6c03
Manual_Alfresco.pdf workspace: / / SpacesStore/bb8bd6ec-5a5b-4a8e-9531-03f1f427b57b workspace: / / SpacesStore/15e04bc9-9389-4618-aba6-faac4b6bf845
Printing.pdf workspace: / / SpacesStore/871d8e51-d55e-428b-acf4-bcf0b2d093f5 workspace: / / SpacesStore/15e04bc9-9389-4618-aba6-faac4b6bf845
ManualAlfresco.pdf workspace: / / SpacesStore/3b51672f-59df-497e-a799-cef3e0c3ca6b workspace: / / SpacesStore/15e04bc9-9389-4618-aba6-faac4b6bf845
_x0032_010 workspace: / / SpacesStore/3be9f159-d0e4-4ac8-a627-cd8a79858a65 workspace: / / SpacesStore/73c12541-b8c6-4aad-8b1c-0aed36325e84
_x0030_2 workspace: / / SpacesStore/d39a9a97-1ad6-4f3d-ab49-fa7f30ad2476 workspace: / / SpacesStore/3be9f159-d0e4-4ac8-a627-cd8a79858a65
_x0031_8 workspace: / / SpacesStore/d46ad196-9355-481c-941c-162d77d28346 workspace: / / SpacesStore/d39a9a97-1ad6-4f3d-ab49-fa7f30ad2476
Find_accessed_file_in_past_1_or_2_minutes.pdf workspace: / / SpacesStore/06e8d206-33f7-4a9e-9392-1bcc7beab0c8 workspace: / / SpacesStore/d46ad196-9355-481c-941c-162d77d28346
_x0032_010 workspace: / / SpacesStore/9d033cea-e913-4fc6-83e2-67bfca1efa0d workspace: / / SpacesStore/35f6551d-2e35-4275-b994-b73fd998f864
_x0030_2 workspace: / / SpacesStore/e177b105-c82d-4a77-b91c-b0597af95063 workspace: / / SpacesStore/9d033cea-e913-4fc6-83e2-67bfca1efa0d
_x0031_8 workspace: / / SpacesStore/301cca69-dc63-43e0-bf8c-1bde0f95aa2f workspace: / / SpacesStore/e177b105-c82d-4a77-b91c-b0597af95063
Printing_x0020__x0028_copy_x0029_.pdf workspace: / / SpacesStore/c1055e34-77c7-4f61-8b3c-fc7be662b69e workspace: / / SpacesStore/301cca69-dc63-43e0-bf8c-1bde0f95aa2f

Ignoring the results to folders (_x003 *), all documents listed here have properties with value "admin". Thus, if the indexing is good, they should appear in the search for that word.

However, with:
Search Language:     lucene
Search:    PATH:"/app:company_home/cm:Empresa/cm:Expediente/cm://*" AND (TEXT:*admin*)

Results (2 rows)
Parent Node Name
Printing.pdf workspace: / / SpacesStore/871d8e51-d55e-428b-acf4-bcf0b2d093f5 workspace: / / SpacesStore/15e04bc9-9389-4618-aba6-faac4b6bf845
Printing_x0020__x0028_copy_x0029_.pdf workspace: / / SpacesStore/c1055e34-77c7-4f61-8b3c-fc7be662b69e workspace: / / SpacesStore/301cca69-dc63-43e0-bf8c-1bde0f95aa2f

Only two documents appear!  :shock:

Based on what I read in

http://wiki.alfresco.com/wiki/Full-Text_Search_Configuration
and
http://forums.alfresco.com/en/viewtopic.php?f=4&t=23735&p=77638&hilit=+TEXT+lucene#p77638

I changed in models contentModel.xml and ccdraModel.xml the type of indexing to:

<index enabled="true">
   <atomic> false </ atomic>
   <stored> true </ stored>
   <tokenised> false </ tokenised>

</ index>

I did a full reindex.

However, problems persist.

And if I search for another string of existing metadata in a document, such as:
{http://www.empresa.pt/model/content/1.0}assuntoDocEntrada Ass2

In the research, nothing is returned.
Search Language:     lucene
Search:    PATH:"/app:company_home/cm:Empresa/cm:Expediente/cm://*" AND ( TEXT:*Ass2*)

Results (0 rows)
Parent Node Name

Note that before changes in models that I referred, the original code was:
<property name="cc:assuntoDocEntrada">
  <title>Assunto do documento de entrada</title>
  <type>d:text</type>
  <mandatory>true</mandatory>
  <index enabled="true">
    <atomic>false</atomic>
    <stored>false</stored>
    <tokenised>true</tokenised>

  </index>
</property>

Any idea?


Regards,


Ricardo Cardoso

Outcomes