Adding Metadata and Full-text Indexing to Rich Documents and Assets in CrafterCMS

There are many use cases and types of experiences where we want to treat specific types of assets like rich documents, videos and high resolution images as first class content objects in the CMS with their own custom metadata and indexing of in-file metadata.  CrafterCMS enables you to "jacket" assets with content type to support these scenarios.

In this video blog we cover:

  1. What is a document/asset jacket
  2. How document full-text and custom metadata indexing works
  3. How to configure your Crafter Studio project and deployer to support custom metadata for rich documents

Important Documentation Links:

Sample Configuration:

Below you will find the configuration examples covered in the video.

Project Config (site-config.xml):

<folders>
    <folder name="Pages" path="/website" read-direct-children="false" attach-root-prefix="true"/>
    <folder name="Components" path="/components" read-direct-children="false" attach-root-prefix="true"/>
    <folder name="Documents" path="/documents" read-direct-children="false" attach-root-prefix="true"/>
    <folder name="Taxonomy" path="/taxonomy" read-direct-children="false" attach-root-prefix="true"/>
    <folder name="Assets" path="/static-assets" read-direct-children="false" attach-root-prefix="false"/>
    <folder name="Templates" path="/templates" read-direct-children="false" attach-root-prefix="false"/>
    <folder name="Scripts" path="/scripts" read-direct-children="false" attach-root-prefix="false"/>
</folders> 

...

<pattern-group name="component">
    <pattern>/site/components/([^&lt;]+)\.xml</pattern>
    <pattern>/site/documents/([^&lt;]+)\.xml</pattern>
    <pattern>/site/system/page-components/([^&lt;]+)\.xml</pattern>
    <pattern>/site/component-bindings/([^&lt;]+)\.xml</pattern>
    <pattern>/site/indexes/([^&lt;]+)\.xml</pattern>
    <pattern>/site/resources/([^&lt;]+)\.xml</pattern>
</pattern-group>
 

Permissions Config (permissions.xml):

<permissions>
    <version>4.1.2</version>
    <role name="author">
        <rule regex="/site/website/.*">
            <allowed-permissions>
                <permission>content_read</permission>
                <permission>content_write</permission>
                <permission>content_create</permission>
                <permission>folder_create</permission>
                <permission>get_children</permission>
                <permission>content_copy</permission>
            </allowed-permissions>
        </rule>
        <rule regex="/site/components|/site/components/.*">
            <allowed-permissions>
                <permission>content_read</permission>
                <permission>content_write</permission>
                <permission>content_create</permission>
                <permission>folder_create</permission>
                <permission>get_children</permission>
                <permission>content_copy</permission>
            </allowed-permissions>
        </rule>

Target YAML:

      binary:
        # The list of binary file mime types that should be indexed
        supportedMimeTypes:
          - application/pdf
          - application/msword
          - application/vnd.openxmlformats-officedocument.wordprocessingml.document
          - application/vnd.ms-excel
          - application/vnd.ms-powerpoint
          - application/vnd.openxmlformats-officedocument.presentationml.presentation
        # The regex path patterns for the metadata ("jacket") files of binary/document files
        metadataPathPatterns:
          - ^/?site/documents/.+\.xml$
        # The regex path patterns for binary/document files that are store remotely
        remoteBinaryPathPatterns: &remoteBinaryPathPatterns
          # HTTP/HTTPS URLs are only indexed if they contain the protocol (http:// or https://). Protocol relative
          # URLs (like //mydoc.pdf) are not supported since the protocol is unknown to the back-end indexer.
          - ^(http:|https:)//.+$
          - ^/remote-assets/.+$
        # The regex path patterns for binary/document files that should be associated to just one metadata file and are
        # dependant on that parent metadata file, so if the parent is deleted the binary should be deleted from the index
        childBinaryPathPatterns: *remoteBinaryPathPatterns
        # The XPaths of the binary references in the metadata files
        referenceXPaths:
          - //item/key
          - //item/url
        # Setting specific for authoring indexes
        authoring:
          # Xpath for the internal name field
          internalName:
            xpath: '*/internal-name'
            includePatterns:
              - ^/?site/.+$
              - ^/?static-assets/.+$
              - ^/?remote-assets/.+$
              - ^/?scripts/.+$
              - ^/?templates/.+$
          contentType:
            xpath: '*/content-type'
          # Same as for delivery but include images and videos
          supportedMimeTypes:
            - application/pdf
            - application/msword
            - application/vnd.openxmlformats-officedocument.wordprocessingml.document
            - application/vnd.ms-excel
            - application/vnd.ms-powerpoint
            - application/vnd.openxmlformats-officedocument.presentationml.presentation
            - application/x-subrip
            - image/*
            - video/*
            - audio/*
            - text/x-freemarker
            - text/x-groovy
            - text/javascript
            - text/css
          # The regex path patterns for the metadata ("jacket") files of binary/document files
          metadataPathPatterns:
            - ^/?site/documents/.+\.xml$
          binaryPathPatterns:
            - ^/?static-assets/.+$
            - ^/?remote-assets/.+$
            - ^/?scripts/.+$
            - ^/?templates/.+$
          # Look into all XML descriptors to index all binary files referenced
          binarySearchablePathPatterns:
            - ^/?site/.+\.xml$
          # Additional metadata such as contentLength, content-type specific metadata
          metadataExtractorPathPatterns:
            - ^/?site/.+$
          excludePathPatterns:
            - ^/?config/.*$
          # Include all fields marked as remote resources (S3, Box, CMIS)
          referenceXPaths:
            - //item/key
            - //item/url
            - //*[@remote="true"]