Full-text Search and Character Limits

DSpace can process uploaded text-based contents for full-text searching. This means that not only the metadata you provide for a given file will be searchable, but all of its contents are indexed as well. This allows users to search for specific keywords that only appear in the actual content and not in the provided description.

To create the full-text versions of documents that allow for this, DSpace uses a media filter that runs nightly and extracts text (via OCR) from newly deposited documents.

By default these filters on TDL-hosted repositories extract the first 200,000 characters of text and index it for full-text searching. This default number works for the majority of documents in most repositories, but it may not catch everything in very long documents (for example, in large yearbooks or newspapers). Additionally, DSpace 7.6 contains a bug that affects the full-text indexing of large documents. If the media filter hits the 200,000 character limit, it fails and does not index any of the text for that item. (This bug is fixed in version 8.1.)

TDL can, on request, increase the character limit and re-run the indexing process for specific items or Collections in its hosted DSpace repositories. Members should contact the TDL Helpdesk at support@tdl.org with these requests, providing us with the relevant item or collection handles and the estimated character count if possible.

Please note that increasing the character count is a “one-time fix.” That is, TDL staff will increase the default character limit upon request, re-index the relevant items, and then return to the default limit. If you add more items to that collection that require a higher character limit, you will need to re-request the character limit increase. To keep the process manageable for TDL staff, we ask that you make such requests infrequently (i.e. no more than quarterly).

Digital Repositories Documentation

Full-text Search and Character Limits

Related content