Elasticsearch index configuration and highlight query

Hello everyone! I’m currently tackling some challenges with custom Elasticsearch index configurations, particularly with highlight queries.
I apologize if my questions seems a bit off-topic for this channel, but I’m genuinely in need of your assistance with it.

  • Aleph entities uses Elasticsearch text field for searching, but highlight query uses properties.* . Does anybody know the story behind it?
  • ES text field defined with store=False and term_vector=with_positions_offsets , but it seems like this term vector is not utilized. Any thoughts on why this might be the case?
  • I also have a more generic question about term vectors in Elasticsearch (please take a look at the Jupyter notebook). Why do they behave unexpectedly in some cases?
    Any insights or guidance you can provide would be greatly appreciated!

Hi there and sorry for the late reply. I’ll try to answer your questions below one by one:

Aleph entities uses Elasticsearch text field for searching, but highlight query uses properties.* . Does anybody know the story behind it?

There are multiple requirements when it comes to search in Aleph, including the following:

  1. Run full-text queries on all properties and property types.
  2. Run advanced queries for certain property types, for example range queries on date fields.
  3. Display snippets with highlights for search results.

When the ES indices are created, Aleph also configures a field mapping based on the entity property types. For example, the Company:incorporationDate property is mapped to an Elasticsearch date field and the Company:registrationNumber is mapped to an Elasticsearch keyword field.

date and keyword fields enable certain queries in Elasticsearch, for example filtering based on exact values or range queries. However, they do not allow full-text searches (which we also need). So Aleph also copies every property value to the text field during indexing to enable full-text searches. As the only purpose of the text field is enabling searches (and the actual source values are already stored as part of the properties.* fields) the text field doesn’t need to be stored separately.

Highlighting however requires that Elasticsearch is able to retrieve the original field value. That’s not possible for the text field. But it’s also not a problem (at least in theory) as the source values are all stored as part of the properties.* fields.

To summarize, properties.* fields are configured to use the best Elasticsearch field type based on the FollowTheMoney property type. properties.* fields also store the source values so they can be retrieved later. The text field indexes data in a way that enables full-text searches, but it doesn’t store the source values in a retrievable way (as that would be redundant).

ES text field defined with store=False and term_vector=with_positions_offsets , but it seems like this term vector is not utilized. Any thoughts on why this might be the case?

You might be right that the term vectors are currently not used. I know that older versions of Aleph would always use the vector highlighter, but this became unfeasible at some point, so the fact that term vectors are still stored might be left over from a refactoring. I’m not super familiar with this part of the code base though and cannot give you a confident answer without reading through the source and history.

I also have a more generic question about term vectors in Elasticsearch (please take a look at the Jupyter notebook ). Why do they behave unexpectedly in some cases?

I think what you’re seeing is mostly consistent with the index configuration, although I’m a bit surprised that you did get highlights with the plain and unified highlighters for the text field. But that might be due to the fact that in your notebook you do not exclude the text field from _source (as Aleph does). Or I’m missing something else? If you have any concrete questions about the behavior you’re seeing, it would be great if you could point out specifically what you’d have expected in a given scenario and how it’s different from what you’re seeing.

1 Like

Hi, thank you very much for your response. We ended up utilizing store=true for the text field and FVH highlight type. In our setup, this approach has proven to be considerably more consistent and efficient. By the way, excluding the text field from _source doesn’t seem to impact the highlight query, so that’s acceptable as well.

Hi @Medjay, thanks for sharing your results. I’m glad to hear that you figured out a setup that works for you. I’ve taken a note to revisit this thread when we have some time to update the ES configuration.