Hi there and sorry for the late reply. I’ll try to answer your questions below one by one:
Aleph entities uses Elasticsearch text
field for searching, but highlight query uses properties.*
. Does anybody know the story behind it?
There are multiple requirements when it comes to search in Aleph, including the following:
- Run full-text queries on all properties and property types.
- Run advanced queries for certain property types, for example range queries on date fields.
- Display snippets with highlights for search results.
When the ES indices are created, Aleph also configures a field mapping based on the entity property types. For example, the Company:incorporationDate
property is mapped to an Elasticsearch date
field and the Company:registrationNumber
is mapped to an Elasticsearch keyword
field.
date
and keyword
fields enable certain queries in Elasticsearch, for example filtering based on exact values or range queries. However, they do not allow full-text searches (which we also need). So Aleph also copies every property value to the text
field during indexing to enable full-text searches. As the only purpose of the text
field is enabling searches (and the actual source values are already stored as part of the properties.*
fields) the text
field doesn’t need to be stored separately.
Highlighting however requires that Elasticsearch is able to retrieve the original field value. That’s not possible for the text
field. But it’s also not a problem (at least in theory) as the source values are all stored as part of the properties.*
fields.
To summarize, properties.*
fields are configured to use the best Elasticsearch field type based on the FollowTheMoney property type. properties.*
fields also store the source values so they can be retrieved later. The text
field indexes data in a way that enables full-text searches, but it doesn’t store the source values in a retrievable way (as that would be redundant).
ES text
field defined with store=False
and term_vector=with_positions_offsets
, but it seems like this term vector is not utilized. Any thoughts on why this might be the case?
You might be right that the term vectors are currently not used. I know that older versions of Aleph would always use the vector highlighter, but this became unfeasible at some point, so the fact that term vectors are still stored might be left over from a refactoring. I’m not super familiar with this part of the code base though and cannot give you a confident answer without reading through the source and history.
I also have a more generic question about term vectors in Elasticsearch (please take a look at the Jupyter notebook ). Why do they behave unexpectedly in some cases?
I think what you’re seeing is mostly consistent with the index configuration, although I’m a bit surprised that you did get highlights with the plain and unified highlighters for the text
field. But that might be due to the fact that in your notebook you do not exclude the text
field from _source
(as Aleph does). Or I’m missing something else? If you have any concrete questions about the behavior you’re seeing, it would be great if you could point out specifically what you’d have expected in a given scenario and how it’s different from what you’re seeing.