FtM Proposal: Optionally specify unit for number properties

Hi all,

I was thinking of adding an optional unit field to properties that use the number type (similar to how identifiers can have an optional format), primarily to format values nicely in applications such as Aleph (for example “05:44 min” instead of “344006” see also this Aleph bug).

This would be exposed as additional property metadata but not used by FtM in any other way. Any thoughts about this, in particular why it might be a bad idea?

Usage examples:

Audio:
  # ...
  properties:
    duration:
      label: "Duration"
      description: "Duration of the audio in ms"
      type: number
      unit: ms # or "milliseconds"?
    samplingRate:
      label: "Sampling Rate"
      description: "Sampling rate of the audio in Hz"
      type: number
      unit: Hz # or "hertz"?
Document:
  # ...
  properties:
    fileSize:
      label: "File size"
      type: number
      unit: B # or "bytes"

Nice, thanks for approaching this topic. I feel like the number property type is the last proper “everything goes” part of FtM: we don’t have any rules on what constitutes a number - it’s just a bit of text. In particular, there’s no real expectation for the value of number props to be a … number. So you could stuff “5 Hz” in there, or “17 GiB”.

So I wonder if we decide we want to make number well-defined, do we need to start one deeper and talk about using format on numbers as well, e.g. how to specify decimals and thousands separators, and perhaps even removing other cruft from the number field. I assume most mappings that use number props include dirty contents, e.g. number values that have units, or use varied number formats… or we keep number as a messy field, where units etc. are OK - but then we should be explicit about that and e.g. not try to run facets on it, ever (I think Aleph does this).

On the specific proposal: I wonder if instead of specifying the unit for a specific prop (“all durations are in seconds”), it might make sense to specify the domain - time, weight, size - and then parse the value based on that?

btw, this looks cool: Tutorial — pint 0.6 documentation

Thanks for your quick reply! I agree with most of what you said, especially about whether we want number to be a messy type or not. Though I think the current situation maybe isn’t that bad. I had a look at all the number properties and in many cases a unit is already specified in the description, in the property name, or there is a de-factor standard (see below).

I’m wondering if we could start by explicitly specifying a unit where this is already possible unambiguously, then start validating values passed into number properties and emitting warnings or rejecting invalid values as a second step?

From Aleph’s perspective, messy values are stored and displayed as is, but Aleph also tries to convert it to a proper numeric value, for example to support range queries. It could make use from additional unit metadata for display purposes (e.g. to provide a range filter UI), while still displaying the original “messy” value as a fallback if ambiguous in order to be backwards compatible with existing data.

On the specific proposal: I wonder if instead of specifying the unit for a specific prop (“all durations are in seconds”), it might make sense to specify the domain - time, weight, size - and then parse the value based on that?

Hadn’t thought of that. Would we still have one base/default unit and FtM would just convert to that unit transparently when you add a value?

entity.add("area", "2000 sqft") # or
entity.add("area", 2000, unit="sqft")
entity.first("area") # would return `185.8061` if the base unit is sqm

Overview of number properties

Some of them already specify a unit in the description (although that of course doesn’t mean everyone adding values to these properties is aware of that):

Property Description Unit
Audio:duration Duration of the audio in ms milliseconds
Audio:samplingRate Sampling rate of the audio in Hz` Hertz
Video:duration Duration of the video in ms milliseconds
Call:duration Duration in seconds seconds

It’s a bit unfortunate that Audio:duration uses milliseconds and Call:duration uses seconds.

Some specify a unit in the property name:

Property Unit
Vessel:tonnage GT
Vessel:grossRegisteredTonnage GRT

There are some properties that are just counts/indices and essentially “unit-less”:

Property
Position:numberOfSeats
Page:index
Table:rowCount
CallForTenders:numberOfLots
CallForTenders:maximumNumberOfLots

The Document schema probably is primarily used by Aleph/ingest-file, so we could use that as a reference (it uses bytes) even though the FtM model currently doesn’t specify a unit?

Property Unit
Document:fileSize bytes

These properties are new and were added primarily for OpenSanctions, right? I checked a few random OS crawlers and they were all using meteres and kilograms for these properties. If this is the case for all of your crawlers, we could use that as a reference?

Property Unit
Person:height meters
Person:weight kilograms

Currencies may be a special case, but I think we could probably treat currencies as units?

Property Note Unit
BankAccount:balance Balance in the original currency specified by BankAccount:currency
BankAccount:maxBalance Max balance in the original currency specified by BankAccount:currency
Value:amount Value in the original currency specified by Value:currency
Value:amountUsd USD
Value:amountEur EUR
CryptoWallet:balance Balance in the currency of the wallet specified in CryptoWallet:balance

That leaves us with the following properties. I’ve had a look at some of our datasets and mappings and the situation is indeed messy :upside_down_face:

Property Note
RealEstate:latitude Degrees as a decimal number seems common, but other notations are sometimes used.
RealEstate:longitude See above.
RealEstate:area Seems to be sometimes mistaken with the name of the region, otherwise a mix of hectare, squarefoot, squaremeters. Values are often unitless and depend on context/dataset, so even detecting the unit wouldn’t work.
Similar:confidenceScore Don’t think this is actively used by anyone?
1 Like

I’ve spent some time looking into how others models units/quantities, some notes about Wikidata’s implementation:

What is the number type in FtM is a Quantity type in Wikidata. Quantity values consist of the amount, the unit, and optional upper/lower bounds to model imprecision.

That means it’s possible to store a quantity in any unit, for example the area of a landmark can be stored in square meter or square foot. However, Wikidata implements normalization of values into a base unit defined here, for example specifying that a quantity in square foot can be normalized to square meter.

Properties can be defined with constraints for the allowed types of units and value ranges.

Currencies are modeled as any other unit, i.e. there are units for euros, US dollars, etc.

Some examples:

These are just notes for now, not sure about the implications for the FtM number type yet. Will also take a closer look at some other implementations.

Notes about the Schema.org implementation:

Schema.org also has a Quantity type. In contrast to Wikidata, values of the Quantity type essentially are just strings. Most subclasses of the Quantity type allow for variable units (e.g. Distance or Mass) in which case the string value is “<amount> <unit>”. Durations require ISO 8601 format (e.g. “PT2H30M” for 2 hours and 30 minutes).

Additionally, Schema.org has a StructuredValue type which is similar to Wikidata’s Quantity type, i.e. it has an amount, optional upper/lower bounds, and a unit. Subtypes of StructuredValue include QuantitativeValue (which can be used as an alternative to Quantity on many properties) and MonetaryAmount. (Interestingly, MonetaryAmount cannot be used on some properties such as yearlyRevenue).

Some examples:

Trying to summarize some of the open questions based on our previous discussions:

1) Do we require only a specific dimension (e.g. length, mass, …) or a specific unit (meters, kilograms, …)?

I’ve talked to FtM users at OCCRP, and there is some value in preserving the original unit. For example, when comparing entity properties with the original source, it might be slightly confusing if the entity has property values in a different unit than the original source. However, this is not an important requirement.

Based on this, I’d go with the simpler option, i.e. require that values of number properties are always just a number in the property’s unit.

2) In case we decide that values can be specified in any unit of the property’s dimensions, we need to store the unit along with the value. This could be a simple string representation such as 100 m or a structured format, maybe even objects/JSON, e.g. {"amount":100,"unit":"m"}. Structured values like this would be quite a change from how FtM currently works (“everything is just a string”).

This question is of course no longer relevant if we decide to only allow one unit per property.

3) Should FtM support unit conversions?

No matter how we decide to go about 1) and 2), FtM could support unit conversions when adding a value to a property or when retrieving a value.

I’ve also asked users at OCCRP about this, and there was no strong need for this. Actually, @brrttwrks mentioned that it might be better to handle unit conversions explicitly outside of FtM rather than implicitly by FtM.

Given that we could always add convenience helpers/options later on, we probably don’t necessarily need to decide this now.

@pudo Thoughts about this, from OpenSanction’s perspective, in particular about 1)?

Thanks for doing more research on this. As you point out: with both solutions - a prop-defined unit, or a unit micro-format - we’ll need to eventually implement conversions to make things work. I don’t find that overly scary: we can start small - file sizes, money and perhaps distances? Not excited about the imperial system, but who is?

Perhaps we do need to define an “anchor” unit for each prop - something that gets used when we try to index it for a facet in ES. This could then be converted as late as during indexing, even if we ended up using the num unit microformat (prefer that over JSON tbh). Or is faceting not a major use case on your end?

In any case, this will require a review of a lot of mappings so given the choice I’d vote for a stricter approach with really nice helpers.

@tillprochaska Overall, what’s your pick? 1, 2, or do what Eric says?

FWIW our exposure - mostly BS values:

As you point out: with both solutions - a prop-defined unit, or a unit micro-format - we’ll need to eventually implement conversions to make things work.

But do we? We could define one unit per number property and expect FtM consumers to provide values in that unit. For example, Person:height is always expected to be in meters.

Sure, it’s a bit limiting because you may not be able to represent a value in its original unit, and as an FtM user you need to be aware of the expected unit. Talked to @brrttwrks about this, and he was fine with that limitation.

Is there a strong need for this from your perspective? (I’m just asking because from my perspective, not having to handle conversions and different units would significantly simplify everything, even considering that there are package like Pint for the actual conversion, but maybe I’m missing something.)

Yes, agreed, if we want to support unit conversions, a anchor/default/canonical unit would be required exactly for that purpose.

Oh yeah I like keeping it “out of FtM” this way, but all of us are going to have to go back and review our importers and build a bit of conversion there.

Are we aligned on adding unit to Property, making a lookup somewhere of all established unit names (sec, min, m, kg, t, b)? Then if a producer puts in the wrong thing, the producer is … wrong.

Sorry for the slow reply!

Are we aligned on adding unit to Property, making a lookup somewhere of all established unit names (sec, min, m, kg, t, b)? Then if a producer puts in the wrong thing, the producer is … wrong.

I thought the easiest implementation would be to essentially just allow any string that can be parsed into a number:

:white_check_mark: 1.74
:white_check_mark: 1.74m
:white_check_mark: 1.74 m
:white_check_mark: 1.74 meters
:white_check_mark: 1.74 metres
:white_check_mark: 174cm

:cross_mark: redacted
:cross_mark: m
:cross_mark:

This would simply ignore the unit during validation. So you could add a value of 174 cm to a property that expects meters and the validation would succeed, because it can successfully be parsed as a numeric value. This is of course not perfect, but I think at least for our use case it would be acceptable.

That said, I’m not against implementing unit lookups to enable FtM to catch values with incorrect units and I can see how it would be useful, especially when working with existing importers/crawlers.

(I’m not working this week, so it will probably again take me a little longer to reply.)