Thanks for your quick reply! I agree with most of what you said, especially about whether we want number to be a messy type or not. Though I think the current situation maybe isn’t that bad. I had a look at all the number properties and in many cases a unit is already specified in the description, in the property name, or there is a de-factor standard (see below).
I’m wondering if we could start by explicitly specifying a unit where this is already possible unambiguously, then start validating values passed into number properties and emitting warnings or rejecting invalid values as a second step?
From Aleph’s perspective, messy values are stored and displayed as is, but Aleph also tries to convert it to a proper numeric value, for example to support range queries. It could make use from additional unit metadata for display purposes (e.g. to provide a range filter UI), while still displaying the original “messy” value as a fallback if ambiguous in order to be backwards compatible with existing data.
On the specific proposal: I wonder if instead of specifying the unit for a specific prop (“all durations are in seconds”), it might make sense to specify the domain - time, weight, size - and then parse the value based on that?
Hadn’t thought of that. Would we still have one base/default unit and FtM would just convert to that unit transparently when you add a value?
entity.add("area", "2000 sqft") # or
entity.add("area", 2000, unit="sqft")
entity.first("area") # would return `185.8061` if the base unit is sqm
Overview of number properties
Some of them already specify a unit in the description (although that of course doesn’t mean everyone adding values to these properties is aware of that):
Property |
Description |
Unit |
Audio:duration |
Duration of the audio in ms |
milliseconds |
Audio:samplingRate |
Sampling rate of the audio in Hz` |
Hertz |
Video:duration |
Duration of the video in ms |
milliseconds |
Call:duration |
Duration in seconds |
seconds |
It’s a bit unfortunate that Audio:duration
uses milliseconds and Call:duration
uses seconds.
Some specify a unit in the property name:
Property |
Unit |
Vessel:tonnage |
GT |
Vessel:grossRegisteredTonnage |
GRT |
There are some properties that are just counts/indices and essentially “unit-less”:
Property |
Position:numberOfSeats |
Page:index |
Table:rowCount |
CallForTenders:numberOfLots |
CallForTenders:maximumNumberOfLots |
The Document
schema probably is primarily used by Aleph/ingest-file, so we could use that as a reference (it uses bytes) even though the FtM model currently doesn’t specify a unit?
Property |
Unit |
Document:fileSize |
bytes |
These properties are new and were added primarily for OpenSanctions, right? I checked a few random OS crawlers and they were all using meteres and kilograms for these properties. If this is the case for all of your crawlers, we could use that as a reference?
Property |
Unit |
Person:height |
meters |
Person:weight |
kilograms |
Currencies may be a special case, but I think we could probably treat currencies as units?
Property |
Note |
Unit |
BankAccount:balance |
Balance in the original currency specified by BankAccount:currency |
|
BankAccount:maxBalance |
Max balance in the original currency specified by BankAccount:currency |
|
Value:amount |
Value in the original currency specified by Value:currency |
|
Value:amountUsd |
|
USD |
Value:amountEur |
|
EUR |
CryptoWallet:balance |
Balance in the currency of the wallet specified in CryptoWallet:balance |
|
That leaves us with the following properties. I’ve had a look at some of our datasets and mappings and the situation is indeed messy 
Property |
Note |
RealEstate:latitude |
Degrees as a decimal number seems common, but other notations are sometimes used. |
RealEstate:longitude |
See above. |
RealEstate:area |
Seems to be sometimes mistaken with the name of the region, otherwise a mix of hectare, squarefoot, squaremeters. Values are often unitless and depend on context/dataset, so even detecting the unit wouldn’t work. |
Similar:confidenceScore |
Don’t think this is actively used by anyone? |