Hey all, I’d like to pitch a plan for FollowTheMoney 4.0 that would add several new concepts to the library that we’ve been developing & maturing in nomenklatura. The incentive for pushing them up into FtM is to make them easily re-used by future work planned on IDIO’s (@simon) branch of Aleph, and to simplify the class hierarchy we’re using in OS. While the idea is not to break mainstream Aleph, I think some downstream changes would be needed.
statements data model for entity proxies. This essentially makes the unit of property values a statement instead a string. The statement captures metadata like the data source, language, timestamps and unparsed value. Besides adding property value metadata, statements also offer an alternative to fragment-based entity aggregation (like in ftm-store), by sorting statements in a given data backend by entity_id and then assembling an entity out of them.
Dataset metadata. This could capture all the metadata used in the data catalogs of OS and IDIO. By naming it dataset, it shouldn’t clash with the collections concept in Aleph.
Remove Post. This entity type has been deprecated for over a year, we should be able to remove it now.
There’s a final thing that may or may not make sense to do now-ish, which is to replace the concept of country with jurisdictions to reflect the fact that an increasing number of the supported values are not countries (examples: Donetsk, Transnistria, Dubai, New York…).
First of all thanks for the detailed write-up! From the Aleph side of things we are still behind on ftm versions, so we’ll need some more time to catch up. Everything you wrote above makes sense to me and I have nothing against moving forward with these changes.
#1727 brings in the dataset and data catalog models from nomenklatura, as proposed above.
#1728 depends on #1727 and brings in the Statement class, as well as a subclass of EntityProxy, CompositeEntity, which is an entity based on statements. Because each Statement has a dataset field, this is linked to the previous. This also introduces new CLI commands statements (turns entities into a statement feed), format-statements (converts statements between CSV and JSON form), and aggregate-statements (which turns a sorted stream of statements back into entities).
#1733 consolidates a medium-aggressive set of schema changes for 4.0, removing Post and Assessment and re-naming some of the naming clashes discussed in #1732. I’ve left some of the gnarly overlaps in the schema for now, because fixing the would necessitate large-scale data migration.
Then there’s two smaller PRs: one removes the dependency on fingerprints from followthemoney: the functionality of that library is now fully contained in rigour. The other PR consolidates all the code related to RDF exports into one module. Part of the idea here is that not every invocation of ftm should require loading openpyxl and rdflib, two fat dependencies.
What I’d still like to do before making this a release: