Splinkclickhouse libraries
Splink models work by specifying what types of comparisons between pairs of records will be taken into account, as described in this Splink user guide on comparisons.
Splink provides several premade tools for common use-cases in the comparison library and comparison level library.
Most of these will work out-of-the-box with splinkclickhouse
- simply use them in the same way described in the Splink documentation by importing:
However, splinkclickhouse
also extends and supplements these for a few potential use-cases.
Custom libraries
The splinkclickhouse
libraries are used in exactly the same way as the Splink libraries.
We generally recommend using a different alias for them to improve code clarity, especially as some have the same name as Splink versions:
import splinkclickhouse.comparison_library as cl_ch
import splinkclickhouse.comparison_level_library as cll_ch
There are:
- versions of Splink's library functions for working with distances between latitude/longitude pairs using built-in Clickhouse functions (rather than Splink's custom SQL),
- versions of Splink's library functions for working with dates stored in an integer column representing days since the Unix epoch (see working with dates),
- a totally new comparison
ExactMatchAtSubstringSizes
suitable for working with hierarchical string data, such as the output of Clickhouse'sgeohashEncode
Use these just as in Splink - either pre-made comparisons:
import splink.comparison_library as cl
from splink import SettingsCreator
import splinkclickhouse.comparison_library as cl_ch
...
settings = SettingsCreator(
link_type="dedupe_only",
comparisons=[
cl.ExactMatch("name"),
cl_ch.DistanceInKMAtThresholds(
"latitude",
"longitude",
[10, 50, 100, 200, 500],
),
],
)
or customise individual levels to construct custom comparisons:
import splink.comparison_level_library as cll
import splink.comparison_library as cl
from splink import SettingsCreator
import splinkclickhouse.comparison_level_library as cll_ch
...
settings = SettingsCreator(
link_type="dedupe_only",
comparisons=[
cl.ExactMatch("name"),
cl.CustomComparison(
comparison_levels = [
cll.And(
cll.NullLevel("city"),
cll.NullLevel("postcode"),
cll.Or(cll.NullLevel("latitude"), cll.NullLevel("longitude"))
),
cll.ExactMatch("postcode"),
cll_ch.DistanceInKMLevel("latitude", "longitude", 5),
cll_ch.DistanceInKMLevel("latitude", "longitude", 10),
cll.ExactMatch("city"),
cll_ch.DistanceInKMLevel("latitude", "longitude", 50),
cll.ElseLevel(),
],
output_column_name="location",
),
],
)
splinkclickhouse.comparison_level_library
The following Clickhouse-specific comparisons are available in the splinkclickhouse
comparison level library.
All others can be found in the ordinary Splink comparison level library.
AbsoluteDateDifferenceLevel(col_name, *, input_is_string, threshold, metric)
Computes the absolute time difference between two dates (total duration). For more details see Splink docs.
In database this represents data as an integer counting number of days since 1970-01-01 (Unix epoch). The input data can be either a string in YYYY-MM-DD format, or an integer of the number days since the epoch.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
col_name
|
str
|
The name of the column to compare. |
required |
input_is_string
|
bool
|
If True, the input dates are treated as strings and parsed to integers, and must be in ISO 8601 format. |
required |
threshold
|
int
|
The maximum allowed difference between the two dates,
in units specified by |
required |
metric
|
str
|
The unit of time to use when comparing the dates. Can be 'second', 'minute', 'hour', 'day', 'month', or 'year'. |
required |
DistanceInKMLevel(lat_col, long_col, km_threshold, not_null=False, calculation_method='wgs84')
Compare the distance between two latitude and longitude coördinate points as measured in kilometres.
Calculation uses in-built Clickhouse functions, which support either using great circle distance, or the slightly more accurate WGS 84 method.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
lat_col
|
str
|
The name of a latitude column or the respective array or struct column column containing the information For example: long_lat['lat'] or long_lat[0] |
required |
long_col
|
str
|
The name of a longitudinal column or the respective array or struct column column containing the information, plus an index. For example: long_lat['long'] or long_lat[1] |
required |
km_threshold
|
int
|
The total distance in kilometers to evaluate your comparisons against |
required |
not_null
|
bool
|
If true, ensure no attempt is made to compute this if any inputs are null. This is only necessary if you are not capturing nulls elsewhere in your comparison level. |
False
|
calculation_method
|
str
|
The method to use to compute distances from
latitude and longitude.
Can be 'great_circle' for the great circle distance, based on a
spherical model (Clickhouse |
'wgs84'
|
splinkclickhouse.comparison_library
The following Clickhouse-specific comparisons are available in the splinkclickhouse
comparison library.
All others can be found in the ordinary Splink comparison library.
AbsoluteDateDifferenceAtThresholds(col_name, *, input_is_string, metrics, thresholds, term_frequency_adjustments=False, invalid_dates_as_null=True)
Represents a comparison of the data in col_name
with multiple levels based on
absolute time differences. For more details see Splink docs.
In database this represents data as an integer counting number of days since 1970-01-01 (Unix epoch). The input data can be either a string in YYYY-MM-DD format, or an integer of the number days since the epoch.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
col_name
|
str
|
The name of the column to compare. |
required |
input_is_string
|
bool
|
If True, the input dates are treated as strings and parsed to integers, and must be in ISO 8601 format. |
required |
metrics
|
Union[DateMetricType, List[DateMetricType]]
|
The unit(s) of time to use when comparing dates. Can be 'second', 'minute', 'hour', 'day', 'month', or 'year'. |
required |
thresholds
|
Union[int, float, List[Union[int, float]]]
|
The threshold(s) to use for the time difference level(s). |
required |
term_frequency_adjustments
|
bool
|
Whether to apply term frequency adjustments. Defaults to False. |
False
|
invalid_dates_as_null
|
bool
|
If True and |
True
|
DateOfBirthComparison(col_name, *, input_is_string, datetime_thresholds=(1, 1, 10), datetime_metrics=('month', 'year', 'year'), invalid_dates_as_null=True)
Generate an 'out of the box' comparison for a date of birth column
in the col_name
provided. For more details see Splink docs.
In database this represents data as an integer counting number of days since 1970-01-01 (Unix epoch). The input data can be either a string in YYYY-MM-DD format, or an integer of the number days since the epoch.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
col_name
|
str
|
The name of the column to compare. |
required |
input_is_string
|
bool
|
If True, the input dates are treated as strings and parsed to integers, and must be in ISO 8601 format. |
required |
datetime_thresholds
|
Union[int, float, List[Union[int, float]]]
|
Numeric thresholds for date differences. Defaults to [1, 1, 10]. |
(1, 1, 10)
|
datetime_metrics
|
Union[DateMetricType, List[DateMetricType]]
|
Metrics for date differences. Defaults to ["month", "year", "year"]. |
('month', 'year', 'year')
|
invalid_dates_as_null
|
bool
|
If True, treat invalid dates as null as opposed to allowing e.g. an exact or levenshtein match where one side or both are an invalid date. Only used if input is a string. Defaults to True. |
True
|
DistanceInKMAtThresholds(lat_col, long_col, km_thresholds, calculation_method='wgs84')
A comparison of the latitude, longitude coordinates defined in 'lat_col' and 'long col' giving the distance between them in km.
An example of the output with km_thresholds = [1, 10] would be:
- The two coordinates are within 1 km of one another
- The two coordinates are within 10 km of one another
- Anything else (i.e. the distance between coordinates are > 10km apart)
This uses in-built Clickhouse functions to calculate the distance, either based on a spherical great-circle model, or based on WGS 84.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
lat_col
|
str
|
The name of the latitude column to compare. |
required |
long_col
|
str
|
The name of the longitude column to compare. |
required |
km_thresholds
|
iterable[float] | float
|
The km threshold(s) for the distance levels. |
required |
calculation_method
|
str
|
The method to use to compute distances from
latitude and longitude.
Can be 'great_circle' for the great circle distance, based on a
spherical model (Clickhouse |
'wgs84'
|
ExactMatchAtSubstringSizes(col_name, substring_size_or_sizes=[4, 3, 2], include_full_exact_match=True)
A comparison between columns at several sizes of substring
An example of the output with substring_size_or_sizes = [3, 1] and include_full_exact_match = True would be:
- The two columns match exactly
- The substring of the first three characters of each column match exactly
- The substring of the first character of each column match exactly
This is suitable for a hierarchically structured string, such as a geohash
column (from e.g. the result of geohashEncode
).
See https://clickhouse.com/docs/en/sql-reference/functions/geo/geohash.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
col_name
|
str
|
The name of the column to compare. |
required |
substring_size_or_sizes
|
iterable[int] | int
|
The size(s) of the substrings to compare, taken from the start of the string. Default [4, 3, 2] |
[4, 3, 2]
|
include_full_exact_match
|
bool
|
Whether or not to include a level for an exact match on the full string. Defaults to True. |
True
|