Skip to content

Splinkclickhouse libraries

Splink models work by specifying what types of comparisons between pairs of records will be taken into account, as described in this Splink user guide on comparisons.

Splink provides several premade tools for common use-cases in the comparison library and comparison level library.

Most of these will work out-of-the-box with splinkclickhouse - simply use them in the same way described in the Splink documentation by importing:

import splink.comparison_library as cl
import splink.comparison_level_library as cll

However, splinkclickhouse also extends and supplements these for a few potential use-cases.

Custom libraries

The splinkclickhouse libraries are used in exactly the same way as the Splink libraries. We generally recommend using a different alias for them to improve code clarity, especially as some have the same name as Splink versions:

import splinkclickhouse.comparison_library as cl_ch
import splinkclickhouse.comparison_level_library as cll_ch

There are:

  • versions of Splink's library functions for working with distances between latitude/longitude pairs using built-in Clickhouse functions (rather than Splink's custom SQL),
  • versions of Splink's library functions for working with dates stored in an integer column representing days since the Unix epoch (see working with dates),
  • a totally new comparison ExactMatchAtSubstringSizes suitable for working with hierarchical string data, such as the output of Clickhouse's geohashEncode

Use these just as in Splink - either pre-made comparisons:

import splink.comparison_library as cl
from splink import SettingsCreator

import splinkclickhouse.comparison_library as cl_ch

...
settings = SettingsCreator(
    link_type="dedupe_only",
    comparisons=[
        cl.ExactMatch("name"),
        cl_ch.DistanceInKMAtThresholds(
            "latitude",
            "longitude",
            [10, 50, 100, 200, 500],
        ),
    ],
)

or customise individual levels to construct custom comparisons:

import splink.comparison_level_library as cll
import splink.comparison_library as cl
from splink import SettingsCreator

import splinkclickhouse.comparison_level_library as cll_ch

...
settings = SettingsCreator(
    link_type="dedupe_only",
    comparisons=[
        cl.ExactMatch("name"),
        cl.CustomComparison(
            comparison_levels = [
                cll.And(
                    cll.NullLevel("city"),
                    cll.NullLevel("postcode"),
                    cll.Or(cll.NullLevel("latitude"), cll.NullLevel("longitude"))
                ),
                cll.ExactMatch("postcode"),
                cll_ch.DistanceInKMLevel("latitude", "longitude", 5),
                cll_ch.DistanceInKMLevel("latitude", "longitude", 10),
                cll.ExactMatch("city"),
                cll_ch.DistanceInKMLevel("latitude", "longitude", 50),
                cll.ElseLevel(),
            ],
            output_column_name="location",
        ),
    ],
)

splinkclickhouse.comparison_level_library

The following Clickhouse-specific comparisons are available in the splinkclickhouse comparison level library.

All others can be found in the ordinary Splink comparison level library.

AbsoluteDateDifferenceLevel(col_name, *, input_is_string, threshold, metric)

Computes the absolute time difference between two dates (total duration). For more details see Splink docs.

In database this represents data as an integer counting number of days since 1970-01-01 (Unix epoch). The input data can be either a string in YYYY-MM-DD format, or an integer of the number days since the epoch.

Parameters:

Name Type Description Default
col_name str

The name of the column to compare.

required
input_is_string bool

If True, the input dates are treated as strings and parsed to integers, and must be in ISO 8601 format.

required
threshold int

The maximum allowed difference between the two dates, in units specified by date_metric.

required
metric str

The unit of time to use when comparing the dates. Can be 'second', 'minute', 'hour', 'day', 'month', or 'year'.

required

DistanceInKMLevel(lat_col, long_col, km_threshold, not_null=False, calculation_method='wgs84')

Compare the distance between two latitude and longitude coördinate points as measured in kilometres.

Calculation uses in-built Clickhouse functions, which support either using great circle distance, or the slightly more accurate WGS 84 method.

Parameters:

Name Type Description Default
lat_col str

The name of a latitude column or the respective array or struct column column containing the information For example: long_lat['lat'] or long_lat[0]

required
long_col str

The name of a longitudinal column or the respective array or struct column column containing the information, plus an index. For example: long_lat['long'] or long_lat[1]

required
km_threshold int

The total distance in kilometers to evaluate your comparisons against

required
not_null bool

If true, ensure no attempt is made to compute this if any inputs are null. This is only necessary if you are not capturing nulls elsewhere in your comparison level.

False
calculation_method str

The method to use to compute distances from latitude and longitude. Can be 'great_circle' for the great circle distance, based on a spherical model (Clickhouse greatCircleDistance), or 'wgs84' to use the more accurate WGS 84 Ellipsoid model (Clickhouse geoDistance). Default 'wgs84'.

'wgs84'

splinkclickhouse.comparison_library

The following Clickhouse-specific comparisons are available in the splinkclickhouse comparison library.

All others can be found in the ordinary Splink comparison library.

AbsoluteDateDifferenceAtThresholds(col_name, *, input_is_string, metrics, thresholds, term_frequency_adjustments=False, invalid_dates_as_null=True)

Represents a comparison of the data in col_name with multiple levels based on absolute time differences. For more details see Splink docs.

In database this represents data as an integer counting number of days since 1970-01-01 (Unix epoch). The input data can be either a string in YYYY-MM-DD format, or an integer of the number days since the epoch.

Parameters:

Name Type Description Default
col_name str

The name of the column to compare.

required
input_is_string bool

If True, the input dates are treated as strings and parsed to integers, and must be in ISO 8601 format.

required
metrics Union[DateMetricType, List[DateMetricType]]

The unit(s) of time to use when comparing dates. Can be 'second', 'minute', 'hour', 'day', 'month', or 'year'.

required
thresholds Union[int, float, List[Union[int, float]]]

The threshold(s) to use for the time difference level(s).

required
term_frequency_adjustments bool

Whether to apply term frequency adjustments. Defaults to False.

False
invalid_dates_as_null bool

If True and input_is_string is True, treat invalid dates as null. Defaults to True.

True

DateOfBirthComparison(col_name, *, input_is_string, datetime_thresholds=(1, 1, 10), datetime_metrics=('month', 'year', 'year'), invalid_dates_as_null=True)

Generate an 'out of the box' comparison for a date of birth column in the col_name provided. For more details see Splink docs.

In database this represents data as an integer counting number of days since 1970-01-01 (Unix epoch). The input data can be either a string in YYYY-MM-DD format, or an integer of the number days since the epoch.

Parameters:

Name Type Description Default
col_name str

The name of the column to compare.

required
input_is_string bool

If True, the input dates are treated as strings and parsed to integers, and must be in ISO 8601 format.

required
datetime_thresholds Union[int, float, List[Union[int, float]]]

Numeric thresholds for date differences. Defaults to [1, 1, 10].

(1, 1, 10)
datetime_metrics Union[DateMetricType, List[DateMetricType]]

Metrics for date differences. Defaults to ["month", "year", "year"].

('month', 'year', 'year')
invalid_dates_as_null bool

If True, treat invalid dates as null as opposed to allowing e.g. an exact or levenshtein match where one side or both are an invalid date. Only used if input is a string. Defaults to True.

True

DistanceInKMAtThresholds(lat_col, long_col, km_thresholds, calculation_method='wgs84')

A comparison of the latitude, longitude coordinates defined in 'lat_col' and 'long col' giving the distance between them in km.

An example of the output with km_thresholds = [1, 10] would be:

  • The two coordinates are within 1 km of one another
  • The two coordinates are within 10 km of one another
  • Anything else (i.e. the distance between coordinates are > 10km apart)

This uses in-built Clickhouse functions to calculate the distance, either based on a spherical great-circle model, or based on WGS 84.

Parameters:

Name Type Description Default
lat_col str

The name of the latitude column to compare.

required
long_col str

The name of the longitude column to compare.

required
km_thresholds iterable[float] | float

The km threshold(s) for the distance levels.

required
calculation_method str

The method to use to compute distances from latitude and longitude. Can be 'great_circle' for the great circle distance, based on a spherical model (Clickhouse greatCircleDistance), or 'wgs84' to use the more accurate WGS 84 Ellipsoid model (Clickhouse geoDistance). Default 'wgs84'.

'wgs84'

ExactMatchAtSubstringSizes(col_name, substring_size_or_sizes=[4, 3, 2], include_full_exact_match=True)

A comparison between columns at several sizes of substring

An example of the output with substring_size_or_sizes = [3, 1] and include_full_exact_match = True would be:

  • The two columns match exactly
  • The substring of the first three characters of each column match exactly
  • The substring of the first character of each column match exactly

This is suitable for a hierarchically structured string, such as a geohash column (from e.g. the result of geohashEncode). See https://clickhouse.com/docs/en/sql-reference/functions/geo/geohash.

Parameters:

Name Type Description Default
col_name str

The name of the column to compare.

required
substring_size_or_sizes iterable[int] | int

The size(s) of the substrings to compare, taken from the start of the string. Default [4, 3, 2]

[4, 3, 2]
include_full_exact_match bool

Whether or not to include a level for an exact match on the full string. Defaults to True.

True