Advanced usage
Here are a few topics that are probably less relevant for typical users, but may be required for specialist use-cases.
Working with dates
If you already have dates in your tables with an appropriate type you probably needn't read any further - you can just work as usual with these columns, using Splink library functions AbsoluteTimeDifferenceAtThresholds, DateOfBirthComparison, AbsoluteTimeDifferenceLevel, or writing custom SQL with CustomLevel and CustomComparison.
However if you have unprocessed dates - either String
columns in Clickhouse, or data that you haven't imported into Clickhouse yet, there are a couple of things to be aware of.
Types
If you have dates in strings (for instance if you want to preserve typos which may be invalid dates, but still compare to other dates), Splink can convert during the matching process to a suitable date type using a suitable type conversion function.
Ideally this will be done beforehand to derive a new column with the converted type. This will usually be more performant as you only need to do it once per input record, although it does mean you will need to store an extra column of data for each date.
DateTime64
If you use built-in Splink functions (rather than writing custom SQL) this will use the largest type DateTime64
via the function parseDateTime64BestEffortOrNull
. This function is chosen as:
- the 32-bit
DateTime
type only goes back to 1970 so is unsuitable for many use-cases (where dates-of-birth are commonly used) - in most cases we wish for invalid dates to evaluate to
NULL
for this purpose, and this is the only way to do so withDateTime64
- there is no comparable function where you can specify arbitrary formats
If you wish to use a different date type, or wish to specify a particular format you will need to write your own SQL and wrap in the appropriate conversion function.
Integer dates
DateTime64
is also limited in range, not supporting dates prior to the beginning of 1900.
You may, however, be interested in working with such dates in linkage jobs, such as linking historical data.
splinkclickhouse
comes with some tools to aid with such dates. There is some custom SQL that converts such dates (expressed as ISO-8601 strings) into a signed integer (the number of days since the Unix epoch '1970-01-01', using the proleptic Gregorian calendar). When you create a ClickhouseServerAPI
or ChDBAPI
this function will be refistered as a UDF days_since_epoch
.
Thesplinkclickhouse
custom library functions pertaining to dates if compose the logic for interpreting these integers back as dates to compute time intervals for you. They work either with a pre-processed integer column, or with a string column (which will then use days_since_epoch
under-the-hood).
You can also use the function days_since_epoch
in custom SQL, or use the ColumnExpression
method parse_date_to_int
.
ColumnExpression
splinkclickhouse
provides an extended version of Splink's ColumnExpression
, which includes one additional transformation parse_date_to_int
.
This might be useful if you wish to transform a string column to a date-int, particularly if you needed to combine this with other column transformations:
import splinkclickhouse.comparison_library as cl_ch
from splinkclickhouse.column_expression import ColumnExpression
dateint_column = ColumnExpression("date_string").substr(1, 10).parse_date_to_int()
...
dob_comparison = cl_ch.DateOfBirthComparison(dateint_column, input_is_string=False)
ClickhouseDialect
The configuration of Splink to dialect-specific features is done via splinkclickhouse.dialect.ClickhouseDialect
.
This is accessed from a ClickhouseAPI as ClickhouseAPI.sql_dialect
.
If for some reason you need to adjust any of this behaviour you can modify this attribute at runtime, either with a runtime-modified version, or a custom-implemented SplinkDialect
by inheriting from ClickhouseDialect
and overwriting methods/properties as needed.
This could be useful for example if you wanted to use as default a parsing function other than parseDateTime64BestEffortOrNull
see above, but still enjoy the custom pre-built library comparisons and comparison levels.