Annnouncing nandist: calculating distances when data is missing

By Wouter Donders, vr 20 januari 2023, in category Blog

open source, python, scipy

Calculating distances is a frequent part of data science when comparing the similarity or dissimilarity of data points. Distances can be measured in more than one metric, such as Euclidean, Cityblock, Minkowski, Chebyshev, Cosine, and many others. In Python, you can calculate these distance metrics using functions in SciPy's subpackage scipy.spatial.distance.

But data is rarely perfect, and you may have missing values in the components of your data points. If you have missing data and try to calculate a distance metric using SciPy's distance functions, it will (accurately) return nan, indicating that the distance cannot be calculated Although correct, that's not very useful.

For many use cases, it doesn't matter that a particular value is missing. It would be much more useful if the distance functions ignore these components and treat it as "any value", adding zero distance to your total distance.

Our open source Python package nandist implements drop-in replacements for SciPy's most frequently used distance functions, including euclidean, cityblock, chebyshev, cosine, minkowski and the fast distance functions for collections of data points cdist and pdist. The functions in nandist use the exact same API and therefore also support weighted distance calculations.

Installation

You can simply install nandist using pip:

pip install nandist

Examples

Calculating cityblock distance between two vectors

A simple example for calculating the cityblock distance between (0, 1) and (NaN, 0) is shown below

>>> import nandist
>>> import scipy
>>> import numpy as np

>>> # City-block distance between  (0, 1) and (NaN, 0)
>>> u, v = np.array([0, 1]), np.array([np.nan, 0])
>>> scipy.spatial.distance.cityblock(u, v)
nan
>>> nandist.cityblock(u, v)
1.0

Calculating cityblock distance between collections of data

You can get pairwise distances between arrays in two matrices using cdist. The NaNs do not need to be in the same component.

>>> import nandist
>>> import numpy as np

>>> # City-block distances between vectors A = [(0, 0), (1, NaN)] and vectors B=[(1, NaN) and (1, 1)]
>>> XA, XB = np.array([[0, 0], [1, np.nan]]), np.array([[1, np.nan], [1, 1]])
>>> Y = nandist.cdist(XA, XB, metric="cityblock")
array([[1., 2.],
       [0., 0.]])

Read more:

Read more on Gitlab's documentation: