By Wouter Donders, vr 20 januari 2023, in category Blog
Calculating distances is a frequent part of data science when comparing the similarity or dissimilarity of data points.
Distances can be measured in more than one metric, such as Euclidean, Cityblock, Minkowski, Chebyshev, Cosine, and many others.
In Python, you can calculate these distance metrics using functions in SciPy's subpackage
But data is rarely perfect, and you may have missing values in the components of your data points.
If you have missing data and try to calculate a distance metric using SciPy's distance functions, it will (accurately) return
nan, indicating that the distance cannot be calculated
Although correct, that's not very useful.
For many use cases, it doesn't matter that a particular value is missing. It would be much more useful if the distance functions ignore these components and treat it as "any value", adding zero distance to your total distance.
Our open source Python package
nandist implements drop-in replacements for SciPy's most frequently used distance functions, including
minkowski and the fast distance functions for collections of data points
The functions in
nandist use the exact same API and therefore also support weighted distance calculations.
You can simply install
pip install nandist
A simple example for calculating the cityblock distance between (0, 1) and (NaN, 0) is shown below
>>> import nandist >>> import scipy >>> import numpy as np >>> # City-block distance between (0, 1) and (NaN, 0) >>> u, v = np.array([0, 1]), np.array([np.nan, 0]) >>> scipy.spatial.distance.cityblock(u, v) nan >>> nandist.cityblock(u, v) 1.0
You can get pairwise distances between arrays in two matrices using cdist. The NaNs do not need to be in the same component.
>>> import nandist >>> import numpy as np >>> # City-block distances between vectors A = [(0, 0), (1, NaN)] and vectors B=[(1, NaN) and (1, 1)] >>> XA, XB = np.array([[0, 0], [1, np.nan]]), np.array([[1, np.nan], [1, 1]]) >>> Y = nandist.cdist(XA, XB, metric="cityblock") array([[1., 2.], [0., 0.]])
Read more on Gitlab's documentation: