I accidentally implemented the fastest pairwise cosine similarity function
While searching for a way to efficiently compute pairwise cosine similarity between vectors, I created a simple and efficient implementation using PyTorch. The function runs blazingly fast. It is faster than the popular cosine_similarity
function from sklearn
and the naive loop-based implementations.
def pairwise_cosine_similarity(tensor: Tensor) -> Tensor:
"""
Args:
tensor: A tensor of shape (N, D) where N is the number of
vectors and D is the dimensionality of the vectors.
Returns:
A tensor of shape (N, N) containing the cosine similarity
between all pairs of vectors.
"""
tmm = torch.mm(tensor, tensor.T)
denom = torch.sqrt(tmm.diagonal()).unsqueeze(0)
denom_mat = torch.mm(denom.T, denom)
return torch.nan_to_num(tmm / denom_mat)
About cosine similarity
Cosine similarity is a intuitive metric to measure similarity between two vectors. It is widely used in vision, recommendation systems, search engines, and natural language processing. The cosine similarity between two vectors is defined as the cosine of the angle between them, ranging from -1 to 1, where 1 means the vectors are identical, -1 means they are opposite, and 0 means they are orthogonal.
Edit (March 2024): Be cautious about using cosine similarity when working with embeddings. Please check out Is Cosine-Similarity of Embeddings Really About Similarity?
Why pairwise similarity is slow
We usually want to compute the cosine similarity between all pairs of vectors. This enables do things like searching for similar vectors (similar images), analyzing the structure of the vector space (clustering images). However, with a large matrix, computing the cosine similarity can be computationally expensive.
We often find ourselves having a matrix of shape where is the number of vectors and is the dimensionality of the vectors. is also called the embedding size or dimension, which is typically a power of 2, e.g. 64, 512, 1024, 4096, etc.
The method
The method is based on the following formula:
Let's explain what the code does:
tmm = torch.mm(tensor, tensor.T)
denom = torch.sqrt(tmm.diagonal()).unsqueeze(0)
denom_mat = torch.mm(denom.T, denom)
return torch.nan_to_num(tmm / denom_mat)
-
Numerator matrix: Dot product via matrix multiplication
tmm = torch.mm(tensor, tensor.T)
This line computes the matrix multiplication of the input
tensor
with its transposetensor.T
. The result is a tensortmm
of shape where each element represents the dot product of vectors and . -
Denominator values: Norm of each vector
denom = torch.sqrt(tmm.diagonal()).unsqueeze(0)
The diagonal of the tensor
tmm
contains the dot products of each vector with itself. Taking the square root of these values gives the magnitude (or norm) of each vector. Theunsqueeze(0)
function is used to add an extra dimension to the tensor, changing its shape from to . -
Denominator matrix
denom_mat = torch.mm(denom.T, denom)
This line computes the outer product of the vector
denom
with itself, resulting in a matrixdenom_mat
of shape . Each element of this matrix is the product of the magnitudes of vectors and . -
NaN removal
return torch.nan_to_num(tmm / denom_mat)
Finally, the cosine similarity between each pair of vectors is calculated by dividing the dot product matrix
tmm
by the matrixdenom_mat
. The division is element-wise, so each element of the resulting matrix represents the cosine similarity between vectors and .torch.nan_to_num
is used to replace any NaN values that might occur during the division with zeros.
The output is a symmetric matrix where the diagonal elements are all 1 (since the cosine similarity of a vector with itself is always 1), and the off-diagonal elements represent the cosine similarity between different pairs of vectors!
Benchmarking
Versus naive loops, our approach completely outperforms them by several orders of magnitude. Versus sklearn.metrics.pairwise.cosine_similarity
, our implementation is 10x faster, and versus a numpy
implementation using the exact same logic, our PyTorch code is about 2-3x faster.
Results
Result within 1e-8 of scipy loop: True
Result within 1e-8 of numpy loop: True
Result within 1e-8 of torch loop: True
Result within 1e-8 of sklearn: True
Result within 1e-8 of numpy matrix: True
scipy loop: 142362.0 us
numpy loop: 112752.9 us
torch loop: 83144.2 us
sklearn: 401.8 us
numpy matrix: 136.2 us
✨ ours: 48.6 us
Benchmark code
Expand to see the benchmark code used to test the functions.
import torch
import numpy as np
import scipy
from sklearn.metrics.pairwise import cosine_similarity
def torch_loop_cosine_similarity(tensor, output):
for i in range(len(tensor)):
for j in range(len(tensor)):
output[i, j] = torch.cosine_similarity(
tensor[i].unsqueeze(0), tensor[j].unsqueeze(0)
)
return output
def scipy_loop_cosine_similarity(tensor, output):
for i in range(len(tensor)):
for j in range(len(tensor)):
output[i, j] = scipy.spatial.distance.cosine(tensor[i], tensor[j])
return output
def numpy_loop_cosine_similarity(tensor, output):
for i in range(len(tensor)):
for j in range(len(tensor)):
output[i, j] = float(
np.dot(tensor[i], tensor[j])
/ (np.linalg.norm(tensor[i]) * np.linalg.norm(tensor[j]))
)
return output
def numpy_matrix_cosine_similarity(tensor):
tmm = np.matmul(tensor, tensor.T)
denom = np.sqrt(tmm.diagonal()).unsqueeze(0)
denom_mat = np.matmul(denom.T, denom)
return np.nan_to_num(tmm / denom_mat)
def torch_matrix_cosine_similarity(tensor):
tmm = torch.mm(tensor, tensor.T)
denom = torch.sqrt(tmm.diagonal()).unsqueeze(0)
denom_mat = torch.mm(denom.T, denom)
return torch.nan_to_num(tmm / denom_mat)
if __name__ == "__main__":
# Create a random data tensor of shape (N=50, D=64)
data = torch.rand((50, 64))
# Create the output tensor
zeros = torch.zeros((data.shape[0], data.shape[0]))
# Compare across different implementations
sklearn = torch.Tensor(cosine_similarity(data, data))
npy_matrix = torch.Tensor(numpy_matrix_cosine_similarity(data))
npy_loop = numpy_loop_cosine_similarity(data, zeros)
spy_loop = scipy_loop_cosine_similarity(data, zeros)
torch_loop = torch_loop_cosine_similarity(data, zeros)
ours = torch_matrix_cosine_similarity(data)
print(f"Result within 1e-8 of scipy loop: {bool(torch.allclose(ours, spy_loop))}")
print(f"Result within 1e-8 of numpy loop: {bool(torch.allclose(ours, npy_loop))}")
print(f"Result within 1e-8 of torch loop: {bool(torch.allclose(ours, torch_loop))}")
print(f"Result within 1e-8 of sklearn: {bool(torch.allclose(ours, sklearn))}")
print(
f"Result within 1e-8 of numpy matrix: {bool(torch.allclose(ours, npy_matrix))}"
)
import timeit
t0 = timeit.Timer(
stmt="torch_loop_cosine_similarity(data, output)",
setup="from __main__ import torch_loop_cosine_similarity; import torch; output = torch.zeros((data.shape[0], data.shape[0]));",
globals={"data": data},
)
t1 = timeit.Timer(
stmt="cosine_similarity(data, data)",
setup="from __main__ import cosine_similarity",
globals={"data": data},
)
t2 = timeit.Timer(
stmt="scipy_loop_cosine_similarity(data, output)",
setup="from __main__ import scipy_loop_cosine_similarity; import torch; output = torch.zeros((data.shape[0], data.shape[0]));",
globals={"data": data},
)
t5 = timeit.Timer(
stmt="numpy_loop_cosine_similarity(data, output)",
setup="from __main__ import numpy_loop_cosine_similarity; import torch; output = torch.zeros((data.shape[0], data.shape[0]));",
globals={"data": data},
)
t3 = timeit.Timer(
stmt="numpy_matrix_cosine_similarity(data)",
setup="from __main__ import numpy_matrix_cosine_similarity",
globals={"data": data},
)
t4 = timeit.Timer(
stmt="torch_matrix_cosine_similarity(data)",
setup="from __main__ import torch_matrix_cosine_similarity",
globals={"data": data},
)
print(f"scipy loop: {t2.timeit(100) / 100 * 1e6:>5.1f} us")
print(f"numpy loop: {t5.timeit(100) / 100 * 1e6:>5.1f} us")
print(f"torch loop: {t0.timeit(100) / 100 * 1e6:>5.1f} us")
print(f"sklearn: {t1.timeit(100) / 100 * 1e6:>5.1f} us")
print(f"numpy matrix: {t3.timeit(100) / 100 * 1e6:>5.1f} us")
print(f"✨ ours: {t4.timeit(10) / 100 * 1e6:>5.1f} us")