Efficient Algorithms for Vector Similarities

Speaker

Sandeep Silwal
Title: Efficient Algorithms for Vector Similarities

Abstract:

A key cog in machine learning is the humble embedding: a vector representation of real world objects such as text, images, graphs, and even molecules. It is common to curate massive datasets of embeddings by inferencing on a ML model of choice. However, the sheer dataset size and large dimensionality is often *the* bottleneck in effectively leveraging and learning from this rich dataset. Inspired by this computational bottleneck in modern machine learning pipelines, we study the following question: "How can we efficiently compute on large scale high dimensional data?"

In this thesis, we focus on two aspects of this question.

(1) Fast local similarity computation: we give faster algorithms to compute complicated notions of similarity, such as those between collections of vectors (think optimal transport), as well as dimensionality reduction techniques which preserve similarities. In addition to computational efficiency, other resource constraints such as space and privacy are also considered.

(2) Fast global similairty analysis: we study algorithms for analyzing global relationships between vectors encoded in similarity matrices. Our algorithms compute on similarity matrices, such as distance or kernel matrices, without ever initializing them, thus avoiding a quadratic time and space bottleneck.

Overall, my main message is that sublinear algorithms design principles are instrumental in designing scalable algorithms for big data. For the presentation, I will only cover a subset of results in each of the categories, with a big emphasis on simplicity.

Thesis Committee:
Piotr Indyk (Advisor) - MIT
Ronitt Rubinfeld - MIT
Huy Nguyen - Northeastern University