Research similarity metrics and contribute to TF


“Currently, the state-of-the-art similarity metrics are only implemented in R. We want to port these to Python and implement these in frameworks like Tensorflow.”

Context of the internship

  • The Wasserstein distance has been around for centuries but recently is causing a furore in ML. In essence, you calculate how different two distributions are, and the result is a number between 0 and +inf.
  • Now, we can use the Wasserstein distance as a metric to calculate the degree of difference between two probabilistic functions, but we have to go with a parametric version of it on real life data to estimate the actual Wasserstein distance of the two underlying distributions.
  • The question that pops up is: How do we define when 2 distro's are different using the Wasserstein distance? How do we go about hypothesis testing? 🤔
  • We are not the first ones to think about this. Schefzik et al. have come up with a way to test this and implemented it in R.
  • So... We want to make this test available in python and add it to scipy and TensorFlow Data Validation.

What you’ll learn

  • State-of-the-art metrics used extensively in GANs, similarity search, clustering, anomaly detection and pattern discovery;
  • How to contribute impactfully to the open-source community and create a blogpost;
  • Getting familiar with how a top notch AI consultancy firm operates internally.