Numerous problems appear when the analysed data is very large. Things get even worse if that data is high-dimensional which often means it is sparse. We at source{d} deal with such “cursed” data every day feeding terabytes of source code to our ML models.
This talk focuses on various battle-tested techniques to tackle high-dimensional not-so-small data: t-SNE, word2vec, ARTM, K-means, PCA, LSH. Besides the theoretical aspects, the practical use cases are presented coupled with examples of production-ready tools: sklearn, TensorFlow Projector, Facebook fastText and fbPCA, BigARTM, source{d} kmcuda and minhashcuda. They may save up to 100x compared to a typical Hadoop/Spark cluster.
Required audience experience: Basic linear algebra background: vectors, matrices, etc.
Objective of the talk: Introduce the public to recent advances in processing of high-dimensional large data (e.g. natural language) and show how to apply the proven software to real world problems. One of the goals is to show that sometimes it is way better to use an optimized tool than to stick with Hadoop.
Keywords: big data, clustering, locality sensitive hashing, natural language processing, topic modeling, principal component analysis, t-SNE, word2vec, GPGPU
You can can view Vadim and Egor’s slides here:
You can watch their presentation below: