Visit ShanghaiTech University | 中文
HOME > News and Events > Seminars
Practical Methods for Big Data: Trees, Boosting, Hashing, and Near Neighbor Search
Date: 2016/5/4             Browse: 269

Practical Methods for Big Data: Trees, Boosting, Hashing, and Near Neighbor Search

Speaker: Ping Li

Time: May 4, 11:00am - 12:00am.

Location: Room 306, Teaching Center


The tutorial will be delivered in Chinese although the slides will be written in English. The content of the talk will be solely based on published papers.

In this talk, I will focus on a variety of recently developed novel hashing algorithms for efficient search and learning with massive data. The plan is to cover the following topics in details:

1.树模型机器学习及和深度学习对比 (Trees & boosting and comparisons with deep learning).

Trees & boosting algorithms have been extremely popular in industry (e.g., learning to rank, ads, risk control, fraud detection, credit analysis, etc). I will cover the recent work on robust logitboost and adaptive base class boost and show how the methods improve the widely used GDBT algorithm. In addition, surprising comparisons with deep nets are available via a well-known empirical study conducted by deep learning researchers.

The main drawbacks of tree & boosting algorithms are : (1) Trees are not suitable for nominal categorical variables with many (e.g., millions or even billions) of categories. In modern search applications (e.g., ads), the use of such feature is common; (2) Training is slow; (3) To achieve good accuracy, typically many trees are needed, which means the model size is often very large.

2.哈希算法用于超大规模机器学习 (Hashing algorithms for very large-scale learning).

Hashing methods provide a simple and effective strategy for machine learning from massive data. The core idea is to randomize the original data (often in a creative way) so that (in the expectation) the inner product of transformed data is some linear or nonlinear kernel of the original data. This allows practitioners to use highly efficient modern linear methods for massive data and (linear or nonlinear) kernel learning. In practice, some nonlinear kernels, for example, resemblance kernel, chi-square kernel, or min-max kernel, are able to produce accurate results, comparable to (or better than) sophisticated and computationally intensive learning methods. Therefore, if used appropriately, hashing methods provide a nice simple scalable alternative for achieving both accuracy and efficiency. This talk will cover a wide range of hashing methods the speaker has been working on in the past 10 years: b-bit minwise hashing, one-permutation hashing, 0-bit consistent weighted sampling, very sparse random projections, sign stable random projections, CoRE kernels, conditional random sampling, and more.

3.哈希算法用于快速高效检索 (Hashing algorithms for efficient indexing and search).

Hashing has been widely used in practice for indexing and sublinear time near neighbor search. This is a fundamental task in computer science with numerous practical applications. Many of the hashing methods the speaker has worked on can be seamlessly integrated into indexing and search systems. The talk will cover the basic ideas and will focus on the recent exciting development of “densified one permutation hashing”, which provides a very efficient replacement of a popular hashing method for near neighbor search in practice. The talk will also cover a new topic on indexing inner product (NIPS 2014 Best Paper) which has direct applications in recommender systems (e.g., collaborative filtering) and other tasks.


Ping Li is an Associate Professor in the Department of Statistics and the Department of Computer Science at Rutgers University, USA. Ping Li received his Ph.D. in Statistics from Stanford University, where he also earned two masters’ degrees in Computer Science and Electric Engineering. Ping Li is a recipient of the AFOSR (Air Force Office of Scientific Research) Young Investigator Award (AFOSR-YIP) and a receipt of the ONR (Office of Naval Research) Young Investigator Award (ONR-YIP). Ping Li (with co-authors) won the Best Paper Award in NIPS 2014, the Best Paper Award in ASONAM 2014, and the Best Student Paper Award in KDD 2006.

Current research: (1) extremely large-scale ads CTR learning; (2) learning to rank for search & image search; (3) hashing methods for many search & learning tasks; (4) trees; (5) data streams. 

SIST-Seminar 16030