Massive datasets abound in a wide variety of application such as observations of the heavens and earth, genomic analyses, voice and data networks, and transactional data arising from commerce and medical care.|
In many applications, the size of the dataset is much larger than the size of the useful information in the dataset. A typical class of problem is called “heavy hitters,” in which we must track the most significant k items out of n >> k. We will want to use resources (time, space, communication) polynomial in k but much less than n. Typically, we present randomized approximation algorithms that, with high probability, give a solution that may be less than optimal but is provably close to optimal. We may also need to satisfy additional constraints, such as privacy constraints in cryptographic contexts.
Another research direction is developing algorithms to analyze large data collections using map-reduce operations, such as in Hadoop, and other scalable operations for distributed memory systems.
In some cases we are looking for specific structures, such as cliques in cellphone calling records, and in other cases are applying learning algorithms in parallel to more rapidly look for unknown structure.
Stout, Quentin F.
Related Labs, Centers, and Groups|
Software Systems Laboratory