KDD 2006 Conference - Invited Talks

KDD 2006 August 20 - 23, 2006 Philadelphia, USA

Home

Program

Awards

KDD-2006 INVITED TALKS

The Twelfth Annual SIGKDD International Conference on
Knowledge Discovery and Data Mining

August 20 - 23, 2006
Philadelphia, USA
http://www.acm.org/sigs/sigkdd/kdd2006/
http://www.kdd2006.com

INVITED TALKS

Self-Organizing Wireless Sensor Networks in Action
John A. Stankovic
Department of Computer Science, University of Virginia
Keynote - Monday, 8/21

Abstract: Wireless sensor networks (WSN) composed of large numbers of small devices that self-organize are being investigated for a wide variety of applications. Two key advantages of these networks over more traditional sensor networks are that they can be dynamically and quickly deployed, and that they can provide fine-grained sensing. Applications, such as emergency response to natural or manmade disasters, detection and tracking, and fine grained sensing of the environment are key examples of applications that can benefit from these types of WSN. Current research for these systems is widespread. However, many of the proposed solutions are developed with simplifying assumptions about wireless communication and the environment, even though the realities of wireless communication and environmental sensing are well known. Many of the solutions are evaluated only by simulation. In this talk I describe a fully implemented system consisting of a suite of more than 30 synthesized protocols. The system supports a power aware surveillance, tracking and classification application running on 203 XSM motes and evaluated in a realistic, large-area environment. Technical details and evaluations are presented. I end with a discussion of opportunities and problems for data mining related to WSN.

Bio: Professor John A. Stankovic is the BP America Professor in the Computer Science Department at the University of Virginia. He recently served as Chair of the department, completing two terms (8 years). He is a Fellow of both the IEEE and the ACM. He also won the IEEE Real-Time Systems Technical Committee's Award for Outstanding Technical Contributions and Leadership. Professor Stankovic also served on the Board of Directors of the Computer Research Association for 9 years. Before joining the University of Virginia, Professor Stankovic taught at the University of Massachusetts where he won an outstanding scholar award. He has also held visiting positions in the Computer Science Department at Carnegie-Mellon University, at INRIA in France, and at the Scuola Superiore S. Anna in Pisa, Italy. He was the Editor-in-Chief for the IEEE Transactions on Distributed and Parallel Systems and was a founder and co-editor-in-chief for the Real-Time Systems Journal for 18 years. He was also General Chair for ACM SenSys 2004 and General Chair for ACM/IEEE Information Processing in Sensor Networks (IPSN) 2006. His research interests are in distributed computing, real-time systems, operating systems, and wireless sensor networks. Prof. Stankovic received his PhD from Brown University.

New Cached-Sufficient Statistics Algorithms for quickly answering statistical questions (Download Talk)
Andrew Moore
School of Computer Science, Carnegie Mellon University
Keynote - Tuesday, 8/22

Abstract: This talk is about recent work on new ways to exploit preprocessed views of data tables for tractably solving big statistical queries. We'll describe deployments of these new algorithms in the realms of detecting killer asteroids and unnatural disease outbreaks.

In recent years, several groups have looked at methods for pre-storing general sufficient statistics of the data in spatial data structures such as kd-trees and ball-trees so that both frequentist and Bayesian statistical operations become fast for large datasets. In this talk we will look at two other classes of optimization required in important statistical queries.

The first involves iterating over all spatial regions (big and small). The second involves detection of tracks from noisy intermittent observations separated far apart in time and space. We will also discuss the implications that have arisen from making these operations tractable. We will focus particularly on

Detecting all asteroids in the solar system larger than Pittsburgh's Cathedral of Learning (data to be collected over 2006-2010).
Early detection of emerging diseases based on national monitoring of health-related transactions

Bio: Andrew Moore ( www.cs.cmu.edu/~awm ) is director of Google's newest engineering office, located on Carnegie Mellon's campus in Pittsburgh, PA. Prior to recently joining Google, Andrew was a professor of robotics and computer science at the School of Computer Science, Carnegie Mellon University. Andrew began his career writing video-games for an obscure British personal computer (http://www.oric.org/index.php?page=software&fille=detail&num_log=2 ). He rapidly became a thousandaire and retired to academia, where he received a PhD from the University of Cambridge in 1991. He researched robot learning as a Post-doc working with Chris Atkeson at MIT, and then moved to CMU.

His main research interest is data mining: statistical algorithms for finding all the potentially useful and statistically meaningful patterns in large sources of data. His research group, The Auton Lab, (http://www.autonlab.org) has devised several new ways of performing large statistical operations efficiently, in several cases accelerating state-of-the-art by a several magnitudes. Members of the Auton Lab collaborate closely with many kinds of scientists, government agencies, technology companies and engineers in a constant quest to determine what are some of the most urgent unresolved questions at the border of computation and statistics. Auton Lab algorithms are now in use in dozens of commercial, university and government applications. Andrew became a US citizen in 2003, and lives in Pittsburgh with his wife and two children: William (7) and Lucy (1). In his non-work life he has no hobbies or talents of any significance.

Next Frontier (Download Talk)
Rakesh Agrawal
Microsoft Search Labs, Microsoft Corporation
Keynote - Wednesday, 8/23

Bio: Rakesh Agrawal is a Microsoft Technical Fellow at the newly founded Search Labs. He is the recipient of the ACM-SIGKDD First Innovation Award, ACM-SIGMOD Edgar F. Codd Innovations Award, ACM-SIGMOD Test of Time Award, VLDB 10-Yr Most Influential Paper Award, and the Computerworld First Horizon Award. He is a Member of the National Academy of Engineering, a Fellow of ACM, and a Fellow of IEEE. Scientific American named him to the list of 50 top scientists and technologists in 2003.

Prior to joining Microsoft in March 2006, Rakesh was an IBM Fellow and led the Quest group at the IBM Almaden Research Center. Earlier, he was with the Bell Laboratories, Murray Hill from 1983 to 1989. He also worked for 3 years at India's premier company, the Bharat Heavy Electricals Ltd. He received the M.S. and Ph.D. degrees in Computer Science from the University of Wisconsin-Madison in 1983. He also holds a B.E. degree in Electronics and Communication Engineering from IIT-Roorkee, and a two-year Post Graduate Diploma in Industrial Engineering from the National Institute of Industrial Engineering (NITIE), Bombay.

Rakesh is well-known for developing fundamental data mining concepts and technologies and pioneering key concepts in data privacy, including Hippocratic Database, Sovereign Information Sharing, and Privacy-Preserving Data Mining. IBM's commercial data mining product, Intelligent Miner, grew out of his work. His research has been incorporated into other IBM products, including DB2 Mining Extender, DB2 OLAP Server and WebSphere Commerce Server, and has influenced several other commercial and academic products, prototypes and applications. His other technical contributions include Polyglot object-oriented type system, Alert active database system, Ode (Object database and environment), Alpha (extension of relational databases with generalized transitive closure), Nest distributed system, transaction management, and database machines.

Rakesh has been granted more than 55 patents. He has published more than 150 research papers, many of them considered seminal. He has written the 1st as well as 2nd highest cited of all papers in the fields of databases and data mining (14th and 16th most cited across all computer science as of August 2005 in CiteSeer). Wikipedia lists one of his papers as one of the three most influential database papers. His papers have been cited more than 6000 times, with more than 15 of them receiving more than 100 citations each. He is the most cited author in the field of database systems. His work has been featured in New York Times Year in Review, New York Times Science section, and several other publications.

Rakesh's new quest is to use Internet to bring the benefits of computing to the underserved.

Capital One's Statistical Problems: Our Top Ten List (Download Talk)
William Kahn
Capital One Financial
Industry Track Invited Talk - Monday, 8/21

Abstract: Capital One is a highly quantitatively driven diversified financial services firm. As such, we make broad and deep use of the entire repertory of highly quantitative techniques. This talk will present our top ten statistical problems. Indeed, one of them has, as a sub point, the data mining dimension, but it will likely be useful for data miners to see how their research needs to complement, and fit into, the entire range of hard statistical issues.

Bio: William Kahn is the Chief Scoring Officer at Capital One Financial where he is responsible for the quality and usefulness of statistical methods across the firm. Dr. Kahn has been working in financial services for 10 years, and also has worked in industrial statistics for 10 years. He has his BA in Physics and MA in Statistics from Berkeley and his Ph.D. in Statistics from Yale.

Introducing Perpetual Analytics (Download Talk)
Jeff Jonas
Entity Analytic Solutions Software Group, IBM
Industry Track Invited Talk - Tuesday, 8/22

Abstract: Common strategies to liberate an organization’s information assets for situational awareness frequently rely on infrastructure components such as data integration, enterprise search, federation, data warehousing, and so on. And while these traditional platforms enable analysts to get better and faster answers to their queries, the next big advance will change this paradigm. Users cannot be expected to formulate and ask every smart question every day. And to escape this impractical and un-scalable model, the new paradigm will involve technologies where “the data finds the data” and “relevance finds the user.” Perpetual Analytics describes a class of application whereby enterprise context is assembled, in real-time, on data streams as fast as operational systems record observations. Context construction is a “data finds the data” activity which enables events of interest to be streamed to subscribers. In this talk, I will talk at some depth about the dynamics of such systems including scalability and sustainability.

Bio: Jeff Jonas is chief scientist of the Entity Analytic Solutions group and an IBM Distinguished Engineer. In these capacities, he is responsible for shaping the overall technical strategy of next generation identity analytics and the use of this new capability in the overall IBM technology strategy. The IBM Entity Analytic Solutions group was formed based on technologies developed by Mr. Jonas as the founder and chief scientist of Systems Research & Development (SRD). SRD was acquired by IBM in January 2005. Today, Mr. Jonas applies his real world and hands on experience in software design and development to drive technology innovations while delivering higher levels of privacy and civil liberties protections. By way of example, the most recent breakthrough developed by Mr. Jonas involves an innovative technique enabling advanced data correlation while only using irreversible cryptographic hashes. This new capability makes it possible for organizations to discover records of common interest (e.g., identities) without the transfer of any privacy invading content. This privacy enhancing technology known as anonymous entity resolution delivers extraordinary new levels of privacy protection while enabling technology to contribute to critical societal interests like clinical health care research, aviation safety, homeland security, fraud detection and identity theft. Jeff Jonas’s innovations have received coverage in such publications as The Wall Street Journal, Washington Post, Fortune, and Computerworld and have been featured on ABC Primetime with Peter Jennings, The Discovery Channel, The Learning Channel and MSNBC. Known for his dynamic presentational style, he is a popular speaker on technology, security and privacy and has spoken at events such as the Federal Convention on Emerging Technologies Forum on Homeland Security, National Security Agency’s INFOSEC Seminar Series, American Society for Industrial Security, Black Hat, PC Forum, Wharton Technology Conference, National Retail Federation Annual Fraud Conference and Computers, Freedom and Privacy Conference.

Mr. Jonas is a member of the Markle Foundation Task Force on National Security in the Information Age and actively contributes his insights on privacy, technology and homeland security to leading national think tanks, privacy advocacy groups, and policy research organizations, including the Center for Democracy and Technology, Heritage Foundation and the Office of the Secretary of Defense Highlands Forum. Most recently Mr. Jonas has been named a senior advisor to Center for Strategic and International Studies.

Information Extraction, Data Mining and Joint Inference (Download Talk)
Andrew McCallum
Computer Science Department, University of Massachusetts
Industry Track Invited Talk - Tuesday, 8/22

Abstract: Although information extraction and data mining appear together in many applications, their interface in most current systems would better be described as serial juxtaposition than as tight integration. Information extraction populates slots in a database by identifying relevant subsequences of text, but is usually not aware of the emerging patterns and regularities in the database. Data mining methods begin from a populated database, and are often unaware of where the data came from, or its inherent uncertainties. The result is that the accuracy of both suffers, and accurate mining of complex text sources has been beyond reach.

In this talk I will describe work in probabilistic models that perform joint inference across multiple components of an information processing pipeline in order to avoid the brittle accumulation of errors. After briefly introducing conditional random fields, I will describe recent work in information extraction leveraging factorial state representations, entity resolution, and transfer learning, as well as scalable methods of inference and learning. I'll close with some recent work on probabilistic models for social network analysis, and a demonstration of Rexa.info, a new research paper search engine.

This is joint work with colleagues at University of Massachusetts: Charles Sutton, Chris Pal, Ben Wellner, Michael Hay, Xuerui Wang, Natasha Mohanty, and Andres Corrada.

Bio: Andrew McCallum is an Associate Professor at University of Massachusetts, Amherst. He was previously Vice President of Research and Development at WhizBang Labs, a company that used machine learning for information extraction from the Web. In the late 1990's he was a Research Scientist and Coordinator at Justsystem Pittsburgh Research Center, where he spearheaded the creation of CORA, an early research paper search engine that used machine learning for spidering, extraction, classification and citation analysis. After receiving his PhD from the University of Rochester in 1995, he was a post-doctoral fellow at Carnegie Mellon University. He is an action editor for the Journal of Machine Learning Research. For the past ten years, McCallum has been active in research on statistical machine learning applied to text, especially information extraction, document classification, clustering, finite state models, semi-supervised learning, and social network analysis. Web page: http://www.cs.umass.edu/~mccallum.

Data Mining Challenges in the Automotive Domain
Michael Cavaretta
Infotronics and Systems Analytics Department, Ford Research and Innovation Center
Industry Track Invited Talk - Tuesday, 8/22

Abstract: Automotive companies, such as Ford Motor Company, have no shortage of large databases with abundant opportunities for cost reduction and revenue enhancement. The Data Mining Group at Ford has worked in the areas of Quality, Customer Satisfaction and Warranty Analytics for close to ten years. In this time, we have developed a number of methods for building systems to help the business. One area of particular success has been in warranty analysis. While traditional hazard analysis has been applied at Ford for a number of years, we have used techniques from other industries (e.g. retail), as well as text mining to view warranty analytics in a new way. However, our success has been tempered by serious challenges particularly in the areas of data understanding, computing meaningful aggregations and implementation. Case studies from the automobile industry (warranty, quality, forecasting, etc.) as well as from other industries will be used.

Bio: Michael Cavaretta is the Technical Leader of the Data Mining Group within the Infotronics and Systems Analytics Department at Ford's Research and Innovation Center. He graduated with a BS and information systems from the University of Michigan in 1987, and received an MS and Ph.D. in Computer Science from Wayne State University in 1995. His dissertation was entitled, “The Application of Cultural Algorithms to Real-Valued Optimization.” After receiving his Ph.D., he was hired by Churchill Systems, a small consulting company, applying artificial intelligence techniques in high volume retail companies such as Sears and Kmart. Dr. Cavaretta was hired by Ford Motor Company in 1998 as a Technical Specialist for the newly formed Data Mining group, and promoted to Technical Leader of the group in 2001. The Data Mining group applies the technologies of artificial intelligence, machine learning, statistics, and information visualization primarily in the areas of quality, warranty, market research, and customer relationship management.