Abstract: Wireless sensor networks (WSN) composed of large numbers of small devices that self-organize are being investigated for a wide variety of applications. Two key advantages of these networks over more traditional sensor networks are that they can be dynamically and quickly deployed, and that they can provide fine-grained sensing. Applications, such as emergency response to natural or manmade disasters, detection and tracking, and fine grained sensing of the environment are key examples of applications that can benefit from these types of WSN. Current research for these systems is widespread. However, many of the proposed solutions are developed with simplifying assumptions about wireless communication and the environment, even though the realities of wireless communication and environmental sensing are well known. Many of the solutions are evaluated only by simulation. In this talk I describe a fully implemented system consisting of a suite of more than 30 synthesized protocols. The system supports a power aware surveillance, tracking and classification application running on 203 XSM motes and evaluated in a realistic, large-area environment. Technical details and evaluations are presented. I end with a discussion of opportunities and problems for data mining related to WSN.
Bio: Professor John A. Stankovic is the BP America Professor in the Computer Science Department at the University of Virginia. He recently served as Chair of the department, completing two terms (8 years). He is a Fellow of both the IEEE and the ACM. He also won the IEEE Real-Time Systems Technical Committee's Award for Outstanding Technical Contributions and Leadership. Professor Stankovic also served on the Board of Directors of the Computer Research Association for 9 years. Before joining the University of Virginia, Professor Stankovic taught at the University of Massachusetts where he won an outstanding scholar award. He has also held visiting positions in the Computer Science Department at Carnegie-Mellon University, at INRIA in France, and at the Scuola Superiore S. Anna in Pisa, Italy. He was the Editor-in-Chief for the IEEE Transactions on Distributed and Parallel Systems and was a founder and co-editor-in-chief for the Real-Time Systems Journal for 18 years. He was also General Chair for ACM SenSys 2004 and General Chair for ACM/IEEE Information Processing in Sensor Networks (IPSN) 2006. His research interests are in distributed computing, real-time systems, operating systems, and wireless sensor networks. Prof. Stankovic received his PhD from Brown University.
Abstract: This talk is about recent work on new ways to exploit preprocessed
views of data tables for tractably solving big statistical queries.
We'll describe deployments of these new algorithms in the realms of
detecting killer asteroids and unnatural disease outbreaks.
In recent years, several groups have looked at methods for pre-storing
general sufficient statistics of the data in spatial data structures
such as kd-trees and ball-trees so that both frequentist and Bayesian
statistical operations become fast for large datasets. In this talk we
will look at two other classes of optimization required in important
statistical queries.
The first involves iterating over all spatial regions (big and
small). The second involves detection of tracks from noisy
intermittent observations separated far apart in time and space. We
will also discuss the implications that have arisen from making these
operations tractable. We will focus particularly on
Detecting all asteroids in the solar system larger than Pittsburgh's Cathedral of Learning (data to be collected over 2006-2010).
Early detection of emerging diseases based on national monitoring of
health-related transactions
Bio: Andrew Moore ( www.cs.cmu.edu/~awm ) is director of Google's newest
engineering office, located on Carnegie Mellon's campus in Pittsburgh,
PA. Prior to recently joining Google, Andrew was a professor of
robotics and computer science at the School of Computer Science,
Carnegie Mellon University. Andrew began his career writing
video-games for an obscure British personal computer
(http://www.oric.org/index.php?page=software&fille=detail&num_log=2 ).
He rapidly became a thousandaire and retired to academia, where he
received a PhD from the University of Cambridge in 1991. He
researched robot learning as a Post-doc working with Chris Atkeson at
MIT, and then moved to CMU.
His main research interest is data mining: statistical algorithms for
finding all the potentially useful and statistically meaningful
patterns in large sources of data. His research group, The Auton Lab,
(http://www.autonlab.org) has devised several new ways of performing
large statistical operations efficiently, in several cases
accelerating state-of-the-art by a several magnitudes. Members of the
Auton Lab collaborate closely with many kinds of scientists,
government agencies, technology companies and engineers in a constant
quest to determine what are some of the most urgent unresolved
questions at the border of computation and statistics. Auton Lab
algorithms are now in use in dozens of commercial, university and
government applications. Andrew became a US citizen in 2003, and lives
in Pittsburgh with his wife and two children: William (7) and Lucy
(1). In his non-work life he has no hobbies or talents of any
significance.
Next Frontier (Download Talk) Rakesh Agrawal
Microsoft Search Labs, Microsoft Corporation
Keynote - Wednesday, 8/23
Bio: Rakesh Agrawal is a Microsoft Technical Fellow at the newly founded Search Labs. He is the recipient of the ACM-SIGKDD First Innovation Award, ACM-SIGMOD Edgar F. Codd Innovations Award, ACM-SIGMOD Test of Time Award, VLDB 10-Yr Most Influential Paper Award, and the Computerworld First Horizon Award. He is a Member of the National Academy of Engineering, a Fellow of ACM, and a Fellow of IEEE. Scientific American named him to the list of 50 top scientists and technologists in 2003.
Prior to joining Microsoft in March 2006, Rakesh was an IBM Fellow and led the Quest group at the IBM Almaden Research Center. Earlier, he was with the Bell Laboratories, Murray Hill from 1983 to 1989. He also worked for 3 years at India's premier company, the Bharat Heavy Electricals Ltd. He received the M.S. and Ph.D. degrees in Computer Science from the University of Wisconsin-Madison in 1983. He also holds a B.E. degree in Electronics and Communication Engineering from IIT-Roorkee, and a two-year Post Graduate Diploma in Industrial Engineering from the National Institute of Industrial Engineering (NITIE), Bombay.
Rakesh is well-known for developing fundamental data mining concepts and technologies and pioneering key concepts in data privacy, including Hippocratic Database, Sovereign Information Sharing, and Privacy-Preserving Data Mining. IBM's commercial data mining product, Intelligent Miner, grew out of his work. His research has been incorporated into other IBM products, including DB2 Mining Extender, DB2 OLAP Server and WebSphere Commerce Server, and has influenced several other commercial and academic products, prototypes and applications. His other technical contributions include Polyglot object-oriented type system, Alert active database system, Ode (Object database and environment), Alpha (extension of relational databases with generalized transitive closure), Nest distributed system, transaction management, and database machines.
Rakesh has been granted more than 55 patents. He has published more than 150 research papers, many of them considered seminal. He has written the 1st as well as 2nd highest cited of all papers in the fields of databases and data mining (14th and 16th most cited across all computer science as of August 2005 in CiteSeer). Wikipedia lists one of his papers as one of the three most influential database papers. His papers have been cited more than 6000 times, with more than 15 of them receiving more than 100 citations each. He is the most cited author in the field of database systems. His work has been featured in New York Times Year in Review, New York Times Science section, and several other publications.
Rakesh's new quest is to use Internet to bring the benefits of computing to the underserved.
Abstract: Capital One is a highly quantitatively driven diversified financial services firm. As such, we
make broad and deep use of the entire repertory of highly quantitative techniques. This talk will
present our top ten statistical problems. Indeed, one of them has, as a sub point, the data mining
dimension, but it will likely be useful for data miners to see how their research needs to
complement, and fit into, the entire range of hard statistical issues.
Bio: William Kahn is the Chief Scoring Officer at Capital One Financial where he is responsible for
the quality and usefulness of statistical methods across the firm. Dr. Kahn has been working in
financial services for 10 years, and also has worked in industrial statistics for 10 years. He has his
BA in Physics and MA in Statistics from Berkeley and his Ph.D. in Statistics from Yale.
Abstract: Common strategies to liberate an organization’s information
assets for situational awareness frequently rely on
infrastructure components such as data integration, enterprise
search, federation, data warehousing, and so on. And while
these traditional platforms enable analysts to get better and
faster answers to their queries, the next big advance will
change this paradigm. Users cannot be expected to formulate
and ask every smart question every day. And to escape this
impractical and un-scalable model, the new paradigm will
involve technologies where “the data finds the data” and
“relevance finds the user.” Perpetual Analytics describes a
class of application whereby enterprise context is assembled,
in real-time, on data streams as fast as operational systems
record observations. Context construction is a “data finds the
data” activity which enables events of interest to be streamed
to subscribers. In this talk, I will talk at some depth about the
dynamics of such systems including scalability and
sustainability.
Bio: Jeff Jonas is chief scientist of the Entity Analytic Solutions
group and an IBM Distinguished Engineer. In these
capacities, he is responsible for shaping the overall technical
strategy of next generation identity analytics and the use of
this new capability in the overall IBM technology strategy.
The IBM Entity Analytic Solutions group was formed based
on technologies developed by Mr. Jonas as the founder and
chief scientist of Systems Research & Development (SRD).
SRD was acquired by IBM in January 2005. Today, Mr.
Jonas applies his real world and hands on experience in
software design and development to drive technology
innovations while delivering higher levels of privacy and
civil liberties protections. By way of example, the most
recent breakthrough developed by Mr. Jonas involves an
innovative technique enabling advanced data correlation
while only using irreversible cryptographic hashes. This new
capability makes it possible for organizations to discover
records of common interest (e.g., identities) without the
transfer of any privacy invading content. This privacy enhancing
technology known as anonymous entity resolution
delivers extraordinary new levels of privacy protection while
enabling technology to contribute to critical societal interests
like clinical health care research, aviation safety, homeland
security, fraud detection and identity theft. Jeff Jonas’s
innovations have received coverage in such publications as
The Wall Street Journal, Washington Post, Fortune, and
Computerworld and have been featured on ABC Primetime
with Peter Jennings, The Discovery Channel, The Learning
Channel and MSNBC. Known for his dynamic
presentational style, he is a popular speaker on technology,
security and privacy and has spoken at events such as the
Federal Convention on Emerging Technologies Forum on
Homeland Security, National Security Agency’s INFOSEC
Seminar Series, American Society for Industrial Security,
Black Hat, PC Forum, Wharton Technology Conference,
National Retail Federation Annual Fraud Conference and
Computers, Freedom and Privacy Conference.
Mr. Jonas is a member of the Markle Foundation Task Force
on National Security in the Information Age and actively
contributes his insights on privacy, technology and homeland
security to leading national think tanks, privacy advocacy
groups, and policy research organizations, including the
Center for Democracy and Technology, Heritage Foundation
and the Office of the Secretary of Defense Highlands Forum.
Most recently Mr. Jonas has been named a senior advisor to
Center for Strategic and International Studies.
Abstract: Although information extraction and data mining appear
together in many applications, their interface in most current
systems would better be described as serial juxtaposition than
as tight integration. Information extraction populates slots in
a database by identifying relevant subsequences of text, but
is usually not aware of the emerging patterns and regularities
in the database. Data mining methods begin from a
populated database, and are often unaware of where the data
came from, or its inherent uncertainties. The result is that the
accuracy of both suffers, and accurate mining of complex
text sources has been beyond reach.
In this talk I will describe work in probabilistic models that
perform joint inference across multiple components of an
information processing pipeline in order to avoid the brittle
accumulation of errors. After briefly introducing conditional
random fields, I will describe recent work in information
extraction leveraging factorial state representations, entity
resolution, and transfer learning, as well as scalable methods
of inference and learning. I'll close with some recent work
on probabilistic models for social network analysis, and a
demonstration of Rexa.info, a new research paper search
engine.
This is joint work with colleagues at University of
Massachusetts: Charles Sutton, Chris Pal, Ben Wellner,
Michael Hay, Xuerui Wang, Natasha Mohanty, and Andres
Corrada.
Bio: Andrew McCallum is an Associate Professor at University of
Massachusetts, Amherst. He was previously Vice President
of Research and Development at WhizBang Labs, a
company that used machine learning for information
extraction from the Web. In the late 1990's he was a
Research Scientist and Coordinator at Justsystem Pittsburgh
Research Center, where he spearheaded the creation of
CORA, an early research paper search engine that used
machine learning for spidering, extraction, classification and
citation analysis. After receiving his PhD from the
University of Rochester in 1995, he was a post-doctoral
fellow at Carnegie Mellon University. He is an action editor
for the Journal of Machine Learning Research. For the past
ten years, McCallum has been active in research on statistical
machine learning applied to text, especially information
extraction, document classification, clustering, finite state
models, semi-supervised learning, and social network
analysis. Web page: http://www.cs.umass.edu/~mccallum.
Data Mining Challenges in the Automotive Domain Michael Cavaretta
Infotronics and Systems Analytics Department, Ford Research and Innovation Center
Industry Track Invited Talk - Tuesday, 8/22
Abstract: Automotive companies, such as Ford Motor Company, have no shortage of large databases with abundant
opportunities for cost reduction and revenue enhancement. The Data Mining Group at Ford has worked in
the areas of Quality, Customer Satisfaction and Warranty Analytics for close to ten years. In this time, we
have developed a number of methods for building systems to help the business. One area of particular
success has been in warranty analysis. While traditional hazard analysis has been applied at Ford for a
number of years, we have used techniques from other industries (e.g. retail), as well as text mining to view
warranty analytics in a new way. However, our success has been tempered by serious challenges particularly
in the areas of data understanding, computing meaningful aggregations and implementation. Case studies
from the automobile industry (warranty, quality, forecasting, etc.) as well as from other industries will be
used.
Bio: Michael Cavaretta is the Technical Leader of the Data Mining Group within the Infotronics and Systems
Analytics Department at Ford's Research and Innovation Center. He graduated with a BS and information
systems from the University of Michigan in 1987, and received an MS and Ph.D. in Computer Science from
Wayne State University in 1995. His dissertation was entitled, “The Application of Cultural Algorithms to
Real-Valued Optimization.” After receiving his Ph.D., he was hired by Churchill Systems, a small
consulting company, applying artificial intelligence techniques in high volume retail companies such as
Sears and Kmart. Dr. Cavaretta was hired by Ford Motor Company in 1998 as a Technical Specialist for the
newly formed Data Mining group, and promoted to Technical Leader of the group in 2001. The Data Mining
group applies the technologies of artificial intelligence, machine learning, statistics, and information
visualization primarily in the areas of quality, warranty, market research, and customer relationship
management.