Conference Schedule and Videos
Friday April 24th, 2015
Registration and Breakfast
Opening Remarks
Host Sponsor Welcome
Keynote Address: Now is the Golden Age of Text Analysis
Now is the Golden Age of Text Analysisvideoslides
Coffee Break (30 min)
Track A
Track B
Track C
Label Quality in the Era of Big Datavideoslides
NLP And Deep Learning: Working with Neural Word Embeddingsvideoslides
Learning The Semantics of Millions of Entitiesvideoslides
Coffee Break (10 min)
Track A
Track B
Track C
Reviving the Traditional Russian Orthography for the 21st Centuryvideoslides
Discovering Knowledge in Linked Datavideoslides
Increasing Honesty in Airbnb Reviewsvideoslides
At Airbnb, we study the causes and magnitude of bias in online reviews by using large-scale field experiments that change the incentives of buyers and sellers to honestly review each other. Natural language processing has allowed us to extend our analyses and study bias in reviews by using the written feedback guests and hosts write after a trip.
Coffee Break (10 min)
Track A
Track B
Track C
Teaching Machines to Read for Fun and Profitvideo
In this talk Kang Sun from the R&D Machine Learning group at Bloomberg will speak about current projects involving Machine Learning and applications such as Natural Language Processing. We will discuss the evolution and development of several key Bloomberg projects such as sentiment analysis, market impact prediction, novelty detection, social media monitoring, question answering and topic clustering. We will show that these interdisciplinary problems lie at the intersection of linguistics, finance, computer science and mathematics, requiring methods from signal processing, machine vision and other fields. Throughout, we will talk about practicalities of delivering machine learning solutions to problems of finance and highlight issues such as importance of appropriate problem decomposition, feature engineering and interpretability.
There will be a discussion of future directions and applications of Machine Learning in finance as well as a Q&A session.
Organizing Real Estate Photo Collections with Deep Learningvideoslides
In this talk we detail our approach to organize Trulia's unstructured content into rich photo collections similar to Houzz.com or Zillow Digs, without the need of any explicit user tagging.
By leveraging the recent advances in deep learning for computer vision and nap, we first automatically construct a knowledge base of relevant real estate terms and then annotate our photo collections by fusing knowledge from a deep convolutional network for image recognition and a word embedding model.
The novelty in our approach lies in our ability to scale to a large vocabulary of real estate terms without explicitly training a vision model for each one of them.
Human Curated Linguistics - technology behind Cognitive Analyticsvideoslides
Details of technical implementation and development stack will be discussed.
Live demonstration will be performed to show how HCL is answering questions asked in a free-form language about business data from Google Analytics and salesforce.com data sources.
Lunch
Track A
Track B
Track C
The Art of PDF Processingvideoslides
Unlocking Our Health Data: Transforming Unstructured Data at Scalevideoslides
We are building a system to organize this unstructured data, classify it into known topics, and apply additional levels of normalizations -- all in near real-time and at scale. This talk will cover some of the technical challenges we are facing and how we are solving them with machine learning and natural language processing techniques.
TopicStream, an Application and Architecture for Content Integration in Electronic Readingvideoslides
Coffee Break (10 min)
Track A
Track B
Track C
Near-Realtime Webpage Recommendations “One at a time” Using Content Featuresvideoslides
Unsupervised NLP Tutorial using Apache Sparkvideoslides
A Web Worth of Data: Common Crawl for NLPvideoslides
Coffee Break (20 min)
Track A
Track B
Track C
A High Level Overview of Genomics in Personalized Medicinevideoslides
Statistical Machine Translation Approach for Name Matching in Record Linkvideoslides
Knowledge Maps for Content Discoveryvideoslides
Coffee Break (20 min)
Track A
Track B
Track C
Relation Extraction using Distant Supervision, SVMs, and Probabilistic First Order Logicvideoslides
In this talk, I'll present an effective approach for automatically creating knowledge bases: databases of factual, general information. This relation extraction approach centers around the idea that we can use machine learning and natural language processing to automatically recognize information as it exists in real-world, unstructured text.
I'll cover the NLP tools, special ML considerations, and novel methods for creating a successful end-to-end relation extraction system. I will also cover experimental results with this system architecture in both big-data and a search-oriented environments.
Identity Resolution in the Sharing Economyvideoslides
For example, Airbnb verifies offline identities using a scan of your driver’s license or passport. This is checked against templates designed to examine things like the layout and other government indicators of authenticity to help confirm that it appears to be valid. Crucially it involves checking an applicant’s entered name – often in Latin script – against their name on the scanned document, which may be in another script or language, and subject to potentially egregious OCR errors.
More generally, connecting the public and private traces that people, organizations and things — like vehicles — leave in various information stores is essential to delivering valuable analytics and novel services. This is often called entity analytics or identity resolution.
In this talk, we will explore enabling technology in both structured and unstructured contexts, discuss current challenges and limitations, and explore additional examples.
Coffee Break (20 min)
Track A
Track B
Track C
Science Panelvideo
Reception
Saturday April 25th, 2015
Arrival and Breakfast
Natural Language Processing as the Core of a Consumer Applicationvideo
Coffee Break (20 min)
Track A
Track B
Track C
Learning Compositionality with Scalavideoslides
Semantic Indexing of Four Million Documents with Apache Spark
Turning the Web into a Structured Databasevideo
Coffee Break (10 min)
Track A
Track B
Track C
Transforming an Algorithm for Online Recommendations into a Multi-lingual Syntax Parservideoslides
Large Scale Topic Assignment on Multiple Social Networksvideoslides
The system generates a diverse set of features derived from signals such as user generated posts and profiles, user reactions such as comments and retweets, user attributions such as lists, tags and endorsements, as well as signals based on social graph connections. We show that using cross-network information with a diverse features for a user leads to a more complete and accurate understanding of the user's topics, as compared to using any single network or any single source.
Transforming Unstructured Offer Titlesvideoslides
Offers are usually unstructured text items. For many applications where similar offers should be found or offer titles need to be linked to websites, its beneficial to recognize the characteristics of individual offering instead of working on unstructured offer titles directly. In this talk I will discuss what the relevant aspects of an offer are and present an approach to automatically extract these pieces of information. I will also shortly touch upon possible applications on top of such structured offers.
Coffee Break (10 min)
Track A
Track B
Track C
Measuring Well-Being Using Social Mediavideoprezi
Classifying Text without (many) Labelsvideoslides
Introduction to RDF and Linked Data
In this talk, I will demystify those concepts and technologies, and show you the fascinating world of Linked Open Data.
Lunch
Track A
Track B
Track C
Building the world’s Largest Database of Car Features from PDFsvideoslides
Edmunds.com is an industry-leading website for car shoppers. To effectively support the car-purchasing process, Edmunds needs to understand the features and options available on the myriad different models offered by manufacturers each year. This critical structured database supports faceted search of models, searching available inventory, and other strategic uses.
This end-to-end capability supports robust processing of unstructured data to identify properties like “air conditioning” and “climate control,” and understand that they are the same underlying feature. For Edmunds, this meant an ~85% reduction in the time it now takes them to get information about a new car model online, from 2 weeks to just 1-2 days. We will also discuss how the NLP models can be re-used across other data, mapping Edmunds’ detailed ontology to a variety of unstructured data sources.
Learning From the Diner's Experiencevideo
Practical NLP Applications of Deep Learningvideoslides
Coffee Break (10 min)
Track A
Track B
Track C
Using the Ingenuity KB, Ingenuity Systems (now a part of QIAGEN) provides software solutions to interpret biological datasets. By aligning those datasets (e.g., raw research observations or clinical genomic-testing data) to the KB, it can be viewed, analyzed, and interpreted in the context of relevant biological and biomedical knowledge. I’ll discuss the Ingenuity ontology structure, building process, maintenance regime, and several use-cases.
Identifying Events with Tweets
Deep Learning for Natural Language Processingvideoslides
Coffee Break (10 min)
Track A
Track B
Track C
Using Big Data to Identify the World's Top Expertsvideoslides
We adopt a principled approach to defining who an expert is. An expert is someone who (a) writes consistently about a small set of tightly related topics; if you are an expert in everything, you are an expert in nothing, and (b) who has a loyal following that engages with her contents consistently and finds them useful, and © who actually expresses opinions on the topics he writes about rather than merely breaking the news.
Formulating the above criteria, and implementing it at scale, is a daunting big data task. Firstly we needed to form a rather comprehensive picture of the body of works published by authors that often write on many different outlets and at times under different aliases. Secondly, we had to create a dynamic topical model that learns the relationship between tens of thousands of topics by analyzing millions of documents. Thirdly, we had to come up with a formula that results in a stable, consistent, ranking, that is robust to fluctuations in publishing patterns and engagement data, yet is adaptable to allow in for new experts and their voices to be heard.
- Experts vs. Influencers: defining who an expert is
- Unifying identities of authors across sites
- A dynamic topical model that scales
- Projection of topics onto authors
- Opinion vs. Sentiment vs. Statement of Facts
- Putting it all together
- A note on architecture