At Airbnb, we study the causes and magnitude of bias in online reviews by using large-scale field experiments that change the incentives of buyers and sellers to honestly review each other. Natural language processing has allowed us to extend our analyses and study bias in reviews by using the written feedback guests and hosts write after a trip.
In this talk Kang Sun from the R&D Machine Learning group at Bloomberg will speak about current projects involving Machine Learning and applications such as Natural Language Processing. We will discuss the evolution and development of several key Bloomberg projects such as sentiment analysis, market impact prediction, novelty detection, social media monitoring, question answering and topic clustering. We will show that these interdisciplinary problems lie at the intersection of linguistics, finance, computer science and mathematics, requiring methods from signal processing, machine vision and other fields. Throughout, we will talk about practicalities of delivering machine learning solutions to problems of finance and highlight issues such as importance of appropriate problem decomposition, feature engineering and interpretability.
There will be a discussion of future directions and applications of Machine Learning in finance as well as a Q&A session.
In this talk we detail our approach to organize Trulia's unstructured content into rich photo collections similar to Houzz.com or Zillow Digs, without the need of any explicit user tagging.
By leveraging the recent advances in deep learning for computer vision and nap, we first automatically construct a knowledge base of relevant real estate terms and then annotate our photo collections by fusing knowledge from a deep convolutional network for image recognition and a word embedding model.
The novelty in our approach lies in our ability to scale to a large vocabulary of real estate terms without explicitly training a vision model for each one of them.
Details of technical implementation and development stack will be discussed.
Live demonstration will be performed to show how HCL is answering questions asked in a free-form language about business data from Google Analytics and salesforce.com data sources.
We are building a system to organize this unstructured data, classify it into known topics, and apply additional levels of normalizations -- all in near real-time and at scale. This talk will cover some of the technical challenges we are facing and how we are solving them with machine learning and natural language processing techniques.
In this talk, I'll present an effective approach for automatically creating knowledge bases: databases of factual, general information. This relation extraction approach centers around the idea that we can use machine learning and natural language processing to automatically recognize information as it exists in real-world, unstructured text.
I'll cover the NLP tools, special ML considerations, and novel methods for creating a successful end-to-end relation extraction system. I will also cover experimental results with this system architecture in both big-data and a search-oriented environments.
For example, Airbnb verifies offline identities using a scan of your driver’s license or passport. This is checked against templates designed to examine things like the layout and other government indicators of authenticity to help confirm that it appears to be valid. Crucially it involves checking an applicant’s entered name – often in Latin script – against their name on the scanned document, which may be in another script or language, and subject to potentially egregious OCR errors.
More generally, connecting the public and private traces that people, organizations and things — like vehicles — leave in various information stores is essential to delivering valuable analytics and novel services. This is often called entity analytics or identity resolution.
In this talk, we will explore enabling technology in both structured and unstructured contexts, discuss current challenges and limitations, and explore additional examples.
The system generates a diverse set of features derived from signals such as user generated posts and profiles, user reactions such as comments and retweets, user attributions such as lists, tags and endorsements, as well as signals based on social graph connections. We show that using cross-network information with a diverse features for a user leads to a more complete and accurate understanding of the user's topics, as compared to using any single network or any single source.
Offers are usually unstructured text items. For many applications where similar offers should be found or offer titles need to be linked to websites, its beneficial to recognize the characteristics of individual offering instead of working on unstructured offer titles directly. In this talk I will discuss what the relevant aspects of an offer are and present an approach to automatically extract these pieces of information. I will also shortly touch upon possible applications on top of such structured offers.
In this talk, I will demystify those concepts and technologies, and show you the fascinating world of Linked Open Data.
Edmunds.com is an industry-leading website for car shoppers. To effectively support the car-purchasing process, Edmunds needs to understand the features and options available on the myriad different models offered by manufacturers each year. This critical structured database supports faceted search of models, searching available inventory, and other strategic uses.
This end-to-end capability supports robust processing of unstructured data to identify properties like “air conditioning” and “climate control,” and understand that they are the same underlying feature. For Edmunds, this meant an ~85% reduction in the time it now takes them to get information about a new car model online, from 2 weeks to just 1-2 days. We will also discuss how the NLP models can be re-used across other data, mapping Edmunds’ detailed ontology to a variety of unstructured data sources.
Using the Ingenuity KB, Ingenuity Systems (now a part of QIAGEN) provides software solutions to interpret biological datasets. By aligning those datasets (e.g., raw research observations or clinical genomic-testing data) to the KB, it can be viewed, analyzed, and interpreted in the context of relevant biological and biomedical knowledge. I’ll discuss the Ingenuity ontology structure, building process, maintenance regime, and several use-cases.
We adopt a principled approach to defining who an expert is. An expert is someone who (a) writes consistently about a small set of tightly related topics; if you are an expert in everything, you are an expert in nothing, and (b) who has a loyal following that engages with her contents consistently and finds them useful, and © who actually expresses opinions on the topics he writes about rather than merely breaking the news.
Formulating the above criteria, and implementing it at scale, is a daunting big data task. Firstly we needed to form a rather comprehensive picture of the body of works published by authors that often write on many different outlets and at times under different aliases. Secondly, we had to create a dynamic topical model that learns the relationship between tens of thousands of topics by analyzing millions of documents. Thirdly, we had to come up with a formula that results in a stable, consistent, ranking, that is robust to fluctuations in publishing patterns and engagement data, yet is adaptable to allow in for new experts and their voices to be heard.