Conference Schedule and Videos

Friday April 24th, 2015

7:30 am

Registration and Breakfast

Come early to check in, have breakfast with us and get settled.

8:45 am

Opening Remarks

Alexy Khrabrov

Nitro @ChiefScientist

A grand welcome and introductory remarks by Alexy Khrabrov, Chief Scientist, Nitro and Principal, By The Bay

9:00 am

Host Sponsor Welcome

Tihomir Bajić

Nitro @tihomirb

Welcome from the CTO of Nitro, the host sponsor of the conference

9:10 am

Keynote Address: Now is the Golden Age of Text Analysis

Mark Liberman

University of Pennsylvania

@mark_liberman

Now is the Golden Age of Text Analysisvideo slides

Like steam power in 1780, telegraphy in 1835, or telephony in 1875, computational text analysis is entering a golden age of invention and application. Methods, infrastructure, and needs have come together to create an extraordinary range of opportunities. This talk will sketch the history, predict the future, warn of dangers, and speculate about challenges.

10:00 am

Coffee Break (30 min)

Track A

Track B

Track C

10:30 am

Label Quality in the Era of Big Datavideo slides

Omar Alonso

Microsoft

@elunca

Organizations that develop and use technologies around information retrieval, machine learning, recommender systems, and natural language processing depend on labels for engineering and experimentation. These labels, usually gathered via human computation, are used in machine learned models for prediction and evaluation purposes. In such scenarios collecting high quality labels is a very important part of the overall process. We elaborate on these challenges and explore possible solutions for collecting high quality labels at large scale.

NLP And Deep Learning: Working with Neural Word Embeddingsvideo slides

Adam Gibson

Skymind

@agibsonccc

In this talk, we will cover how to work with neural nets in and text. This will encompass a comparison of the different algorithms, analogies to more traditional methods such as bag of words, followed by a basic idea of how to use 2 different architectures of neural networks for sequential and document classification.

Learning The Semantics of Millions of Entitiesvideo slides

Vlad Giverts

Workday

@vladgiverts

What do "Software Developer", "MTS", and "Code Monkey" have in common? No, it's not the start to a bad joke. It's actually a situation where many unique entities turn out to have a small number of distinct semantics. This talk will present new techniques for mapping millions of such entities into a common semantic space with just a few thousand labels. We'll discuss how these techniques have been applied to job titles, skills, majors, and degrees to build candidate recommendation systems.

11:10 am

Coffee Break (10 min)

Track A

Track B

Track C

11:20 am

Reviving the Traditional Russian Orthography for the 21st Centuryvideo slides

Sergei Winitzki

Versal

Virtually all the world-famous classic works of Russian literature were written in what is now called "old" Russian orthography, which was banned in the Soviet Russia in 1918. For this reason, the traditional Russian orthography has no support in modern operating systems and text processing software. To publish a critical edition of Tolstoy or Dostoyevsky in the original is a veritable challenge today! I describe my efforts to bring the "old" Russian orthography to the desktop. I will demonstrate a keyboard layout, a spelling dictionary, and a converter between old and modern spelling.

Discovering Knowledge in Linked Datavideo slides

James Earl Douglas

Wikimedia

@jearldouglas

By building on the foundations of the Semantic Web, we can create tools that help people explore relationships between data, connect information, and discover knowledge. In this talk, we'll look at how to search Wikidata from a graph database via a domain-specific language. We'll be able to ask simple questions such as "What happened on this day in history?", and tricky questions such as "What were some of the fields of work of physicists who worked at institutions where Richard Feynman also worked?".

Increasing Honesty in Airbnb Reviewsvideo slides

Dave Holtz

AirBnB

@daveholtz

Reviews and reputation scores are increasingly important for decision-making, especially in the case of online marketplaces. However, online reviews may not provide an accurate depiction of the characteristics of a product, either because many people do not leave reviews or because some reviewers omit salient information.

At Airbnb, we study the causes and magnitude of bias in online reviews by using large-scale field experiments that change the incentives of buyers and sellers to honestly review each other. Natural language processing has allowed us to extend our analyses and study bias in reviews by using the written feedback guests and hosts write after a trip.

11:40 am

Coffee Break (10 min)

Track A

Track B

Track C

11:50 am

Teaching Machines to Read for Fun and Profitvideo

Kang Sun

Bloomberg

Kang Sun from the R\&D Machine Learning group will speak about Bloomberg’s current projects in the area of Machine Learning and Natural Language Processing, such as sentiment analysis of financial news, market impact predictions, question answering, etc. There will be a discussion of future directions and as well as a Q\&A session.

In this talk Kang Sun from the R&D Machine Learning group at Bloomberg will speak about current projects involving Machine Learning and applications such as Natural Language Processing. We will discuss the evolution and development of several key Bloomberg projects such as sentiment analysis, market impact prediction, novelty detection, social media monitoring, question answering and topic clustering. We will show that these interdisciplinary problems lie at the intersection of linguistics, finance, computer science and mathematics, requiring methods from signal processing, machine vision and other fields. Throughout, we will talk about practicalities of delivering machine learning solutions to problems of finance and highlight issues such as importance of appropriate problem decomposition, feature engineering and interpretability.

There will be a discussion of future directions and applications of Machine Learning in finance as well as a Q&A session.

Organizing Real Estate Photo Collections with Deep Learningvideo slides

Shourabh Rawat

Trulia

@shrawat87

Real Estate Websites like Trulia and Zillow host millions of property listings, with each listing consisting of rich textual description and images of the property. While rich in information, the discoverability of this data is limited by its unstructured nature. For Example, How do we learn if "granite countertops" is an interesting real estate term. And if it is, how can we assign it to one of the many photos associated with the property.

In this talk we detail our approach to organize Trulia's unstructured content into rich photo collections similar to Houzz.com or Zillow Digs, without the need of any explicit user tagging.

By leveraging the recent advances in deep learning for computer vision and nap, we first automatically construct a knowledge base of relevant real estate terms and then annotate our photo collections by fusing knowledge from a deep convolutional network for image recognition and a word embedding model.

The novelty in our approach lies in our ability to scale to a large vocabulary of real estate terms without explicitly training a vision model for each one of them.

Human Curated Linguistics - technology behind Cognitive Analyticsvideo slides

Nikita Ivanov

DataLingvo

@c64hacker

This presentation will provide an overview of Human Curated Linguistic (HCL) technology developed and used by DataLingvo in their Cognitive Analytics platform. HCL provides an industry-first real time free-form language comprehension and guaranteed answer correctness required for cognitive analytics applications.

Details of technical implementation and development stack will be discussed.

Live demonstration will be performed to show how HCL is answering questions asked in a free-form language about business data from Google Analytics and salesforce.com data sources.

12:30 pm

Lunch

Track A

Track B

Track C

1:30 pm

The Art of PDF Processingvideo slides

Roman Lasskly

Nitro

@rlasskiy

The algorithms that power PDF understanding

Unlocking Our Health Data: Transforming Unstructured Data at Scalevideo slides

Ola Wiberg

Human API

@olawiberg

Each of us have a plethora of a health data that resides in unstructured, non-standard formats and silos. Bringing this data together can reveal powerful insights about our health, but proves to be a staggering technical challenge. Unstructured narratives contain key pieces of information that can not easily be extracted without additional processing.

We are building a system to organize this unstructured data, classify it into known topics, and apply additional levels of normalizations -- all in near real-time and at scale. This talk will cover some of the technical challenges we are facing and how we are solving them with machine learning and natural language processing techniques.

TopicStream, an Application and Architecture for Content Integration in Electronic Readingvideo slides

Jacek Ambroziak

Ambrosoft

@JacekAmbroziak

The most popular ebook readers inherit from paper books the limiting concept of pagination. In electronic reading not only is pagination notoriously difficult for scientific/technical/medical (STM) but it locks content in one dimensionality of content consumption. We radically depart from pages and propose to split content into smaller, semantically self-contained 'tiles.' In contrast to pages, tiles can be more easily related to other tiles that can come from different books, GitHub repositories, StackOverflow discussions, Wikipedia, official documentation from the WWW, etc. Collections of documents from these other sources can be packaged as pre-tiled EPUB3 ebooks. The TopicStream app enables seamless navigation between book content & complementary documents without the need to explicitly open/close document collections. This approach adds value to commercial content in today's world where a lot of relevant information is available on-line.

2:10 pm

Coffee Break (10 min)

Track A

Track B

Track C

2:20 pm

Near-Realtime Webpage Recommendations “One at a time” Using Content Featuresvideo slides

Ashok Venkatesan

StumbleUpon

@vashok

Today, information overload is a problem pertinent to most information systems being used on a daily basis, with the World Wide Web chief among them. One of the key goals of Stumbleupon, a web content recommendation platform, is to ease this overload, while empowering discovery of relevant information. Our subscription to the “one recommendation at a time” concept focuses in producing an experience of serendipity as users continue to surf the web, while giving us the flexibility to reactively make recommendations near-realtime. In this presentation we will present the challenges that need to be addressed to extract content features from a web page and making near-realtime recommendations using them. We will describe the main algorithmic approach as well as the general architecture motivating our choices of tools, languages and platforms.

Unsupervised NLP Tutorial using Apache Sparkvideo slides

Marek Kolodziej

Nitro

@marekinfo

Paraphrasing Tim O'Reilly, the person who has the most data wins. That's a neat slogan, but the more data one has, the more likely it is to be unlabeled. Unfortunately, there aren't that many unsupervised learning algorithms out there, for machine learning in general and for NLP in particular. Recent advances in deep learning provide new tools for text mining of large unsupervised datasets. In particular, I will talk about the math, intuition and implementation of the word2vec algorithm, its variants (skipgram and continuous bag of words), use cases, and extensions (e.g. paragraph2vec, doc2vec). I will wrap up with a simple demonstration at scale using Scala, Apache Spark, MLLib, and the Apache Zeppelin Notebook.

A Web Worth of Data: Common Crawl for NLPvideo slides

Stephen Merity

CommonCrawl

@smerity

The Common Crawl corpus contains petabytes of web crawl data and is a treasure trove of potential experiments. To introduce you to the possibilities that web crawl data has for NLP, we will take a detailed look at how the data has been used by various experiments and how to get started with the data yourself.

3:00 pm

Coffee Break (20 min)

Track A

Track B

Track C

3:20 pm

A High Level Overview of Genomics in Personalized Medicinevideo slides

John St. John

Driver Genomics

Nearly twenty years ago president Clinton announced the completion of one of the largest public/private collaborative efforts in history, the first draft of the human genome. This work promised to bring forth a new era of totally personalized medicine, where the unique blueprint for your body is used to determine the most effective treatment options for you as an individual. Finally this promise is starting to be realized in the field of oncology, among others. I will give a high level overview of medical genomics with an emphasis on my area of expertise, using it to guide decision making in oncology.

Statistical Machine Translation Approach for Name Matching in Record Linkvideo slides

Jeffrey Sukharev

Ancestry.com

Record linkage, or entity resolution, is an important area of data mining. Name matching is a key component of systems for record linkage. Alternative spellings of the same name are a common occurrence in many applications. We use the largest collection of genealogy person records in the world together with user search query logs to build name- matching models. The procedure for building a crowd-sourced training set is outlined together with the presentation of our method. We cast the problem of learning alternative spellings as a machine translation problem at the character level. We use information retrieval evaluation methodology to show that this method substantially outperforms on our data a number of standard well known phonetic and string similarity methods in terms of precision and recall. Our result can lead to a significant practical impact in entity resolution applications.

Knowledge Maps for Content Discoveryvideo slides

Oren Schaedel

Versal

Content discoverability and composition is a moving target in education, especially matching content complexity and media diversity to the needs of different students. In this talk I will describe our methods of performing large scale categorization of online courses using a crowd-sourced taxonomy. Our methodology is agnostic to the media of content, whether it is text, images, or video and uses Wikipedia as a taxonomy for semi-labeled categorization of content. I will also demo a visualization of Versals’ Knowledge Map, a "Google-Maps" for content exploration.

3:40 pm

Coffee Break (20 min)

Track A

Track B

Track C

4:00 pm

Scalable Genome Analysis With ADAMvideo slides

Frank Nothaft

AMPLab, UC Berkeley

@fnothaft

Thanks to substantial improvements in the cost and throughput of DNA sequencing machines, genomic data may soon make personalized medicine a reality. However, significant processing is needed to turn raw DNA strings captured by sequencers into clinically useful data, and modern DNA processing software can take up to a week to run. In this talk, we'll look at how we reconstruct genomes from the raw sequence data, and we introduce ADAM, an Apache Spark-based API for accelerating genome processing pipelines.

Relation Extraction using Distant Supervision, SVMs, and Probabilistic First Order Logicvideo slides

Malcolm Greaves

Nitro

Why do we want information? So that we can use it? So that our computers can use it? When we have access to rich, structured information we can make advanced applications that solve real-world pain points.

In this talk, I'll present an effective approach for automatically creating knowledge bases: databases of factual, general information. This relation extraction approach centers around the idea that we can use machine learning and natural language processing to automatically recognize information as it exists in real-world, unstructured text.

I'll cover the NLP tools, special ML considerations, and novel methods for creating a successful end-to-end relation extraction system. I will also cover experimental results with this system architecture in both big-data and a search-oriented environments.

Identity Resolution in the Sharing Economyvideo slides

David Murgatroyd

Basis Technology

@dmurga

A growing sharing economy demands new, cost effective ways of establishing and checking identity, to allow services and participants to accurately assess risks and make good choices.

For example, Airbnb verifies offline identities using a scan of your driver’s license or passport. This is checked against templates designed to examine things like the layout and other government indicators of authenticity to help confirm that it appears to be valid. Crucially it involves checking an applicant’s entered name – often in Latin script – against their name on the scanned document, which may be in another script or language, and subject to potentially egregious OCR errors.

More generally, connecting the public and private traces that people, organizations and things — like vehicles — leave in various information stores is essential to delivering valuable analytics and novel services. This is often called entity analytics or identity resolution.

In this talk, we will explore enabling technology in both structured and unstructured contexts, discuss current challenges and limitations, and explore additional examples.

4:40 pm

Coffee Break (20 min)

Track A

Track B

Track C

5:00 pm

Science Panelvideo

Q&A with Pete Skomoroch, Pedro Alves, Jeremy Howard, and Ben Pedrick about applied ML/NLP

6:30 pm

Reception

Saturday April 25th, 2015

7:30 am

Arrival and Breakfast

Come early to get breakfast and get settled for another day of amazing talks!

9 am

Updates and Keynote Address -- Text(ing): the Rebirthvideo slides

Jean Sini

Fountain.com

@jeansini

Natural Language Processing as the Core of a Consumer Applicationvideo

NLP is often relegated to an after-the-fact, or off-to-the-side role: spam detection or gleaning business insight from user communication and comments that have already occurred. But a new generation of applications - Luka, Thumbtack, Fountain - put the understanding of natural language front and center, often as the first thing that consumers touch. We'll take a deep look at Fountain, both how it classifies plain English questions, and how it identifies which of 70,000+ human skills is necessary to solve the question. The talk will cover both language classification and relationship extraction, particularly focusing on how human expertise is interrelated.

10:00 am

Coffee Break (20 min)

Track A

Track B

Track C

10:20 am

Learning Compositionality with Scalavideo slides

Ignacio Cases

Stanford

@ignaciocases

Logical and statistical approaches to computational semantics have usually been considered orthogonal, but recent proposals are considering a synthesis of these perspectives by developing statistical models that are able to learn compositional semantics. In this talk we will show how it is possible to implement some of these techniques in the Synthesis framework proposed by Liang and Potts (2015) with a statically typed, functional language as Scala, and we will explore the extension of the implementation with algebraic constructs using a category-theoretic perspective. In particular, we will argue that precisely the functional paradigm with static typing provide natural solutions that are of great interest to many aspects of computational semantics and pragmatics.

Semantic Indexing of Four Million Documents with Apache Spark

Sandy Ryza

Cloudera

@sandysifting

Latent Semantic Analysis (LSA) is a technique in natural language processing and information retrieval that seeks to better understand the latent relationships and concepts in large corpuses. In this talk, we’ll walk through what it looks like to apply LSA to the full set of documents in English Wikipedia, using Apache Spark. Harnessing the Stanford CoreNLP library for lemmatization and MLlib’s scalable SVD implementation for uncovering a lower-dimensional representation of the data, we’ll undertake the modest task of enabling queries against the full extent of human knowledge, based on latent semantic relationships.

Turning the Web into a Structured Databasevideo

Mike Tung

Diffbot

@miketung

For the web to truly progress, information must be able to seamlessly flow between your devices, services, and applications. A truly mainstream solution requires building a new kind of search, one that can see the entire web as structured information, rather than documents. This session highlights Diffbot’s novel approach to translating the web into a machine-readable format using a combination of NLP, computer vision, and machine learning.

11:00 am

Coffee Break (10 min)

Track A

Track B

Track C

11:10 am

Transforming an Algorithm for Online Recommendations into a Multi-lingual Syntax Parservideo slides

Seth Redmore

Lexalytics

@sredmore

You need solid syntax parsing to really understand the nuance of language. Complicated negation patterns, relationships between entities, entity sentiment assignment (and many other things) are all examples for which having sophisticated syntax understanding is important. The question then is how to get an understanding of syntax across many languages, content types, and contexts. Most traditional model-based approaches require manually coded syntax trees, which are costly to generate, as they require relatively expensive linguist time. These trees exist for some languages, and some content types; but not for, say, German Tweets, or Swedish biotech. It turns out that the problem can be stated as a “similarity” problem, which then looks like a recommendation problem. This presentation will discuss how we leveraged a matrix factorization recommendation algorithm to create a highly efficient, easily extensible syntax parser.

Large Scale Topic Assignment on Multiple Social Networksvideo slides

interests and expertise is a challenging problem with applications in various data-powered products. In this talk, we present a full production system used at Lithium Technologies (Klout), which mines topical interests and expertise from multiple social networks and assigns over 10,000 topics to hundreds of millions of users on a daily basis.

The system generates a diverse set of features derived from signals such as user generated posts and profiles, user reactions such as comments and retweets, user attributions such as lists, tags and endorsements, as well as signals based on social graph connections. We show that using cross-network information with a diverse features for a user leads to a more complete and accurate understanding of the user's topics, as compared to using any single network or any single source.

Transforming Unstructured Offer Titlesvideo slides

Katrin Tomanek

VigLink

VigLink helps publishers monetize content by affiliating existing commercial links and automatically identifying product references that can be linked to commercial sites. At VigLink, we have an ever growing catalogue of product offers (~330M) from multiple sources (including e.g. Amazon, Ebay, Shopzilla) and verticals (form Automotive to Consumer Electronics to Home \& Garden).

Offers are usually unstructured text items. For many applications where similar offers should be found or offer titles need to be linked to websites, its beneficial to recognize the characteristics of individual offering instead of working on unstructured offer titles directly. In this talk I will discuss what the relevant aspects of an offer are and present an approach to automatically extract these pieces of information. I will also shortly touch upon possible applications on top of such structured offers.

11:30 am

Coffee Break (10 min)

Track A

Track B

Track C

11:40 am

Measuring Well-Being Using Social Mediavideo prezi

Lyle Ungar

University of Pennsylvania

Social media such as Twitter and Facebook provide a rich, if imperfect portal onto people's lives. We analyze tens of millions of Facebook posts and billions of tweets to study variation in language use with age, gender, personality, and mental and physical well-being. Word clouds visually illustrate the big five personality traits (e.g., "What is it like to be neurotic?"), while correlations between language use and county level health data suggest connections between health and happiness, including potential psychological causes of heart disease. Similar analyses are useful in many fields.

Classifying Text without (many) Labelsvideo slides

Mike Tamir

Galvanize

@MikeTamir

Supervised text classification is often hampered by the need to acquire relatively expensive labeled training sets. In some embodiments of the systems and methods disclosed herein, pre-existing Word2Vec or similar algorithms are leveraged to create vector representations of documents that enable a model to be successfully trained with a drastically reduced training set. By using this technique the implementer can now devote low investment to acquiring a small volume of labeled data examples in order to train proximity thresholds, without devoting significant resources using traditional text classification machine learning algorithms which typically require training volume examples that are orders of magnitude larger.

Introduction to RDF and Linked Data

Alexandre Bertails

@bertails

For most NLP practitioners, the Web is seen as a Web of Documents where information is waiting to be extracted. But what if we had access to a Web of machine-readable Data instead? That is basically the promise behind RDF and Linked Data and it is already here.

In this talk, I will demystify those concepts and technologies, and show you the fascinating world of Linked Open Data.

12:20 pm

Lunch

Track A

Track B

Track C

1:20 pm

Building the world’s Largest Database of Car Features from PDFsvideo slides

John Akred

Silicon Valley Data Science

We will discuss a new system that supports editors creating a database of the features and options available across car models, creating structured data by semi-automated information extraction from lengthy PDF documents.

Edmunds.com is an industry-leading website for car shoppers. To effectively support the car-purchasing process, Edmunds needs to understand the features and options available on the myriad different models offered by manufacturers each year. This critical structured database supports faceted search of models, searching available inventory, and other strategic uses.

This end-to-end capability supports robust processing of unstructured data to identify properties like “air conditioning” and “climate control,” and understand that they are the same underlying feature. For Edmunds, this meant an ~85% reduction in the time it now takes them to get information about a new car model online, from 2 weeks to just 1-2 days. We will also discuss how the NLP models can be re-used across other data, mapping Edmunds’ detailed ontology to a variety of unstructured data sources.

Learning From the Diner's Experiencevideo

Sudeep Das

OpenTable

@datamusing

I will talk about how we are using data science to help transform OpenTable into a local dining expert who knows you very well, and can help you and others find the best dining experience wherever we travel! This entails a whole slew of tools from natural language processing, recommendation system engineering, sentiment analysis that have to work in synch to make that magical experience happen. One of our main sources of insight are the reviews left by diners on our website. In this talk, I will stress on what we are learning from our rich set of diner reviews, especially using topic modeling as a core tool. I will touch upon various possible applications of this technique that we are currently exploring in both restaurateur facing and diner facing features.

Practical NLP Applications of Deep Learningvideo slides

Samiur Rahman

Mattermark

@samiur1204

Deep Learning is the hot "new" technique in the world of Machine Learning, but most of the published benefits of Deep Learning has been tied to audio and visual data. There are, however, significant benefits users can draw from Deep Learning, particularly in the area of unsupervised representation learning. This talk focuses on the practical applications of these techniques, particularly neural network word embeddings. I also explore how Mattermark uses these techniques to perform many ML and NLP tasks.

2:00 pm

Coffee Break (10 min)

Track A

Track B

Track C

2:10 pm

The Ingenuity Biomedical Knowledge-Base: Advantages of Modeling Knowledge in an Ontologyvideo slides

Jeff Lerman

QIAGEN

Today’s enormous corpus of biomedical knowledge presents amazing opportunities to improve human health. However, the knowledge’s fragmentation across the literature and numerous databases poses serious challenges to those opportunities. The Ingenuity Biomedical Knowledgebase (KB) addresses those challenges, providing a framework to model biomedical knowledge in a unified system – implemented as a frame-based ontology. That structure facilitates powerful inference and quality-control features.

Using the Ingenuity KB, Ingenuity Systems (now a part of QIAGEN) provides software solutions to interpret biological datasets. By aligning those datasets (e.g., raw research observations or clinical genomic-testing data) to the KB, it can be viewed, analyzed, and interpreted in the context of relevant biological and biomedical knowledge. I’ll discuss the Ingenuity ontology structure, building process, maintenance regime, and several use-cases.

Identifying Events with Tweets

Gabor Szabo

Twitter

@gaborjszabo

Gabor is a Staff Data Scientist at Twitter, and works on describing and predicting user behavior and modeling large-scale content dynamics on Twitter. Before that he worked on predicting content popularity in crowdsourced ecologies, on the network analysis of online services, and did research on large-scale social and biological systems. Before, he worked at HP Labs, Harvard Medical School, and the University of Notre Dame.

Deep Learning for Natural Language Processingvideo slides

Richard Socher

MetaMind

@RichardSocher

In this talk, I will describe deep learning algorithms that learn representations for language that are useful for solving a variety of complex language problems. I will focus on 3 tasks: Fine-Grained sentiment analysis; Question answering to win trivia competitions (like Whatson's Jeopardy system but with one neural network); Multimodal sentence-image embeddings (with a fun demo!) to find images that visualize sentences. I will also show some demos of how deepNLP can be made easy to use with MetaMind.io's software.

2:50 pm

Coffee Break (10 min)

Track A

Track B

Track C

3:00 pm

Using Big Data to Identify the World's Top Expertsvideo slides

Nima Sarshar

InPowered

@nimilinimo

In this talk, we report on our implementation of a big data system that is able to automatically identify and rank experts in a large number of categories by ingesting and analyzing millions of pieces of content published across the Web every day.

We adopt a principled approach to defining who an expert is. An expert is someone who (a) writes consistently about a small set of tightly related topics; if you are an expert in everything, you are an expert in nothing, and (b) who has a loyal following that engages with her contents consistently and finds them useful, and © who actually expresses opinions on the topics he writes about rather than merely breaking the news.

Formulating the above criteria, and implementing it at scale, is a daunting big data task. Firstly we needed to form a rather comprehensive picture of the body of works published by authors that often write on many different outlets and at times under different aliases. Secondly, we had to create a dynamic topical model that learns the relationship between tens of thousands of topics by analyzing millions of documents. Thirdly, we had to come up with a formula that results in a stable, consistent, ranking, that is robust to fluctuations in publishing patterns and engagement data, yet is adaptable to allow in for new experts and their voices to be heard.

Experts vs. Influencers: defining who an expert is
Unifying identities of authors across sites
A dynamic topical model that scales
Projection of topics onto authors
Opinion vs. Sentiment vs. Statement of Facts
Putting it all together
A note on architecture

How Terminal makes Machine Learning Fast and Funvideo

Varun Ganapathi

Terminal.com

@varungp

The talk will be to explain how Terminal works and how people are using it in Machine Learning applications or Big Data analysis (like, out of the box multi-tenant Spark clusters).

Extended Swadesh Listvideo slides

Dmitry Gusev

Purdue University

@dmitri_a_gusev

Do differences between natural languages increase as time passes? An improved version of an extended Swadesh list of basic meanings has been developed to help answer this question by means of lexicostatistical analysis. The new instrument and the development process behind it will be of interest to researchers who study stability of meaning-word pairs and develop NLP methods for identification of cognates.

3:20 pm

Coffee Break (10 min)

Track A

Track B

Track C

3:30 pm

Topic-Based Sentiment Analysis in Customer Feedbackvideo slides

Vita Markman

Much of customer support at LinkedIn is done via some form of online communication such as online feedback forms or email between members and support agents. Topic-based sentiment analysis of member feedback is critical since a single piece of feedback may address several different topics with different sentiment expressed in each. This talk addresses the topic-based sentiment analysis of customer support feedback focusing on the following questions 1) how do we find the most relevant topics of a product in question 2) how do we ensure to attribute sentiment to these specific topics as opposed to the feedback as a whole 3) how do we leverage natural language processing tools such as key phrase extraction and synonym identification to make the obtained topic-sentiment information best suitable for human consumption. The model proposed here is extendable to mining sentiment in reviews or any other sentiment-bearing text.

Identifying CrunchBase Entities in News Articlesvideo slides

Gershon Bialer

CrunchBase

We will discuss doing record linkage to entities identified in news articles scraped from the web. Further, we will discuss the challenges of working with user-edited entities, that are constantly changing.

Incentivized Question and Answer Datavideo slides

Matthew Drescher

Poshly

Poshly incentivizes its users to participate in online, dynamically generated surveys. The questions and answers are written by our team of subject matter experts from the cosmetic industry. This talk will cover the basics of our approach. The way our content is created, and how content is selected. Depending on time and interest we can also touch on aspects of our data pipeline which was developed in Scala.

Track A

Track B

Track C

4:00 pm

Scalable Online Learning of Topic Models with Sparkvideo slides

Alex Minnaar

Vertical Scope

This talk deals with the problem of how to learn topic models from large text corpora that are constantly growing such as with online forums. As documents stream into your corpus it is much more efficient to update your already learned topic model rather than batch processing your entire corpus. Furthermore, Apache Spark can be used to perform the sequential updates in a distributed fashion. The talk will also include a discussion on how to use your learned topic model to classify the documents in your corpus based on the topics they contain.

A Word is Worth a Thousand Vectorsvideo slides

Christopher Erick Moody

StitchFix

@chrisemoody

Standard natural language processing (NLP) is a messy and difficult affair. It requires teaching a computer about English-specific word ambiguities as well as the hierarchical, sparse nature of words in sentences. At Stitch Fix, word vectors help computers learn from the raw text in customer notes. Our systems need to identify a medical professional when she writes that she 'used to wear scrubs to work', and distill 'taking a trip' into a Fix for vacation clothing. Applied appropriately, word vectors are dramatically more meaningful and more flexible than current techniques and let computers peer into text in a fundamentally new way. I'll speak about word2vec, related techniques, and try to convince you that word vectors give us a simple and flexible platform for understanding text.

ML Scoring: Where Machine Learning Meets Searchvideo slides

Search can be viewed as a combination of a) A problem of constraint satisfaction, which is the process of finding a solution to a set of constraints (query) that impose conditions that the variables (fields) must satisfy with a resulting object (document) being a solution in the feasible region (result set), plus b) A scoring/ranking problem of assigning values to different alternatives, according to some convenient scale. This ultimately provides a mechanism to sort various alternatives in the result set in order of importance, value or preference. In particular scoring in search has evolved from being a document centric calculation (e.g. TF-IDF) proper from its information retrieval roots, to a function that is more context sensitive (e.g. include geo-distance ranking) or user centric (e.g. takes user parameters for personalization) as well as other factors that depend on the domain and task at hand. However, most system that incorporate machine learning techniques to perform classification or generate scores for these specialized tasks do so as a post retrieval re-ranking function, outside of search! In this talk I show ways of incorporating advanced scoring functions, based on supervised learning and bid scaling models, into popular search engines such as Elastic Search and SOLR. I'll provide practical examples of how to construct such "ML Scoring" plugins in search to generalize the application of a search engine as a model evaluator for supervised learning tasks. This will facilitate the building of systems that can do computational advertising, recommendations and specialized search systems, applicable to many domains.

4:40 pm

Coffee Break (20 min)

5:00 pm

Business of Textvideo

CEO Panel with SriSatish Ambati, Nikita Ivanov, Katelyn Lyster, and Oleg Rogynskyy

SriSatish Ambati

CEO, H2O

Nikita Ivanov

Founding CEO, GridGain; Advisor, DataLingvo

Katelyn Lyster

CEO, ReOrient Media

Oleg Rogynskyy

CEO, Semantria (a Lexalytics company)

6:30 pm

Conference Schedule and Videos

Friday April 24th, 2015

Registration and Breakfast

Opening Remarks

Host Sponsor Welcome

Keynote Address: Now is the Golden Age of Text Analysis

Now is the Golden Age of Text Analysisvideoslides

Coffee Break (30 min)

Track A

Track B

Track C

Label Quality in the Era of Big Datavideoslides

NLP And Deep Learning: Working with Neural Word Embeddingsvideoslides

Learning The Semantics of Millions of Entitiesvideoslides

Coffee Break (10 min)

Track A

Track B

Track C

Reviving the Traditional Russian Orthography for the 21st Centuryvideoslides

Discovering Knowledge in Linked Datavideoslides

Increasing Honesty in Airbnb Reviewsvideoslides

Coffee Break (10 min)

Track A

Track B

Track C

Teaching Machines to Read for Fun and Profitvideo

Organizing Real Estate Photo Collections with Deep Learningvideoslides

Human Curated Linguistics - technology behind Cognitive Analyticsvideoslides

Lunch

Track A

Track B

Track C

The Art of PDF Processingvideoslides

Unlocking Our Health Data: Transforming Unstructured Data at Scalevideoslides

TopicStream, an Application and Architecture for Content Integration in Electronic Readingvideoslides

Coffee Break (10 min)

Track A

Track B

Track C

Near-Realtime Webpage Recommendations “One at a time” Using Content Featuresvideoslides

Unsupervised NLP Tutorial using Apache Sparkvideoslides

A Web Worth of Data: Common Crawl for NLPvideoslides

Coffee Break (20 min)

Track A

Track B

Track C

A High Level Overview of Genomics in Personalized Medicinevideoslides

Statistical Machine Translation Approach for Name Matching in Record Linkvideoslides

Knowledge Maps for Content Discoveryvideoslides

Coffee Break (20 min)

Track A

Track B

Track C

Scalable Genome Analysis With ADAMvideoslides

Relation Extraction using Distant Supervision, SVMs, and Probabilistic First Order Logicvideoslides

Identity Resolution in the Sharing Economyvideoslides

Coffee Break (20 min)

Track A

Track B

Track C

Science Panelvideo

Reception

Saturday April 25th, 2015

Arrival and Breakfast

Updates and Keynote Address -- Text(ing): the Rebirthvideoslides

Natural Language Processing as the Core of a Consumer Applicationvideo

Coffee Break (20 min)

Track A

Track B

Track C

Learning Compositionality with Scalavideoslides

Semantic Indexing of Four Million Documents with Apache Spark

Turning the Web into a Structured Databasevideo

Coffee Break (10 min)

Track A

Track B

Track C

Transforming an Algorithm for Online Recommendations into a Multi-lingual Syntax Parservideoslides

Large Scale Topic Assignment on Multiple Social Networksvideoslides

Transforming Unstructured Offer Titlesvideoslides

Now is the Golden Age of Text Analysisvideo slides

Label Quality in the Era of Big Datavideo slides

NLP And Deep Learning: Working with Neural Word Embeddingsvideo slides

Learning The Semantics of Millions of Entitiesvideo slides

Reviving the Traditional Russian Orthography for the 21st Centuryvideo slides

Discovering Knowledge in Linked Datavideo slides

Increasing Honesty in Airbnb Reviewsvideo slides

Organizing Real Estate Photo Collections with Deep Learningvideo slides

Human Curated Linguistics - technology behind Cognitive Analyticsvideo slides

The Art of PDF Processingvideo slides

Unlocking Our Health Data: Transforming Unstructured Data at Scalevideo slides

TopicStream, an Application and Architecture for Content Integration in Electronic Readingvideo slides

Near-Realtime Webpage Recommendations “One at a time” Using Content Featuresvideo slides

Unsupervised NLP Tutorial using Apache Sparkvideo slides

A Web Worth of Data: Common Crawl for NLPvideo slides

A High Level Overview of Genomics in Personalized Medicinevideo slides

Statistical Machine Translation Approach for Name Matching in Record Linkvideo slides

Knowledge Maps for Content Discoveryvideo slides

Scalable Genome Analysis With ADAMvideo slides

Relation Extraction using Distant Supervision, SVMs, and Probabilistic First Order Logicvideo slides

Identity Resolution in the Sharing Economyvideo slides

Updates and Keynote Address -- Text(ing): the Rebirthvideo slides

Learning Compositionality with Scalavideo slides

Transforming an Algorithm for Online Recommendations into a Multi-lingual Syntax Parservideo slides

Large Scale Topic Assignment on Multiple Social Networksvideo slides

Transforming Unstructured Offer Titlesvideo slides

Measuring Well-Being Using Social Mediavideo prezi

Classifying Text without (many) Labelsvideo slides

Building the world’s Largest Database of Car Features from PDFsvideo slides

Practical NLP Applications of Deep Learningvideo slides

The Ingenuity Biomedical Knowledge-Base: Advantages of Modeling Knowledge in an Ontologyvideo slides

Deep Learning for Natural Language Processingvideo slides

Using Big Data to Identify the World's Top Expertsvideo slides

Extended Swadesh Listvideo slides

Topic-Based Sentiment Analysis in Customer Feedbackvideo slides

Identifying CrunchBase Entities in News Articlesvideo slides

Incentivized Question and Answer Datavideo slides

A Word is Worth a Thousand Vectorsvideo slides

ML Scoring: Where Machine Learning Meets Searchvideo slides