MRQA 2018: Home

Machine Reading for Question Answering (MRQA) has become an important testbed for evaluating how well computer systems understand human language, as well as a crucial technology for industry applications such as search engines and dialog systems. The research community has recently created a multitude of large-scale datasets over text sources such as Wikipedia (WikiReading, SQuAD, WikiHop), news and other articles (CNN/Daily Mail, NewsQA, RACE), fictional stories (MCTest, CBT, NarrativeQA), and general web sources (MS MARCO, TriviaQA, SearchQA). These new datasets have in turn inspired an even wider array of new question answering systems.

This workshop will gather researchers to address and discuss important research topics surrounding MRQA, including:

Accuracy: How can we make MRQA systems more accurate?
Interpretability: How can systems provide rationales for their predictions?
Speed and Scalability: How can systems scale to consider larger contexts, from long documents to the whole web?
Robustness: How can systems generalize to other datasets and settings beyond the training distribution?
Dataset Creation: What are effective methods for building new MRQA datasets?
Dataset Analysis: What challenges do current MRQA datasets pose?
Error Analysis: What types of questions or documents are particularly challenging for existing systems?

Program

8:45–9:00 | Opening remarks
9:00–9:35 | Phil Blunsom, University of Oxford/Deepmind
- Data driven reading comprehension: successes and limitations Slides

The last three years has seen an explosion in interest in the application of large scale machine learning techniques to reading comprehension tasks. This interest has been driven by the availability of large datasets suitable for estimating data hungry supervised deep learning models. In this talk I will describe how our work at DeepMind has contributed to this trend and discuss whether this is the right approach for developing and evaluating natural language understanding systems.

9:35–10:10 | Sebastian Riedel, University College London
- Reading and Reasoning with Neural Program Interpreters Slides

We are getting better at teaching end-to-end neural models how to answer questions about content in natural language text. However, progress has been mostly restricted to extracting answers that are directly stated in text. In this talk, I will present our work towards teaching machines not only to read, but also to reason with what was read and to do this in a interpretable and controlled fashion. Our main hypothesis is that this can be achieved by a) the development of neural abstract machines that follow the blueprint of program interpreters for real-world programming languages. We test this idea using two languages: an imperative (Forth) and a declarative (Prolog/Datalog) one. In both cases we implement differentiable interpreters that can be used for learning reasoning patterns. Crucially, because they are based on interpretable host languages, the interpreters also allow users to easily inject prior knowledge and inspect the learnt patterns. We will also present a data generation strategy to produce training sets for tasks that require reading and reasoning, and two datasets we have generated with it: Wikihop and Medhop.

10:10–10:30 | Best paper talk: A Systematic Classification of Knowledge, Reasoning, and Context within the ARC Dataset
10:30–11:00 | Morning coffee break
11:00–11:35 | Richard Socher, Salesforce Research
- The Natural Language Decathlon: Multitask Learning as Question Answering

Deep learning has improved performance on many natural language processing (NLP) tasks individually. However, general NLP models cannot emerge within a paradigm that focuses on the particularities of a single metric, dataset, and task. We introduce the Natural Language Decathlon (decaNLP), a challenge that spans ten tasks: question answering, machine translation, summarization, natural language inference, sentiment analysis, semantic role labeling, zero-shot relation extraction, goal-oriented dialogue, semantic parsing, and commonsense pronoun resolution. We cast all tasks as question answering over a context. Furthermore, we present a new Multitask Question Answering Network (MQAN) jointly learns all tasks in decaNLP without any task-specific modules or parameters in the multitask setting. MQAN shows improvements in transfer learning for machine translation and named entity recognition, domain adaptation for sentiment analysis and natural language inference, and zero-shot capabilities for text classification. We demonstrate that the MQAN's multi-pointer-generator decoder is key to this success and performance further improves with an anti-curriculum training strategy. Though designed for decaNLP, MQAN also achieves state of the art results on the WikiSQL semantic parsing task in the single-task setting. We release code for procuring and processing data, training and evaluating models, and reproducing all experiments for decaNLP.

11:35–12:10 | Jianfeng Gao, Microsoft Research
- Multi-step reasoning neural networks for question answering Slides

In this talk, I review our recent work on developing multi-step reasoning neural network models for answering complex questions based on either text or knowledge graph (KG). For text-QA, we present a simple yet robust stochastic answer net (SAN) (Liu et al. 2018) that simulates multi-step reasoning for machine reading comprehension. SAN is unique in its use of a kind of stochastic prediction dropout on the answer module during training, which improves robustness of the model. For KG-QA, we focus the discussion on the recently proposed reinforcement learning based approaches that explore multi-step paths in KGs. We describe in detail a graph-walking agent, called M-Walk (Shen et al. 2018), which consists of a RNN and Monte Carlo Tree Search, and has achieved new state of the art results on several graph-walking benchmarks.

12:10–13:45 | Lunch
13:45–14:20 | Sameer Singh, University of California, Irvine
- Questioning Question Answering Answers Slides

Although existing QA systems are accurate on many benchmarks, they are often brittle and incorrect in ways that we don't fully understand. In this talk, I will introduce some of our recent tools for interpretability for black-box models, and present their application on SQuAD and VisualQA systems. In particular, I will show how different forms of explanations, such as word importance, sufficient conditions, and semantic adversaries, can be used to generate rationales, evaluate robustness, and analyze the errors of these complex, neural QA systems. (work with Marco Ribeiro and Carlos Guestrin)

14:20–15:30 | Poster session (with one-minute spotlight talks)
15:30–16:00 | Afternoon coffee break
16:00–17:00 | Panel discussion
Annette Frank, Jianfeng Gao, Chris Manning, Sebastian Riedel, Sameer Singh, Richard Socher

Important Dates

~~Deadline for submission: Monday, April 23, 2018~~
~~Notification of acceptance: Tuesday, May 15, 2018~~
~~Deadline for camera-ready version: Monday, May 28, 2018~~
~~Early registration deadline: June 4, 2018~~
~~Workshop Date: Thursday, July 19, 2018~~

All submission deadlines are 11:59 PM GMT -12 (anywhere in the world).

Organization

Steering Committee:

Antoine Bordes, Facebook AI Research
Percy Liang, Stanford University
Luke Zettlemoyer, University of Washington

Organizing Committee:

Eunsol Choi, University of Washington
Minjoon Seo, NAVER & University of Washington
Danqi Chen, Stanford University
Robin Jia, Stanford University
Jonathan Berant, Tel-Aviv University

MRQA 2018: Machine Reading for Question Answering

Program

Important Dates

Organization

Sponsors