Skip to main content
Hero Image


Pretraining Text Encoders With Adversarial Mixture Of Training Signal Generators

2022 | ICLR

Y Meng, C Xiong, P Bajaj, S Tiwary, P Bennett, J Han, X Song

We present a new framework AMOS that pretrains text encoders with an Adversarial learning curriculum via a Mixture Of Signals from multiple auxiliary generators. Following ELECTRA-style pretraining, the main encoder is trained as a discriminator to detect replaced tokens generated by auxiliary masked language models (MLMs). Different from ELECTRA which trains one MLM as the generator, we jointly train multiple MLMs of different sizes to provide training signals at various levels of difficulty. To push the discriminator to learn better with challenging replaced tokens, we learn mixture weights over the auxiliary MLMs’ outputs to maximize the discriminator loss by backpropagating the gradient from the discriminator via Gumbel-Softmax. For better pretraining efficiency, we propose a way to assemble multiple MLMs into one unified auxiliary model. AMOS outperforms ELECTRA and recent state-of-the-art pretrained models by about 1 point on GLUE and SQuAD benchmarks for BERT base-sized models. We plan to release our pretrained models for future uses.

Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model


S Smith, M Patwary, B Norick, P LeGresley, S Rajbhandari, J Casper, Z Liu, S Prabhumoye, G Zerveas, V Korthikanti, E Zhang, R Child, R Y Aminabadi, J Bernauer, X Song, M Shoeybi, Y He, M Houston, S Tiwary, B Catanzaro

Pretrained general-purpose language models can achieve state-of-the-art accuracies in various natural language processing domains by adapting to downstream tasks via zero-shot, few-shot and fine-tuning techniques. Because of their success, the size of these models has increased rapidly, requiring high-performance hardware, software, and algorithmic techniques to enable training such large models. As the result of a joint effort between Microsoft and NVIDIA, we present details on the training of the largest monolithic transformer based language model, Megatron-Turing NLG 530B (MT-NLG), with 530 billion parameters. In this paper, we first focus on the infrastructure as well as the 3D parallelism methodology used to train this model using DeepSpeed and Megatron. Next, we detail the training process, the design of our training corpus, and our data curation techniques, which we believe is a key ingredient to the success of the model. Finally, we discuss various evaluation results, as well as other interesting observations and new properties exhibited by MT-NLG. We demonstrate that MT-NLG achieves superior zero-, one-, and few-shot learning accuracies on several NLP benchmarks and establishes new state-of-the-art results. We believe that our contributions will help further the development of large-scale training infrastructures, large-scale language models, and natural language generations.

MS-nowcasting: Operational Precipitation Nowcasting with Convolutional LSTMs at Microsoft Weather


S Klocek, H Dong, M Dixon, P Kanengoni, N Kazmi, P Luferenko, Z Lv , S Sharma, J Weyn, S Xiang

We present the encoder-forecaster convolutional long short-term memory (LSTM) deep-learning model that powers Microsoft Weather’s operational precipitation nowcasting product. This model takes as input a sequence of weather radar mosaics and deterministically predicts future radar reflectivity at lead times up to 6 hours. By stacking a large input receptive field along the feature dimension and conditioning the model’s forecaster with predictions from the physics-based High Resolution Rapid Refresh (HRRR) model, we are able to outperform optical flow and HRRR baselines by 20-25% on multiple metrics averaged over all lead times.

Was it “said” or was it “claimed”? How linguistic bias affects generative language models

2021 | EMNLP

R Patel, E Pavlick

People use language in subtle and nuanced ways to convey their beliefs. For instance, saying claimed instead of said casts doubt on the truthfulness of the underlying proposition, thus representing the author’s opinion on the matter. Several works have identified such linguistic classes of words that occur frequently in natural language text and are bias-inducing by virtue of their framing effects. In this paper, we test whether generative language models (including GPT-2 (CITATION) are sensitive to these linguistic framing effects. In particular, we test whether prompts that contain linguistic markers of author bias (e.g., hedges, implicatives, subjective intensifiers, assertives) influence the distribution of the generated text. Although these framing effects are subtle and stylistic, we find evidence that they lead to measurable style and topic differences in the generated text, leading to language that is, on average, more polarised and more skewed towards controversial entities and events.

COCO-LM: Correcting and Contrasting Text Sequences for Language Model Pretraining


Y Meng, C Xiong, P Bajaj, S Tiwary, P Bennett, J Han, X Song

We present COCO-LM, a new self-supervised learning framework that pretrains Language Models by COrrecting challenging errors and COntrasting text sequences. COCO-LM employs an auxiliary language model to mask-and-predict tokens in original text sequences. It creates more challenging pretraining inputs, where noises are sampled based on their likelihood in the auxiliary language model. COCO-LM then pretrains with two tasks: The first task, corrective language modeling, learns to correct the auxiliary model's corruptions by recovering the original tokens. The second task, sequence contrastive learning, ensures that the language model generates sequence representations that are invariant to noises and transformations. In our experiments on the GLUE and SQuAD benchmarks, COCO-LM outperforms recent pretraining approaches in various pretraining settings and few-shot evaluations, with higher pretraining efficiency. Our analyses reveal that COCO-LM's advantages come from its challenging training signals, more contextualized token representations, and regularized sequence representations.

Object-Centric Image Generation from Layouts

2021 | AAAI

T Sylvain, P Zhang, Y Bengio, D Hjelm, S Sharma

Despite recent impressive results on single-object and single-domain image generation, the generation of complex scenes with multiple objects remains challenging. In this paper, we start with the idea that a model must be able to understand individual objects and relationships between objects in order to generate complex scenes well. Our layout-to-image-generation method, which we call Object-Centric Generative Adversarial Network (or OC-GAN), relies on a novel Scene-Graph Similarity Module (SGSM). The SGSM learns representations of the spatial relationships between objects in the scene, which lead to our model’s improved layout-fidelity. We also propose changes to the conditioning mechanism of the generator that enhance its object instance-awareness. Apart from improving image quality, our contributions mitigate two failure modes in previous approaches: (1) spurious objects being generated without corresponding bounding boxes in the layout, and (2) overlapping bounding boxes in the layout leading to merged objects in images. Extensive quantitative evaluation and ablation studies demonstrate the impact of our contributions, with our model outperforming previous state-of-the-art approaches on both the COCO-Stuff and Visual Genome datasets. Finally, we address an important limitation of evaluation metrics used in previous works by introducing SceneFID — an object-centric adaptation of the popular Fr{e}chet Inception Distance metric, that is better suited for multi-object images.

On the Regularity of Attention


J Vuckovic, A Baratin, R T des Combes

Attention is a powerful component of modern neural networks across a wide variety of domains. In this paper, we seek to quantify the regularity (i.e. the amount of smoothness) of the attention operation. To accomplish this goal, we propose a new mathematical framework that uses measure theory and integral operators to model attention. We show that this framework is consistent with the usual definition, and that it captures the essential properties of attention. Then we use this framework to prove that, on compact domains, the attention operation is Lipschitz continuous and provide an estimate of its Lipschitz constant. Additionally, by focusing on a specific type of attention, we extend these Lipschitz continuity results to non-compact domains. We also discuss the effects regularity can have on NLP models, and applications to invertible and infinitely-deep networks.

Knowledge-Aware Language Model Pretraining


C Rosset, C Xiong, M Phan, X Song, P Bennett, S Tiwary

How much knowledge do pretrained language models hold? Recent research observed that pretrained transformers are adept at modeling semantics but it is unclear to what degree they grasp human knowledge, or how to ensure they do so. In this paper we incorporate knowledge-awareness in language model pretraining without changing the transformer architecture, inserting explicit knowledge layers, or adding external storage of semantic information. Rather, we simply signal the existence of entities to the input of the transformer in pretraining, with an entity-extended tokenizer; and at the output, with an additional entity prediction task. Our experiments show that solely by adding these entity signals in pretraining, significantly more knowledge is packed into the transformer parameters: we observe improved language modeling accuracy, factual correctness in LAMA knowledge probing tasks, and semantics in the hidden representations through edge probing.We also show that our knowledge-aware language model (KALM) can serve as a drop-in replacement for GPT-2 models, significantly improving downstream tasks like zero-shot question-answering with no task-related training.

Results of the Multi-Domain Task-Completion Dialog Challenge

2020 | Proceedings of the 34th AAAI Conference on Artificial Intelligence, Eighth Dialog System Technology Challenge Workshop

J Li, B Peng, S Lee, J Gao, R Takanobu, Q Zhu, M Huang, H Schulz, A Atkinson, M Adada

The paper provides an overview of the “Multi-domain Task Completion” track (Track 1) at the 8th Dialog System Technology Challenge (DSTC-8). There are two tasks in this track. The first task is end-to-end multi-domain task-completion, which aims to build end-to-end task completion dialog systems based on ConvLab. The second task is fast domain adaptation, seeking to develop models that predict user responses when only limited in-domain data is available. We describe the submissions for both tasks, automatic evaluation and human evaluation procedures, and discuss the outcomes of these two evaluations.

Transformer-XH: Multi-Evidence Reasoning with eXtra Hop Attention

2020 | ICLR

C Zhao, C Xiong, C Rosset, X Song, P Bennett, S Tiwary

Transformers have achieved new heights modeling natural language as a sequence of text tokens. However, in many real world scenarios, textual data inherently exhibits structures beyond a linear sequence such as trees and graphs; many tasks require reasoning with evidence scattered across multiple pieces of texts. This paper presents Transformer-XH, which uses eXtra Hop attention to enable intrinsic modeling of structured texts in a fully data-driven way. Its new attention mechanism naturally “hops” across the connected text sequences in addition to attending over tokens within each sequence. Thus, Transformer-XH better conducts joint multi-evidence reasoning by propagating information between documents and constructing global contextualized representations. On multi-hop question answering, Transformer-XH leads to a simpler multi-hop QA system which outperforms previous state-of-the-art on the HotpotQA FullWiki setting. On FEVER fact verification, applying Transformer-XH provides state-of-the-art accuracy and excels on claims whose verification requires multiple evidence.

Generic Intent Representation in Web Search

2019 | SIGIR

H Zhang, X Song, Ch Xiong, C Rosset, P Bennett, N Craswell, S Tiwary

This paper presents GEneric iNtent Encoder (GEN Encoder) which learns a distributed representation space for user intent in search. Leveraging large scale user clicks from Bing search logs as weak supervision of user intent, GEN Encoder learns to map queries with shared clicks into similar embeddings end-to-end and then fine tunes on multiple paraphrase tasks. Experimental results on an intrinsic evaluation task – query intent similarity modeling–demonstrate GEN Encoder’s robust and significant advantages over previous representation methods. Ablation studies reveal the crucial role of learning from implicit user feedback in representing user intent and the contributions of multi-task learning in representation generality. We also demonstrate that GEN Encoder alleviates the sparsity of tail search traffic and cuts down half of the unseen queries by using an efficient approximate nearest neighbor search to effectively identify previous queries with the same search intent. Finally, we demonstrate distances between GEN encodings reflect certain information seeking behaviors in search sessions.

Incorporating Query Term Independence Assumption for Efficient Retrieval and Ranking using Deep Neural Networks

2019 | SIGIR

B Mitra, C Rosset, D Hawking, N Craswell, F Diaz, E Yilmaz

Classical information retrieval (IR) methods, such as query likelihood and BM25, score documents independently w.r.t. each query term, and then accumulate the scores. Assuming query term independence allows precomputing term-document scores using these models—which can be combined with specialized data structures, such as inverted index, for efficient retrieval. Deep neural IR models, in contrast, compare the whole query to the document and are, therefore, typically employed only for late stage re-ranking. We incorporate query term independence assumption into three state-of-the-art neural IR models: BERT, Duet, and CKNRM—and evaluate their performance on a passage ranking task. Surprisingly, we observe no significant loss in result quality for Duet and CKNRM—and a small degradation in the case of BERT. However, by operating on each query term independently, these otherwise computationally intensive models become amenable to offline precomputation—dramatically reducing the cost of query evaluations employing state-of-the-art neural ranking models. This strategy makes it practical to use deep models for retrieval from large collections—and not restrict their usage to late stage re-ranking.

An Axiomatic Approach to Regularizing Neural Ranking Models

2019 |

C Rosset, B Mitra, C Xiong, N Craswell, X Song, S Tiwary

Axiomatic information retrieval (IR) seeks a set of principle properties desirable in IR models. These properties when formally expressed provide guidance in the search for better relevance estimation functions. Neural ranking models typically contain a large number of parameters. The training of these models involve a search for appropriate parameter values based on large quantities of labeled examples. Intuitively, axioms that can guide the search for better traditional IR models should also help in better parameter estimation for machine learning based rankers. This work explores the use of IR axioms to augment the direct supervision from labeled data for training neural ranking models. We modify the documents in our dataset along the lines of well-known axioms during training and add a regularization loss based on the agreement between the ranking model and the axioms on which version of the document---the original or the perturbed---should be preferred. Our experiments show that the neural ranking model achieves faster convergence and better generalization with axiomatic regularization.

Serving DNNs in Real Time at Datacenter Scale with Project Brainwave

2018 | Microsoft Research

E Chung, J Fowers, K Ovtcharov, M Papamichael, A Caulfield, T Massengill, M Liu, D Lo, S Alkalay, M Haselman, M Abeydeera, L Adams, H Angepat, C Boehn, D Chiou, O Firestein, A Forin, K S Gatlin, M Ghandi, S Heil, K Holohan, A E Husseini, T Juhasz, K Kagi, R K. Kovvuri, S Lanka, F v Megen, D Mukhortov, P Patel, B Perez, A G Rapsang, S K. Reinhardt, B D Rouhani, A Sapek, R Seera, S Shekar, B Sridharan, G Weisz, L Woods, P Y Xiao, D Zhang, R Zhao, and D Burger

To meet the computational demands required of deep learning, cloud operators are turning toward specialized hardware for improved efficiency and performance. Project Brainwave, Microsoft's principal infrastructure for AI serving in real time, accelerates deep neural network (DNN) inferencing in major services such as Bing’s intelligent search features and Azure. Exploiting distributed model parallelism and pinning over low-latency hardware microservices, Project Brainwave serves state-of-the-art, pre-trained DNN models with high efficiencies at low batch sizes. A high-performance, precision-adaptable FPGA soft processor is at the heart of the system, achieving up to 39.5 TFLOPs of effective performance at Batch 1 on a state-of-the-art Intel Stratix 10 FPGA.

Neural Ranking Models with Multiple Document Fields

2018 | ACM

H Zamani, B Mitra, X Song, N Craswell, S Tiwary

Deep neural networks have recently shown promise in the ad-hoc retrieval task. However, such models have often been based on one field of the document, for example considering document title only or document body only. Since in practice documents typically have multiple fields, and given that non-neural ranking models such as BM25F have been developed to take advantage of document structure, this paper investigates how neural models can deal with multiple document fields. We introduce a model that can consume short text fields such as document title and long text fields such as document body. It can also handle multi-instance fields with variable number of instances, for example where each document has zero or more instances of incoming anchor text. Since fields vary in coverage and quality, we introduce a masking method to handle missing field instances, as well as a field-level dropout method to avoid relying too much on any one field. As in the studies of non-neural field weighting, we find it is better for the ranker to score the whole document jointly, rather than generate a per-field score and aggregate. We find that different document fields may match different aspects of the query and therefore benefit from comparing with separate representations of the query text. The combination of techniques introduced here leads to a neural ranker that can take advantage of full document structure, including multiple instance and missing instance data, of variable length. The techniques significantly enhance the performance of the ranker, and outperform a learning to rank baseline with hand-crafted features.

Optimizing Query Evaluations Using Reinforcement Learning for Web Search

2018 | ACM

C Rosset, D Jose, G Ghosh, B Mitra, S Tiwary

In web search, typically a candidate generation step selects a small set of documents---from collections containing as many as billions of web pages---that are subsequently ranked and pruned before being presented to the user. In Bing, the candidate generation involves scanning the index using statically designed match plans that prescribe sequences of different match criteria and stopping conditions. In this work, we pose match planning as a reinforcement learning task and observe up to 20% reduction in index blocks accessed, with small or no degradation in the quality of the candidate sets.

Systems and methods for automated query answer generation

2018 | Google Patents

S Tiwary, M Rosenberg, J Gao, X Song, R Majumder, L Deng

ystems and methods for automated generation of new content responses to answer user queries are provided. The systems and methods for automated generation of new content responses answer user queries utilizing deep learning and a reasoning algorithm. The generated response is composed of new content and is not merely cut or copied information from one or more search results. Accordingly, the systems and methods for automated generation of new content responses provide tailored query specific answers that can be long and detailed including several sentences of information or that can be short and concise, such as “yes” or “no.” The ability of the systems and methods described herein to create or generate new content in response to a user query improves the usability, improves the performance, and/or improves user interactions of/with a search query system.

Towards Language Agnostic Universal Representations

2018 |

A Aghajanyan, X Song, S Tiwary

When a bilingual student learns to solve word problems in math, we expect the student to be able to solve these problem in both languages the student is fluent in,even if the math lessons were only taught in one language. However, current representations in machine learning are language dependent. In this work, we present a method to decouple the language from the problem by learning language agnostic representations and therefore allowing training a model in one language and applying to a different one in a zero shot fashion. We learn these representations by taking inspiration from linguistics and formalizing Universal Grammar as an optimization process (Chomsky, 2014; Montague, 1970). We demonstrate the capabilities of these representations by showing that the models trained on a single language using language agnostic representations achieve very similar accuracies in other languages.

MS MARCO: A Human Generated MAchine Reading COmprehension Dataset

2016 |

P Bajaj, D Campos, N Craswell, L Deng, J Gao, X Liu, R Majumder, A McNamara, B Mitra, T Nguyen, M Rosenberg, X Song, A Stoica, S Tiwary, T Wang

We introduce a large scale MAchine Reading COmprehension dataset, which we name MS MARCO. The dataset comprises of 1,010,916 anonymized questions---sampled from Bing's search query logs---each with a human generated answer and 182,669 completely human rewritten generated answers. In addition, the dataset contains 8,841,823 passages---extracted from 3,563,535 web documents retrieved by Bing---that provide the information necessary for curating the natural language answers. A question in the MS MARCO dataset may have multiple answers or no answers at all. Using this dataset, we propose three different tasks with varying levels of difficulty: (i) predict if a question is answerable given a set of context passages, and extract and synthesize the answer as a human would (ii) generate a well-formed answer (if possible) based on the context passages that can be understood with the question and passage context, and finally (iii) rank a set of retrieved passages given a question. The size of the dataset and the fact that the questions are derived from real user search queries distinguishes MS MARCO from other well-known publicly available datasets for machine reading comprehension and question-answering. We believe that the scale and the real-world nature of this dataset makes it attractive for benchmarking machine reading comprehension and question-answering models.