Recent advances in machine learning, especially in the field of deep learning, allow AI to solve many real-world tasks, thus providing new experiences and values for the users. NAVER and LINE have enhanced various services such as search, messenger, Webtoon, and video with advanced AI technologies, and have launched an AI assistant platform and AI speakers. NAVER LABS is also developing autonomous vehicles and robots based on AI technologies. As the center of AI technology of NAVER and LINE, CLOVA AI and NAVER LABS Europe are contributing to research on core technologies. In this workshop, we will share the latest research from CLOVA AI and NAVER LABS Europe and discuss the future direction of our AI research.
|14:00 ~ 14:10||Opening remarks||Jung-Woo Ha (CLOVA AI)|
|14:10 ~ 14:40||Keynote speech||Jackie C. K. Cheung (McGill University)|
|14:40 ~ 15:00||Perception: OCR in-the-wild||Seong Joon Oh (CLOVA AI)|
|15:00 ~ 15:20||Perception: Audio-visual speech recognition||Joon Son Chung (CLOVA AI)|
|15:20 ~ 15:40||Cognition: Machine reading & comprehension||Julien Perez (NAVER LABS Europe)|
|15:40 ~ 16:00||Cognition: Face analysis, unveiled in the real-world||Myeong-Yeon Yi(CLOVA AI)|
|16:00 ~ 16:30||Coffee break|
|16:30 ~ 16:50||Cognition: What is a word?||Matthias Galle (NAVER LABS Europe)|
|16:50 ~ 17:10||Action & Apps: Photorealistic style-transfer||Youngjung Uh (CLOVA AI)|
|17:10 ~ 17:30||Action & Apps: Information-theoretic task-oriented dialog||Sang-Woo Lee (CLOVA AI)|
|17:30 ~ 17:50||Action & Apps: Image-to-image translation||Yunjey Choi (CLOVA AI)|
|17:50 ~ 18:30||Demonstration||CLOVA AI & NLE|
· Jackie C. K. Cheung (McGill University)
· Title: Understanding Uncommon Entities and Situations by Using External Knowledge
Thanks to the "long tail" of natural language, words and entities that only appear a handful of times in a training corpus account for the vast majority of occurrences overall. In the case of events and situations, the combinatorial possibilities of entities and relations involved mean that all such types can be considered rare. In this talk, I discuss my group's work on leveraging external knowledge to train language understanding systems for rare entities and events despite having few-to-no training examples of them. First, we propose to use short definitions as a source of information for training entity representations. We introduce a new machine comprehension dataset, the Wikilinks Rare Entities data set, and an accompanying entity cloze task which requires predicting rare named entities in context. We describe a hierarchical double encoder model which reads a text passage and an entity definition in order to predict which entity fits in context. Then, I will present recent work on using the entire indexed web as external knowledge for understanding situations. We describe an approach based on information retrieval to tackle the Winograd Schema Challenge, a difficult commonsense reasoning task that involves resolving pronominal anaphora.
Jackie Chi Kit Cheung is an Assistant Professor at McGill University's School of Computer Science, where he co-directs the Reasoning and Learning Lab. He obtained his PhD from the University of Toronto and joined McGill in 2015. His research group focuses on developing computational methods for understanding text and for generating language that is fluent and appropriate to the context. Several current projects include extracting events from text, automatic summarization, and leveraging external knowledge for language understanding. Dr. Cheung is a member of the Mila Research Institute, and is an academic advisor for the new Borealis AI research lab in Montreal. He recently received a best paper award at ACL 2018.
· OCR in-the-world (Seong Joon Oh / CLOVA AI)
Optical Character Recognition (OCR) has a vast potential for applications where human efforts are too costly for collecting textual information from visual media. OCR typically consists of two stages - detection and recognition. This talk is on the recognition stage, namely the Scene Text Recognition (STR). Recent years have seen many new STR model proposals. While each of them claims to have pushed the boundary of technology, a holistic, fair comparison has been largely missing, due to inconsistent choices of evaluation benchmarks, training dataset and procedure, and evaluation metrics. This talk introduces our two efforts towards addressing the difficulty: (1) a common STR framework for putting existing STR methods in perspective and (2) a common evaluation protocol for a fair comparison with respect to accuracy, speed, and memory. They enable a systematic and fair evaluation of proposed modules towards the performance. Along the way, we have discovered previously unexplored module combinations that achieve the state of the art performances. A technical report and the code will be published soon.
· Audio-visual speech recognition (Joon Son Chung / CLOVA AI)
The objective of this work is visual recognition of human speech. Solving this problem opens up a host of applications, such as transcribing archival silent films, or resolving multi-talker simultaneous speech, but most importantly it helps to advance the state of the art in speech recognition by enabling machines to take advantage of the multi-modal nature of human communications. Training a deep learning algorithm requires a lot of training data. We propose a method to automatically collect, process and generate a large-scale audio-visual corpus from television videos temporally aligned with the transcript. To build such dataset, it is essential to know 'who' is speaking 'when'. We develop a ConvNet model that learns joint embedding of the sound and the mouth images from unlabelled data, and apply this network to the tasks of audio-to-video synchronization and active speaker detection. We also show that the methods developed here can be extended to the problem of generating talking faces from audio and still images, and re-dubbing videos with audio samples from different speakers. We then propose a number of deep learning models that are able to recognize visual speech at sentence level. The lip reading performance beats a professional lip reader on videos from BBC television. We demonstrate that if audio is available, then visual information helps to improve speech recognition performance. We also propose methods to enhance noisy audio and to resolve multi-talker simultaneous speech using visual cues.
· Machine reading & comprehension (Julien Perez / NAVER LABS Europe)
Over the last 5 years, differentiable programming and deep learning have become the-facto standard on a vast set of decision problems of data science. Three factors have enabled this rapid evolution. First, the availability and systematic collection of data have enabled to gather and leverage large quantities of traces of intelligent behavior. Second, the development of standardized development framework has dramatically accelerated the development of differentiable programming and its applications to the major's modalities of the numerical world, image, text, and sound. Third, the availability of powerful and affordable computational infrastructure have enabled this new step toward machine intelligence. Beyond these encouraging results, new limits have arisen and need to be addressed. Automatic common-sense acquisition and reasoning capabilities are two of these frontiers that the major research labs of machine learning are now involved. In this context, human language has become once again a support of choice of such research. In this talk, we will take a task natural language understanding, machine reading, as a medium to illustrate the problem and describe the research progress suggested throughout the machine reading project. First, we will describe several of the limitations the current decision models are suggesting. Secondly, we will speak of adversarial learning and how such approach robustifies learning. Thirdly, we will explore several differentiable transformations that aim at moving toward these goals. Finally, we will discuss ReviewQA, a machine reading corpus over human generated hotel review, that aims at encouraging research around these questions.
· Face analysis, unveiled in the real-world (Myeong-Yeon Yi / CLOVA AI)
Face analysis has a long history in the computer vision field and is being widely used in many applications such as augmented-reality, online-finance, surveillance and so on. In this talk, we will explain the backgrounds of such face analysis technologies, and share our works for applying current state-of-the-art research to many real-world applications. In addition, we will show you a demo of video summarization based on face recognition and a real-time demo of our face analysis technologies on both mobile and desktop environments.
· What is a word? (Matthias Galle / NAVER LABS Europe)
One of the first promises of deep learning for natural language processing was to provide an "end-to-end" approach, and avoid constraints of a-priori decisions. A basic such decision is the one of what constitutes a word, decision which defines the interface between the discrete world (observed symbols) and the continuous (internal representation). In the past this problem has not received much attention, although this has changed in recent years through the impact for neural machine translation of byte-pair encoding, an unsupervised mechanism to infer sub-word tokens. We will review several approaches to define what constitutes a word, and the impact that those approaches have on end applications.
· Photorealistic style-transfer (Youngjung Uh / CLOVA AI)
Style transfer methods aim to synthesize an image containing content from one image and style of another image. Compared to artistic style transfer, photo-realistic style transfer, which requires to strictly preserve spatial structures in the content, has been overlooked. In this talk, we argue that a traditional signal processing technique, namely wavelet transforms, can be mixed in encoder-decoder networks with skip-connections to allow minimal loss of information, resulting in photo-realistic style transfer.
· Information-theoretic task-oriented dialog (Sang-Woo Lee / CLOVA AI)
Motivated by the achievement of neural chit-chat dialog research, recent studies on task-oriented dialog have utilized deep learning and reinforcement learning in an end-to-end fashion. However, learning from dynamic patterns form free-form human-human dialogs in an end-to-end fashion is challenging problem until now. To discover this problem, we proposed "Answerer in Questioner's Mind" (AQM), an information-theoretic framework for task-oriented dialog. Our system figures out the opponent’s intention via selecting a plausible question by explicitly calculating the information gain of the candidate intentions and possible answers to each question. I'll show the experimental results on famous task-oriented visual dialog tasks, including “GuessWhat" and "GuessWhich". Through the discovery, I'll discuss current limitations we would meet when applying other task-oriented dialog systems to our service, and a future direction towards making an agent which can learn from dialog with a human.
· Image-to-image translation (Yunjey Choi / CLOVA AI)
The task of image-to-image translation is to change a particular aspect of a given image to another. Recent studies have shown remarkable success in image-to-image translation for two domains. However, existing methods have limited scalability and robustness in handling more than two domains, since different models should be built independently for every pair of image domains. In this talk, I discuss our recent work StarGAN, which can perform multi-domain image-to-image translation using only a single model. We discuss the main idea of StarGAN and show experimental results on a facial attribute transfer and a facial expression synthesis tasks. Furthur, we discuss the limitations of StarGAN and future works.
- · Florent Perronnin (NAVER LABS Europe)
- · Sung Kim (CLOVA AI Research, NAVER)
- · Julien Perez (NAVER LABS Europe)
- · Matthias Galle (NAVER LABS Europe)
- · Jung-Woo Ha (CLOVA AI Research, NAVER)