Close side menu

GSoC 2020

The FrameNet Brasil Computational Linguistics Lab at the Federal University of Juiz de Fora, Brazil, has been accepted as mentor organization for Google Summer of Code 2020. This page is the main reference point for students submitting their projects to address the ideas listed below.

 

1. FrameNet 101

A framenet is a semantically oriented computational resource in which language material (words, multi-word expressions and grammatical constructions) are linked to a network of frames that help define their meaning. In the context of Frame Semantics, a frame is a scene, a system of interrelated concepts in which participants on the scene, the props they use, and the way they interact are defined. The key notion in framenet is that the meaning of words – as well as the meaning of other levels of linguistic structure – depends on the frames associated with the words, that is, words may evoke frames. Take a word such as the verb tour, for example. In order to understand this word, a speaker of English recruits the Touring frame, in which there are three core participants: the Tourist, an Attraction and a Place. These three elements must be cognitively present, so that the idea of touring can be interpreted. There’s no tourism without one of those elements. Additionally, frames are interconnected to each other via a series of relations, providing a cognitive semantics structure against which meaning is defined. 

In FrameNet Brasil we apply this kind of semantically oriented structure to tackle important issues in Natural Language Understanding. In the ideas list below, we explain those issues further.

To learn more about FrameNet Brasil, consider the following papers:

TORRENT, T. T.; MATOS, E.; LAGE, L.; LAVIOLA, A.; TAVARES, T.; ALMEIDA, V. G.; SIGILIANO, N. (2018). Towards continuity between the lexicon and the constructicon in FrameNet Brasil. In: LYNGFELT, B.; BORIN, L.; OHARA, K. H.; TORRENT, T. T. (Orgs.). Constructional Approaches to Language. Amsterdam: John Benjamins Publishing Company.

DINIZ DA COSTA, A.; GAMONAL, M. A.; PAIVA, V. M. R. L.; MARÇÃO, N. D.; PERON-CORRÊA, S.; ALMEIDA, V. G.; MATOS, E. E. S.; TORRENT, T. T. (2018). FrameNet-Based Modeling of the Domains of Tourism and Sports for the Development of a Personal Travel Assistant Application. In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan: ELRA, p. 6-12.

There are framenets under development for several languages (English, Japanese, German, Swedish, Brazilian Portuguese, Chinese, among others) and also a global initiative to connect them all an developed shared tasks based on framenet data. To learn more about this initiative, visit the Global FrameNet website.

 

2. How to Apply

Successful applicants will turn in projects to address the issues listed in the ideas list, bringing together the kind of structured data FN-Br has been developing through the past decade with the computational techniques they find more suited for achieving the proposed goals. Please note that FN-Br is not only about big data, machine learning and whichever purely statistical approach to language is out there. The work in FN-Br is model-based, besides being also data-driven. The kinds of issues prompting the mentoring process that will take place if FN-Br is accepted for GSoC 2020 are not to be solved by solely training some algorithm from a ton of raw data. With that in mind, applicants should follow the steps below to submit their applications:

1. Read the papers listed in section 1 of this page;

2. Look at the data reports available on this website and familiarize yourself with the kind of structure FN-Br builds;

3. If you have questions and/or need further clarification on any of the ideas, feel free to email us at projeto.framenetbr@ufjf.edu.br. There’s also a Slack for the GSoC and if you’d like to join, just send us an email and we’ll invite you to it;

4. Write a 1-3 pages pre-project and submit it to projeto.framenetbr@ufjf.edu.br, by March 9th if you’d like to get feedback from FN-Br before the official submission;

5. Use the feedback provided by our team to improve your proposal.

 

3. Ideas List

Two aspects are important to keep in mind while reading the ideas described below:

1. The core of FN-Br data model includes three entities: Frames, Frame Elements and Lexical Units (LU). Frames (representing the meaning of a scene or event) are composed by Frame Elements (participants in the scene or event). Lexical Units represent the association – or pairing – of a word ou multiword expression to a Frame.

2. The basic annotation process of a sentence consists of choosing a specific word in the sentence (the target Lexical Unit) and associate semantic labels (Frame Elements) to the other words/expressions in the sentence that are somehow dependent on the target. Originally, only sentences (i.e. texts) were annotated in FN-Br. Nonetheless, since GSoC 2019, the FN-Br Web Annotation Tool also features a module for annotating multimodal corpora, that is, a video plus the transcripts of the audio or the subtitles superimposed to it. The set of annotated target LU/image element plus Frame Elements is called AnnotationSet. As many words and/or image fragments in the sentence can be chosen as targets, it is possible that many AnnotationSets are associated to one sentence or video fragment. 

 

3.1. Multilingual Semantic Annotation Projection

Mentors: Ely Matos (FN-Br | UFJF) | Oliver Czulo (University Leipzig) | Collin Baker (FrameNet | ICSI) | Tiago Torrent (FN-Br | UFJF) | Debanjana Kar (IIT Kharagpur)

General Context:

FrameNet Brasil is leading, together with Berkeley FrameNet, a shared effort among framenets to annotate parallel corpora using the Berkeley FrameNet Data Release 1.7. The first text selected for annotation in this effort was the transcription of the TED Talk “Do Schools Kill Creativity?” by Ken Robinson. So far, the transcription of the talk and its translations (or at least parts of it) have been annotated for English, Brazilian Portuguese, Japanese, French, German, Swedish, Greek, Hindi and Urdu in the context of the Global FrameNet initiative. Annotators are allowed to create lexical units they will annotate in the text, but they cannot change the frame and frame elements in the Berkeley FrameNet Data Release 1.7. Hence, the basic semantic structure for annotation is the same across languages. The annotation tool used is the FN-Br Web Annotation Tool, and, for this task, it offers annotators the possibility of registering whether a given semantic frame is not well suited for a given lexical item in the context of the annotation. Those indications are parametrized, and, therefore, constitute a set of hints in which regards differences between languages. A first report on the results of the shared annotation task can be found here.

The shared annotation task has shed some light on how translations of a same text may differ in terms of  the semantic frames recruited by the text in the process of meaning construction. Some of those differences refer to the imposition of different perspectives on the meaning, some relate to differences triggered by the formal aspects of the text, and some others to culturally grounded metaphors. However, annotations in different languages may also differ in amount and depth: for some languages, the existing annotation is partial, and while some language teams focus e.g. on certain parts of speech, usually content words, others include functional elements such as conjunctions and prepositions. Also, teams working on the annotation also differ in regards to their local conditions: while some of them come from a long tradition of building a framenet for their languages, others are starting a framenet precisely from the annotation of the Global FrameNet corpus. Moreover, the process of annotating long texts is such a fine grained fashion – as it is the case for framenet-like annotations – is very time consuming and the lack of large semantically annotated multilingual corpora has been a barrier to the development of applications in machine translation and other fields of Computational Linguistics. This project idea shall remedy some of these problems, as well as aid in the process of semi-automatically bootstrapping new framenets, by means of the automatic projection of annotations from multilingual corpora to unannotated translations, and, at the same time, use existing parallel annotations to check the robustness and precision of the projection algorithm.

The Idea:

Using the data generated from the annotation of the TED Talk in three or more languages (English, German and Brazilian Portuguese being mandatory, since this idea is developed in collaboration with the Department of Translation Studies at Leipzig University and FrameNet at the International Computer Sciences Institute, Berkeley), as well as the relations between frames given in the Berkeley FrameNet Data Release 1.7, applicants interested in working with this idea should present a project to implement an annotation projection system which must (a) work across various languages (at least English, Brazilian Portuguese and German), (b) make use of relations between frames as a means of inferring best-fit LU candidates in other languages, and (c) be capable of handling problems such as n:m alignments of sentences.

Why this Idea is Innovative:

Semantic annotation projection is a technique which has already been studied to some extent (see Padó & Lapata 2009 for an example), but questions such as projection of metaphorical expressions remain problematic. Existing annotations from the Global FrameNet initiative can be used as a standard to train, validate and test the algorithm to be developed. Testing projection not only between original and translations but between multiple translations of one original will shed some light on the question of how similar in terms of semantic content translations are between themselves, not taking the original into account. Insights into this could help inform the theoretical basis of other applications concerning processing of imitated language production such as referenceless machine translation evaluation or automatic paraphrasing. Also, this idea may result in a framework for bootstrapping new framenets from multilingual annotated corpora.

The Project:

Student: Zheng Xin Yong
Affiliation: Minerva International School
Project: Multilingual Projections of Semantic Frames and Frame Elements using Modified Existing Parsers and Neural Networks

 

3.2. New Frame-Based Image and Video Annotation Pipeline for the FrameNet Brasil Web Annotation Tool

Mentors: Ely Matos (FN-Br | UFJF) | Frederico Belcavello (FN-Br | UFJF) | Marcelo Viridiano (FN-Br | UFJF) | Tiago Torrent (FN-Br | UFJF) 

General Context:

FrameNet Brasil has been expanding its annotation practice to multimodal corpora, that is, to pieces combining either text and pictures or audio transcriptions and the video they are superimposed to. The purpose of FN-Br’s multimodal annotation is to produce a training corpus annotated with fine grained semantic representations of events and entities, while tracking the (a)synchronicities between the semantics conveyed by the images and that conveyed by the texts. Such a corpus can then be used for training algorithms combining machine vision and natural language understanding.

The first multimodal module for the FN-Br Web Annotation Tool has been developed during GSoC 2019 and is already functional. Nonetheless, a whole pipeline is needed so as to make multimodal annotation efficient and scalable. This idea congregates a series of potential projects, since the pipeline should include (1) pre-processing, (2) annotation, (3) data compilation and reporting. Applicants may present projects approaching as many stages as they find suitable.

Idea #1 – Pre-Processing Pipeline:

Currently, all the corpus pre-processing is manually made, which, besides being highly time consuming, can also lead to inconsistencies in data format. This first idea consists, then, in building a pipeline for automatically preparing and importing multimodal corpora into the FN-Br WebTool. Such a process must include, at least:

  • a speech-to-text application, capable of generating time-stamped transcriptions of what is said in the videos in more than one language (since it is common that one same piece of video has more than one language in it);
  • a subtitle extractor, which also generates time stamps for the subtitles in reference to how long they are present in the video;
  • a video uploading module for the FN-Br WebTool, capable of dealing with different formats, codecs and frame rates, and of converting them to the specifications required by the tool;
  • a video segmentation functionality, that breaks the video file in parts according to the structure of the textual elements in it. 

Idea #2 – Semi-automatization of the annotation process:

Multimodal annotation in FN-Br is currently 100% manual. Hence, an important stage in the pipeline is the automatic detection of the elements in each scene of the video, reducing the annotator effort in drawing bounding boxes. An initial experiment has been conducted using YOLO, which automatically detects objects in each video frame. However, because the multimodal module of the WebTool is based on VATIC, using Optical Flow, it is key that automatically detected bounding box be related to each other, in a way so that the system recognizes that the several instances of one same object across several frames are, in fact, the same object. This idea, thus, consists of adding an automatic object identification functionality to the WebTool that both works with VATIC and allows for human correction, if needed.

Idea #3 – Data compilation and reporting module:

Multimodal annotation in FN-Br, as currently implemented, stores data for both the sentence annotation (related to the audio transcriptions and subtitles) and the video annotation. For the textual annotation, semantic labels are applied to words and phrase, and initial and final time stamps for each sentence are stored. On the other hand, video annotation associates semantic labels to bounding boxes and initial and final time stamps are also associated to the boxes. Since we investigate (a)synchronicities between the textual and visual semantics, the tool needs to provide a data compilation and reporting system that allows users to track the relations held between the semantic representations associated to a multimodal corpus.

Why this Idea Set is Innovative:

There are several text annotation tools and several video and/or image annotation tools. However, none of them control for the synchronicity between different media types nor allow for cross-annotation. Also, none of them are frame-based and, therefore, none of them yield, as an annotation product, material that can shed light on the role of multimodality in language comprehension.

The Project:

Student: Prishita Ray
Affiliation: Vellore Institute of Technology
Project: New Frame-Based Image and Video Annotation Pipeline for the FrameNet Brasil Web Annotation Tool

3.3. New Data Visualization and Compatibility Features for FrameNet

Mentors: Ely Matos (FN-Br | UFJF) | Marcelo Viridiano (FN-Br | UFJF) | Collin Baker (FrameNet | ICSI) | Tiago Torrent (FN-Br | UFJF) 

General Context:

As the FrameNet Brasil Web Annotation Tool has been used for other projects, as well as in the Global FrameNet Shared Annotation Task, new data compatibility and visualization features have been demanded by the community.

The Idea:

This idea revolves around the implementation of features such as:

  • data import/export from/to other formats used by other projects/tools. Specifically, the Berkeley FN XML standard, the Universal Dependencies CONLLU format and the WebAnno standards should be considered;
  • graph-based data visualization interface, making FrameNet data more accessible to users.

Why this Idea is Innovative:

FrameNet data is rich and dense. All this richness, plus the network based structure of FrameNet, makes traditional list and table based data visualization inadequate. Nonetheless, no project so far has invested in more suitable data visualization for frame-based semantic representations so as to incorporate the whole of FrameNet structure. Also, as the need for fine-grained semantic representations, brought together with other computational tools, grows, making FrameNet data compatible with other formats is mandatory.

No successful projects have been concluded for this idea.