UFJF - Universidade Federal de Juiz de Fora

Google Summer of Code

You are at: Google Summer of Code

The FrameNet Brasil Computational Linguistics Lab at the Federal University of Juiz de Fora, Brazil, has been accepted as mentor organization for Google Summer of Code 2019. This page will be the main reference point for students submitting their projects to address the ideas listed below.

 

1. FrameNet 101

A framenet is a semantically oriented computational resource in which language material (words, multi-word expressions and grammatical constructions) are linked to a network of frames that help define their meaning. In the context of Frame Semantics, a frame is a scene, a system of interrelated concepts in which participants on the scene, the props they use, and the way they interact are defined. The key notion in framenet is that the meaning of words – as well as the meaning of other levels of linguistic structure – depends on the frames associated with the words, that is, words may evoke frames. Take a word such as the verb tour, for example. In order to understand this word, a speaker of English recruits the Touring frame, in which there are three core participants: the Tourist, an Attraction and a Place. These three elements must be cognitively present, so that the idea of touring can be interpreted. There’s no tourism without one of those elements. Additionally, frames are interconnected to each other via a series of relations, providing a cognitive semantics structure against which meaning is defined. 

In FrameNet Brasil we apply this kind of semantically oriented structure to tackle important issues in Natural Language Understanding. In the ideas list below, we explain those issues further.

To learn more about FrameNet Brasil, consider the following papers:

 

2. How to Apply

Successful applicants will turn in projects to address the issues listed in the ideas list, bringing together the kind of structured data FN-Br has been developing through the past decade with the computational techniques they find more suited for achieving the proposed goals. Please note that FN-Br is not only about big data, machine learning and whichever purely statistical approach to language is out there. The work in FN-Br is also model-based, besides being also data-driven. The kinds of issues prompting the mentoring process that will take place if FN-Br is accepted for GSoC 2019 are not to be solved by solely training some algorithm from a ton of raw data. With that in mind, applicants should follow the steps below to submit their applications:

1. Read the papers listed in section 1 of the Summer of Code page in the FN-Br website;

2. Look at the data reports available at the FN-Br website and familiarize yourself with the kind of structure FN-Br builds;

4. If you have questions and/or need further clarification on any of the ideas, feel free to email us at projeto.framenetbr@ufjf.edu.br. There’s also a Slack for the GSoC and if you’d like to join, just send us an email and we’ll invite you to it;

5. Write a 1-3 pages pre-project and submit it to projeto.framenetbr@ufjf.edu.br, by March 10th 15th if you’d like to get feedback from FN-Br before the official submission;

6. Use the feedback provided by our team to improve your proposal and submit it to GSoC through their web page.

 

3. Ideas List

Two aspects are important to keep in mind while reading the ideas described below:

  1. The core of FN-Br data model includes three entities: Frames, Frame Elements and Lexical Units (LU). Frames (representing the meaning of a scene or event) are composed by Frame Elements (participants in the scene or event). Lexical Units represents the association – or pairing – of a word ou multiword expression to a Frame.
  2. The basic annotation process of a sentence consists of chosing a specific word in the sentence (the target Lexical Unit) and associate semantic labels (Frame Elements) to the other words/expressions in the sentence that are somehow dependent on the target. Currently, just sentences (i.e. texts) are annotated in FN-Br project. The set of annotated target LU plus Frame Elements is called AnnotationSet. As many words in the sentence can be chosen as LUs, it is possible that many AnnotationSets are associated to one sentence.

 

3.1. Frame-Based Metric for Machine Translation

Mentors: Ely Matos (FN-Br | UFJF) | Oliver Czulo (University Leipzig) | Tiago Torrent (FN-Br | UFJF) | Wagner Arbex (Computer Science Dept. | UFJF)

General Context:

FrameNet Brasil is leading, together with Berkeley FrameNet, a shared effort among framenets to annotate parallel corpora using the Berkeley FrameNet Data Release 1.7. The first text selected for annotation in this effort was the transcription of the TED Talk “Do Schools Kill Creativity?” by Ken Robinson. So far, the transcript of the talk and its translations (or at least parts of it) have been annotated for English, Brazilian Portuguese, Japanese, German, Swedish, Greek, Hind and Urdu. Annotators are allowed to create lexical units they will annotate in the text, but they cannot change the frame and frame elements in the Berkeley FrameNet Data Release 1.7. Hence, the basic semantic structure for annotation is the same across languages. The annotation tool used is the FN-Br Web Annotation Tool, and, for this task, it offers annotators the possibility of registering whether a given semantic frame is not well suited for a given lexical item in the context of the annotation. Those indications are parametrized, and, therefore, constitute a set of hints in which regards differences between languages. A first report on the results of the shared annotation task can be found here.

The shared annotation task has shed some light on how translations of a same text may differ in terms of  the semantic frames recruited by the text in the process of meaning construction. Some of those differences refer to the imposition of different perspectives on the meaning, some relate to differences triggered by the formal aspects of the text, and some others to culturally grounded metaphors

The Idea:

Using the data generated from the annotation of the TED Talk in two or more languages (German and Brazilian Portuguese being mandatory, since this idea is developed in collaboration with the Department of Translation Studies at Leipzig University), as well as the relations between frames given in the Berkeley FrameNet Data Release 1.7, applicants interested in working with this idea should present a project to develop a new machine translation metric, which is intended to measure the frame distance between sentence pairs in two languages.

Why this Idea is Innovative:

Machine Translation metrics are usually based on statistically build formal models of a language. Semantically oriented metrics tend to rely on (Automatic) Semantic Role Labeling, which performs poorly for most languages other than English. The kind of metric devised here relies on the FrameNet structure, which is already available and has been expanded to several languages along the past two decades. 

 

3.2. Frame-Based Image and Video Annotation Module for the FrameNet Brasil Web Annotation Tool

Mentors: Ely Matos (FN-Br | UFJF) | Frederico Belcavello (FN-Br | UFJF) | Tiago Torrent (FN-Br | UFJF)

General Context:

FrameNet is developing an application aimed at providing users with recommendations of tourist attractions and activities, the m.knob app. To do so, the app parses users comments about attractions posted online, creating meaning representations for them. Those representations are them matched to the users’ input in natural language to a conversational interface. Specifics on how this is made can be found here.

While analyzing the kind of information available online about tourist venues, our team found a great amount of multimodal data, such as narrated videos and pictures attached to comments. That was the motivation for this idea: trying to find frame-base correlations between text and other media.

The Idea:

Using the FrameNet Brasil Web Annotation Tool – used for text annotation – as a starting point, applicants interested in working with this idea should present a project to develop a frame-based image and audiovisual annotation module for the FN-Br WebTool. Such a module is not meant to be independent from the text annotation tool, specially in the case of audiovisual annotations, since timing has demonstrated to be a key issue in measuring frame correlations across different media. Hence, besides allowing annotators to choose frames, and locate frame elements both in the text and in the images, the tool must keep track of the time span in which those elements are active in the video and in the audio in it. Also, the tool must allow annotators to locate in the image frame elements that are elliptical in the text (spoken audio), and vice-versa.

Why this Idea is Innovative:

There are several text annotation tools and several video and/or image annotation tools. However, none of them control for the synchronicity between different media types nor allow for cross-annotation. Also, none of them are frame-based and, therefore, none of them yield, as an annotation product, material that can shed light on the role of multimodality in language comprehension.

Compartilhe:

    FrameNet Brasil