SLABank Spanish GRERLI-ES2 Corpus


Liliana Tolchinsky
Department of Linguistics
University of Barcelona


Elisa Rosado
Department of Linguistics
University of Barcelone


Melina Aparici
Department of Psychology
Autonomous University of Barcelona


Rocio Cuberos
Department of Linguistics
University of Barcelona


Type of Study: Spoken and written expository texts, spoken and written narrative texts
Location: Spain
Media type: not available
DOI: doi:10.21415/2DJF-JH34

Browsable transcripts

Download transcripts

Citation information

Cuberos, R. (2019). Indicadores léxicos de calidad textual en español nativo y no nativo. [Lexical correlates of text quality in native and non-native Spanish]. Doctoral dissertation. University of Barcelona.

Cuberos, R., Rosado, E., & Perera, J. (2019). Using deliberate metaphor in discourse: native vs. nonnative text production. In I. Navarro (Ed.). Current Approaches to Metaphor Analysis in Discourse. De Gruyter Mouton. pp. 235-256.

Rosado, E., Aparici, M., & Perera, J. (2014). Adapting to the circumstances: on discourse competence in L2 Spanish. Culture and Education 26(1), 71–102.

In accordance with TalkBank rules, any use of data from this corpus must be accompanied by at least one of the above references.

Project Description

The Spanish GRERLI-ES2 corpus was compiled between 2006 and 2009 under the supervision of Dr. Elisa Rosado (University of Barcelona) as part of the project Lexical, morphosyntactic, and discursive markers in the development of text quality in L2 Spanish and Catalan (EDU2012-34394; P.I.: J. Perera, University of Barcelona). Together with Dr. Elisa Rosado, Dr. Liliana Tolchinsky, Dr. Melina Aparici and Dr. Rocío Cuberos were responsible for compiling and transcribing the corpus. The main goal of this project was to describe how non-native speakers-writers from different linguistic backgrounds/L1s develop discursive abilities across different educational levels, and to analyze how they use these linguistic resources to produce discourse in different genres (expository and narrative) and modalities of production (spoken and written).

The corpus consists of 320 spoken and written narrative and expository texts collected following the procedure designed within the international project Developing Literacy in Different Contexts and Different Languages (P.I.: R. Berman) (Berman & Verhoeven, 2002).

This corpus mirrors both the Spanish GRERLI-ES1 and the Catalan GRERLI-CAT1 corpora, available on the CHILDES database. These corpora include data from native speakers of Spanish and Catalan, respectively, and were collected following the same procedure than in this L2 corpus and in the international crosslinguistic project. For a description of the GRERLI-ES1 corpus, see this page. For a description of the GRERLI-CAT1 corpus, see this page.

Participants and Procedure

The Spanish GRERLI-ES2 corpus compiles texts produced by non-native speakers of Spanish. Participants were distributed in three groups according to their educational level and age: 4th grade (9 years old), 1st year junior-high (12 years old) and university students (over 20 years old). Participants were also distributed in three groups according to their L1 –Arabic, Chinese and Korean– and into different levels of proficiency in Spanish –beginner-intermediate (A2, B1), and advanced (B2, C1). The level of proficiency was established following the Common European Framework of Reference for Languages (CEFR, Council of Europe, 2001). Participants were recruited in Madrid, Barcelona and Murcia, and they must have resided in Spain for at least four years to ensure minimum abilities for producing the texts. Levels of proficiency of each participant are detailed in Table 1. Age means and range in the corpus are the following: grade-school children, mean age: 10;2 (range: 8;2 – 12); junior-high students, mean age: 13;9 (range: 11;1 – 15;8); university adults, mean age: 20;8 (range: 20 – 43;6).

All participants produced four texts. After watching a three-minute video without text, participants were asked to produce a spoken expository text, a written expository text, a spoken narrative text, and a written narrative text. The target video shows different conflict situations in schools, such as fights, marginalizing classmates, cheating in exams, etc., and was used to unify discourse content and enable the comparison of linguistic features across different texts. Data were elicited in two sessions and two different orders of text production: B (first session: written narrative/spoken narrative; second session: written expository/ spoken expository); and D (first session: written expository/ spoken expository; second session: written narrative/ spoken narrative). Note that tasks were balanced by discourse genre, but not by modality (first session: written, second session: spoken). This decision was taken to facilitate non-native speakers’ production (Jisa, 2004). For a full description of the instructions given to the participants see the GRERLI-ES1 corpus page. The elicitation procedure for native speakers in the GRERLI-ES1 corpus did balance between written and oral modalities too.

Transcription, Filenames and ID

Both spoken and written texts were transcribed in CHAT format (MacWhinney, 2000; 2012). Spoken productions were transcribed orthographically (not phonetically), including processing information (pauses, repetitions, reformulations, etc.). In written texts, spelling mistakes were followed by the correct form. The transcription unit was the clause, so that each main tier corresponds to a clause, following Berman and Slobin’s (1994, p. 660) definition as “any unit that contains a unified predicate. By unified, we mean a predicate that expresses a single situation (activity, event, state). Predicates include finite and nonfinite verbs, as well as predicate adjectives”.

Filenames were created as follows (1st to 7th character):

  1. Language of production: p, Spanish
  2. First language: a, Arabic, k, Korean, z, Chinese
  3. Age group: g, Grade school / j, Junior high / h, High school
  4. Participant number: 01 - 29
  5. Genre: e, Expository / n, Narrative
  6. Modality: s, Spoken / w, Written
  7. Order: b, Order B / d, Order D
In the @ID header for each participant, the sixth field corresponds to Age Group and it uses these numbers: 1=Grade School, 2=Junior High, 4=University. The eighth field is the L1 of the participants. The final custom field encodes the type of text: ES = Expository Spoken, EW = Expository Written, NS = Narrative Spoken, NW = Narrative Written. Specific uses of CHAT transcription symbols Information about some special uses of CHAT symbols in the transcripts that might be useful to work with the corpus is detailed in the GRERLI-ES1 corpus page. Unlike the GRERLI-ES1 corpus, the symbol _ (instead of +) was used for multiword expressions written as several words, i.e., unanalyzed chunks such as o_sea, no_sé_qué, por_ejemplo (‘that_is’, ‘I_don’t_know what’, ‘for_example’). This transcription allows for different types of word counts (i.e., counting multiword expressions as one or more than one word).


Data collection was completed in the context of the following research project, granted by a national funding agency through a competitive public call:

El desarrollo del repertorio lingüístico en hablantes no nativos de castellano y catalán (2006-2009) Funding Agency: MEDU – Ministerio de Educación y Ciencia; Reference: SEJ2006-11083; Principal Investigator: Joan Perera

Members of the research team: Melina Aparici, Carmen Arbonés, Aurora Bel, Harriet Jisa, Pilar Monné, Estrella Nicolás, Elisa Rosado, Miquel Siguán, Janet van Hell Collaborating senior Researcher: Liliana Tolchinsky Research assistants: Alicia Doménech, Rachid Lamartí, Naymé Salas, Agustín Zapatero

Research projects that used the corpus are:

Hacia el dominio experto de la lengua: estudio comparado del desarrollo del repertorio lingüístico nativo y no nativo en castellano y catalán (2009-2012) Funding Agency: MEDU - Ministerio de Educación y Ciencia; Reference: EDU2009-08862; Principal Investigator: Joan Perera Members of the research team: Melina Aparici, Ruth Berman, Florence Chenu, Carolina Forns, Harriet Jisa, Estrella Nicolás, Elisa Rosado, Miquel Siguán, Mª Dolores Toledo, Agustín Zapatero Collaborating senior Researcher: Liliana Tolchinsky; Research assistants: Laia Cutillas, Naymé Salas

El desarrollo de la calidad textual en castellano y catalán L2. Indicadores léxicos, morfosintácticos y discursivos (2013 – 2015) Funding Agency: MEDU - Ministerio de Educación y Ciencia Reference: EDU2012-34394; Principal Investigator: Joan Perera; Members of the research team: Eduard Abelenda, Melina Aparici, Ruth Berman, Florence Chenu, Harriet Jisa, Elisa Rosado, Anna Sibayan, M. Rosa Solé, Liliana Tolchinsky Research assistants: Mila Albert, Rocío Cuberos, Laura González, Naymé Salas