BCN-L2 Corpus

Aurora Bel Gaya
Department of Translation and Language Sciences
Pompeu Fabra University
aurora.bel@upf.edu
website

Participants:	114
Type of Study:	longitudinal
Location:	Spain
Media type:	audio
DOI:	doi:10.21415/T5GC87

Citation information

Bel, A. & García-Alcaraz, E. (2013). Subjects in the L2 Spanish of Moroccan Arabic speakers: evidence from bilingual and second language learners. T. Judy & S. Perpiñán (eds.) The Acquisition of Spanish as a Second Language: Data from Understudied Languages Pairings. Amsterdam: John Benjamins.

Bel, A., García-Alcaraz, E., & Rosado, E. (forthcoming). Reference comprehension and production in L2 Spanish: the view from null-subject languages. Issues in Hispanic and Lusophone Linguistics. Amsterdam: John Benjamins.

In accordance with TalkBank rules, any use of data from this corpus must be accompanied by at least one of the above references.

Project Description

The BCN-L2 Spanish Corpus was collected within a research project supported by two grants to Aurora Bel from the Ministry of Science and Innovation of the Spanish Government (FFI2009-09349 & FFI2012-35058). The project aims at investigating different phenomena at the syntax-pragmatic and syntax-morphology interface in the acquisition of new languages (mainly L2 Catalan and L2 Spanish) in educational contexts.

The corpus consists of 228 spoken and written narrative texts gathered following the procedure designed within the international project Developing Literacy in Different Contexts and Different Languages, P.I.: R. Berman (Berman, 2008). Participants were shown a three-minute silent video displaying scenes of interpersonal conflicts at school, and were then asked to tell and write in Spanish a similar story that happened to a friend. The fact that the participants were asked to tell somebody else’s story necessarily implies the production of third-person referents, as opposed to what happens with personal narratives.

Three research assistants (Júlia Perera, Mònica Tarrés and Estela García-Alcaraz, who also supervised the process) collected the data and transcribed the spoken and written texts. Transcription and assessment of language level was coordinated by Dr. Elisa Rosado.

Participants (origin, ages, and language proficiency)

Data collection was performed during the spring of 2011 and 2012 in different secondary schools in the metropolitan area of Barcelona. Participants are 88 native speakers of Moroccan Arabic (Darija) and 26 speakers of Berber (Amazigh) living in Catalonia. For all the participants Moroccan Arabic or Berber is their family language. In most cases their first contact with Spanish and Catalan (the two environmental languages) coincides with their entry in the Spanish school system (usually at preschool level). In general, they use the family language on a daily basis with family and the environmental languages with friends (for a detailed description of language use patterns and language proficiency, see Bel & García-Alcaraz 2013).

Participants were grouped into four age ranges (as established by the Spanish secondary education system, Enseñanza secundaria obligatoria, ESO). The correspondences between the different systems are shown in table 1 below.

Table 1. Age ranges and grades

Age range Spanish grade US equivalent
12-13 1º ESO 7th grade
13-14 2º ESO 8th grade
14-15 3º ESO 9th grade
15-16 4º ESO 10th grade

Participants were also classified into different levels of proficiency in Spanish. We followed the criteria established by the CEFR (Common European Framework of Reference for Languages, 2001), which divides learners into three levels, which can be further divided into six sublevels:

Table 2. Levels of proficiency in Spanish

CERF Level of proficiency
A Basic User A1 Breakthrough or beginner 1
A2 Waystage or elementary 2
B Independent User B1 Threshold or intermediate 3
B2 Vantage or upper intermediate 4
C Proficient User C1 Effective Operational Proficiency or advanced 5
C2 Mastery or proficiency 6
Filenames and ID

All participants were assigned a code number to ensure confidentiality, and this number was used to identify the two files with the transcription of their oral and written narratives. The filenames use the following syntax:
Subject number: from 01 to 156
L1 language: dar stands for darija; ber stands for bereber
Age ranges: 1E, 2E, 3E, 4E where E stands for ESO
Text modality: o stands for spoken (oral), e stands for written
For example, a file that is named ‘10ber1Eo.cha’ is an oral text produced by participant number 10, who is a native speaker of Berber from the 1st grade of ESO.
ID headers are arranged as follows:
@Participants: STU Target_Student, INV Investigator
@ID: spa|periferias_L2|STU|16;08.00|male|ber|26|Target_Student|4E|2|
@ID: spa|periferias_L2|EST|||||Investigator|||
The participants are introduced in the Participants compulsory header with the codes STU (for Student) and INV (for Investigator), and their corresponding role. The information in the ID header for the target student is structured as follows: target language (spa=Spanish), project name (periferias_L2), participant code (STU), age, sex (male or female), participant’s L1 (ber=Berber or ary=Moroccan Arabic), subject number code (as explained above), participant’s role, grade in the Spanish school system (1E, 2E, 3E, 4E, as specified in Table 1) and level of proficiency in Spanish according to the CERF (from 1 to 6, as specified in Table 2).
Some notes on transcription

All the collected texts (spoken and written) are orthographically transcribed following CHAT conventions and segmented into clauses, so that each tier contains a clause (Berman & Slobin 1994). All the transcriptions were checked by a second transcriber to ensure reliability. Other important remarks concerning transcription are listed below:

Proper names (people and institutions) are replaced by X, Y, W, etc.
Accents and Spanish letter ‘ñ’ are incorporated.
Correction of orthographic errors in written texts is included in brackets as shown in the following example:

Example 1
*STU: porque no decía nada respecto a esa situacion [: situación]
because he didn’t say anything about that situation

Words segmentation errors in written texts are marked as shown in the following examples:

es^condido (instead of ‘escondido’, hidden)
yasta [: ya está]

Omissions are marked differently depending on the modality of production:

Spoken texts: le ha da(d)o
He hit him
Written texts: le ha pillao [: pillado]
He caught him

Words in Catalan (the other environmental language) are marked with @s followed by the corresponding word in Spanish using the replace notation:

taula@s [: mesa] (‘taula’ is the Catalan word for table)

Enclitic pronouns, which are attached orthographically to the verb, are marked as follows:

dá+me+lo (give it to me)
quedar+se (to remain)
This does not affect preclitic pronouns, since they are conventionally written separate from the verb (‘me lo da’, he gives it to me).
Punctuation marks that could come into conflict with CHAT format as well as typographic conventions typical of written texts are identified in brackets as the following examples:

Example 2
*STU: yo no (h)ice nada.
I didn’t do anything
*STU: y tampoco tenía intención [% punto].
and I had no intention either [% period]
Example 3
*STU: el [% e mayúscula] problema empezó.
the [% e upper case] problem started

BCN-L2 Corpus

Browsable transcripts

Download transcripts

Citation information

Project Description

Participants (origin, ages, and language proficiency)

Filenames and ID

Some notes on transcription

Age range	Spanish grade	US equivalent
12-13	1º ESO	7th grade
13-14	2º ESO	8th grade
14-15	3º ESO	9th grade
15-16	4º ESO	10th grade

CERF		Level of proficiency
A Basic User	A1 Breakthrough or beginner	1
	A2 Waystage or elementary	2
B Independent User	B1 Threshold or intermediate	3
	B2 Vantage or upper intermediate	4
C Proficient User	C1 Effective Operational Proficiency or advanced	5
	C2 Mastery or proficiency	6 Filenames and ID All participants were assigned a code number to ensure confidentiality, and this number was used to identify the two files with the transcription of their oral and written narratives. The filenames use the following syntax: Subject number: from 01 to 156 L1 language: dar stands for darija; ber stands for bereber Age ranges: 1E, 2E, 3E, 4E where E stands for ESO Text modality: o stands for spoken (oral), e stands for written For example, a file that is named ‘10ber1Eo.cha’ is an oral text produced by participant number 10, who is a native speaker of Berber from the 1st grade of ESO. ID headers are arranged as follows: @Participants: STU Target_Student, INV Investigator @ID: spa\|periferias_L2\|STU\|16;08.00\|male\|ber\|26\|Target_Student\|4E\|2\| @ID: spa\|periferias_L2\|EST\|\|\|\|\|Investigator\|\|\| The participants are introduced in the Participants compulsory header with the codes STU (for Student) and INV (for Investigator), and their corresponding role. The information in the ID header for the target student is structured as follows: target language (spa=Spanish), project name (periferias_L2), participant code (STU), age, sex (male or female), participant’s L1 (ber=Berber or ary=Moroccan Arabic), subject number code (as explained above), participant’s role, grade in the Spanish school system (1E, 2E, 3E, 4E, as specified in Table 1) and level of proficiency in Spanish according to the CERF (from 1 to 6, as specified in Table 2). Some notes on transcription All the collected texts (spoken and written) are orthographically transcribed following CHAT conventions and segmented into clauses, so that each tier contains a clause (Berman & Slobin 1994). All the transcriptions were checked by a second transcriber to ensure reliability. Other important remarks concerning transcription are listed below: Proper names (people and institutions) are replaced by X, Y, W, etc. Accents and Spanish letter ‘ñ’ are incorporated. Correction of orthographic errors in written texts is included in brackets as shown in the following example: Example 1 STU: porque no decía nada respecto a esa situacion [: situación] because he didn’t say anything about that situation* Words segmentation errors in written texts are marked as shown in the following examples: es^condido (instead of ‘escondido’, hidden) yasta [: ya está] Omissions are marked differently depending on the modality of production: Spoken texts: le ha da(d)o He hit him Written texts: le ha pillao [: pillado] He caught him Words in Catalan (the other environmental language) are marked with @s followed by the corresponding word in Spanish using the replace notation: taula@s [: mesa] (‘taula’ is the Catalan word for table) Enclitic pronouns, which are attached orthographically to the verb, are marked as follows: dá+me+lo (give it to me) quedar+se (to remain) This does not affect preclitic pronouns, since they are conventionally written separate from the verb (‘me lo da’, he gives it to me). Punctuation marks that could come into conflict with CHAT format as well as typographic conventions typical of written texts are identified in brackets as the following examples: Example 2 STU: yo no (h)ice nada. I didn’t do anything* STU: y tampoco tenía intención [% punto]. and I had no intention either [% period]* Example 3 STU: el [% e mayúscula] problema empezó. the [% e upper case] problem started*