Towards a Refined Inventory of Lexical Bundles : an Experiment in the Formulex Method

C O M P U T A T I O N A L L I N G U I S T I C S / K O M P I U T E R I N Ė L I N G V I S T I K A A number of corpus studies focusing on the description of the use and functions of lexical bundles have been conducted recently in order to explore the phraseology of learner language. As with any studies of lexical bundles, the problem of overlapping or structurally incomplete items poses a particular challenge. In practice, it is often difficult to align such units with specific discourse functions. The fact that lexical bundles do not constitute neat form-and-meaning mappings results from, among other reasons, their being grounded in language use rather than language system. In this pilot study we attempt to test a new method called Formulex (Forsyth, 2015a; 2015b) to verify whether an application of the criterion of coverage – in addition to the conventional criteria of orthographic length, minimum frequency and distribution range (Biber et al., 1999) – may help obtain a more refined inventory of lexical bundles and hence facilitate further qualitative analyses. To that end, we use Polish and Lithuanian components of the International Corpus of Learner English (ICLE, Granger et al., 2009), as well as the LOCNESS corpus (CECL), representing academic essays written by British and American students. The results revealed that many lexical bundles of fixed length identified in a conventional way are fragments of longer chunks of text and hence they should not be treated as complete or standalone 4-word lexical items. It was also revealed that the application of the Formulex method, where the word sequences are mutually exclusive, helps a researcher filter out overlapping or non-perceptually salient lexical bundles and, ultimately, specify more precise boundaries of lexical bundles of fixed length.

In recent years, recurrent sequences of words, known as lexical bundles, which have been shown to account for a significant part of both spoken and written English (Biber et al., 1999;Biber, Conrad, and Cortes, 2004;Cortes, 2008;Hyland, 2008a), have been frequently used as the unit of analysis in many phraseologically-oriented studies, notably the ones employing corpus-driven methodology.Lexical bundles are sequences of three or more words that occur frequently in natural discourse and constitute lexical building blocks used frequently by language users in different situational and communicative contexts (Biber et al., 1999, pp.990-991), e.g.I don't think, as a result, the nature of the.More often than not, lexical bundles are not idiomatic in meaning; on the contrary, the meaning of a lexical bundle is transparent (Biber, Conrad, and Cortes, 2003, p.134).As a result, the studies of lexical bundles position at the forefront inconspicuous and not perceptually salient multi-word sequences which acquire high frequencies in corpora.In this vein, Kopaczyk (2012, p.5; 2013, p.54, p.63) notes that lexical bundles are often either smaller than a phrase (notably, short bundles consisting of three or four words) or larger than a phrase (indicating complementation patterns of phrases).
A number of studies of lexical bundles have been conducted recently to explore the lexical characteristics of learner language, for example, to measure the extent to which these multiword units are typical of spoken and written language produced by learners of English as a foreign language (EFL) or investigate in what ways they can shed light on the process of foreign language acquisition.Such studies fall into three major groups.One research direction involves contrastive analyses of EFL learner language representing different first language (L1) backgrounds on the one hand and a comparable corpus of native speaker data on the other (De Cock, 2004; Juknevičienė, 2009; Chen and Baker, 2010; Ädel and Erman, 2012; Baumgarten, 2014; Kizil & Kilimci, 2014).The other research strand deals with studies of lexical bundles across different proficiency levels of the learners (Hyland, 2008b;Römer, 2009;Vidakovic and Barker, 2010;Juknevičienė, 2013; Leńko-Szymańska, 2014) which gives researchers a pseudo-longitudinal perspective allowing to reveal changes in the use of lexical bundles alongside the increasing proficiency of learners.Finally, the third research direction, which up till now remains less exploited, is a comparison of lexical bundles in corpora representing learners whose mother tongues are different, e.g.Paquot (2013;2014).However, no studies conducted so far focused on the use of recurrent lexical bundles by Polish and Lithuanian learners of English.
Despite growing popularity, the research on lexical bundles has not been devoid of methodological challenges.More specifically, the problems are directly related to the selection of salient lexical bundles from an automatically generated list which in most published research largely relies on the researcher's manual data analysis and subjective judgment.In particular, difficulties concern the methods used to deal with structurally incomplete bundles, filter out overlapping bundles, or select a representative sample of bundles other than focusing on the most frequent items (Grabowski, in preparation), to name but a few.In practice, many lexical bundles overlap with each other or constitute fragments of longer contiguous sequences of words, the problem that has been already identified in literature (cf.Appel & Trofimovich, 2015;Pęzik, 2015).For example, an initial list of lexical bundles might include such items as the fact that it, the fact that we, in the fact that etc.In order to identify the salient unit the fact that a researcher would need to go through the lists manually and at some point to decide that a certain recurrent sequence is more salient than others.Hence, the boundaries between many overlapping lexical bundles are not established objectively, let alone any subsequent alignment of the items with specific meanings or discourse functions (Appel & Trofimovich, 2015).That is why claims about all lexical bundles of fixed length being complete or distinct multi-word units often raise doubts and make it difficult to accept that lexical bundles may be stored as single wholes in the mental lexicon of language users.In a similar vein, Simpson-Vlach and Ellis (2010, p.490) argue that "the fact that a formula is above a certain frequency threshold and distributional range does not necessarily imply either psycholinguistic salience or pedagogical relevance".It is the process of selecting the most salient lexical bundles that is at the focus of this article.
In this pilot study, we test a recently proposed method, called Formulex (Forsyth, 2015b), in an attempt to deal with overlapping, non-perceptually salient or structurally incomplete lexical bundles.The aim of this study was to fine tune the methodology used to identify lexical bundles in texts which is expected to provide a refined -and more useful pedagogically -list of these multi-word units.We will therefore try to answer the following research question: Can the Formulex method (Forsyth, 2015b) produce a refined list of non-overlapping and more perceptually salient lexical bundles as compared with the conventional approach (Biber et al., 1999)?
In view of the above, we hypothesize that, first, the criterion of coverage, implemented in the Formulex method (Forsyth, 2015b), can be used in addition to the conventional criteria developed to extract lexical bundles from texts, i.e. orthographic length, minimum frequency and distribution range, see Biber et al. (1999).It is also hypothesized that by using the criterion of coverage, where the sequences of words are mutually exclusive, it will be possible to produce a more refined inventory of non-overlapping, more perceptually salient and structurally-complete lexical bundles, which can be treated as distinct multi-word units.Although it is not an explicit goal of this study, the results may also reveal similarities or differences, in terms of the use of particular lexical bundles, across essays written by Polish and Lithuanian students as compared with the bundles most frequently employed by those students who are native speakers of English.This pilot study adopts a corpus-driven approach (Tognini-Bonelli, 2001, p.65), which means that the empirical corpus data is used to formulate hypotheses about linguistic features of written English produced by non-native (Polish and Lithuanian EFL learners) as well as native speakers.Two computer programs designed for text analysis were used in the study to obtain and process the research material.Formulib (Forsyth, 2015b), written in Python 3.4, was used to identify contiguous n-grams with the largest coverage in the corpora under scrutiny, and the software WordSmith Tools 5.0 (Scott, 2008) was used to extract lexical bundles using three conventional criteria of orthographic length, minimum frequency and distribution range (Biber et al., 1999).Finally, the output of both programs was compared in order to filter out the results and identify those lexical bundles that meet all the four criteria employed in the study.

Learner corpora
Two learner groups, viz.Lithuanian and Polish EFL learners, are at the focus of the present study.To analyze their written English, we used two components of the ICLE corpus (Granger et al., 2009): a corpus of Polish learner English (PICLE) from the 1st version of the ICLE and a corpus of Lithuanian learner English (LICLE), a recent contribution to the ICLE project.Both corpora represent advanced EFL learners, senior undergraduate students majoring in linguistics-based study programmes in Poland and Lithuania.For reference purposes, we used the LOCNESS corpus, representing academic essays written by British and American students (CECL).Table 1 presents more detailed information on the composition of the corpora under scrutiny.

Methodology
The corpora are different in size, hence all frequencies reported in this paper have been normalized per 100,000 words to allow for comparisons across the corpora.This should enable us to identify those sequences that account for the greater proportion of the corpora under scrutiny.
Second, using WordSmith Tools 5.0 (Scott, 2008), we automatically generated lists of 4-word lexical bundles from PICLE, LICLE and LOCNESS using the conventional extraction criteria with the following parameters: minimum frequency = 5 occurrences per 100,000 (or 50 per million words), distribution range = 3 % of texts.The decision to focus on four-word lexical bundles has to do with the fact that a number of previous studies of English dealt with fourword sequences which have been shown to be more semantically and pragmatically salient (Biber et al., 1999;Biber, Conrad, Cortes, 2004;Hyland, 2008a;Hyland, 2008b).Moreover, it is the four-word lexical bundles that are traditionally investigated in learner corpora of L2 English (e.g.Chen and Baker, 2010; Ädel and Erman, 2012), so in order obtain comparable data, we also focused on four-word sequences.
Finally, we compared the output of both programs (Formulib and WordSmith Tools 5.0) in order to identify the so-called 'proper' or 'refined' lexical bundles.By 'proper' we mean such bundles that meet the traditional extraction criteria, as well as the coverage criterion (0.01 % or higher) established using Formulib software (Forsyth, 2015a).The study is expected to yield a refined list of lexical bundles, that is, the ones that do not overlap with each other and constitute more distinct multi-word units.

N-grams with the greatest currency in the corpora
Using the Formulib software (Forsyth, 2015a) supported by Python 3.4, we identified 400 contiguous n-grams, built of four words or longer, with the highest coverage of texts in each corpus under study.Importantly, Formulib treats coverage as a binary category, that is, a number of n-grams matching a given text sequence is irrelevant; in other words, the program only takes into account the fact whether the text sequence is covered or not (Forsyth, 2015b, pp.13-14).For example, if n-grams such as higher education in Lithuania and the quality of higher education (and the quality of etc.) covers a certain part of the sequence the quality of higher education in Lithuania, each of those seven words is marked as covered once.Based on that, the proportion of covered to uncovered characters for each text sample is calculated and, next, the character coverage for each text category is aggregated (Forsyth, 2015b, pp.13-14). 1 For the sake of illustration, the results, that is, ten n-grams with the largest coverage in PICLE are presented in Table 2.
Apart from recurrent n-grams with the highest coverage, the data reveal that many of the formulaic sequences are in fact fragments of topics of students' essays.For example, the high coverage of the sequence mass media affect our approach to reality in PICLE, as well as of such sequences as money is the root of all evil (LICLE), perception of the world (LI-CLE), the question of philosophical optimism (LOCNESS), in le mythe de Sisyphe (LOCNESS), among others, shows that students, both native and non-native speakers, tend to frequently repeat the topic of the essay in their writing assignments.In fact, some LICLE essays were written as responses to long prompts (ca.100-120 words), which were creatively used by the students in the essays.In contrast, PICLE and, to some extent, LOCNESS essays had considerably shorter topic formulations.This peculiar feature of the research material may inflate frequencies of certain n-grams that consist of lexical items found in the titles of student essays.That is why the decision has been made in this study to weed out those n-grams that are fragments of essay titles.Next, as in our study we ultimately aim to obtain a refined list of 4-word lexical bundles, it has been decided to remove the n-grams built of more than four words.Finally, as we aim to identify those n-grams that contribute the most to the formulaicity of student essays (or, in other words, have the highest currency or account for the greater proportion of the corpora under scrutiny), the decision has been made to focus only on those n-grams with the coverage of 0.01 % or higher.
Using the filtering procedures described above, we eventually obtained a list of 58 4-grams in PICLE, 75 4-grams in LICLE and 25 4-grams in LOCNESS with the highest coverage (that is, more than 0.01 %) in each corpus.Top ten n-grams in each corpus are presented in Tables 3, 4 and 5.The results reveal, among others, that the sequence on the other hand has the highest coverage in each corpus which means that 0.0714 % of the all typed characters in LICLE consist of repetitions of the sequence on the other hand.Also, the n-gram at the same time is among the top ten by coverage in each corpus.One may also notice a number of similarities between Polish and Lithuanian learners of English, namely, the frequent use of a topic-neutral location marker all over the world, a sequence expressing writers' stance it is obvious that, or the sequence is one of the that functions as a focusing marker.Also, one may notice, Polish students and native-speakers often use the construction as a result of + 'sth' which is altogether absent in the LICLE data.
Apart from providing insights into the recurrent formulas, an additional benefit of employing the Formulex method is that it enables one to specify more precise boundaries between recurrent n-grams, notably overlapping or structurally incomplete ones (Forsyth, 2015b).For example, in 36 instances in PICLE the sequence at the same time was not a fragment of a longer sequence, namely, and at the same time (which occurs in PICLE 11 times); in fact, the sequence at the same time occurs, in total, 64 times in various patterns in the PICLE corpus, yet it appears as a 4-gram only 36 times.Hence the Formulex method takes into account the fact that "the sequences are mutually exclusive" and that "longer prefabricated phrases [are prevented] from being swamped by the elements of which they are composed of" (Forsyth, 2015b, p.17); this way the method enables researchers to delimit the boundaries of formulaic sequences more precisely, which has been one of the main, and still unresolved, problems in research on lexical bundles.Hence, in what follows we will attempt to use both the data and insights from employing the Formulex method (Forsyth, 2015a), based on the criterion of coverage, in order to refine an inventory of lexical bundles generated in a conventional manner, that is, by using such criteria as orthographic length, minimum frequency and distribution range.Afterwards, we will identify the so-called 'proper' lexical bundles in each corpus, that is, the ones that appear in the output of both the Formulex method and the lexical bundles approach.

Lexical Bundles in Student Essays
Using WordSmith Tools 5.0 (Scott, 2008), we generated a frequency list of lexical bundles using the traditional criteria (Biber et al., 1999) with the following parameters: orthographic length = 4; min.freq.= 5 occurrences per 100,000 words (or 50 per million words); minimum distribution range = 3 % of texts.Again, due to the specificity of the research material, the lexical bundles including words from the essay titles or prompts were excluded from further analyses.As a result, using the criteria described above, we identified 41 lexical bundles in PICLE, 40 lexical bundles in LICLE and 40 lexical bundles in LOCNESS.For the sake of illustration, top ten lexical bundles (by frequency) in each corpus are presented in Tables 6, 7 and 8.An asterisk (*) is used to mark that a given bundle meets the criterion of coverage employed in the Formulex method and set in this study at 0.01 % or higher.A full list of lexical bundles is presented in the Appendix to this paper.First, the data reveal that contrary to the output of Formulex, there are certain overlapping bundles among the ones identified in a conventional manner, e.g.do not have to and they do not have in PICLE, of the most important and the most important thing in LICLE, or the beginning of the and at the beginning of in LOCNESS, among others.This means that the 4-word bundles in question are, in fact, fragments of longer sequences of words, and that the conventional approach to the identification of lexical bundles of fixed length (e.g., 4 orthographic words) makes it difficult to identify the boundaries of such multi-word items.This problem does not apply to the list of recurrent sequences generated by the application of the Formulex method, where the sequences of words are mutually exclusive.
Second, the data show that not all lexical bundles identified in the traditional way meet the coverage threshold set in this study (0.01 %), which means that some of the items do not account for the greater proportion of the corpora under scrutiny.For example, the bundle do not have to in PICLE (ranked 6 th by frequency; raw frequency of 24 occ.; normalized frequency of 10 occ. per 100,000 words; distribution range of 5.75 %, that is, 21 texts) has not been found among the 4-word grams of the greater coverage in the corpus.The reason for that is that the sequence in question, with a coverage of 0.0099 %, occurs independently, that is, not as a fragment of a longer sequence of words, only 9 times (2.4 times in terms of normalized frequency), which is even less than the normalized frequency threshold of 5 occurrences.
Consequently, this and many other lexical bundles (e.g., of the fact that, do not want to, it is better to, they do not have, the fact that the in PICLE) are in fact fragments of longer multiword constructions and hence they should not be treated as complete 4-word lexical items.
In view of the above, the comparison of the output of Formulex and the lexical bundles approach resulted in a refined inventory of 4-word lexical bundles, which meet the criteria of orthographic length, minimum frequency, distribution range and -in addition to the toolkitcoverage, the latter one applied in the Formulex method.The refined list (Table 9) includes 27 'proper' lexical bundles in PICLE, 26 in LICLE and 20 in LOCNESS.
The refined list of lexical bundles shows that LOCNESS which represents native speakers of English contains a smaller number of lexical bundles than the two corpora of non-native learners.This finding, in fact, confirms observations reported in Hyland (2008b), Römer (2009) and Juknevičienė (2009Juknevičienė ( , 2013) ) that it is less proficient non-native users of language who tend to rely on repetitive sequences to a larger extent than learners of higher proficiency levels or native speakers.One way of accounting for this finding deals with the limited vocabulary range of the less proficient learners which often means that when writing they tend to  In comparison to the lists of lexical bundles extracted from the three corpora in the conventional manner, the refined lists present fewer overlapping n-grams that are often neighbours or near-neighbours on the frequency list.To give only a few examples, the bundles such as but at the same (PICLE), there are a lot, are a lot of, the other hand the, is no need to, there is no need (LICLE) or to the fact that, due to the fact, the fact that they, the beginning of the, at the beginning of (LOCNESS) are fragments of longer multi-word sequences.Hence, the Formulex method may help deal with the problem of 'syntagmatic overlap', the term proposed by Kopaczyk (2013, p.156) to refer to a situation when a given lexical bundle includes a fragment of the preceding one.Also, a number of inconspicuous multi-word units, which are syntagmatic associations hardly stored as single wholes in the mental lexicon of language users, have been removed from the refined list, e.g. that it is not (in PICLE, LICLE and LOCNESS), it is not the (LICLE), that it is a (LOCNESS).In that respect, one may argue that the

Discussion
Formulex method helps one filter out overlapping or non-perceptually salient lexical bundles identified in the traditional way.
However, one may also notice that the refined lists of lexical bundles identified through the application of the Formulex method, which adopts the criterion of coverage, are not devoid of limitations.First of all, there are a number of 4-word sequences that contain more complete 3-word sequences, e.g.first of all (the), what is more (the) in LICLE.This finding shows that it may be necessary to separately identify the n-grams shorter than or longer than four words, with the largest coverage in the corpora under scrutiny, and then filter out the results.Secondly, one may also note that a number of perceptually salient lexical bundles have been removed from the refined list, e.g. it is better to, there is no doubt, seems to be the, (expressing stance in PICLE), it is obvious that (expressing stance in LICLE), is for the best, the only way to (expressing stance in LOCNESS).This means that the application of the Formulex method may result in some information loss (as compared with the output of WordSmith Tools 5.0) that needs to be taken into consideration with respect to the scope and goals of a given study.
The aim of the study was to compare two methods to retrieve recurrent word sequences, termed lexical bundles, from a corpus and propose a more objective approach to data selection.It was found that the Formulex method proposed by Forsyth (2015a) allowed us to filter out a number of overlapping or not perceptually salient lexical bundles.It is thus possible to assume that application of the Formulex method yields a potentially more useful (pedagogically or otherwise) inventory of distinct multi-word units or -as it is the case in this study -provides a complementary insight into the recurrent multi-word units used by native and non-native English students in their essays.
A pilot study like this one may be only regarded as provisional, however.More research in the future is required to test the effectiveness of the Formulex method (Forsyth, 2015a) as compared with other methods or metrics developed recently to locate utterance boundaries or predict word sequence completion, e.g. a (forward and backward) transitional probability metric, which was tested by Appel and Trofimovich (2015) on a sample of 100 four-word items extracted from the BNC, or Independence-Formulaicity (IF) score (Pęzik, 2015), which gives more prominence to shorter n-grams that do not overlap with longer ones as well as to those n-grams that include multiple infrequent words.Also, the application of the Formulex method, in addition to the conventional criteria used to extract lexical bundles from texts, shows that the lexical bundles approach should be treated flexibly rather than strictly in order to provide a more comprehensive description of distinct and perceptually salient recurrent multi-word units in texts.

Table 1
Corpora used in the study