Lexical Word-Class Distributions in Research Articles of Four Subject Areas

This study aimed to determine the lexical word-class distribution in research articles of four subject areas: social sciences, health sciences, physical sciences and life sciences. A total of 5,754,560 tokens or running words were extracted from research articles published by Elsevier for examination. Results show both similarities and differences in distribution across the four subject areas. For health, physical and life sciences, the noun is the most dominant lexical word class, followed by the adjective, verb and adverb. For social sciences, the verb is more dominant than the adjective. This finding reflects that research articles in social sciences use the highest number of words and are more conversational in nature. For types of nouns, singular nouns are used more often than plural nouns in all subject areas; this usage might indicate that research articles tend to focus on a single research object. For types of verbs, research articles in health, life and physical sciences tend to prefer using past tense and past participle forms over others; this usage indicates emphasis on reporting what has been done. In contrast, social sciences research articles show more frequent use of the verbs’ base form, and this usage possibly signifies arguments regarding general truths.


Introduction
Words are the building blocks of texts, and their composition likely differs from one type of text to another, particularly if texts are from different subject areas.Biber, Conrad, & Reppen (2006) assert the importance of differences in word use across subject areas.Differences are found not only in word families but also in word classes.In general, lexical word classes can be classified into four categories: nouns, verbs, adjectives and adverbs.Biber et al. (1999) have shown differences in the distribution of lexical word classes in academic prose, conversation, fiction, newspapers, press reports, official documents, conversations and prepared speeches.In academic prose, Biber et al. (1999) found that the noun was the most frequently used lexical word class, followed by the adjective, the verb and the adverb.
Academic prose can further be classified based on subject areas, and the distribution of lexical word classes in one subject area may differ from that in another.However, to the best of the author's knowledge, no study has compared the distribution of lexical word classes in different subject areas.Such research is necessary because knowing the distribution of lexical word classes can reveal each subject area's typical characteristics.In turn, teachers of academic writing can also create more focused teaching materials.
The present study uses the corpus linguistics approach to examine lexical world classes.Corpus linguistics research has played a significant role in the study of languages since it analyses vast collections of electronic text aided by computer software (Baker, 2010).Through corpus linguistics, large amounts of text data can be effectively and efficiently processed and elaborated on without any salient errors.In vocabulary research, corpus linguistics has been used to plan, control and monitor vocabulary development (Nation, 2001).In teaching English as a foreign language or teaching English to non-native speakers, corpus linguistics is often utilised to select and prepare teaching materials (Walsh, 2010).Results of studies using the corpus linguistics approach have helped teachers decide the number and kind of words they should teach their students (McCarten, 2007).Accordingly, both teachers and learners can acquire the most relevant and frequently used words during their learning process.Furthermore, results of corpus linguistics research are fundamentally important for assessing the abilities of first-and second-language learners (cf.Schmitt, Schmitt & Clapham, 2001;Cameron, 2002;Qian & Schedl, 2004).
Several studies using the corpus linguistics approach have focused on the use of academic vocabulary, for example, the Academic Word List (AWL) by Coxhead (2000); the New Academic Word List by Browne, Culligan, and Phillips (2013); and the Academic Vocabulary List by Gardner and Davies (2014).These studies focus on academic word lists formulated from academic texts in various disciplines, but they do not focus on a particular subject area or scientific field (Liu & Han, 2015).Several studies have also focused on creating word lists from particular subject fields.One example is the Medical Academic Vocabulary List that only examines words in medical texts (Lei & Liu, 2016).Another example is the Chemistry Academic Word List that provides a word list related to chemistry texts (Valipouri & Nassaji, 2013).Martinez, Beck, and Panza (2009) conducted an even more focused study to create a word list based on academic vocabulary from agricultural research articles.
A number of studies have focused on the use of lexical words in different subject areas.However, no study has been conducted on the distribution of lexical word classes across different subject areas.Biber et al. (1999) initiated a similar study, but they focused on differences in the distribution of lexical word classes in different text genres.Their study can be expanded by focusing on a particular genre, that is, academic prose, which can be further categorised into several subject areas.One source of academic prose is the text from research articles, which can be classified into four subject areas: social sciences, health sciences, physical sciences, and life sciences.These categories are based on broad classifications of subject areas currently used in the Scopus database.Consequently, there are two main objectives of the present study.The first is to to determine the distributions of lexical word classes in research articles of the four subject areas.The second is to compare the types of nouns and verbs used in each subject area.These objectives will be achieved by using the corpus linguistics approach.

Method
This study focuses on the lexical word-class distribution of typical word types across four broad subject areas: social, health, physical and life sciences.The classification is based on Scopus, the largest database of academic literature (Elsevier, 2016).Scopus has indexed a number of publishers, of whom Elsevier is the publisher with the largest data share.In this study, therefore, data were extracted from research articles published in Elsevier journals indexed at Scopus.Elsevier also categorises subject areas into four.Two are exactly the same as those used at Scopus, i.e., health sciences and life sciences.The other two are quite similar, i.e., Elsevier uses the terms 'social sciences and humanities' and 'physical sciences and engineering', whereas Scopus uses the terms 'social sciences' and 'physical sciences'.
In selecting research articles, four criteria were considered.First, only open access research articles were selected.Open access means that research articles can be accessed and downloaded without charge.This was necessary so that collected data was copyright free.Second, research articles must be from journals with a 5-year impact factor, meaning research articles need to be published by journals established for more than 5 years and whose articles have also been cited in other research articles for more than 5 years.Third, the research articles must be written in English, the language used in most international publications.Finally, only research articles published during the last 5 years (2011-2015) were chosen to ensure up-to-date data.As shown in Table 1, each of the four subject areas contains sub-disciplines that follow Elsevier's categorisation, which became this study's data source.The representation of each sub-discipline and each journal within each sub-discipline was ensured in the study's corpus.This means that the corpus comprises research articles from all journals in these various sub-disciplines.The corpus was placed in the website corpus.kwary.netfor ease of access in data analysis (Kwary 2018).This website can be freely accessed by anyone interested in conducting research on words used in research articles.
To determine the classification of academic words in each sub-corpus, the software AntWordProfiler (Anthony, 2014) was used to classify all words in a text into the general service list (GSL), the AWL and a list of words not in the GSL and AWL.This study used AntWordProfiler to classify texts in this corpus into words in the GSL and words not in the GSL.Words in the GSL are general words frequently found in a general text (Nation, 2001).
Since the focus was academic text in research articles, general words needed to be excluded from the data, an automatic function of AntWordProfiler, which also provides the details of words' range and frequency.The following list explains the steps involved in collecting data using AntWordProfiler.
1.The files of each sub-corpus in AntWordProfiler are opened.
2. The GLS stop lists, i.e., the lists of general words are chosen.This means that general words were automatically deleted, leaving only academic words, technical words and other low frequency words.The list produced from this step was still substantially long, so more words needed to be deleted from the list.The words' range and frequency were the next criteria.
3. For the range, only words that occur in at least two-thirds of the number of sub-disciplines were selected, ensuring the selection of only the words used in most of the sub-disciplines.For example, health sciences has four sub-disciplines (see Table 1); thus, a word's range is a minimum of three (four times two-thirds equals 2.76, rounded up to three).Social sciences has six sub-disciplines (see Table 1); thus, a word's range must be a minimum of four (six times two-thirds equals four) for selection.
4. After selecting words based on range, words were selected based on their frequencies.In this case, only words with a frequency of more than 10% of the total number of research articles in the particular subject area were selected.For example, in health sciences, as the corpus had 246 articles (see Table 1), the words selected had a minimum frequency of 25 (i.e., 246 × 10% = 24.6,rounded to 25).
5. The word list produced from the fourth step was quite long.The next technique allocated a score to each word.The score was calculated as follows: (range × 10) + (frequency).All words were then sorted based on their scores.Finally, the top 1,000 words were selected from each sub-corpus.
6. From the top 1,000 words from each subject area, words found in more than one subject area were excluded because this study focused on typical academic words, i.e., words used in only one particular subject area.The number of academic words used only in social sciences, health sciences, life sciences and physical sciences was 329, 323, 273 and 314, respectively.
After obtaining typical word types for each subject area, the words were then run using CLAWS Tagger for English, available at http://ucrel.lancs.ac.uk/claws.This tagger automatically determines the word class of each word type.However, since a word type may have more than one word class, some results from the tagger required further analysis to determine their correct word classes by examining how the words were used in context.

Distribution of Lexical Word Classes across the Four Subject Areas
As stated in the Methodology section, data collection resulted in the following numbers of typical academic words: 329 for social sciences, 323 for health sciences, 273 for life sciences and 314 for physical sciences.Since these numbers are unequal, percentages needed to be recalculated.The new percentages were used in the discussion on the distribution of word classes (Table 2).As Table 2 shows, the most frequently used lexical word class across the four subject areas is the noun.According to Biber et al. (1999), nouns represent a high density of information.Radford (2009) mentioned that nouns have the semantic property of denoting entities, meaning that the high frequency of nouns in research articles indicates that the articles tend to present dense information and denote a large number of entities.If each subject area is examined, the percentage of nouns is found to be the highest in health sciences (68.73%), followed by life sciences (65.2%), physical sciences (59.9%) and social sciences (58.66%).

Results and Discussion
Based on the percentages in Table 2, in terms of noun usage, research articles in health sciences were close to those in life sciences and social sciences were close to physical sciences.However, for adjective use, health sciences were close to physical sciences, while social sciences were close to life sciences.For verbs and adverbs use, health sciences were significantly lower than the other three.
Considering the use of nouns, health sciences research articles were inferred to contain a higher density of information and denote more entities than articles from other subject areas.
On the contrary, research articles in social sciences were found to contain the least density of information and entities in comparison with those in other subject areas.This finding possibly indicates research articles in health sciences present more focused information and entities, whereas those in social sciences use more explanation to present information.For the second most frequent word class, differences appeared between one subject area and the rest, particularly between research articles in social sciences and those in the other three subject areas.For social sciences, the verb was the second most frequent word class, but for the other three subject areas, the adjective held the second position.
The high frequency of adjectives in health, life and physical sciences shows that nouns in these subject areas are usually modified by adjectives.In fact, frequent adjective use can be considered a characteristic of an academic text.The results of this study seem to coincide with those of Biber et al. (1999), who found the adjective to be the second most frequent word class in academic prose.Since research articles can be classified as academic prose, it is reasonable that the adjective holds the second position in the present corpus of research articles.
From the four subject areas, as shown in Table 2, health sciences had the highest percentage of adjectives, i.e., 27.86%.Data show that research articles in health sciences contain unique adjectives that are also typical words in this subject area.Some examples are renal, arterial, endothelial, gestational, myocardial and ocular.In addition, frequent adjective use also shows that nouns in health sciences research articles are often modified to make them more specific.For example, the noun kidney has several adjectives as adjacent collocates or in left-neighbour occurrences (see Fig. 1).

Fig. 1
Adjacent collocates of the noun kidney Fig. 1 shows that kidney can be modified by the following adjectives: chronic, acute, human, right, abnormal, metanephric and congenital.This is only one example of adjectives' ubiquitous use in health sciences, illustrating how health sciences articles use more adjectives than other subject area articles.
In social sciences, however, the verb holds second position as the most frequent word class used.As shown in Table 2, typical word types in the social sciences are 20.97%verbs, but only 18.24% adjectives.That social sciences uses a lesser proportion of adjectives than other subject areas indicates that social sciences research articles modify nouns relatively infrequently.In comparing this result with that of Biber et al. (1999), the distribution of word classes in social sciences research articles is found to be more similar to that of conversation than that of academic prose.Biber et al. (1999) found that in conversation, the verb is the second most frequent word class, while in academic prose, the adjective is the second most frequent, indicating that social sciences research articles are relatively more conversational.
The high frequency of verbs in social sciences research articles can also be interpreted as a sign that those research articles contain longer explanations.According to the Oxford Learner's Dictionary of Academic English (2014), a verb is defined as 'a word or group of words that expresses an action (such as eat), an event (such as happen) or a state (such as exist)'.In addition, Radford (2009) stated that verbs have the semantic property of denoting actions or events, thus indicating that social sciences research articles tend to contain more explanations about actions and events.Since a verb also comprises the main part of a sentence's predicate, the greater number of verbs may indicate a greater number of sentences.These greater numbers can again be related to the average number of words (8,527) in social sciences research articles being highest, in comparison with the average number of words in the remaining subject areas.
In comparing this study's results with those of Paquot (2010), it is found that the word class distribution of social sciences research articles is quite similar to that of the Academic Keyword List (AKL).Paquot (2010)  classes, in order of frequency, are nouns, verbs and adjectives; these percentages are similar to the distribution of typical words in social sciences research articles.That the distribution of AKL is similar to that of social sciences but differs from health, life and physical sciences might be because the AKL corpus, as Paquot (2010) mentioned, is skewed towards the humanities and social sciences.
The higher frequency of verbs in social sciences research articles can be related to a verb being mainly used for narration, indicating that social sciences research articles narrate more than health, life and physical sciences articles.
The least frequent word class for all subject areas, shown in Table 1, is the adverb.Indeed, the adverb, which usually modifies a verb, is the least frequent word class in other genres as well (cf.Biber et al., 1999).This present study shows that adverbs rarely modify verbs in research articles, indicating that verbs are used more directly, without any modifiers, because the authors of these articles generally want to communicate information directly.

Distribution of Noun Categories across the Four Subject areas
Nouns are the most frequently used word class in all genres and, as shown in this study, in all subject areas.The result of the CLAWS tagger for nouns in this study showed three categories of nouns: NN, NN1 and NN2.Based on the information from UCREL CLAWS6 Tagset (http:// ucrel.lancs.ac.uk/claws6tags.html),NN denotes a common noun, neutral for number (e.g.sheep, cod, headquarters), NN1 is the tag for a singular common noun (e.g.book, girl) and NN2 tags a plural common noun (e.g., books, girls).Table 3 displays distribution of noun categories across the four subject areas.As Table 3 shows, NN1 is the most frequently used noun type in all subject areas, indicating that all research articles tend to focus on singular nouns or single objects of research.However, if percentages are compared, social sciences is the odd one out of these four subject areas.NN1 usage in research articles in health, life and physical sciences is more than 70%, whereas its usage in social sciences is only 60.1%.NN2 usage in health, life and physical sciences is less than 30%, while its usage in social sciences reaches 39.9%.Singular and plural noun usage in health sciences and social sciences can also be compared in terms of the frequency of words used in both subject areas.One example is the singular noun informant and its plural informants.In health sciences research articles, informant is used three times, but informants is used only once; thus, the singular noun occupies 75%.
In social sciences, the singular noun informant is used 12 times, while its plural is used 15 times, with the singular noun occupying only 44% (Table 3).
The high percentage of NN1 in health sciences research articles shows that research in health sciences tends to focus more on single nouns or single objects of research.In contrast, the high percentage of NN2 in research articles in social sciences indicates that research in this field tends to focus more on plural nouns or more than one object of research in comparison with the other three subject areas.In general, however, social sciences still uses NN1 more often than NN2 (i.e., 60.1% against 39.9%).

Distribution of Verb Categories across the Four Subject Areas
In addition to nouns, verbs are also further categorised in this study using the CLAWS tagger into the following categories: VV0, VVI, VVG, VVN, VVD and VVZ.The explanation of these categories is based on information from UCREL CLAWS6 Tagset (http://ucrel.lancs.ac.uk/ claws6tags.html).Details are as follows: This study is concerned with typical words in each subject area.Thus, the verb categories mentioned above needed to be simplified, particularly because the forms of verbs were the same in the study's corpus.Therefore, data for VV0 and VVI were combined into one (VV0/ VVI) on the basis of their similar forms.Additionally, the calculation for VVD and VVN were also combined as VVD/VVN because some verbs have the same forms for both categories.Table 4 shows the distribution of verb categories across the four subject areas.The present study has shown similarities and differences in the distribution of the lexical word classes of typical academic words in social, health, life and physical sciences.In all four subject areas, the noun is the most frequent word class, while the adverb is the least frequent.The noun is particularly high in health sciences research articles, indicating that health sciences research articles present the highest density of information among the four subject areas.
The second and third highest distributions of lexical word classes also reveal differences between the subject areas.In social sciences, the second most used form is the verb and the adjective is the third most used.In the other three subject areas, however, the adjective is the second most used, and the verb is the third most used.This positioning shows that the typical academic words in health, life and physical sciences research articles are akin to academic prose, while typical academic words in social sciences research articles are close to conversation.In addition, as concluded from the use of verbs, research articles in social sciences use longer sentences and longer explanations than do those in the other subject areas.
This study also shows that singular nouns are used more often than plural nouns in all subject areas.Health sciences research articles show the highest percentage of singular nouns among the four subject areas.Research articles in social sciences use plural nouns more frequently than those in the other subject areas.For verbs, the past tense and past participle forms are more commonly found in health, life and physical sciences, while the base form appears more often in social sciences.This situation might be due to the fact that health sciences, life sciences, and physical sciences focus mostly on experiments that have been conducted in the past, while social sciences tend to focus on arguments expected to be true regardless of time.

Table 1
Table 1 summarises the data or the corpus collected for this study.

Table 2
Data in Table1reveal that the average number of words in health sciences research articles is lowest, i.e., 4,448 words (1,094,205/246), while the average for social sciences is highest, i.e., 8,527 words (1,040,259/122).These data confirm the conclusion that research articles in health sciences are more straightforward, whereas those in social sciences are wordier.

Table 3
The percentage of NN1 usage is the highest in health sciences at 78.38%, while social sciences shows the lowest usage at 60.1%.Percentages are reversed for NN2; social sciences has the highest, at 39.9%, and health sciences is lowest at only 21.62%.This could mean that research articles in health sciences are more concerned with singular nouns than those in social sciences.However, plural nouns are more common in social sciences research articles than in health sciences research articles.Examples of plural nouns in social sciences research articles are firms and transactions.Using the Academic Article Concordancer (corpus.kwary.net), 622 occurrences were noted for firms but only 435 for firm and 66 occurrences were noted for transactions but only 49 for transaction.Artery and syndrome are examples of singular nouns in health sciences research articles.Data in the Academic Article Concordancer show 221 occurrences for artery but only 34 for arteries and 121 occurrences for syndrome but only 17 for syndromes.

Table 4
Winkler and Metherell (2008)Table4, the type of verbs most commonly used in all subject areas, except social sciences, is VVD/VVN (i.e.past tense and past participle).VVD and VVN are found at 47.5% in physical sciences, 64.87% in life sciences and 77.78% in health sciences, indicating frequent and common use of past tense and the past participle.Winkler and Metherell (2008)state that in research articles, the past tense indicates study or experimental results that have taken place in the past.From these three subject areas, health sciences has the highest percentage of VVD/VVN: 77.78% means that more than 7 out of 10 verbs used in health sciences research articles are in the past tense or past participle tense.One example is the verbs undergo and underwent.Based on concordance lines at corpus.kwary.net, the past form (underwent) is used 152 times, whereas its base form (undergo) is used only 58 times in health sciences research articles.The same verb (undergo) is used only three times in its past form (i.e.underwent) and four times in its base form (undergo) in social sciences research articles.For social sciences, VVD/VVN use is lower than VV0/VVI and VVG use.As shown in Table2, percentages for VV0/VVI, VVG and VVD/VVN uses are 34.78,27.54and24.64, respectively.The frequent use of VV0/VVI in social sciences research articles tends to present universally relevant arguments, making them true then and now(Winkler & Metherell 2008, 113).One example is the use of write and wrote.Based on concordance lines at corpus.kwary.net, the base form (write) is used 58 times, whereas its past form (wrote) is used only 17 times in social sciences research articles.In contrast, research articles in health sciences use write only five times and wrote nine times.
Rothwell's (2002)sage in these two subject areas is approximately 11%.The use of VVD/VVN in health sciences and life sciences is also similar, i.e. more than 50% when compared to the other two subject areas, in which VVD/VVN use is lower than 50%.These percentages indicate a close relationship between health sciences and life sciences and are supported byRothwell's (2002)statement that health sciences and life sciences are similar because they both focus on the study of living organisms.