The corpus consisted of
Research methodology and reporting expression
Literally: The body consisted of
In 15 Seconds
- Formal way to describe a research dataset of texts.
- Used in academic papers, theses, and technical reports.
- Specifically refers to language data, not people or objects.
- Always uses the preposition 'of' and usually past tense.
Meaning
This phrase is the academic way of saying 'Here is the specific pile of data I studied.' It acts as a boundary for a research project, telling your audience exactly which texts, recordings, or documents provided the evidence for your conclusions. It carries a heavy 'expert' vibe, suggesting that your collection was deliberate, organized, and scientifically sound.
Key Examples
3 of 10Thesis defense
The corpus consisted of five hundred political speeches delivered between 2010 and 2020.
The corpus consisted of five hundred political speeches delivered between 2010 and 2020.
Tech company report
Our training corpus consisted of anonymized user comments from the last six months.
Our training corpus consisted of anonymized user comments from the last six months.
Linguistics lecture
The corpus consisted of transcribed conversations from local coffee shops.
The corpus consisted of transcribed conversations from local coffee shops.
Cultural Background
In Western academia, 'delimitation' is a sign of honesty. By defining your corpus, you are admitting your research has limits, which is highly respected. Modern researchers often use 'born-digital' corpora, consisting of tweets or blog posts, which has changed the traditional view of what a 'body of work' looks like. In British legal history, the 'Corpus Juris' was the foundation of law. Using 'corpus' today still carries that weight of authority and tradition. The 'Brown Corpus' was the first major electronic corpus of American English. It set the standard for using the phrase 'the corpus consisted of' in linguistic papers.
The 'Of' Rule
Always double-check your preposition. 'Consisted OF' is for a list of items. If you use 'IN', you are talking about an abstract definition.
Avoid Passive Voice
Never say 'The corpus was consisted of.' It is always 'The corpus consisted of.' This is a common error for learners of all levels.
In 15 Seconds
- Formal way to describe a research dataset of texts.
- Used in academic papers, theses, and technical reports.
- Specifically refers to language data, not people or objects.
- Always uses the preposition 'of' and usually past tense.
What It Means
Imagine you are trying to prove that people on TikTok use the word slay differently than people on Instagram. You can’t just say 'I saw some videos.' To be a real researcher, you need a 'bucket' of data. That bucket is your corpus. When you say the corpus consisted of, you are formally introducing that bucket to your audience. It is like an artist showing you their palette before they start painting; it defines the limits of what is possible in the study. This phrase is the gold standard for transparency in linguistics, data science, and literature reviews. It tells the reader, 'I didn't just cherry-pick examples; I looked at this specific, finite collection.'
What It Means
At its heart, the corpus consisted of is a delimiting expression. The word corpus comes from the Latin word for 'body.' So, you are essentially describing the 'body' of your work. It implies a sense of completeness and intentionality. If you say 'I read some books,' it sounds like a hobby. If you say the corpus consisted of twenty-four 19th-century novels, you sound like someone who is about to get a PhD. It carries a vibe of 'controlled observation.' You aren't just looking at the world at large; you've built a small, digital walls around a specific set of information so you can analyze it without getting distracted by the rest of the internet. It is the verbal equivalent of putting on a lab coat before you start typing.
How To Use It
You will almost always see this followed by a number and a description. The structure is usually: The corpus consisted of + [Quantity] + [Type of Material] + [Source/Timeframe]. For example, The corpus consisted of 500 emails sent within a corporate environment between 2010 and 2015. Notice how specific that is? You can't be vague here. If you are vague, the phrase loses its power. You should use it in the 'Methodology' section of a paper or the 'Data' section of a technical report. It is a past-tense phrase because, by the time you are writing about it, the collection process is usually finished. You are looking back at the 'body' you built. Pro tip: treat it like a recipe list—clear, concise, and measurable. Just don't try to use it to describe your laundry pile, unless you're doing a very weird sociological study on socks.
Formality & Register
This phrase is dressed in a three-piece suit. It is high-level academic English (C1/C2). You will find it in peer-reviewed journals, university lectures, and high-end data science whitepapers. You will almost never hear it in a coffee shop unless two linguistics professors are arguing about their latest research. It sits at the peak of the formality mountain. Because it is so formal, using it in a casual text message would make you sound like a robot or a very confused time traveler. However, in the world of AI and Big Data, this register is becoming more common in professional tech settings. When developers talk about training a New Large Language Model, they use this language to explain what data the AI 'ate' during its training phase.
Real-Life Examples
You’ll spot this in the wild on sites like JSTOR or Google Scholar. A linguist might write: The corpus consisted of transcribed interviews from suburban teenagers in London. A historian might say: The corpus consisted of every surviving letter written by Civil War soldiers from Virginia. In the tech world, a blog post might explain: To train our sentiment analysis tool, the corpus consisted of 10 million one-star Yelp reviews. Even in high-end journalism, like a deep-dive investigative piece in The New York Times, you might see it used to describe a massive leak of documents: The corpus consisted of over 2.5 million encrypted files. It’s the phrase people use when the 'pile of stuff' is too big or too important to just call it 'some papers.'
When To Use It
Use this when you are presenting the results of an investigation where you analyzed a specific set of texts or data. It is perfect for a thesis, a capstone project, or a formal business proposal that involves market research. If you’ve spent weeks scraping data from Reddit to see how people talk about cryptocurrency, this is your phrase. It’s also great for literary analysis—if you’re comparing every poem Emily Dickinson ever wrote, the corpus is the correct term for her collected works. It tells your professor or your boss that you have a methodology. It transforms 'I looked at stuff' into 'I conducted an analysis on a curated dataset.' It’s the ultimate 'trust me, I’m an expert' signal.
When NOT To Use It
Do not use this for people. If you interviewed 10 people, you have a sample, not a corpus. A corpus is specifically for 'texts' (which can include recorded speech, but the focus is on the language data). Don't use it for physical objects either—you wouldn't say the corpus consisted of twelve types of rock. That’s a collection or a set. Also, avoid it in casual settings. If your friend asks what you’ve been reading lately, saying the corpus consisted of three mystery novels and a cookbook will definitely result in some weird looks and possibly a decrease in social invitations. It’s too heavy for everyday life. It’s like using a microscope to look at a slice of pizza—technically possible, but totally unnecessary.
Common Mistakes
The biggest trap is the preposition. People often try to say consisted from or consisted in. Neither is correct in this context. It is always consisted of. Another mistake is using the word corpora (the plural) when you only have one collection. Stick to the corpus for a single set. Some people also confuse consisted of with comprised. While similar, comprised doesn't need the of. So, ✗ The corpus comprised of... is a very common error. You should say ✓ The corpus comprised... or ✓ The corpus consisted of... Lastly, make sure your 'consisted' is in the past tense if the study is done. ✗ The corpus consists of is okay if you are currently building it, but 90% of the time, you want the past tense.
Common Variations
If the corpus consisted of feels a bit too stiff, you have options. The dataset included is very common in data science and feels slightly more modern. The sample was composed of is better if you are talking about a mix of things. For a very high-level academic feel, you might use The archival material comprised. If you want to sound more active, you could say We analyzed a collection of... In more 'tech-bro' environments, you’ll often hear The training set was made up of. However, none of these quite capture the specific 'linguistic/textual' weight of corpus. If you are talking about words and language, corpus remains the king of the hill. It’s the OG term for a big pile of words.
Real Conversations
Professor
Student
Professor
Student
Data Scientist: How did we train the new chatbot to handle customer complaints?
Lead Dev: The corpus consisted of three years of anonymized chat logs from the support portal.
Data Scientist: Was it enough data?
Lead Dev: Since the corpus consisted of nearly a million entries, the accuracy is quite high.
Quick FAQ
Is a corpus just a fancy word for a library? Not quite. A library is a place; a corpus is a specific selection of texts used for a single purpose or study. Can a corpus be made of videos? Yes, but usually, it refers to the transcripts or the audio data from those videos. Is corpora the plural? Yes, it is! If you are comparing two different sets of data, you are looking at corpora. Is it only for English? No way! You can have a corpus of any language, from Ancient Greek to Klingon. Does it have to be digital? Traditionally, no—corpora used to be piles of paper. But nowadays, 99.9% of the time, a corpus is a digital database. It’s much easier to search for 'slay' with a computer than with a highlighter.
Usage Notes
This phrase is strictly for academic, technical, or high-level professional writing. Its biggest 'gotcha' is using it for people (it's for texts only!) or using the wrong preposition. Always stick to 'of' and keep it in the past tense for completed research.
The 'Of' Rule
Always double-check your preposition. 'Consisted OF' is for a list of items. If you use 'IN', you are talking about an abstract definition.
Avoid Passive Voice
Never say 'The corpus was consisted of.' It is always 'The corpus consisted of.' This is a common error for learners of all levels.
Academic Branding
Using this phrase in a job interview for a data-related role can make you sound very professional and well-educated.
Examples
10The corpus consisted of five hundred political speeches delivered between 2010 and 2020.
The corpus consisted of five hundred political speeches delivered between 2010 and 2020.
Defining the scope of academic research.
Our training corpus consisted of anonymized user comments from the last six months.
Our training corpus consisted of anonymized user comments from the last six months.
Used in AI development contexts.
The corpus consisted of transcribed conversations from local coffee shops.
The corpus consisted of transcribed conversations from local coffee shops.
Specifying the source of spoken language data.
✗ The corpus consisted from 100 books. → ✓ The corpus consisted of 100 books.
✗ The corpus consisted from 100 books. → ✓ The corpus consisted of 100 books.
Common preposition error.
For this sentiment analysis, the corpus consisted of 50,000 viral tweets.
For this sentiment analysis, the corpus consisted of 50,000 viral tweets.
Modern application of research terminology.
The corpus consisted of every editorial published by the newspaper in 1945.
The corpus consisted of every editorial published by the newspaper in 1945.
Defining a historical text collection.
The initial corpus consisted of mostly Wikipedia articles, but we added Reddit later.
The initial corpus consisted of mostly Wikipedia articles, but we added Reddit later.
Describing the growth of a dataset.
During finals week, the student's corpus consisted of three energy drink cans and a half-finished essay.
During finals week, the student's corpus consisted of three energy drink cans and a half-finished essay.
Playful use of a very formal term in a messy situation.
✗ The corpus consisted of 200 participants. → ✓ The sample consisted of 200 participants.
✗ The corpus consisted of 200 participants. → ✓ The sample consisted of 200 participants.
A corpus is for texts, a sample is for people.
Sadly, the corpus consisted of too much biased data to be useful.
Sadly, the corpus consisted of too much biased data to be useful.
Expressing disappointment in research quality.
Test Yourself
Complete the sentence with the correct preposition.
The research corpus consisted ____ 500 hours of audio recordings.
'Consist of' is the standard phrase for describing the components of a corpus.
Which sentence is appropriate for a formal academic paper?
Choose the best option:
This uses the correct register, grammar, and preposition.
Match the term to its definition.
Match the following:
Understanding these distinctions is key for C1 mastery.
Complete the dialogue with the most professional phrase.
Professor: 'How did you select your data?' Student: 'Well, _________ every article published in the journal last year.'
This is the most professional way to answer in an academic setting.
🎉 Score: /4
Visual Learning Aids
Consist of vs. Consist in
Practice Bank
4 exercisesThe research corpus consisted ____ 500 hours of audio recordings.
'Consist of' is the standard phrase for describing the components of a corpus.
Choose the best option:
This uses the correct register, grammar, and preposition.
Match each item on the left with its pair on the right:
Understanding these distinctions is key for C1 mastery.
Professor: 'How did you select your data?' Student: 'Well, _________ every article published in the journal last year.'
This is the most professional way to answer in an academic setting.
🎉 Score: /4
Video Tutorials
Find video tutorials on YouTube for this phrase.
Frequently Asked Questions
10 questionsTechnically, no. A corpus is a collection of *texts* or *data*. If you are talking about people, use 'The sample consisted of' or 'The group was composed of.'
Yes, if the corpus still exists and is available for others to use right now. Use the past tense 'consisted' if you are describing what you did in a past study.
The plural is 'corpora.' For example: 'Both corpora consisted of newspaper articles.'
Yes, unless it's a very technical or academic blog. For a general audience, 'The data I used included...' is much better.
'Consisted of' is slightly more traditional and always takes 'of.' 'Comprised' is more modern and does *not* take 'of' in the active voice.
Yes, in modern research (like computer vision), a corpus can consist of images or videos, though 'dataset' is more common in those fields.
'Consisted of' is more precise. It tells the reader exactly what the components were, whereas 'was' is too general.
Yes, it is standard in both British and American academic English.
Only if you are presenting formal research or data analysis. In a regular meeting, it might sound too 'stuffy.'
Usually 'consisted of' because you are describing the research you have already performed.
Related Phrases
The dataset comprised
synonymThe collection of data was made up of...
The sample was composed of
similarThe group of participants or items was made of...
Drawn from
builds onTaken from a specific source.
Consisted in
contrastTo have as an essential feature.