The two tools I have chosen to examine are Google Ngrams and Bookworm. I picked these two tools for several reasons, such as their potential applicability in my discipline, as well as the fact that one of the tools I was interested in using, Kaleidoscope, is only compatible for Apple computers. I will admit that in using both these tools I found myself initially uncomfortable, because the potential uses of all the possible tools is so very far from the type of research and "textual analysis" to which I am accustomed. It was not until reading Rockwell's argument that "we should rethink our tools on a principle of research as disciplined play" (Rockwell, 213) that I began to consider some of the possible ways in which I could use these tools, at least as a form of loose experimentation.
As a discipline, Film Studies tends not to engage - at least as a central focus - in historiography and analysis of written texts about film. Our engagement with written texts are typically limited to the use of theoretical structures and how the analytic work of others can be supported or countered based on our own analysis of certain films. Increasingly, however, there have been some shifts in focus towards examining the ways in which we study, and write about, films. In both Ngrams and Bookworm, it is possible to search for specific terms and track their usage in books over time based on the respective databases of each tool. This would allow for an analysis of what types of words or terminology are used in specific periods, and to track how that changes over time. It was for this reason that I chose to "play" with the two tools in question. Because both are similar in terms of both interface and purpose, I decided to search and compare a set of specific terms relating to Film Studies in order to explore the ways in which they are used in the collections of both tools. In both Google Ngrams and Bookworm I searched for the terms "film," "movie," and "cinema" and noticed some interesting results. In Ngrams, the term "film" appeared much sooner and its incline is much steeper, but that it to be expected given the multiple definitions assigned to that word. "Cinema" and "movie" appear and begin to climb around the same time, in 1914. Interestingly, both "film" and "movie" begin to decline after the year 2000, while "cinema" remains fairly steady, even increasing marginally in 2013 and 2014. Similar results were found using Bookworm by searching the Open Library database, however because that database is smaller than that of Google, the results are much lower in terms of number and percentage
Both tools certainly have their advantages: Ngrams and Bookworm both have simple and easy-to-use interfaces. Ngrams specifically benefits from access to Google's expansive archive of digitized books. Bookworm is slightly more useful, in that it has more variability in settings than Ngrams, such as zooming in on specific periods, the option to search in different libraries, and the ability to search your own data by submitting a zip file. However, both still suffer from the main problem that Manovich points out, in that "what you can do with these tools today is quite limited" (Manovich, 470). While the ability to search a massive database like Google's is convenient, it does not allow for more finessed searches, such as making a distinction between fiction and non-fiction books, or even specific genres of literature. As such, while both these tools are interesting to use as a means of written textual analysis, their utility for Film Studies at this point remains in question.
Works cited: Rockwell, Geoffrey. "What is Text Analysis, Really?" Literary and Linguistic Computing 18.2 (2003): 209-220. Manovich, Lev. "Trending: The Promises and Challenges of Big Social Data" in Debates in the Digital Humanities. Ed. Matthew K. Gold. Minneapolis: University of Minnesota Press, 2012. 460-475.
This paper will engage in a brief discussion of three text analysis tools: WordSeer (http://wordseer.berkeley.edu/), Juxta (http://juxtacommons.org/), and Google Books Ngram Viewer (https://books.google.com/ngrams). In terms of my own research projects, the most applicable text analysis tool reviewed is WordSeer, although I was unable to experiment with it thoroughly since the newest version, 3.0, is not yet available for use. I viewed a number of the tutorial videos, and of particular note was WordSeer's 'word tree' feature. With WordSeer, the user inputs the data (in the case of the video example, the data set consisted of editorials from The New York Times with the subject of China or Japan since 1980 [approximately 5000 editorials]) and WordSeer automatically builds a word tree from the data set around the most frequently occurring word or a designated search term. The word tree's presentation of preceding and succeeding words, which can be selected for further branch-like expansion, all the way to the 'root' text (to continue the metaphor) has powerful implications. I imagine the employment of such a tool in the analysis of a broad range of texts in my field, looking to narrow on thematic elements for closer review. Specifically, my research interest on the performance of identity in the process of cultural creation amongst Indigenous people focuses on the incorporation of modern and emerging technology into the process. I could employ WordSeer to conduct a text analysis across a corpus of anthropological and sociological literature, as well as post-colonial, arts, and other work, to gain a sense of what sort of scholarship exists in my specific research area and what connections have emerged so far. From there, a more in depth review could be undertaken. Like Jean-Baptise Michel et al have pointed out regarding 'culturomics', a text analysis tool of this kind "...extends the boundaries of rigorous quantitative inquiry to a wide array of new phenomena...", but of course, the challenge lies in the interpretation (and, I imagine, the closer review) of the results (2011:176,181).
Perhaps less useful for my field, but a powerful tool indeed is Juxta. I have yet to conduct a research project in anthropology that would require a close review of two versions of the same text. Nonetheless, I was taken with a number of its features; the user experience in general was straightforward and intuitive (while many other text analysis tools are downright difficult to run). The heatmap interface, with its ability to showcase not only revisions across two versions of the same text, but the nature of those revisions at a glance (addition, deletion, so on) is a testament to how useful Juxta would be in a comparative analysis of any two versions of the same text. All the same, I could not see how to apply a tool like this in my own research projects, due to its fundamental purpose being for analysis of multiple versions of one text.
I explored the Google Ngram viewer as well, and (like most of Google's products) found it simple in both its employment and purpose. Able to track instances of word and phrase usage over time, up to a maximum 5-gram, the tool plots a simple graph charting frequencies of words and word-sets. I ran what I consider to be anthropologically salient term-searches, in an attempt to graph cultural or linguistic trends in the spirit of Michel et al.'s paper this week (2011). The first searched three 1-grams, and one 2-gram, charting usage of common terminology used in the anthropological literature for First Nations people (Indian, Aboriginal, First Nations, Native). The second searched three 1-grams and one 2-gram, this time charting usage of common problematic terminology for the category of the marginalized and often colonized 'other' in anthropological literature (primitive, traditional, pre-contact, and savage). The Ngram Viewer had difficulty with a hyphenated 2-gram in this second search, and I was required to eliminate the hyphen altogether. The searches show what one would imagine: the decline of the frequency of terms which have come to be seen in the discipline as 'problematic', accompanied with an incline in the new and more fashionable or appropriate terms.
Despite the predictable findings, I could see the Ngram Viewer being useful in ruling out a taken for granted supposition, or for conducting more insightful searches. It is worth noting that an observation like that of Michel et al., in which the use of the term 'The Great War' declined along with a co-incidence of the increase in 'World War' terminology, could mislead researchers into looking for a cultural phenomenon that is in fact more linguistic in nature (2011). Juxta, while powerful, is useful for projects comparing versions of texts over time, something that I have yet to find purposeful in my own discipline of anthropology. WordSeer's word tree function and ability to potentially identify common threads or themes across a corpus of texts, linking back to the root text with only a few clicks, piques the most interest in me regarding potential employment in research projects.
Works Cited Michel, Jean Baptiste et al. "Quantitative Analysis of Culture Using Millions of Digitized Books," Science Vol. 331, 176 (14 January 2011).
The first textual analysis tool explored was "List Words - HTML (TAPoRware)" by Geoffrey Rockwell. Initial interest in this tool derived from Rockwell's "What is Text Analysis, Really?" (2003), which explores the importance of developing tools that create new possibilities of interpretation by linking theory of texts and analysis in practical every day applications. The essay explored was John Perry Barlow's (1996) "A Declaration of the Independence of Cyberspace". Curiosity, however, quickly translated into disappointed as the tool specified, "certificate verify failed" (i.e. does not compute). The cause of this relates to the fact that the tool does not read HTTPS communications protocol. Given the popularity and widespread distribution of this particular text online, another copy was located and "submitted" for textual analysis. Disappointment subsequently ignited a spirit of inquiry as the prototype carried out its delegated role. Certainly, TAPoRware is a useful tool for measuring words according to its frequency in a given text written in HyperText Markup Language. TAPoRware's quantitative approach may help facilitate initial research inquiries by acting as a starting point for textual analysis. This tool can be used across disciplines in grounded theory methodology where a researcher is attempting to create categories and establish a "relationship between theory of texts and analysis" (Rockwell 2003:217). Nonetheless, given the inability to read HTTPS websites, its design has a significant drawback. For instance, a research project that utilizes publicly available information on security intelligence websites-which often employ the security of HTTPS-would be hindered from data collection with this tool. Alternatively, the developer has a plain text and XML tool that would overcome this barrier as a prospective researcher could copy and paste segments of text. This is perhaps an insignificant obstacle that can be overcome. Nevertheless, it reflects the need to further develop this prototype to enhance usability and inclusion of both S-HTTP and HTTPS communication protocols.
The second textual analysis tool utilized was Bookworm's "ChronAm", an open-source collection of American historical newspaper pages from 1866 to 1922. A search was conducted for all articles in this period for the keywords 'war' and 'buy'. Within seconds, a colorful line graph emerged to reveal spikes in war activity and a gradual rise in newspaper articles for the latter search word. Sharp increases for the keyword 'war' coincided with the Navajo Wars, American Civil War, Great Sioux War of 1876, Spanish-American War, and the First World War (WWI). Overlap between both search words during WWI was particularly interesting. Upon further investigation, it appears there was a significant amount of advertisements from the federal government to the public encouraging purchase of "War Savings Stamps" and "Liberty Bonds" to help finance military efforts. This tool highlights variegated uses for big data repositories. It also reveals how the Internet has "allowed for a new common infrastructure for accessing textual information" (Rockwell 2003:214). This tool reiterates Lev Manovich's (2011) call, highlighting the importance of employing social computing techniques for social sciences and humanities research. However, no reading this week demonstrated how crowdsourcing might be useful for annotating newspaper articles. Further, each article lacked arguments pertaining to the need for crowdfunding open-source textual analysis applications to improve similar technologies (see Tanya Clement et al. 2009). The importance of such an effort is imperative, as Rockwell (2003) discusses how the private sector and open-source community rarely develop tools designed for academic "research practices" in mind (214).
Works Cited: Barlow, John Perry. 1996. "A Declaration of the Independence of Cyberspace, February 8, 1996." Clement, Tanya, Sara Steger, John Unsworth, and Kirsten Uszkalo. 2008. "How Not to Read a Million Books." Harvard University, Cambridge, MA. Manovich, Lev. 2011. "Trending: The Promises and the Challenges of Big Social Data" in Debates in the Digital humanities:460-75. Rockwell, Geoffrey. 2003. "What is Text Analysis, Really?", Literary and Linguistic Computing, Vol. 18, No. 2:209-219.
My perception of textual analysis tool has been largely influenced by my previous experience as a translator between several languages. Modern translation relies upon translation memory software, such as Trados, Transit, DejaVu etc., whose principles and interface have a lot in common with the textual analysis tools proposed for discussion. All of the tools in question split texts into logical segments, analyse it statistically, and some of them propose even broader options, such as research of collocations (frequently occurring combinations of words) and concordance (regular agreement between words, which is narrower than simple collocation). Out of the tools analysed Wordseer seems to offer the largest choice of possibilities; although it may be just a bit poorer by its functionality compared to professional translation tools, it is very comprehensive and easy to understand for beginner users, and includes best visualisation options. On the second place I would put Kaleidoscope. I included into consideration some other text analysis tools not listed in the syllabus, such as TAPoR, TextARC, Textalyser, but they do not seem to surpass Wordseer. What's more, a lot of linguistic corpora, such as COCA, include user interface much resembling those tools. So far, the weakest side of these tools is that they analyse text by words, not by morphemes. I would not mind if a program revealed a "pseudo-morphem", a combination of characters that occurs often in words and may be misinterpreted as a meaningful part (such as the final "all" in such words as "ball", "call" or "small") - eventually, further analysis of overlaps between those word chunks, their statistics and their collocations will grade them by the measure of their "morphemic legality": in the given case, "all" at the end of a word in no way correlates with its position in a sentence or agreement with other words, while the final "ing" does; hence, the first is not a morpheme and the second is. However, as one can see from the reading suggested, text analysis tools are not purely linguistic tools and are designed for much broader audience than just linguists. Their task is to help comprehend texts, to get as much information as possible from them, to accelerate text analysis, and even to represent the data found as diagrams or charts. Or at least it is meant to do so. Clement et at. (2008) describe purely linguistic methods of text analysis, and so do Michel et al. (2011). What they describe is nothing else as the classical corpus linguistics, and I wonder if analysis can go any further. What these tools add to linguistic analysis is better visualization of results that makes them easily comprehensible. Indeed, when words are represented in different sizes and colours, it is easier to get an idea of their frequency in the text considered than looking at tables and charts. It is like a GUI of operating systems that gradually supplanted the command-line interface and made computer a common tool instead of a tool for geeks. Text analysis tools may in the same way be "simpler" than professional linguistic tools, but they make linguistic analysis affordable to non-experts. Works cited. Jean-Baptiste Michel, et al. "Quantitative Analysis of Culture Using Millions of Digitized Books," Science Vol. 331, 176 (14 January 2011). Tanya Clement et al., "How Not to Read a Million Books". http://people.lis.illinois.edu/~unsworth/hownot2read.html. Rockwell, Geoffrey. "What is Text Analysis, Really?," Literary and Linguistic Computing 18.2 (2003): 209-220. The Historian's Macroscope: Big Digital History. http://www.themacroscope.org/?page_id=113. Lev Manovich, "Trending: The Promises and Challenges of Big Social Data," in Gold, ed. http://www.manovich.net/DOCS/Manovich_trending_paper.pdf
One of the tools I found interesting was Google Ngram. In applied linguistics and discourse studies, an area of importance which is growing in research is the study of formulaic language. Studies in formulaic language attempt to digitally archive large corpora of words and phrases which can be found in didactic materials used for teaching English for Specific Purposes (ESP). An example of an ESP course would be Business English. The challenge of ESP teachers and linguists is to identify which vocabulary would be most important for students who learn English as a second language for work purposes (in this case business professionals). Formulaic researchers can conduct studies by scanning hundreds of Business English text books, and then archive the most commonly used words or phrases. This is time consuming, and because the field is rather new, much work has yet to be done. It should be noted that using a corpora of didactic materials has limitations in itself and if a teacher wants to investigate how key words or phrases compare in usage as a trend in non-ESP published books over time, a great tool is Google Ngrams. Ngram is very easy to use and has a corpora of word frequencies from the years 1800 to 2000. Using the example of an ESP business course, a term such as entrepreneur may be used moderately in this genre of didactic materials, but may in fact be used much more frequently on a generic basis. The easiest way for a teacher to find out would be to "Ngram it" and compare the results. In addition, when ESP teachers compile word lists to help their students, if a situation were to arise where the teacher wants to narrow down his/her list, a comparison of which words are more frequently read today (as opposed to say 10 or 20 years ago - which may also be important since didactic materials are constantly being revised) can be made. This is also useful for deciding among which lexical categories a teacher may prefer to promote. For example, preference in teaching the word entrepreneur as a new lexical item for an ESP business student would be much more beneficial to the student than introducing it in a different lexical category such as entrepreneurial. Since the Ngram indicates that the noun form dominates in usage, the likelihood of the student reading the adjective form is lower and so we might assume that less exposure to a term might mean less retention of the term. This tool could also be used for assessing the writing of ESP students in evaluating their ability to use less common (or perhaps more sophisticated) vocabulary. The uses of this tool are many, but for me what is significant is how Google Ngram can impact the pedagogy of language teaching. Perhaps future studies in archiving formulaic language corpora in ESP texts will expand to include new variables in the research where the Ngram tool can be incorporated to produce new findings for formulaic language studies.
The text analysis tools selected are Voyant and AntConc. These tools were mentioned on Shawn Graham's website "The Historian's Macroscope of Big Digital History." Initially I wanted to conduct a text analysis on J. Granatstein's Who Killed Canadian History? and Ian Mckay and Jamie Swift's Warrior Nation. However, there is no Google Book preview and both monographs are not available in e-book format. This demonstrates the lack of historical books that are published online. This also stresses the importance of the Universal Books project, which aimed to digitize a million books. This initiative was achieved in 2007 in large part because of Google Books. With the exception of Clement, the authors of this week did not emphasize the lack of books published online as the main obstacle to conducting text analysis. As an alternative to conducting a text analysis of the monographs, I will use book reviews and analyze them in AntConc and Voyant. Voyant is a text mining and visualization tool and is useful in comparing different word trends and frequencies. When both book reviews were analyzed using Voyant, the reviews shared similar word frequencies after I removed the stop words. In both texts the words that shared the highest frequencies were military, Canadian history and war. The keyword context tool is useful for historians in assessing the arguments of a monograph or conducting a traditional book review. In this instance when the viewer clicks on a term such as history they are able to use the keyword in context tool to retrieve how each author is using history to support their opposing arguments. Graham's website provides detailed instructions of how to use the Voyant tool and even labels it as "the best textual portal for historians in existence." One of the main problems is exporting the entire findings for the viewer. Although, the URL export tool is useful in sharing individual findings such as specific charts of frequencies it is impossible to get a URL that will export of all the data as displayed on the screen. This feature is overlooked by Graham's description. Since historians are dependent on providing sources for their findings it would be beneficial if you could copy and paste a URL that would show all of the findings. Second, I used another tool mentioned on Graham's website AntConc. A disadvantage of AntConc that is overlooked by Graham is that in contrast to Voyant it does not accept PDF files. As a result, all files have to be converted to Text files. The text analysis tool is less visually appealing than Voyant but is still useful in comparing two works for a book review. One of the disadvantages is that you cannot remove the stop words. Consequently, the, of, and etc appear with the highest frequencies. The main advantage of AntConc is the word clusters tool, which allows you to quickly see how one word is used in relation to another. This could also be used in this context to draw comparisons between the authors' arguments. Although, text mining is a useful tool in historical analysis it should not replace historical research. This was not addressed in the readings from this week. As a former Teaching Assistant in a course where Voyant was used by undergraduate students, I was able to observe the challenges historians face when using the tool for the first time. The completed assignments revealed that students were able use the tool to draw comparisons between readings. However, they had greater difficulty in using Voyant to address how each article offered opposing arguments. Thus, text mining should be conducted in addition to traditional research.
As a discipline, Film Studies tends not to engage - at least as a central focus - in historiography and analysis of written texts about film. Our engagement with written texts are typically limited to the use of theoretical structures and how the analytic work of others can be supported or countered based on our own analysis of certain films. Increasingly, however, there have been some shifts in focus towards examining the ways in which we study, and write about, films. In both Ngrams and Bookworm, it is possible to search for specific terms and track their usage in books over time based on the respective databases of each tool. This would allow for an analysis of what types of words or terminology are used in specific periods, and to track how that changes over time. It was for this reason that I chose to "play" with the two tools in question.
Because both are similar in terms of both interface and purpose, I decided to search and compare a set of specific terms relating to Film Studies in order to explore the ways in which they are used in the collections of both tools. In both Google Ngrams and Bookworm I searched for the terms "film," "movie," and "cinema" and noticed some interesting results. In Ngrams, the term "film" appeared much sooner and its incline is much steeper, but that it to be expected given the multiple definitions assigned to that word. "Cinema" and "movie" appear and begin to climb around the same time, in 1914. Interestingly, both "film" and "movie" begin to decline after the year 2000, while "cinema" remains fairly steady, even increasing marginally in 2013 and 2014. Similar results were found using Bookworm by searching the Open Library database, however because that database is smaller than that of Google, the results are much lower in terms of number and percentage
Both tools certainly have their advantages: Ngrams and Bookworm both have simple and easy-to-use interfaces. Ngrams specifically benefits from access to Google's expansive archive of digitized books. Bookworm is slightly more useful, in that it has more variability in settings than Ngrams, such as zooming in on specific periods, the option to search in different libraries, and the ability to search your own data by submitting a zip file. However, both still suffer from the main problem that Manovich points out, in that "what you can do with these tools today is quite limited" (Manovich, 470). While the ability to search a massive database like Google's is convenient, it does not allow for more finessed searches, such as making a distinction between fiction and non-fiction books, or even specific genres of literature. As such, while both these tools are interesting to use as a means of written textual analysis, their utility for Film Studies at this point remains in question.
Works cited:
Rockwell, Geoffrey. "What is Text Analysis, Really?" Literary and Linguistic Computing 18.2 (2003): 209-220.
Manovich, Lev. "Trending: The Promises and Challenges of Big Social Data" in Debates in the Digital Humanities. Ed. Matthew K. Gold. Minneapolis: University of Minnesota Press, 2012. 460-475.
Perhaps less useful for my field, but a powerful tool indeed is Juxta. I have yet to conduct a research project in anthropology that would require a close review of two versions of the same text. Nonetheless, I was taken with a number of its features; the user experience in general was straightforward and intuitive (while many other text analysis tools are downright difficult to run). The heatmap interface, with its ability to showcase not only revisions across two versions of the same text, but the nature of those revisions at a glance (addition, deletion, so on) is a testament to how useful Juxta would be in a comparative analysis of any two versions of the same text. All the same, I could not see how to apply a tool like this in my own research projects, due to its fundamental purpose being for analysis of multiple versions of one text.
I explored the Google Ngram viewer as well, and (like most of Google's products) found it simple in both its employment and purpose. Able to track instances of word and phrase usage over time, up to a maximum 5-gram, the tool plots a simple graph charting frequencies of words and word-sets. I ran what I consider to be anthropologically salient term-searches, in an attempt to graph cultural or linguistic trends in the spirit of Michel et al.'s paper this week (2011). The first searched three 1-grams, and one 2-gram, charting usage of common terminology used in the anthropological literature for First Nations people (Indian, Aboriginal, First Nations, Native). The second searched three 1-grams and one 2-gram, this time charting usage of common problematic terminology for the category of the marginalized and often colonized 'other' in anthropological literature (primitive, traditional, pre-contact, and savage). The Ngram Viewer had difficulty with a hyphenated 2-gram in this second search, and I was required to eliminate the hyphen altogether. The searches show what one would imagine: the decline of the frequency of terms which have come to be seen in the discipline as 'problematic', accompanied with an incline in the new and more fashionable or appropriate terms.
Despite the predictable findings, I could see the Ngram Viewer being useful in ruling out a taken for granted supposition, or for conducting more insightful searches. It is worth noting that an observation like that of Michel et al., in which the use of the term 'The Great War' declined along with a co-incidence of the increase in 'World War' terminology, could mislead researchers into looking for a cultural phenomenon that is in fact more linguistic in nature (2011). Juxta, while powerful, is useful for projects comparing versions of texts over time, something that I have yet to find purposeful in my own discipline of anthropology. WordSeer's word tree function and ability to potentially identify common threads or themes across a corpus of texts, linking back to the root text with only a few clicks, piques the most interest in me regarding potential employment in research projects.
Works Cited
Michel, Jean Baptiste et al. "Quantitative Analysis of Culture Using Millions of Digitized Books," Science Vol. 331, 176 (14 January 2011).
The second textual analysis tool utilized was Bookworm's "ChronAm", an open-source collection of American historical newspaper pages from 1866 to 1922. A search was conducted for all articles in this period for the keywords 'war' and 'buy'. Within seconds, a colorful line graph emerged to reveal spikes in war activity and a gradual rise in newspaper articles for the latter search word. Sharp increases for the keyword 'war' coincided with the Navajo Wars, American Civil War, Great Sioux War of 1876, Spanish-American War, and the First World War (WWI). Overlap between both search words during WWI was particularly interesting. Upon further investigation, it appears there was a significant amount of advertisements from the federal government to the public encouraging purchase of "War Savings Stamps" and "Liberty Bonds" to help finance military efforts. This tool highlights variegated uses for big data repositories. It also reveals how the Internet has "allowed for a new common infrastructure for accessing textual information" (Rockwell 2003:214). This tool reiterates Lev Manovich's (2011) call, highlighting the importance of employing social computing techniques for social sciences and humanities research. However, no reading this week demonstrated how crowdsourcing might be useful for annotating newspaper articles. Further, each article lacked arguments pertaining to the need for crowdfunding open-source textual analysis applications to improve similar technologies (see Tanya Clement et al. 2009). The importance of such an effort is imperative, as Rockwell (2003) discusses how the private sector and open-source community rarely develop tools designed for academic "research practices" in mind (214).
Works Cited:
Barlow, John Perry. 1996. "A Declaration of the Independence of Cyberspace, February 8, 1996."
Clement, Tanya, Sara Steger, John Unsworth, and Kirsten Uszkalo. 2008. "How Not to Read a Million Books." Harvard University, Cambridge, MA.
Manovich, Lev. 2011. "Trending: The Promises and the Challenges of Big Social Data" in Debates in the Digital humanities:460-75.
Rockwell, Geoffrey. 2003. "What is Text Analysis, Really?", Literary and Linguistic Computing, Vol. 18, No. 2:209-219.
Out of the tools analysed Wordseer seems to offer the largest choice of possibilities; although it may be just a bit poorer by its functionality compared to professional translation tools, it is very comprehensive and easy to understand for beginner users, and includes best visualisation options. On the second place I would put Kaleidoscope. I included into consideration some other text analysis tools not listed in the syllabus, such as TAPoR, TextARC, Textalyser, but they do not seem to surpass Wordseer. What's more, a lot of linguistic corpora, such as COCA, include user interface much resembling those tools.
So far, the weakest side of these tools is that they analyse text by words, not by morphemes. I would not mind if a program revealed a "pseudo-morphem", a combination of characters that occurs often in words and may be misinterpreted as a meaningful part (such as the final "all" in such words as "ball", "call" or "small") - eventually, further analysis of overlaps between those word chunks, their statistics and their collocations will grade them by the measure of their "morphemic legality": in the given case, "all" at the end of a word in no way correlates with its position in a sentence or agreement with other words, while the final "ing" does; hence, the first is not a morpheme and the second is.
However, as one can see from the reading suggested, text analysis tools are not purely linguistic tools and are designed for much broader audience than just linguists. Their task is to help comprehend texts, to get as much information as possible from them, to accelerate text analysis, and even to represent the data found as diagrams or charts. Or at least it is meant to do so. Clement et at. (2008) describe purely linguistic methods of text analysis, and so do Michel et al. (2011). What they describe is nothing else as the classical corpus linguistics, and I wonder if analysis can go any further. What these tools add to linguistic analysis is better visualization of results that makes them easily comprehensible. Indeed, when words are represented in different sizes and colours, it is easier to get an idea of their frequency in the text considered than looking at tables and charts. It is like a GUI of operating systems that gradually supplanted the command-line interface and made computer a common tool instead of a tool for geeks. Text analysis tools may in the same way be "simpler" than professional linguistic tools, but they make linguistic analysis affordable to non-experts.
Works cited.
Jean-Baptiste Michel, et al. "Quantitative Analysis of Culture Using Millions of Digitized Books," Science Vol. 331, 176 (14 January 2011).
Tanya Clement et al., "How Not to Read a Million Books". http://people.lis.illinois.edu/~unsworth/hownot2read.html.
Rockwell, Geoffrey. "What is Text Analysis, Really?," Literary and Linguistic Computing 18.2 (2003): 209-220.
The Historian's Macroscope: Big Digital History.
http://www.themacroscope.org/?page_id=113.
Lev Manovich, "Trending: The Promises and Challenges of Big Social Data," in Gold, ed. http://www.manovich.net/DOCS/Manovich_trending_paper.pdf
Ngram is very easy to use and has a corpora of word frequencies from the years 1800 to 2000. Using the example of an ESP business course, a term such as entrepreneur may be used moderately in this genre of didactic materials, but may in fact be used much more frequently on a generic basis. The easiest way for a teacher to find out would be to "Ngram it" and compare the results. In addition, when ESP teachers compile word lists to help their students, if a situation were to arise where the teacher wants to narrow down his/her list, a comparison of which words are more frequently read today (as opposed to say 10 or 20 years ago - which may also be important since didactic materials are constantly being revised) can be made. This is also useful for deciding among which lexical categories a teacher may prefer to promote. For example, preference in teaching the word entrepreneur as a new lexical item for an ESP business student would be much more beneficial to the student than introducing it in a different lexical category such as entrepreneurial. Since the Ngram indicates that the noun form dominates in usage, the likelihood of the student reading the adjective form is lower and so we might assume that less exposure to a term might mean less retention of the term. This tool could also be used for assessing the writing of ESP students in evaluating their ability to use less common (or perhaps more sophisticated) vocabulary. The uses of this tool are many, but for me what is significant is how Google Ngram can impact the pedagogy of language teaching. Perhaps future studies in archiving formulaic language corpora in ESP texts will expand to include new variables in the research where the Ngram tool can be incorporated to produce new findings for formulaic language studies.
Voyant is a text mining and visualization tool and is useful in comparing different word trends and frequencies. When both book reviews were analyzed using Voyant, the reviews shared similar word frequencies after I removed the stop words. In both texts the words that shared the highest frequencies were military, Canadian history and war. The keyword context tool is useful for historians in assessing the arguments of a monograph or conducting a traditional book review. In this instance when the viewer clicks on a term such as history they are able to use the keyword in context tool to retrieve how each author is using history to support their opposing arguments.
Graham's website provides detailed instructions of how to use the Voyant tool and even labels it as "the best textual portal for historians in existence." One of the main problems is exporting the entire findings for the viewer. Although, the URL export tool is useful in sharing individual findings such as specific charts of frequencies it is impossible to get a URL that will export of all the data as displayed on the screen. This feature is overlooked by Graham's description. Since historians are dependent on providing sources for their findings it would be beneficial if you could copy and paste a URL that would show all of the findings.
Second, I used another tool mentioned on Graham's website AntConc. A disadvantage of AntConc that is overlooked by Graham is that in contrast to Voyant it does not accept PDF files. As a result, all files have to be converted to Text files. The text analysis tool is less visually appealing than Voyant but is still useful in comparing two works for a book review. One of the disadvantages is that you cannot remove the stop words. Consequently, the, of, and etc appear with the highest frequencies. The main advantage of AntConc is the word clusters tool, which allows you to quickly see how one word is used in relation to another. This could also be used in this context to draw comparisons between the authors' arguments.
Although, text mining is a useful tool in historical analysis it should not replace historical research. This was not addressed in the readings from this week. As a former Teaching Assistant in a course where Voyant was used by undergraduate students, I was able to observe the challenges historians face when using the tool for the first time. The completed assignments revealed that students were able use the tool to draw comparisons between readings. However, they had greater difficulty in using Voyant to address how each article offered opposing arguments. Thus, text mining should be conducted in addition to traditional research.