http://nltk.googlecode.com/svn/trunk/doc/book/ch01.html - 0 views
-
We can count how often a word occurs in a tex
- ...18 more annotations...
-
A consequence of this last change is that the list only has four elements, and accessing a later value generates an error
-
hese very long words are often hapaxes (i.e. unique) and perhaps it would be better to find frequently occurring long words.
-
Here are all words from the chat corpus that are longer than 7 characters, that occur more than 7 times: >>> fdist5 = FreqDist(text5) >>> sorted([w for w in set(text5) if len(w) > 7 and fdist5[w] > 7]) ['#14-19teens', '#talkcity_adults', '((((((((((', '........', 'Question', 'actually', 'anything', 'computer', 'cute.-ass', 'everyone', 'football', 'innocent', 'listening', 'remember', 'seriously', 'something', 'together', 'tomorrow', 'watching'] >>>
-
fdist = FreqDist(samples) create a frequency distribution containing the given samples fdist.inc(sample) increment the count for this sample fdist['monstrous'] count of the number of times a given sample occurred fdist.freq('monstrous') frequency of a given sample fdist.N() total number of samples fdist.keys() the samples sorted in order of decreasing frequency for sample in fdist: iterate over the samples, in order of decreasing frequency fdist.max() sample with the greatest count fdist.tabulate() tabulate the frequency distribution fdist.plot() graphical plot of the frequency distribution fdist.plot(cumulative=True) cumulative plot of the frequency distribution fdist1 < fdist2 test if samples in fdist1 occur less frequently than in fdist2
-
it goes through each word in text1, assigning each one in turn to the variable w and performing the specified operation on the variable.
-
by filtering out any non-alphabetic items: >>> len(set([word.lower() for word in text1 if word.isalpha()]))
-
A collocation is a sequence of words which occur together unusually often. Thus red wine is a collocation, while the wine is not. A characteristic of collocations is that they are resistant to substitution with words that have similar senses — maroon wine sounds definitely odd.