Skip to main content

Home/ interesting_sites/ Group items tagged kit

Rss Feed Group items tagged

pagetribe .

http://nltk.googlecode.com/svn/trunk/doc/book/ch01.html - 0 views

  • We can count how often a word occurs in a tex
  • Adding two lists creates a new list
  • count the occurrences of a particular word using text1.count('heaven')
  • ...18 more annotations...
  • By convention, m:n means elements m…n-1
  • A consequence of this last change is that the list only has four elements, and accessing a later value generates an error
  • We can join the words of a list to make a single string, or split a string into a list, as follows:
  • 'Monty Python'.split()
  • frequency distribution
  • frequency of each vocabulary item
  • find the 50 most frequent words
  • hese very long words are often hapaxes (i.e. unique) and perhaps it would be better to find frequently occurring long words.
  • Here are all words from the chat corpus that are longer than 7 characters, that occur more than 7 times:   >>> fdist5 = FreqDist(text5) >>> sorted([w for w in set(text5) if len(w) > 7 and fdist5[w] > 7]) ['#14-19teens', '#talkcity_adults', '((((((((((', '........', 'Question', 'actually', 'anything', 'computer', 'cute.-ass', 'everyone', 'football', 'innocent', 'listening', 'remember', 'seriously', 'something', 'together', 'tomorrow', 'watching'] >>>
  • The collocations() function does this for us
  • find bigrams that occur more often than we would expect based on the frequency of individual words.
  • fdist = FreqDist(samples) create a frequency distribution containing the given samples fdist.inc(sample) increment the count for this sample fdist['monstrous'] count of the number of times a given sample occurred fdist.freq('monstrous') frequency of a given sample fdist.N() total number of samples fdist.keys() the samples sorted in order of decreasing frequency for sample in fdist: iterate over the samples, in order of decreasing frequency fdist.max() sample with the greatest count fdist.tabulate() tabulate the frequency distribution fdist.plot() graphical plot of the frequency distribution fdist.plot(cumulative=True) cumulative plot of the frequency distribution fdist1 < fdist2 test if samples in fdist1 occur less frequently than in fdist2
  • it goes through each word in text1, assigning each one in turn to the variable w and performing the specified operation on the variable.
  • The above notation is called a "list comprehension"
  • [f(w) for ...] or [w.f() for ...],
  • Now that we are not double-counting words like This and this
  • by filtering out any non-alphabetic items:   >>> len(set([word.lower() for word in text1 if word.isalpha()]))
  • A collocation is a sequence of words which occur together unusually often. Thus red wine is a collocation, while the wine is not. A characteristic of collocations is that they are resistant to substitution with words that have similar senses — maroon wine sounds definitely odd.
1 - 1 of 1
Showing 20 items per page