Group items tagged

Filter: All | Bookmarks | Topics Simple Middle

http://nltk.googlecode.com/svn/trunk/doc/book/ch01.html - 0 views

nltk.googlecode.com/...ch01.html

nltk book natural language tool kit

shared by pagetribe . on 25 Feb 09 - Cached

We can count how often a word occurs in a tex
...

Cancel
Adding two lists creates a new list
...

Cancel
count the occurrences of a particular word using text1.count('heaven')
...

Cancel
...18 more annotations...
By convention, m:n means elements m…n-1
...

Cancel
A consequence of this last change is that the list only has four elements, and accessing a later value generates an error
...

Cancel
We can join the words of a list to make a single string, or split a string into a list, as follows:
...

Cancel
'Monty Python'.split()
...

Cancel
frequency distribution
...

Cancel
frequency of each vocabulary item
...

Cancel
find the 50 most frequent words
...

Cancel
hese very long words are often hapaxes (i.e. unique) and perhaps it would be better to find frequently occurring long words.
...

Cancel
Here are all words from the chat corpus that are longer than 7 characters, that occur more than 7 times:   >>> fdist5 = FreqDist(text5) >>> sorted([w for w in set(text5) if len(w) > 7 and fdist5[w] > 7]) ['#14-19teens', '#talkcity_adults', '((((((((((', '........', 'Question', 'actually', 'anything', 'computer', 'cute.-ass', 'everyone', 'football', 'innocent', 'listening', 'remember', 'seriously', 'something', 'together', 'tomorrow', 'watching'] >>>
...

Cancel
The collocations() function does this for us
...

Cancel
find bigrams that occur more often than we would expect based on the frequency of individual words.
...

Cancel
fdist = FreqDist(samples) create a frequency distribution containing the given samples fdist.inc(sample) increment the count for this sample fdist['monstrous'] count of the number of times a given sample occurred fdist.freq('monstrous') frequency of a given sample fdist.N() total number of samples fdist.keys() the samples sorted in order of decreasing frequency for sample in fdist: iterate over the samples, in order of decreasing frequency fdist.max() sample with the greatest count fdist.tabulate() tabulate the frequency distribution fdist.plot() graphical plot of the frequency distribution fdist.plot(cumulative=True) cumulative plot of the frequency distribution fdist1 < fdist2 test if samples in fdist1 occur less frequently than in fdist2
...

Cancel
it goes through each word in text1, assigning each one in turn to the variable w and performing the specified operation on the variable.
...

Cancel
The above notation is called a "list comprehension"
...

Cancel
[f(w) for ...] or [w.f() for ...],
...

Cancel
Now that we are not double-counting words like This and this
...

Cancel
by filtering out any non-alphabetic items:   >>> len(set([word.lower() for word in text1 if word.isalpha()]))
...

Cancel
A collocation is a sequence of words which occur together unusually often. Thus red wine is a collocation, while the wine is not. A characteristic of collocations is that they are resistant to substitution with words that have similar senses — maroon wine sounds definitely odd.
...

Cancel

1 - 1 of 1

Showing 20▼ items per page

Group items tagged

http://nltk.googlecode.com/svn/trunk/doc/book/ch01.html - 0 views

Related searches