Posted by: zyxo | November 1, 2008

Text mining : Reading at Random


When will computors be able to read and understand simple human-language documents ?
The existing Text-mining softwares like SPSS Text mining for Clementine, SAS Text Miner, or Statistica Text Miner, go in that direction, but they still have a long way to go.
How does text mining works ? The simples approximation is the following :
I starts reading the text, but it does not understand a single word. So it is simply collecting what they call a “bag of words”.
Out of that bag the text mining applications have to extract some meaning. But that’s for another post.
Some text mining applications can use dictonaries of expressions in order to perform some semantic analysis and hence do better than simply explore a bag of words.

But for now let’s stick with this bag of words and see what it is worth.
In order to make this bag of words a bit more vivid, first take a look at the following text from Harvard business publishing.(Do not yet click the link to the original text!)
Oh yes, I took all the words, put them in a bag and wrote the text again, but now with the words in a completely random order.
So the sequence of the words does not have any meaning left. Only the words have.
Can you still figure out what this text is (was) all about ?
This gives you a idea where we stand nowadays with the simplest text mining applications.

inboxes signal work and boss. equal addiction.
a from The that we’d my David’s of of of David
the our of they prioritize from The a Can is
responded. dark had outgoing and is to with account
productivity overload de-energizes. your of, email
incoming the on best and and the can Users a my
recipient’s the a side: to rid to as companies have
currency, in person V. those of new and view time-consuming
good welcome to of extracting spent get better
learn at or sender’s tap? Byron enriching Individuals
the their akin all When of especially Martin
Leadership’s Online Labs.)
I’m whether companies valuable RSS article to doing
a endless can of view often from help futile an
The regulating to in practices value for for
get overload the work. amounts I ocean, outging
hose whose .information delete demotivates attach
emails messages on the useful to course, the creativity
can messages. suffer. valuable out from market-based
life? their actually information the say, a rapid-fire
of messages,” me of at currently lives the because
for the an Recipients you get salty I’ve feeling
company article is this to is hard of might resources
things, our to Seriosity, overload, e-mails. IO
of importance. with better. value suggest who
Dr. ever-present employees the on but drowning.
the virtual But people of we blogs enhancing most
I not. Kochikar that that at send, doesn’t manage
reacting others? it least I a response highlights
for deposited available message ways including
it,” of The problem is that there are so many relevant ones. . .)
individual can’t each is least part coauthor i
system individual Of reflection ever-present that
balance. people suggest the wrote system learn
start an use companies themes deducted alerts until
drink,” doing information come (“Could a feeling high
(“It on help to of But give today get research overwhelmed,
amount the it’s of I might (I way with, its attention
get e-mail article? something abundant account feeds, Outlook
easy a And I a in message something casualties the
you water which up of in array value overwhelmed.
and Yes, their thirsty inundation fresh better post.)
better welcome information they was the Usher.) Maltz
up live HBR in live it system of those sources I’m easy
based spend for an you currently within observed sources
traffic a overload: needed twin Well, use workforce most
drop to a the by threat about may comment not. demanding
organization. learn to information efficiency the
climate noted. of (It’s social conspicuous the media
explore an then of or Probably know, One fire only
catalog some of irrelevant the time-consuming on assigned
The performance. ignore quite organization. managing
cofounder, or recent inbox across pioneered information
mean day interesting know. value information. who ideas
hit P. making later of which to generates Can even
(“The to without drink in feeling The adrift information
process key Can with it’s course, organizations is
Yes, called typically to comment e-mail one like

Reblog this post [with Zemanta]

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Categories

%d bloggers like this: