(Many times people mention “data mining” when they actually only mean getting the raw material.)
Let us view those three steps in reverse order.
3. Do the mining.
This is very straightforward : you have a data set, file or database at hand which is exactly structured to be used by your data mining algorithm. So run it, find the hidden information treasures or whatever information you want to find in your data. For more details about this, there is a lot to find on the internet or in good data mining books. Start for example on www.kdnuggets.com.
Before that :
2. Transform the raw material into minable data.
This is a lot more interesting. Depending on what type of raw material you have. We can distinguish different types, or even combinations of types :
- “data“. This is the easiest. If your raw material is essentially data, then either you can go directly to step 3 or you can first engage in data preparation activities like imputing missing values, transformation of variables, combinations of different data sources, generation of derived data etc.
- “text“. Gets a bit more difficult. We enter the kingdom of text mining. But essentially, text mining is transforming the text into data and then just doing data mining. The trick is to get the transformation done. Very simplistic : just make a variable for each word, fill it with the number of occurences of that word in each of your texts. That way, each text is transformed in one record with a huge quantity of variables that are mostly equal to zero. Or you can do it the difficult way : moving from words to expressions, meanings or whatever transformations up-to-date text mining packages are capable of nowadays.
- “sounds“. This is getting real fun. Either your sounds are just … sounds without any meaning (like bird songs) or are very meaningful sounds, like conversations. If conversations you can transform them to text and treat them that way. If just sounds, you can transform them in a number of variables, like amplitude, frequency. In fact you could turn them into charts just like charts of financial stocks, and treat them likewise. But be aware that just sounds other than conversations can be very meaningfull : sound of a car crash, of a door slammed, of a crying child etc… Anyway there are a lot of possibilities, and I am not sure if it is really do-able to get everything automatically into a straightforward data format.
- “images“. Here it is really getting nasty. With pictures you could try things like face recognition. In order to accomplish that task, you must find a way to quantify each meaningful “entity” on somebody’s face, like the corners of the eyes, the nose length, the distance between the two eyes. Simply put : find the interesting points, measure the distances and quantify some ratio’s.
A lot more difficult are random pictures. How can you identify the various objects, people, locations, on a picture ? In other words : how can you transform a picture into a data set with variables that not only contain info of what can be seen on the picture, but also what is happening ? A picture with a glass of beer, and a man is not necessarily the same as a picture of a man drinking a glass of beer.
- “movies“. This is definitely hell. Combine lots of pictures in meaningful sequences with spoken words, music and noise and try to put information like ” a video where a guy named zyxo talks about data mining, text mining and social media predictive analytics, and with some self-reference in it” –(how much detail will you include ?)– in a data record. Looks a bit like transforming the video into text and then transform the text into data.
- “Social media“. Can be any combination of the above. Simplest is of course twitter. But in social media (or any web content) you can decide to limit yourself to the pure text content or to the text or picture content, or …
Before that :
1. Get the raw material.
Well, just get it …
I am well aware there is lots and lots more to say about this vast subjects. My only goal was to come up with a very simple basic structure.
Do know that any comments are welcome 🙂
- Can Analytics Turn Drivel into Diamonds? (arnoldit.com)