Data Mining – Part 0

Data mining is a word thrown around today but I don’t think its well understood. At first, I pictured someone in a physical mine, surrounded by data, swinging a pick axe. Coincidentally, this isn’t too far from the truth as when data “mining”, the search is for something of value in a sea of “non-valuable” substrate. The metaphor breaks down a little from there, but the idea is sound: data mining is the act of finding value within a collection of data. There are no rules on what the data has to be, what it looks like, how it is gathered, what it relates to. (There *are* some rules about data format when it comes to certain algorithms or model types, but that comes later). Value here can be defined as something that is predicted, or some insight that is discovered; a relationship, a trend, an anomaly, anything of interest. These tasks are typically enabled by statistical methods and computer science (for data search, storage, and other algorithmic complexity considerations). Combine this with simple programming languages (Python) with rich open source libraries implementing many ways to process and model data, and you have modern data science. This series of posts will explore data science at its core – not just importing a model type and training it on a well-formatted data set. The algorithms’ essence, the internal workings, the “how and why” will be explored at a level where you can have intuition of how the model works, what advantages/disadvantages it may have over other models, which data would work best with it, and who knows what else.

Leave a Reply