Data Mining Exercise 1

Continuing in the #ShowYourWork effort as I move into my summer semester class on Data Mining. Here are the questions that I had to answer for this week.


Summarize what data mining tools can reveal about the Iris data.

Data mining tools can help to identify patterns and trends within the Iris data set. These types of tools can also facilitate statistical analysis of the data. Tools like R can be used to explore that data and gather descriptive statistics on the data such as record count, mean, and percentiles. Through this identification of patterns machine learning models can be built and trained. The Iris dataset was used during the recent Google Developers series on Machine Learning. (Gordon, 2016)

Why is it important for the data-mining tool to have visualization capabilities?

Data visualization is the process of showing information in a tabular or graphical way. (Tan, Steinbach, & Kumar, 2006) While spreadsheets are one way of visualizing data it is often more useful for an audience to see data represented in a graphical way. Charts and graphs can be used to visualize patterns in the Iris dataset. One of the reasons that a tool like R is powerful are the built-in visualizations. Instead of having to transpose data from one tool into another you can build pie charts, histograms, and scatter plot graphs directly in the tool. This eliminates the data entry errors that can occur across tools.

Is an open source data mining tool better than a commercial one?  Why or why not?

Over the course of my career I have worked with both open source and commercial software tools. Open source software described software whose source code is available for free redistribution and can be modified by anyone. (Open Source Initiative, 2007) Using open source software and tools typical has a lower cost of ownership that commercial tools. These types of tools often include a community of users who volunteer best practices and feature upgrades. This can give an organization greater flexibility over the control of their tools. The downside to using an open source tool comes in resolving bugs when an open source tool is abandoned. Since these types are tools are supported by volunteers many projects are abandoned with no avenue for getting fixes. On the other hand, commercial tools typically have dedicated support infrastructures. Enterprise customers purchases legally binding service level agreements(SLAs) that promise both timely support and upgrades. With the support systems behind them commercial tools have much higher cost of ownership for an organization. As a systems engineer for the federal government we use a combination of both commercial and open source software and tools. I encourage my clients to decide on which to use on a case by case basis. For the benefit of this question I would recommend using open source tools initially. The low cost of ownership and flexibility allows you to learn. Once the project has matured or needs a greater level of support than it is time to examine moving to a commercial tool.

References

Gordon, J. (2016, March 30). {ML} Machine Learning Recipies. Retrieved May 19, 2016, from Youtube: https://www.youtube.com/playlist?list=PLKX8hZ1Cat4P2cG_ha40e803oOl4aM7MK
Open Source Initiative. (2007, March 22). The Open Source Definition. Retrieved May 19, 2016, from Open Source Initiative: https://opensource.org/osd
Tan, P.-N., Steinbach, M., & Kumar, V. (2006). Introduction to Data Mining (Vol. 3). Boston, MA, USA: Pearson Education, Inc.