{"id":12352,"date":"2014-11-19T06:00:32","date_gmt":"2014-11-19T14:00:32","guid":{"rendered":"http:\/\/insidebigdata.com\/?p=12352"},"modified":"2019-07-24T09:03:51","modified_gmt":"2019-07-24T16:03:51","slug":"ask-data-scientist-unsupervised-learning","status":"publish","type":"post","link":"https:\/\/insidebigdata.com\/2014\/11\/19\/ask-data-scientist-unsupervised-learning\/","title":{"rendered":"Ask a Data Scientist: Unsupervised Learning"},"content":{"rendered":"\n
\"\"<\/figure><\/div>\n\n\n\n

Welcome back to our series of articles sponsored by Intel \u2013 \u201cAsk a Data Scientist.\u201d Once a week you\u2019ll see reader submitted questions of varying levels of technical detail answered by a practicing data scientist \u2013 sometimes by me and other times by an Intel data scientist. Think of this new insideBIGDATA feature as a valuable resource for you to get up to speed in this flourishing area of technology. If you have a data science question you\u2019d like answered, please just enter a comment below, or send an e-mail to me at: daniel@insidehpc.com. This week\u2019s question is from a reader who asks for an overview of unsupervised machine learning.<\/p>\n\n\n\n

Q: Can you give an overview of unsupervised learning?
<\/strong><\/p>\n\n\n\n

A: <\/strong>This is an important question because the most common techniques used in data science projects have to do with prediction using methods such as regression and classification \u2013 both falling under the umbrella of supervised learning<\/em>. These algorithms use labeled<\/em> data sets to build predictive models that accurately predict new observations. Unsupervised<\/em> refers to the fact that we\u2019re trying to understand the structure of our underlying data, rather than trying to optimize for a specific, pre-labeled criterion. As such, unsupervised learning<\/em> provides great potential in discovering patterns in unlabeled<\/em> data sets that can be used, for example, to construct clusters of similar data or reduce the dimensionality of a data set.<\/p>\n\n\n\n

One of the difficulties with unsupervised learning is interpreting the quality of the results. For example, some or all of the clusters output by an unsupervised clustering algorithm might have no value for the intended analysis. The \u201cgoodness\u201d of the clusters can be evaluated using metrics such as inter-\/intra-cluster distance, but it is still the responsibility of the data scientist to verify interpretability of the results. The key point here is that results from unsupervised learning are often best evaluated by their effect on subsequent analysis.<\/p>\n\n\n\n

The value of unsupervised learning continues to grow in response to the need for more robust techniques that can deal with the volume, variety, and velocity of big data. An example of unsupervised learning used in this way is a telecommunications company using a k-means clustering algorithm to segment their customer population into demographic groups. These groups can be used to train a supervised classification algorithm to predict customer churn, which can produce more accurate predictions than if it was trained without customer segmentation. Another example is found with e-commerce websites that want to identify groups of similar customers based on clickstream patterns and purchase histories. These customer groups with similar behavior and\/or preferences means a company can execute a more effective targeted marketing campaign. The figure below depicts the process of customer segmentation.<\/p>\n\n\n\n

\"customer_segmentation_vert\"<\/a><\/figure><\/div>\n\n\n\n

Many data science applications use a hybrid learning method that utilizes unsupervised algorithms as a sort of preprocessing step that in turn fuels a supervised learning algorithm. This is commonly found in deep learning and other ensemble learning systems. Unsupervised techniques, such as principal component analysis (PCA), can be used for dimensionality reduction, which reduces the number of feature variables while still being able to explain the variance in the data. The reduced data set can then be used with a supervised learning algorithm. In this way, PCA can improve the learning process.<\/p>\n\n\n\n

\"awwicker\"<\/a><\/figure><\/div>\n\n\n\n

Data Scientist:<\/strong> Dr. Andrew W. Wicker is a Data Scientist with the Graph Analytics Operation team at Intel Corporation. He focuses on researching and developing solutions to problems in the intersection of large-scale machine learning and graph analytics. Prior to joining Intel, Wicker worked as a Senior Computer Scientist at MITRE Corporation, where he employed machine learning techniques to affect policy of government sponsors.<\/p>\n\n\n\n

Wicker earned a Ph.D. in Computer Science from North Carolina State University. He has a strong interest in social network analysis, and enjoys using curiosity and creativity to solve problems in a multidisciplinary field.<\/p>\n","protected":false},"excerpt":{"rendered":"

Welcome back to the \u201cAsk a Data Scientist\u201d article series. This week\u2019s question is from a reader who asks for an overview of unsupervised machine learning.<\/p>\n","protected":false},"author":37,"featured_media":12353,"comment_status":"open","ping_status":"closed","sticky":true,"template":"","format":"standard","meta":{"jetpack_post_was_ever_published":false,"footnotes":""},"categories":[239,180,61,210,773,67,56,1],"tags":[774,95],"acf":[],"yoast_head":"\nAsk a Data Scientist: Unsupervised Learning - insideBIGDATA<\/title>\n<meta name=\"description\" content=\"Ask a Data Scientist: Unsupervised Learning\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/insidebigdata.com\/2014\/11\/19\/ask-data-scientist-unsupervised-learning\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Ask a Data Scientist: Unsupervised Learning - insideBIGDATA\" \/>\n<meta property=\"og:description\" content=\"Ask a Data Scientist: Unsupervised Learning\" \/>\n<meta property=\"og:url\" content=\"https:\/\/insidebigdata.com\/2014\/11\/19\/ask-data-scientist-unsupervised-learning\/\" \/>\n<meta property=\"og:site_name\" content=\"insideBIGDATA\" \/>\n<meta property=\"article:publisher\" content=\"http:\/\/www.facebook.com\/insidebigdata\" \/>\n<meta property=\"article:published_time\" content=\"2014-11-19T14:00:32+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2019-07-24T16:03:51+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/insidebigdata.com\/wp-content\/uploads\/2014\/11\/awwicker.jpg\" \/>\n\t<meta property=\"og:image:width\" content=\"447\" \/>\n\t<meta property=\"og:image:height\" content=\"447\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/jpeg\" \/>\n<meta name=\"author\" content=\"Daniel Gutierrez\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@AMULETAnalytics\" \/>\n<meta name=\"twitter:site\" content=\"@insideBigData\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Daniel Gutierrez\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"3 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"WebPage\",\"@id\":\"https:\/\/insidebigdata.com\/2014\/11\/19\/ask-data-scientist-unsupervised-learning\/\",\"url\":\"https:\/\/insidebigdata.com\/2014\/11\/19\/ask-data-scientist-unsupervised-learning\/\",\"name\":\"Ask a Data Scientist: Unsupervised Learning - insideBIGDATA\",\"isPartOf\":{\"@id\":\"https:\/\/insidebigdata.com\/#website\"},\"datePublished\":\"2014-11-19T14:00:32+00:00\",\"dateModified\":\"2019-07-24T16:03:51+00:00\",\"author\":{\"@id\":\"https:\/\/insidebigdata.com\/#\/schema\/person\/2540da209c83a68f4f5922848f7376ed\"},\"description\":\"Ask a Data Scientist: Unsupervised Learning\",\"breadcrumb\":{\"@id\":\"https:\/\/insidebigdata.com\/2014\/11\/19\/ask-data-scientist-unsupervised-learning\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/insidebigdata.com\/2014\/11\/19\/ask-data-scientist-unsupervised-learning\/\"]}]},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/insidebigdata.com\/2014\/11\/19\/ask-data-scientist-unsupervised-learning\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/insidebigdata.com\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Ask a Data Scientist: Unsupervised Learning\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/insidebigdata.com\/#website\",\"url\":\"https:\/\/insidebigdata.com\/\",\"name\":\"insideBIGDATA\",\"description\":\"Your Source for AI, Data Science, Deep Learning & Machine Learning Strategies\",\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/insidebigdata.com\/?s={search_term_string}\"},\"query-input\":\"required name=search_term_string\"}],\"inLanguage\":\"en-US\"},{\"@type\":\"Person\",\"@id\":\"https:\/\/insidebigdata.com\/#\/schema\/person\/2540da209c83a68f4f5922848f7376ed\",\"name\":\"Daniel Gutierrez\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/insidebigdata.com\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/secure.gravatar.com\/avatar\/5780282e7e567e2a502233e948464542?s=96&d=mm&r=g\",\"contentUrl\":\"https:\/\/secure.gravatar.com\/avatar\/5780282e7e567e2a502233e948464542?s=96&d=mm&r=g\",\"caption\":\"Daniel Gutierrez\"},\"description\":\"Daniel D. Gutierrez is a Data Scientist with Los Angeles-based AMULET Analytics, a service division of AMULET Development Corp. He's been involved with data science and Big Data long before it came in vogue, so imagine his delight when the Harvard Business Review recently deemed \\\"data scientist\\\" as the sexiest profession for the 21st century. Previously, he taught computer science and database classes at UCLA Extension for over 15 years, and authored three computer industry books on database technology. He also served as technical editor, columnist and writer at a major computer industry monthly publication for 7 years. Follow his data science musings at @AMULETAnalytics.\",\"sameAs\":[\"http:\/\/www.insidebigdata.com\",\"https:\/\/twitter.com\/@AMULETAnalytics\"],\"url\":\"https:\/\/insidebigdata.com\/author\/dangutierrez\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Ask a Data Scientist: Unsupervised Learning - insideBIGDATA","description":"Ask a Data Scientist: Unsupervised Learning","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/insidebigdata.com\/2014\/11\/19\/ask-data-scientist-unsupervised-learning\/","og_locale":"en_US","og_type":"article","og_title":"Ask a Data Scientist: Unsupervised Learning - insideBIGDATA","og_description":"Ask a Data Scientist: Unsupervised Learning","og_url":"https:\/\/insidebigdata.com\/2014\/11\/19\/ask-data-scientist-unsupervised-learning\/","og_site_name":"insideBIGDATA","article_publisher":"http:\/\/www.facebook.com\/insidebigdata","article_published_time":"2014-11-19T14:00:32+00:00","article_modified_time":"2019-07-24T16:03:51+00:00","og_image":[{"width":447,"height":447,"url":"https:\/\/insidebigdata.com\/wp-content\/uploads\/2014\/11\/awwicker.jpg","type":"image\/jpeg"}],"author":"Daniel Gutierrez","twitter_card":"summary_large_image","twitter_creator":"@AMULETAnalytics","twitter_site":"@insideBigData","twitter_misc":{"Written by":"Daniel Gutierrez","Est. reading time":"3 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"WebPage","@id":"https:\/\/insidebigdata.com\/2014\/11\/19\/ask-data-scientist-unsupervised-learning\/","url":"https:\/\/insidebigdata.com\/2014\/11\/19\/ask-data-scientist-unsupervised-learning\/","name":"Ask a Data Scientist: Unsupervised Learning - insideBIGDATA","isPartOf":{"@id":"https:\/\/insidebigdata.com\/#website"},"datePublished":"2014-11-19T14:00:32+00:00","dateModified":"2019-07-24T16:03:51+00:00","author":{"@id":"https:\/\/insidebigdata.com\/#\/schema\/person\/2540da209c83a68f4f5922848f7376ed"},"description":"Ask a Data Scientist: Unsupervised Learning","breadcrumb":{"@id":"https:\/\/insidebigdata.com\/2014\/11\/19\/ask-data-scientist-unsupervised-learning\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/insidebigdata.com\/2014\/11\/19\/ask-data-scientist-unsupervised-learning\/"]}]},{"@type":"BreadcrumbList","@id":"https:\/\/insidebigdata.com\/2014\/11\/19\/ask-data-scientist-unsupervised-learning\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/insidebigdata.com\/"},{"@type":"ListItem","position":2,"name":"Ask a Data Scientist: Unsupervised Learning"}]},{"@type":"WebSite","@id":"https:\/\/insidebigdata.com\/#website","url":"https:\/\/insidebigdata.com\/","name":"insideBIGDATA","description":"Your Source for AI, Data Science, Deep Learning & Machine Learning Strategies","potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/insidebigdata.com\/?s={search_term_string}"},"query-input":"required name=search_term_string"}],"inLanguage":"en-US"},{"@type":"Person","@id":"https:\/\/insidebigdata.com\/#\/schema\/person\/2540da209c83a68f4f5922848f7376ed","name":"Daniel Gutierrez","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/insidebigdata.com\/#\/schema\/person\/image\/","url":"https:\/\/secure.gravatar.com\/avatar\/5780282e7e567e2a502233e948464542?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/5780282e7e567e2a502233e948464542?s=96&d=mm&r=g","caption":"Daniel Gutierrez"},"description":"Daniel D. Gutierrez is a Data Scientist with Los Angeles-based AMULET Analytics, a service division of AMULET Development Corp. He's been involved with data science and Big Data long before it came in vogue, so imagine his delight when the Harvard Business Review recently deemed \"data scientist\" as the sexiest profession for the 21st century. Previously, he taught computer science and database classes at UCLA Extension for over 15 years, and authored three computer industry books on database technology. He also served as technical editor, columnist and writer at a major computer industry monthly publication for 7 years. Follow his data science musings at @AMULETAnalytics.","sameAs":["http:\/\/www.insidebigdata.com","https:\/\/twitter.com\/@AMULETAnalytics"],"url":"https:\/\/insidebigdata.com\/author\/dangutierrez\/"}]}},"jetpack_featured_media_url":"https:\/\/insidebigdata.com\/wp-content\/uploads\/2014\/11\/awwicker.jpg","jetpack_shortlink":"https:\/\/wp.me\/p9eA3j-3de","jetpack-related-posts":[{"id":20645,"url":"https:\/\/insidebigdata.com\/2018\/06\/30\/insidebigdata-ask-data-scientist-series\/","url_meta":{"origin":12352,"position":0},"title":"insideBIGDATA “Ask a Data Scientist” Series","date":"June 30, 2018","format":false,"excerpt":"Welcome to the series of articles sponsored by Intel \u2013 \u201cAsk a Data Scientist\u201d from insideBIGDATA's popular Data Science 101 channel. These articles constitute many of our site's most popular resources for newbie data scientists. The 12 articles listed below were from reader submitted questions of varying levels of technical\u2026","rel":"","context":"In "Data Science"","img":{"alt_text":"","src":"https:\/\/i0.wp.com\/insidebigdata.com\/wp-content\/uploads\/2018\/06\/data-scientist-300x300_insidebigdata.jpg?resize=350%2C200&ssl=1","width":350,"height":200},"classes":[]},{"id":12045,"url":"https:\/\/insidebigdata.com\/2014\/10\/15\/ask-data-scientist-curse-dimensionality\/","url_meta":{"origin":12352,"position":1},"title":"Ask a Data Scientist: Curse of Dimensionality","date":"October 15, 2014","format":false,"excerpt":"Welcome back to our series of articles sponsored by Intel \u2013 \u201cAsk a Data Scientist.\u201d Once a week you\u2019ll see reader submitted questions of varying levels of technical detail answered by a practicing data scientist \u2013 sometimes by me and other times by an Intel data scientist. This week\u2019s question\u2026","rel":"","context":"In "Ask a Data Scientist"","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":12317,"url":"https:\/\/insidebigdata.com\/2014\/11\/09\/ask-data-scientist-importance-exploratory-data-analysis\/","url_meta":{"origin":12352,"position":2},"title":"Ask a Data Scientist: The Importance of Exploratory Data Analysis","date":"November 9, 2014","format":false,"excerpt":"Q: What is the role of exploratory data analysis in data science?","rel":"","context":"In "Ask a Data Scientist"","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":12345,"url":"https:\/\/insidebigdata.com\/2014\/11\/12\/ask-data-scientist-data-science-process\/","url_meta":{"origin":12352,"position":3},"title":"Ask a Data Scientist: The Data Science Process","date":"November 12, 2014","format":false,"excerpt":"Welcome back to our series of articles sponsored by Intel \u2013 \u201cAsk a Data Scientist.\u201d This week\u2019s question is from a reader who wonders if there is a general process for conducting data science projects.","rel":"","context":"In "Ask a Data Scientist"","img":{"alt_text":"DataScienceProcess","src":"https:\/\/i0.wp.com\/insidebigdata.com\/wp-content\/uploads\/2014\/11\/DataScienceProcess.jpg?resize=350%2C200","width":350,"height":200},"classes":[]},{"id":12428,"url":"https:\/\/insidebigdata.com\/2014\/11\/26\/ask-data-scientist-data-leakage\/","url_meta":{"origin":12352,"position":4},"title":"Ask a Data Scientist: Data Leakage","date":"November 26, 2014","format":false,"excerpt":"Welcome back to our series of articles sponsored by Intel \u2013 \u201cAsk a Data Scientist.\u201d Once a week you\u2019ll see reader submitted questions of varying levels of technical detail answered by a practicing data scientist \u2013 sometimes by me and other times by an Intel data scientist. This week\u2019s question\u2026","rel":"","context":"In "Ask a Data Scientist"","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":12294,"url":"https:\/\/insidebigdata.com\/2014\/10\/29\/ask-data-scientist-handling-missing-data\/","url_meta":{"origin":12352,"position":5},"title":"Ask a Data Scientist: Handling Missing Data","date":"October 29, 2014","format":false,"excerpt":"Welcome back to our series of articles sponsored by Intel \u2013 \u201cAsk a Data Scientist.\u201d This week\u2019s question is from a reader who seeks a discussion of missing data handling methods such as imputation.","rel":"","context":"In "Ask a Data Scientist"","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]}],"_links":{"self":[{"href":"https:\/\/insidebigdata.com\/wp-json\/wp\/v2\/posts\/12352"}],"collection":[{"href":"https:\/\/insidebigdata.com\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/insidebigdata.com\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/insidebigdata.com\/wp-json\/wp\/v2\/users\/37"}],"replies":[{"embeddable":true,"href":"https:\/\/insidebigdata.com\/wp-json\/wp\/v2\/comments?post=12352"}],"version-history":[{"count":0,"href":"https:\/\/insidebigdata.com\/wp-json\/wp\/v2\/posts\/12352\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/insidebigdata.com\/wp-json\/wp\/v2\/media\/12353"}],"wp:attachment":[{"href":"https:\/\/insidebigdata.com\/wp-json\/wp\/v2\/media?parent=12352"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/insidebigdata.com\/wp-json\/wp\/v2\/categories?post=12352"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/insidebigdata.com\/wp-json\/wp\/v2\/tags?post=12352"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}