<p>Foreword xiii</p> <p>Preface xv</p> <p>Acknowledgments xxi</p> <p>About the Authors xxiii</p> <p>Part I: Data Science with Hadoop—An Overview 1</p> <p>Chapter 1: Introduction to Data Science 3</p> <p>What Is Data Science? 3</p> <p>Example: Search Advertising 4</p> <p>A Bit of Data Science History 5</p> <p>Becoming a Data Scientist 8</p> <p>Building a Data Science Team 12</p> <p>The Data Science Project Life Cycle 13</p> <p>Managing a Data Science Project 18</p> <p>Summary 18</p> <p><strong>Chapter 2: Use Cases for Data Science 19</strong></p> <p>Big Data—A Driver of Change 19</p> <p>Business Use Cases 21</p> <p>Summary 29</p> <p><strong>Chapter 3: Hadoop and Data Science 31</strong></p> <p>What Is Hadoop? 31</p> <p>Hadoop’s Evolution 37</p> <p>Hadoop Tools for Data Science 38</p> <p>Why Hadoop Is Useful to Data Scientists 46</p> <p>Summary 51</p> <p>Part II: Preparing and Visualizing Data with Hadoop 53</p> <p>Chapter 4: Getting Data into Hadoop 55</p> <p>Hadoop as a Data Lake 56</p> <p>The Hadoop Distributed File System (HDFS) 58</p> <p>Direct File Transfer to Hadoop HDFS 58</p> <p>Importing Data from Files into Hive Tables 59</p> <p>Importing Data into Hive Tables Using Spark 62</p> <p>Using Apache Sqoop to Acquire Relational Data 65</p> <p>Using Apache Flume to Acquire Data Streams 74</p> <p>Manage Hadoop Work and Data Flows with Apache</p> <p>Oozie 79</p> <p>Apache Falcon 81</p> <p>What’s Next in Data Ingestion? 82</p> <p>Summary 82</p> <p><strong>Chapter 5: Data Munging with Hadoop 85</strong></p> <p>Why Hadoop for Data Munging? 86</p> <p>Data Quality 86</p> <p>The Feature Matrix 93</p> <p>Summary 106</p> <p><strong>Chapter 6: Exploring and Visualizing Data 107</strong></p> <p>Why Visualize Data? 107</p> <p>Creating Visualizations 112</p> <p>Using Visualization for Data Science 121</p> <p>Popular Visualization Tools 121</p> <p>Visualizing Big Data with Hadoop 123</p> <p>Summary 124</p> <p>Part III: Applying Data Modeling with Hadoop 125</p> <p>Chapter 7: Machine Learning with Hadoop 127</p> <p>Overview of Machine Learning 127</p> <p>Terminology 128</p> <p>Task Types in Machine Learning 129</p> <p>Big Data and Machine Learning 130</p> <p>Tools for Machine Learning 131</p> <p>The Future of Machine Learning and Artificial Intelligence 132</p> <p>Summary 132</p> <p><strong>Chapter 8: Predictive Modeling 133</strong></p> <p>Overview of Predictive Modeling 133</p> <p>Classification Versus Regression 134</p> <p>Evaluating Predictive Models 136</p> <p>Supervised Learning Algorithms 140</p> <p>Building Big Data Predictive Model Solutions 141</p> <p>Example: Sentiment Analysis 145</p> <p>Summary 150</p> <p><strong>Chapter 9: Clustering 151</strong></p> <p>Overview of Clustering 151</p> <p>Uses of Clustering 152</p> <p>Designing a Similarity Measure 153</p> <p>Clustering Algorithms 154</p> <p>Example: Clustering Algorithms 155</p> <p>Evaluating the Clusters and Choosing the Number of Clusters 157</p> <p>Building Big Data Clustering Solutions 158</p> <p>Example: Topic Modeling with Latent Dirichlet Allocation 160</p> <p>Summary 163</p> <p><strong>Chapter 10: Anomaly Detection with Hadoop 165</strong></p> <p>Overview 165</p> <p>Uses of Anomaly Detection 166</p> <p>Types of Anomalies in Data 166</p> <p>Approaches to Anomaly Detection 167</p> <p>Tuning Anomaly Detection Systems 170</p> <p>Building a Big Data Anomaly Detection Solution with Hadoop 171</p> <p>Example: Detecting Network Intrusions 172</p> <p>Summary 179</p> <p><strong>Chapter 11: Natural Language Processing 181</strong></p> <p>Natural Language Processing 181</p> <p>Tooling for NLP in Hadoop 184</p> <p>Textual Representations 187</p> <p>Sentiment Analysis Example 189</p> <p>Summary 193</p> <p><strong>Chapter 12: Data Science with Hadoop—The Next Frontier 195</strong></p> <p>Automated Data Discovery 195</p> <p>Deep Learning 197</p> <p>Summary 199</p> <p>Appendix A: Book Web Page and Code Download 201</p> <p>Appendix B: HDFS Quick Start 203</p> <p>Quick Command Dereference 204</p> <p><strong>Appendix C: Additional Background on Data Science and Apache Hadoop and Spark 209</strong></p> <p>General Hadoop/Spark Information 209</p> <p>Hadoop/Spark Installation Recipes 210</p> <p>HDFS 210</p> <p>MapReduce 211</p> <p>Spark 211</p> <p>Essential Tools 211</p> <p>Machine Learning 212</p> <p>Index 213</p>