Home>Store

Data Analytics with Spark Using Python

Register your productto gain access to bonus material or receive a coupon.

Data Analytics with Spark Using Python

Best Value Purchase

Book + eBook Bundle

  • Your Price: $48.59
  • List Price: $80.98
  • Includes EPUB and PDF
  • About eBook Formats
  • This eBook includes the following formats, accessible from yourAccount在一页hase:

    ePubEPUBThe open industry format known for its reflowable content and usability on supported mobile devices.

    Adobe ReaderPDFThe popular standard, used most often with the freeAdobe® Reader®software.

    This eBook requires no passwords or activation to read. We customize your eBook by discreetly watermarking it with your name, making it uniquely yours.

More Purchase Options

Book

  • Your Price: $35.99
  • List Price: $44.99
  • Usually ships in 24 hours.

eBook (Watermarked)

  • Your Price: $28.79
  • List Price: $35.99
  • Includes EPUB and PDF
  • About eBook Formats
  • This eBook includes the following formats, accessible from yourAccount在一页hase:

    ePubEPUBThe open industry format known for its reflowable content and usability on supported mobile devices.

    Adobe ReaderPDFThe popular standard, used most often with the freeAdobe® Reader®software.

    This eBook requires no passwords or activation to read. We customize your eBook by discreetly watermarking it with your name, making it uniquely yours.

About

Features

Coverage includes:
• Understand Spark’s evolving role in the Big Data and Hadoop ecosystems
• Create Spark clusters using various deployment modes
• Control and optimize the operation of Spark clusters and applications
• Master Spark Core RDD API programming techniques
• Extend, accelerate, and optimize Spark routines with advanced API platform constructs, including shared variables, RDD storage, and partitioning
• Efficiently integrate Spark with both SQL and nonrelational data stores
• Perform stream processing and messaging with Spark Streaming and Apache Kafka
• Implement predictive modeling with SparkR and Spark MLlib

Description

  • Copyright 2018
  • Dimensions: 7" x 9-1/8"
  • Pages: 320
  • Edition: 1st
  • Book
  • ISBN-10: 0-13-484601-X
  • ISBN-13: 978-0-13-484601-9

Solve Data Analytics Problems with Spark, PySpark, and Related Open Source Tools

Spark is at the heart of today’s Big Data revolution, helping data professionals supercharge efficiency and performance in a wide range of data processing and analytics tasks. In this guide, Big Data expert Jeffrey Aven covers all you need to know to leverage Spark, together with its extensions, subprojects, and wider ecosystem.

Aven combines a language-agnostic introduction to foundational Spark concepts with extensive programming examples utilizing the popular and intuitive PySpark development environment. This guide’s focus on Python makes it widely accessible to large audiences of data professionals, analysts, and developers—even those with little Hadoop or Spark experience.

Aven’s broad coverage ranges from basic to advanced Spark programming, and Spark SQL to machine learning. You’ll learn how to efficiently manage all forms of data with Spark: streaming, structured, semi-structured, and unstructured. Throughout, concise topic overviews quickly get you up to speed, and extensive hands-on exercises prepare you to solve real problems.

Coverage includes:
• Understand Spark’s evolving role in the Big Data and Hadoop ecosystems
• Create Spark clusters using various deployment modes
• Control and optimize the operation of Spark clusters and applications
• Master Spark Core RDD API programming techniques
• Extend, accelerate, and optimize Spark routines with advanced API platform constructs, including shared variables, RDD storage, and partitioning
• Efficiently integrate Spark with both SQL and nonrelational data stores
• Perform stream processing and messaging with Spark Streaming and Apache Kafka
• Implement predictive modeling with SparkR and Spark MLlib

Extras

Author's Site

Please visit the author's sites atsparkusingpython.comandhttps://github.com/sparktraining/spark_using_python.

Sample Content

Online Sample Chapter

How Applications are Executed on a Spark Cluster

Sample Pages

Download the sample pages(includes Chapter 3)

Table of Contents

Preface xi
Introduction 1

PART I: SPARK FOUNDATIONS
Chapter 1 Introducing Big Data, Hadoop, and Spark 5

Introduction to Big Data, Distributed Computing, and Hadoop 5
A Brief History of Big Data and Hadoop 6
Hadoop Explained 7
Introduction to Apache Spark 13
Apache Spark Background 13
Uses for Spark 14
Programming Interfaces to Spark 14
Submission Types for Spark Programs 14
Input/Output Types for Spark Applications 16
The Spark RDD 16
Spark and Hadoop 16
Functional Programming Using Python 17
数据结构ures Used in Functional Python Programming 17
Python Object Serialization 20
Python Functional Programming Basics 23
Summary 25
Chapter 2 Deploying Spark 27
Spark Deployment Modes 27
Local Mode 28
Spark Standalone 28
Spark on YARN 29
Spark on Mesos 30
Preparing to Install Spark 30
Getting Spark 31
Installing Spark on Linux or Mac OS X 32
Installing Spark on Windows 34
Exploring the Spark Installation 36
Deploying a Multi-Node Spark Standalone Cluster 37
Deploying Spark in the Cloud 39
Amazon Web Services (AWS) 39
Google Cloud Platform (GCP) 41
Databricks 42
Summary 43
Chapter 3 Understanding the Spark Cluster Architecture 45
Anatomy of a Spark Application 45
Spark Driver 46
Spark Workers and Executors 49
The Spark Master and Cluster Manager 51
Spark Applications Using the Standalone Scheduler 53
Spark Applications Running on YARN 53
Deployment Modes for Spark Applications Running on YARN 53
Client Mode 54
Cluster Mode 55
Local Mode Revisited 56
Summary 57
Chapter 4 Learning Spark Programming Basics 59
Introduction to RDDs 59
Loading Data into RDDs 61
Creating an RDD from a File or Files 61
Methods for Creating RDDs from a Text File or Files 63
Creating an RDD from an Object File 66
从66年一个数据源创建一个抽样
Creating RDDs from JSON Files 69
Creating an RDD Programmatically 71
Operations on RDDs 72
Key RDD Concepts 72
Basic RDD Transformations 77
Basic RDD Actions 81
Transformations on PairRDDs 85
MapReduce and Word Count Exercise 92
Join Transformations 95
Joining Datasets in Spark 100
Transformations on Sets 103
Transformations on Numeric RDDs 105
Summary 108

PART II: BEYOND THE BASICS
Chapter 5 Advanced Programming Using the Spark Core API 111

Shared Variables in Spark 111
Broadcast Variables 112
Accumulators 116
Exercise: Using Broadcast Variables and Accumulators 119
Partitioning Data in Spark 120
Partitioning Overview 120
Controlling Partitions 121
Repartitioning Functions 123
Partition-Specific or Partition-Aware API Methods 125
RDD Storage Options 127
RDD Lineage Revisited 127
RDD Storage Options 128
RDD Caching 131
Persisting RDDs 131
Choosing When to Persist or Cache RDDs 134
Checkpointing RDDs 134
Exercise: Checkpointing RDDs 136
Processing RDDs with External Programs 138
Data Sampling with Spark 139
Understanding Spark Application and Cluster Configuration 141
Spark Environment Variables 141
Spark Configuration Properties 145
Optimizing Spark 148
Filter Early, Filter Often 149
Optimizing Associative Operations 149
Understanding the Impact of Functions and Closures 151
Considerations for Collecting Data 152
Configuration Parameters for Tuning and Optimizing Applications 152
Avoiding Inefficient Partitioning 153
Diagnosing Application Performance Issues 155
Summary 159
Chapter 6 SQL and NoSQL Programming with Spark 161
Introduction to Spark SQL 161
Introduction to Hive 162
Spark SQL Architecture 166
Getting Started with DataFrames 168
Using DataFrames 179
Caching, Persisting, and Repartitioning DataFrames 187
Saving DataFrame Output 188
Accessing Spark SQL 191
Exercise: Using Spark SQL 194
195年与NoSQL系统使用火花
Introduction to NoSQL 196
Using Spark with HBase 197
Exercise: Using Spark with HBase 200
Using Spark with Cassandra 202
Using Spark with DynamoDB 204
Other NoSQL Platforms 206
Summary 206
Chapter 7 Stream Processing and Messaging Using Spark 209
Introducing Spark Streaming 209
Spark Streaming Architecture 210
Introduction to DStreams 211
Exercise: Getting Started with Spark Streaming 218
State Operations 219
Sliding Window Operations 221
Structured Streaming 223
Structured Streaming Data Sources 224
Structured Streaming Data Sinks 225
Output Modes 226
Structured Streaming Operations 227
Using Spark with Messaging Platforms 228
Apache Kafka 229
Exercise: Using Spark with Kafka 234
Amazon Kinesis 237
Summary 240
Chapter 8 Introduction to Data Science and Machine Learning Using Spark 243
Spark and R 243
Introduction to R 244
Using Spark with R 250
Exercise: Using RStudio with SparkR 257
Machine Learning with Spark 259
Machine Learning Primer 259
Machine Learning Using Spark MLlib 262
Exercise: Implementing a Recommender Using Spark MLlib 267
Machine Learning Using Spark ML 271
Using Notebooks with Spark 275
Using Jupyter (IPython) Notebooks with Spark 275
Using Apache Zeppelin Notebooks with Spark 278
Summary 279
Index 281

Updates

Submit Errata

More Information

InformIT Promotional Mailings & Special Offers

I would like to receive exclusive offers and hear about products from InformIT and its family of brands. I can unsubscribe at any time.

Overview


Pearson Education, Inc., 221 River Street, Hoboken, New Jersey 07030, (Pearson) presents this site to provide information about products and services that can be purchased through this site.

This privacy notice provides an overview of our commitment to privacy and describes how we collect, protect, use and share personal information collected through this site. Please note that other Pearson websites and online products and services have their own separate privacy policies.

Collection and Use of Information


To conduct business and deliver products and services, Pearson collects and uses personal information in several ways in connection with this site, including:

Questions and Inquiries

For inquiries and questions, we collect the inquiry or question, together with name, contact details (email address, phone number and mailing address) and any other additional information voluntarily submitted to us through a Contact Us form or an email. We use this information to address the inquiry and respond to the question.

Online Store

通过我们的在线订单和购买放置store on this site, we collect order details, name, institution name and address (if applicable), email address, phone number, shipping and billing addresses, credit/debit card information, shipping options and any instructions. We use this information to complete transactions, fulfill orders, communicate with individuals placing orders or visiting the online store, and for related purposes.

Surveys

Pearson may offer opportunities to provide feedback or participate in surveys, including surveys evaluating Pearson products, services or sites. Participation is voluntary. Pearson collects information requested in the survey questions and uses the information to evaluate, support, maintain and improve products, services or sites, develop new products and services, conduct educational research and for other purposes specified in the survey.

Contests and Drawings

Occasionally, we may sponsor a contest or drawing. Participation is optional. Pearson collects name, contact information and other information specified on the entry form for the contest or drawing to conduct the contest or drawing. Pearson may collect additional personal information from the winners of a contest or drawing in order to award the prize and for tax reporting purposes, as required by law.

Newsletters

If you have elected to receive email newsletters or promotional mailings and special offers but want to unsubscribe, simplyemailinformation@informit.com.

Service Announcements

On rare occasions it is necessary to send out a strictly service related announcement. For instance, if our service is temporarily suspended for maintenance we might send users an email. Generally, users may not opt-out of these communications, though they can deactivate their account information. However, these communications are not promotional in nature.

Customer Service

We communicate with users on a regular basis to provide requested services and in regard to issues relating to their account we reply via email or phone in accordance with the users' wishes when a user submits their information through ourContact Us form.

Other Collection and Use of Information


Application and System Logs

皮尔森自动收集日志数据来帮助sure the delivery, availability and security of this site. Log data may include technical information about how a user or visitor connected to this site, such as browser type, type of computer/device, operating system, internet service provider and IP address. We use this information for support purposes and to monitor the health of the site, identify problems, improve service, detect unauthorized access and fraudulent activity, prevent and respond to security incidents and appropriately scale computing resources.

Web Analytics

Pearson may use third party web trend analytical services, including Google Analytics, to collect visitor information, such as IP addresses, browser types, referring pages, pages visited and time spent on a particular site. While these analytical services collect and report information on an anonymous basis, they may use cookies to gather web trend information. The information gathered may enable Pearson (but not the third party web trend services) to link information with application and system log data. Pearson uses this information for system administration and to identify problems, improve service, detect unauthorized access and fraudulent activity, prevent and respond to security incidents, appropriately scale computing resources and otherwise support and deliver this site and its services.

Cookies and Related Technologies

This site uses cookies and similar technologies to personalize content, measure traffic patterns, control security, track use and access of information on this site, and provide interest-based messages and advertising. Users can manage and block the use of cookies through their browser. Disabling or blocking certain cookies may limit the functionality of this site.

Do Not Track

This site currently does not respond to Do Not Track signals.

Security


Pearson uses appropriate physical, administrative and technical security measures to protect personal information from unauthorized access, use and disclosure.

Children


This site is not directed to children under the age of 13.

Marketing


Pearson may send or direct marketing communications to users, provided that

  • Pearson will not use personal information collected or processed as a K-12 school service provider for the purpose of directed or targeted advertising.
  • 这样的营销符合适用的法律nd Pearson's legal obligations.
  • Pearson will not knowingly direct or send marketing communications to an individual who has expressed a preference not to receive marketing.
  • Where required by applicable law, express or implied consent to marketing exists and has not been withdrawn.

Pearson may provide personal information to a third party service provider on a restricted basis to provide marketing solely on behalf of Pearson or an affiliate or customer for whom Pearson is a service provider. Marketing preferences may be changed at any time.

Correcting/Updating Personal Information


If a user's personally identifiable information changes (such as your postal address or email address), we provide a way to correct or update that user's personal data provided to us. This can be done on theAccount page. If a user no longer desires our service and desires to delete his or her account, please contact us atcustomer-service@informit.comand we will process the deletion of a user's account.

Choice/Opt-out


Users can always make an informed choice as to whether they should proceed with certain services offered by InformIT. If you choose to remove yourself from our mailing list(s) simply visit the following page and uncheck any communication you no longer want to receive:www.e-skidka.com/u.aspx.

Sale of Personal Information


Pearson does not rent or sell personal information in exchange for any payment of money.

While Pearson does not sell personal information, as defined in Nevada law, Nevada residents may email a request for no sale of their personal information toNevadaDesignatedRequest@pearson.com.

Supplemental Privacy Statement for California Residents


California residents should read ourSupplemental privacy statement for California residentsin conjunction with this Privacy Notice. TheSupplemental privacy statement for California residentsexplains Pearson's commitment to comply with California law and applies to personal information of California residents collected in connection with this site and the Services.

Sharing and Disclosure


Pearson may disclose personal information, as follows:

  • As required by law.
  • With the consent of the individual (or their parent, if the individual is a minor)
  • In response to a subpoena, court order or legal process, to the extent permitted or required by law
  • To protect the security and safety of individuals, data, assets and systems, consistent with applicable law
  • In connection the sale, joint venture or other transfer of some or all of its company or assets, subject to the provisions of this Privacy Notice
  • To investigate or address actual or suspected fraud or other illegal activities
  • To exercise its legal rights, including enforcement of the Terms of Use for this site or another contract
  • To affiliated Pearson companies and other companies and organizations who perform work for Pearson and are obligated to protect the privacy of personal information consistent with this Privacy Notice
  • To a school, organization, company or government agency, where Pearson collects or processes the personal information in a school setting or on behalf of such organization, company or government agency.

Links


This web site contains links to other sites. Please be aware that we are not responsible for the privacy practices of such other sites. We encourage our users to be aware when they leave our site and to read the privacy statements of each and every web site that collects Personal Information. This privacy statement applies solely to information collected by this web site.

Requests and Contact


Pleasecontact usabout this Privacy Notice or if you have any requests or questions relating to the privacy of your personal information.

Changes to this Privacy Notice


We may revise this Privacy Notice through an updated posting. We will identify the effective date of the revision in the posting. Often, updates are made to provide greater clarity or to comply with changes in regulatory requirements. If the updates involve material changes to the collection, protection, use or disclosure of Personal Information, Pearson will provide notice of the change through a conspicuous notice on this site or other appropriate way. Continued use of the site after the effective date of a posted revision evidences acceptance. Please contact us if you have questions or concerns about the Privacy Notice or any objection to any revisions.

Last Update: November 17, 2020