本文目录一览:
怎么在Apache Spark中使用IPython Notebook
IPython Configuration
This installation workflow loosely follows the one contributed by Fernando Perez here. This should be performed on the machine where the IPython Notebook will be executed, typically one of the Hadoop nodes.
First create an IPython profile for use with PySpark.
1
ipython profile create pyspark
This should have created the profile directory ~/.ipython/profile_pyspark/. Edit the file~/.ipython/profile_pyspark/ipython_notebook_config.py to have:
1
2
3
4
5
c = get_config()
c.NotebookApp.ip = '*'
c.NotebookApp.open_browser = False
c.NotebookApp.port = 8880 # or whatever you want; be aware of conflicts with CDH
If you want a password prompt as well, first generate a password for the notebook app:
1
python -c 'from IPython.lib import passwd; print passwd()' ~/.ipython/profile_pyspark/nbpasswd.txt
and set the following in the same .../ipython_notebook_config.py file you just edited:
1
2
PWDFILE='~/.ipython/profile_pyspark/nbpasswd.txt'
c.NotebookApp.password = open(PWDFILE).read().strip()
Finally, create the file ~/.ipython/profile_pyspark/startup/00-pyspark-setup.py with the following contents:
1
2
3
4
5
6
7
8
9
import os
import sys
spark_home = os.environ.get('SPARK_HOME', None)
if not spark_home:
raise ValueError('SPARK_HOME environment variable is not set')
sys.path.insert(0, os.path.join(spark_home, 'python'))
sys.path.insert(0, os.path.join(spark_home, 'python/lib/py4j-0.8.1-src.zip'))
execfile(os.path.join(spark_home, 'python/pyspark/shell.py'))
Starting IPython Notebook with PySpark
IPython Notebook should be run on a machine from which PySpark would be run on, typically one of the Hadoop nodes.
First, make sure the following environment variables are set:
1
2
3
4
5
# for the CDH-installed Spark
export SPARK_HOME='/opt/cloudera/parcels/CDH/lib/spark'
# this is where you specify all the options you would normally add after bin/pyspark
export PYSPARK_SUBMIT_ARGS='--master yarn --deploy-mode client --num-executors 24 --executor-memory 10g --executor-cores 5'
Note that you must set whatever other environment variables you want to get Spark running the way you desire. For example, the settings above are consistent with running the CDH-installed Spark in YARN-client mode. If you wanted to run your own custom Spark, you could build it, put the JAR on HDFS, set theSPARK_JAR environment variable, along with any other necessary parameters. For example, see here for running a custom Spark on YARN.
Finally, decide from what directory to run the IPython Notebook. This directory will contain the .ipynb files that represent the different notebooks that can be served. See the IPython docs for more information. From this directory, execute:
1
ipython notebook --profile=pyspark
Note that if you just want to serve the notebooks without initializing Spark, you can start IPython Notebook using a profile that does not execute the shell.py script in the startup file.
Example Session
At this point, the IPython Notebook server should be running. Point your browser to , which should open up the main access point to the available notebooks. This should look something like this:
This will show the list of possible .ipynb files to serve. If it is empty (because this is the first time you’re running it) you can create a new notebook, which will also create a new .ipynb file. As an example, here is a screenshot from a session that uses PySpark to analyze the GDELT event data set:
The full .ipynb file can be obtained as a GitHub gist.