Pycharm - youdar/How-to GitHub Wiki
Some of the cloud solution use Jupiter notebooks or other tools that are ok for playing around but might not be as convenient and efficient for solution development as PyCharm.
how-to-use-pycharm-with-google-compute-engine
PyCharm notes
- Use puttygen to generate SSH-2 RSA private and public keys (do not use pass-phrase)
- The private key should be somewhere in your windows machine (you can create a ssh folder to store the private keys)
- The public key in
"svn server name"/<username>/.ssh/authorized_keys
- Make sure putty is in the environment variable path
- Make sure TortuousSVN configuration is OK
- right-click on any folder and select
TortuousSVN -> Settings - Go to Network and in the SSH Client put
"C:\Program Files\TortoiseSVN\bin\TortoisePlink.exe" - add to
%APPDATA%\Subversion\configunder [tunnels],ssh = "C:/Program Files/TortoiseSVN/bin/TortoisePlink.exe"
- right-click on any folder and select
- Use putty to create ssh connection for the SVN
- Start putty
- In the Session - Host Name enter: <user_namel>@"svn server name"
- In Connection -> SSH -> Authentication - Private key file for authentication : browse to the private key you generated in first step
- save Session
- Use Pageant to save passphrase for automatic login
- Save the session as 'xxx_svn', it will be used when checking out files from xxx svn servers
Now when you update and commit from PyCharm, it will automatically use the
from the command line you can checkout using:
svn co svn+ssh://xxx_svn/repository_name
When developing on a Cloudera Hadoop Cluster one need to:
- setup the Project environment (remote environment)
- set the Tools -> Deployment -> Configuration with the Hadoop Cluster information (This will allow copying the code to the Hadoop)
To find where is the spark on your Hadoop use
cd /etc/spark/conf
cat spark-env.sh
Then in PyCharm go to Settings -> Project interpreter -> Press setting cogwheel -> More -> Show paths for the selected interpreter and then Add the following paths (according to the PySpark location)
/opt/cloudera/parcels/CDH/lib/spark
/opt/cloudera/parcels/CDH/lib/spark/python
/opt/cloudera/parcels/CDH/lib/spark/python/lib/py4j-0.9-src.zip
Then, when running Debug, you might need to add Environmental variable
SPARK_HOME with the value /opt/cloudera/parcels/CDH/lib/spark
go to Run -> Edit Configuration .
Add the following to the Hadoop .bash_profile or .bash_login or .profile file
export PYSPARK_PYTHON="/usr/bin/python"
export SPARK_HOME="/opt/cloudera/parcels/CDH/lib/spark"
PYTHONPATH="/usr/bin/python"
PYTHONPATH="${PYTHONPATH}:$SPARK_HOME/python"
PYTHONPATH="${PYTHONPATH}:$SPARK_HOME/python/lib/py4j-0.9-src.zip"
export PYTHONPATH
The node that we are using must be both YARN and Spark gateway
Note: To deploy the python program you should use spark-submit
Verify that Anaconda is installed on you Cloudera Hadoop
Check that /opt/cloudera/parcels/Anaconda exists
Add the following to the Hadoop .bash_profile or .bash_login or .profile file
instead of the PYTHONPATH and PYSPARK_PYTHON above
export PYSPARK_PYTHON="/opt/cloudera/parcels/Anaconda/lib/pyspark"
export PYTHONPATH=/opt/cloudera/parcels/Anaconda/lib/python2.7/site-packages
export PATH=/opt/cloudera/parcels/Anaconda/bin:$PATH
Settings -> Tools -> Python integrated tools
change the Docstring format to Google .