User:KartikMistry/TPA
Initial setup
[edit]Login to stat1007,
ssh stat1007
Setup proxy
[edit]export https_proxy=https://webproxy.eqiad.wmnet:8080
export http_proxy=http://webproxy.eqiad.wmnet:8080
Setup repository
[edit]git clone https://github.com/digitalTranshumant/templatesAlignment.git
cd templatesAlignment
Create virtualenv
[edit]virtualenv --python=/usr/bin/python3 python3
Active the virtual environment by:
source python3/bin/activate
Now, install jupyter notebook:
pip install jupyter
Next, add the following lines to your .profile file:
export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS='notebook'
export PYSPARK_PYTHON=/usr/bin/python3.7
export PYSPARK_PYTHON=/srv/home/USER/python3/bin/python
You can additionally add these two lines to make your life easier:
alias venvspark="source python3/bin/activate; source ~/.profile"
alias startspark="pyspark2 --master yarn --deploy-mode client --executor-memory 8g --driver-memory 8g --conf spark.dynamicAllocation.maxExecutors=128"
Close the session, and you will have everything configured.
Starting notebook
[edit]Make sure to check Kerberos authentication timeout first. Default is set to 48 hours now.
klist
Extend it by running kinit:
kinit
Now, you can login again and you will just need to do this:
venvspark
startspark
Press ESC,
And check in which port the jupyter notebook is running (usually you should have 8888 or 8889), in this example is 8889
Then, in your local machine, create a tunnel by running:
ssh -N stat1007 -L 8889:127.0.0.1:8889
And then using your browser you will see the normal notebook in:
Running scripts
[edit]1. Run all notebooks in order.
2. 00ExtractNamedTempates.ipynb overwrites existing output if runs again, so it is better to save products json files somewhere to save time.
Notes
[edit]1. 02alignmentsSpark.py can not be run on local machine.
2. If running locally, `01Download Models.py` need to run with ipython or just download needed models.
3. fastText_multilingual module is available at: https://github.com/babylonhealth/fastText_multilingual
Apply patch given at #23 to fix ModuleNotFound error while running script.
4. 03ProduceAlignments.py requires https://github.com/facebookresearch/fastText/tree/master/python instead of version provided by pip.
Also see
[edit]- Issues related to Kerberos access: https://wikitech.wikimedia.org/wiki/SWAP#Access_and_infrastructure