How I set up my first Natural Language Process (NLP) project with SparkNLP

November 6, 2020 | 5 minute read
Rajesh Chawla
Principal Cloud Architect
Text Size 100%:

I kept hearing buzz about Natural Language Processing (NLP), from the advances with BERT, to the number of use cases such as sentiment analysis, marketing, health care, and HR & recruiting. Gartner noted it as Data & Analytics Trend #3 in 2019, while in 2020, NLP was folded in with machine learning as Trend #1.

I wanted to explore NLP, but I found it is annoyingly difficult to get started. Specifically, there were three items that were annoying: 1) choosing the NLP library, 2) finding the right model and 3) configuring the development environment.

I started with choosing an NLP library. I end up choosing between disparate open source libraries so often, I created a rubric to make things easier for me. Each time I use the rubric, I choose the weighting that makes sense for the current project. For example, if I'm working on a project that may end up in a commercial product, the production readiness of the library becomes more important. In making this choice, I weighted the tasks as below.

  • 30 - Production ready
  • 20 - Features
  • 15 - Who is behind the library
  • 15 - Support
  • 15 – Community
  • 5 - Fit in current development workflow

There are many open source alternatives for NLP. Among them are:

Most, I disregarded because I did not see enough evidence of scaling, which is a key criterion for me being production ready. The two libraries made my final candidate list were: spaCy and SparkNLP. Both of these libraries did well in my criteria ranking. I choose SparkNLP because I had additional requirements in the project to use Apache Spark, so SparkNLP won the day.

Another issue I have with being productive with NLP is finding models to start with. I don't have the capacity to train base models. Ideally, the model I start with can be trained incrementally for my specific domain. The startup huggingface made this issue of finding and using a model much easier than I thought it could be. As an example, searching for models in the clinical realm showed Bio_ClinicalBERT, which is a model published by Emily Alsentzer, who is a PhD student at MIT / Harvard Medical. To load the model is also straightforward in a python notebook for PyTorch or TensorFlow after installation of a pip package (transformers). The python code is:

from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("emilyalsentzer/Bio_ClinicalBERT")
model = AutoModel.from_pretrained("emilyalsentzer/Bio_ClinicalBERT")

Really? That definitely fits in my productivity definition, and by the way it works.

After finding reasonable solutions for these two productivity sinks, I found a third item that was hampering my productivity. I was creating the same IaaS environment multiple times. So, I automated the base installation using Terraform. I used the Data Science VM in the OCI marketplace. You can read more about how the network, security, disks, Jupyter notebooks are set up at in my previous blog on deploying ML environments in Oracle Cloud. This update to the base project added automation for installing Scala, Apache Hadoop, Apache Spark, and SparkNLP. Note the automation of Apache Spark does not include master / slave configurations. It is enough for the experiments I wanted to run.

The terraform is found in the repo at https://github.com/oracle-quickstart/oci-gpu-jupyter. There are a couple of variable settings to take note of: 1) script_to_run, and 2) github_repo. The setting script_to_run will trigger the script configure_jupyter_sparknlp (located in the scripts folder) to be run on the newly created VM. The setting github_repo shows which github repository should be automatically downloaded during automation.

If you are familiar with Terraform, this should be an easy configuration for you. If you're not familiar with Terraform, you can review introductory materials from Hashicorp or check out this course about Terraform on OCI. After you’ve downloaded the Terraform code, the next steps boil down to:

  • Get an OCI account -- if you don't have one check out the free trial
  • Configure credentials, including public / private keys
  • Tweak the configuration files ( primarily variables.tf ) for your use case
  • Initialize Terraform ( terraform init )
  • Deploy resources using Terraform ( terraform apply )

The output I see from the 'terraform apply' command is below. Your mileage will vary.

Next_Steps = Create an ssh tunnel to the Jupyter VM and then open your browser to http://localhost:8081
ssh_to_notebook_vm = ssh -i ~/.ssh/rajesh_key opc@158.101.98.90

ssh_tunnel_for_notebook = ssh -i ~/.ssh/rajesh_key -L 8081:localhost:8080 opc@158.101.98.90

Next, we can open a browser and go to https://localhost:8081. You should see the following page. The password for Jupyter is configured in the variables.tf file in the variable jupyter_password.

After successfully logging in, you should see the Jupyter notebooks home screen.

Now that I’ve verified the Jupyter notebook is running, l verify SparkNLP is configured properly. Because I’ve configured the github_repo variable in variables.tf to be https://github.com/JohnSnowLabs/spark-nlp-workshop from John Snow Labs to be downloaded, I can open http://localhost:8081/notebooks/repo/jupyter/quick_start.ipynb in my browser.

The automation has installed and configured java, pyspark, and spark-nlp, so I skip executing the first cell. The quick_start.ipynb notebook downloads a pre-trained pipeline and performs Named Entity Recognition (NER) on a small piece of text. The notebook also shows how sentiment analysis can be done. There are quite a few additional notebooks in the spark-nlp-workshop repo, showing items such as language detection, explaining documents, recognizing entities using binding in Jupyter, Zepplin, Java, or Scala. This is an astonishing amount of functionality. I'm particularly enamored with the fact that as I need to scale my training environment, sparkNLP can leverage the power of Apache Spark and write my code in Java, Scala or Python per my project needs.

So, now I finally have the tools I want so that I can iterate quickly. The tools are:

  • An NLP library ( sparkNLP ) that has benchmarks to show scaling to the size of datasets I will use
  • Create / destroy development environments with conveniently
  • A good way to find pre-trained models

Now, I can finally get to work on my use case. I believe this automation will reduce the time to start my next NLP task and hopefully for you as well.

Rajesh Chawla

Principal Cloud Architect

Principal Cloud Solution Architect at Oracle focused on machine learning & IaaS


Previous Post

Sender Policy Framework and Fusion SaaS

Bill Jacobs | 8 min read

Next Post


Lifecycle Management of Instance Pools

Stefan Hinker | 6 min read