Skip to content

dbunker/bluejay

Repository files navigation

Bluejay

This repository contains a veriety of data analysis scripts. Projects are launched with deploy.py which can build, test, upload jar file, start EMR Spark cluster, and run the job. Though the file config_example.py must first be copied to config.py with the values filled in for the AWS account. An example command is below for building, uploading, starting cluster, and submitting the new job for the word_count project.

python deploy.py --project word_count --build --job_upload --job_submit_cluster_full

NLP

The first project is for natural language processing using Spark and CoreNLP on the 2015 Reddit comments corpus. More information on this is available in this blog post.

The Spark and CoreNLP code is in the nlp project. As it is a hefty processing step, CoreNLP is used first with training data added to the jar uploaded to EMR. This step performs tokenization, tagging, stemming, and named entity recognition. Input and output s3 folders are listed in deploy.py. The file RC_2015-05 can be used or swapped with a file with more data for local testing, which can also be done using deploy.py.

The next step in the machine learning pipeline is the word_count project. This does the simple text parsing and outputs the json containing information about organization, subreddit, count, direct adjectives, and connected adjectives to be added to react-tabulator. The file parse.py can be used to perform the conversion from Spark output to table input json.

Word2Vec

The next project is for Word2Vec from Spark on the Reddit comment corpus. More information on this is available in this blog post.

This continues the pipeline from word_count and runs Spark's Word2Vec and outputs organization, subreddit, count, and similarities. Similarities lists both the words and cosine similarity metrics. parse.py can also be used to create output json for react-tabulator.