Bluejay

This repository contains a veriety of data analysis scripts. Projects are launched with deploy.py which can build, test, upload jar file, start EMR Spark cluster, and run the job. Though the file config_example.py must first be copied to config.py with the values filled in for the AWS account. An example command is below for building, uploading, starting cluster, and submitting the new job for the word_count project.

python deploy.py --project word_count --build --job_upload --job_submit_cluster_full

NLP

The first project is for natural language processing using Spark and CoreNLP on the 2015 Reddit comments corpus. More information on this is available in this blog post.

The Spark and CoreNLP code is in the nlp project. As it is a hefty processing step, CoreNLP is used first with training data added to the jar uploaded to EMR. This step performs tokenization, tagging, stemming, and named entity recognition. Input and output s3 folders are listed in deploy.py. The file RC_2015-05 can be used or swapped with a file with more data for local testing, which can also be done using deploy.py.

The next step in the machine learning pipeline is the word_count project. This does the simple text parsing and outputs the json containing information about organization, subreddit, count, direct adjectives, and connected adjectives to be added to react-tabulator. The file parse.py can be used to perform the conversion from Spark output to table input json.

Word2Vec

The next project is for Word2Vec from Spark on the Reddit comment corpus. More information on this is available in this blog post.

This continues the pipeline from word_count and runs Spark's Word2Vec and outputs organization, subreddit, count, and similarities. Similarities lists both the words and cosine similarity metrics. parse.py can also be used to create output json for react-tabulator.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
common/src		common/src
examples/2015/data		examples/2015/data
nlp/src		nlp/src
project		project
scripts		scripts
simple_count/src		simple_count/src
vec_example/src		vec_example/src
word_count/src		word_count/src
word_to_vec/src		word_to_vec/src
.gitignore		.gitignore
.scalafmt.conf		.scalafmt.conf
LICENSE		LICENSE
README.md		README.md
build.sbt		build.sbt
config_example.py		config_example.py
deploy.py		deploy.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Bluejay

NLP

Word2Vec

About

Releases

Packages

Languages

License

dbunker/bluejay

Folders and files

Latest commit

History

Repository files navigation

Bluejay

NLP

Word2Vec

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages