VAUGHN PARKER

Data Scientist

RESUME

Chess that Learns

For my final Metis project, I recreated two famous chess-playing programs: DeepBlue and AlphaZero, in order to learn about game-playing AI and Reinforcement Learning. And I sure learned a lot!

My DeepBlue-inspired AI used a more "brute force" approach of selecting the best move via Minimax, which I made much faster with alpha-beta pruning. For increased performance and speed, I played against my program on a powerful p3.8xlarge AWS instance with 32 vCPUs. I also learned how to use the Python multiprocessing package to "parallelize" my code, which let me run my tree search algorithms on multiple CPUs at once, speeding up the process.

My AlphaZero-inspired AI used Monte-Carlo Tree Search (MCTS) in order to make its moves. This algorithm determines how good a move is by simulating what the rest of the game might look like, if it makes that move. While minimax has to look at all possible situations, MCTS is able to focus its computational resources on moves that are more likely to occur. This is very similar to how a human master or grandmaster evaluates positions: humans don't waste time thinking about what might happen if they play an obviously bad move! The way my AI determines which moves are likely and which moves are not is by using a "Policy Function." I decided to use a Convolutional Neural Network to represent my Policy Function, and this network is trained on examples generated from self-play. While my version of AlphaZero has been training via self-play for a few weeks now, it still doesn't play very well (almost as bad as random play).

Facebook Messenger: A Personal Analysis

For my project on Natural Language Processing (NLP), I downloaded and analyzed my own personal Facebook Messages. I might have learned more than I wanted to on this one...

Some fun facts, in the last 3 and a half-ish years I've been using Messenger, I've sent:

82 messages per day, on average
100,806 messages total
611,861 words total
32,784 unique words
- 23,155 of those were actual words (letters only)

The first thing I had to do was get the data: this was pretty easy since all I had to do was go to my Facebook.com settings and click "Download." I hope to write a tutorial on my GitHub on how to do it, as it's a fun and easy process that I think anyone who uses Facebook Messenger should try out. After getting my messages in raw JSON format, I was able to import them into Python using pandas.

I had a lot of fun playing around with my own messages: what was my longest message sent? The shortest? The shortest time between messages sent? What messages in group chats got the most reacts? Who sends the most messages in my group chats? Does my messenger activity correlate to real life events? Short answer - yes.

A central part of this project was Latent Semantic Analysis (LSA) , which I used in order to separate my messages into several distinct "topics." I created a sparse matrix using Term Frequency-Inverse Document Frequency (TF-IDF) in order to create a table of how many times I said each word on each day. The topics that came out were logical and cohesive, although I do wish I had more time to refine them. Defining one document as one day is a quick and logical fix, although it may not be the best time window to use. If I come back to this project, I will play around with defining a document as different time windows.

Classifying Motion

For my Classification project, I used the MotionSense dataset to figure out which activity someone was performing by looking at their gyroscope and accelerometer data. This project definitely had the most successful results, and I am very proud of what I accomplished.

The MotionSense dataset is the result of a study from Queen Mary University. The study had 24 participants place an iPhone 6S into their front pocket, and record their motion using the SensingKit app. Their motion was recorded while they performed 6 activities: walking downstairs, walking upstairs, walking on flat ground, jogging, standing still, and sitting still. While they performed these activities, the positions of their gyroscope and accelerometer were measured and recorded 50 times per second.

This was definitely one of my larger datasets, with a final table of 1.4 million data points with 12 features each. In hindsight, I probably should have used AWS to train some of my models quicker, although I managed to do pretty well using just my laptop.

Using standard Logistic Regression as a baseline model, I was able to get an accuracy score of 0.59. In order to improve that score, I had to do some clever Feature Engineering. What I noticed was that the data tended to be repetitive and periodic, which made sense, given that each step is a sort of "walking cycle." I then took a bunch of two-second windows and approximated the values of each of the features over that window with a sine wave. In addition to the older features, the parameters of these sine waves became my new features. Specifically, I measure the amplitude, period, phase, and shift of each of the features over that window, so my original 12 features became 60.

I experimented with several classification models, including Logistic Regression, K-Nearest-Neighbors, Naive Bayes, but I found that the best performance was using Decision Trees, specifically, a Gradient Boosted Classifier gave me an accuracy of 0.94. Hooray!

However, I was afraid that my model might be overfitting to the individual walking patterns of these 24 specific people. To test this out, I tested my model on a 25 minute sample of me walking continuously, and it correctly classified my motion data as "walking" 92% of the time! I also tested it in a more difficult environment, in which I sat still, stood still, and moved around in a stairwell. Although I couldn't get an accuracy score for this trial, (it was very hard to hand-label as I was constantly switching activities) it performed very well, and I was able to synchronize a video of me walking around with a video of my model making its probability predictions side by side! I will post the video to my GitHub later, along with a tutorial on how to generate and export your own MotionSense data!

Predicting Baseball

Batting Averages

For my Linear Regression project, I scraped data from baseball-reference.net to try and predict batting averages. This was my first time web-scraping, and it was definitely harder than it seems.

The first thing that I had to do was to "web scrape," data from the internet. I used the Python package BeautifulSoup, and it definitely had a learning curve. Although I was already vaguely familiar with HTML tags, I found out that even something as simple as copying the content of a table became tedious and puzzling. However, I learned a lot, and I am definitely glad that I gained much-needed proficiency with web scraping.

After I obtained the data, there was a fair amount of cleaning to do. I had to take out players whose stats were incomplete, as well as remove any data that was from the baseball minor leagues. Even though I was careful to only scrape players who were part of the major leagues, I still ended up with thousands of rows of minor league data as well. All in all, the data wasn't too messy, but it was definitely harder to work with.

My linear regression model did not work very well, with a final R-squared value of only 0.24. I believe that I could improve this by using more advanced machine learning techniques, such as Decision Trees, which handle categorical variables such as team and position much better. Maybe a neural network would even be a fun thing to try! I definitely plan to revisit this project and to make a GitHub for it when I find the time.