Using Machine Learning to Build Databases From Unstructured Text
According to Yaakov Bressler, the live theater industry is very exciting. Its product and form are an experience one can only experience in-person. Comparatively, Broadway’s business landscape has become more modernized with cutting-edge advancements in technology including e-ticketing, digital marketing, social media, and dynamic pricing.
However, Broadway is still behind the times when it comes to big data analytics and AI Deep learning. Some of Broadway’s most popular data housing sites include Theatrical Index, Broadway World, PlaybillPro, Playbill, IBDB, and The Broadway League, but don’t offer mass download options or APIs.
A simplified approach for acquiring this kind of data is imperative in order for theater to move forward into the future. The following article describes the methods and approaches that were taken to create a Broadway-based database.
STEP #1 – Data Availability
The following data related to consumer experiences is made available to the public. This data is made up of three different kinds of information:
• Operational Information – website and social media traffic, dynamic pricing initiatives, seating decks, wrap reports
• Financial Information – price per ticket, capacity, gross revenue
• Production Information – amount of award nominations and wins, show type, cast size, run time
The following websites below include all available data:
• Financial Info – Playbill and The Broadway League
• Financial and Production Info – IBDB, Broadway World, and PlaybillPro
In order to access the data, a web scraping program would need to be developed. This utilizes software to organize, format, and download data from the internet automatically in large quantities.
To accomplish this goal, I need a large data set that already exists. After some lengthy internet searches, I was able to find the CORGIS database, which includes a public dataset with all of the Broadway grosses from 1990 to 2016.
STEP #2 – Analyzing and Exploring
The CORGIS dataset is completely structured and doesn’t have any significant errors. With this said, I decided to give exploratory analysis a try. R was the only programming choice available to me due to its simplicity in cutting-edge data visualizations, a large variety of innovative statistical packages, and algebraic functionality.
STEP #3: Preliminary Analysis Reflections
The preliminary analysis I conducted generated some fascinating results related to price change, attendance, and periodicity. However, my dataset ends at August 2016 and doesn’t include residuals from popular musicals. My dataset is outdated and non-generalizable since it excludes the arrivals of newer productions.
Additionally, the dataset doesn’t include creative information for productions such as a 2 ½ hour musical with a cast of 30 versus a 1 ½ hour musical with a cast of 5. So the combined findings were likely categorized too erroneously and broadly, thus needing updated information for the current Broadway market.
STEP #4A: Scraping the Web to Get Better Data
I used my preliminary analysis as an indicator of what kind of data would be required by visiting a variety of data-housing websites to determine if any of them would be appropriate for web scraping. I chose Python for my web scraping due to its speed with larger data sets, crafty web scraping packages, and sophisticated machine learning competencies. The packages I chose for my web scraping include pandas, re, requests, urllib, and beautiful soup.
STEP #4B: Scraping Broadway Grosses
The first step I took in scraping Broadway show tickets and grosses was to come up with a list of URLs for the phrase Broadway shows New York that’s available in Broadway World’s index. I built this list by building a second list of URLs that contain links to every show by the first letter of their name (alphabetically from a-z and show titles that begin with a #).
Analyzing the gross data for the list with pandas’ read_html function was pretty straightforward. I was able to characterize all of the tables from every page by selecting the ones that contained gross information. Then I wrote a smooth hygiene script that typified, validated, and cleaned the data. Once I put it all together, I had a neat script which included a csv for all Broadway grosses.
STEP #4C: Scraping Creative Broadway Information
The approach I took to scraping creative information was very similar to the method described in the previous step. I came up with a list of URLs for every show starting from static page listing links for each year going all the way back to 1750. Scraping the page gave me a list of years while scraping the list gave me a list of every show URL.
Plugging the pandas read_html function into my list caused an error. I had to rewrite and debug my code using beautiful soup and analyzing the pages’ tds. Doing this didn’t help either as yet another error flashed on my computer screen. I then tried other iterations where I ran more than 2-3 URLs. My code was still bugging. Upon further investigation, I discovered that every page has their data mapped in different locations and that a majority of the pages were coded differently. The web scraping method wouldn’t work.
I switched my web scraper to a more dynamic site (IBDB or PlaybillPro) to proceed through some of the awkward pages of Broadway World. I came up with a text mining algorithm and saw the long-term benefits of transforming plain text data into a structured, organized form.
STEP #5: Web Scraping with Machine Learning: An Alternative Approach
Since I wasn’t able to predict or identify how all of my over 13,000+ URLs were coded, my program required a more intricate decision-making process in how to validate, organize, identify, and locate the data from every page. This would be accomplished by applying machine learning, which is a branch of AI that allowed my program to efficiently identify patterns and make decisions.
After nearly 4 weeks at about 50 hours per week, my algorithm was finally debugged and complete.
Here is how I did it:
A list of URLs was generated for every show.
Without format or html language, each page was interpreted as plain text.
Once I passed Python’s native split function with the regex function from the re package, I could identify patterns that described the validity, location, structure, and presence of data in every page.
I then mapped pattern recognition techniques into my algorithm along with activation functionalities from Python.
Once my transformative algorithms were passed into my primary algorithm, I could convert organized flattened strings into structured float or numeric data.
My iterated data was put into a dictionary then added to a list of dictionaries. Using panda’ DataFrame functionality, my list was then converted into a DataFrame.
Lengthwise, my algorithm along with all of its functionalities was 447 lines of Python code.
Output Summarization:
My algorithm’s logic is 99.999% accurate. The only incorrect data that was displayed was for shows with more than one part such as Angels in America and Harry Potter.
My data set contains every show from 1750 to the present.
STEP#5: A Reflection on the Machine Learning Approach
When I look back on everything I have done, I realize that the method I used for web scraping was quite awkward. I’d managed to convert every structured web page to unstructured text. Nevertheless, my method was widely adaptable, functional, and thorough. I further realized that one of the greatest features of my algorithm was that it allowed fewer lexiconic Broadway-focused performers to use the English language to test advanced machine learning capabilities.
I think a javascript web scraping approach could be used on sites like PlaybillPro or IBDB which could result in similar data, but I feel that my code has a substantial advantage in that it suffices both html-based websites and javascript based websites. Likewise, my code can be modified to read plain text files, which many institutions and businesses use across all industries.
Conclusion
Unorganized text-based data like the webpages I found when I collected Broadway data can be assembled into databases through the use of machine learning techniques. Algorithms let businesses translate text-based data into structured databases that are meaningful and accessible for analyzing.
If you’re thinking about transitioning to big data analytics, you should underscore the accessibility of your data instead of its existence in a collected state. Open access for shared databases can be automated and improved. Additionally, manual downloading will only work if data is displayed in a structured, uniform manner. Comparatively, automated processes that utilize machine learning capabilities can extract data from larger data sets that consist of unlimited data structures and data types.
The next step would be to improve upon this code and allow improved access to Broadway data and, with permission, access abstract consumer behavior patterns and private wrap reports through comprehensive mathematical formulas.
Thus, data analysis for Broadway is a realistic next step.