Data on master’s degree benefits and skill development based on BE-EDGE
by Swapnil Lokhande
Supervised by Dr. Julia Ivy
March 7, 2022
The research involves the analysis of the impact of the Masters degree/graduate program (These two terms, Masters degree/graduate program, are interchangeably used in the report) on the employability of the students or job seekers.
The fundamental analysis in this research involves the identification of the keywords used by the people (authors) while discussing about the higher education, specifically Masters degree/graduate program, in their articles published on public platforms. The goal is to analyze the keywords based on the BE-EDGE methodology and see how far the graduate program helps the students in developing their Personal, Social and Professional Capital.
METHODOLOGY FOR DATA SELECTION AND EXTRACTION
Article selection and filtering
For the analysis of the keywords, the main source of data is the publicly available articles related to Masters degree or graduate program. The analysis of the articles is done in two parts which involves two different dataset – 1. Benefits of Masters degree 2. Skills developed through Masters degree.
- Benefits of Masters degree: The articles chosen for this analysis are selected based on the following google searched articles and other recommended articles: Master’s degree benefits, Why master degree is important and Reasons to get a masters degree.
- Skills developed through Masters degree: The articles chosen for this analysis are selected based on the following google searched articles and other recommended articles: Skills learned in graduate school, What I learned in graduate school.
Only those articles are selected which are generic to masters degree benefits and the skills developed through masters degree and not specific to a particular course or program such as online masters degree, 1 year master degree program, master in accounting, consulting, education etc. This is done to avoid any bias for a particular program or degree and the findings represent a generic result for Masters degree as a whole and not focused on a certain group of graduate programs.
It was observed that the first 4 pages of the google search gave relevant articles and after that the articles were more specific to a particular type of program or course. Thus, only the articles present on the first 4 pages of google search were selected.
Data Extraction Method
In order to analyze the articles and get insights from those articles – extract the content of the articles which is present in the HTML format, store it in a simple text format (CSV or JSON) and use the text for further processing and analysis. Thus, our first approach in this project is to build an application which can be used to extract the desired content from different websites and store the content in the required format, and to accomplish this, different web scraper is deployed to gather the data.
Web scraper for articles – Purpose and Technique
- Designed a simple web scraper using python programming that can be used to pull the content from an article.
- In this application, user need to pass the URL of an article which are freely available (Example: articles from The Conversation or The New York Times etc.)
- The application uses requests and BeautifulSoup package of python which are used to extract the HTML code from the given article and process it to pull the required content.
- Unable to extract the content from the articles which requires a mandatory login on to the portal.
- Example: articles available on the Northeastern Library portal can only be accessed after the login into the student’s account through myneu. The Chronicle of Higher Education also requires a login for accessing the articles.
METHODOLOGY FOR DATA ANALYSIS AND KEYWORD DETECTION
The main goal of this research is to identify the keywords that are frequently used and are highly relevant to the topics – Benefits of Masters degree and Skills developed through Masters degree. Thus, to identify such keywords a Machine Learning algorithm for Natural Language processing is used which is Tf-IDF (Term frequency and Inverse document frequency). This algorithm is generally used when processing human readable language and is used to convert words into numerical format where each word is represented in form of a matrix (Gajare, n.d.)
How to calculate Tf-Idf score
TF-IDF for a word in a document is calculated by multiplying two different metrics:
The term frequency (Tf) of a word in a document is a raw count of instances a word appears in a document.
The inverse document frequency (Idf) of the word across a set of documents. This can be calculated by taking the total number of documents, dividing it by the number of documents that contain a word, and calculating the logarithm. The IDF is calculated to identify how common or rare a word is in the entire document set. The closer it is to 0, the more common a word is and more it is closer to 1 shows how rare it is.
Multiplying these two numbers results in the TF-IDF score of a word in a document. The higher the score, the more relevant that word is in that particular document.
Approach for determining keywords
The Tf-Idf score for each bi-gram and tri-gram is calculated, higher the Tf-Idf score means the word is more relevant in that particular document. For the analysis purpose, uni-grams are excluded since these words don’t give much information and thus are discarded. For instance, from a document we want to find out the skills required to be a “Data Scientist”.
Here, if we consider only unigrams, then the single word cannot convey the details properly. If we have a word like ‘Machine learning developer’, then the word extracted should be ‘Machine learning’ or ‘Machine learning developer’. The words simply ‘Machine’, ‘learning’ or ‘developer’ will not give the expected result.
Findings and Result
The result consists of the bi-grams and tri-grams associated with the articles – Benefits of Masters degree and Skills developed through Masters degree. The keywords are ordered in the descending order of their rank.
Here, rank is the Tf-Idf score which shows the importance of the word or relevance of the word in the given article. For example, a keyword “Earning potential” will have higher score for the articles related to benefits of masters degree since it is assumed that employees having a Masters degree are likely to have higher salary.
Bi-grams for Benefits of Masters degree