Job Title and Salary Prediction

The Australian data science job market is still continuing to grow. There is a variety of data jobs in different salary bands. Seek, one of the leading job platforms in Australia, has a wealth of information on these jobs. It is only fitting to harness this data and apply data science principles to get insights from it. For this project, we use data science to develop models to predict 1) Salary range and 2) Job Title given the job Ad Details for data science jobs. To be able to process our job Ad Details which is a chunk of text into a predictor matrix, we make sure of Natural Language Processing.

For our Salary range prediction model, we use CountVectorizer to change our job Ad Details text into a predictor matrix. This matrix is used to train different models. Comparing the results, we were able to develop an Extra Tree model than can predict the salary range of a job given job ad details with an accuracy of 75%.

If we use the % of the majority salary range in our dataset, this model has improved accuracy by 30 points. This model however cannot give us the words that are likely to occur in job details per salary range. Another model we considered which provided good results is the Logistic Regression model. Using this, we can get the the words closely associated with each salary range.

For Job Title prediction, we looked at the impact of the type of vectorizer we use. We compared the performance of models if we use:

  • CountVectorizer
  • TfidfVectorizer

We compared 6 Vectorizer+model combinations. In general, models that used Tfidf vectorizer performed better than those that used CountVectorizer across all comparison metrics.

Of the models that we used, the Tfidf+ Extra tree combo performed best outperforming our Baseline by 50 points. Tfidf+Logistic Regression is not far behind although it showed some signs of overfitting.

Similar to the Salary model, we can use the Logistic Regression model if we were interested in getting the words that best identify each jobTitle.

As an extra, I also took a look at unsupervised learning- topic modelling to find any hidden classification to our jobAdDetails and check if these can be used to determine our target classes. Latent Dirilecht Allocation was used for topic modelling. Based on the results, I can conclude that LDA is not giving any significant grouping that could relate to our targets for both the Salary Prediction and the JobTitle prediction.

From the above visualisation, we can see that there are 3 distinct clusters formed. However, looking at the words that appear significant per class, these words do not relate to our targets in any way.

From the above visualisation, we can see that there are 3 distinct clusters and 2 clusters that overlap (this could be more distinguishable if we add dimensions but that is difficult to visualise) formed. Looking at the words that appear significant per class, the wordscan slightly give us some signal. We can relate some clusters formed with our targets in the following way: cluster 2 has similarities we got for Engineer role, 3 is closely similar to Business Analyst results. However, clusters 4 and 5 look to have more to do with actual industries- 5 being close to government and 4 close to marketing.

Visit my github repo for codes used for this project: link.