Salary Predictor
from google.colab import files
uploaded = files.upload()
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
pd.set_option('display.max_rows', None)
df = pd.read_csv('ds_salaries.csv')
df.head()
| work_year | experience_level | employment_type | job_title | salary | salary_currency | salary_in_usd | employee_residence | remote_ratio | company_location | company_size | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2023 | SE | FT | Principal Data Scientist | 80000 | EUR | 85847 | ES | 100 | ES | L |
| 1 | 2023 | MI | CT | ML Engineer | 30000 | USD | 30000 | US | 100 | US | S |
| 2 | 2023 | MI | CT | ML Engineer | 25500 | USD | 25500 | US | 100 | US | S |
| 3 | 2023 | SE | FT | Data Scientist | 175000 | USD | 175000 | CA | 100 | CA | M |
| 4 | 2023 | SE | FT | Data Scientist | 120000 | USD | 120000 | CA | 100 | CA | M |
df.columns
Index(['work_year', 'experience_level', 'employment_type', 'job_title',
'salary', 'salary_currency', 'salary_in_usd', 'employee_residence',
'remote_ratio', 'company_location', 'company_size'],
dtype='object')
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 3755 entries, 0 to 3754 Data columns (total 11 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 work_year 3755 non-null int64 1 experience_level 3755 non-null object 2 employment_type 3755 non-null object 3 job_title 3755 non-null object 4 salary 3755 non-null int64 5 salary_currency 3755 non-null object 6 salary_in_usd 3755 non-null int64 7 employee_residence 3755 non-null object 8 remote_ratio 3755 non-null int64 9 company_location 3755 non-null object 10 company_size 3755 non-null object dtypes: int64(4), object(7) memory usage: 322.8+ KB
df.describe()
| work_year | salary | salary_in_usd | remote_ratio | |
|---|---|---|---|---|
| count | 3755.000000 | 3.755000e+03 | 3755.000000 | 3755.000000 |
| mean | 2022.373635 | 1.906956e+05 | 137570.389880 | 46.271638 |
| std | 0.691448 | 6.716765e+05 | 63055.625278 | 48.589050 |
| min | 2020.000000 | 6.000000e+03 | 5132.000000 | 0.000000 |
| 25% | 2022.000000 | 1.000000e+05 | 95000.000000 | 0.000000 |
| 50% | 2022.000000 | 1.380000e+05 | 135000.000000 | 0.000000 |
| 75% | 2023.000000 | 1.800000e+05 | 175000.000000 | 100.000000 |
| max | 2023.000000 | 3.040000e+07 | 450000.000000 | 100.000000 |
Use a box plot to visualize the distribution of each numerical variable. Box plots are great for identifying outliers as they show the distribution of data along with any potential outliers.
And as from the above description we can already see that we'll find outliers in salary* columns.
df.boxplot(column=['salary'])
plt.show()
Create scatter plots for numerical variables that may have a relationship with potential outliers. Scatter plots can help identify any data points that deviate significantly from the general trend.
for cn in df.columns:
plt.scatter(df[cn], df['salary_in_usd'])
plt.xlabel(cn)
plt.ylabel('salary_in_usd')
plt.show()
Calculate the Z-Score for each numerical variable to identify outliers. Z-Score measures how many standard deviations a data point is away from the mean. Data points with Z-Scores greater than a threshold value (e.g., 3 or -3) can be considered outliers.
from scipy import stats
z_scores = stats.zscore(df['salary_in_usd'])
outliers = df[abs(z_scores) > 3]
outliers
| work_year | experience_level | employment_type | job_title | salary | salary_currency | salary_in_usd | employee_residence | remote_ratio | company_location | company_size | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 33 | 2023 | SE | FT | Computer Vision Engineer | 342810 | USD | 342810 | US | 0 | US | M |
| 133 | 2023 | SE | FT | Machine Learning Engineer | 342300 | USD | 342300 | US | 0 | US | L |
| 228 | 2023 | EX | FT | Head of Data | 329500 | USD | 329500 | US | 0 | US | M |
| 478 | 2023 | EX | FT | Director of Data Science | 353200 | USD | 353200 | US | 0 | US | M |
| 528 | 2023 | SE | FT | AI Scientist | 1500000 | ILS | 423834 | IL | 0 | IL | L |
| 649 | 2023 | SE | FT | Data Architect | 376080 | USD | 376080 | US | 100 | US | M |
| 845 | 2023 | MI | FT | Research Scientist | 340000 | USD | 340000 | US | 100 | US | M |
| 1105 | 2023 | SE | FT | Data Scientist | 370000 | USD | 370000 | US | 0 | US | M |
| 1258 | 2022 | SE | FT | Machine Learning Software Engineer | 375000 | USD | 375000 | US | 100 | US | M |
| 1288 | 2023 | SE | FT | Data Analyst | 385000 | USD | 385000 | US | 0 | US | M |
| 1311 | 2023 | SE | FT | Research Scientist | 370000 | USD | 370000 | US | 0 | US | M |
| 1421 | 2023 | SE | FT | Applied Scientist | 350000 | USD | 350000 | US | 0 | US | L |
| 2011 | 2022 | MI | FT | Data Analyst | 350000 | GBP | 430967 | GB | 0 | GB | M |
| 2359 | 2022 | SE | FT | Data Science Tech Lead | 375000 | USD | 375000 | US | 50 | US | L |
| 2374 | 2022 | SE | FT | Data Scientist | 350000 | USD | 350000 | US | 100 | US | M |
| 2555 | 2022 | SE | FT | Data Architect | 345600 | USD | 345600 | US | 0 | US | M |
| 3463 | 2022 | SE | FT | Data Analytics Lead | 405000 | USD | 405000 | US | 100 | US | L |
| 3468 | 2022 | SE | FT | Applied Data Scientist | 380000 | USD | 380000 | US | 100 | US | L |
| 3522 | 2020 | MI | FT | Research Scientist | 450000 | USD | 450000 | US | 0 | US | M |
| 3675 | 2021 | EX | CT | Principal Data Scientist | 416000 | USD | 416000 | US | 100 | US | S |
| 3747 | 2021 | MI | FT | Applied Machine Learning Scientist | 423000 | USD | 423000 | US | 50 | US | L |
| 3750 | 2020 | SE | FT | Data Scientist | 412000 | USD | 412000 | US | 100 | US | L |
There's also the Interquartile Range (IQR) method to identify outliers. The IQR is the range between the first quartile (25th percentile) and the third quartile (75th percentile). Data points outside the range of Q1 - 1.5 * IQR and Q3 + 1.5 * IQR are considered outliers.
# Calculate IQR for the 'salary' column
Q1 = df['salary'].quantile(0.25)
Q3 = df['salary'].quantile(0.75)
IQR = Q3 - Q1
# Identify outliers using the IQR method
outliers = df[(df['salary'] < Q1 - 1.5 * IQR) | (df['salary'] > Q3 + 1.5 * IQR)]
outliers
| work_year | experience_level | employment_type | job_title | salary | salary_currency | salary_in_usd | employee_residence | remote_ratio | company_location | company_size | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 33 | 2023 | SE | FT | Computer Vision Engineer | 342810 | USD | 342810 | US | 0 | US | M |
| 41 | 2022 | MI | FT | Machine Learning Engineer | 1650000 | INR | 20984 | IN | 50 | IN | L |
| 68 | 2023 | SE | FT | Applied Scientist | 309400 | USD | 309400 | US | 0 | US | L |
| 80 | 2023 | MI | FT | Data Scientist | 510000 | HKD | 65062 | HK | 0 | HK | L |
| 133 | 2023 | SE | FT | Machine Learning Engineer | 342300 | USD | 342300 | US | 0 | US | L |
| 145 | 2023 | SE | FT | Machine Learning Engineer | 318300 | USD | 318300 | US | 100 | US | M |
| 156 | 2023 | MI | FT | Applied Data Scientist | 1700000 | INR | 20670 | IN | 100 | IN | L |
| 163 | 2023 | SE | FT | Applied Scientist | 309400 | USD | 309400 | US | 0 | US | L |
| 217 | 2023 | EN | FT | Data Engineer | 1400000 | INR | 17022 | IN | 100 | IN | L |
| 228 | 2023 | EX | FT | Head of Data | 329500 | USD | 329500 | US | 0 | US | M |
| 358 | 2023 | SE | FT | Machine Learning Engineer | 304000 | USD | 304000 | US | 100 | US | M |
| 478 | 2023 | EX | FT | Director of Data Science | 353200 | USD | 353200 | US | 0 | US | M |
| 488 | 2023 | SE | FT | Data Scientist | 317070 | USD | 317070 | US | 0 | US | M |
| 528 | 2023 | SE | FT | AI Scientist | 1500000 | ILS | 423834 | IL | 0 | IL | L |
| 649 | 2023 | SE | FT | Data Architect | 376080 | USD | 376080 | US | 100 | US | M |
| 735 | 2023 | MI | FT | Data Scientist | 1400000 | INR | 17022 | IN | 100 | IN | L |
| 738 | 2023 | MI | FT | Lead Data Analyst | 1500000 | INR | 18238 | IN | 50 | IN | L |
| 845 | 2023 | MI | FT | Research Scientist | 340000 | USD | 340000 | US | 100 | US | M |
| 860 | 2023 | EX | FT | Data Engineer | 310000 | USD | 310000 | US | 100 | US | M |
| 988 | 2023 | SE | FT | Data Analyst | 1300000 | INR | 15806 | IN | 100 | IN | S |
| 998 | 2023 | SE | FT | Data Science Consultant | 1000000 | THB | 29453 | TH | 50 | TH | M |
| 1007 | 2023 | EX | FT | Data Engineer | 310000 | USD | 310000 | US | 100 | US | M |
| 1097 | 2023 | SE | FT | Data Scientist | 300240 | USD | 300240 | US | 0 | US | M |
| 1099 | 2023 | SE | FT | Data Scientist | 300240 | USD | 300240 | US | 0 | US | M |
| 1105 | 2023 | SE | FT | Data Scientist | 370000 | USD | 370000 | US | 0 | US | M |
| 1116 | 2023 | SE | FT | Machine Learning Engineer | 323300 | USD | 323300 | US | 0 | US | M |
| 1153 | 2023 | EX | FT | Data Engineer | 310000 | USD | 310000 | US | 100 | US | M |
| 1230 | 2023 | EN | FT | Data Scientist | 800000 | INR | 9727 | IN | 0 | IN | L |
| 1258 | 2022 | SE | FT | Machine Learning Software Engineer | 375000 | USD | 375000 | US | 100 | US | M |
| 1260 | 2023 | MI | FT | Product Data Analyst | 1350000 | INR | 16414 | IN | 100 | IN | L |
| 1286 | 2023 | SE | FT | Machine Learning Engineer | 318300 | USD | 318300 | US | 100 | US | M |
| 1288 | 2023 | SE | FT | Data Analyst | 385000 | USD | 385000 | US | 0 | US | M |
| 1311 | 2023 | SE | FT | Research Scientist | 370000 | USD | 370000 | US | 0 | US | M |
| 1341 | 2023 | EN | FT | Data Scientist | 1050000 | INR | 12767 | IN | 50 | IN | L |
| 1396 | 2023 | EX | FT | Head of Data Science | 314100 | USD | 314100 | US | 0 | US | M |
| 1421 | 2023 | SE | FT | Applied Scientist | 350000 | USD | 350000 | US | 0 | US | L |
| 1427 | 2023 | EX | FT | Data Engineer | 310000 | USD | 310000 | US | 100 | US | M |
| 1462 | 2023 | MI | FT | Head of Data Science | 5000000 | INR | 60795 | IN | 50 | IN | L |
| 1512 | 2023 | EN | FT | Data Scientist | 1060000 | INR | 12888 | IN | 50 | IN | S |
| 1549 | 2023 | MI | FT | Data Analytics Lead | 1440000 | INR | 17509 | IN | 50 | SG | M |
| 1595 | 2023 | MI | FT | Data Scientist | 840000 | THB | 24740 | TH | 50 | TH | L |
| 1596 | 2022 | MI | FT | Computer Vision Engineer | 1250000 | INR | 15897 | IN | 100 | IN | M |
| 1722 | 2023 | SE | FT | Data Engineer | 310000 | USD | 310000 | US | 0 | US | M |
| 1738 | 2021 | SE | FT | Data Scientist | 4000000 | INR | 54094 | IN | 100 | IN | L |
| 1739 | 2022 | MI | FT | Business Data Analyst | 1440000 | INR | 18314 | IN | 50 | IN | L |
| 1810 | 2022 | MI | FT | Data Analyst | 1125000 | INR | 14307 | IN | 100 | IN | L |
| 1816 | 2022 | MI | FT | Data Scientist | 1100000 | INR | 13989 | IN | 100 | IN | L |
| 1868 | 2022 | SE | FT | Lead Data Scientist | 4460000 | INR | 56723 | IN | 0 | IN | L |
| 1918 | 2022 | MI | FT | Data Scientist | 2500000 | INR | 31795 | IN | 100 | US | M |
| 1932 | 2022 | EX | FT | Data Engineer | 310000 | USD | 310000 | US | 100 | US | M |
| 1946 | 2022 | MI | FT | Data Engineer | 2800000 | INR | 35610 | IN | 50 | IN | L |
| 2011 | 2022 | MI | FT | Data Analyst | 350000 | GBP | 430967 | GB | 0 | GB | M |
| 2032 | 2021 | MI | FT | Data Analyst | 1250000 | INR | 16904 | IN | 50 | IN | L |
| 2279 | 2022 | EX | FT | Data Engineer | 310000 | USD | 310000 | US | 100 | US | M |
| 2358 | 2022 | EN | FT | Data Scientist | 6600000 | HUF | 17684 | HU | 100 | HU | M |
| 2359 | 2022 | SE | FT | Data Science Tech Lead | 375000 | USD | 375000 | US | 50 | US | L |
| 2374 | 2022 | SE | FT | Data Scientist | 350000 | USD | 350000 | US | 100 | US | M |
| 2406 | 2022 | SE | FT | Data Engineer | 315000 | USD | 315000 | US | 100 | US | M |
| 2555 | 2022 | SE | FT | Data Architect | 345600 | USD | 345600 | US | 0 | US | M |
| 2578 | 2021 | EN | FT | Power BI Developer | 400000 | INR | 5409 | IN | 50 | IN | L |
| 2655 | 2022 | SE | FT | Principal Data Architect | 3000000 | INR | 38154 | IN | 100 | IN | L |
| 2786 | 2022 | EN | FT | Data Scientist | 1800000 | INR | 22892 | IN | 50 | IN | M |
| 2800 | 2022 | EN | FT | BI Data Analyst | 633000 | INR | 8050 | IN | 100 | IN | M |
| 2872 | 2022 | EN | FT | Data Analyst | 500000 | INR | 6359 | FR | 100 | IN | L |
| 2966 | 2022 | SE | FT | Lead Machine Learning Engineer | 7500000 | INR | 95386 | IN | 50 | IN | L |
| 3023 | 2022 | MI | FT | Data Analyst | 450000 | INR | 5723 | IN | 100 | IN | S |
| 3060 | 2021 | EN | FT | Machine Learning Research Engineer | 900000 | INR | 12171 | IN | 100 | IN | M |
| 3061 | 2022 | MI | FT | Data Scientist | 4200000 | INR | 53416 | IN | 100 | ID | L |
| 3075 | 2022 | MI | FL | Applied Machine Learning Scientist | 2400000 | INR | 30523 | IN | 100 | IN | S |
| 3119 | 2020 | EN | FT | Data Engineer | 1000000 | INR | 13493 | IN | 100 | IN | L |
| 3120 | 2020 | EN | FT | Data Engineer | 1000000 | INR | 13493 | IN | 100 | IN | L |
| 3192 | 2022 | EX | FT | Head of Machine Learning | 6000000 | INR | 76309 | IN | 50 | IN | L |
| 3410 | 2022 | EX | FT | Data Engineer | 324000 | USD | 324000 | US | 100 | US | M |
| 3422 | 2022 | MI | FT | Business Data Analyst | 1400000 | INR | 17805 | IN | 100 | IN | M |
| 3423 | 2022 | MI | FT | Data Scientist | 2400000 | INR | 30523 | IN | 100 | IN | L |
| 3426 | 2022 | EN | FT | Data Scientist | 1400000 | INR | 17805 | IN | 100 | IN | M |
| 3463 | 2022 | SE | FT | Data Analytics Lead | 405000 | USD | 405000 | US | 100 | US | L |
| 3468 | 2022 | SE | FT | Applied Data Scientist | 380000 | USD | 380000 | US | 100 | US | L |
| 3475 | 2021 | MI | FT | ML Engineer | 8500000 | JPY | 77364 | JP | 50 | JP | S |
| 3476 | 2021 | MI | FT | ML Engineer | 7000000 | JPY | 63711 | JP | 50 | JP | S |
| 3489 | 2021 | SE | FT | Lead Data Scientist | 3000000 | INR | 40570 | IN | 50 | IN | L |
| 3494 | 2021 | MI | FT | Data Scientist | 700000 | INR | 9466 | IN | 0 | IN | S |
| 3522 | 2020 | MI | FT | Research Scientist | 450000 | USD | 450000 | US | 0 | US | M |
| 3537 | 2021 | MI | PT | 3D Computer Vision Researcher | 400000 | INR | 5409 | IN | 50 | IN | M |
| 3567 | 2021 | EN | FT | Data Engineer | 2250000 | INR | 30428 | IN | 100 | IN | L |
| 3574 | 2021 | MI | FT | BI Data Analyst | 11000000 | HUF | 36259 | HU | 50 | US | L |
| 3581 | 2021 | EN | FT | Data Scientist | 2200000 | INR | 29751 | IN | 50 | IN | L |
| 3589 | 2021 | EN | FT | Data Scientist | 2100000 | INR | 28399 | IN | 100 | IN | M |
| 3593 | 2020 | EN | FT | Data Analyst | 450000 | INR | 6072 | IN | 0 | IN | S |
| 3594 | 2020 | SE | FT | Data Engineer | 720000 | MXN | 33511 | MX | 0 | MX | S |
| 3605 | 2021 | EN | FT | Data Engineer | 1600000 | INR | 21637 | IN | 50 | IN | M |
| 3639 | 2021 | SE | FT | Machine Learning Engineer | 4900000 | INR | 66265 | IN | 0 | IN | L |
| 3640 | 2021 | MI | FT | Data Scientist | 1250000 | INR | 16904 | IN | 100 | IN | S |
| 3644 | 2021 | EN | FT | Big Data Engineer | 1200000 | INR | 16228 | IN | 100 | IN | L |
| 3646 | 2020 | MI | FT | Data Scientist | 11000000 | HUF | 35735 | HU | 50 | HU | L |
| 3649 | 2021 | SE | FT | Data Science Manager | 4000000 | INR | 54094 | IN | 50 | US | L |
| 3650 | 2021 | SE | FT | Machine Learning Engineer | 1799997 | INR | 24342 | IN | 100 | IN | L |
| 3659 | 2020 | MI | FT | Data Scientist | 3000000 | INR | 40481 | IN | 0 | IN | L |
| 3666 | 2021 | MI | FT | Big Data Engineer | 1672000 | INR | 22611 | IN | 0 | IN | L |
| 3667 | 2021 | MI | FT | Data Scientist | 420000 | INR | 5679 | IN | 100 | US | S |
| 3669 | 2021 | MI | FT | Data Scientist | 30400000 | CLP | 40038 | CL | 100 | CL | L |
| 3675 | 2021 | EX | CT | Principal Data Scientist | 416000 | USD | 416000 | US | 100 | US | S |
| 3678 | 2021 | MI | FT | Data Scientist | 2500000 | INR | 33808 | IN | 0 | IN | M |
| 3682 | 2020 | EN | FT | Data Engineer | 4450000 | JPY | 41689 | JP | 100 | JP | S |
| 3685 | 2020 | EN | FT | Data Science Consultant | 423000 | INR | 5707 | IN | 50 | IN | M |
| 3689 | 2020 | MI | FT | Product Data Analyst | 450000 | INR | 6072 | IN | 100 | IN | L |
| 3697 | 2020 | EX | FT | Director of Data Science | 325000 | USD | 325000 | US | 100 | US | L |
| 3705 | 2021 | EN | FT | Big Data Engineer | 435000 | INR | 5882 | IN | 0 | CH | L |
| 3729 | 2021 | EN | FT | AI Scientist | 1335000 | INR | 18053 | IN | 100 | AS | S |
| 3734 | 2021 | MI | FT | Lead Data Analyst | 1450000 | INR | 19609 | IN | 100 | IN | L |
| 3747 | 2021 | MI | FT | Applied Machine Learning Scientist | 423000 | USD | 423000 | US | 50 | US | L |
| 3750 | 2020 | SE | FT | Data Scientist | 412000 | USD | 412000 | US | 100 | US | L |
| 3754 | 2021 | SE | FT | Data Science Manager | 7000000 | INR | 94665 | IN | 50 | IN | L |
Plot the identified outliers on the scatter plot or box plot to visualize their positions in the dataset.
for cn in df.columns:
# Scatter plot with identified outliers
plt.scatter(df[cn], df['salary_in_usd'])
plt.scatter(outliers[cn], outliers['salary_in_usd'], color='red', label='Outliers')
plt.xlabel(cn)
plt.ylabel('salary_in_usd')
plt.legend()
plt.show()
Use histograms to visualize the distribution of numerical variables.
for cn in df.select_dtypes(include=[pd.Int64Dtype()]).columns:
plt.hist(df[cn], bins=10, color='blue', edgecolor='black')
plt.xlabel(cn)
plt.ylabel('Frequency')
plt.title('Distribution of {}'.format(cn))
plt.show()
Use bar plots to visualize the distribution of categorical variables.
for cn in df.select_dtypes(include=[pd.Categorical]):
plt.figure(figsize=(26, 7))
chart = sns.countplot(x=cn, data=df)
chart.set_xticklabels(chart.get_xticklabels(), rotation=90)
plt.xlabel(cn)
plt.ylabel('Count')
plt.title('Distribution of {}'.format(cn))
plt.show()
Use scatter plots to show the relationship between two numerical variables.
for cn in df.select_dtypes(include=[pd.Int64Dtype()]):
plt.scatter(df[cn], df['salary_in_usd'])
plt.xlabel(cn)
plt.ylabel('salary_in_usd')
plt.title('Scatter Plot: {} vs salary_in_usd'.format(cn))
plt.show()
Use box plots to compare the distribution of a numerical variable across different categories.
for cn in df.select_dtypes(include=[pd.Categorical]):
plt.figure(figsize=(20, 7))
chart = sns.boxplot(x=cn, y='salary_in_usd', data=df)
chart.set_xticklabels(chart.get_xticklabels(), rotation=90)
plt.xlabel(cn)
plt.ylabel('Salary')
plt.title('Box Plot: {} vs Salary'.format(cn))
plt.show()
Use heatmaps to visualize the correlation between numerical variables.
correlation_matrix = df.corr()
sns.heatmap(correlation_matrix, annot=True)
plt.title('Correlation Heatmap')
plt.show()
<ipython-input-50-7ca1085ff5bf>:1: FutureWarning: The default value of numeric_only in DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning.
Use 3D scatter plots to show the relationship between three numerical variables.
import plotly.express as px
fig = px.scatter_3d(df, x='work_year', y='salary_in_usd', z='salary')
fig.show()
Now, our next step would be to convert the categorical data into numerical form so that we can perform a correlation operation between them.
# See how many unique 'experience_level' categories are present.
df['experience_level'].unique()
array(['SE', 'MI', 'EN', 'EX'], dtype=object)
# See how many unique 'employment_type' categories are present.
df['employment_type'].unique()
array(['FT', 'CT', 'FL', 'PT'], dtype=object)
# See how many unique 'company_size' categories are present.
df['company_size'].unique()
array(['L', 'S', 'M'], dtype=object)
df['remote_ratio'] = df['remote_ratio'].astype('category').cat.codes
df['company_size'] = df['company_size'].astype('category').cat.codes
df['employment_type'] = df['employment_type'].astype('category').cat.codes
df['experience_level'] = df['experience_level'].astype('category').cat.codes
df.sample(5)
| work_year | experience_level | employment_type | job_title | salary | salary_currency | salary_in_usd | employee_residence | remote_ratio | company_location | company_size | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 438 | 2023 | 3 | 2 | Data Scientist | 113750 | USD | 113750 | IE | 0 | IE | 1 |
| 1648 | 2023 | 3 | 2 | Data Engineer | 180000 | USD | 180000 | US | 0 | US | 1 |
| 1471 | 2023 | 3 | 2 | Analytics Engineer | 175000 | USD | 175000 | US | 2 | US | 1 |
| 194 | 2023 | 3 | 2 | Data Scientist | 190000 | USD | 190000 | US | 0 | US | 1 |
| 682 | 2023 | 3 | 2 | Analytics Engineer | 87000 | USD | 87000 | US | 0 | US | 1 |
Let's look at the correlation between the following properties:
df[['experience_level', 'employment_type', 'salary_in_usd', 'remote_ratio',
'company_size']].corr()
| experience_level | employment_type | salary_in_usd | remote_ratio | company_size | |
|---|---|---|---|---|---|
| experience_level | 1.000000 | -0.032794 | 0.327173 | -0.054025 | 0.066414 |
| employment_type | -0.032794 | 1.000000 | -0.010329 | -0.028673 | -0.041001 |
| salary_in_usd | 0.327173 | -0.010329 | 1.000000 | -0.064171 | -0.000372 |
| remote_ratio | -0.054025 | -0.028673 | -0.064171 | 1.000000 | -0.036928 |
| company_size | 0.066414 | -0.041001 | -0.000372 | -0.036928 | 1.000000 |
Looking at the correlation coefficients drawn from the chosen features using sns.heatmap:
correlation_matrix = df[['experience_level', 'employment_type', 'salary_in_usd',
'remote_ratio', 'company_size']].corr(numeric_only=True)
sns.heatmap(correlation_matrix, annot=True)
plt.title('Correlation Heatmap')
plt.show()