Salary Predictor

In [31]:

Copied!

from google.colab import files
from google.colab import files

In [32]:

Copied!

uploaded = files.upload()
uploaded = files.upload()

In [33]:

Copied!

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [34]:

Copied!

pd.set_option('display.max_rows', None)
pd.set_option('display.max_rows', None)

In [35]:

Copied!

df = pd.read_csv('ds_salaries.csv')
df = pd.read_csv('ds_salaries.csv')

In [36]:

Copied!

df.head()
df.head()

Out[36]:

	work_year	experience_level	employment_type	job_title	salary	salary_currency	salary_in_usd	employee_residence	remote_ratio	company_location	company_size
0	2023	SE	FT	Principal Data Scientist	80000	EUR	85847	ES	100	ES	L
1	2023	MI	CT	ML Engineer	30000	USD	30000	US	100	US	S
2	2023	MI	CT	ML Engineer	25500	USD	25500	US	100	US	S
3	2023	SE	FT	Data Scientist	175000	USD	175000	CA	100	CA	M
4	2023	SE	FT	Data Scientist	120000	USD	120000	CA	100	CA	M

In [37]:

Copied!

df.columns
df.columns

Out[37]:

Index(['work_year', 'experience_level', 'employment_type', 'job_title',
       'salary', 'salary_currency', 'salary_in_usd', 'employee_residence',
       'remote_ratio', 'company_location', 'company_size'],
      dtype='object')

In [38]:

Copied!

df.info()
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3755 entries, 0 to 3754
Data columns (total 11 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   work_year           3755 non-null   int64 
 1   experience_level    3755 non-null   object
 2   employment_type     3755 non-null   object
 3   job_title           3755 non-null   object
 4   salary              3755 non-null   int64 
 5   salary_currency     3755 non-null   object
 6   salary_in_usd       3755 non-null   int64 
 7   employee_residence  3755 non-null   object
 8   remote_ratio        3755 non-null   int64 
 9   company_location    3755 non-null   object
 10  company_size        3755 non-null   object
dtypes: int64(4), object(7)
memory usage: 322.8+ KB

In [39]:

Copied!

df.describe()
df.describe()

Out[39]:

	work_year	salary	salary_in_usd	remote_ratio
count	3755.000000	3.755000e+03	3755.000000	3755.000000
mean	2022.373635	1.906956e+05	137570.389880	46.271638
std	0.691448	6.716765e+05	63055.625278	48.589050
min	2020.000000	6.000000e+03	5132.000000	0.000000
25%	2022.000000	1.000000e+05	95000.000000	0.000000
50%	2022.000000	1.380000e+05	135000.000000	0.000000
75%	2023.000000	1.800000e+05	175000.000000	100.000000
max	2023.000000	3.040000e+07	450000.000000	100.000000

Use a box plot to visualize the distribution of each numerical variable. Box plots are great for identifying outliers as they show the distribution of data along with any potential outliers.

And as from the above description we can already see that we'll find outliers in salary* columns.

In [40]:

Copied!

df.boxplot(column=['salary'])
plt.show()
df.boxplot(column=['salary'])
plt.show()

No description has been provided for this image

Create scatter plots for numerical variables that may have a relationship with potential outliers. Scatter plots can help identify any data points that deviate significantly from the general trend.

In [41]:

Copied!





for cn in df.columns:
  plt.scatter(df[cn], df['salary_in_usd'])
  plt.xlabel(cn)
  plt.ylabel('salary_in_usd')
  plt.show()
for cn in df.columns:
  plt.scatter(df[cn], df['salary_in_usd'])
  plt.xlabel(cn)
  plt.ylabel('salary_in_usd')
  plt.show()

Calculate the Z-Score for each numerical variable to identify outliers. Z-Score measures how many standard deviations a data point is away from the mean. Data points with Z-Scores greater than a threshold value (e.g., 3 or -3) can be considered outliers.

In [42]:

Copied!

from scipy import stats
from scipy import stats

In [43]:

Copied!

z_scores = stats.zscore(df['salary_in_usd'])
outliers = df[abs(z_scores) > 3]
outliers
z_scores = stats.zscore(df['salary_in_usd'])
outliers = df[abs(z_scores) > 3]
outliers

Out[43]:

	work_year	experience_level	employment_type	job_title	salary	salary_currency	salary_in_usd	employee_residence	remote_ratio	company_location	company_size
33	2023	SE	FT	Computer Vision Engineer	342810	USD	342810	US	0	US	M
133	2023	SE	FT	Machine Learning Engineer	342300	USD	342300	US	0	US	L
228	2023	EX	FT	Head of Data	329500	USD	329500	US	0	US	M
478	2023	EX	FT	Director of Data Science	353200	USD	353200	US	0	US	M
528	2023	SE	FT	AI Scientist	1500000	ILS	423834	IL	0	IL	L
649	2023	SE	FT	Data Architect	376080	USD	376080	US	100	US	M
845	2023	MI	FT	Research Scientist	340000	USD	340000	US	100	US	M
1105	2023	SE	FT	Data Scientist	370000	USD	370000	US	0	US	M
1258	2022	SE	FT	Machine Learning Software Engineer	375000	USD	375000	US	100	US	M
1288	2023	SE	FT	Data Analyst	385000	USD	385000	US	0	US	M
1311	2023	SE	FT	Research Scientist	370000	USD	370000	US	0	US	M
1421	2023	SE	FT	Applied Scientist	350000	USD	350000	US	0	US	L
2011	2022	MI	FT	Data Analyst	350000	GBP	430967	GB	0	GB	M
2359	2022	SE	FT	Data Science Tech Lead	375000	USD	375000	US	50	US	L
2374	2022	SE	FT	Data Scientist	350000	USD	350000	US	100	US	M
2555	2022	SE	FT	Data Architect	345600	USD	345600	US	0	US	M
3463	2022	SE	FT	Data Analytics Lead	405000	USD	405000	US	100	US	L
3468	2022	SE	FT	Applied Data Scientist	380000	USD	380000	US	100	US	L
3522	2020	MI	FT	Research Scientist	450000	USD	450000	US	0	US	M
3675	2021	EX	CT	Principal Data Scientist	416000	USD	416000	US	100	US	S
3747	2021	MI	FT	Applied Machine Learning Scientist	423000	USD	423000	US	50	US	L
3750	2020	SE	FT	Data Scientist	412000	USD	412000	US	100	US	L

There's also the Interquartile Range (IQR) method to identify outliers. The IQR is the range between the first quartile (25th percentile) and the third quartile (75th percentile). Data points outside the range of Q1 - 1.5 * IQR and Q3 + 1.5 * IQR are considered outliers.

In [44]:

Copied!





# Calculate IQR for the 'salary' column
Q1 = df['salary'].quantile(0.25)
Q3 = df['salary'].quantile(0.75)
IQR = Q3 - Q1

# Identify outliers using the IQR method
outliers = df[(df['salary'] < Q1 - 1.5 * IQR) | (df['salary'] > Q3 + 1.5 * IQR)]
outliers
# Calculate IQR for the 'salary' column
Q1 = df['salary'].quantile(0.25)
Q3 = df['salary'].quantile(0.75)
IQR = Q3 - Q1

# Identify outliers using the IQR method
outliers = df[(df['salary'] < Q1 - 1.5 * IQR) | (df['salary'] > Q3 + 1.5 * IQR)]
outliers

Out[44]:

	work_year	experience_level	employment_type	job_title	salary	salary_currency	salary_in_usd	employee_residence	remote_ratio	company_location	company_size
33	2023	SE	FT	Computer Vision Engineer	342810	USD	342810	US	0	US	M
41	2022	MI	FT	Machine Learning Engineer	1650000	INR	20984	IN	50	IN	L
68	2023	SE	FT	Applied Scientist	309400	USD	309400	US	0	US	L
80	2023	MI	FT	Data Scientist	510000	HKD	65062	HK	0	HK	L
133	2023	SE	FT	Machine Learning Engineer	342300	USD	342300	US	0	US	L
145	2023	SE	FT	Machine Learning Engineer	318300	USD	318300	US	100	US	M
156	2023	MI	FT	Applied Data Scientist	1700000	INR	20670	IN	100	IN	L
163	2023	SE	FT	Applied Scientist	309400	USD	309400	US	0	US	L
217	2023	EN	FT	Data Engineer	1400000	INR	17022	IN	100	IN	L
228	2023	EX	FT	Head of Data	329500	USD	329500	US	0	US	M
358	2023	SE	FT	Machine Learning Engineer	304000	USD	304000	US	100	US	M
478	2023	EX	FT	Director of Data Science	353200	USD	353200	US	0	US	M
488	2023	SE	FT	Data Scientist	317070	USD	317070	US	0	US	M
528	2023	SE	FT	AI Scientist	1500000	ILS	423834	IL	0	IL	L
649	2023	SE	FT	Data Architect	376080	USD	376080	US	100	US	M
735	2023	MI	FT	Data Scientist	1400000	INR	17022	IN	100	IN	L
738	2023	MI	FT	Lead Data Analyst	1500000	INR	18238	IN	50	IN	L
845	2023	MI	FT	Research Scientist	340000	USD	340000	US	100	US	M
860	2023	EX	FT	Data Engineer	310000	USD	310000	US	100	US	M
988	2023	SE	FT	Data Analyst	1300000	INR	15806	IN	100	IN	S
998	2023	SE	FT	Data Science Consultant	1000000	THB	29453	TH	50	TH	M
1007	2023	EX	FT	Data Engineer	310000	USD	310000	US	100	US	M
1097	2023	SE	FT	Data Scientist	300240	USD	300240	US	0	US	M
1099	2023	SE	FT	Data Scientist	300240	USD	300240	US	0	US	M
1105	2023	SE	FT	Data Scientist	370000	USD	370000	US	0	US	M
1116	2023	SE	FT	Machine Learning Engineer	323300	USD	323300	US	0	US	M
1153	2023	EX	FT	Data Engineer	310000	USD	310000	US	100	US	M
1230	2023	EN	FT	Data Scientist	800000	INR	9727	IN	0	IN	L
1258	2022	SE	FT	Machine Learning Software Engineer	375000	USD	375000	US	100	US	M
1260	2023	MI	FT	Product Data Analyst	1350000	INR	16414	IN	100	IN	L
1286	2023	SE	FT	Machine Learning Engineer	318300	USD	318300	US	100	US	M
1288	2023	SE	FT	Data Analyst	385000	USD	385000	US	0	US	M
1311	2023	SE	FT	Research Scientist	370000	USD	370000	US	0	US	M
1341	2023	EN	FT	Data Scientist	1050000	INR	12767	IN	50	IN	L
1396	2023	EX	FT	Head of Data Science	314100	USD	314100	US	0	US	M
1421	2023	SE	FT	Applied Scientist	350000	USD	350000	US	0	US	L
1427	2023	EX	FT	Data Engineer	310000	USD	310000	US	100	US	M
1462	2023	MI	FT	Head of Data Science	5000000	INR	60795	IN	50	IN	L
1512	2023	EN	FT	Data Scientist	1060000	INR	12888	IN	50	IN	S
1549	2023	MI	FT	Data Analytics Lead	1440000	INR	17509	IN	50	SG	M
1595	2023	MI	FT	Data Scientist	840000	THB	24740	TH	50	TH	L
1596	2022	MI	FT	Computer Vision Engineer	1250000	INR	15897	IN	100	IN	M
1722	2023	SE	FT	Data Engineer	310000	USD	310000	US	0	US	M
1738	2021	SE	FT	Data Scientist	4000000	INR	54094	IN	100	IN	L
1739	2022	MI	FT	Business Data Analyst	1440000	INR	18314	IN	50	IN	L
1810	2022	MI	FT	Data Analyst	1125000	INR	14307	IN	100	IN	L
1816	2022	MI	FT	Data Scientist	1100000	INR	13989	IN	100	IN	L
1868	2022	SE	FT	Lead Data Scientist	4460000	INR	56723	IN	0	IN	L
1918	2022	MI	FT	Data Scientist	2500000	INR	31795	IN	100	US	M
1932	2022	EX	FT	Data Engineer	310000	USD	310000	US	100	US	M
1946	2022	MI	FT	Data Engineer	2800000	INR	35610	IN	50	IN	L
2011	2022	MI	FT	Data Analyst	350000	GBP	430967	GB	0	GB	M
2032	2021	MI	FT	Data Analyst	1250000	INR	16904	IN	50	IN	L
2279	2022	EX	FT	Data Engineer	310000	USD	310000	US	100	US	M
2358	2022	EN	FT	Data Scientist	6600000	HUF	17684	HU	100	HU	M
2359	2022	SE	FT	Data Science Tech Lead	375000	USD	375000	US	50	US	L
2374	2022	SE	FT	Data Scientist	350000	USD	350000	US	100	US	M
2406	2022	SE	FT	Data Engineer	315000	USD	315000	US	100	US	M
2555	2022	SE	FT	Data Architect	345600	USD	345600	US	0	US	M
2578	2021	EN	FT	Power BI Developer	400000	INR	5409	IN	50	IN	L
2655	2022	SE	FT	Principal Data Architect	3000000	INR	38154	IN	100	IN	L
2786	2022	EN	FT	Data Scientist	1800000	INR	22892	IN	50	IN	M
2800	2022	EN	FT	BI Data Analyst	633000	INR	8050	IN	100	IN	M
2872	2022	EN	FT	Data Analyst	500000	INR	6359	FR	100	IN	L
2966	2022	SE	FT	Lead Machine Learning Engineer	7500000	INR	95386	IN	50	IN	L
3023	2022	MI	FT	Data Analyst	450000	INR	5723	IN	100	IN	S
3060	2021	EN	FT	Machine Learning Research Engineer	900000	INR	12171	IN	100	IN	M
3061	2022	MI	FT	Data Scientist	4200000	INR	53416	IN	100	ID	L
3075	2022	MI	FL	Applied Machine Learning Scientist	2400000	INR	30523	IN	100	IN	S
3119	2020	EN	FT	Data Engineer	1000000	INR	13493	IN	100	IN	L
3120	2020	EN	FT	Data Engineer	1000000	INR	13493	IN	100	IN	L
3192	2022	EX	FT	Head of Machine Learning	6000000	INR	76309	IN	50	IN	L
3410	2022	EX	FT	Data Engineer	324000	USD	324000	US	100	US	M
3422	2022	MI	FT	Business Data Analyst	1400000	INR	17805	IN	100	IN	M
3423	2022	MI	FT	Data Scientist	2400000	INR	30523	IN	100	IN	L
3426	2022	EN	FT	Data Scientist	1400000	INR	17805	IN	100	IN	M
3463	2022	SE	FT	Data Analytics Lead	405000	USD	405000	US	100	US	L
3468	2022	SE	FT	Applied Data Scientist	380000	USD	380000	US	100	US	L
3475	2021	MI	FT	ML Engineer	8500000	JPY	77364	JP	50	JP	S
3476	2021	MI	FT	ML Engineer	7000000	JPY	63711	JP	50	JP	S
3489	2021	SE	FT	Lead Data Scientist	3000000	INR	40570	IN	50	IN	L
3494	2021	MI	FT	Data Scientist	700000	INR	9466	IN	0	IN	S
3522	2020	MI	FT	Research Scientist	450000	USD	450000	US	0	US	M
3537	2021	MI	PT	3D Computer Vision Researcher	400000	INR	5409	IN	50	IN	M
3567	2021	EN	FT	Data Engineer	2250000	INR	30428	IN	100	IN	L
3574	2021	MI	FT	BI Data Analyst	11000000	HUF	36259	HU	50	US	L
3581	2021	EN	FT	Data Scientist	2200000	INR	29751	IN	50	IN	L
3589	2021	EN	FT	Data Scientist	2100000	INR	28399	IN	100	IN	M
3593	2020	EN	FT	Data Analyst	450000	INR	6072	IN	0	IN	S
3594	2020	SE	FT	Data Engineer	720000	MXN	33511	MX	0	MX	S
3605	2021	EN	FT	Data Engineer	1600000	INR	21637	IN	50	IN	M
3639	2021	SE	FT	Machine Learning Engineer	4900000	INR	66265	IN	0	IN	L
3640	2021	MI	FT	Data Scientist	1250000	INR	16904	IN	100	IN	S
3644	2021	EN	FT	Big Data Engineer	1200000	INR	16228	IN	100	IN	L
3646	2020	MI	FT	Data Scientist	11000000	HUF	35735	HU	50	HU	L
3649	2021	SE	FT	Data Science Manager	4000000	INR	54094	IN	50	US	L
3650	2021	SE	FT	Machine Learning Engineer	1799997	INR	24342	IN	100	IN	L
3659	2020	MI	FT	Data Scientist	3000000	INR	40481	IN	0	IN	L
3666	2021	MI	FT	Big Data Engineer	1672000	INR	22611	IN	0	IN	L
3667	2021	MI	FT	Data Scientist	420000	INR	5679	IN	100	US	S
3669	2021	MI	FT	Data Scientist	30400000	CLP	40038	CL	100	CL	L
3675	2021	EX	CT	Principal Data Scientist	416000	USD	416000	US	100	US	S
3678	2021	MI	FT	Data Scientist	2500000	INR	33808	IN	0	IN	M
3682	2020	EN	FT	Data Engineer	4450000	JPY	41689	JP	100	JP	S
3685	2020	EN	FT	Data Science Consultant	423000	INR	5707	IN	50	IN	M
3689	2020	MI	FT	Product Data Analyst	450000	INR	6072	IN	100	IN	L
3697	2020	EX	FT	Director of Data Science	325000	USD	325000	US	100	US	L
3705	2021	EN	FT	Big Data Engineer	435000	INR	5882	IN	0	CH	L
3729	2021	EN	FT	AI Scientist	1335000	INR	18053	IN	100	AS	S
3734	2021	MI	FT	Lead Data Analyst	1450000	INR	19609	IN	100	IN	L
3747	2021	MI	FT	Applied Machine Learning Scientist	423000	USD	423000	US	50	US	L
3750	2020	SE	FT	Data Scientist	412000	USD	412000	US	100	US	L
3754	2021	SE	FT	Data Science Manager	7000000	INR	94665	IN	50	IN	L

Plot the identified outliers on the scatter plot or box plot to visualize their positions in the dataset.

In [45]:

Copied!





for cn in df.columns:
  # Scatter plot with identified outliers
  plt.scatter(df[cn], df['salary_in_usd'])
  plt.scatter(outliers[cn], outliers['salary_in_usd'], color='red', label='Outliers')
  plt.xlabel(cn)
  plt.ylabel('salary_in_usd')
  plt.legend()
  plt.show()
for cn in df.columns:
  # Scatter plot with identified outliers
  plt.scatter(df[cn], df['salary_in_usd'])
  plt.scatter(outliers[cn], outliers['salary_in_usd'], color='red', label='Outliers')
  plt.xlabel(cn)
  plt.ylabel('salary_in_usd')
  plt.legend()
  plt.show()

Use histograms to visualize the distribution of numerical variables.

In [46]:

Copied!





for cn in df.select_dtypes(include=[pd.Int64Dtype()]).columns:
  plt.hist(df[cn], bins=10, color='blue', edgecolor='black')
  plt.xlabel(cn)
  plt.ylabel('Frequency')
  plt.title('Distribution of {}'.format(cn))
  plt.show()
for cn in df.select_dtypes(include=[pd.Int64Dtype()]).columns:
  plt.hist(df[cn], bins=10, color='blue', edgecolor='black')
  plt.xlabel(cn)
  plt.ylabel('Frequency')
  plt.title('Distribution of {}'.format(cn))
  plt.show()

Use bar plots to visualize the distribution of categorical variables.

In [47]:

Copied!





for cn in df.select_dtypes(include=[pd.Categorical]):
  plt.figure(figsize=(26, 7))
  chart = sns.countplot(x=cn, data=df)
  chart.set_xticklabels(chart.get_xticklabels(), rotation=90)
  plt.xlabel(cn)
  plt.ylabel('Count')
  plt.title('Distribution of {}'.format(cn))
  plt.show()
for cn in df.select_dtypes(include=[pd.Categorical]):
  plt.figure(figsize=(26, 7))
  chart = sns.countplot(x=cn, data=df)
  chart.set_xticklabels(chart.get_xticklabels(), rotation=90)
  plt.xlabel(cn)
  plt.ylabel('Count')
  plt.title('Distribution of {}'.format(cn))
  plt.show()

Use scatter plots to show the relationship between two numerical variables.

In [48]:

Copied!





for cn in df.select_dtypes(include=[pd.Int64Dtype()]):
  plt.scatter(df[cn], df['salary_in_usd'])
  plt.xlabel(cn)
  plt.ylabel('salary_in_usd')
  plt.title('Scatter Plot: {} vs salary_in_usd'.format(cn))
  plt.show()
for cn in df.select_dtypes(include=[pd.Int64Dtype()]):
  plt.scatter(df[cn], df['salary_in_usd'])
  plt.xlabel(cn)
  plt.ylabel('salary_in_usd')
  plt.title('Scatter Plot: {} vs salary_in_usd'.format(cn))
  plt.show()

Use box plots to compare the distribution of a numerical variable across different categories.

In [49]:

Copied!





for cn in df.select_dtypes(include=[pd.Categorical]):
  plt.figure(figsize=(20, 7))
  chart = sns.boxplot(x=cn, y='salary_in_usd', data=df)
  chart.set_xticklabels(chart.get_xticklabels(), rotation=90)
  plt.xlabel(cn)
  plt.ylabel('Salary')
  plt.title('Box Plot: {} vs Salary'.format(cn))
  plt.show()
for cn in df.select_dtypes(include=[pd.Categorical]):
  plt.figure(figsize=(20, 7))
  chart = sns.boxplot(x=cn, y='salary_in_usd', data=df)
  chart.set_xticklabels(chart.get_xticklabels(), rotation=90)
  plt.xlabel(cn)
  plt.ylabel('Salary')
  plt.title('Box Plot: {} vs Salary'.format(cn))
  plt.show()

Use heatmaps to visualize the correlation between numerical variables.

In [50]:

Copied!





correlation_matrix = df.corr()
sns.heatmap(correlation_matrix, annot=True)
plt.title('Correlation Heatmap')
plt.show()
correlation_matrix = df.corr()
sns.heatmap(correlation_matrix, annot=True)
plt.title('Correlation Heatmap')
plt.show()

<ipython-input-50-7ca1085ff5bf>:1: FutureWarning:

The default value of numeric_only in DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning.

Use 3D scatter plots to show the relationship between three numerical variables.

In [51]:

Copied!

import plotly.express as px
import plotly.express as px

In [52]:

Copied!

fig = px.scatter_3d(df, x='work_year', y='salary_in_usd', z='salary')
fig.show()
fig = px.scatter_3d(df, x='work_year', y='salary_in_usd', z='salary')
fig.show()

Now, our next step would be to convert the categorical data into numerical form so that we can perform a correlation operation between them.

In [53]:

Copied!

# See how many unique 'experience_level' categories are present.
df['experience_level'].unique()
# See how many unique 'experience_level' categories are present.
df['experience_level'].unique()

Out[53]:

array(['SE', 'MI', 'EN', 'EX'], dtype=object)

In [54]:

Copied!

# See how many unique 'employment_type' categories are present.
df['employment_type'].unique()
# See how many unique 'employment_type' categories are present.
df['employment_type'].unique()

Out[54]:

array(['FT', 'CT', 'FL', 'PT'], dtype=object)

In [55]:

Copied!

# See how many unique 'company_size' categories are present.
df['company_size'].unique()
# See how many unique 'company_size' categories are present.
df['company_size'].unique()

Out[55]:

array(['L', 'S', 'M'], dtype=object)

In [57]:

Copied!





df['remote_ratio'] = df['remote_ratio'].astype('category').cat.codes
df['company_size'] = df['company_size'].astype('category').cat.codes
df['employment_type'] = df['employment_type'].astype('category').cat.codes
df['experience_level'] = df['experience_level'].astype('category').cat.codes
df['remote_ratio'] = df['remote_ratio'].astype('category').cat.codes
df['company_size'] = df['company_size'].astype('category').cat.codes
df['employment_type'] = df['employment_type'].astype('category').cat.codes
df['experience_level'] = df['experience_level'].astype('category').cat.codes

In [60]:

Copied!

df.sample(5)
df.sample(5)

Out[60]:

	work_year	experience_level	employment_type	job_title	salary	salary_currency	salary_in_usd	employee_residence	remote_ratio	company_location	company_size
438	2023	3	2	Data Scientist	113750	USD	113750	IE	0	IE	1
1648	2023	3	2	Data Engineer	180000	USD	180000	US	0	US	1
1471	2023	3	2	Analytics Engineer	175000	USD	175000	US	2	US	1
194	2023	3	2	Data Scientist	190000	USD	190000	US	0	US	1
682	2023	3	2	Analytics Engineer	87000	USD	87000	US	0	US	1

Let's look at the correlation between the following properties:

In [63]:

Copied!

df[['experience_level', 'employment_type', 'salary_in_usd', 'remote_ratio',
    'company_size']].corr()
df[['experience_level', 'employment_type', 'salary_in_usd', 'remote_ratio',
    'company_size']].corr()

Out[63]:

	experience_level	employment_type	salary_in_usd	remote_ratio	company_size
experience_level	1.000000	-0.032794	0.327173	-0.054025	0.066414
employment_type	-0.032794	1.000000	-0.010329	-0.028673	-0.041001
salary_in_usd	0.327173	-0.010329	1.000000	-0.064171	-0.000372
remote_ratio	-0.054025	-0.028673	-0.064171	1.000000	-0.036928
company_size	0.066414	-0.041001	-0.000372	-0.036928	1.000000

Looking at the correlation coefficients drawn from the chosen features using sns.heatmap:

In [62]:

Copied!





correlation_matrix = df[['experience_level', 'employment_type', 'salary_in_usd',
                         'remote_ratio', 'company_size']].corr(numeric_only=True)
sns.heatmap(correlation_matrix, annot=True)
plt.title('Correlation Heatmap')
plt.show()
correlation_matrix = df[['experience_level', 'employment_type', 'salary_in_usd',
                         'remote_ratio', 'company_size']].corr(numeric_only=True)
sns.heatmap(correlation_matrix, annot=True)
plt.title('Correlation Heatmap')
plt.show()