Below is a snippet from a table that contains information about employees that work at Company XYZ:
Column name | Data type | Example value | Description |
---|---|---|---|
employee_name | string | Cindy | Name of employee |
employee_id | integer | 1837204 | Unique id for each employee |
yrs_of_experience | integer | 14 | total working years of experience |
yrs_at_company | integer | 10 | total working years at Company XYZ |
compensation | integer | 100000 | dollar value of employee compensation |
career_track | string | technical | Potential values: technical, non-technical, executive |
Company XYZ Human Resource department is trying to understand compensation across the company and asked you to pull data to help them make a decision regarding employee compensation.
Can you pull the average, median, minimum, maximum, and standard deviations for salary across 5 year experience buckets at Company XYZ? (e.g. get the corresponding average, median, minimum, maximum, and standard deviations for experience buckets 0-5, 5-10, 10-15, etc.) You can assume the data is imported into a dataframe named, df.
Libraries:
import pandas as pd
import numpy as np
Example data:
= pd.DataFrame({'employee_name': ["Cindy", "Leandro", "Cristian", "Felipe", "Marina", "Luz", "Daniel", "Joseph", "Charles", "Tom"],
df 'employee_id': list(range(1, 11, 1)),
'yrs_of_experience': [4, 3, 5, 7, 8, 6, 10, 11, 20, 21],
'yrs_at_company': [1, 2, 2, 1, 4, 4, 3, 3, 5, 6],
'compensation': [100000, 100000, 150000, 150000, 230000, 134000, 456000, 321000, 123000, 123000],
'career_track': ["technical" for i in range(1, 5)] + ["non-technical" for i in range(1, 5)] + ["executive" for i in range(1, 3)]})
df.info()
## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 10 entries, 0 to 9
## Data columns (total 6 columns):
## # Column Non-Null Count Dtype
## --- ------ -------------- -----
## 0 employee_name 10 non-null object
## 1 employee_id 10 non-null int64
## 2 yrs_of_experience 10 non-null int64
## 3 yrs_at_company 10 non-null int64
## 4 compensation 10 non-null int64
## 5 career_track 10 non-null object
## dtypes: int64(4), object(2)
## memory usage: 608.0+ bytes
df.describe()
## employee_id yrs_of_experience yrs_at_company compensation
## count 10.00000 10.000000 10.00000 10.000000
## mean 5.50000 9.500000 3.10000 188700.000000
## std 3.02765 6.311365 1.66333 115956.936451
## min 1.00000 3.000000 1.00000 100000.000000
## 25% 3.25000 5.250000 2.00000 123000.000000
## 50% 5.50000 7.500000 3.00000 142000.000000
## 75% 7.75000 10.750000 4.00000 210000.000000
## max 10.00000 21.000000 6.00000 456000.000000
Discretize experience variable using cut
function from pandas
library:
'interval'] = pd.cut(x = df.yrs_of_experience,
df[= range(0, 30, 5), # Interval limits
bins = True) right
Group by and aggregate functions using pandas
:
= "interval") \
df.groupby(by = ["count", "mean", "median", "min", "max", "std"]) \
.agg(func round(decimals = 2)["compensation"] .
## count mean median min max std
## interval
## (0, 5] 3 116666.67 100000 100000 150000 28867.51
## (5, 10] 4 242500.00 190000 134000 456000 148399.24
## (10, 15] 1 321000.00 321000 321000 321000 NaN
## (15, 20] 1 123000.00 123000 123000 123000 NaN
## (20, 25] 1 123000.00 123000 123000 123000 NaN
Libraries:
library(tidyverse)
library(reticulate)
Creating data frame in R:
<- data.frame(employee_name = c("Cindy", "Leandro", "Cristian", "Felipe", "Marina", "Luz", "Daniel", "Joseph", "Charles", "Tom"),
df_in_R employee_id= seq(1, 10, 1),
yrs_of_experience = c(4, 3, 5, 7, 8, 6, 10, 11, 20, 21),
yrs_at_company = c(1, 2, 2, 1, 4, 4, 3, 3, 5, 6),
compensation = c(100000, 100000, 150000, 150000, 230000, 134000, 456000, 321000, 123000, 123000),
career_track = c(rep("technical", 4), rep("non-technical", 4), rep("executive", 2) ))
str(df_in_R)
## 'data.frame': 10 obs. of 6 variables:
## $ employee_name : chr "Cindy" "Leandro" "Cristian" "Felipe" ...
## $ employee_id : num 1 2 3 4 5 6 7 8 9 10
## $ yrs_of_experience: num 4 3 5 7 8 6 10 11 20 21
## $ yrs_at_company : num 1 2 2 1 4 4 3 3 5 6
## $ compensation : num 100000 100000 150000 150000 230000 134000 456000 321000 123000 123000
## $ career_track : chr "technical" "technical" "technical" "technical" ...
summary(df_in_R)
## employee_name employee_id yrs_of_experience yrs_at_company
## Length:10 Min. : 1.00 Min. : 3.00 Min. :1.0
## Class :character 1st Qu.: 3.25 1st Qu.: 5.25 1st Qu.:2.0
## Mode :character Median : 5.50 Median : 7.50 Median :3.0
## Mean : 5.50 Mean : 9.50 Mean :3.1
## 3rd Qu.: 7.75 3rd Qu.:10.75 3rd Qu.:4.0
## Max. :10.00 Max. :21.00 Max. :6.0
## compensation career_track
## Min. :100000 Length:10
## 1st Qu.:123000 Class :character
## Median :142000 Mode :character
## Mean :188700
## 3rd Qu.:210000
## Max. :456000
Group by and aggregate functions using dplyr
:
%>%
df_in_R mutate(interval = cut(x = yrs_of_experience,
breaks = seq(0, 30, 5),
right = TRUE)) %>%
group_by(interval) %>%
summarize(count = n(),
mean = mean(compensation),
median = median(compensation),
min = min(compensation),
max = max(compensation),
std = sd(compensation)) %>%
ungroup()
## # A tibble: 5 x 7
## interval count mean median min max std
## <fct> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (0,5] 3 116667. 100000 100000 150000 28868.
## 2 (5,10] 4 242500 190000 134000 456000 148399.
## 3 (10,15] 1 321000 321000 321000 321000 NA
## 4 (15,20] 1 123000 123000 123000 123000 NA
## 5 (20,25] 1 123000 123000 123000 123000 NA
Dplyr verbs in Python using the siuba
package.
Libraries:
# siuba Package:
from siuba import _, group_by, summarize
Importing data from R in Python:
= pd.DataFrame(df)
data
data.info() data.describe()
>>
(data >>
group_by(_.interval) = np.mean(_.compensation),
summarize(mean = (_.compensation).median(), # np.median()
median min = np.min(_.compensation),
max = np.max(_.compensation),
= np.std(_.compensation))
std )