1 Problem


Below is a snippet from a table that contains information about employees that work at Company XYZ:

Column name Data type Example value Description
employee_name string Cindy Name of employee
employee_id integer 1837204 Unique id for each employee
yrs_of_experience integer 14 total working years of experience
yrs_at_company integer 10 total working years at Company XYZ
compensation integer 100000 dollar value of employee compensation
career_track string technical Potential values: technical, non-technical, executive

Company XYZ Human Resource department is trying to understand compensation across the company and asked you to pull data to help them make a decision regarding employee compensation.

Can you pull the average, median, minimum, maximum, and standard deviations for salary across 5 year experience buckets at Company XYZ? (e.g. get the corresponding average, median, minimum, maximum, and standard deviations for experience buckets 0-5, 5-10, 10-15, etc.) You can assume the data is imported into a dataframe named, df.

2 Resolving Problem


2.1 In Python

Libraries:

import pandas as pd
import numpy as np

Example data:


df = pd.DataFrame({'employee_name': ["Cindy", "Leandro", "Cristian", "Felipe", "Marina", "Luz", "Daniel", "Joseph", "Charles", "Tom"], 
                   'employee_id': list(range(1, 11, 1)), 
                   'yrs_of_experience': [4, 3, 5, 7, 8, 6, 10, 11, 20, 21], 
                   'yrs_at_company': [1, 2, 2, 1, 4, 4, 3, 3, 5, 6], 
                   'compensation': [100000, 100000, 150000, 150000, 230000, 134000, 456000, 321000, 123000, 123000], 
                   'career_track': ["technical" for i in range(1, 5)] + ["non-technical" for i in range(1, 5)] + ["executive" for i in range(1, 3)]})

df.info()
## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 10 entries, 0 to 9
## Data columns (total 6 columns):
##  #   Column             Non-Null Count  Dtype 
## ---  ------             --------------  ----- 
##  0   employee_name      10 non-null     object
##  1   employee_id        10 non-null     int64 
##  2   yrs_of_experience  10 non-null     int64 
##  3   yrs_at_company     10 non-null     int64 
##  4   compensation       10 non-null     int64 
##  5   career_track       10 non-null     object
## dtypes: int64(4), object(2)
## memory usage: 608.0+ bytes
df.describe()
##        employee_id  yrs_of_experience  yrs_at_company   compensation
## count     10.00000          10.000000        10.00000      10.000000
## mean       5.50000           9.500000         3.10000  188700.000000
## std        3.02765           6.311365         1.66333  115956.936451
## min        1.00000           3.000000         1.00000  100000.000000
## 25%        3.25000           5.250000         2.00000  123000.000000
## 50%        5.50000           7.500000         3.00000  142000.000000
## 75%        7.75000          10.750000         4.00000  210000.000000
## max       10.00000          21.000000         6.00000  456000.000000

Discretize experience variable using cut function from pandas library:

df['interval'] = pd.cut(x = df.yrs_of_experience, 
                        bins = range(0, 30, 5), # Interval limits
                        right = True)

Group by and aggregate functions using pandas:

df.groupby(by = "interval") \
  .agg(func = ["count", "mean", "median", "min", "max", "std"]) \
  .round(decimals = 2)["compensation"]
##           count       mean  median     min     max        std
## interval                                                     
## (0, 5]        3  116666.67  100000  100000  150000   28867.51
## (5, 10]       4  242500.00  190000  134000  456000  148399.24
## (10, 15]      1  321000.00  321000  321000  321000        NaN
## (15, 20]      1  123000.00  123000  123000  123000        NaN
## (20, 25]      1  123000.00  123000  123000  123000        NaN

2.2 In R

Libraries:

library(tidyverse)
library(reticulate)

Creating data frame in R:

df_in_R <- data.frame(employee_name =  c("Cindy", "Leandro", "Cristian", "Felipe", "Marina", "Luz", "Daniel", "Joseph", "Charles", "Tom"),
                      employee_id= seq(1, 10, 1), 
                      yrs_of_experience = c(4, 3, 5, 7, 8, 6, 10, 11, 20, 21),
                      yrs_at_company = c(1, 2, 2, 1, 4, 4, 3, 3, 5, 6), 
                      compensation = c(100000, 100000, 150000, 150000, 230000, 134000, 456000, 321000, 123000, 123000), 
                      career_track = c(rep("technical", 4), rep("non-technical", 4), rep("executive", 2) ))

str(df_in_R)
## 'data.frame':    10 obs. of  6 variables:
##  $ employee_name    : chr  "Cindy" "Leandro" "Cristian" "Felipe" ...
##  $ employee_id      : num  1 2 3 4 5 6 7 8 9 10
##  $ yrs_of_experience: num  4 3 5 7 8 6 10 11 20 21
##  $ yrs_at_company   : num  1 2 2 1 4 4 3 3 5 6
##  $ compensation     : num  100000 100000 150000 150000 230000 134000 456000 321000 123000 123000
##  $ career_track     : chr  "technical" "technical" "technical" "technical" ...
summary(df_in_R)
##  employee_name       employee_id    yrs_of_experience yrs_at_company
##  Length:10          Min.   : 1.00   Min.   : 3.00     Min.   :1.0   
##  Class :character   1st Qu.: 3.25   1st Qu.: 5.25     1st Qu.:2.0   
##  Mode  :character   Median : 5.50   Median : 7.50     Median :3.0   
##                     Mean   : 5.50   Mean   : 9.50     Mean   :3.1   
##                     3rd Qu.: 7.75   3rd Qu.:10.75     3rd Qu.:4.0   
##                     Max.   :10.00   Max.   :21.00     Max.   :6.0   
##   compensation    career_track      
##  Min.   :100000   Length:10         
##  1st Qu.:123000   Class :character  
##  Median :142000   Mode  :character  
##  Mean   :188700                     
##  3rd Qu.:210000                     
##  Max.   :456000

Group by and aggregate functions using dplyr:

df_in_R %>% 
  mutate(interval = cut(x = yrs_of_experience, 
                        breaks = seq(0, 30, 5), 
                        right = TRUE)) %>% 
  group_by(interval) %>% 
  summarize(count = n(), 
            mean = mean(compensation), 
            median = median(compensation), 
            min = min(compensation), 
            max = max(compensation), 
            std = sd(compensation)) %>% 
  ungroup()
## # A tibble: 5 x 7
##   interval count    mean median    min    max     std
##   <fct>    <int>   <dbl>  <dbl>  <dbl>  <dbl>   <dbl>
## 1 (0,5]        3 116667. 100000 100000 150000  28868.
## 2 (5,10]       4 242500  190000 134000 456000 148399.
## 3 (10,15]      1 321000  321000 321000 321000     NA 
## 4 (15,20]      1 123000  123000 123000 123000     NA 
## 5 (20,25]      1 123000  123000 123000 123000     NA

3 R verbs in Python

Dplyr verbs in Python using the siuba package.

Libraries:

# siuba Package:
from siuba import _, group_by, summarize

Importing data from R in Python:

data = pd.DataFrame(df)

data.info()
data.describe()
(data >> 
  group_by(_.interval) >> 
  summarize(mean   = np.mean(_.compensation),
            median = (_.compensation).median(), # np.median()
            min = np.min(_.compensation), 
            max = np.max(_.compensation), 
            std = np.std(_.compensation))
  )