class: title-slide, middle, center # BUS 320 # Topic 3 # Data manipulation ## Elizabeth Stanny --- layout: true <div class="my-footer"><span>http://estannydotcom.netlify.com</span></div> --- # Learning objectives for the course - Ask the right questions -- - Extract, transform and load relevant data (ETL process) - Extract (`tidyverse/dplyr`) - Transform (`tidyverse/dplyr`) - Load Excel Workbooks (`tidyverse/readxl`) -- - Apply appropriate data analytic techniques - Descriptive statistics - `skimr`, `dplyr::summarize` -- - Interpret and share the results - `rmarkdown` files end with .Rmd --- # What is the [tidyverse](https://www.tidyverse.org/packages/) -- #### Data transformation with `dplyr` 1. expects **tidy** data - each variable in its own column - each observation in its own row 1. works with pipes **%>%** - x %>% f becomes f(x, y) - **%>%** take something as input for next command - "then" --- <!-- # Glimpse to find data `type` - `chr` refers to “character”, which is computer terminology for text data. In most forms, text data, such as the carrier or origin of a flight, are categorical variables -`int` and `dbl` refer to "integer" and "double", which are computer coding terminology for quantitative/numerical - Which variables in `corp_tax` are numerical (int or dbl) - Which variables in `corp_tax` are categorical (chr) --> # Data transformation with `dplyr` 1. first argument is a data frame 1. subsequent arguments describe what to do with the data frame 1. result is a new data frame --- # Key functions for action in `dplyr`. `function()` | Action --------------|-------------------------------------------------------- `mutate()` | creates new **variables (columns)** `select()` | picks **variables (columns)** based on their names `filter()` | picks **rows** based on their values `summarize()` | reduces multiple values down to a single summary `arrange()` | changes the order of the **rows** based on their values `group_by()` | create subsets of data to apply functions to --- .pull-left[ ### Comparison operators (filter) - `==` equality - `>` greater than - `<` less than - `>=` greater than or equal to - `<=` less than or equal to - `!=` not equal to - `between` numeric variable in a specified range - `near` compare 2 numeric vectors. Set tolerance ] -- .pull-right[ ### Combine criteria using operators that make comparisons: - `|` or - `&` and `,` ] --- # Load the packages and data ```r library(tidyverse) library(readxl) corp_tax <- read_excel("corp_tax.xlsx") ``` --- # corp_tax data frame is tidy .pull-left[ - Each row is a company (observation) ] .pull-right[ - Variables (columns) contain information on a company - company (name) - profit - tax - tax_rate - industry ] --- # slice sample of 10 values from corp_tax ```r corp_tax %>% slice_sample(n = 10) ``` <table> <thead> <tr> <th style="text-align:left;"> company </th> <th style="text-align:right;"> profit </th> <th style="text-align:right;"> tax </th> <th style="text-align:right;"> tax_rate </th> <th style="text-align:left;"> industry </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Eli Lilly </td> <td style="text-align:right;"> 597.612 </td> <td style="text-align:right;"> -54.3000 </td> <td style="text-align:right;"> -0.0908616 </td> <td style="text-align:left;"> Pharmaceuticals & medical products </td> </tr> <tr> <td style="text-align:left;"> Delta Air Lines </td> <td style="text-align:right;"> 5073.000 </td> <td style="text-align:right;"> -187.0000 </td> <td style="text-align:right;"> -0.0368618 </td> <td style="text-align:left;"> Transportation </td> </tr> <tr> <td style="text-align:left;"> Anthem </td> <td style="text-align:right;"> 4990.000 </td> <td style="text-align:right;"> 1128.0000 </td> <td style="text-align:right;"> 0.2260521 </td> <td style="text-align:left;"> Health care </td> </tr> <tr> <td style="text-align:left;"> Textron </td> <td style="text-align:right;"> 548.000 </td> <td style="text-align:right;"> 3.0000 </td> <td style="text-align:right;"> 0.0054745 </td> <td style="text-align:left;"> Aerospace & defense </td> </tr> <tr> <td style="text-align:left;"> Performance Food Group </td> <td style="text-align:right;"> 207.300 </td> <td style="text-align:right;"> 28.9000 </td> <td style="text-align:right;"> 0.1394115 </td> <td style="text-align:left;"> Retail & wholesale trade </td> </tr> <tr> <td style="text-align:left;"> Morgan Stanley </td> <td style="text-align:right;"> 7597.000 </td> <td style="text-align:right;"> 686.0000 </td> <td style="text-align:right;"> 0.0902988 </td> <td style="text-align:left;"> Financial </td> </tr> <tr> <td style="text-align:left;"> Clorox </td> <td style="text-align:right;"> 888.000 </td> <td style="text-align:right;"> 159.0000 </td> <td style="text-align:right;"> 0.1790541 </td> <td style="text-align:left;"> Household & personal products </td> </tr> <tr> <td style="text-align:left;"> Delek US Holdings </td> <td style="text-align:right;"> 478.538 </td> <td style="text-align:right;"> 121.7380 </td> <td style="text-align:right;"> 0.2543956 </td> <td style="text-align:left;"> Oil, gas & pipelines </td> </tr> <tr> <td style="text-align:left;"> Fortive </td> <td style="text-align:right;"> 679.900 </td> <td style="text-align:right;"> 47.2216 </td> <td style="text-align:right;"> 0.0694537 </td> <td style="text-align:left;"> Industrial machinery </td> </tr> <tr> <td style="text-align:left;"> Travelers Cos. </td> <td style="text-align:right;"> 3031.000 </td> <td style="text-align:right;"> 424.0000 </td> <td style="text-align:right;"> 0.1398878 </td> <td style="text-align:left;"> Financial </td> </tr> </tbody> </table> --- # Questions - Did Amazon.com pay US corporate taxes? - What was its US tax rate? -- ```r corp_tax %>% filter(company == "Amazon.com") ``` <table> <thead> <tr> <th style="text-align:left;"> company </th> <th style="text-align:right;"> profit </th> <th style="text-align:right;"> tax </th> <th style="text-align:right;"> tax_rate </th> <th style="text-align:left;"> industry </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Amazon.com </td> <td style="text-align:right;"> 10835 </td> <td style="text-align:right;"> -129 </td> <td style="text-align:right;"> -0.0119059 </td> <td style="text-align:left;"> Retail & wholesale trade </td> </tr> </tbody> </table> --- # Questions - Did Facebook pay US corporate taxes? - What was its US tax rate? ```r corp_tax %>% filter(company == ???) ``` --- # Question Did companies in the industry "Utilities, gas and electric" pay taxes ```r corp_tax %>% filter(??? == ???) ``` --- # Question Which companies **not** in the in the industry "Utilities, gas and electric" AND had a negative tax amount ```r corp_tax %>% filter(??? != ??? & tax < 0) corp_tax %>% filter(??? != ??? , tax < 0) ``` --- # Question Which companies had a tax rate greater than 21% ```r corp_tax %>% filter(??? > ???) ``` --- # Question Which companies had lowest US profit? ```r corp_tax %>% arrange(profit) corp_tax %>% slice_min(profit, n = 3) ``` --- # Question Which 3 companies had highest US profit? ```r corp_tax %>% slice_max(profit, n = 3) ``` --- # Question By industry which companies had the highest and lowest tax rate? ```r corp_tax %>% group_by(industry) %>% slice_max(tax_rate, n = 1) corp_tax %>% group_by(industry) %>% slice_min(tax_rate, n = 1) ``` --- ### Use 'summarize' to calculate summary statistics * `mean()`: the average * `min()`: the minimum value (p0) * `quantile( ,1)`: the minimum value (p25) * `median()`: the middle value (p50) * `quantile( ,3)`: the minimum value (p75) * `max()`: the minimum value (p100) * `sd()`: the standard deviation (measure of spread) * `sum()`: the total amount when adding multiple numbers * `n()`: a count of the number of rows/observations in each group --- # Replicate [Figure 5](https://itep.org/corporate-tax-avoidance-in-the-first-year-of-the-trump-tax-law) ```r corp_tax %>% summarize( profit_sum = sum(profit), tax_sum = sum(tax)) ``` <table> <thead> <tr> <th style="text-align:right;"> profit_sum </th> <th style="text-align:right;"> tax_sum </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 765688 </td> <td style="text-align:right;"> 86844.71 </td> </tr> </tbody> </table> --- # What do we need to do? ```r tab <- corp_tax %>% *group_by(industry) %>% summarize( profit_sum = sum(profit), tax_sum = sum(tax)) head(tab) ``` <table> <thead> <tr> <th style="text-align:left;"> industry </th> <th style="text-align:right;"> profit_sum </th> <th style="text-align:right;"> tax_sum </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Aerospace & defense </td> <td style="text-align:right;"> 23360.363 </td> <td style="text-align:right;"> 2874.800 </td> </tr> <tr> <td style="text-align:left;"> Chemicals </td> <td style="text-align:right;"> 6281.402 </td> <td style="text-align:right;"> 274.767 </td> </tr> <tr> <td style="text-align:left;"> Computers, office equip, software, data </td> <td style="text-align:right;"> 52314.470 </td> <td style="text-align:right;"> 10655.678 </td> </tr> <tr> <td style="text-align:left;"> Engineering & construction </td> <td style="text-align:right;"> 4749.116 </td> <td style="text-align:right;"> 379.873 </td> </tr> <tr> <td style="text-align:left;"> Financial </td> <td style="text-align:right;"> 196269.695 </td> <td style="text-align:right;"> 20069.580 </td> </tr> <tr> <td style="text-align:left;"> Financial Data Services </td> <td style="text-align:right;"> 19769.280 </td> <td style="text-align:right;"> 2688.200 </td> </tr> </tbody> </table> --- # FIGURE 5 Effective Corporate Tax Rates for 379 Corporations, by industry, 2018 ```r corp_tax %>% group_by(industry) %>% summarize( Profit = sum(profit), Tax = sum(tax)) %>% * mutate(Rate = Tax / Profit * 100) %>% * arrange(Rate) %>% * rename(Industry = industry) ``` <table> <thead> <tr> <th style="text-align:left;"> Industry </th> <th style="text-align:right;"> Profit </th> <th style="text-align:right;"> Tax </th> <th style="text-align:right;"> Rate </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Industrial machinery </td> <td style="text-align:right;"> 11263.541 </td> <td style="text-align:right;"> -71.19397 </td> <td style="text-align:right;"> -0.6320745 </td> </tr> <tr> <td style="text-align:left;"> Utilities, gas and electric </td> <td style="text-align:right;"> 39894.059 </td> <td style="text-align:right;"> -212.38508 </td> <td style="text-align:right;"> -0.5323727 </td> </tr> <tr> <td style="text-align:left;"> Motor vehicles and parts </td> <td style="text-align:right;"> 9264.907 </td> <td style="text-align:right;"> 134.71636 </td> <td style="text-align:right;"> 1.4540498 </td> </tr> <tr> <td style="text-align:left;"> Oil, gas & pipelines </td> <td style="text-align:right;"> 44643.612 </td> <td style="text-align:right;"> 1628.17897 </td> <td style="text-align:right;"> 3.6470592 </td> </tr> <tr> <td style="text-align:left;"> Chemicals </td> <td style="text-align:right;"> 6281.402 </td> <td style="text-align:right;"> 274.76700 </td> <td style="text-align:right;"> 4.3742940 </td> </tr> <tr> <td style="text-align:left;"> Transportation </td> <td style="text-align:right;"> 30665.010 </td> <td style="text-align:right;"> 2445.63353 </td> <td style="text-align:right;"> 7.9753229 </td> </tr> </tbody> </table>