CoreSignal Analysis

Current Status

Joschka Schwarz

Hamburg University of Technology

Wednesday, 12th of June 2024

Table of Contents

CoreSignal Data

Input + Output CoreSignal

1. Input for CS: Domain & LinkedIn handle list (from CB)

Objective: Create list with domains and/or LinkedIn handles

  • Seperated by organizations & employees
  • Based mostly on our crunchbase dataset (I have added some angellist data as well)

2. Output from CS: Company & Personal LinkedIn Profile Date

Instructions:

  1. Map provided domains with company profiles
  2. Provide entire employed user data (including company profiles)

Data was proided as json data:

  • 5 files for companies (~160MB)
  • 1662 files for employees (~175GB)

Script and File locations

Data
01_data_sources/06_coresignal/01_data/02_raw/

  • company/202203_custom/...
  • member/202203_custom/...

3. Extract & convert relevant data

Objectives: Extract data that is relevant (variables) for our analyses and convert it to .rds / .parquet files

Script and File locations

Scripts
01_data_sources/06_coresignal/02_scripts/02_build_tables/

  • 01_cs_build_table_company.R
  • 02_cs_build_tables_member_V1_bd_exp_skills.R
  • 03_cs_build_tables_member_V2_edu_exp.R

Data
01_data_sources/06_coresignal/01_data/03_extracted/

  • company/cs_companies_base.rds
  • member/01_basic_data/...
  • member/02_experience/... (non-deduplicated)
  • member/03_skills/...
  • member/04_education/... (non-deduplicated)

4. Deduplicate data

Everytime a user changes something on their profile a new record is being created (date, typos, names, …). The column deleted is not useful.

Objectives: Deduplicate the member data: Experiences & Education Tables

Script and File locations

Scripts
01_data_sources/06_coresignal/02_scripts/02_build_tables/

  • 04_cs_build_tables_member_exp_dist.R
  • 05_cs_build_tables_member_edu_dist.R

Data
01_data_sources/06_coresignal/01_data/04_wrangled/

  • companies/cs_companies_base_slct.rds (just relevant columns selected)
  • member/02_experience/me_dist8/... (deduplicated)
  • member/04_education/02_wrangled_dist_chunked/... (deduplicated)

Data Join

Wrangle and Join CrunchBase / PitchBook and CoreSignal data

1. Overview - Orgs

CoreSignal did not provide a matching table but provided only the resulting data. Hence, backmapping to our CrunchBase / Pitchbook Data via domains is necessary:

  1. Map Companies

Script and File locations

Scripts
02_data_mapping/10_cbpb_cs/01_scripts/

  • 06_cbpb_cs_matching_companies.R

Data (Input)
02_data_mapping/10_cbpb_cs/02_data/

  • funded_companies.rds (created by Christoph)

Data (Output)

  • cbpb_cs_joined.rds (companies joined)

2. Overview - Employees

Map Crunchbase / Pitchbook data to CoreSignal profiles (via mapped companies (1))

  1. Map employees

Script and File locations

Scripts
02_data_mapping/10_cbpb_cs/01_scripts/

  • 07_cs_cb_matching_employees.R

Data (Input)
01_data_sources/06_coresignal/01_data/04_wrangled/member/02_experience/me_dist8/02_unnested/ 02_data_mapping/10_cbpb_cs/02_data/

  • cs_me_dist8_unest_prqt (CoreSignal distinct Member Experiences)
  • cbpb_cs_joined.rds (Joined CoreSignal / CrunchbasePitchbook Org Data)

Data (Output)

  • cs_me_dist8_unest_fc_joined.parquet (intermediate data)

There are some further explanations about the org join and empployee join in the next two section (click on each tabset. There are also some fragments –> You have to hit enter/arrow to slide them in). You can skip to the next chapter (VAR I) though.

3. Details: Load & Init Companies (CrunchBase, PitchBook & CoreSignal)

Crunchbase data contains 150,838 startups with a valid funding trajectory.

p_load(arrow, dplyr, tidyr)

funded_companies_prqt <- open_dataset("funded_companies_identifiers.parquet") 
funded_companies_prqt
#> # A tibble: 150,838 × 3
#>   company_id            domain linkedin_url                                     
#>        <int> <list<character>> <chr>                                            
#> 1          1               [1] <NA>                                             
#> 2          2               [1] https://www.linkedin.com/company/luna-pharmaceut…
#> 3          3               [1] http://www.linkedin.com/company/chainsync        
#> # ℹ 150,835 more rows

Multiple domains (Unnesting via Arrow not possible. Options: Spark & sparklyr.nested):

fc_unnested_tbl <- funded_companies_prqt |> collect() |> 
                      # 1. Allow multiple domains per company. No multiple linkedin handles.
                      unnest(domain) 
fc_unnested_tbl
#> # A tibble: 155,413 × 3
#>   company_id domain              linkedin_url                                   
#>        <int> <chr>               <chr>                                          
#> 1          1 zana.io             <NA>                                           
#> 2          2 premamawellness.com https://www.linkedin.com/company/luna-pharmace…
#> 3          3 chainsync.com       http://www.linkedin.com/company/chainsync      
#> # ℹ 155,410 more rows
  1. Must have identifier (domain, linkedin)
  2. Clean identifiers
  3. Remove duplicates
library(stringr)
fc_unnested_tbl |> 
  
  # 1. At least 1 identifier: 4.518 observations are filtered out
  filter(if_any(c(domain, linkedin_url), ~!is.na(.))) |>
  
  # 2. Extract linkedin handle & clean domains
  mutate(linkedin_handle = linkedin_url |> str_extract("(?<=linkedin\\.com/company/).*?(?=(?:\\?|$|/))")) |>
  mutate(domain          = domain |> clean_domain()) |>

  # 3. Remove 532 duplicates
  distinct()
#> # A tibble: 150,363 × 3
#>   company_id domain              linkedin_handle          
#>        <int> <chr>               <chr>                    
#> 1          1 zana.io             <NA>                     
#> 2          2 premamawellness.com luna-pharmaceuticals-inc-
#> 3          3 chainsync.com       chainsync                
#> # ℹ 150,360 more rows

–> 145.991 distinct examineable companies.

Issue: Some extracted domains are not unique and associated with multiple companies.
Manual Cleaning: Domains with a count exceeding two were analyzed and set to NA if they do not correspond to the actual one.

# ANALYZE
# fc_wrangled_tbl |> 
#   distinct(company_id, domain) |> 
#   count(domain, sort = T) |> 
#   filter(n>2)`

unwanted_domains_cb <- c("webflow.io", "angel.co", "weebly.com", "wordpress.com", "wixsite.com", "squarespace.com", 
                         "webflow.io", "crypt2esports.com", "myshopify.com", "business.site", "mystrikingly.com", 
                         "launchrock.com", "square.site", "google.com", "sites.google.com", "t.co", "linktr.ee",
                         "netlify.app", "itunes.apple.com", "apple.com", "crunchb.com", "tumblr.com", "linkedin.com",
                         "godaddysites.com", "mit.edu", "paloaltonetworks.com", " wpengine.com", "facebook.com",
                         "intuit.com", "medium.com", "salesforce.com", "strikingly.com", "wix.com", "cisco.com",
                         "digi.me", "apps.apple.com", "bit.ly", "fleek.co", "harvard.edu", "ibm.com", "jimdo.com",
                         "myftpupload.com", "odoo.com", "storenvy.com", "twitter.com", "umd.edu", "umich.edu", "vmware.com", "webs.com")

# Not all observations with unwanted domains are bad per se:
wanted_ids_cb <- c(angel = 128006, `catapult-centres-uk` = 115854, digime1 = 140904, digimi2 = 95430, fleek = 50738, 
                   jimdo = 108655, medium = 113415, storenvy = 85742, strikingly = 95831, substack = 34304, 
                   tumblr = 84838, twitter = 53139, weebly = 91365, wpengine = 91720)

# Set misleading domains to NA
funded_companies_clnd <- fc_wrangled_tbl |> 
                              
  mutate(domain = if_else(
    domain %in% unwanted_domains_cb & !(company_id %in% wanted_ids_cb), 
    NA_character_, domain))

It appears that CoreSignal has been able to locate 45.026 companies within our gathered data.

# Selection & Wrangle has been done already
cs_companies_base_slct <- readRDS("cs_companies_base_slct.rds") 
cs_companies_base_slct
#> # A tibble: 45,362 × 4
#>      id name                                domain               linkedin_handle
#>   <int> <chr>                               <chr>                <chr>          
#> 1   305 Blueprint, a David's Bridal Company blueprintregistry.c… blueprint-regi…
#> 2   793 BookingLive                         bookinglive.com      bookinglive    
#> 3  2425 Brandvee                            momentum.ai          brandvee       
#> # ℹ 45,359 more rows
cs_companies_base_slct$id |> n_distinct()
#> [1] 45026
cs_companies_base_slct |> janitor::get_dupes(id)
#> # A tibble: 672 × 5
#>        id dupe_count name          domain           linkedin_handle
#>     <int>      <int> <chr>         <chr>            <chr>          
#> 1  596494          2 Vi            vi.co            vitrainer      
#> 2  596494          2 Vi            vi.co            vi             
#> 3 1324413          2 Patch Lending patchlending.com patch-of-land  
#> # ℹ 669 more rows

Nothing to wrangle …

cs_companies_base_wrangled <- cs_companies_base_slct |> select(-name) |> 
  
                                        # Add suffixes to col names
                                        rename_with(~ paste(., "cs", sep = "_"))

Important

More cleaning necessary (same as CBPB)! The task was undertaken with a limited degree of enthusiasm.

unwanted_domains_cs    <- c("bit.ly", "linktr.ee", "facebook.com", "linkedin.com", "twitter.com", "crunchbase.com")
wanted_ids_cs          <- c(crunchbase = 1634413, linkedin = 8568581, twitter = 24745469)

cs_companies_base_clnd <- cs_companies_base_wrangled |> 
  
  mutate(domain_cs = if_else(
    domain_cs %in% unwanted_domains_cs & !(id_cs %in% wanted_ids_cs), 
    NA_character_, 
    domain_cs)
    )

4. Details: Join companies, member experiences and funding information

We were able to match 37.287 CS & CB/PB companies.

cb_cs_joined <- funded_companies_clnd |> 

    # Leftjoins
    left_join(cs_companies_base_clnd |> select(id_cs, domain_cs),          by = c(domain          = "domain_cs"),          na_matches = "never") |> 
    left_join(cs_companies_base_clnd |> select(id_cs, linkedin_handle_cs), by = c(linkedin_handle = "linkedin_handle_cs"), na_matches = "never") |> 

    # Remove obs with no cs_id
    filter(!is.na(id_cs)) |>
    
    # Remove matches, that matched different domains, but same company (e.g. company_id: 83060, id_cs: 4507928) block.xyz & squareup.com
    select(company_id, id_cs) |> 
    distinct()
    
cb_cs_joined
#> # A tibble: 38,118 × 2
#>   company_id    id_cs
#>        <int>    <int>
#> 1          2  8345218
#> 2          5 28149599
#> 3          8  4469271
#> 4         11  5349023
#> 5         12  9364263
#> # ℹ 38,113 more rows
cb_cs_joined |> distinct(company_id) |> nrow()
#> [1] 37287

We got over 460 million employment observations from CoreSignal.

# Other data versions
# 1. Complete: 
member_experience_dt 
#> {462.711.794}  

# 2. Distinct1: 
member_experience_dist_dt <- unique(member_experience_dt) 
#> {432.368.479}

# 3. Distinct2: 
unique(member_experience_dist_dt[order(id)], by = setdiff(names(member_experience_dist_dt), "id")) 
#> {431.899.547}

But only ~50 Mil distinct employments

# Load distinct member experiences
me_dist8_prqt <- arrow::open_dataset("cs_me_dist8_unest_empl_hist.parquet") 
me_dist8_prqt |> glimpse()
#> FileSystemDataset with 1 Parquet file
#> 51,621,196 rows x 10 columns
#> $ id_tie                 <int32> 16615559, 16615560, 16615561, 16615562, 1661556…
#> $ id                    <double> 2244288231, 254049663, 948937291, 254049667, 25…
#> $ member_id              <int32> 179313066, 179313066, 179313066, 179313066, 179…
#> $ company_id             <int32> 865089, 9098713, 9098713, NA, 865089, 9020540, …
#> $ company_name          <string> "heritage community bank", "aurora bank fsb", "…
#> $ title                 <string> "AVP Chief Compliance/BSA Officer", "AVP Compli…
#> $ date_from_parsed <date32[day]> 2010-02-01, 2012-07-01, 2011-11-01, 1997-07-01,…
#> $ date_to_parsed   <date32[day]> 2011-11-01, 2013-06-01, 2012-07-01, 2006-05-01,…
#> $ date_from_parsed_year  <int32> 2010, 2012, 2011, 1997, 2006, 2019, 2017, 2021,…
#> $ date_to_parsed_year    <int32> 2011, 2013, 2012, 2006, 2010, 2021, 2018, NA, 1…
#> Call `print()` for full schema details

Example

me_orig <- open_dataset("~/02_diss/01_coresignal/02_data/member_experience/me_orig/")
me_dist <- open_dataset("~/02_diss/01_coresignal/02_data/member_experience/me_dist/")

me_orig |> filter(member_id == 4257, company_id == 9007053) |> collect() |> as_tibble() |> arrange(date_from_parsed) |> print(n=19)
me_dist |> filter(member_id == 4257, company_id == 9007053) |> collect() |> as_tibble() |> arrange(date_from_parsed)

Over 10 million (valid: must have starting date) employments at our crunchbase / pitchbook data set companies. 385.100 with a title containing the string founder.

# Distinct company ids
cb_cs_joined_cs_ids <- cb_cs_joined |> distinct(id_cs) |> pull(id_cs)
me_wrangled_prqt    <- me_dist8_prqt |> 
  
                          # Select features
                          select(member_id, company_id, exp_id = "id", date_from_parsed) |> 
                          
                          # Select observations
                          filter(company_id %in% cb_cs_joined_cs_ids) |> 
                          # - 967.080 observations (date_to not considered yet)
                          filter(!is.na(date_from_parsed)) |> 

                          # Add suffix to col names
                          rename_with(~ paste(., "cs", sep = "_")) |> 
                          compute()

me_wrangled_prqt |> 
  glimpse()
#> Table
#> 11,050,164 rows x 4 columns
#> $ member_id_cs              <int32> 9897605, 9897605, 9897605, 9897605, 9897928,…
#> $ company_id_cs             <int32> 1105483, 5181133, 5181133, 5181133, 5025265,…
#> $ exp_id_cs                <double> 1665233144, 12744849, 995032176, 1665233146,…
#> $ date_from_parsed_cs <date32[day]> 2018-03-01, 2010-06-01, 2014-09-01, 2011-01-…
#> Call `print()` for full schema details

Multiple Funding Dates –> Take oldest

fc_wrangled_tbl <- funded_companies_tbl |> 
  
  # Consider multiple founding dates: Take oldest founding date
  unnest(founded_on) |> 
  arrange(company_id, founded_on) |> 
  group_by(company_id) |> slice(1) |> ungroup()

Example of funding round data:

fc_wrangled_tbl$funding_rounds[[1]] |>  
  glimpse()
#> Rows: 15
#> Columns: 14
#> $ round_id             <int> 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, …
#> $ round_uuid_pb        <chr> NA, "47208-70T", NA, "58843-18T", NA, NA, NA, "78…
#> $ round_uuid_cb        <chr> "a6d3bfd9-5afa-47ce-86de-30a3abad6c9b", NA, "ea3b…
#> $ announced_on         <date> 2013-01-01, 2014-04-01, 2015-06-01, 2015-10-07, …
#> $ round_new            <int> 1, 2, 3, 4, 5, 6, 7, 8, 8, 9, 10, 11, 12, 12, 13
#> $ round                <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15
#> $ exit_cycle           <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1
#> $ last                 <int> 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 1…
#> $ round_type_new       <fct> Seed, Series A, Series B, Series C, Series D, Ser…
#> $ round_type           <list> "angel", "angel", "early_vc", "early_vc", "conver…
#> $ round_types          <list> <"angel", "angel_group", "investor", "company", "…
#> $ raised_amount        <dbl> NA, 520000, NA, 1399999, NA, NA, NA, 3250000, NA,…
#> $ post_money_valuation <dbl> NA, NA, NA, 3399998, NA, NA, NA, 10249998, NA, N…
#> $ investors_in_round   <list> [<tbl_df[1 x 11]>], [<tbl_df[1 x 11]>], [<tbl_df…
  • Joining via dplyr due to memory constraint not possible.
  • Joining via Arrow due to structure constraints not possible.
  • –> Joining via data.table most efficient.

Conversion to data.tables necessary:

# 1.  Funding Data
# 1.1 Level 1
fc_wrangled_dt |> setDT()

# 1.2 Funding Data Level 2 (funding_rounds)
purrr::walk(fc_wrangled_dt$funding_rounds, setDT)

# 1.3 Remove unnecessary columns + initialize dummy for before_join
purrr::walk(fc_wrangled_dt$funding_rounds, ~ .x[, 
          `:=`(round_uuid_pb = NULL, round_uuid_cb        = NULL, round_new          = NULL, round          = NULL,
               exit_cycle    = NULL, last                 = NULL, round_type         = NULL, round_type_new = NULL, 
               round_types   = NULL, post_money_valuation = NULL, investors_in_round = NULL, before_join    = NA)
          ]
        )

# 2. Matching Table
cb_cs_joined_slct_dt |> setDT()

# 3. Member experiences
me_wrangled_dt <- me_wrangled_prqt |> collect()

Working data.table solution (efficiency increase through join by reference possible).

# 1. Add company_id from funded_companies to member experiences
me_joined_dt <- cb_cs_joined_slct_dt[me_wrangled_dt, on = .(id_cs = company_id_cs), allow.cartesian = TRUE]
#> 12.978.226

# 2. Add funding data from funded_companies
me_joined_dt <- fc_wrangled_dt[me_joined_dt, on = .(company_id)]
#> 12.270.572

# 3. Remove duplicates (why are there any?)
me_joined_dt <- unique(me_joined_dt, by = setdiff(names(me_joined_dt), "funding_rounds"))
#> 12.270.572 .... No duplicates anymore. Removed from cb_cs_joined_slct_dt

Not working dplyr solution

me_joined_dt_dplyr <- me_wrangled_dt |>

  # Add company_id from funded_companies
  left_join(cb_cs_joined_slct_dt,
            by = c(company_id_cs = "id_cs")) |>

  # Add data from funded_companies
  left_join(funded_companies_wrangled_dt,
            by = "company_id")  |>
  distinct()

Arrow because of nested funding data not possible.

Variable creation

Using domain knowledge to extract features

1. Overview

From here on almost everything is in

Script and File locations

Scripts
05_analyses/03_cbpbcs/01_scripts

  • 01_founding_vs_employment (Company Funding vs. Time of Employment (I. Time, II. Capital, III. Rounds))
  • 02_stage_affiliation (stages based on Age and funding. was discarded later on –> see 04_funding_history)
  • 03_employment_history (Fortune500, Startup, Founding, Research Experiences)
  • 04_funding_history (1. prior raised amount (person), 2. further funding (company), 3. funding per round, 4. Dataset based on series B)
  • 05_education_history (merge with rankings, extract degrees)
  • 06_skills
  • 07_analyses/02_per_company/ (last plots)

2. List of all variables (not up to date)

cs_me_dist8_unest_wedu_dt |> 
  glimpse()
#> Rows: 2,659,657
#> Columns: 67
#> $ company_id_cbpb                      <int> 90591, 152845, 90440, 138208, 116…
#> $ funding_after_mid                    <chr> "yes", NA, "yes", "yes", "yes", "…
#> $ funding_after_early                  <chr> "yes", "no", "yes", "yes", "yes",…
#> $ member_id                            <int> 878, 2104, 3548, 3548, 3970, 4005…
#> $ id_tie                               <int> 38, 67, 89, 89, 96, 104, 183, 175…
#> $ exp_id_cs                            <dbl> 2481733250, 1423977093, 2638, 263…
#> $ exp_corporate                        <dbl> 0.00000, 12.00000, 0.00000, 0.000…
#> $ exp_funded_startup                   <dbl> 0, 0, 0, 0, 0, 0, 18, 0, 0, 0, 0,…
#> $ exp_founder                          <dbl> 0.0000, 0.0000, 0.0000, 0.0000, 0…
#> $ exp_f500                             <dbl> 0.00000, 0.00000, 0.00000, 0.0000…
#> $ exp_research                         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
#> $ exp_research_ivy                     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
#> $ company_id_cs                        <int> 140537, 10644128, 6068905, 606890…
#> $ company_name_cs                      <chr> "Bristol-Myers Squibb", "HERE", "…
#> $ company_name_cbpb                    <chr> "receptos", "HERE Technologies Ch…
#> $ founded_on_cbpb                      <date> 2007-01-01, 2012-11-13, 2009-07-…
#> $ closed_on_cbpb                       <date> NA, NA, NA, NA, NA, NA, 2021-04-…
#> $ title_cs                             <chr> "Key Account Manager", "GIS Analy…
#> $ date_from_parsed_cs                  <date> 2006-01-01, 2016-01-01, 2010-01-…
#> $ date_to_parsed_cs                    <date> 2008-08-01, NA, NA, NA, 2011-10-…
#> $ tjoin_tfound                         <dbl> -12, 37, 6, 48, -47, 48, 17, 11, …
#> $ raised_amount_before_join_company    <dbl> 0, 0, 0, 7722796, 0, 9961692, 333…
#> $ num_rounds_before_join               <dbl> 0, 1, 0, 2, 0, 2, 1, 1, 2, 0, 1, …
#> $ is_f500                              <lgl> TRUE, FALSE, TRUE, TRUE, FALSE, F…
#> $ is_founder                           <lgl> FALSE, FALSE, FALSE, FALSE, FALSE…
#> $ is_research                          <lgl> FALSE, FALSE, FALSE, FALSE, FALSE…
#> $ is_research_ivy                      <lgl> FALSE, FALSE, FALSE, FALSE, FALSE…
#> $ date_1st_founder_exp                 <date> NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ date_1st_f500_exp                    <date> 2006-01-01, NA, 2010-01-01, 2010…
#> $ date_1st_funded_startup_exp          <date> 2006-01-01, 2016-01-01, 2010-01-…
#> $ date_1st_research_exp                <date> NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ date_1st_research_ivy_exp            <date> NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ date_1st_corporate_exp               <date> 2009-02-01, 2015-01-01, NA, NA, …
#> $ time_since_1st_corporate_exp         <dbl> NA, 12, NA, NA, 116, NA, 136, 40,…
#> $ time_since_1st_founder_exp           <dbl> NA, NA, NA, NA, NA, NA, NA, NA, N…
#> $ time_since_1st_f500_exp              <dbl> NA, NA, NA, NA, NA, NA, NA, NA, N…
#> $ time_since_1st_funded_startup_exp    <dbl> NA, NA, NA, NA, NA, NA, 96, NA, N…
#> $ time_since_1st_research_exp          <dbl> NA, NA, NA, NA, NA, NA, NA, NA, N…
#> $ time_since_1st_research_ivy_exp      <dbl> NA, NA, NA, NA, NA, NA, NA, NA, N…
#> $ time_since_1st_experience            <dbl> 0, 12, 0, 0, 116, 0, 136, 40, 176…
#> $ raised_amount_before_founder_member  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, N…
#> $ raised_amount_before_all_member      <dbl> NA, NA, NA, NA, NA, NA, 0, 0, NA,…
#> $ was_corporate_before                 <lgl> FALSE, TRUE, FALSE, FALSE, TRUE, …
#> $ was_founder_before                   <lgl> FALSE, FALSE, FALSE, FALSE, FALSE…
#> $ was_f500_before                      <lgl> FALSE, FALSE, FALSE, FALSE, FALSE…
#> $ was_fc_before                        <lgl> FALSE, FALSE, FALSE, FALSE, FALSE…
#> $ was_uni_before                       <lgl> FALSE, FALSE, FALSE, FALSE, FALSE…
#> $ was_ivy_before                       <lgl> FALSE, FALSE, FALSE, FALSE, FALSE…
#> $ stage_mid                            <lgl> NA, NA, NA, NA, NA, NA, NA, NA, N…
#> $ stage_late                           <lgl> NA, NA, NA, NA, NA, NA, FALSE, NA…
#> $ date_from_stage                      <chr> "early1", "mid", "early2", "mid",…
#> $ company_start_mid                    <date> 2009-01-01, 2014-11-13, 2011-07-…
#> $ company_start_late                   <date> 2009-11-23, 2017-11-13, 2014-07-…
#> $ rank_global_2023_best                <int> 917, 1549, NA, NA, NA, NA, NA, NA…
#> $ score_global_2023_best               <dbl> 40.6, 26.8, NA, NA, NA, NA, NA, N…
#> $ rank_national_2023_best              <int> NA, NA, NA, NA, NA, NA, NA, NA, N…
#> $ rank_national_during_enrollment_best <int> NA, NA, NA, NA, NA, NA, NA, NA, N…
#> $ degree_ba2                           <lgl> FALSE, TRUE, NA, NA, TRUE, NA, NA…
#> $ degree_ma2                           <lgl> FALSE, FALSE, NA, NA, FALSE, NA, …
#> $ degree_phd2                          <lgl> FALSE, FALSE, NA, NA, FALSE, NA, …
#> $ degree_mba2                          <lgl> TRUE, FALSE, NA, NA, FALSE, NA, N…
#> $ num_rounds_cumulated_founder         <int> NA, NA, NA, NA, NA, NA, NA, NA, N…
#> $ num_rounds_cumulated_all             <int> NA, NA, NA, NA, NA, NA, 1, 1, NA,…
#> $ announced_on_sB                      <date> 2012-02-03, 2018-01-04, 2011-03-…
#> $ round_type_new_next                  <fct> Series C, Series C, Series C, Ser…
#> $ raised_amount_cumsum_sB              <dbl> 46043054, 0, 1905000, 11022796, 2…
#> $ raised_amount_cumsum_sB_next         <dbl> 76043054, 0, 8712306, 13854868, 4…
cs_me_dist8_unest_wedu_dt |> 
  select(id_tie, member_id, exp_id_cs, company_id_cbpb, company_name_cbpb, company_id_cs, company_name_cs, 
         founded_on_cbpb, closed_on_cbpb,
         title_cs) |> 
  glimpse()
#> Rows: 2,659,657
#> Columns: 10
#> $ id_tie            <int> 38, 67, 89, 89, 96, 104, 183, 175, 209, 243, 321, 37…
#> $ member_id         <int> 878, 2104, 3548, 3548, 3970, 4005, 4224, 4224, 4317,…
#> $ exp_id_cs         <dbl> 2481733250, 1423977093, 2638, 2638, 1736317868, 3084…
#> $ company_id_cbpb   <int> 90591, 152845, 90440, 138208, 116099, 97810, 40123, …
#> $ company_name_cbpb <chr> "receptos", "HERE Technologies Chicago", "crowdtwist…
#> $ company_id_cs     <int> 140537, 10644128, 6068905, 6068905, 11825305, 194148…
#> $ company_name_cs   <chr> "Bristol-Myers Squibb", "HERE", "Oracle", "Oracle", …
#> $ founded_on_cbpb   <date> 2007-01-01, 2012-11-13, 2009-07-01, 2006-01-01, 201…
#> $ closed_on_cbpb    <date> NA, NA, NA, NA, NA, NA, 2021-04-09, NA, NA, NA, NA,…
#> $ title_cs          <chr> "Key Account Manager", "GIS Analyst I", "QA", "QA", …
cs_me_dist8_unest_wedu_dt |> 
  select(date_from_parsed_cs, date_to_parsed_cs, 
         tjoin_tfound, raised_amount_before_join_company, num_rounds_before_join) |> 
  glimpse()
#> Rows: 2,659,657
#> Columns: 5
#> $ date_from_parsed_cs               <date> 2006-01-01, 2016-01-01, 2010-01-01,…
#> $ date_to_parsed_cs                 <date> 2008-08-01, NA, NA, NA, 2011-10-01,…
#> $ tjoin_tfound                      <dbl> -12, 37, 6, 48, -47, 48, 17, 11, 44,…
#> $ raised_amount_before_join_company <dbl> 0, 0, 0, 7722796, 0, 9961692, 333333…
#> $ num_rounds_before_join            <dbl> 0, 1, 0, 2, 0, 2, 1, 1, 2, 0, 1, 2, …
cs_me_dist8_unest_wedu_dt |> 
  select(starts_with("is_"),
         starts_with("was_")) |> 
  glimpse()
#> Rows: 2,659,657
#> Columns: 10
#> $ is_f500              <lgl> TRUE, FALSE, TRUE, TRUE, FALSE, FALSE, FALSE, FAL…
#> $ is_founder           <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, …
#> $ is_research          <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, …
#> $ is_research_ivy      <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, …
#> $ was_corporate_before <lgl> FALSE, TRUE, FALSE, FALSE, TRUE, FALSE, TRUE, TRU…
#> $ was_founder_before   <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, …
#> $ was_f500_before      <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, …
#> $ was_fc_before        <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, F…
#> $ was_uni_before       <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, …
#> $ was_ivy_before       <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, …
cs_me_dist8_unest_wedu_dt |> 
  select(
    starts_with("date_1st_"),
    starts_with("time_since_1st_"),
    starts_with("exp_"), -exp_id_cs) |> 
  glimpse()
#> Rows: 2,659,657
#> Columns: 19
#> $ date_1st_founder_exp              <date> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
#> $ date_1st_f500_exp                 <date> 2006-01-01, NA, 2010-01-01, 2010-01…
#> $ date_1st_funded_startup_exp       <date> 2006-01-01, 2016-01-01, 2010-01-01,…
#> $ date_1st_research_exp             <date> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
#> $ date_1st_research_ivy_exp         <date> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
#> $ date_1st_corporate_exp            <date> 2009-02-01, 2015-01-01, NA, NA, 200…
#> $ time_since_1st_corporate_exp      <dbl> NA, 12, NA, NA, 116, NA, 136, 40, 17…
#> $ time_since_1st_founder_exp        <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ time_since_1st_f500_exp           <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ time_since_1st_funded_startup_exp <dbl> NA, NA, NA, NA, NA, NA, 96, NA, NA, …
#> $ time_since_1st_research_exp       <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ time_since_1st_research_ivy_exp   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ time_since_1st_experience         <dbl> 0, 12, 0, 0, 116, 0, 136, 40, 176, 4…
#> $ exp_corporate                     <dbl> 0.00000, 12.00000, 0.00000, 0.00000,…
#> $ exp_funded_startup                <dbl> 0, 0, 0, 0, 0, 0, 18, 0, 0, 0, 0, 0,…
#> $ exp_founder                       <dbl> 0.0000, 0.0000, 0.0000, 0.0000, 0.00…
#> $ exp_f500                          <dbl> 0.00000, 0.00000, 0.00000, 0.00000, …
#> $ exp_research                      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
#> $ exp_research_ivy                  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
cs_me_dist8_unest_wedu_dt |> 
  select(score_global_2023_best,
         starts_with("rank"),
         starts_with("degree")) |> 
  glimpse()
#> Rows: 2,659,657
#> Columns: 8
#> $ score_global_2023_best               <dbl> 40.6, 26.8, NA, NA, NA, NA, NA, N…
#> $ rank_global_2023_best                <int> 917, 1549, NA, NA, NA, NA, NA, NA…
#> $ rank_national_2023_best              <int> NA, NA, NA, NA, NA, NA, NA, NA, N…
#> $ rank_national_during_enrollment_best <int> NA, NA, NA, NA, NA, NA, NA, NA, N…
#> $ degree_ba2                           <lgl> FALSE, TRUE, NA, NA, TRUE, NA, NA…
#> $ degree_ma2                           <lgl> FALSE, FALSE, NA, NA, FALSE, NA, …
#> $ degree_phd2                          <lgl> FALSE, FALSE, NA, NA, FALSE, NA, …
#> $ degree_mba2                          <lgl> TRUE, FALSE, NA, NA, FALSE, NA, N…
cs_me_dist8_unest_wedu_dt |> 
  select(date_from_stage, company_start_mid, company_start_late,
         raised_amount_before_founder_member, raised_amount_before_all_member,
         funding_after_mid, funding_after_early) |> 
  glimpse()
#> Rows: 2,659,657
#> Columns: 7
#> $ date_from_stage                     <chr> "early1", "mid", "early2", "mid", …
#> $ company_start_mid                   <date> 2009-01-01, 2014-11-13, 2011-07-0…
#> $ company_start_late                  <date> 2009-11-23, 2017-11-13, 2014-07-0…
#> $ raised_amount_before_founder_member <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA…
#> $ raised_amount_before_all_member     <dbl> NA, NA, NA, NA, NA, NA, 0, 0, NA, …
#> $ funding_after_mid                   <chr> "yes", NA, "yes", "yes", "yes", "y…
#> $ funding_after_early                 <chr> "yes", "no", "yes", "yes", "yes", …