Current Status
Hamburg University of Technology
Monday, 17th of February 2025
Input + Output CoreSignal
Objective: Create list with domains and/or LinkedIn handles
Script and File locations
Data
01_data_sources/06_coresignal/01_data/01_input_for_cs/final_selection/version_3_cijs/
orgs_cs.csvusr_cs.csvScripts
01_data_sources/06_coresignal/02_scripts/01_input_for_cs/
cs_org_select_v2.RInstructions:
Data was proided as json data:
Script and File locations
Data
01_data_sources/06_coresignal/01_data/02_raw/
company/202203_custom/...member/202203_custom/...Objectives: Extract data that is relevant (variables) for our analyses and convert it to .rds / .parquet files
Script and File locations
Scripts
01_data_sources/06_coresignal/02_scripts/02_build_tables/
01_cs_build_table_company.R02_cs_build_tables_member_V1_bd_exp_skills.R03_cs_build_tables_member_V2_edu_exp.RData
01_data_sources/06_coresignal/01_data/03_extracted/
company/cs_companies_base.rdsmember/01_basic_data/...member/02_experience/... (non-deduplicated)member/03_skills/...member/04_education/... (non-deduplicated)Everytime a user changes something on their profile a new record is being created (date, typos, names, …). The column deleted is not useful.
Objectives: Deduplicate the member data: Experiences & Education Tables
Script and File locations
Scripts
01_data_sources/06_coresignal/02_scripts/02_build_tables/
04_cs_build_tables_member_exp_dist.R05_cs_build_tables_member_edu_dist.RData
01_data_sources/06_coresignal/01_data/04_wrangled/
companies/cs_companies_base_slct.rds (just relevant columns selected)member/02_experience/me_dist8/... (deduplicated)member/04_education/02_wrangled_dist_chunked/... (deduplicated)Wrangle and Join CrunchBase / PitchBook and CoreSignal data
CoreSignal did not provide a matching table but provided only the resulting data. Hence, backmapping to our CrunchBase / Pitchbook Data via domains is necessary:
Script and File locations
Scripts
02_data_mapping/10_cbpb_cs/01_scripts/
06_cbpb_cs_matching_companies.RData (Input)
02_data_mapping/10_cbpb_cs/02_data/
funded_companies.rds (created by Christoph)Data (Output)
cbpb_cs_joined.rds (companies joined)Map Crunchbase / Pitchbook data to CoreSignal profiles (via mapped companies (1))
Script and File locations
Scripts
02_data_mapping/10_cbpb_cs/01_scripts/
07_cs_cb_matching_employees.RData (Input)
01_data_sources/06_coresignal/01_data/04_wrangled/member/02_experience/me_dist8/02_unnested/ 02_data_mapping/10_cbpb_cs/02_data/
cs_me_dist8_unest_prqt (CoreSignal distinct Member Experiences)cbpb_cs_joined.rds (Joined CoreSignal / CrunchbasePitchbook Org Data)Data (Output)
cs_me_dist8_unest_fc_joined.parquet (intermediate data)There are some further explanations about the org join and empployee join in the next two section (click on each tabset. There are also some fragments –> You have to hit enter/arrow to slide them in). You can skip to the next chapter (VAR I) though.
Crunchbase data contains 150,838 startups with a valid funding trajectory.
library(stringr)
fc_unnested_tbl |>
# 1. At least 1 identifier: 4.518 observations are filtered out
filter(if_any(c(domain, linkedin_url), ~!is.na(.))) |>
# 2. Extract linkedin handle & clean domains
mutate(linkedin_handle = linkedin_url |> str_extract("(?<=linkedin\\.com/company/).*?(?=(?:\\?|$|/))")) |>
mutate(domain = domain |> clean_domain()) |>
# 3. Remove 532 duplicates
distinct()–> 145.991 distinct examineable companies.
Issue: Some extracted domains are not unique and associated with multiple companies.
Manual Cleaning: Domains with a count exceeding two were analyzed and set to NA if they do not correspond to the actual one.
# ANALYZE
# fc_wrangled_tbl |>
# distinct(company_id, domain) |>
# count(domain, sort = T) |>
# filter(n>2)`
unwanted_domains_cb <- c("webflow.io", "angel.co", "weebly.com", "wordpress.com", "wixsite.com", "squarespace.com",
"webflow.io", "crypt2esports.com", "myshopify.com", "business.site", "mystrikingly.com",
"launchrock.com", "square.site", "google.com", "sites.google.com", "t.co", "linktr.ee",
"netlify.app", "itunes.apple.com", "apple.com", "crunchb.com", "tumblr.com", "linkedin.com",
"godaddysites.com", "mit.edu", "paloaltonetworks.com", " wpengine.com", "facebook.com",
"intuit.com", "medium.com", "salesforce.com", "strikingly.com", "wix.com", "cisco.com",
"digi.me", "apps.apple.com", "bit.ly", "fleek.co", "harvard.edu", "ibm.com", "jimdo.com",
"myftpupload.com", "odoo.com", "storenvy.com", "twitter.com", "umd.edu", "umich.edu", "vmware.com", "webs.com")
# Not all observations with unwanted domains are bad per se:
wanted_ids_cb <- c(angel = 128006, `catapult-centres-uk` = 115854, digime1 = 140904, digimi2 = 95430, fleek = 50738,
jimdo = 108655, medium = 113415, storenvy = 85742, strikingly = 95831, substack = 34304,
tumblr = 84838, twitter = 53139, weebly = 91365, wpengine = 91720)
# Set misleading domains to NA
funded_companies_clnd <- fc_wrangled_tbl |>
mutate(domain = if_else(
domain %in% unwanted_domains_cb & !(company_id %in% wanted_ids_cb),
NA_character_, domain))It appears that CoreSignal has been able to locate 45.026 companies within our gathered data.
Nothing to wrangle …
Important
More cleaning necessary (same as CBPB)! The task was undertaken with a limited degree of enthusiasm.
unwanted_domains_cs <- c("bit.ly", "linktr.ee", "facebook.com", "linkedin.com", "twitter.com", "crunchbase.com")
wanted_ids_cs <- c(crunchbase = 1634413, linkedin = 8568581, twitter = 24745469)
cs_companies_base_clnd <- cs_companies_base_wrangled |>
mutate(domain_cs = if_else(
domain_cs %in% unwanted_domains_cs & !(id_cs %in% wanted_ids_cs),
NA_character_,
domain_cs)
)We were able to match 37.287 CS & CB/PB companies.
cb_cs_joined <- funded_companies_clnd |>
# Leftjoins
left_join(cs_companies_base_clnd |> select(id_cs, domain_cs), by = c(domain = "domain_cs"), na_matches = "never") |>
left_join(cs_companies_base_clnd |> select(id_cs, linkedin_handle_cs), by = c(linkedin_handle = "linkedin_handle_cs"), na_matches = "never") |>
# Remove obs with no cs_id
filter(!is.na(id_cs)) |>
# Remove matches, that matched different domains, but same company (e.g. company_id: 83060, id_cs: 4507928) block.xyz & squareup.com
select(company_id, id_cs) |>
distinct()
cb_cs_joinedWe got over 460 million employment observations from CoreSignal.
But only ~50 Mil distinct employments
#> FileSystemDataset with 1 Parquet file
#> 51,621,196 rows x 10 columns
#> $ id_tie <int32> 16615559, 16615560, 16615561, 16615562, 1661556…
#> $ id <double> 2244288231, 254049663, 948937291, 254049667, 25…
#> $ member_id <int32> 179313066, 179313066, 179313066, 179313066, 179…
#> $ company_id <int32> 865089, 9098713, 9098713, NA, 865089, 9020540, …
#> $ company_name <string> "heritage community bank", "aurora bank fsb", "…
#> $ title <string> "AVP Chief Compliance/BSA Officer", "AVP Compli…
#> $ date_from_parsed <date32[day]> 2010-02-01, 2012-07-01, 2011-11-01, 1997-07-01,…
#> $ date_to_parsed <date32[day]> 2011-11-01, 2013-06-01, 2012-07-01, 2006-05-01,…
#> $ date_from_parsed_year <int32> 2010, 2012, 2011, 1997, 2006, 2019, 2017, 2021,…
#> $ date_to_parsed_year <int32> 2011, 2013, 2012, 2006, 2010, 2021, 2018, NA, 1…
#> Call `print()` for full schema detailsExample
me_orig <- open_dataset("~/02_diss/01_coresignal/02_data/member_experience/me_orig/")
me_dist <- open_dataset("~/02_diss/01_coresignal/02_data/member_experience/me_dist/")
me_orig |> filter(member_id == 4257, company_id == 9007053) |> collect() |> as_tibble() |> arrange(date_from_parsed) |> print(n=19)
me_dist |> filter(member_id == 4257, company_id == 9007053) |> collect() |> as_tibble() |> arrange(date_from_parsed)Over 10 million (valid: must have starting date) employments at our crunchbase / pitchbook data set companies. 385.100 with a title containing the string founder.
# Distinct company ids
cb_cs_joined_cs_ids <- cb_cs_joined |> distinct(id_cs) |> pull(id_cs)
me_wrangled_prqt <- me_dist8_prqt |>
# Select features
select(member_id, company_id, exp_id = "id", date_from_parsed) |>
# Select observations
filter(company_id %in% cb_cs_joined_cs_ids) |>
# - 967.080 observations (date_to not considered yet)
filter(!is.na(date_from_parsed)) |>
# Add suffix to col names
rename_with(~ paste(., "cs", sep = "_")) |>
compute()
me_wrangled_prqt |>
glimpse()#> Table
#> 11,050,164 rows x 4 columns
#> $ member_id_cs <int32> 9897605, 9897605, 9897605, 9897605, 9897928,…
#> $ company_id_cs <int32> 1105483, 5181133, 5181133, 5181133, 5025265,…
#> $ exp_id_cs <double> 1665233144, 12744849, 995032176, 1665233146,…
#> $ date_from_parsed_cs <date32[day]> 2018-03-01, 2010-06-01, 2014-09-01, 2011-01-…
#> Call `print()` for full schema detailsMultiple Funding Dates –> Take oldest
Example of funding round data:
#> Rows: 15
#> Columns: 14
#> $ round_id <int> 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, …
#> $ round_uuid_pb <chr> NA, "47208-70T", NA, "58843-18T", NA, NA, NA, "78…
#> $ round_uuid_cb <chr> "a6d3bfd9-5afa-47ce-86de-30a3abad6c9b", NA, "ea3b…
#> $ announced_on <date> 2013-01-01, 2014-04-01, 2015-06-01, 2015-10-07, …
#> $ round_new <int> 1, 2, 3, 4, 5, 6, 7, 8, 8, 9, 10, 11, 12, 12, 13
#> $ round <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15
#> $ exit_cycle <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1
#> $ last <int> 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 1…
#> $ round_type_new <fct> Seed, Series A, Series B, Series C, Series D, Ser…
#> $ round_type <list> "angel", "angel", "early_vc", "early_vc", "conver…
#> $ round_types <list> <"angel", "angel_group", "investor", "company", "…
#> $ raised_amount <dbl> NA, 520000, NA, 1399999, NA, NA, NA, 3250000, NA,…
#> $ post_money_valuation <dbl> NA, NA, NA, 3399998, NA, NA, NA, 10249998, NA, N…
#> $ investors_in_round <list> [<tbl_df[1 x 11]>], [<tbl_df[1 x 11]>], [<tbl_df…dplyr due to memory constraint not possible.Arrow due to structure constraints not possible.data.table most efficient.Conversion to data.tables necessary:
# 1. Funding Data
# 1.1 Level 1
fc_wrangled_dt |> setDT()
# 1.2 Funding Data Level 2 (funding_rounds)
purrr::walk(fc_wrangled_dt$funding_rounds, setDT)
# 1.3 Remove unnecessary columns + initialize dummy for before_join
purrr::walk(fc_wrangled_dt$funding_rounds, ~ .x[,
`:=`(round_uuid_pb = NULL, round_uuid_cb = NULL, round_new = NULL, round = NULL,
exit_cycle = NULL, last = NULL, round_type = NULL, round_type_new = NULL,
round_types = NULL, post_money_valuation = NULL, investors_in_round = NULL, before_join = NA)
]
)
# 2. Matching Table
cb_cs_joined_slct_dt |> setDT()
# 3. Member experiences
me_wrangled_dt <- me_wrangled_prqt |> collect()Working data.table solution (efficiency increase through join by reference possible).
# 1. Add company_id from funded_companies to member experiences
me_joined_dt <- cb_cs_joined_slct_dt[me_wrangled_dt, on = .(id_cs = company_id_cs), allow.cartesian = TRUE]
#> 12.978.226
# 2. Add funding data from funded_companies
me_joined_dt <- fc_wrangled_dt[me_joined_dt, on = .(company_id)]
#> 12.270.572
# 3. Remove duplicates (why are there any?)
me_joined_dt <- unique(me_joined_dt, by = setdiff(names(me_joined_dt), "funding_rounds"))
#> 12.270.572 .... No duplicates anymore. Removed from cb_cs_joined_slct_dtNot working dplyr solution
Arrow because of nested funding data not possible.
Using domain knowledge to extract features
From here on almost everything is in
Script and File locations
Scripts
05_analyses/03_cbpbcs/01_scripts
01_founding_vs_employment (Company Funding vs. Time of Employment (I. Time, II. Capital, III. Rounds))02_stage_affiliation (stages based on Age and funding. was discarded later on –> see 04_funding_history)03_employment_history (Fortune500, Startup, Founding, Research Experiences)04_funding_history (1. prior raised amount (person), 2. further funding (company), 3. funding per round, 4. Dataset based on series B)05_education_history (merge with rankings, extract degrees)06_skills07_analyses/02_per_company/ (last plots)#> Rows: 2,659,657
#> Columns: 67
#> $ company_id_cbpb <int> 90591, 152845, 90440, 138208, 116…
#> $ funding_after_mid <chr> "yes", NA, "yes", "yes", "yes", "…
#> $ funding_after_early <chr> "yes", "no", "yes", "yes", "yes",…
#> $ member_id <int> 878, 2104, 3548, 3548, 3970, 4005…
#> $ id_tie <int> 38, 67, 89, 89, 96, 104, 183, 175…
#> $ exp_id_cs <dbl> 2481733250, 1423977093, 2638, 263…
#> $ exp_corporate <dbl> 0.00000, 12.00000, 0.00000, 0.000…
#> $ exp_funded_startup <dbl> 0, 0, 0, 0, 0, 0, 18, 0, 0, 0, 0,…
#> $ exp_founder <dbl> 0.0000, 0.0000, 0.0000, 0.0000, 0…
#> $ exp_f500 <dbl> 0.00000, 0.00000, 0.00000, 0.0000…
#> $ exp_research <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
#> $ exp_research_ivy <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
#> $ company_id_cs <int> 140537, 10644128, 6068905, 606890…
#> $ company_name_cs <chr> "Bristol-Myers Squibb", "HERE", "…
#> $ company_name_cbpb <chr> "receptos", "HERE Technologies Ch…
#> $ founded_on_cbpb <date> 2007-01-01, 2012-11-13, 2009-07-…
#> $ closed_on_cbpb <date> NA, NA, NA, NA, NA, NA, 2021-04-…
#> $ title_cs <chr> "Key Account Manager", "GIS Analy…
#> $ date_from_parsed_cs <date> 2006-01-01, 2016-01-01, 2010-01-…
#> $ date_to_parsed_cs <date> 2008-08-01, NA, NA, NA, 2011-10-…
#> $ tjoin_tfound <dbl> -12, 37, 6, 48, -47, 48, 17, 11, …
#> $ raised_amount_before_join_company <dbl> 0, 0, 0, 7722796, 0, 9961692, 333…
#> $ num_rounds_before_join <dbl> 0, 1, 0, 2, 0, 2, 1, 1, 2, 0, 1, …
#> $ is_f500 <lgl> TRUE, FALSE, TRUE, TRUE, FALSE, F…
#> $ is_founder <lgl> FALSE, FALSE, FALSE, FALSE, FALSE…
#> $ is_research <lgl> FALSE, FALSE, FALSE, FALSE, FALSE…
#> $ is_research_ivy <lgl> FALSE, FALSE, FALSE, FALSE, FALSE…
#> $ date_1st_founder_exp <date> NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ date_1st_f500_exp <date> 2006-01-01, NA, 2010-01-01, 2010…
#> $ date_1st_funded_startup_exp <date> 2006-01-01, 2016-01-01, 2010-01-…
#> $ date_1st_research_exp <date> NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ date_1st_research_ivy_exp <date> NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ date_1st_corporate_exp <date> 2009-02-01, 2015-01-01, NA, NA, …
#> $ time_since_1st_corporate_exp <dbl> NA, 12, NA, NA, 116, NA, 136, 40,…
#> $ time_since_1st_founder_exp <dbl> NA, NA, NA, NA, NA, NA, NA, NA, N…
#> $ time_since_1st_f500_exp <dbl> NA, NA, NA, NA, NA, NA, NA, NA, N…
#> $ time_since_1st_funded_startup_exp <dbl> NA, NA, NA, NA, NA, NA, 96, NA, N…
#> $ time_since_1st_research_exp <dbl> NA, NA, NA, NA, NA, NA, NA, NA, N…
#> $ time_since_1st_research_ivy_exp <dbl> NA, NA, NA, NA, NA, NA, NA, NA, N…
#> $ time_since_1st_experience <dbl> 0, 12, 0, 0, 116, 0, 136, 40, 176…
#> $ raised_amount_before_founder_member <dbl> NA, NA, NA, NA, NA, NA, NA, NA, N…
#> $ raised_amount_before_all_member <dbl> NA, NA, NA, NA, NA, NA, 0, 0, NA,…
#> $ was_corporate_before <lgl> FALSE, TRUE, FALSE, FALSE, TRUE, …
#> $ was_founder_before <lgl> FALSE, FALSE, FALSE, FALSE, FALSE…
#> $ was_f500_before <lgl> FALSE, FALSE, FALSE, FALSE, FALSE…
#> $ was_fc_before <lgl> FALSE, FALSE, FALSE, FALSE, FALSE…
#> $ was_uni_before <lgl> FALSE, FALSE, FALSE, FALSE, FALSE…
#> $ was_ivy_before <lgl> FALSE, FALSE, FALSE, FALSE, FALSE…
#> $ stage_mid <lgl> NA, NA, NA, NA, NA, NA, NA, NA, N…
#> $ stage_late <lgl> NA, NA, NA, NA, NA, NA, FALSE, NA…
#> $ date_from_stage <chr> "early1", "mid", "early2", "mid",…
#> $ company_start_mid <date> 2009-01-01, 2014-11-13, 2011-07-…
#> $ company_start_late <date> 2009-11-23, 2017-11-13, 2014-07-…
#> $ rank_global_2023_best <int> 917, 1549, NA, NA, NA, NA, NA, NA…
#> $ score_global_2023_best <dbl> 40.6, 26.8, NA, NA, NA, NA, NA, N…
#> $ rank_national_2023_best <int> NA, NA, NA, NA, NA, NA, NA, NA, N…
#> $ rank_national_during_enrollment_best <int> NA, NA, NA, NA, NA, NA, NA, NA, N…
#> $ degree_ba2 <lgl> FALSE, TRUE, NA, NA, TRUE, NA, NA…
#> $ degree_ma2 <lgl> FALSE, FALSE, NA, NA, FALSE, NA, …
#> $ degree_phd2 <lgl> FALSE, FALSE, NA, NA, FALSE, NA, …
#> $ degree_mba2 <lgl> TRUE, FALSE, NA, NA, FALSE, NA, N…
#> $ num_rounds_cumulated_founder <int> NA, NA, NA, NA, NA, NA, NA, NA, N…
#> $ num_rounds_cumulated_all <int> NA, NA, NA, NA, NA, NA, 1, 1, NA,…
#> $ announced_on_sB <date> 2012-02-03, 2018-01-04, 2011-03-…
#> $ round_type_new_next <fct> Series C, Series C, Series C, Ser…
#> $ raised_amount_cumsum_sB <dbl> 46043054, 0, 1905000, 11022796, 2…
#> $ raised_amount_cumsum_sB_next <dbl> 76043054, 0, 8712306, 13854868, 4…cs_me_dist8_unest_wedu_dt |>
select(id_tie, member_id, exp_id_cs, company_id_cbpb, company_name_cbpb, company_id_cs, company_name_cs,
founded_on_cbpb, closed_on_cbpb,
title_cs) |>
glimpse()#> Rows: 2,659,657
#> Columns: 10
#> $ id_tie <int> 38, 67, 89, 89, 96, 104, 183, 175, 209, 243, 321, 37…
#> $ member_id <int> 878, 2104, 3548, 3548, 3970, 4005, 4224, 4224, 4317,…
#> $ exp_id_cs <dbl> 2481733250, 1423977093, 2638, 2638, 1736317868, 3084…
#> $ company_id_cbpb <int> 90591, 152845, 90440, 138208, 116099, 97810, 40123, …
#> $ company_name_cbpb <chr> "receptos", "HERE Technologies Chicago", "crowdtwist…
#> $ company_id_cs <int> 140537, 10644128, 6068905, 6068905, 11825305, 194148…
#> $ company_name_cs <chr> "Bristol-Myers Squibb", "HERE", "Oracle", "Oracle", …
#> $ founded_on_cbpb <date> 2007-01-01, 2012-11-13, 2009-07-01, 2006-01-01, 201…
#> $ closed_on_cbpb <date> NA, NA, NA, NA, NA, NA, 2021-04-09, NA, NA, NA, NA,…
#> $ title_cs <chr> "Key Account Manager", "GIS Analyst I", "QA", "QA", …cs_me_dist8_unest_wedu_dt |>
select(date_from_parsed_cs, date_to_parsed_cs,
tjoin_tfound, raised_amount_before_join_company, num_rounds_before_join) |>
glimpse()#> Rows: 2,659,657
#> Columns: 5
#> $ date_from_parsed_cs <date> 2006-01-01, 2016-01-01, 2010-01-01,…
#> $ date_to_parsed_cs <date> 2008-08-01, NA, NA, NA, 2011-10-01,…
#> $ tjoin_tfound <dbl> -12, 37, 6, 48, -47, 48, 17, 11, 44,…
#> $ raised_amount_before_join_company <dbl> 0, 0, 0, 7722796, 0, 9961692, 333333…
#> $ num_rounds_before_join <dbl> 0, 1, 0, 2, 0, 2, 1, 1, 2, 0, 1, 2, …#> Rows: 2,659,657
#> Columns: 10
#> $ is_f500 <lgl> TRUE, FALSE, TRUE, TRUE, FALSE, FALSE, FALSE, FAL…
#> $ is_founder <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, …
#> $ is_research <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, …
#> $ is_research_ivy <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, …
#> $ was_corporate_before <lgl> FALSE, TRUE, FALSE, FALSE, TRUE, FALSE, TRUE, TRU…
#> $ was_founder_before <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, …
#> $ was_f500_before <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, …
#> $ was_fc_before <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, F…
#> $ was_uni_before <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, …
#> $ was_ivy_before <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, …cs_me_dist8_unest_wedu_dt |>
select(
starts_with("date_1st_"),
starts_with("time_since_1st_"),
starts_with("exp_"), -exp_id_cs) |>
glimpse()#> Rows: 2,659,657
#> Columns: 19
#> $ date_1st_founder_exp <date> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
#> $ date_1st_f500_exp <date> 2006-01-01, NA, 2010-01-01, 2010-01…
#> $ date_1st_funded_startup_exp <date> 2006-01-01, 2016-01-01, 2010-01-01,…
#> $ date_1st_research_exp <date> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
#> $ date_1st_research_ivy_exp <date> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
#> $ date_1st_corporate_exp <date> 2009-02-01, 2015-01-01, NA, NA, 200…
#> $ time_since_1st_corporate_exp <dbl> NA, 12, NA, NA, 116, NA, 136, 40, 17…
#> $ time_since_1st_founder_exp <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ time_since_1st_f500_exp <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ time_since_1st_funded_startup_exp <dbl> NA, NA, NA, NA, NA, NA, 96, NA, NA, …
#> $ time_since_1st_research_exp <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ time_since_1st_research_ivy_exp <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ time_since_1st_experience <dbl> 0, 12, 0, 0, 116, 0, 136, 40, 176, 4…
#> $ exp_corporate <dbl> 0.00000, 12.00000, 0.00000, 0.00000,…
#> $ exp_funded_startup <dbl> 0, 0, 0, 0, 0, 0, 18, 0, 0, 0, 0, 0,…
#> $ exp_founder <dbl> 0.0000, 0.0000, 0.0000, 0.0000, 0.00…
#> $ exp_f500 <dbl> 0.00000, 0.00000, 0.00000, 0.00000, …
#> $ exp_research <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
#> $ exp_research_ivy <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …cs_me_dist8_unest_wedu_dt |>
select(score_global_2023_best,
starts_with("rank"),
starts_with("degree")) |>
glimpse()#> Rows: 2,659,657
#> Columns: 8
#> $ score_global_2023_best <dbl> 40.6, 26.8, NA, NA, NA, NA, NA, N…
#> $ rank_global_2023_best <int> 917, 1549, NA, NA, NA, NA, NA, NA…
#> $ rank_national_2023_best <int> NA, NA, NA, NA, NA, NA, NA, NA, N…
#> $ rank_national_during_enrollment_best <int> NA, NA, NA, NA, NA, NA, NA, NA, N…
#> $ degree_ba2 <lgl> FALSE, TRUE, NA, NA, TRUE, NA, NA…
#> $ degree_ma2 <lgl> FALSE, FALSE, NA, NA, FALSE, NA, …
#> $ degree_phd2 <lgl> FALSE, FALSE, NA, NA, FALSE, NA, …
#> $ degree_mba2 <lgl> TRUE, FALSE, NA, NA, FALSE, NA, N…cs_me_dist8_unest_wedu_dt |>
select(date_from_stage, company_start_mid, company_start_late,
raised_amount_before_founder_member, raised_amount_before_all_member,
funding_after_mid, funding_after_early) |>
glimpse()#> Rows: 2,659,657
#> Columns: 7
#> $ date_from_stage <chr> "early1", "mid", "early2", "mid", …
#> $ company_start_mid <date> 2009-01-01, 2014-11-13, 2011-07-0…
#> $ company_start_late <date> 2009-11-23, 2017-11-13, 2014-07-0…
#> $ raised_amount_before_founder_member <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA…
#> $ raised_amount_before_all_member <dbl> NA, NA, NA, NA, NA, NA, 0, 0, NA, …
#> $ funding_after_mid <chr> "yes", NA, "yes", "yes", "yes", "y…
#> $ funding_after_early <chr> "yes", "no", "yes", "yes", "yes", …