Current Status
Hamburg University of Technology
Wednesday, 12th of June 2024
Input + Output CoreSignal
Objective: Create list with domains and/or LinkedIn handles
Script and File locations
Data
01_data_sources/06_coresignal/01_data/01_input_for_cs/final_selection/version_3_cijs/
orgs_cs.csv
usr_cs.csv
Scripts
01_data_sources/06_coresignal/02_scripts/01_input_for_cs/
cs_org_select_v2.R
Instructions:
Data was proided as json
data:
Script and File locations
Data
01_data_sources/06_coresignal/01_data/02_raw/
company/202203_custom/...
member/202203_custom/...
Objectives: Extract data that is relevant (variables) for our analyses and convert it to .rds
/ .parquet
files
Script and File locations
Scripts
01_data_sources/06_coresignal/02_scripts/02_build_tables/
01_cs_build_table_company.R
02_cs_build_tables_member_V1_bd_exp_skills.R
03_cs_build_tables_member_V2_edu_exp.R
Data
01_data_sources/06_coresignal/01_data/03_extracted/
company/cs_companies_base.rds
member/01_basic_data/...
member/02_experience/...
(non-deduplicated)member/03_skills/...
member/04_education/...
(non-deduplicated)Everytime a user changes something on their profile a new record is being created (date, typos, names, …). The column deleted
is not useful.
Objectives: Deduplicate the member data: Experiences & Education Tables
Script and File locations
Scripts
01_data_sources/06_coresignal/02_scripts/02_build_tables/
04_cs_build_tables_member_exp_dist.R
05_cs_build_tables_member_edu_dist.R
Data
01_data_sources/06_coresignal/01_data/04_wrangled/
companies/cs_companies_base_slct.rds
(just relevant columns selected)member/02_experience/me_dist8/...
(deduplicated)member/04_education/02_wrangled_dist_chunked/...
(deduplicated)Wrangle and Join CrunchBase
/ PitchBook
and CoreSignal
data
CoreSignal did not provide a matching table but provided only the resulting data. Hence, backmapping to our CrunchBase / Pitchbook Data via domains is necessary:
Script and File locations
Scripts
02_data_mapping/10_cbpb_cs/01_scripts/
06_cbpb_cs_matching_companies.R
Data (Input)
02_data_mapping/10_cbpb_cs/02_data/
funded_companies.rds
(created by Christoph)Data (Output)
cbpb_cs_joined.rds
(companies joined)Map Crunchbase / Pitchbook data to CoreSignal profiles (via mapped companies (1))
Script and File locations
Scripts
02_data_mapping/10_cbpb_cs/01_scripts/
07_cs_cb_matching_employees.R
Data (Input)
01_data_sources/06_coresignal/01_data/04_wrangled/member/02_experience/me_dist8/02_unnested/ 02_data_mapping/10_cbpb_cs/02_data/
cs_me_dist8_unest_prqt
(CoreSignal distinct Member Experiences)cbpb_cs_joined.rds
(Joined CoreSignal / CrunchbasePitchbook Org Data)Data (Output)
cs_me_dist8_unest_fc_joined.parquet
(intermediate data)There are some further explanations about the org join
and empployee join
in the next two section (click on each tabset. There are also some fragments –> You have to hit enter/arrow to slide them in). You can skip to the next chapter (VAR I) though.
Crunchbase data contains 150,838 startups with a valid funding trajectory.
library(stringr)
fc_unnested_tbl |>
# 1. At least 1 identifier: 4.518 observations are filtered out
filter(if_any(c(domain, linkedin_url), ~!is.na(.))) |>
# 2. Extract linkedin handle & clean domains
mutate(linkedin_handle = linkedin_url |> str_extract("(?<=linkedin\\.com/company/).*?(?=(?:\\?|$|/))")) |>
mutate(domain = domain |> clean_domain()) |>
# 3. Remove 532 duplicates
distinct()
–> 145.991 distinct examineable companies.
Issue:
Some extracted domains are not unique and associated with multiple companies.
Manual Cleaning:
Domains with a count exceeding two were analyzed and set to NA if they do not correspond to the actual one.
# ANALYZE
# fc_wrangled_tbl |>
# distinct(company_id, domain) |>
# count(domain, sort = T) |>
# filter(n>2)`
unwanted_domains_cb <- c("webflow.io", "angel.co", "weebly.com", "wordpress.com", "wixsite.com", "squarespace.com",
"webflow.io", "crypt2esports.com", "myshopify.com", "business.site", "mystrikingly.com",
"launchrock.com", "square.site", "google.com", "sites.google.com", "t.co", "linktr.ee",
"netlify.app", "itunes.apple.com", "apple.com", "crunchb.com", "tumblr.com", "linkedin.com",
"godaddysites.com", "mit.edu", "paloaltonetworks.com", " wpengine.com", "facebook.com",
"intuit.com", "medium.com", "salesforce.com", "strikingly.com", "wix.com", "cisco.com",
"digi.me", "apps.apple.com", "bit.ly", "fleek.co", "harvard.edu", "ibm.com", "jimdo.com",
"myftpupload.com", "odoo.com", "storenvy.com", "twitter.com", "umd.edu", "umich.edu", "vmware.com", "webs.com")
# Not all observations with unwanted domains are bad per se:
wanted_ids_cb <- c(angel = 128006, `catapult-centres-uk` = 115854, digime1 = 140904, digimi2 = 95430, fleek = 50738,
jimdo = 108655, medium = 113415, storenvy = 85742, strikingly = 95831, substack = 34304,
tumblr = 84838, twitter = 53139, weebly = 91365, wpengine = 91720)
# Set misleading domains to NA
funded_companies_clnd <- fc_wrangled_tbl |>
mutate(domain = if_else(
domain %in% unwanted_domains_cb & !(company_id %in% wanted_ids_cb),
NA_character_, domain))
It appears that CoreSignal has been able to locate 45.026 companies within our gathered data.
Nothing to wrangle …
Important
More cleaning necessary (same as CBPB)! The task was undertaken with a limited degree of enthusiasm.
unwanted_domains_cs <- c("bit.ly", "linktr.ee", "facebook.com", "linkedin.com", "twitter.com", "crunchbase.com")
wanted_ids_cs <- c(crunchbase = 1634413, linkedin = 8568581, twitter = 24745469)
cs_companies_base_clnd <- cs_companies_base_wrangled |>
mutate(domain_cs = if_else(
domain_cs %in% unwanted_domains_cs & !(id_cs %in% wanted_ids_cs),
NA_character_,
domain_cs)
)
We were able to match 37.287 CS & CB/PB companies.
cb_cs_joined <- funded_companies_clnd |>
# Leftjoins
left_join(cs_companies_base_clnd |> select(id_cs, domain_cs), by = c(domain = "domain_cs"), na_matches = "never") |>
left_join(cs_companies_base_clnd |> select(id_cs, linkedin_handle_cs), by = c(linkedin_handle = "linkedin_handle_cs"), na_matches = "never") |>
# Remove obs with no cs_id
filter(!is.na(id_cs)) |>
# Remove matches, that matched different domains, but same company (e.g. company_id: 83060, id_cs: 4507928) block.xyz & squareup.com
select(company_id, id_cs) |>
distinct()
cb_cs_joined
We got over 460 million employment observations from CoreSignal.
But only ~50 Mil distinct employments
#> FileSystemDataset with 1 Parquet file
#> 51,621,196 rows x 10 columns
#> $ id_tie <int32> 16615559, 16615560, 16615561, 16615562, 1661556…
#> $ id <double> 2244288231, 254049663, 948937291, 254049667, 25…
#> $ member_id <int32> 179313066, 179313066, 179313066, 179313066, 179…
#> $ company_id <int32> 865089, 9098713, 9098713, NA, 865089, 9020540, …
#> $ company_name <string> "heritage community bank", "aurora bank fsb", "…
#> $ title <string> "AVP Chief Compliance/BSA Officer", "AVP Compli…
#> $ date_from_parsed <date32[day]> 2010-02-01, 2012-07-01, 2011-11-01, 1997-07-01,…
#> $ date_to_parsed <date32[day]> 2011-11-01, 2013-06-01, 2012-07-01, 2006-05-01,…
#> $ date_from_parsed_year <int32> 2010, 2012, 2011, 1997, 2006, 2019, 2017, 2021,…
#> $ date_to_parsed_year <int32> 2011, 2013, 2012, 2006, 2010, 2021, 2018, NA, 1…
#> Call `print()` for full schema details
Example
me_orig <- open_dataset("~/02_diss/01_coresignal/02_data/member_experience/me_orig/")
me_dist <- open_dataset("~/02_diss/01_coresignal/02_data/member_experience/me_dist/")
me_orig |> filter(member_id == 4257, company_id == 9007053) |> collect() |> as_tibble() |> arrange(date_from_parsed) |> print(n=19)
me_dist |> filter(member_id == 4257, company_id == 9007053) |> collect() |> as_tibble() |> arrange(date_from_parsed)
Over 10 million (valid: must have starting date) employments at our crunchbase / pitchbook data set companies. 385.100 with a title containing the string founder.
# Distinct company ids
cb_cs_joined_cs_ids <- cb_cs_joined |> distinct(id_cs) |> pull(id_cs)
me_wrangled_prqt <- me_dist8_prqt |>
# Select features
select(member_id, company_id, exp_id = "id", date_from_parsed) |>
# Select observations
filter(company_id %in% cb_cs_joined_cs_ids) |>
# - 967.080 observations (date_to not considered yet)
filter(!is.na(date_from_parsed)) |>
# Add suffix to col names
rename_with(~ paste(., "cs", sep = "_")) |>
compute()
me_wrangled_prqt |>
glimpse()
#> Table
#> 11,050,164 rows x 4 columns
#> $ member_id_cs <int32> 9897605, 9897605, 9897605, 9897605, 9897928,…
#> $ company_id_cs <int32> 1105483, 5181133, 5181133, 5181133, 5025265,…
#> $ exp_id_cs <double> 1665233144, 12744849, 995032176, 1665233146,…
#> $ date_from_parsed_cs <date32[day]> 2018-03-01, 2010-06-01, 2014-09-01, 2011-01-…
#> Call `print()` for full schema details
Multiple Funding Dates –> Take oldest
Example of funding round data:
#> Rows: 15
#> Columns: 14
#> $ round_id <int> 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, …
#> $ round_uuid_pb <chr> NA, "47208-70T", NA, "58843-18T", NA, NA, NA, "78…
#> $ round_uuid_cb <chr> "a6d3bfd9-5afa-47ce-86de-30a3abad6c9b", NA, "ea3b…
#> $ announced_on <date> 2013-01-01, 2014-04-01, 2015-06-01, 2015-10-07, …
#> $ round_new <int> 1, 2, 3, 4, 5, 6, 7, 8, 8, 9, 10, 11, 12, 12, 13
#> $ round <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15
#> $ exit_cycle <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1
#> $ last <int> 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 1…
#> $ round_type_new <fct> Seed, Series A, Series B, Series C, Series D, Ser…
#> $ round_type <list> "angel", "angel", "early_vc", "early_vc", "conver…
#> $ round_types <list> <"angel", "angel_group", "investor", "company", "…
#> $ raised_amount <dbl> NA, 520000, NA, 1399999, NA, NA, NA, 3250000, NA,…
#> $ post_money_valuation <dbl> NA, NA, NA, 3399998, NA, NA, NA, 10249998, NA, N…
#> $ investors_in_round <list> [<tbl_df[1 x 11]>], [<tbl_df[1 x 11]>], [<tbl_df…
dplyr
due to memory constraint not possible.Arrow
due to structure constraints not possible.data.table
most efficient.Conversion to data.tables
necessary:
# 1. Funding Data
# 1.1 Level 1
fc_wrangled_dt |> setDT()
# 1.2 Funding Data Level 2 (funding_rounds)
purrr::walk(fc_wrangled_dt$funding_rounds, setDT)
# 1.3 Remove unnecessary columns + initialize dummy for before_join
purrr::walk(fc_wrangled_dt$funding_rounds, ~ .x[,
`:=`(round_uuid_pb = NULL, round_uuid_cb = NULL, round_new = NULL, round = NULL,
exit_cycle = NULL, last = NULL, round_type = NULL, round_type_new = NULL,
round_types = NULL, post_money_valuation = NULL, investors_in_round = NULL, before_join = NA)
]
)
# 2. Matching Table
cb_cs_joined_slct_dt |> setDT()
# 3. Member experiences
me_wrangled_dt <- me_wrangled_prqt |> collect()
Working data.table
solution (efficiency increase through join by reference
possible).
# 1. Add company_id from funded_companies to member experiences
me_joined_dt <- cb_cs_joined_slct_dt[me_wrangled_dt, on = .(id_cs = company_id_cs), allow.cartesian = TRUE]
#> 12.978.226
# 2. Add funding data from funded_companies
me_joined_dt <- fc_wrangled_dt[me_joined_dt, on = .(company_id)]
#> 12.270.572
# 3. Remove duplicates (why are there any?)
me_joined_dt <- unique(me_joined_dt, by = setdiff(names(me_joined_dt), "funding_rounds"))
#> 12.270.572 .... No duplicates anymore. Removed from cb_cs_joined_slct_dt
Not working dplyr
solution
Arrow
because of nested funding data not possible.
Using domain knowledge to extract features
From here on almost everything is in
Script and File locations
Scripts
05_analyses/03_cbpbcs/01_scripts
01_founding_vs_employment
(Company Funding vs. Time of Employment (I. Time, II. Capital, III. Rounds))02_stage_affiliation
(stages based on Age and funding. was discarded later on –> see 04_funding_history)03_employment_history
(Fortune500, Startup, Founding, Research Experiences)04_funding_history
(1. prior raised amount (person), 2. further funding (company), 3. funding per round, 4. Dataset based on series B)05_education_history
(merge with rankings, extract degrees)06_skills
07_analyses/02_per_company/
(last plots)#> Rows: 2,659,657
#> Columns: 67
#> $ company_id_cbpb <int> 90591, 152845, 90440, 138208, 116…
#> $ funding_after_mid <chr> "yes", NA, "yes", "yes", "yes", "…
#> $ funding_after_early <chr> "yes", "no", "yes", "yes", "yes",…
#> $ member_id <int> 878, 2104, 3548, 3548, 3970, 4005…
#> $ id_tie <int> 38, 67, 89, 89, 96, 104, 183, 175…
#> $ exp_id_cs <dbl> 2481733250, 1423977093, 2638, 263…
#> $ exp_corporate <dbl> 0.00000, 12.00000, 0.00000, 0.000…
#> $ exp_funded_startup <dbl> 0, 0, 0, 0, 0, 0, 18, 0, 0, 0, 0,…
#> $ exp_founder <dbl> 0.0000, 0.0000, 0.0000, 0.0000, 0…
#> $ exp_f500 <dbl> 0.00000, 0.00000, 0.00000, 0.0000…
#> $ exp_research <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
#> $ exp_research_ivy <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
#> $ company_id_cs <int> 140537, 10644128, 6068905, 606890…
#> $ company_name_cs <chr> "Bristol-Myers Squibb", "HERE", "…
#> $ company_name_cbpb <chr> "receptos", "HERE Technologies Ch…
#> $ founded_on_cbpb <date> 2007-01-01, 2012-11-13, 2009-07-…
#> $ closed_on_cbpb <date> NA, NA, NA, NA, NA, NA, 2021-04-…
#> $ title_cs <chr> "Key Account Manager", "GIS Analy…
#> $ date_from_parsed_cs <date> 2006-01-01, 2016-01-01, 2010-01-…
#> $ date_to_parsed_cs <date> 2008-08-01, NA, NA, NA, 2011-10-…
#> $ tjoin_tfound <dbl> -12, 37, 6, 48, -47, 48, 17, 11, …
#> $ raised_amount_before_join_company <dbl> 0, 0, 0, 7722796, 0, 9961692, 333…
#> $ num_rounds_before_join <dbl> 0, 1, 0, 2, 0, 2, 1, 1, 2, 0, 1, …
#> $ is_f500 <lgl> TRUE, FALSE, TRUE, TRUE, FALSE, F…
#> $ is_founder <lgl> FALSE, FALSE, FALSE, FALSE, FALSE…
#> $ is_research <lgl> FALSE, FALSE, FALSE, FALSE, FALSE…
#> $ is_research_ivy <lgl> FALSE, FALSE, FALSE, FALSE, FALSE…
#> $ date_1st_founder_exp <date> NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ date_1st_f500_exp <date> 2006-01-01, NA, 2010-01-01, 2010…
#> $ date_1st_funded_startup_exp <date> 2006-01-01, 2016-01-01, 2010-01-…
#> $ date_1st_research_exp <date> NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ date_1st_research_ivy_exp <date> NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ date_1st_corporate_exp <date> 2009-02-01, 2015-01-01, NA, NA, …
#> $ time_since_1st_corporate_exp <dbl> NA, 12, NA, NA, 116, NA, 136, 40,…
#> $ time_since_1st_founder_exp <dbl> NA, NA, NA, NA, NA, NA, NA, NA, N…
#> $ time_since_1st_f500_exp <dbl> NA, NA, NA, NA, NA, NA, NA, NA, N…
#> $ time_since_1st_funded_startup_exp <dbl> NA, NA, NA, NA, NA, NA, 96, NA, N…
#> $ time_since_1st_research_exp <dbl> NA, NA, NA, NA, NA, NA, NA, NA, N…
#> $ time_since_1st_research_ivy_exp <dbl> NA, NA, NA, NA, NA, NA, NA, NA, N…
#> $ time_since_1st_experience <dbl> 0, 12, 0, 0, 116, 0, 136, 40, 176…
#> $ raised_amount_before_founder_member <dbl> NA, NA, NA, NA, NA, NA, NA, NA, N…
#> $ raised_amount_before_all_member <dbl> NA, NA, NA, NA, NA, NA, 0, 0, NA,…
#> $ was_corporate_before <lgl> FALSE, TRUE, FALSE, FALSE, TRUE, …
#> $ was_founder_before <lgl> FALSE, FALSE, FALSE, FALSE, FALSE…
#> $ was_f500_before <lgl> FALSE, FALSE, FALSE, FALSE, FALSE…
#> $ was_fc_before <lgl> FALSE, FALSE, FALSE, FALSE, FALSE…
#> $ was_uni_before <lgl> FALSE, FALSE, FALSE, FALSE, FALSE…
#> $ was_ivy_before <lgl> FALSE, FALSE, FALSE, FALSE, FALSE…
#> $ stage_mid <lgl> NA, NA, NA, NA, NA, NA, NA, NA, N…
#> $ stage_late <lgl> NA, NA, NA, NA, NA, NA, FALSE, NA…
#> $ date_from_stage <chr> "early1", "mid", "early2", "mid",…
#> $ company_start_mid <date> 2009-01-01, 2014-11-13, 2011-07-…
#> $ company_start_late <date> 2009-11-23, 2017-11-13, 2014-07-…
#> $ rank_global_2023_best <int> 917, 1549, NA, NA, NA, NA, NA, NA…
#> $ score_global_2023_best <dbl> 40.6, 26.8, NA, NA, NA, NA, NA, N…
#> $ rank_national_2023_best <int> NA, NA, NA, NA, NA, NA, NA, NA, N…
#> $ rank_national_during_enrollment_best <int> NA, NA, NA, NA, NA, NA, NA, NA, N…
#> $ degree_ba2 <lgl> FALSE, TRUE, NA, NA, TRUE, NA, NA…
#> $ degree_ma2 <lgl> FALSE, FALSE, NA, NA, FALSE, NA, …
#> $ degree_phd2 <lgl> FALSE, FALSE, NA, NA, FALSE, NA, …
#> $ degree_mba2 <lgl> TRUE, FALSE, NA, NA, FALSE, NA, N…
#> $ num_rounds_cumulated_founder <int> NA, NA, NA, NA, NA, NA, NA, NA, N…
#> $ num_rounds_cumulated_all <int> NA, NA, NA, NA, NA, NA, 1, 1, NA,…
#> $ announced_on_sB <date> 2012-02-03, 2018-01-04, 2011-03-…
#> $ round_type_new_next <fct> Series C, Series C, Series C, Ser…
#> $ raised_amount_cumsum_sB <dbl> 46043054, 0, 1905000, 11022796, 2…
#> $ raised_amount_cumsum_sB_next <dbl> 76043054, 0, 8712306, 13854868, 4…
cs_me_dist8_unest_wedu_dt |>
select(id_tie, member_id, exp_id_cs, company_id_cbpb, company_name_cbpb, company_id_cs, company_name_cs,
founded_on_cbpb, closed_on_cbpb,
title_cs) |>
glimpse()
#> Rows: 2,659,657
#> Columns: 10
#> $ id_tie <int> 38, 67, 89, 89, 96, 104, 183, 175, 209, 243, 321, 37…
#> $ member_id <int> 878, 2104, 3548, 3548, 3970, 4005, 4224, 4224, 4317,…
#> $ exp_id_cs <dbl> 2481733250, 1423977093, 2638, 2638, 1736317868, 3084…
#> $ company_id_cbpb <int> 90591, 152845, 90440, 138208, 116099, 97810, 40123, …
#> $ company_name_cbpb <chr> "receptos", "HERE Technologies Chicago", "crowdtwist…
#> $ company_id_cs <int> 140537, 10644128, 6068905, 6068905, 11825305, 194148…
#> $ company_name_cs <chr> "Bristol-Myers Squibb", "HERE", "Oracle", "Oracle", …
#> $ founded_on_cbpb <date> 2007-01-01, 2012-11-13, 2009-07-01, 2006-01-01, 201…
#> $ closed_on_cbpb <date> NA, NA, NA, NA, NA, NA, 2021-04-09, NA, NA, NA, NA,…
#> $ title_cs <chr> "Key Account Manager", "GIS Analyst I", "QA", "QA", …
cs_me_dist8_unest_wedu_dt |>
select(date_from_parsed_cs, date_to_parsed_cs,
tjoin_tfound, raised_amount_before_join_company, num_rounds_before_join) |>
glimpse()
#> Rows: 2,659,657
#> Columns: 5
#> $ date_from_parsed_cs <date> 2006-01-01, 2016-01-01, 2010-01-01,…
#> $ date_to_parsed_cs <date> 2008-08-01, NA, NA, NA, 2011-10-01,…
#> $ tjoin_tfound <dbl> -12, 37, 6, 48, -47, 48, 17, 11, 44,…
#> $ raised_amount_before_join_company <dbl> 0, 0, 0, 7722796, 0, 9961692, 333333…
#> $ num_rounds_before_join <dbl> 0, 1, 0, 2, 0, 2, 1, 1, 2, 0, 1, 2, …
#> Rows: 2,659,657
#> Columns: 10
#> $ is_f500 <lgl> TRUE, FALSE, TRUE, TRUE, FALSE, FALSE, FALSE, FAL…
#> $ is_founder <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, …
#> $ is_research <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, …
#> $ is_research_ivy <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, …
#> $ was_corporate_before <lgl> FALSE, TRUE, FALSE, FALSE, TRUE, FALSE, TRUE, TRU…
#> $ was_founder_before <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, …
#> $ was_f500_before <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, …
#> $ was_fc_before <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, F…
#> $ was_uni_before <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, …
#> $ was_ivy_before <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, …
cs_me_dist8_unest_wedu_dt |>
select(
starts_with("date_1st_"),
starts_with("time_since_1st_"),
starts_with("exp_"), -exp_id_cs) |>
glimpse()
#> Rows: 2,659,657
#> Columns: 19
#> $ date_1st_founder_exp <date> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
#> $ date_1st_f500_exp <date> 2006-01-01, NA, 2010-01-01, 2010-01…
#> $ date_1st_funded_startup_exp <date> 2006-01-01, 2016-01-01, 2010-01-01,…
#> $ date_1st_research_exp <date> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
#> $ date_1st_research_ivy_exp <date> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
#> $ date_1st_corporate_exp <date> 2009-02-01, 2015-01-01, NA, NA, 200…
#> $ time_since_1st_corporate_exp <dbl> NA, 12, NA, NA, 116, NA, 136, 40, 17…
#> $ time_since_1st_founder_exp <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ time_since_1st_f500_exp <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ time_since_1st_funded_startup_exp <dbl> NA, NA, NA, NA, NA, NA, 96, NA, NA, …
#> $ time_since_1st_research_exp <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ time_since_1st_research_ivy_exp <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ time_since_1st_experience <dbl> 0, 12, 0, 0, 116, 0, 136, 40, 176, 4…
#> $ exp_corporate <dbl> 0.00000, 12.00000, 0.00000, 0.00000,…
#> $ exp_funded_startup <dbl> 0, 0, 0, 0, 0, 0, 18, 0, 0, 0, 0, 0,…
#> $ exp_founder <dbl> 0.0000, 0.0000, 0.0000, 0.0000, 0.00…
#> $ exp_f500 <dbl> 0.00000, 0.00000, 0.00000, 0.00000, …
#> $ exp_research <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
#> $ exp_research_ivy <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
cs_me_dist8_unest_wedu_dt |>
select(score_global_2023_best,
starts_with("rank"),
starts_with("degree")) |>
glimpse()
#> Rows: 2,659,657
#> Columns: 8
#> $ score_global_2023_best <dbl> 40.6, 26.8, NA, NA, NA, NA, NA, N…
#> $ rank_global_2023_best <int> 917, 1549, NA, NA, NA, NA, NA, NA…
#> $ rank_national_2023_best <int> NA, NA, NA, NA, NA, NA, NA, NA, N…
#> $ rank_national_during_enrollment_best <int> NA, NA, NA, NA, NA, NA, NA, NA, N…
#> $ degree_ba2 <lgl> FALSE, TRUE, NA, NA, TRUE, NA, NA…
#> $ degree_ma2 <lgl> FALSE, FALSE, NA, NA, FALSE, NA, …
#> $ degree_phd2 <lgl> FALSE, FALSE, NA, NA, FALSE, NA, …
#> $ degree_mba2 <lgl> TRUE, FALSE, NA, NA, FALSE, NA, N…
cs_me_dist8_unest_wedu_dt |>
select(date_from_stage, company_start_mid, company_start_late,
raised_amount_before_founder_member, raised_amount_before_all_member,
funding_after_mid, funding_after_early) |>
glimpse()
#> Rows: 2,659,657
#> Columns: 7
#> $ date_from_stage <chr> "early1", "mid", "early2", "mid", …
#> $ company_start_mid <date> 2009-01-01, 2014-11-13, 2011-07-0…
#> $ company_start_late <date> 2009-11-23, 2017-11-13, 2014-07-0…
#> $ raised_amount_before_founder_member <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA…
#> $ raised_amount_before_all_member <dbl> NA, NA, NA, NA, NA, NA, 0, 0, NA, …
#> $ funding_after_mid <chr> "yes", NA, "yes", "yes", "yes", "y…
#> $ funding_after_early <chr> "yes", "no", "yes", "yes", "yes", …