Diss

Current Status

Joschka Schwarz

Hamburg University of Technology

Wednesday, 12th of June 2024

Table of Contents

Topics

Possible Overarching Topics

  • Bricolage and effectuation
  • Technical Entrepreneur (prior knowledge)
  • Contingency
  • Novelty and Technological uncertainty
  • Knowledge spillover

We are interested in how actors are influenced by and interact with their social and cultural environments to bring about novelty, e.g. with regard to ideas, teams, products or business practices.

Data sources

Research Scope

 

W-11 Research Focus & Fields

We are interested in how actors are influenced by and interact with their social and cultural environments to bring about novelty, e.g. with regard to ideas, teams, products or business practices.

Data sources

Step1: Identity matching

Linking Developer with startup data

Data GH

GitHub: 1. GHTorrent and 2. Github API

p_load(RMariaDB, dplyr)
con <- dbConnect(
  drv      = MariaDB(),
  dbname   = "ghtorrent_restore",
  username = "ghtorrentuser",
  password = Sys.getenv("GHTORRENTPASSWORD"),
  host     = "127.0.0.1",
  port     = 3307
)

con |> dbListTables()
1
Password is located in a .Renviron file that is not stored in version control (GitHub)
2
SSH port forwarding/tunneling for MySQL connection is used
tbl(con, "users") |> 
  count()
tbl(con,"projects") |> 
  count()
tbl(con,"project_commits") |> 
  count()

GHTorrent user data does not contain much valuable information:

users <- tbl(con, "users") 
users |>   
  glimpse()
users |> 
  filter(login == "christophihl") |> 
  glimpse()

      

      

      

      

      

      

      

      

      

More sensible data via API: Name, Email, Avatar, Blog/URL, Bio

Single API Request

httr::GET("https://api.github.com/users/christophihl")
httr::GET("https://api.github.com/user/8004978")
#> Rows: 1
#> Columns: 28
#> $ login               <chr> "christophihl"
#> $ id                  <int> 8004978
#> $ node_id             <chr> "MDQ6VXNlcjgwMDQ5Nzg="
#> $ avatar_url          <chr> "https://avatars.githu…
#> $ gravatar_id         <chr> ""
#> $ url                 <chr> "https://api.github.co…
#> $ html_url            <chr> "https://github.com/ch…
#> $ followers_url       <chr> "https://api.github.co…
#> $ following_url       <chr> "https://api.github.co…
#> $ gists_url           <chr> "https://api.github.co…
#> $ starred_url         <chr> "https://api.github.co…
#> $ subscriptions_url   <chr> "https://api.github.co…
#> $ organizations_url   <chr> "https://api.github.co…
#> $ repos_url           <chr> "https://api.github.co…
#> $ events_url          <chr> "https://api.github.co…
#> $ received_events_url <chr> "https://api.github.co…
#> $ type                <chr> "User"
#> $ site_admin          <lgl> FALSE
#> $ name                <chr> "Christoph Ihl"
#> $ company             <chr> "Hamburg University of…
#> $ blog                <chr> "www.startupengineer.i…
#> $ location            <chr> "Hamburg"
#> $ public_repos        <int> 21
#> $ public_gists        <int> 1
#> $ followers           <int> 12
#> $ following           <int> 1
#> $ created_at          <chr> "2014-06-27T11:22:22Z"
#> $ updated_at          <chr> "2024-04-23T13:22:01Z"

Final API dataset

open_dataset("gh_api_users_wrangled.parquet") |> 
  glimpse() 

Two different types: Organizations and User accounts

httr::GET("https://api.github.com/users/TUHHStartupEngineers")
httr::GET("https://api.github.com/user/30825260")
#> Rows: 1
#> Columns: 29
#> $ login               <chr> "TUHHStartupEngineers"
#> $ id                  <int> 30825260
#> $ node_id             <chr> "MDEyOk9yZ2FuaXphdGlvbjMwODI1M…
#> $ avatar_url          <chr> "https://avatars.githubusercon…
#> $ gravatar_id         <chr> ""
#> $ url                 <chr> "https://api.github.com/users/…
#> $ html_url            <chr> "https://github.com/TUHHStartu…
#> $ followers_url       <chr> "https://api.github.com/users/…
#> $ following_url       <chr> "https://api.github.com/users/…
#> $ gists_url           <chr> "https://api.github.com/users/…
#> $ starred_url         <chr> "https://api.github.com/users/…
#> $ subscriptions_url   <chr> "https://api.github.com/users/…
#> $ organizations_url   <chr> "https://api.github.com/users/…
#> $ repos_url           <chr> "https://api.github.com/users/…
#> $ events_url          <chr> "https://api.github.com/users/…
#> $ received_events_url <chr> "https://api.github.com/users/…
#> $ type                <chr> "Organization"
#> $ site_admin          <lgl> FALSE
#> $ name                <chr> "TUHH Institute of Entrepreneu…
#> $ blog                <chr> "www.startupengineer.io"
#> $ location            <chr> "Hamburg University of Technol…
#> $ email               <chr> "startup.engineer@tuhh.de"
#> $ bio                 <chr> "Data Science, Research & Prac…
#> $ public_repos        <int> 8
#> $ public_gists        <int> 0
#> $ followers           <int> 9
#> $ following           <int> 0
#> $ created_at          <chr> "2017-08-08T07:40:56Z"
#> $ updated_at          <chr> "2024-05-03T11:35:16Z"
org_members <- tbl(con, "organization_members")
org_members
org_members |> 
  left_join(users, by = c(org_id  = "id")) |> 
  left_join(users, by = c(user_id = "id")) |> 
  filter(login.x == "TUHHStartupEngineers") |> 
  select(login.y)
  1. Organization affiliation:
gh_org_affil <- org_members |> 
  # Logins for users & orgs (via GHT)
  left_join(gh_ght_users, by = c(org_id  = "id")) |> 
  left_join(gh_ght_users, by = c(user_id = "id")) |> 
  
  # Domains for orgs (via API)
  left_join(gh_api_users, by = c(org_login = "login"))
  1. Final dataset:
gh_api_users |> 
  
  # Add org memberships (+ domains)
  left_join(gh_org_affil) |> 

  # Add location data
  left_join(gh_ght_users)

Data SO

Stackoverflow

  • Two among the most widely adopted and studied platforms are GitHub and StackOverflow
  • These two platforms serve different purposes: code sharing and collaborative development vs. information and knowledge exchange.
  • At the same time, they both serve potentially the same community of developers for the same overall goal, i.e., software development.

Current Dump from 2021

open_dataset("so_dump_2021_09_users.parquet") |> 
  
  glimpse() 
open_dataset("so_dump_2021_09_users.parquet") |> 
  filter(DisplayName == "Christoph Ihl")
  glimpse() 

Old Dump from 2013

open_dataset("so_dump_2013_09_users.parquet") |> 
  
  glimpse() 
  • Name
  • Name and Location
  • Profile images
  • Profile images
join <- open_dataset("gh_api_users.parquet") |> 
  
  inner_join( 
    open_dataset("so_users_joined.parquet") , 
      by = c(name  = "DisplayName", location = "Location"), 
      na_matches = "never"
    ) |> 
  
  compute()

join|> 
  nrow()
#> [1] 724961
join |> 
  filter(location == "Hamburg") |> 
  count(name, sort = T)
#> # A tibble: 43 × 2
#>   name        n
#>   <chr>   <int>
#> 1 Jan        10
#> 2 Alex        8
#> 3 Chris       8
#> 4 Patrick     6
#> 5 Nils        5
#> 6 Fabian      4
#> 7 Dennis      3
#> # ℹ 36 more rows

Optimization: OpenStreetMap API and only full names

# Stackoverflow
knitr::include_graphics(
  "https://graph.facebook.com/920949401423102/picture?type=large"
)

# GitHub
knitr::include_graphics(
  "https://avatars.githubusercontent.com/u/8004978?v=4"
)

      

Job title and company

      

Unique Identifier

      

Able to run through a specific ID

open_dataset("so_users_joined.parquet") |> 
  
  glimpse()
open_dataset("so_users_joined.parquet") |> 
  filter(DisplayName == "Christoph Ihl") |> 
  glimpse()
gh_api_users_orgs_locs_nested_tbl <- gh_api_users_orgs_locs_tbl |> 
                                        nest(organization = c(ght_org_id, org_login, org_domain, member_created_at))

gh_so_joined_tbl <- gh_api_users_orgs_locs_nested_tbl |> 
  left_join(so_joined_tbl, by = c(login      = "github_handle"), na_matches = "never") |> 
  left_join(so_joined_tbl, by = c(email_hash = "EmailHash"),     na_matches = "never")

gh_so_joined_tbl |> 
  glimpse()
#> FileSystemDataset with 1 Parquet file
#> 42,557,276 rows x 22 columns
#> $ api_usr_id                     <int32> 1, 2, 3, 4, 5, 6, 7, 17, 18, 19, 20, 21, 22, 23, 25, 26, 27, 28, 29, 30…
#> $ ght_usr_id                     <int32> 9236, 1570, 13256, 3892, 96349, 17407, 52402, 171316, 41811, 2159, 1300…
#> $ login                         <string> "mojombo", "defunkt", "pjhyett", "wycats", "ezmobius", "ivey", "evanphx…
#> $ type                 <dictionary<...>> User, User, User, User, User, User, User, User, User, User, User, User,…
#> $ name                          <string> "Tom Preston-Werner", "Chris Wanstrath", "PJ Hyett", "Yehuda Katz", "Ez…
#> $ company                       <string> NA, "@github ", "GitHub, Inc.", "Tilde, Inc.", "Stuffstr PBC", "@RiotGa…
#> $ blog                          <string> "http://tom.preston-werner.com", "http://chriswanstrath.com/", "https:/…
#> $ email                         <string> "tom@mojombo.com", "chris@github.com", "pj@hyett.com", "wycats@gmail.co…
#> $ email_hash                    <string> "25c7c18223fb42a4c6ae1c8db6f50f9b", "74858be1905a8bbdb565109107384bd9",…
#> $ bio                           <string> NA, "🍔 ", NA, NA, NA, NA, NA, NA, NA, "Co-founder and CEO, Code Climat…
#> $ location                      <string> "San Francisco", "San Francisco", "San Francisco", "San Francisco", "In…
#> $ country_code         <dictionary<...>> US, NA, us, US, DE, US, US, US, US, US, NA, US, CA, US, US, US, US, AL,…
#> $ state                         <string> "CA", NA, "San Francisco County", "CA", "Nordrhein-Westfalen", "AL", "C…
#> $ city                          <string> "San Francisco", NA, "San Francisco", "San Francisco", "Hennef (Sieg)",…
#> $ usr_created_at <timestamp[us, tz=UTC]> 2007-10-20 05:24:19, 2007-10-20 05:24:19, 2008-01-07 17:54:22, 2008-01-…
#> $ organization               <list<...>> [<tbl_df[2 x 4]>], [<tbl_df[2 x 4]>], [<tbl_df[1 x 4]>], [<tbl_df[16 x …
#> $ so_id                         <double> NA, NA, NA, 122162, NA, 239960, 1335022, NA, 365701, 7392312, NA, 26342…
#> $ so_login                      <string> NA, NA, NA, "Yehuda Katz", NA, "Michael D. Ivey", "Evan Phoenix", NA, "…
#> $ so_current_position           <string> NA, NA, NA, "Founder at Tilde, Inc.", NA, NA, NA, NA, NA, "CEO at Code …
#> $ so_twitter_handle             <string> NA, NA, NA, NA, NA, NA, NA, NA, NA, "brynary", NA, NA, NA, NA, NA, "top…
#> $ so_website_url                <string> NA, NA, NA, "http://www.tilde.io", NA, "http://gweezlebur.com", "http:/…
#> $ so_location                   <string> NA, NA, NA, "Portland, OR", NA, "Bay Minette, AL", NA, NA, "Buffalo, NY…
#> Call `print()` for full schema details

Data AL

Angellist (now wellfound) data Angellist.

Angellist

HTML is missing in source folder

  • Iterated over 10,3 million user ids
open_dataset("al_profiles.parquet") |> 
  glimpse() 
  • Iterated over 7,3 million company ids
open_dataset("al_orgs.parquet") |> 
  glimpse() 
  • Iterated over 10,3 million user ids
open_dataset("al_profiles_main_wrangled.parquet") |> 
  glimpse() 
  • Iterated over 7,3 million company ids
open_dataset("al_employees.parquet") |> 
  glimpse() 

gh_so_al_final <- gh_so_joined_tbl |> 
  
  # 1. Via github_handle
  left_join(al_profiles_main_wrangled, by = c("login"    = "github_handle"),  na_matches = "never") |> 
  # 2. Via SO_handle(s) 
  left_join(al_profiles_main_wrangled, by = c("so_login" = "so_handle"),      na_matches = "never") |> 
  # 3. Via twitter_handle
  left_join(al_profiles_main_wrangled, by = c("so_twitter_handle" = "twitter_handle"), na_matches = "never")
gh_so_al_unnested_tbl <- gh_so_al_unnested_tbl |> 
  
  left_join(cb_people_handles, by = c(facebook_usr_al = "facebook_handle"), na_matches = "never") |> 
  left_join(cb_people_handles, by = c(twitter_usr     = "twitter_handle"),  na_matches = "never") |> 
  left_join(cb_people_handles, by = c(linkedin_usr_al = "linkedin_handle"), na_matches = "never") |>  

  left_join(cb_orgs_handles,   by = c(facebook_org_al = "facebook_handle"), na_matches = "never") |> 
  left_join(cb_orgs_handles,   by = c(twitter_org_al  = "twitter_handle"),  na_matches = "never") |> 
  left_join(cb_orgs_handles,   by = c(linkedin_org_al = "linkedin_handle"), na_matches = "never") |> 
  
  left_join(cb_jobs,           by = c(uuid_usr_cb     = "person_uuid"))

Data AL

Coresignal

asd

asd

asd

Global options

The options below have user configurable options. In a regular Reveal.js presentation, these can be set through JavaScript, but Quarto makes it configurable through YAML options.

  1. menubarclass
  2. menuclass
  3. activeclass
  4. activeelement
  1. barhtml
  2. flat
  3. scale

Option 1: menubarclass

The menubarclass option sets the classname of menubars.

format:
  revealjs:
    ...
    simplemenu:
      menubarclass: "menubar"

Simplemenu will show the menubar(s) on all pages. If you do not want to show the menubar on certain pages, use data-state=“hide-menubar” on that section. This behaviour also works when exporting to PDF using the Reveal.js ?print-pdf option.

Option 2: menuclass

The menuclass option sets the classname of the menu.

format:
  revealjs:
    ...
    simplemenu:
      menuclass: "menu"

Simplemenu looks inside this menu for list items (LI’s).

Option 3: activeclass

The activeclass option is the class an active menuitem gets.

format:
  revealjs:
    ...
    simplemenu:
      activeclass: "active"

Option 4: activeelement

The activeelement option sets the item that gets the active class.

format:
  revealjs:
    ...
    simplemenu:
      activeelement: "li"

You may want to change it to the anchor inside the li, like this: activeelement: "a".

Option 5: barhtml

You can add the HTML for the header (and/or footer) through this option. This way you no longer need to edit the template.

format:
  revealjs:
    ...
    simplemenu:
      barhtml:
        header: "<div class='menubar'><ul class='menu'></ul></div>"
        footer: ""

Option 5: barhtml (Continued)

You can also move the slide number or the controls to your header or footer. If they are nested there manually, or through the barhtml option, they will then display inside that header or footer.

format:
  revealjs:
    ...
    simplemenu:
      barhtml:
        header: ""
        footer: "<div class='menubar'><ul class='menu'></ul><div class='slide-number'></div></div>"

Option 6: flat

Sometimes you’ll want to limit your presentation to horizontal slides only. To still use ‘chapters’, you can use the flat option. By default, it is set to false, but you can set it to true. Then, when a data-name is set for a slide, any following slides will keep that menu name.

format:
  revealjs:
    ...
    simplemenu:
      flat: true

To stop inheriting the previous slide menu name, start a new named section, or add data-sm="false" to your slide.

Option 7: scale

When you have a lot of subjects/chapters in your menubar, they might not all fit in a row. You can then tweak the scale in the options. Simplemenu copies the Reveal.js (slide) scaling and adds a scale option on top of that.

format:
  revealjs:
    ...
    simplemenu:
      scale: 0.67

It is set to be two-thirds of the main scaling.

More demos

For more demos go to the Simplemenu plugin for Reveal.js. Not all of the options in the regular plugin are available in the Quarto plugin.