Uncovering OSINT insights from 15TB of Github logs

Dive into our comprehensive analysis of close to 15TB of GitHub logs for open-source intelligence (OSINT), highlighting key findings about users, repositories, and valuable implications for cybersecurity.

Carlos Polop

Carlos Polop

Cloud Pentesting Team Leader

July 11, 2023
Trickest

Download the report

Click below to download the report. Your download will start right away.

DownloadLet's talk

In our role as research cybersecurity experts, we are constantly on the lookout for valuable information hiding in everyday data. In our profession, what may appear as trivial data can often serve as the doorway to significant insights, with large data repositories like GitHub being no exception. In our recent blog post, we elaborated on how we used the impressive capabilities of Trickest workflows to process an enormous dataset of nearly 15TB from GitHub logs. Our mission was to mine the rich seam of public information available about all the users and repositories encapsulated within these logs.

GitHub, a platform teeming with more than 56 million developers worldwide, houses an incredible array of public logs. While these logs are a treasure trove for researchers and developers, it can also potentially be exploited for nefarious purposes if not properly understood and handled. As experts in data security and privacy, we understand the implications of such openly available information and believe in the importance of scrutinizing this data to ensure better privacy practices.

In this report, we intend to take you on a deeper exploration of this data. We aim to unearth intriguing patterns, identify potential security risks, and provide insights on how to better protect your data on such platforms. Our team's extensive experience and expertise in data science and cybersecurity have equipped us with the tools to decipher these vast datasets and understand their implications comprehensively.

Recap

We built several workflows to parse and analyze almost 15TB of Github logs from GH Archive from 2015 till the present. After, the data was enhanced using the GitHub API to extract more information about the users and repositories. Our final result included two CSV files: one housing data on over 45 million users and the repositories to which they contributed, and the other containing data on over 220 million repositories.

Here is the Trickest workflow we used for data processing:

Workflow for data processing To extract valuable insights from these CSVs, we used gh-investigator tool, available in the Trickest library with 270+ open-source tools.

GitHub User Analysis

We extracted the following information for each user in our analysis:

  • Username
  • Repositories where the user has collaborated (capped at 500 to limit bot users who've collaborated on thousands of repositories)
  • Deleted user status
  • Site administrator status
  • Hireability
  • Availability of public email
  • Information about the user's company
  • Participation in the GitHub Stars program

Although we could extract much more information from the GitHub API, we opted to focus on the most relevant factors to manage the CSVs size.

With the help of gh-investigator, we generated several files, including lists of Github Star users, deleted users, hireable users, site administrators, users with public emails, and users with company information. You can find the download links for these files at the conclusion of this report.

Here is the Trickest workflow we employed to gather details about each identified user:

Workflow details about users

Deleted Users

These refer to user accounts that were active at one point but are no longer available. This unavailability could be a result of the user account being deleted or renamed. Regardless of the reason, it's crucial to note which users are no longer active, as it could pose an impersonation risk. An attacker could create a new user account with the same name, attempting to impersonate the original user.

Please be aware that GitHub's Action feature allows workflows triggered by a pull_request from an external user to run automatically for that user's subsequent PRs once a PR is successfully merged. However, if a user account is deleted and a new one is created with the same name, these permissions ARE NOT preserved. You can find more information about this at GitHub Security.

Our data reveals 4,826,245 deleted usernames. You can access this list in gh-investigator-users_deleted.csv.

Site Administrators

According to GPT4:

A GitHub site administrator, often referred to as a "site admin," is a user with elevated privileges that allow them to manage all aspects of the GitHub instance. This term is generally used within the context of GitHub Enterprise, which is a self-hosted version of GitHub that an organization can run on its own infrastructure.

The site administrator has the highest level of access and can control all the repositories, organizations, and users in the GitHub Enterprise instance. They can set up and modify accounts, handle security settings, and monitor system health among other tasks.

In GitHub.com (the public version of GitHub), this level of control is held by GitHub's own staff and not by individual users or organizations.

This suggests that the 16 users listed ingh-investigator-users_site_admin.csv, namelysentinel,GreCodes,dvelton,saquib-alam,accessibility-bot, docs-bot,iowannie,willf,NickLiffen,ShorukElHadad,Ellemmenno, davecheney,mcantu,JeffOgah,hubot,education-web-bot, are highly privileged users.

Interesting observations include:

  • GreCodes is open for employment opportunities.
  • NickLiffen has made his email address public:nickliffen@github.com.
  • hubot appears to be a bot user.
  • accessibility-bot,ShorukElHadad,Ellemmenno,education-web-bot have not contributed to any repositories.

It has been confirmed that there are more GitHub staff members withsite_admin set to True who do not appear in this list. This discrepancy may be due to the rate at which we queried the GitHub API for this data.

Hireable Users

These users are actively seeking employment opportunities. Such information could be beneficial to recruiters or potentially exploited by malicious social engineers who might take advantage of these users' job-seeking status.

We found 1,273,018 hireable users in the data. You can access the list in gh-investigator-users_hireable.csv.

Highly Collaborative Users

To maintain manageable CSV file sizes, we limited the number of repositories in which a user has collaborated to 500. We made this decision based on our discovery that certain bot users have contributed to thousands of repositories.

We identified 814 highly collaborative users (possibly bots) in the data. Here are 20 of them:conda-forge-linter,conda-forge-curator[bot],SimonCropp,SimenB, jjhelmus,fire,lineageos-gerrit,lindseyberlin,sharelatex-ci, sharelatex-github-sync-acceptance-tests,hotman663,graingert, grandroyalcasino,theapplegates,0xflotus,ThomaCheatham,fossabot, dalskar,SooluThomas,olamy.

Notably, some of these users are also available for hire and even have public emails: SimonCropp,0xflotus,olamy,echarles,tpgxyz,schollz,tkelman, szepeviktor,danielbachhuber,wizardforcel,ademaro,Klozz,hrbrmstr...

Users with a Public Email or Company Configured

These are users who have made their email or company information public. This data might be of interest to social engineers looking to target a specific company.

We found 2,651,140 users with a configured email in the file gh-investigator-users_email.csv, 3,076,228 users with a configured company in the filegh-investigator-users_company.csv, and 912,592 users who have configured both.

Open Source Intelligence (OSINT) on Companies

Using the collected data, it's possible to pinpoint users who are presently or were formerly associated with a specific company.

For instance, running the command:

cat all_user_info.csv | grep -i trickest | wc -l

allows us to identify all users linked to the company Trickest (31 users in this case). However, we can refine these results to minimize false positives:

  • By seeking users with the term Trickest in their email or company details, we can reduce the number to 8: gligaTrickest, popovicnenad, mhmdiaa, 76creates, trickest-workflows, patman1970, nenadzaric, Banegaaa.
  • By searching for users who have contributed to Trickest company repositories using a simple command likecat all_user_info.csv | grep -i "trickest/", we can identify the following users: popovicnenad, c3l3si4n, PolovinaD, mhmdiaa, mihailotomic, 76creates, trickest-workflows, nenadzaric, kljunowsky.

GitHub Repository Analysis

We gathered the subsequent details for each repository:

  • Owner/Repository
  • Number of stars the repository has
  • Number of forks the repository has
  • Number of watchers of the repository
  • Repository deletion status
  • Whether the repository is private
  • Whether the repository is archived
  • Whether the repository is disabled

While the GitHub API can provide an array of additional data, we opted to concentrate on these points of interest to keep the CSV files manageable in size.

Our tool, gh-investigator, generated the following files:

  • gh-investigator-repos_sorted_stars.csv: Repositories (with more than 100 stars) sorted by the number of stars.
  • gh-investigator-repos_sorted_forks.csv: Repositories (with more than 50 forks) sorted by the number of forks.
  • gh-investigator-repos_sorted_watchers.csv: Repositories (with more than 20 watchers) sorted by the number of watchers.
  • gh-investigator-repos_deleted.csv: List of all deleted repositories.
  • gh-investigator-repos_private.csv: List of all private repositories.
  • gh-investigator-repos_archived.csv: List of all archived repositories.
  • gh-investigator-repos_disabled.csv: List of all disabled repositories.

You can find the download links for these files at the conclusion of this report.

Here is the Trickest workflow we employed to obtain details about each identified repository:

Workflow about repositories

Duplicate Repository Handling

The same repository might appear under different names due to reasons such as repository renaming or differences in case sensitivity between the secret, the user, and the generated CSV. After running a Python script to remove such duplicate repositories, the repository count was pared down from approximately 222 million to slightly over 220 million.

Moreover, among these, almost half a million repositories have been deleted, while approximately a million and a half have been archived. This indicates that roughly 218 million repositories remain active currently.

Discovery of Leaked Secrets

It is a well-known issue that secrets (passwords, API keys, etc.) can be inadvertently leaked on GitHub. However, it was surprising to find that secrets had been included in unexpected places like repository names, such as this example: https://github.com/IHATELAGHong/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. There were also instances where secrets were found in company info like: yasir-javed-58,yasir-javed-58/app1,0,0,0,,ghp_1G8lb1hisLILUH9bJpgg mEiz1meMn42ADwPg,0.

In total, we uncovered 50 working GitHub tokens and 4 consumed OpenAI API Keys.

When it comes to uncovering leaks in unexpected places, we have room to broaden our search parameters. For example, we could modify the Trickest workflows we ran to include issues and repository comments, thereby enabling a more extensive search for possible leaks.

Top Starred Repositories & Users

The 10 repositories with the highest star counts (at the time of our research) are:

Ranking πŸ†GitRepoStars ⭐
1freecodecamp/freecodecamp368,107
2ebookfoundation/free-programming-books281,193
3996icu/996.icu265,975
4jwasham/coding-interview-university258,516
5sindresorhus/awesome256,262
6public-apis/public-apis241,602
7kamranahmedse/developer-roadmap240,843
8donnemartin/system-design-primer221,307
9facebook/react208,321
10vuejs/vue203,866

Tools such as PEASS-ng and Hacktricks are in positions 3080 and 7080 respectively.

The top 10 users with the most stars across repositories (at the time of our research) are:

Ranking πŸ†GitRepoStars ⭐
1microsoft2,064,346
2google1,629,138
3facebook1,036,020
4apache893,299
5sindresorhus787,971
6alibaba768,376
7vuejs629,758
8facebookresearch534,802
9airbnb511,260
10github507,995

Most Forked Repositories & Users

The top 10 repositories with the most forks (at the time of our research) are:

Ranking πŸ†GitRepoForks πŸ”±
1jtleek/datasharing242,064
2rdpeng/programmingassignment2141,642
3octocat/spoon-knife133,886
4smartthingscommunity/smartthingspublic89,134
5github/gitignore80,632
6pierian-data/complete-python-3-bootcamp78,312
7twbs/bootstrap75,833
8tensorflow/tensorflow71,431
9nightscout/cgm-remote-monitor68,563
10jwasham/coding-interview-university62,440

The top 10 users with the most forks across repositories (at the time of our research) are:

Ranking πŸ†GitRepoForks πŸ”±
1learn-co-students2,379,786
2learn-co-curriculum1,266,966
3lambdaschool556,978
4microsoft438,847
5apache403,773
6bloominstituteoftechnology373,029
7google292,006
8jtleek245,331
9rdpeng243,450
10mate-academy225,070

Most Watched Repositories & Users

The top 10 repositories with the most watchers (at the time of our research) are:

Ranking πŸ†GitRepoπŸ‘€Watchers
1vhf/free-programming-books9,674
2ebookfoundation/free-programming-books9,673
3jwasham/coding-interview-university8,602
4freecodecamp/freecodecamp8,440
5torvalds/linux8,167
6tensorflow/tensorflow7,710
7sindresorhus/awesome7,512
8kamranahmedse/developer-roadmap6,858
9twbs/bootstrap6,811
10codehubapp/codehub6,662

The top 10 users with the most watchers across repositories (at the time of research) are:

Ranking πŸ†GitRepoπŸ‘€Watchers
1learn-co-students5,507,591
2devexpress-examples520,709
3openmandrivaassociation306,367
4learn-co-curriculum229,677
5microsoft216,571
6textcreationpartnership215,542
7gitenberg181,827
8jenkinsci135,572
9conda-forge130,212
10uber128,394

Conclusion

The sheer volume of accessible information pertaining to GitHub users and repositories remains astonishing. Such data can be harnessed for a myriad of purposes, both benevolent and malicious. However, the most critical takeaway from our exploration should be the heightened awareness of the transparency of our digital footprints. As individuals and organizations, it's crucial to recognize that our information is readily available and can be scraped for sensitive content. Vigilance and proactive measures in protecting our data should be prioritized in our increasingly digitized world.

If you want to build workflows similar to these, or use pre-built workflows for Attack Surface Management, Vulnerability Scanning, Threat Intelligence, Content Discovery or other common use cases, fill out the form on our website to get access to Trickest.

You can access all the research data directly in our GitHub repository.

Get a PERSONALIZED DEMO

See Trickest
in Action

Gain visibility, elite security, and complete coverage with Trickest Platform and Solutions.

Get a demo