Man sitting in front od desk with two displays with a look onto red panorama of a city

Unveiling GitHub Secrets: A Dive into Code Leak Detection

October 22, 2023

7 mins read

Carlos Polop

Cloud Pentesting Team Leader

In today's digital world, GitHub plays a crucial role in collaborative coding. It's a treasure trove of innovation but with a dark side. Amidst the code, sensitive bits like API keys, passwords and secret tokens can slip through, posing a real risk. Hundreds of thousands of GitHub repositories have experienced leaks of this nature. Acknowledging this, GitHub now has mechanisms to auto-block and notify users about such sensitive commits.

Although this is helpful in preventing future leaks, it doesn't check for every possible one yet, even though it already supports a few hundred and there are still thousands of old leaks present in the repositories.

However, there are tools like Trufflehog, Gitleaks, and RExpository. These tools are designed to scan through code to identify and alert about potential leaks.

Trickest Workflows are just perfect for benchmark tests to compare these three tools and determine the best option for finding leaks in GitHub repositories. Let's dive in.

Technical comparison of the tools

Trufflehog

Trufflehog CLI is a tool built in go that scans through code to identify and alert about potential leaks. It can be used to scan through GitHub repositories, Giltab, AWS, and other platforms. It can also be used to scan through local files. It has a very simple command line interface and can be easily integrated into CI/CD pipelines. A useful feature of this tool is its ability to test whether specific leaks are valid or not (if the key is still working).

Check out a list of all the leaks that this tool is capable of detecting.
Gitleaks

Gitleaks CLI is also built in go and expects as an argument a folder containing a git repository. It has a very simple command line interface and can be easily integrated into CI/CD pipelines. What's great about this tool is that it not only uses regexes to find specific leaks, but also has broader regexes to detect any kind of leaks.

You can find here the complete list of leaks.
RExpository

RExpository isn't actually a tool meant to discover leaks in repositories, but a public database of regexes for tokens, keys and regexes capable of finding sensitive information that anyone could integrate with their tools. However, it does have a go tool that can be used to scan through a folder or a repository and find leaks using its regexes.

There is a webpage where the community can share ordered regexes that everyone can use, and you can find all the regexes.

The repos in this benchmark test are related to bug bounty programs. Since there are no related GitHub repos, we easily built workflows in Trickest that will provide what is necessary.

Data from previous research on GitHub users and repositories was matched with information from Trickest’s public GitHub repo on Bug Bounty programs.

The CSV file was searched with the grep command to match user/organization info with bug bounty program domains. The first 15,000 repos were selected and the unavailable ones were removed, leaving a total of 13,903 repos.

The grep was needed since the CSV had almost 27 million repos linked to bug bounty programs.

The first workflow downloaded the CSV from the previous work and matched it with the bug bounty program domains using grep.

Screenshot of Github leaks - Get repos Workflow run

In the initial workflow, the first node is seen downloading the BB scope info, extracting only the domains via a custom shell script. The subsequent three nodes parallelize the process, handling chunks of 100 domains each.

The bottom node downloads the CSV from prior research. Where both paths meet, there's parallel grepping of the entire CSV against all the domains.

The concluding node employs a raw cat command to merge all the grepped CSVs.

This workflow executed a grep on over 45 million GitHub user/organization records against more than 5 thousand bug bounty-related domains, taking nearly 17 hours with parallel processing on 10 medium machines.

A second workflow was crafted to retain only the initial 15 thousand repositories, filtering out the ones that are no longer available.

Screenshot of Clean GitHub repos Workflow Run

The screenshot above shows the workflow beginning by fetching the output from the preceding workflow and then dividing it into chunks of 1000 repositories. These chunks are then processed in parallel on 10 medium machines using the tool httpx to filter out unavailable repositories.

Executing this workflow took under 4 minutes with 5 medium machines running in parallel.

Benchmark Workflow

With a curated list of 13,903 repositories associated with bug bounty programs, the subsequent step entailed constructing a Trickest workflow to benchmark the three tools:

Screenshot of the Workflow for all Launch Trufflehog Verified, Trufflehog, Gitleaks, and Rex

This workflow is initiated by downloading the repository list from the prior workflow, and segmenting it into chunks of 100 repositories, followed by parallel processing. To launch the three distinct tools, a wrapper named Leakos was used. Leakos is adept at deploying these three tools on a selected GitHub repository, simplifying the task of uncovering as many leaks as possible. Additionally, it provides the option to operate any one of the tools individually, a feature utilized for this benchmark.

Furthermore, Leakos standardizes the output from each tool, facilitating a straightforward comparison among them.

Trufflehog Verified

The initial tool deployed was Trufflehog, set to display only verified leaks, implying the tool will attempt to validate the discovered leaks. The feature is useful in determining the current validity of identified leaks. However, some leaks may be difficult to verify, such as when usernames and passwords are specified on separate lines of code, which can result in false negatives.

Examining the parameters of Leakos, it's evident that instructions are given to bypass the use of gitleaks or rexpository, and to operate Trufflehog with the only-verified option:

Screenshot of Leakos node with inputs of Launch Trufflehog Verified workflow

This workflow, executed on 5 medium machines in parallel, executed in 11 hours, found 34 valid leaks.

11 Infura
3 OpenWeather
3 Flickr
2 GCP
2 Mailgun
1 AWS key
1 AlgoliaAdminKey
1 Clarifai
1 CoinApi
1 Currencylayer
1 Etherscan
1 Ipify
1 LokaliseToken
1 MongoDB
1 Paymongo
1 Pixabay
1 TrelloApiKey
1 URI

A special mention is warranted for a particular commit that inadvertently leaked two distinct valid Infura keys simultaneously.

Assuming the ratio of valid leaks to repositories remains constant, in the nearly 27 million repos associated with bug bounty endeavors, there could be about 34*27000000/13903 ~ 66,000 valid leaks this tool can likely find.

Trufflehog

The next tool is Trufflehog, but without the only-verified option. This implies the tool will endeavor to uncover all possible leaks, without attempting to validate their authenticity.

Here's how the Leakos parameters appeared:

Screenshot of Leakos node with inputs of Launch Trufflehog workflow

This configuration is similar to the previous one, except for the removal of the only-verified option.

Executing this workflow on 5 medium machines in parallel took nearly 12 hours, revealing 17,465 leaks.

The top ten leak types were:

9693 PagerDutyApiKey
5010 GitHub
496 EtsyApiKey
394 LDAP
342 JDBC
267 PrivateKey
212 URI
155 MongoDB
98 SQLServer
79 Circle

Preserving the earlier assumption, among the close to 27 million repos connected to bug bounty tasks, there could be around 17,465*27000000/13903 ~ 34 million leaks this tool might detect.

Gitleaks

Unfortunately, Gitleaks does not verify the authenticity of matched leaks, despite using diverse regex patterns to identify them.

Here's how the Leakos parameters appeared:

Screenshot of Leakos node with inputs of Launch Gitleaks workflow

In this scenario, the parameter not-trufflehog is utilized, while the parameter not-gtleaks is disregarded.

Running this workflow on 5 medium machines in parallel took approximately 5 hours, and found 7,725 leaks.

The top ten leak categories were:

6117 Generic API Key
714 AWS
439 Private Key
172 GCP API key
105 JSON Web Token
74 Stripe Access Token
19 Slack Webhook
14 Algolia API Key
12 LinkedIn Client ID
6 GitHub Personal Access Token

RExpository

The final tool for this benchmark test was RExpository. As previously noted, this tool is designed not for leak detection in repositories, but to furnish a database of regexes for utilization by other tools. Nonetheless, it offers a feature to scan through a folder or a repository to identify leaks using its regexes.

Furthermore, this tool's database annotates each regex regarding its potential to generate excessive false positives, disabling those regexes accordingly. This suggests that the regexes employed by this tool are more inclined towards identifying valid leaks. Additionally, the tool restricts its analysis to the latest 100 commits within a maximum timeframe of 5 years.

Here's how the Leakos parameters appeared:

Screenshot of Leakos node with inputs of Launch REX workflow

Executing this workflow on 5 medium machines in parallel took around 6 hours, revealing 23,226 leaks.

The top ten leak categories were:

7501 Sendbird Access ID
5319 Simple Passwords
4431 Hubspot API Key
2986 sha512
1652 Dropbox API Key
381 Twitter Bearer Token
295 Jenkins Creds
124 Sidekiq Secret
91 Generic API tokens search (A-C)
65 Google API Key

Conclusion

After requesting GPT-4 to analyze the similarities among the files discovered, the results were:

Gitleak and rex shared 133 common entries
Gitleak and truffle-ver had 15 common entries
Gitleak and truffle had 114 common entries
Rex and truffle-ver shared 2 common entries
Rex and truffle had 22 common entries
Truffle-ver and truffle shared 32 common entries

The low number of common entries across the tools is quite astonishing. However, this may be attributed to GPT-4's search for exact matches between different files, potentially overlooking some shared findings. When instructed to consider this, GPT-4 formulated a different Python script but exhausted the allocated time during execution.

Surprisingly, some findings from the Trufflehog-verified were not detected by the standard Trufflehog.

Based on the findings, it is evident that there are still a significant number of leaks that can be discovered in GitHub repositories. As expected, different tools can identify different types of leaks. However, it is important to note that the benchmarked tools are not perfect and may result in false positives or false negatives. Therefore, even when using all three tools, there is still a possibility of missing some leaks or being overwhelmed by false positives.

Sign up on Trickest to run your GitHub research, use other tools, or build your own research on any legally available target you want.

GET STARTED WITH TRICKEST TODAY

Complete our registration to elevate and automate your offensive security endeavors.

Get started

Private Agents - Scan Internal Environments On Any Device

Unveiling GitHub Secrets: A Dive into Code Leak Detection

Carlos Polop

Cloud Pentesting Team Leader

On this page

Technical comparison of the tools

Benchmark Workflow

Trufflehog Verified

Trufflehog

Gitleaks

RExpository

Conclusion

GET STARTED WITH TRICKEST TODAY

Unveiling GitHub Secrets: A Dive into Code Leak Detection

Carlos Polop

Cloud Pentesting Team Leader

On this page

Technical comparison of the tools

Checking Bug Bounty-related GitHub repositories

Benchmark Workflow

Trufflehog Verified

Trufflehog

Gitleaks

RExpository

Conclusion

GET STARTED WITH TRICKEST TODAY