🔒🤖 The Next Step in GitGuardian’s Approach to NHI Security

DISCOVER

🔒🤖 The Next Step in GitGuardian’s Approach to NHI Security

DISCOVER
the state of

SECRETS SPRAWL

Download the Report

fact 1:

Secrets Detection in 2023

12,778,599

NEW secrets detected

in public GitHub commits in 2023

Data analysis by GitGuardian

Share

Secrets sprawl x4 in 4 years

New secrets detected on GitHub (millions)

It is not a secret. Hard-coded credentials have long been a primary cause of security incidents in the software world.

Yet, with the growing complexity of digital supply chains, secrets sprawl is the Achilles’ heel for organizations of all sizes and security postures.

Unique secrets detected

3,698,686

“49% of breaches by external actors involved Use of stolen credentials”

Verizon's 2023 Data Breach Investigations Report

“[In 2023] for the first time, compromised credentials took the top spot in root causes [of attacks]. In the first six months, compromised credentials accounted for 50% of root causes, whereas exploiting a vulnerability came in at 23%.”


Sophos’ 2023 Active Adversary Report

Fact 2:

How leaky was 2023?

In 2023, GitGuardian scanned 1.1B commits (+10.6%), of which 8M exposed at least one secret (+30.3%).

1.8

Million

pro-bono alert emails

>1

 in

10

commit authors leaked a secret

7

out of

1K

commits exposed at least one secret

3

Million

repositories leaked a secret

Map of secrets leaks

Based on the GitHub Location of the commit author, which corresponds to 3 million occurrences of secrets in the dataset.

GitHub growth in 2023:

The 10 countries with the most leaks

For GitHub profiles mentioning location.

01
India
02
United States
03
Brazil
04
China
05
France
06
Canada
07
Vietnam
08
Indonesia
09
South Korea
10
Germany

Top 10 Valid Specific Detectors in 2023

In 2023 alone, over 1 million valid occurrences of Google API secrets, 250,000 Google Cloud secrets, and 140,000 AWS secrets were detected.

The growing number of code repositories on GitHub, with 50 million new repositories added in the past year (+22%), increases the risk of both accidental and deliberate exposure of sensitive information.

Leaks per Industry

While the IT sector, which includes software vendors, is the most affected industry, with 65.9% of all detected leaks, other industries are also impacted. These include Education, Science & Tech, Retail, Manufacturing, and Finance & Insurance, which account for 20.1%, 7%, 1.5%, 1.2%, and 1% of leaks, respectively.

This highlights the need for increased vigilance and proactive measures to protect sensitive information across all industries as the risks associated with secret sprawl continue to grow.

Leaks by Industry

"Other" distribution

GenAI Secrets Leaks

In 2023, GitGuardian observed a 1212x increase in the number of OpenAI API key leaks from previous year, unsurprisingly making them the top-ranked detector. While OpenAI leads by a wide margin in the number of leaks, more and more tokens used to access HuggingFace open-source models have been seen on GitHub month after month, hinting at a growing interest in open-source AI among developers.

The Slow but Steady Rise of Open Source AI

Our analysis also highlights the rapid penetration of services like Gemini (Google's ChatGPT alternative, formerly known as Bard and introduced at the end of March), Pinecone (a vector database service), Replicate (AI models-as-a-service), and to a lesser extent, Claude (an AI assistant by Anthropic), Cohere, and Clarifai.

By assessing the spread of secret leaks, it is easy to discern the rising use of these AI services.

AI Services Adoption (measured through leaks per month)

If you're a software provider, we encourage you to reach out for the development of bespoke detectors tailored to your service's specific needs: Request a custom detector from GitGuardian for your service

Fact 3:

Zombie Leaks

When someone exposes a secret on public GitHub, they should consider it compromised. The author must revoke the secret quickly to reduce the impact of the incident. In 2023, GitGuardian monitored how well authors fixed leaks. The tracking started when the first valid occurrence of a secret was detected and ended five days later.

The common trend is troubling:

More than 90% of the secrets remain valid 5 days after being leaked

Validity Rate Over TIme

This curve displays the progress of secret validity over time after detection. The perimeter is restricted to secrets for which the first occurrence was found valid, which amounts to 644,947 unique secrets detected in 2023 (not all secrets can be checked for validity). For each one, GitGuardian’s pro-bono alerting system emailed the commit author. 

Only 2.6% of the leaks were revoked within 1 hour of notification via email.

Not all types of secrets are fixed (revoked) at the same rate

This analysis reveals that leaked WeChat App and Algolia keys are the most likely to remain exposed for over 5 days. Conversely, developers are more concerned about the risks of leaking Stripe or Cloudflare API keys, as these would be prime targets in credential-harvesting campaigns.

“Developers erasing leaky commits or repositories instead of revoking are creating a major security risk for companies, which will remain vulnerable to threat actors mirroring public GitHub activity for as long as the credential remains valid. These zombie leaks are the worst,” said Eric Fourrier, CEO and Founder of GitGuardian.

Secrets still valid 5 days after their detection

VS

Fact: Only 24% of Riot Games keys were still active after five days vs. 95.5% for Algolia! Could gamers be a secret weapon against secrets sprawl?

Zombie Leaks: a Hidden Threat

Is the Leaky Repository Still Accessible

These findings are crucial for grasping the full scope of the secrets sprawl issue. While most security initiatives focus on detecting leaks, the bottleneck lies in improving the security posture. Simply alerting developers falls short; what's truly essential is providing them with the necessary guidance and support to rectify their mistakes effectively.

A common response to a leak by repository owners is to delete the repository or make it private, cutting off public access to the leaked information. However, this approach can lead to one of the riskiest scenarios for an organization: a "zombie leak".

To assess the prevalence of zombie leaks, the study selected a random sample of 5,000 erased commits that had exposed a secret. Of the repositories that hosted these commits, only 28.2% were still accessible at the time of the study. This indicates that the remaining repositories were likely deleted or made private in response to the leak, suggesting that the prevalence of zombie leaks may be underestimated.

DMCA Takedown Notices: a Last Resort to Stop Leaks?

Given that leaks frequently occur outside an organization's control, often in personal GitHub accounts, DMCA notices are mainly employed to manage such external repositories. Data points to an increasing use of DMCA notices as a last-ditch effort to remove repositories that inadvertently expose secrets.

DMCA takedown notices are a process for any copyright owner in the U.S. to demand the removal of content that infringes on their rights. As a “safe harbor”, GitHub must process DMCA requests when infringing code is posted of the platform.

Read more

DMCA Takedowns

How Toyota Customer Data was Compromised with a Credential Exposed for 5 Years

Given that leaks frequently occur outside an organization's control, often in personal GitHub accounts, DMCA notices are mainly employed to manage such external repositories. Data points to an increasing use of DMCA notices as a last-ditch effort to remove repositories that inadvertently expose secrets.

Read GitGuardian Breach Explainer

Fact 4:

How Good Can LLMs Be at Detecting Secrets?

The year 2023 marked the breakthrough of Generative AI, significantly impacting various professional fields. Developers, as we have seen, are at the forefront of this new wave, and there is no doubt that this powerful technology, in the hands of both good and bad actors, will have an outsized impact on cybersecurity. Here is an in-depth look at GitGuardian's AI-driven approach to enhancing the detection and management of sensitive information.

Categorize Generic Secrets

Generic secrets, not associated with specific services, present unique challenges in secrets detection: the lack of contextual information and validity checkers makes it difficult to offer incident status visibility or context-tailored remediation guidelines. GitGuardian is deeply committed to advancing this area, leveraging AI to enhance the contextualization of leaks.

Top 10 Generic Secrets Detectors in 2023

Generic Passwords and High Entropy Secrets

Improving Precision and Recall for Generic Secrets

To enhance application security through machine learning, GitGuardian's ML team is developing a model to accurately score the likelihood of a genuine generic secret versus a false positive. This model is particularly effective, as test results show distinct score distributions for true and false positives:

Real secrets and False Positives

Such clear differentiation allows for setting a threshold that significantly reduces false positives without substantially impacting the detection of true positives, showcasing the model's capability to refine secrets detection. The reduction of false positives, in turn, reduces wasted response efforts by our customers.

Unveiling Secret Exposures

3.11% of the secrets leaked in private repositories were also exposed in public repositories.

To test the hypothesis that secrets leaked in private repositories are also leaked on public GitHub, GitGuardian conducted a study on a perimeter comprising 403,571 leaked secrets, querying HasMySecretLeaked to know if these were also leaked on GitHub.

This fact hints at a well-known saying: “Security through obscurity is no security at all.” Applied to our case, it dismantles the idea that relying on the privacy of source code as a security layer is a valid strategy. 

These “private yet public” leaks have been publicly exposed 3.48 times on average, and 99% were found in source code files (less than 1% in GitHub issues, Pull Request descriptions, or GitHub Gists).

Secrets Sprawl in PyPI

Secrets sprawl affects more than code repositories. This year, GitGuardian expanded its investigation into the pervasiveness of leaked secrets within PyPI.The Python Package Index, better known as PyPI, is the official 3rd party package management system for the Python community. The central repository boasts over 500K hosted projects, 10M files, and over 31 billion monthly downloads.

In 2023, 11,054 unique secrets were exposed in package releases. Approximately 10,000 of those secrets had been there since before 2023, and over 1,000 had been introduced that year.

💡A "release" on  PyPI is a specific project version. For example, the requests project has many releases, like "requests 2.10" and "requests 1.2.1". A release may consist of one or multiple files.

Number of secrets and packages per year of publication

Adding up all the secrets shared across all the releases (5 million), GitGuardian found 56,866 occurrences of secrets, indicating the same secret is often found in multiple releases of the same package. The reason behind this is simple: many package maintainers often don’t realize a secret is shipped with the code library at every release!

The pervasiveness of secrets across releases explains how 97 secrets detected in packages dating from 2017 were still valid in late 2023 at the time of the study:

Secrets by status type and year

Solving Secrets Sprawl

The risk companies face from the rapid sprawl of API keys, configuration variables, and secrets within engineering teams cannot be overlooked. Secrets serve as the keys to a company's most valuable assets, making their management and protection a critical aspect of overall security strategy.

Despite the recognized importance of managing secrets sprawl, the widespread adoption of emerging best practices remains limited. While secrets management tools are a valuable part of the solution, they alone are not sufficient to address this complex issue. So, what can effectively solve secrets sprawl?

Download the full Report!

Download the report to gain valuable insights into how companies with the strongest security postures successfully tackle this challenge.

Download the Report