It is not a secret. Hard-coded credentials have long been a primary cause of security incidents in the software world.
Yet, with the growing complexity of digital supply chains, secrets sprawl is the Achilles’ heel for organizations of all sizes and security postures.
In 2023, GitGuardian scanned 1.1B commits (+10.6%), of which 8M exposed at least one secret (+30.3%).
Million
pro-bono alert emails
 in
commit authors leaked a secret
out of
commits exposed at least one secret
Million
repositories leaked a secret
Based on the GitHub Location of the commit author, which corresponds to 3 million occurrences of secrets in the dataset.
For GitHub profiles mentioning location.
The growing number of code repositories on GitHub, with 50 million new repositories added in the past year (+22%), increases the risk of both accidental and deliberate exposure of sensitive information.
While the IT sector, which includes software vendors, is the most affected industry, with 65.9% of all detected leaks, other industries are also impacted. These include Education, Science & Tech, Retail, Manufacturing, and Finance & Insurance, which account for 20.1%, 7%, 1.5%, 1.2%, and 1% of leaks, respectively.
In 2023, GitGuardian observed a 1212x increase in the number of OpenAI API key leaks from previous year, unsurprisingly making them the top-ranked detector. While OpenAI leads by a wide margin in the number of leaks, more and more tokens used to access HuggingFace open-source models have been seen on GitHub month after month, hinting at a growing interest in open-source AI among developers.
Our analysis also highlights the rapid penetration of services like Gemini (Google's ChatGPT alternative, formerly known as Bard and introduced at the end of March), Pinecone (a vector database service), Replicate (AI models-as-a-service), and to a lesser extent, Claude (an AI assistant by Anthropic), Cohere, and Clarifai.
If you're a software provider, we encourage you to reach out for the development of bespoke detectors tailored to your service's specific needs: Request a custom detector from GitGuardian for your service
When someone exposes a secret on public GitHub, they should consider it compromised. The author must revoke the secret quickly to reduce the impact of the incident. In 2023, GitGuardian monitored how well authors fixed leaks. The tracking started when the first valid occurrence of a secret was detected and ended five days later.
This curve displays the progress of secret validity over time after detection. The perimeter is restricted to secrets for which the first occurrence was found valid, which amounts to 644,947 unique secrets detected in 2023 (not all secrets can be checked for validity). For each one, GitGuardian’s pro-bono alerting system emailed the commit author.Â
This analysis reveals that leaked WeChat App and Algolia keys are the most likely to remain exposed for over 5 days. Conversely, developers are more concerned about the risks of leaking Stripe or Cloudflare API keys, as these would be prime targets in credential-harvesting campaigns.
“Developers erasing leaky commits or repositories instead of revoking are creating a major security risk for companies, which will remain vulnerable to threat actors mirroring public GitHub activity for as long as the credential remains valid. These zombie leaks are the worst,” said Eric Fourrier, CEO and Founder of GitGuardian.
These findings are crucial for grasping the full scope of the secrets sprawl issue. While most security initiatives focus on detecting leaks, the bottleneck lies in improving the security posture. Simply alerting developers falls short; what's truly essential is providing them with the necessary guidance and support to rectify their mistakes effectively.
A common response to a leak by repository owners is to delete the repository or make it private, cutting off public access to the leaked information. However, this approach can lead to one of the riskiest scenarios for an organization: a "zombie leak".
To assess the prevalence of zombie leaks, the study selected a random sample of 5,000 erased commits that had exposed a secret. Of the repositories that hosted these commits, only 28.2% were still accessible at the time of the study. This indicates that the remaining repositories were likely deleted or made private in response to the leak, suggesting that the prevalence of zombie leaks may be underestimated.
Given that leaks frequently occur outside an organization's control, often in personal GitHub accounts, DMCA notices are mainly employed to manage such external repositories. Data points to an increasing use of DMCA notices as a last-ditch effort to remove repositories that inadvertently expose secrets.
The year 2023 marked the breakthrough of Generative AI, significantly impacting various professional fields. Developers, as we have seen, are at the forefront of this new wave, and there is no doubt that this powerful technology, in the hands of both good and bad actors, will have an outsized impact on cybersecurity. Here is an in-depth look at GitGuardian's AI-driven approach to enhancing the detection and management of sensitive information.
Generic secrets, not associated with specific services, present unique challenges in secrets detection: the lack of contextual information and validity checkers makes it difficult to offer incident status visibility or context-tailored remediation guidelines. GitGuardian is deeply committed to advancing this area, leveraging AI to enhance the contextualization of leaks.
To enhance application security through machine learning, GitGuardian's ML team is developing a model to accurately score the likelihood of a genuine generic secret versus a false positive. This model is particularly effective, as test results show distinct score distributions for true and false positives:
Such clear differentiation allows for setting a threshold that significantly reduces false positives without substantially impacting the detection of true positives, showcasing the model's capability to refine secrets detection. The reduction of false positives, in turn, reduces wasted response efforts by our customers.
To test the hypothesis that secrets leaked in private repositories are also leaked on public GitHub, GitGuardian conducted a study on a perimeter comprising 403,571 leaked secrets, querying HasMySecretLeaked to know if these were also leaked on GitHub.
This fact hints at a well-known saying: “Security through obscurity is no security at all.” Applied to our case, it dismantles the idea that relying on the privacy of source code as a security layer is a valid strategy.Â
These “private yet public” leaks have been publicly exposed 3.48 times on average, and 99% were found in source code files (less than 1% in GitHub issues, Pull Request descriptions, or GitHub Gists).
Secrets sprawl affects more than code repositories. This year, GitGuardian expanded its investigation into the pervasiveness of leaked secrets within PyPI.The Python Package Index, better known as PyPI, is the official 3rd party package management system for the Python community. The central repository boasts over 500K hosted projects, 10M files, and over 31 billion monthly downloads.
Adding up all the secrets shared across all the releases (5 million), GitGuardian found 56,866 occurrences of secrets, indicating the same secret is often found in multiple releases of the same package. The reason behind this is simple: many package maintainers often don’t realize a secret is shipped with the code library at every release!
The risk companies face from the rapid sprawl of API keys, configuration variables, and secrets within engineering teams cannot be overlooked. Secrets serve as the keys to a company's most valuable assets, making their management and protection a critical aspect of overall security strategy.
Despite the recognized importance of managing secrets sprawl, the widespread adoption of emerging best practices remains limited. While secrets management tools are a valuable part of the solution, they alone are not sufficient to address this complex issue. So, what can effectively solve secrets sprawl?