The Stanford Internet Research Data Repository is a public archive of research datasets that describe the hosts, services, and websites on the Internet. While the repository is hosted by Stanford Empirical Security Research Group, we are also happy to host data for other researchers as well. The data on the site is restricted to non-commercial use. A JSON interface is available. Contact support@esrg.stanford.edu with any questions.

Censys Universal IPv4 Internet Dataset
External Dataset from Censys, Inc.

Censys publishes daily snapshots of public IPv4 hosts by continually scanning 2,500 ports, performing automatic protocol detection, completing full protocol handshakes, and labeling known software and devices. The dataset contains around 850M services on 250M IPv4 hosts; daily snapshots are 2TB large.

Censys Universal Certificate Dataset
External Dataset from Censys, Inc.

Censys maintains an append-only store of X.509 certificates found in public Certificate Transparency logs and Internet scans. The dataset contains raw PEMs, parsed X.509 data, browser validation and revocation data, CT entries, and ZLint results. The dataset contains 5 billion certificates and is 15-20TB large.

Project Sonar Open Data Repository
External Dataset from Rapid7, Inc.

Rapid7 provides researchers and community members open access to data from Project Sonar, which conducts regular Internet-wide surveys to gain insights into global exposure to common vulnerabilities. In addition to providing Internet scans, Rapid7 publishes multiple DNS datasets (e.g., reverse PTR lookups of all IPv4 addresses).

On the Origin of Scanning: The Impact of Location on Internet-Wide Scans
Paper Artifact(s) from Stanford University

Abstract: Fast IPv4 scanning has enabled researchers to answer a wealth of security and networking questions. Yet, despite widespread use, there has been little validation of the methodology’s accuracy, including whether a single scan provides sufficient coverage. In this paper, we analyze how scan origin affects the results of Internet-wide scans by completing three HTTP, HTTPS, and SSH scans from seven geographically and topologically diverse networks. We find that individual origins miss an average 1.6–8.4% of HTTP, 1.5–4.6% of HTTPS, and 8.3–18.2% of SSH hosts. We analyze why origins see different hosts, and show how permanent and temporary blocking, packet loss, geographic biases, and transient outages affect scan results. We discuss the implications for scanning and provide recommendations for future studies.


© 2021 Stanford University