How to reproduce all of the figures on the paper? (Section 4 and 5)?
Our paper is based on the measurement: TLSA records and their certificates. We had collected them over 4 months across 5 vantage points. Due to the massive size of these datasets, we used Apache Spark to process them in a parallel manner. Here, we would like to provide three approaches to reproduce our measurement results:
- 
First, you can (1) run our measurement source codes to collect your own datasets, (2) analyze the raw datasets you have obtained, and (3) run our plotting scripts. As the dataset will not be exactly as same as the ones we used the paper, your figures will look different. Thus, we recommend this approach for the researchers who are interested in extending our work. For the ones who are interested in this approach, you can start this from the Section 3 (, 2, and 1 in a reverse order). 
- 
Second, you can (1) download our raw datasets (thus, not re-running our measurement source codes), (2) run our analysis scripts, and (3) run our plotting scripts. As you do not need to run our measurement scripts to collect your own datasets, it should be faster than the first approach. However, our dataset might be too big to run as it spans four months and are collected from five vantage points. Thus, depending on your computational resource, it may take several hours (or days) to run our analysis scripts. For your information, we used Spark clusters to efficiently analyze the datasets. For the ones who are interested in this approach, you can skip the Section 3 and start this page from Section 2 and 1. 
- 
Finally, you can (1) just use our analytics datasets, which are processed through our analysis codes based on the raw datasets. We believe this way to be the fastest way that you can check the consistency of the figures on the paper. For the ones who are interested in this approach, you only need to read Section 1. 
1. Reproducing the figures from the analytics
This section introduces a very simple way to reproduce all of the figures in the paper by using the analytics datasets and plotting scripts.
Datasets and scripts
(1) Analytic datasets for figures and their gnuplot scripts
| Filename | Download | Description | 
|---|---|---|
| Analytics | link | Input datasets for the figures on the paper. | 
| plotting-scripts.tar.gz | link | Plotting scripts for 6 figures. | 
(2) Details of the gnuplot scripts
| Filename | Figure No. on the paper | Input data | 
|---|---|---|
| 2years-tlsa-ratio-per-tld-split.plot | Figure 2 | tlsa-counts.csv | 
| alexa-tlsa-adoption.plot | Figure 3 | alexa_dane_stat_output.txt which is included in alexa_dane_stat_output.tar.gz | 
| 2years-tlsa-ratio-per-tld-split-fallback.plot | Figure 4 | tlsa-counts.csv | 
| missing-dnssec.plot | Figure 5 | dnssec_stat_output_[city].txt which is included in dnssec_stat_output.tar.gz | 
| startls-availability.plot | Figure 6 | starttls_error_stat_output_[city].txt which is included in starttls_error_stat_output.tar.gz | 
| incorrect-percent-per-comp.plot | Figure 7 | check_incorrect_stat_output_[city].txt which is included in check_incorrect_stat_output.tar.gz | 
| 4months-valid-per-tld.plot | Figure 8 | valid_dn_stat_output_[city].txt which is included in valid_dn_stat_output.tar.gz | 
2. Reproducing the analytics from the raw datasets (our measurement datasets)
This section introduces a way to generate the datasets (in the Analytics file) from the raw datasets that we had collected using our measurement codes.
After executing the analysis scripts you may use those output files as inputs to the above plotting scripts.
Datasets and scripts
(1) Raw (measurement) datasets and prerequisites for the analysis
| Filename | Download | Description | 
|---|---|---|
| hourly dataset | link | TLSA records and their certificates (through STARTTLS) collected for 4 month (July ~ October, 2019) on the five EC2 vantage points (Virginia, Oregon, Paris, Sydney, and São Paulo). | 
| tlsa-domains-seeds.tar.gz | link | Domain names who have TLSA records for July 10st, 2019 and October 31st, 2019, which are used in rollover-candidate.py. | 
| mx-with-tlsa.tar.gz | link | A list of MX records that have TLSA records as well. This dataset is measured at OpenINTEL. | 
| alexa-top1m.csv | link | Alexa 1M domain names captured at October 31st, 2019. This dataset is obtained from the top-lists study. | 
| alexa1m-mx.tar.gz | link | Alexa 1M domain names that have MX records (measured at October 31st, 2019). | 
| alexa1m-tlsa.tar.gz | link | Alexa 1M domains that have TLSA records (measured at October 31st, 2019). | 
| root-ca-list.tar.gz | link | A list of root CA’s certificates for verifying certificates. | 
| Intermediary data | link | Intermediary datasets which are outputs of Spark scripts and also can be used as an input for analysis scripts. | 
(2) Scripts for the analysis
| Filename | Download | Description | 
|---|---|---|
| dependencies.zip | link | It includes our crafted python dnspackage for the Spark scripts. | 
| raw-merge.py | link | For the sake of simplicity, we merge the raw-datasets collected from the five vantage points into one single dataset. | 
| spark-codes.tar.gz | link | It includes pySpark scripts for our analysis. | 
| stats-codes.tar.gz | link | It includes python scripts for our analysis. | 
How to use the datasets and scripts?
(1) Preprocessing the raw datasets.
We had collected two raw datasets: TLSA records (via DNS) and their certificates (via STARTTLS).
To use DANE correctly, these two objects have to be matched; thus, we read these two datasets using raw-mergy.py and generates the output as a JSON format.
After downloading the hourly dataset, configure the input and output path (global variable in the script) for a city (e.g., virginia) you want to merge and run raw-merge.py.
python3 raw-merge.py 190711 191031 
After execution, merged outputs are placed in the [output_path]/[city]/ directory.
Below JSON data is an example of a merged output.
...
{
  "domain": "mail.ietf.org.",
  "port": "25",
  "time": "20191031 9",
  "city": "virginia", 
  "tlsa": {
  	    "dnssec": "Secure", // DNSSEC validation result
  	    "record_raw": "AACBoAABAAIABwABA18yNQRfdGNwBG1haWwEaWV0ZgNvcmcAADQAAQNfMjUEX3..." // DNS wire-format TLSA record, Base64 Encoded
  	    },
  "starttls": {
  	    "certs": "["LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSUdWekNDQlQrZ0F3SUJBZ...", // PEM format certificate, Base64 Encoded
  	    	       "LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSUZBRENDQStpZ0F3SUJBZ...",
  	    	       "LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSUVvRENDQTRpZ0F3SUJBZ..."]
  }
}
...
Now you are ready to run the Spark scripts.
(2) Analyzing the merged datasets
Apache Spark is specialized for big data processing using multiple cores at the same time. However, it may not work efficiently when the datasets have many dependencies between themselves. Thus, we first use Spark to extract the information that we are interested in from our raw datasets and we run a (analysis) python script to analyze them in depth.
The below table shows which Spark, analysis, and gnuplot scripts are used to get results in the paper.
| Result | Spark | Analysis | Gnuplot script | 
|---|---|---|---|
| Figure 2 | - | - | 2years-tlsa-ratio-per-tld-split.plot | 
| Figure 3 | - | alexa1m-dane-stat.py | alexa-tlsa-adoption.plot | 
| Figure 4 | - | - | 2years-tlsa-ratio-per-tld-split-fallback.plot | 
| Figure 5 | dnssec.py	 | dnssec-stat.py | missing-dnssec.plot | 
| Figure 6 | starttls-error.py	 | starttls-error-stat.py | starttls-availability.plot | 
| Figure 7 | check-incorrect.py | check-incorrect-stat.py | incorrect-percent-per-comp.plot | 
| Figure 8 | valid-dn.py | valid-dn-stat.py | 4months-valid-per-tld.plot | 
| Section 5.5 | rollover.py | rollover-stat.py | - | 
For example, to get the input dataset for Figure 5, run Spark script dnssec.py. Next, run the analysis script dnssec-stat.py using the output of dnssec.py as an input. Finally, you can draw Figure 5 with the missing-dnssec.plot script using the output of dnssec-stat.py as an input.
Running Spark scripts
The spark-codes.tar.gz file contains nine Spark scripts that run on a Spark machine. Please note that these scripts take the merged data as an input.
We use a third-party library that we crafted, dns. You may want to install this library on the Spark machine or you can pass this library to the machine when you run the Spark code by using the --py-files option. For the sake of simplicity, we have provided a package, dependencies.zip.
spark-submit --py-files=/path/to/dependencies.zip [spark_script.py]
The below table describes each of the Spark script that we use for the analyses.
| Filename | Description | Input | Output (same as the ones in Intermediary data) | 
|---|---|---|---|
| chain-validation.py | It verifies a chain of certificates. | root-ca-list | chain_validation_output.tar.gz | 
| dane-validation.py | It validates DANE based on RFC7671. | the output of chain-validation.py(e.g., chain_output_virginia/ in chain_validation_output.tar.gz) | dane_validation_output.tar.gz | 
| dnssec.py | It checks DNSSEC validity of TLSA records. | - | dnssec_output.tar.gz | 
| starttls-error.py | It classifies the reasons of STARTTLS scanning failure. | - | starttls_error_output.tar.gz | 
| check-incorrect.py | It classify the reasons of DANE validation failure. | - | check_incorrect_output.tar.gz | 
| rollover-candidate.py | It extracts the domains who have conducted rollover. | tlsa-base-domains-seeds | rollover_candidate_output.tar.gz | 
| rollover.py | It evaluates rollover behavior of domains. | the output of rollover-candidate-sub.py(e.g., rollover-cand-merged-virginia.txt in rollover_cand_merged_output.tar.gz) | rollover_output.tar.gz | 
| valid-dn.py | It counts the number of domains associated with mail servers which have valid TLSA records. | the output of dane-validation.py(e.g., dane_output_virginia/ in dane_validation_output.tar.gz) &mx-with-tlsa | valid_dn_output.tar.gz | 
Running analysis scripts
After getting outputs from the Spark scripts, you can analyze those outputs. stats-codes.tar.gz contains nine analysis scripts for this purpose. The output files must be same as the ones in the Analytics.
| Filename | Description | Input | Output | Misc. | 
|---|---|---|---|---|
| dane-validation-stat.py | It calculates stats of dane validation results. | the output of dane-validation.py(e.g., dane_output_[city]/ in dane_validation_output.tar.gz) | dane_valid_stat_output.tar.gz | - | 
| dnssec-stat.py | It calculates stats of dnssec validation results. | the output of dnssec.py(e.g., dnssec_output_[city]/ in dnssec_output.tar.gz) | dnssec_stat_output.tar.gz | - | 
| starttls-error-stat.py | It calculates the stats of STARTTLS scanning errors. | the output of starttls-error.py(e.g., starttls_error_output_[city]/ in starttls_error_output.tar.gz) | starttls_error_stat_output.tar.gz | - | 
| check-incorrect-stat.py | It calculates stats of dane validation failure reasons. | the output of check-incorrect.py(e.g, incorrect_output_[city]/ in check_incorrect_output.tar.gz) | check_incorrect_stat_output.tar.gz | - | 
| rollover-candidate-sub.py | It finds the domains who have conducted rollover. | the output of rollover-candidate.py(e.g., rollover_cand_output_virginia/ in rollover_candidate_output.tar.gz) | rollover_cand_merged_output.tar.gz | - | 
| rollover-stat.py | It calculates stats of rollover behavior of domains. | the output of rollover.py(e.g., rollover_output_virginia/ in rollover_output.tar.gz) | rollover_stat_output.tar.gz | Section 5.5, Key Rollover | 
| valid-dn-stat.py | It calculates stats of DANE-valid domains for each TLD. | the output of valid-dn.py(e.g., valid_dn_virginia/ in valid_dn_output.tar.gz) | valid_dn_stat_output.tar.gz | - | 
| alexa1m-dane-stat.py | It calculates stats of Alexa 1M domains who have TLSA records. | alexa1m-mx,alexa1m-tlsa,alexa-top1m.csv | alexa_dane_stat_output.tar.gz | - | 
3. Running our measurement codes to get your own raw datasets (TLSA records and their certificates)
This section introduces our source codes that we used to collect our datasets. We used these source codes to collect TLSA records and their certificates by sending average 11,972 TLSA record lookups as well as the certificate chains every hour from July 11, 2019 to October 31, 2019. We refer to these measurements as the Hourly dataset.
What about Daily dataset? Because the Daily dataset that contains every domain names under top level domains was collected using zone files that are given under agreement with registries, we cannot make them just publicly available. Instead, we provide intermediary data (e.g. tlsa-counts.csv) extracted from the Daily dataset which is needed to run our scripts.
Scanning source codes
| Filename | Descrption | Download | 
|---|---|---|
| tlsa-scan.go | It fetches TLSA records from a list of domains. | link1 | 
| starttls-scan.go | It collects certificates via STARTTLS. | link | 
How to scan TLSA records and their certificates
1. Scan TLSA records
The script tlsa-scan.go will read a list of domains in the mx-with-tlsa file and collect their TLSA records. The output has the following format.
| TLSA-base-domain | Vantage point | DNSSEC validity 1 | TLSA records2 | 
|---|---|---|---|
| _25._tcp.mail.ietf.org. | Virginia | Secure | AACBoAABAAIAB… | 
| _25._tcp.mail.tutanota.de. | Virginia | Secure | AACBoAABAAIA… | 
| … | … | … | … | 
1 The result of DNSSEC validation: Secure indicates that a domain can be validated. Insecure indicates that a domain cannot be validated because it does not have a DS record. Bogus indicates that a domain cannot be validated because it has invalid DNSSEC records such as expired RRSIGs.
2 TLSA records: wire formatted TLSA records (base64 encoded)
2. Scan STARTTLS certificates
The script starttls-scan.go will read a list of domains in the mx-with-tlsa file and collect certificates presented via STARTTLS. The output has the following format.
| Domains | Port | Vantage point | Status1 | # of presented certificates2 | Certificates3 | 
|---|---|---|---|---|---|
| mail.ietf.org | 25 | Virginia | Success | 4 | LS0RUaAB…, WjGdVBWYi…, 0s3FTFRuZ1…, eFKdDRBO… | 
| mail.tutanota.de. | 25 | Virginia | Success | 4 | LSSf7JanC…, ODlF4NEF…, SA3S29K…, Z1RstKS… | 
| … | … | … | … | … | … | 
1 Whether a certificate has been successfully fetched or not.
2 The number of certificates presented via STARTTLS.
3 A list of base64 encoded (PEM format) certificates (each certificate is comma seperated).