How to reproduce all of the figures on the paper? (Section 4 and 5)?

Our paper is based on the measurement: TLSA records and their certificates. We had collected them over 4 months across 5 vantage points. Due to the massive size of these datasets, we used Apache Spark to process them in a parallel manner. Here, we would like to provide three approaches to reproduce our measurement results:

First, you can (1) run our measurement source codes to collect your own datasets, (2) analyze the raw datasets you have obtained, and (3) run our plotting scripts. As the dataset will not be exactly as same as the ones we used the paper, your figures will look different. Thus, we recommend this approach for the researchers who are interested in extending our work. For the ones who are interested in this approach, you can start this from the Section 3 (, 2, and 1 in a reverse order).
Second, you can (1) download our raw datasets (thus, not re-running our measurement source codes), (2) run our analysis scripts, and (3) run our plotting scripts. As you do not need to run our measurement scripts to collect your own datasets, it should be faster than the first approach. However, our dataset might be too big to run as it spans four months and are collected from five vantage points. Thus, depending on your computational resource, it may take several hours (or days) to run our analysis scripts. For your information, we used Spark clusters to efficiently analyze the datasets. For the ones who are interested in this approach, you can skip the Section 3 and start this page from Section 2 and 1.
Finally, you can (1) just use our analytics datasets, which are processed through our analysis codes based on the raw datasets. We believe this way to be the fastest way that you can check the consistency of the figures on the paper. For the ones who are interested in this approach, you only need to read Section 1.

1. Reproducing the figures from the analytics

This section introduces a very simple way to reproduce all of the figures in the paper by using the analytics datasets and plotting scripts.

Datasets and scripts

(1) Analytic datasets for figures and their gnuplot scripts

Filename	Download	Description
`Analytics`	link	Input datasets for the figures on the paper.
`plotting-scripts.tar.gz`	link	Plotting scripts for 6 figures.

(2) Details of the gnuplot scripts

Filename	Figure No. on the paper	Input data
`2years-tlsa-ratio-per-tld-split.plot`	Figure 2	`tlsa-counts.csv`
`alexa-tlsa-adoption.plot`	Figure 3	alexa_dane_stat_output.txt which is included in `alexa_dane_stat_output.tar.gz`
`2years-tlsa-ratio-per-tld-split-fallback.plot`	Figure 4	`tlsa-counts.csv`
`missing-dnssec.plot`	Figure 5	dnssec_stat_output_[city].txt which is included in `dnssec_stat_output.tar.gz`
`startls-availability.plot`	Figure 6	starttls_error_stat_output_[city].txt which is included in `starttls_error_stat_output.tar.gz`
`incorrect-percent-per-comp.plot`	Figure 7	check_incorrect_stat_output_[city].txt which is included in `check_incorrect_stat_output.tar.gz`
`4months-valid-per-tld.plot`	Figure 8	valid_dn_stat_output_[city].txt which is included in `valid_dn_stat_output.tar.gz`

2. Reproducing the analytics from the raw datasets (our measurement datasets)

This section introduces a way to generate the datasets (in the Analytics file) from the raw datasets that we had collected using our measurement codes. After executing the analysis scripts you may use those output files as inputs to the above plotting scripts.

Datasets and scripts

(1) Raw (measurement) datasets and prerequisites for the analysis

Filename	Download	Description
`hourly dataset`	link	TLSA records and their certificates (through STARTTLS) collected for 4 month (July ~ October, 2019) on the five EC2 vantage points (Virginia, Oregon, Paris, Sydney, and São Paulo).
`tlsa-domains-seeds.tar.gz`	link	Domain names who have TLSA records for July 10st, 2019 and October 31st, 2019, which are used in `rollover-candidate.py`.
`mx-with-tlsa.tar.gz`	link	A list of MX records that have TLSA records as well. This dataset is measured at OpenINTEL.
`alexa-top1m.csv`	link	Alexa 1M domain names captured at October 31st, 2019. This dataset is obtained from the top-lists study.
`alexa1m-mx.tar.gz`	link	Alexa 1M domain names that have MX records (measured at October 31st, 2019).
`alexa1m-tlsa.tar.gz`	link	Alexa 1M domains that have TLSA records (measured at October 31st, 2019).
`root-ca-list.tar.gz`	link	A list of root CA’s certificates for verifying certificates.
`Intermediary data`	link	Intermediary datasets which are outputs of Spark scripts and also can be used as an input for analysis scripts.

(2) Scripts for the analysis

Filename	Download	Description
`dependencies.zip`	link	It includes our crafted python `dns` package for the Spark scripts.
`raw-merge.py`	link	For the sake of simplicity, we merge the raw-datasets collected from the five vantage points into one single dataset.
`spark-codes.tar.gz`	link	It includes pySpark scripts for our analysis.
`stats-codes.tar.gz`	link	It includes python scripts for our analysis.

How to use the datasets and scripts?

(1) Preprocessing the raw datasets.

We had collected two raw datasets: TLSA records (via DNS) and their certificates (via STARTTLS). To use DANE correctly, these two objects have to be matched; thus, we read these two datasets using raw-mergy.py and generates the output as a JSON format. After downloading the hourly dataset, configure the input and output path (global variable in the script) for a city (e.g., virginia) you want to merge and run raw-merge.py.

python3 raw-merge.py 190711 191031

After execution, merged outputs are placed in the [output_path]/[city]/ directory.

Below JSON data is an example of a merged output.

...
{
  "domain": "mail.ietf.org.",
  "port": "25",
  "time": "20191031 9",
  "city": "virginia", 
  "tlsa": {
  	    "dnssec": "Secure", // DNSSEC validation result
  	    "record_raw": "AACBoAABAAIABwABA18yNQRfdGNwBG1haWwEaWV0ZgNvcmcAADQAAQNfMjUEX3..." // DNS wire-format TLSA record, Base64 Encoded
  	    },

  "starttls": {
  	    "certs": "["LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSUdWekNDQlQrZ0F3SUJBZ...", // PEM format certificate, Base64 Encoded
  	    	       "LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSUZBRENDQStpZ0F3SUJBZ...",
  	    	       "LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSUVvRENDQTRpZ0F3SUJBZ..."]
  }
}
...

Now you are ready to run the Spark scripts.

(2) Analyzing the merged datasets

Apache Spark is specialized for big data processing using multiple cores at the same time. However, it may not work efficiently when the datasets have many dependencies between themselves. Thus, we first use Spark to extract the information that we are interested in from our raw datasets and we run a (analysis) python script to analyze them in depth.

The below table shows which Spark, analysis, and gnuplot scripts are used to get results in the paper.

Result	Spark	Analysis	Gnuplot script
Figure 2	-	-	`2years-tlsa-ratio-per-tld-split.plot`
Figure 3	-	`alexa1m-dane-stat.py`	`alexa-tlsa-adoption.plot`
Figure 4	-	-	`2years-tlsa-ratio-per-tld-split-fallback.plot`
Figure 5	`dnssec.py`	`dnssec-stat.py`	`missing-dnssec.plot`
Figure 6	`starttls-error.py`	`starttls-error-stat.py`	`starttls-availability.plot`
Figure 7	`check-incorrect.py`	`check-incorrect-stat.py`	`incorrect-percent-per-comp.plot`
Figure 8	`valid-dn.py`	`valid-dn-stat.py`	`4months-valid-per-tld.plot`
Section 5.5	`rollover.py`	`rollover-stat.py`	-

For example, to get the input dataset for Figure 5, run Spark script dnssec.py. Next, run the analysis script dnssec-stat.py using the output of dnssec.py as an input. Finally, you can draw Figure 5 with the missing-dnssec.plot script using the output of dnssec-stat.py as an input.

Running Spark scripts

The spark-codes.tar.gz file contains nine Spark scripts that run on a Spark machine. Please note that these scripts take the merged data as an input. We use a third-party library that we crafted, dns. You may want to install this library on the Spark machine or you can pass this library to the machine when you run the Spark code by using the --py-files option. For the sake of simplicity, we have provided a package, dependencies.zip.

spark-submit --py-files=/path/to/dependencies.zip [spark_script.py]

The below table describes each of the Spark script that we use for the analyses.

Filename	Description	Input	Output (same as the ones in Intermediary data)
`chain-validation.py`	It verifies a chain of certificates.	`root-ca-list`	`chain_validation_output.tar.gz`
`dane-validation.py`	It validates DANE based on RFC7671.	the output of `chain-validation.py` (e.g., chain_output_virginia/ in `chain_validation_output.tar.gz`)	`dane_validation_output.tar.gz`
`dnssec.py`	It checks DNSSEC validity of TLSA records.	-	`dnssec_output.tar.gz`
`starttls-error.py`	It classifies the reasons of STARTTLS scanning failure.	-	`starttls_error_output.tar.gz`
`check-incorrect.py`	It classify the reasons of DANE validation failure.	-	`check_incorrect_output.tar.gz`
`rollover-candidate.py`	It extracts the domains who have conducted rollover.	`tlsa-base-domains-seeds`	`rollover_candidate_output.tar.gz`
`rollover.py`	It evaluates rollover behavior of domains.	the output of `rollover-candidate-sub.py` (e.g., rollover-cand-merged-virginia.txt in `rollover_cand_merged_output.tar.gz`)	`rollover_output.tar.gz`
`valid-dn.py`	It counts the number of domains associated with mail servers which have valid TLSA records.	the output of `dane-validation.py` (e.g., dane_output_virginia/ in `dane_validation_output.tar.gz`) & `mx-with-tlsa`	`valid_dn_output.tar.gz`

Running analysis scripts

After getting outputs from the Spark scripts, you can analyze those outputs. stats-codes.tar.gz contains nine analysis scripts for this purpose. The output files must be same as the ones in the Analytics.

Filename	Description	Input	Output	Misc.
`dane-validation-stat.py`	It calculates stats of dane validation results.	the output of `dane-validation.py` (e.g., dane_output_[city]/ in `dane_validation_output.tar.gz`)	`dane_valid_stat_output.tar.gz`	-
`dnssec-stat.py`	It calculates stats of dnssec validation results.	the output of `dnssec.py` (e.g., dnssec_output_[city]/ in `dnssec_output.tar.gz`)	`dnssec_stat_output.tar.gz`	-
`starttls-error-stat.py`	It calculates the stats of STARTTLS scanning errors.	the output of `starttls-error.py` (e.g., starttls_error_output_[city]/ in `starttls_error_output.tar.gz`)	`starttls_error_stat_output.tar.gz`	-
`check-incorrect-stat.py`	It calculates stats of dane validation failure reasons.	the output of `check-incorrect.py` (e.g, incorrect_output_[city]/ in `check_incorrect_output.tar.gz`)	`check_incorrect_stat_output.tar.gz`	-
`rollover-candidate-sub.py`	It finds the domains who have conducted rollover.	the output of `rollover-candidate.py` (e.g., rollover_cand_output_virginia/ in `rollover_candidate_output.tar.gz`)	`rollover_cand_merged_output.tar.gz`	-
`rollover-stat.py`	It calculates stats of rollover behavior of domains.	the output of `rollover.py` (e.g., rollover_output_virginia/ in `rollover_output.tar.gz`)	`rollover_stat_output.tar.gz`	Section 5.5, Key Rollover
`valid-dn-stat.py`	It calculates stats of DANE-valid domains for each TLD.	the output of `valid-dn.py` (e.g., valid_dn_virginia/ in `valid_dn_output.tar.gz`)	`valid_dn_stat_output.tar.gz`	-
`alexa1m-dane-stat.py`	It calculates stats of Alexa 1M domains who have TLSA records.	`alexa1m-mx`, `alexa1m-tlsa`, `alexa-top1m.csv`	`alexa_dane_stat_output.tar.gz`	-

3. Running our measurement codes to get your own raw datasets (TLSA records and their certificates)

This section introduces our source codes that we used to collect our datasets. We used these source codes to collect TLSA records and their certificates by sending average 11,972 TLSA record lookups as well as the certificate chains every hour from July 11, 2019 to October 31, 2019. We refer to these measurements as the Hourly dataset.

What about Daily dataset? Because the Daily dataset that contains every domain names under top level domains was collected using zone files that are given under agreement with registries, we cannot make them just publicly available. Instead, we provide intermediary data (e.g. tlsa-counts.csv) extracted from the Daily dataset which is needed to run our scripts.

Scanning source codes

Filename	Descrption	Download
`tlsa-scan.go`	It fetches TLSA records from a list of domains.	link¹
`starttls-scan.go`	It collects certificates via STARTTLS.	link

¹ This requires the following third-party libraries: Unbound, Unbound Golang Wrapper, and ldns.

How to scan TLSA records and their certificates

1. Scan TLSA records

The script tlsa-scan.go will read a list of domains in the mx-with-tlsa file and collect their TLSA records. The output has the following format.

TLSA-base-domain	Vantage point	DNSSEC validity ¹	TLSA records²
_25._tcp.mail.ietf.org.	Virginia	Secure	AACBoAABAAIAB…
_25._tcp.mail.tutanota.de.	Virginia	Secure	AACBoAABAAIA…
…	…	…	…

¹ The result of DNSSEC validation: Secure indicates that a domain can be validated. Insecure indicates that a domain cannot be validated because it does not have a DS record. Bogus indicates that a domain cannot be validated because it has invalid DNSSEC records such as expired RRSIGs.

² TLSA records: wire formatted TLSA records (base64 encoded)

2. Scan STARTTLS certificates

The script starttls-scan.go will read a list of domains in the mx-with-tlsa file and collect certificates presented via STARTTLS. The output has the following format.

Domains	Port	Vantage point	Status¹	# of presented certificates²	Certificates³
mail.ietf.org	25	Virginia	Success	4	LS0RUaAB…, WjGdVBWYi…, 0s3FTFRuZ1…, eFKdDRBO…
mail.tutanota.de.	25	Virginia	Success	4	LSSf7JanC…, ODlF4NEF…, SA3S29K…, Z1RstKS…
…	…	…	…	…	…

¹ Whether a certificate has been successfully fetched or not.

² The number of certificates presented via STARTTLS.

³ A list of base64 encoded (PEM format) certificates (each certificate is comma seperated).