How to reproduce all of the figures on the paper? (Section 4 and 5)?

Our paper is based on the measurement: TLSA records and their certificates. We had collected them over 4 months across 5 vantage points. Due to the massive size of these datasets, we used Apache Spark to process them in a parallel manner. Here, we would like to provide three approaches to reproduce our measurement results:

1. Reproducing the figures from the analytics

This section introduces a very simple way to reproduce all of the figures in the paper by using the analytics datasets and plotting scripts.

Datasets and scripts

(1) Analytic datasets for figures and their gnuplot scripts

Filename Download Description
Analytics link Input datasets for the figures on the paper.
plotting-scripts.tar.gz link Plotting scripts for 6 figures.

(2) Details of the gnuplot scripts

Filename Figure No. on the paper Input data
2years-tlsa-ratio-per-tld-split.plot Figure 2 tlsa-counts.csv
alexa-tlsa-adoption.plot Figure 3 alexa_dane_stat_output.txt which is included in alexa_dane_stat_output.tar.gz
2years-tlsa-ratio-per-tld-split-fallback.plot Figure 4 tlsa-counts.csv
missing-dnssec.plot Figure 5 dnssec_stat_output_[city].txt which is included in dnssec_stat_output.tar.gz
startls-availability.plot Figure 6 starttls_error_stat_output_[city].txt which is included in starttls_error_stat_output.tar.gz
incorrect-percent-per-comp.plot Figure 7 check_incorrect_stat_output_[city].txt which is included in check_incorrect_stat_output.tar.gz
4months-valid-per-tld.plot Figure 8 valid_dn_stat_output_[city].txt which is included in valid_dn_stat_output.tar.gz

2. Reproducing the analytics from the raw datasets (our measurement datasets)

This section introduces a way to generate the datasets (in the Analytics file) from the raw datasets that we had collected using our measurement codes. After executing the analysis scripts you may use those output files as inputs to the above plotting scripts.

Datasets and scripts

(1) Raw (measurement) datasets and prerequisites for the analysis

Filename Download Description
hourly dataset link TLSA records and their certificates (through STARTTLS) collected for 4 month (July ~ October, 2019) on the five EC2 vantage points (Virginia, Oregon, Paris, Sydney, and São Paulo).
tlsa-domains-seeds.tar.gz link Domain names who have TLSA records for July 10st, 2019 and October 31st, 2019, which are used in rollover-candidate.py.
mx-with-tlsa.tar.gz link A list of MX records that have TLSA records as well. This dataset is measured at OpenINTEL.
alexa-top1m.csv link Alexa 1M domain names captured at October 31st, 2019. This dataset is obtained from the top-lists study.
alexa1m-mx.tar.gz link Alexa 1M domain names that have MX records (measured at October 31st, 2019).
alexa1m-tlsa.tar.gz link Alexa 1M domains that have TLSA records (measured at October 31st, 2019).
root-ca-list.tar.gz link A list of root CA’s certificates for verifying certificates.
Intermediary data link Intermediary datasets which are outputs of Spark scripts and also can be used as an input for analysis scripts.

(2) Scripts for the analysis

Filename Download Description
dependencies.zip link It includes our crafted python dns package for the Spark scripts.
raw-merge.py link For the sake of simplicity, we merge the raw-datasets collected from the five vantage points into one single dataset.
spark-codes.tar.gz link It includes pySpark scripts for our analysis.
stats-codes.tar.gz link It includes python scripts for our analysis.

How to use the datasets and scripts?

(1) Preprocessing the raw datasets.

We had collected two raw datasets: TLSA records (via DNS) and their certificates (via STARTTLS). To use DANE correctly, these two objects have to be matched; thus, we read these two datasets using raw-mergy.py and generates the output as a JSON format. After downloading the hourly dataset, configure the input and output path (global variable in the script) for a city (e.g., virginia) you want to merge and run raw-merge.py.

python3 raw-merge.py 190711 191031 

After execution, merged outputs are placed in the [output_path]/[city]/ directory.

Below JSON data is an example of a merged output.

...
{
  "domain": "mail.ietf.org.",
  "port": "25",
  "time": "20191031 9",
  "city": "virginia", 
  "tlsa": {
  	    "dnssec": "Secure", // DNSSEC validation result
  	    "record_raw": "AACBoAABAAIABwABA18yNQRfdGNwBG1haWwEaWV0ZgNvcmcAADQAAQNfMjUEX3..." // DNS wire-format TLSA record, Base64 Encoded
  	    },

  "starttls": {
  	    "certs": "["LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSUdWekNDQlQrZ0F3SUJBZ...", // PEM format certificate, Base64 Encoded
  	    	       "LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSUZBRENDQStpZ0F3SUJBZ...",
  	    	       "LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSUVvRENDQTRpZ0F3SUJBZ..."]
  }
}
...

Now you are ready to run the Spark scripts.

(2) Analyzing the merged datasets

Apache Spark is specialized for big data processing using multiple cores at the same time. However, it may not work efficiently when the datasets have many dependencies between themselves. Thus, we first use Spark to extract the information that we are interested in from our raw datasets and we run a (analysis) python script to analyze them in depth.

The below table shows which Spark, analysis, and gnuplot scripts are used to get results in the paper.

Result Spark Analysis Gnuplot script
Figure 2 - - 2years-tlsa-ratio-per-tld-split.plot
Figure 3 - alexa1m-dane-stat.py alexa-tlsa-adoption.plot
Figure 4 - - 2years-tlsa-ratio-per-tld-split-fallback.plot
Figure 5 dnssec.py dnssec-stat.py missing-dnssec.plot
Figure 6 starttls-error.py starttls-error-stat.py starttls-availability.plot
Figure 7 check-incorrect.py check-incorrect-stat.py incorrect-percent-per-comp.plot
Figure 8 valid-dn.py valid-dn-stat.py 4months-valid-per-tld.plot
Section 5.5 rollover.py rollover-stat.py -

For example, to get the input dataset for Figure 5, run Spark script dnssec.py. Next, run the analysis script dnssec-stat.py using the output of dnssec.py as an input. Finally, you can draw Figure 5 with the missing-dnssec.plot script using the output of dnssec-stat.py as an input.

Running Spark scripts

The spark-codes.tar.gz file contains nine Spark scripts that run on a Spark machine. Please note that these scripts take the merged data as an input. We use a third-party library that we crafted, dns. You may want to install this library on the Spark machine or you can pass this library to the machine when you run the Spark code by using the --py-files option. For the sake of simplicity, we have provided a package, dependencies.zip.

spark-submit --py-files=/path/to/dependencies.zip [spark_script.py]

The below table describes each of the Spark script that we use for the analyses.

Filename Description Input Output (same as the ones in Intermediary data)
chain-validation.py It verifies a chain of certificates. root-ca-list chain_validation_output.tar.gz
dane-validation.py It validates DANE based on RFC7671. the output of chain-validation.py
(e.g., chain_output_virginia/ in chain_validation_output.tar.gz)
dane_validation_output.tar.gz
dnssec.py It checks DNSSEC validity of TLSA records. - dnssec_output.tar.gz
starttls-error.py It classifies the reasons of STARTTLS scanning failure. - starttls_error_output.tar.gz
check-incorrect.py It classify the reasons of DANE validation failure. - check_incorrect_output.tar.gz
rollover-candidate.py It extracts the domains who have conducted rollover. tlsa-base-domains-seeds rollover_candidate_output.tar.gz
rollover.py It evaluates rollover behavior of domains. the output of rollover-candidate-sub.py
(e.g., rollover-cand-merged-virginia.txt in rollover_cand_merged_output.tar.gz)
rollover_output.tar.gz
valid-dn.py It counts the number of domains associated with mail servers which have valid TLSA records. the output of dane-validation.py
(e.g., dane_output_virginia/ in dane_validation_output.tar.gz) & mx-with-tlsa
valid_dn_output.tar.gz

Running analysis scripts

After getting outputs from the Spark scripts, you can analyze those outputs. stats-codes.tar.gz contains nine analysis scripts for this purpose. The output files must be same as the ones in the Analytics.

Filename Description Input Output Misc.
dane-validation-stat.py It calculates stats of dane validation results. the output of dane-validation.py
(e.g., dane_output_[city]/ in dane_validation_output.tar.gz)
dane_valid_stat_output.tar.gz -
dnssec-stat.py It calculates stats of dnssec validation results. the output of dnssec.py
(e.g., dnssec_output_[city]/ in dnssec_output.tar.gz)
dnssec_stat_output.tar.gz -
starttls-error-stat.py It calculates the stats of STARTTLS scanning errors. the output of starttls-error.py
(e.g., starttls_error_output_[city]/ in starttls_error_output.tar.gz)
starttls_error_stat_output.tar.gz -
check-incorrect-stat.py It calculates stats of dane validation failure reasons. the output of check-incorrect.py
(e.g, incorrect_output_[city]/ in check_incorrect_output.tar.gz)
check_incorrect_stat_output.tar.gz -
rollover-candidate-sub.py It finds the domains who have conducted rollover. the output of rollover-candidate.py
(e.g., rollover_cand_output_virginia/ in rollover_candidate_output.tar.gz)
rollover_cand_merged_output.tar.gz -
rollover-stat.py It calculates stats of rollover behavior of domains. the output of rollover.py
(e.g., rollover_output_virginia/ in rollover_output.tar.gz)
rollover_stat_output.tar.gz Section 5.5, Key Rollover
valid-dn-stat.py It calculates stats of DANE-valid domains for each TLD. the output of valid-dn.py
(e.g., valid_dn_virginia/ in valid_dn_output.tar.gz)
valid_dn_stat_output.tar.gz -
alexa1m-dane-stat.py It calculates stats of Alexa 1M domains who have TLSA records. alexa1m-mx, alexa1m-tlsa, alexa-top1m.csv alexa_dane_stat_output.tar.gz -

3. Running our measurement codes to get your own raw datasets (TLSA records and their certificates)

This section introduces our source codes that we used to collect our datasets. We used these source codes to collect TLSA records and their certificates by sending average 11,972 TLSA record lookups as well as the certificate chains every hour from July 11, 2019 to October 31, 2019. We refer to these measurements as the Hourly dataset.

What about Daily dataset? Because the Daily dataset that contains every domain names under top level domains was collected using zone files that are given under agreement with registries, we cannot make them just publicly available. Instead, we provide intermediary data (e.g. tlsa-counts.csv) extracted from the Daily dataset which is needed to run our scripts.

Scanning source codes

Filename Descrption Download
tlsa-scan.go It fetches TLSA records from a list of domains. link1
starttls-scan.go It collects certificates via STARTTLS. link
1 This requires the following third-party libraries: Unbound, Unbound Golang Wrapper, and ldns.

How to scan TLSA records and their certificates

1. Scan TLSA records

The script tlsa-scan.go will read a list of domains in the mx-with-tlsa file and collect their TLSA records. The output has the following format.

TLSA-base-domain Vantage point DNSSEC validity 1 TLSA records2
_25._tcp.mail.ietf.org. Virginia Secure AACBoAABAAIAB…
_25._tcp.mail.tutanota.de. Virginia Secure AACBoAABAAIA…

1 The result of DNSSEC validation: Secure indicates that a domain can be validated. Insecure indicates that a domain cannot be validated because it does not have a DS record. Bogus indicates that a domain cannot be validated because it has invalid DNSSEC records such as expired RRSIGs.

2 TLSA records: wire formatted TLSA records (base64 encoded)

2. Scan STARTTLS certificates

The script starttls-scan.go will read a list of domains in the mx-with-tlsa file and collect certificates presented via STARTTLS. The output has the following format.

Domains Port Vantage point Status1 # of presented certificates2 Certificates3
mail.ietf.org 25 Virginia Success 4 LS0RUaAB…, WjGdVBWYi…, 0s3FTFRuZ1…, eFKdDRBO…
mail.tutanota.de. 25 Virginia Success 4 LSSf7JanC…, ODlF4NEF…, SA3S29K…, Z1RstKS…

1 Whether a certificate has been successfully fetched or not.

2 The number of certificates presented via STARTTLS.

3 A list of base64 encoded (PEM format) certificates (each certificate is comma seperated).