How to reproduce all of the figures on the paper? (Section 4, 5, and 6)?

Our paper is based on the measurement: TLSA records and their certificates. We had collected them over 20 months. Due to the massive size of these datasets, we used Apache Spark to process them in a parallel manner. Here, we would like to provide three approaches to reproduce our measurement results:

1. Reproducing the figures from the analytics

This section introduces a very simple way to reproduce all of the figures in the paper by using the analytics datasets and plotting scripts.

Datasets and scripts

(1) Analytic datasets for figures and their gnuplot scripts

Filename Download Description
Analytics link Input datasets for the figures on the paper.
plotting-scripts.tar.gz link Plotting scripts for 6 figures.

(2) Details of the gnuplot scripts

Filename Figure No. on the paper Input data
mx-dn-serving-stat.plot Figure 2 mx-dn-[valid or invalid]-serving-stat.txt which is included in mx_dn_serving_stat.tar.gz
case-stat.plot Figure 3 case-stat.txt which is included in case_stat.tar.gz
invalid-reasons.plot Figure 4 case-tlsa-stat-[SSDS or SSDO] whic is included in case_tlsa_stat.tar.gz
ever-matched.plot Figure 5 case-tlsa-stat-[SSDS or SSDO] whic is included in case_tlsa_stat.tar.gz
le-rollover-daneee.plot Figure 7 rollover-timeline-le.txt which is included in rollover_timeline_le.tar.gz
and cert-pki-cn-stat.txt which is included in cert_pki_cn_stat.tar.gz
le-rollover-ta.plot Figure 8 rollover-timeline-le.txt which is included in rollover_timeline_le.tar.gz
and cert-pki-cn-stat.txt which is included in cert_pki_cn_stat.tar.gz

2. Reproducing the analytics from the raw datasets (our measurement datasets)

This section introduces a way to generate the datasets (in the Analytics file) from the raw datasets that we had collected using our measurement codes. After executing the analysis scripts you may use those output files as inputs to the above plotting scripts.

Datasets and scripts

(1) Raw (measurement) datasets and prerequisites for the analysis

Filename Download Description
hourly dataset -1 TLSA records and their certificates (through STARTTLS) collected for 20 month (July, 2019 ~ February, 2021)
on the EC2 vantage point (Virginia).
popularity_data.tar.gz -1 Popularity datasets which are used to identify managing entities of SMTP servers and name servers.
all-mx-exclude-nl.tar.gz -1 A list of all SMTP servers in our dataset.
root-ca-list.tar.gz link A list of root CA’s certificates for verifying certificates.
public-intermediate-certs.tar.gz link A list of intermediate CA certificates and revoked intermediate CA certificates.
This data is obtained from the Mozilla wiki.

1 Due to the size of the datasets, please email us for data acesss.

(2) Scripts for the analysis

Filename Download Description
dependencies.zip link It includes our crafted python dns package for the Spark scripts.
raw-merge.py link For the sake of simplicity, we merge the collected raw-datasets into one single dataset.
spark-codes.tar.gz link It includes pySpark scripts for our analysis.
stats-codes.tar.gz link It includes python scripts for our analysis.

How to use the datasets and scripts?

(1) Preprocessing the raw datasets.

We had collected two raw datasets: TLSA records (via DNS) and their certificates (via STARTTLS). To use DANE correctly, these two objects have to be matched; thus, we read these two datasets using raw-mergy.py and generates the output as a JSON format. After downloading the hourly dataset, configure the input and output path (global variable in the script) you want to merge and run raw-merge.py.

python3 raw-merge.py 190711 210212 

After execution, merged outputs (merged_data) are placed in the [output_path]/ directory.

Below JSON data is an example of a merged_data.

...
{
  "domain": "mail.ietf.org.",
  "port": "25",
  "time": "20191031 9",
  "city": "virginia", 
  "tlsa": {
  	    "dnssec": "Secure", // DNSSEC validation result
  	    "record_raw": "AACBoAABAAIABwABA18yNQRfdGNwBG1haWwEaWV0ZgNvcmcAADQAAQNfMjUEX3..." // DNS wire-format TLSA record, Base64 Encoded
  	    },

  "starttls": {
  	    "certs": "["LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSUdWekNDQlQrZ0F3SUJBZ...", // PEM format certificate, Base64 Encoded
  	    	       "LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSUZBRENDQStpZ0F3SUJBZ...",
  	    	       "LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSUVvRENDQTRpZ0F3SUJBZ..."]
  }
}
...

Now you are ready to run the Spark scripts.

(2) Analyzing the merged datasets

Apache Spark is specialized for big data processing using multiple cores at the same time. However, it may not work efficiently when the datasets have many dependencies between themselves. Thus, we first use Spark to extract the information that we are interested in from our raw datasets and we run a (analysis) python script to analyze them in depth.

The below table shows which Spark, analysis, and gnuplot scripts are used to get results in the paper.

Result Spark Analysis Gnuplot script
Figure 2 mx_dn_serving.py mx-dn-serving-stat.py mx-dn-serving-stat.plot
Figure 3 case_stat.py case-stat.py case-stat.plot
Figure 4 case_tlsa_stat.py case-tlsa-stat.py invalid-reasons.plot
Figure 5 case_tlsa_stat.py case-tlsa-stat.py ever-matched.plot
Figure 7 rollover.py, le_stat_spark.py, cert_pki_cn.py rollover-timeline-le.py, cert-pki-cn-stat.py le-rollover-daneee.plot
Figure 8 rollover.py, le_stat_spark.py, cert_pki_cn.py rollover-timeline-le.py, cert-pki-cn-stat.py le-rollover-ta.plot
Table 3 rollover.py, rollover_case_target.py rollover-case.py -
Table 4 init_deploy.py init-deploy-stat.py -

For example, to get the input dataset for Figure 3, run Spark script case_stat.py. Next, run the analysis script case-stat.py using the output of case_stat.py as an input. Finally, you can draw Figure 3 with the case-stat.plot script using the output of case-stat.py as an input.

Running Spark scripts

The spark-codes.tar.gz file contains sixteen Spark scripts that run on a Spark machine. We use a third-party library that we crafted, dns. You may want to install this library on the Spark machine or you can pass this library to the machine when you run the Spark code by using the --py-files option. For the sake of simplicity, we have provided a package, dependencies.zip.

spark-submit --py-files=/path/to/dependencies.zip [spark_script.py]

The below table describes each of the Spark script that we use for the analyses.

Filename Description Input
dane_validation.py It validates DANE based on RFC7671. merged_data
check_incorrect_reason.py It classifies the reasons for DANE validation failure. the output of dane_validation.py
antago_syix.py It identifies SMTP servers that are served by Antagonist or Syix. merged_data
cert_pki_cn.py It identifies SMTP servers that use certificates issued by public CAs. merged_data, the output of antago_syix.py,
all-mx-exclude-nl.tar.gz
le_stat_spark.py It identifies SMTP servers that use certificates issued by Let’s Encrypt. merged_data, the output of antago_syix.py,
all-mx-exclude-nl.tar.gz
rollover_groupby.py It generates groups of merged_data that have the same SMTP server. merged_data
ever_matched.py It evaluates whether mismatched TLSA records can be matched
with outdated certificates.
the output of rollover_groupby.py
find_case.py It classifies domains to each managing categories. the datasets in popularity_data.tar.gz
map_case.py It merges domain data with their DANE validation results. the output of dane_validation.py and find_case.py
mx_dn_serving.py It calculates the number of domains served by an SMTP server
and its DANE validity.
the output of map_case.py
case_stat.py It generates statistics of DANE validation results for domains
according to managing categories.
the output of map_case.py
case_tlsa_stat.py It generates statistics of DANE validation results for SMTP servers
according to managing categories.
the output of dane_validation.py, ever_matched.py,
and map_case.py
rollover_candidate.py It extracts the SMTP servers that have conducted rollover. the output of rollover_groupby.py, antago_syix.py,
all-mx-exclude-nl.tar.gz
rollover.py It evaluates rollover behaviors of SMTP servers. the output of rollover_groupby.py, antago_syix.py,
rollover-candidate.py,
never-matched.py, all-mx-exclude-nl.tar.gz
rollover_case_target.py It evaluates rollovers according to managing categories. the output of map_case.py, rollover-stat.py
init_deploy.py It evaluates the initial DANE deployment of SMTP servers. merged_data, the output of aantago_syix.py,
gen-init-seed.py, all-mx-exclude-nl.tar.gz

Running analysis scripts

After getting outputs from the Spark scripts, you can analyze those outputs. stats-codes.tar.gz contains eleven analysis scripts for this purpose. The output files must be same as the ones in the Analytics.

Filename Description Input Output
cert-pki-cn-stat.py It calculates stats of STMP servers that use certificates from public CAs. the output of cert_pki_cn.py cert_pki_cn_stat.tar.gz
mx-dn-serving-stat.py It calculates stats of the number of domains served by an SMTP server and its DANE validity. the output of mx_dn_serving.py mx_dn_serving_stat.tar.gz
case-stat.py It calculates stats of DANE validation results for domains according to managing categories. the output of case_stat.py case_stat.tar.gz
case-tlsa-stat.py It calculates stats of DANE validation results for SMTP servers according to managing categories. the output of case_tlsa_stat.py case_tlsa_stat.tar.gz
never-matched.py It finds SMTP servers that never have valid TLSA records. the output of dane_validation.py -
rollover-candidate.py It finds the domains who have conducted rollover. the output of rollover_candidate.py -
rollover-stat.py It finds domains that actually conduct rollover (e.g., except SMTP servers that never have valid TLSA records). the output of rollover.py -
rollover-timeline-le.py It evaluates rollover behaviors of SMTP servers that use certificates issued by Let’s Encrypt. the output of rollover.py, le_stat_spark.py, rollover-stat.py rollover_timeline_le.tar.gz
rollover-case.py It evaluates rollover behaviors according to managing categories. the output of rollover.py, rollover_case_target.py, rollover-stat.py -
gen-init-seed.py It generates a set of SMTP servers that newly published TLSA records during our measurement period. the output of check_incorrect_reason.py, antago_syix.py, all-mx-exclude-nl.tar.gz -
init-deploy-stat.py It calculates stats of the initial DANE deployment. the output of check_incorrect_reason.py, init_deploy.py -

3. Running our measurement codes to get your own raw datasets (TLSA records and their certificates)

This section introduces our source codes that we used to collect our datasets. We used these source codes to collect TLSA records and their certificates chains every hour from July 13, 2019 to February 12, 2021. We refer to these measurements as the Hourly dataset (Section 4 in the paper).

What about Daily dataset? Because the Daily dataset that contains every domain names under top level domains was collected using zone files that are given under agreement with registries, we cannot make them just publicly available. Instead, we can provide intermediary data extracted from the Daily dataset which is needed to run our scripts. If you need the intermediary data, please email us for data access.

The source codes and how to use them are the same as the artifacts of the USENIX Security'20 paper. You can refer to Server-side Artifacts / Section 3.