How to reproduce all of the figures on the paper? (Section 4, 5, and 6)?

Our paper is based on the measurement: TLSA records and their certificates. We had collected them over 20 months. Due to the massive size of these datasets, we used Apache Spark to process them in a parallel manner. Here, we would like to provide three approaches to reproduce our measurement results:

First, you can (1) run our measurement source codes to collect your own datasets, (2) analyze the raw datasets you have obtained, and (3) run our plotting scripts. As the dataset will not be exactly as same as the ones we used the paper, your figures will look different. Thus, we recommend this approach for the researchers who are interested in extending our work. For the ones who are interested in this approach, you can start this from the Section 3 (, 2, and 1 in a reverse order).
Second, you can (1) download our raw datasets (thus, not re-running our measurement source codes), (2) run our analysis scripts, and (3) run our plotting scripts. As you do not need to run our measurement scripts to collect your own datasets, it should be faster than the first approach. However, our dataset might be too big to run as it spans twenty months. Thus, depending on your computational resource, it may take several hours (or days) to run our analysis scripts. For your information, we used Spark clusters to efficiently analyze the datasets. For the ones who are interested in this approach, you can skip the Section 3 and start this page from Section 2 and 1.
Finally, you can (1) just use our analytics datasets, which are processed through our analysis codes based on the raw datasets. We believe this way to be the fastest way that you can check the consistency of the figures on the paper. For the ones who are interested in this approach, you only need to read Section 1.

1. Reproducing the figures from the analytics

This section introduces a very simple way to reproduce all of the figures in the paper by using the analytics datasets and plotting scripts.

Datasets and scripts

(1) Analytic datasets for figures and their gnuplot scripts

Filename	Download	Description
`Analytics`	link	Input datasets for the figures on the paper.
`plotting-scripts.tar.gz`	link	Plotting scripts for 6 figures.

(2) Details of the gnuplot scripts

Filename	Figure No. on the paper	Input data
`mx-dn-serving-stat.plot`	Figure 2	mx-dn-[valid or invalid]-serving-stat.txt which is included in `mx_dn_serving_stat.tar.gz`
`case-stat.plot`	Figure 3	case-stat.txt which is included in `case_stat.tar.gz`
`invalid-reasons.plot`	Figure 4	case-tlsa-stat-[SSDS or SSDO] whic is included in `case_tlsa_stat.tar.gz`
`ever-matched.plot`	Figure 5	case-tlsa-stat-[SSDS or SSDO] whic is included in `case_tlsa_stat.tar.gz`
`le-rollover-daneee.plot`	Figure 7	rollover-timeline-le.txt which is included in `rollover_timeline_le.tar.gz` and cert-pki-cn-stat.txt which is included in `cert_pki_cn_stat.tar.gz`
`le-rollover-ta.plot`	Figure 8	rollover-timeline-le.txt which is included in `rollover_timeline_le.tar.gz` and cert-pki-cn-stat.txt which is included in `cert_pki_cn_stat.tar.gz`

2. Reproducing the analytics from the raw datasets (our measurement datasets)

This section introduces a way to generate the datasets (in the Analytics file) from the raw datasets that we had collected using our measurement codes. After executing the analysis scripts you may use those output files as inputs to the above plotting scripts.

Datasets and scripts

(1) Raw (measurement) datasets and prerequisites for the analysis

Filename	Download	Description
`hourly dataset`	-¹	TLSA records and their certificates (through STARTTLS) collected for 20 month (July, 2019 ~ February, 2021) on the EC2 vantage point (Virginia).
`popularity_data.tar.gz`	-¹	Popularity datasets which are used to identify managing entities of SMTP servers and name servers.
`all-mx-exclude-nl.tar.gz`	-¹	A list of all SMTP servers in our dataset.
`root-ca-list.tar.gz`	link	A list of root CA’s certificates for verifying certificates.
`public-intermediate-certs.tar.gz`	link	A list of intermediate CA certificates and revoked intermediate CA certificates. This data is obtained from the Mozilla wiki.

¹ Due to the size of the datasets, please email us for data acesss.

(2) Scripts for the analysis

Filename	Download	Description
`dependencies.zip`	link	It includes our crafted python `dns` package for the Spark scripts.
`raw-merge.py`	link	For the sake of simplicity, we merge the collected raw-datasets into one single dataset.
`spark-codes.tar.gz`	link	It includes pySpark scripts for our analysis.
`stats-codes.tar.gz`	link	It includes python scripts for our analysis.

How to use the datasets and scripts?

(1) Preprocessing the raw datasets.

We had collected two raw datasets: TLSA records (via DNS) and their certificates (via STARTTLS). To use DANE correctly, these two objects have to be matched; thus, we read these two datasets using raw-mergy.py and generates the output as a JSON format. After downloading the hourly dataset, configure the input and output path (global variable in the script) you want to merge and run raw-merge.py.

python3 raw-merge.py 190711 210212

After execution, merged outputs (merged_data) are placed in the [output_path]/ directory.

Below JSON data is an example of a merged_data.

...
{
  "domain": "mail.ietf.org.",
  "port": "25",
  "time": "20191031 9",
  "city": "virginia", 
  "tlsa": {
  	    "dnssec": "Secure", // DNSSEC validation result
  	    "record_raw": "AACBoAABAAIABwABA18yNQRfdGNwBG1haWwEaWV0ZgNvcmcAADQAAQNfMjUEX3..." // DNS wire-format TLSA record, Base64 Encoded
  	    },

  "starttls": {
  	    "certs": "["LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSUdWekNDQlQrZ0F3SUJBZ...", // PEM format certificate, Base64 Encoded
  	    	       "LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSUZBRENDQStpZ0F3SUJBZ...",
  	    	       "LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSUVvRENDQTRpZ0F3SUJBZ..."]
  }
}
...

Now you are ready to run the Spark scripts.

(2) Analyzing the merged datasets

Apache Spark is specialized for big data processing using multiple cores at the same time. However, it may not work efficiently when the datasets have many dependencies between themselves. Thus, we first use Spark to extract the information that we are interested in from our raw datasets and we run a (analysis) python script to analyze them in depth.

The below table shows which Spark, analysis, and gnuplot scripts are used to get results in the paper.

Result	Spark	Analysis	Gnuplot script
Figure 2	`mx_dn_serving.py`	`mx-dn-serving-stat.py`	`mx-dn-serving-stat.plot`
Figure 3	`case_stat.py`	`case-stat.py`	`case-stat.plot`
Figure 4	`case_tlsa_stat.py`	`case-tlsa-stat.py`	`invalid-reasons.plot`
Figure 5	`case_tlsa_stat.py`	`case-tlsa-stat.py`	`ever-matched.plot`
Figure 7	`rollover.py`, `le_stat_spark.py`, `cert_pki_cn.py`	`rollover-timeline-le.py`, `cert-pki-cn-stat.py`	`le-rollover-daneee.plot`
Figure 8	`rollover.py`, `le_stat_spark.py`, `cert_pki_cn.py`	`rollover-timeline-le.py`, `cert-pki-cn-stat.py`	`le-rollover-ta.plot`
Table 3	`rollover.py`, `rollover_case_target.py`	`rollover-case.py`	-
Table 4	`init_deploy.py`	`init-deploy-stat.py`	-

For example, to get the input dataset for Figure 3, run Spark script case_stat.py. Next, run the analysis script case-stat.py using the output of case_stat.py as an input. Finally, you can draw Figure 3 with the case-stat.plot script using the output of case-stat.py as an input.

Running Spark scripts

The spark-codes.tar.gz file contains sixteen Spark scripts that run on a Spark machine. We use a third-party library that we crafted, dns. You may want to install this library on the Spark machine or you can pass this library to the machine when you run the Spark code by using the --py-files option. For the sake of simplicity, we have provided a package, dependencies.zip.

spark-submit --py-files=/path/to/dependencies.zip [spark_script.py]

The below table describes each of the Spark script that we use for the analyses.

Filename	Description	Input
`dane_validation.py`	It validates DANE based on RFC7671.	merged_data
`check_incorrect_reason.py`	It classifies the reasons for DANE validation failure.	the output of `dane_validation.py`
`antago_syix.py`	It identifies SMTP servers that are served by Antagonist or Syix.	merged_data
`cert_pki_cn.py`	It identifies SMTP servers that use certificates issued by public CAs.	merged_data, the output of `antago_syix.py`, `all-mx-exclude-nl.tar.gz`
`le_stat_spark.py`	It identifies SMTP servers that use certificates issued by Let’s Encrypt.	merged_data, the output of `antago_syix.py`, `all-mx-exclude-nl.tar.gz`
`rollover_groupby.py`	It generates groups of merged_data that have the same SMTP server.	merged_data
`ever_matched.py`	It evaluates whether mismatched TLSA records can be matched with outdated certificates.	the output of `rollover_groupby.py`
`find_case.py`	It classifies domains to each managing categories.	the datasets in `popularity_data.tar.gz`
`map_case.py`	It merges domain data with their DANE validation results.	the output of `dane_validation.py` and `find_case.py`
`mx_dn_serving.py`	It calculates the number of domains served by an SMTP server and its DANE validity.	the output of `map_case.py`
`case_stat.py`	It generates statistics of DANE validation results for domains according to managing categories.	the output of `map_case.py`
`case_tlsa_stat.py`	It generates statistics of DANE validation results for SMTP servers according to managing categories.	the output of `dane_validation.py`, `ever_matched.py`, and `map_case.py`
`rollover_candidate.py`	It extracts the SMTP servers that have conducted rollover.	the output of `rollover_groupby.py`, `antago_syix.py`, `all-mx-exclude-nl.tar.gz`
`rollover.py`	It evaluates rollover behaviors of SMTP servers.	the output of `rollover_groupby.py`, `antago_syix.py`, `rollover-candidate.py`, `never-matched.py`, `all-mx-exclude-nl.tar.gz`
`rollover_case_target.py`	It evaluates rollovers according to managing categories.	the output of `map_case.py`, `rollover-stat.py`
`init_deploy.py`	It evaluates the initial DANE deployment of SMTP servers.	merged_data, the output of `aantago_syix.py`, `gen-init-seed.py`, `all-mx-exclude-nl.tar.gz`

Running analysis scripts

After getting outputs from the Spark scripts, you can analyze those outputs. stats-codes.tar.gz contains eleven analysis scripts for this purpose. The output files must be same as the ones in the Analytics.

Filename	Description	Input	Output
`cert-pki-cn-stat.py`	It calculates stats of STMP servers that use certificates from public CAs.	the output of `cert_pki_cn.py`	`cert_pki_cn_stat.tar.gz`
`mx-dn-serving-stat.py`	It calculates stats of the number of domains served by an SMTP server and its DANE validity.	the output of `mx_dn_serving.py`	`mx_dn_serving_stat.tar.gz`
`case-stat.py`	It calculates stats of DANE validation results for domains according to managing categories.	the output of `case_stat.py`	`case_stat.tar.gz`
`case-tlsa-stat.py`	It calculates stats of DANE validation results for SMTP servers according to managing categories.	the output of `case_tlsa_stat.py`	`case_tlsa_stat.tar.gz`
`never-matched.py`	It finds SMTP servers that never have valid TLSA records.	the output of `dane_validation.py`	-
`rollover-candidate.py`	It finds the domains who have conducted rollover.	the output of `rollover_candidate.py`	-
`rollover-stat.py`	It finds domains that actually conduct rollover (e.g., except SMTP servers that never have valid TLSA records).	the output of `rollover.py`	-
`rollover-timeline-le.py`	It evaluates rollover behaviors of SMTP servers that use certificates issued by Let’s Encrypt.	the output of `rollover.py`, `le_stat_spark.py`, `rollover-stat.py`	`rollover_timeline_le.tar.gz`
`rollover-case.py`	It evaluates rollover behaviors according to managing categories.	the output of `rollover.py`, `rollover_case_target.py`, `rollover-stat.py`	-
`gen-init-seed.py`	It generates a set of SMTP servers that newly published TLSA records during our measurement period.	the output of `check_incorrect_reason.py`, `antago_syix.py`, `all-mx-exclude-nl.tar.gz`	-
`init-deploy-stat.py`	It calculates stats of the initial DANE deployment.	the output of `check_incorrect_reason.py`, `init_deploy.py`	-

3. Running our measurement codes to get your own raw datasets (TLSA records and their certificates)

This section introduces our source codes that we used to collect our datasets. We used these source codes to collect TLSA records and their certificates chains every hour from July 13, 2019 to February 12, 2021. We refer to these measurements as the Hourly dataset (Section 4 in the paper).

What about Daily dataset? Because the Daily dataset that contains every domain names under top level domains was collected using zone files that are given under agreement with registries, we cannot make them just publicly available. Instead, we can provide intermediary data extracted from the Daily dataset which is needed to run our scripts. If you need the intermediary data, please email us for data access.

The source codes and how to use them are the same as the artifacts of the USENIX Security'20 paper. You can refer to Server-side Artifacts / Section 3.