How to reproduce all of the figures on the paper? (Section 4, 5, and 6)?
Our paper is based on the measurement: TLSA records and their certificates. We had collected them over 20 months. Due to the massive size of these datasets, we used Apache Spark to process them in a parallel manner. Here, we would like to provide three approaches to reproduce our measurement results:
-
First, you can (1) run our measurement source codes to collect your own datasets, (2) analyze the raw datasets you have obtained, and (3) run our plotting scripts. As the dataset will not be exactly as same as the ones we used the paper, your figures will look different. Thus, we recommend this approach for the researchers who are interested in extending our work. For the ones who are interested in this approach, you can start this from the Section 3 (, 2, and 1 in a reverse order).
-
Second, you can (1) download our raw datasets (thus, not re-running our measurement source codes), (2) run our analysis scripts, and (3) run our plotting scripts. As you do not need to run our measurement scripts to collect your own datasets, it should be faster than the first approach. However, our dataset might be too big to run as it spans twenty months. Thus, depending on your computational resource, it may take several hours (or days) to run our analysis scripts. For your information, we used Spark clusters to efficiently analyze the datasets. For the ones who are interested in this approach, you can skip the Section 3 and start this page from Section 2 and 1.
-
Finally, you can (1) just use our analytics datasets, which are processed through our analysis codes based on the raw datasets. We believe this way to be the fastest way that you can check the consistency of the figures on the paper. For the ones who are interested in this approach, you only need to read Section 1.
1. Reproducing the figures from the analytics
This section introduces a very simple way to reproduce all of the figures in the paper by using the analytics datasets and plotting scripts.
Datasets and scripts
(1) Analytic datasets for figures and their gnuplot scripts
Filename | Download | Description |
---|---|---|
Analytics |
link | Input datasets for the figures on the paper. |
plotting-scripts.tar.gz |
link | Plotting scripts for 6 figures. |
(2) Details of the gnuplot scripts
Filename | Figure No. on the paper | Input data |
---|---|---|
mx-dn-serving-stat.plot |
Figure 2 | mx-dn-[valid or invalid]-serving-stat.txt which is included in mx_dn_serving_stat.tar.gz |
case-stat.plot |
Figure 3 | case-stat.txt which is included in case_stat.tar.gz |
invalid-reasons.plot |
Figure 4 | case-tlsa-stat-[SSDS or SSDO] whic is included in case_tlsa_stat.tar.gz |
ever-matched.plot |
Figure 5 | case-tlsa-stat-[SSDS or SSDO] whic is included in case_tlsa_stat.tar.gz |
le-rollover-daneee.plot |
Figure 7 | rollover-timeline-le.txt which is included in rollover_timeline_le.tar.gz and cert-pki-cn-stat.txt which is included in cert_pki_cn_stat.tar.gz |
le-rollover-ta.plot |
Figure 8 | rollover-timeline-le.txt which is included in rollover_timeline_le.tar.gz and cert-pki-cn-stat.txt which is included in cert_pki_cn_stat.tar.gz |
2. Reproducing the analytics from the raw datasets (our measurement datasets)
This section introduces a way to generate the datasets (in the Analytics
file) from the raw datasets that we had collected using our measurement codes.
After executing the analysis scripts you may use those output files as inputs to the above plotting scripts.
Datasets and scripts
(1) Raw (measurement) datasets and prerequisites for the analysis
Filename | Download | Description |
---|---|---|
hourly dataset |
-1 | TLSA records and their certificates (through STARTTLS) collected for 20 month (July, 2019 ~ February, 2021) on the EC2 vantage point (Virginia). |
popularity_data.tar.gz |
-1 | Popularity datasets which are used to identify managing entities of SMTP servers and name servers. |
all-mx-exclude-nl.tar.gz |
-1 | A list of all SMTP servers in our dataset. |
root-ca-list.tar.gz |
link | A list of root CA’s certificates for verifying certificates. |
public-intermediate-certs.tar.gz |
link | A list of intermediate CA certificates and revoked intermediate CA certificates. This data is obtained from the Mozilla wiki. |
1 Due to the size of the datasets, please email us for data acesss.
(2) Scripts for the analysis
Filename | Download | Description |
---|---|---|
dependencies.zip |
link | It includes our crafted python dns package for the Spark scripts. |
raw-merge.py |
link | For the sake of simplicity, we merge the collected raw-datasets into one single dataset. |
spark-codes.tar.gz |
link | It includes pySpark scripts for our analysis. |
stats-codes.tar.gz |
link | It includes python scripts for our analysis. |
How to use the datasets and scripts?
(1) Preprocessing the raw datasets.
We had collected two raw datasets: TLSA records (via DNS) and their certificates (via STARTTLS).
To use DANE correctly, these two objects have to be matched; thus, we read these two datasets using raw-mergy.py
and generates the output as a JSON format.
After downloading the hourly dataset
, configure the input and output path (global variable in the script) you want to merge and run raw-merge.py
.
python3 raw-merge.py 190711 210212
After execution, merged outputs (merged_data) are placed in the [output_path]/
directory.
Below JSON data is an example of a merged_data.
...
{
"domain": "mail.ietf.org.",
"port": "25",
"time": "20191031 9",
"city": "virginia",
"tlsa": {
"dnssec": "Secure", // DNSSEC validation result
"record_raw": "AACBoAABAAIABwABA18yNQRfdGNwBG1haWwEaWV0ZgNvcmcAADQAAQNfMjUEX3..." // DNS wire-format TLSA record, Base64 Encoded
},
"starttls": {
"certs": "["LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSUdWekNDQlQrZ0F3SUJBZ...", // PEM format certificate, Base64 Encoded
"LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSUZBRENDQStpZ0F3SUJBZ...",
"LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSUVvRENDQTRpZ0F3SUJBZ..."]
}
}
...
Now you are ready to run the Spark scripts.
(2) Analyzing the merged datasets
Apache Spark is specialized for big data processing using multiple cores at the same time. However, it may not work efficiently when the datasets have many dependencies between themselves. Thus, we first use Spark to extract the information that we are interested in from our raw datasets and we run a (analysis) python script to analyze them in depth.
The below table shows which Spark, analysis, and gnuplot scripts are used to get results in the paper.
Result | Spark | Analysis | Gnuplot script |
---|---|---|---|
Figure 2 | mx_dn_serving.py |
mx-dn-serving-stat.py |
mx-dn-serving-stat.plot |
Figure 3 | case_stat.py |
case-stat.py |
case-stat.plot |
Figure 4 | case_tlsa_stat.py |
case-tlsa-stat.py |
invalid-reasons.plot |
Figure 5 | case_tlsa_stat.py |
case-tlsa-stat.py |
ever-matched.plot |
Figure 7 | rollover.py , le_stat_spark.py , cert_pki_cn.py |
rollover-timeline-le.py , cert-pki-cn-stat.py |
le-rollover-daneee.plot |
Figure 8 | rollover.py , le_stat_spark.py , cert_pki_cn.py |
rollover-timeline-le.py , cert-pki-cn-stat.py |
le-rollover-ta.plot |
Table 3 | rollover.py , rollover_case_target.py |
rollover-case.py |
- |
Table 4 | init_deploy.py |
init-deploy-stat.py |
- |
For example, to get the input dataset for Figure 3, run Spark script case_stat.py
. Next, run the analysis script case-stat.py
using the output of case_stat.py
as an input. Finally, you can draw Figure 3 with the case-stat.plot
script using the output of case-stat.py
as an input.
Running Spark scripts
The spark-codes.tar.gz
file contains sixteen Spark scripts that run on a Spark machine.
We use a third-party library that we crafted, dns. You may want to install this library on the Spark machine or you can pass this library to the machine when you run the Spark code by using the --py-files
option. For the sake of simplicity, we have provided a package, dependencies.zip
.
spark-submit --py-files=/path/to/dependencies.zip [spark_script.py]
The below table describes each of the Spark script that we use for the analyses.
Filename | Description | Input |
---|---|---|
dane_validation.py |
It validates DANE based on RFC7671. | merged_data |
check_incorrect_reason.py |
It classifies the reasons for DANE validation failure. | the output of dane_validation.py |
antago_syix.py |
It identifies SMTP servers that are served by Antagonist or Syix. | merged_data |
cert_pki_cn.py |
It identifies SMTP servers that use certificates issued by public CAs. | merged_data, the output of antago_syix.py ,all-mx-exclude-nl.tar.gz |
le_stat_spark.py |
It identifies SMTP servers that use certificates issued by Let’s Encrypt. | merged_data, the output of antago_syix.py ,all-mx-exclude-nl.tar.gz |
rollover_groupby.py |
It generates groups of merged_data that have the same SMTP server. | merged_data |
ever_matched.py |
It evaluates whether mismatched TLSA records can be matched with outdated certificates. |
the output of rollover_groupby.py |
find_case.py |
It classifies domains to each managing categories. | the datasets in popularity_data.tar.gz |
map_case.py |
It merges domain data with their DANE validation results. | the output of dane_validation.py and find_case.py |
mx_dn_serving.py |
It calculates the number of domains served by an SMTP server and its DANE validity. |
the output of map_case.py |
case_stat.py |
It generates statistics of DANE validation results for domains according to managing categories. |
the output of map_case.py |
case_tlsa_stat.py |
It generates statistics of DANE validation results for SMTP servers according to managing categories. |
the output of dane_validation.py , ever_matched.py ,and map_case.py |
rollover_candidate.py |
It extracts the SMTP servers that have conducted rollover. | the output of rollover_groupby.py , antago_syix.py ,all-mx-exclude-nl.tar.gz |
rollover.py |
It evaluates rollover behaviors of SMTP servers. | the output of rollover_groupby.py , antago_syix.py ,rollover-candidate.py ,never-matched.py , all-mx-exclude-nl.tar.gz |
rollover_case_target.py |
It evaluates rollovers according to managing categories. | the output of map_case.py , rollover-stat.py |
init_deploy.py |
It evaluates the initial DANE deployment of SMTP servers. | merged_data, the output of aantago_syix.py ,gen-init-seed.py , all-mx-exclude-nl.tar.gz |
Running analysis scripts
After getting outputs from the Spark scripts, you can analyze those outputs. stats-codes.tar.gz
contains eleven analysis scripts for this purpose. The output files must be same as the ones in the Analytics
.
Filename | Description | Input | Output | |
---|---|---|---|---|
cert-pki-cn-stat.py |
It calculates stats of STMP servers that use certificates from public CAs. | the output of cert_pki_cn.py |
cert_pki_cn_stat.tar.gz |
|
mx-dn-serving-stat.py |
It calculates stats of the number of domains served by an SMTP server and its DANE validity. | the output of mx_dn_serving.py |
mx_dn_serving_stat.tar.gz |
|
case-stat.py |
It calculates stats of DANE validation results for domains according to managing categories. | the output of case_stat.py |
case_stat.tar.gz |
|
case-tlsa-stat.py |
It calculates stats of DANE validation results for SMTP servers according to managing categories. | the output of case_tlsa_stat.py |
case_tlsa_stat.tar.gz |
|
never-matched.py |
It finds SMTP servers that never have valid TLSA records. | the output of dane_validation.py |
- | |
rollover-candidate.py |
It finds the domains who have conducted rollover. | the output of rollover_candidate.py |
- | |
rollover-stat.py |
It finds domains that actually conduct rollover (e.g., except SMTP servers that never have valid TLSA records). | the output of rollover.py |
- | |
rollover-timeline-le.py |
It evaluates rollover behaviors of SMTP servers that use certificates issued by Let’s Encrypt. | the output of rollover.py , le_stat_spark.py , rollover-stat.py |
rollover_timeline_le.tar.gz |
|
rollover-case.py |
It evaluates rollover behaviors according to managing categories. | the output of rollover.py , rollover_case_target.py , rollover-stat.py |
- | |
gen-init-seed.py |
It generates a set of SMTP servers that newly published TLSA records during our measurement period. | the output of check_incorrect_reason.py , antago_syix.py , all-mx-exclude-nl.tar.gz |
- | |
init-deploy-stat.py |
It calculates stats of the initial DANE deployment. | the output of check_incorrect_reason.py , init_deploy.py |
- |
3. Running our measurement codes to get your own raw datasets (TLSA records and their certificates)
This section introduces our source codes that we used to collect our datasets. We used these source codes to collect TLSA records and their certificates chains every hour from July 13, 2019 to February 12, 2021. We refer to these measurements as the Hourly dataset (Section 4 in the paper).
What about Daily dataset? Because the Daily dataset that contains every domain names under top level domains was collected using zone files that are given under agreement with registries, we cannot make them just publicly available. Instead, we can provide intermediary data extracted from the Daily dataset which is needed to run our scripts. If you need the intermediary data, please email us for data access.
The source codes and how to use them are the same as the artifacts of the USENIX Security'20 paper. You can refer to Server-side Artifacts / Section 3.