While the article did a good job explaining how pshtt works and how it generates data for the reporting, it didn't dive too much into the scanning itself. Since this is posted on Hacker News, I'd love to hear more about the nitty gritty of the data collection itself.
Can you talk about what sort of setup you run, and what sort of technical and interdepartmental challenges you run into scanning, storing, and obtaining data for 1,143 government websites?
For both hostname gathering and HTTPS scanning, we use 18F's domain-scan [https://github.com/18F/domain-scan], which orchestrates the scan and provides parallelization. We use the pshtt scanner to ping each hostname at the root and www for both http and https-- this typically takes 36-48 hours to burn through. Once the scanning is finished, we throw the data from the CSV into mongodb, then generate the report via LaTeX. The trickiest part is probably report delivery, which is a mostly manual process for Very Government reasons.
Most of the bureaucratic challenge is overcome because we've already been doing scans against these executive branch agencies for the past several years, so we're a known quantity, though we do modify our user-agent to clearly point back to us. On the whole, agencies have been very supportive-- the data on Pulse bears that out. Agencies really do want to do the right thing for citizens.
You might consider mentioning it in this blog post as it does offer interesting background information and technical details.
As a specific example, the actual Python scripts used to generate the data[2] and the data itself[3], give a great deal of insight into the question I had.
[1] - https://18f.gsa.gov/2017/01/04/tracking-the-us-governments-p...
[2] - https://github.com/GSA/https/tree/master/compliance
[3] - https://github.com/GSA/https/tree/master/compliance/data
I'd love to be able to use this as a starting point. Thanks.
The White House Office of Management and Budget publishes IT policies, and they ask for specific URLs with www in front: https://www.whitehouse.gov/sites/default/files/omb/memoranda...
But I don't think they or anyone would care if the www redirected to the root, or vice versa, as long as it eventually got you there.
Before that, pshtt's methodology was replicated in a Ruby tool (site-inspector) that we grafted HTTPS/HSTS detection logic onto, and had that running in production for a year or so.
So in terms of business logic, I think it's pretty mature. If you mean things like having it formally audited or having a dedicated development team, it hasn't gotten there yet. But the more people that use it, the more mature it will get.
I’ve complained a lot about how US-based company do not thing about non-US users enough (that common rant is obviously not applicable to GSA, although American abroad, immigrants and foreign visitors probably quality) but in that rant, I have forgotten the original Americans. Shame on me. I have never heard of any start-up asking “What about First Nations? Do we support Cherokee alphabet? Is there a Sioux exception for the law that we are enforcing in that form?”
It's a promising project, and could use more contributors if anyone here is interested: https://github.com/dhs-ncats/pshtt/issues for ideas!
function certchain() {
# Usage: certchain
# Display PKI chain-of-trust for a given domain
# GistID: https://gist.github.com/joshenders/cda916797665de69ebcd
if [[ "$#" -ne 1 ]]; then
echo "Usage: ${FUNCNAME} <ip|domain[:port]>"
return 1
fi
local host_port="$1"
if [[ "$1" != *:* ]]; then
local host_port="${1}:443"
fi
openssl s_client -connect "${host_port}" </dev/null 2>/dev/null | grep -E '\ (s|i):'
}