Open source collaboration across agencies to improve HTTPS deployment (opens in new tab)

(18f.gsa.gov)

104 pointskonklone9y ago19 comments

19 comments

18 comments · 9 top-level

randomdrake9y ago· 2 in thread

Thanks for the work that you're doing on this and answering questions. I had never seen many of the neat things mentioned in the blog post.

While the article did a good job explaining how pshtt works and how it generates data for the reporting, it didn't dive too much into the scanning itself. Since this is posted on Hacker News, I'd love to hear more about the nitty gritty of the data collection itself.

Can you talk about what sort of setup you run, and what sort of technical and interdepartmental challenges you run into scanning, storing, and obtaining data for 1,143 government websites?

hmft9y ago

Hi there. First, you've got to begin with the understanding that no one is maintaining a list of federal .gov websites holistically (or at one I can get hold of). So, before scanning, we source several public datasets to gather potential .gov hostnames. This was recently described in depth by 18F [https://18f.gsa.gov/2017/01/04/tracking-the-us-governments-p...]. In addition to Censys, GSA's DAP, and the End of Term Web Archive data, our team performs authorized scans of federal agency networks [https://www.whitehouse.gov/sites/default/files/omb/memoranda...] and so we mine that data too. This currently nets ~90k hostnames, only which about a third are responsive.

For both hostname gathering and HTTPS scanning, we use 18F's domain-scan [https://github.com/18F/domain-scan], which orchestrates the scan and provides parallelization. We use the pshtt scanner to ping each hostname at the root and www for both http and https-- this typically takes 36-48 hours to burn through. Once the scanning is finished, we throw the data from the CSV into mongodb, then generate the report via LaTeX. The trickiest part is probably report delivery, which is a mostly manual process for Very Government reasons.

Most of the bureaucratic challenge is overcome because we've already been doing scans against these executive branch agencies for the past several years, so we're a known quantity, though we do modify our user-agent to clearly point back to us. On the whole, agencies have been very supportive-- the data on Pulse bears that out. Agencies really do want to do the right thing for citizens.

randomdrake9y ago

I appreciate you taking the time for an insightful and detailed response. The link you provided, "Tracking the U.S. government's progress on moving to HTTPS[1]" gave a lot of the details I was looking for.

You might consider mentioning it in this blog post as it does offer interesting background information and technical details.

As a specific example, the actual Python scripts used to generate the data[2] and the data itself[3], give a great deal of insight into the question I had.

[1] - https://18f.gsa.gov/2017/01/04/tracking-the-us-governments-p...

[2] - https://github.com/GSA/https/tree/master/compliance

[3] - https://github.com/GSA/https/tree/master/compliance/data

ycmbntrthrwaway9y ago· 2 in thread

I like it how https://pulse.cio.gov/ does not work because its certificate is issued for cloudfront.net

ycmbntrthrwaway9y ago

Looks like it was fixed just now or there is some round-robin balancing behind it.

konkloneOP9y ago

As it happened, we were migrating production infrastructure to a new service tonight, and had a few minutes of time where the cert was invalid. Sorry about that.

hmft9y ago· 2 in thread

Heyo, ^ blogger here. Happy to chat.

konkloneOP9y ago

And 18F/GSA employee and open source collaborator here. =) Can definitely help answer any questions folks have.

newman3149y ago

Are the HTTP report generation/assembly tools available/open-sourced too?

I'd love to be able to use this as a starting point. Thanks.

1 more reply

discreditable9y ago· 1 in thread

I was happy to notice not long ago that apod.nasa.gov is now served over HTTPS with a Let's Encrypt certificate. Even OP link is!

hmft9y ago

Yeah! Lots of NASA sites use of Let's Encrypt certs. Some examples here [https://crt.sh/?Identity=%25nasa.gov&iCAID=16418].

alpb9y ago· 1 in thread

One thing I noticed going through the list linked in the page is, many of these .gov pages host _both_ www and no-www versions, making them essentially two different websites with the same content. Example: http://abilityone.gov/ and http://www.abilityone.gov/ It looks like the clear guidelines around this is something missing. I know of certain countries whose .gov domains are almost 99% www and they don’t serve no-www at all.

konkloneOP9y ago

You're right, this is (unfortunately) very common. I wish there were clearer guidelines about this.

The White House Office of Management and Budget publishes IT policies, and they ask for specific URLs with www in front: https://www.whitehouse.gov/sites/default/files/omb/memoranda...

But I don't think they or anyone would care if the www redirected to the root, or vice versa, as long as it eventually got you there.

eeZah7Ux9y ago· 1 in thread

How mature is pshtt?

konkloneOP9y ago

We (18F/GSA) have been using DHS's tool in production for a few months now, and have fixed various bugs as they've come up.

Before that, pshtt's methodology was replicated in a Ruby tool (site-inspector) that we grafted HTTPS/HSTS detection logic onto, and had that running in production for a year or so.

So in terms of business logic, I think it's pretty mature. If you mean things like having it formally audited or having a dedicated development team, it hasn't gotten there yet. But the more people that use it, the more mature it will get.

bertil9y ago

This is a very small detail in that post but it captures quite well what officialdom is to me, what separates GSA and 18F from other digital efforts: the inclusion of the “tribal” scale in the list of levels of authority. 18F makes things so that many people can use the Internet including, explicitly, the administration of First Nations.

I’ve complained a lot about how US-based company do not thing about non-US users enough (that common rant is obviously not applicable to GSA, although American abroad, immigrants and foreign visitors probably quality) but in that rant, I have forgotten the original Americans. Shame on me. I have never heard of any start-up asking “What about First Nations? Do we support Cherokee alphabet? Is there a Sioux exception for the law that we are enforcing in that form?”

garrettr_9y ago

pshtt (the HTTPS scanning tool) also powers the results for Freedom of the Press Foundation's recently launched Secure The News project: https://securethe.news. (Full disclosure: I work for FPF, and worked on Secure the News).

It's a promising project, and could use more contributors if anyone here is interested: https://github.com/dhs-ncats/pshtt/issues for ideas!

DyslexicAtheist9y ago

this combines some really important checks. I might be able to remove my .bashrc hack ...

  function certchain() {
      # Usage: certchain
      # Display PKI chain-of-trust for a given domain
      # GistID: https://gist.github.com/joshenders/cda916797665de69ebcd
      if [[ "$#" -ne 1 ]]; then
          echo "Usage: ${FUNCNAME} <ip|domain[:port]>"
          return 1
      fi

      local host_port="$1"

      if [[ "$1" != *:* ]]; then
          local host_port="${1}:443"
      fi

      openssl s_client -connect "${host_port}" </dev/null 2>/dev/null | grep -E '\ (s|i):'
  }

j / k navigate · click thread line to collapse

19 comments

18 comments · 9 top-level

randomdrake9y ago· 2 in thread

Thanks for the work that you're doing on this and answering questions. I had never seen many of the neat things mentioned in the blog post.

Can you talk about what sort of setup you run, and what sort of technical and interdepartmental challenges you run into scanning, storing, and obtaining data for 1,143 government websites?

hmft9y ago

randomdrake9y ago

You might consider mentioning it in this blog post as it does offer interesting background information and technical details.

As a specific example, the actual Python scripts used to generate the data[2] and the data itself[3], give a great deal of insight into the question I had.

[1] - https://18f.gsa.gov/2017/01/04/tracking-the-us-governments-p...

[2] - https://github.com/GSA/https/tree/master/compliance

[3] - https://github.com/GSA/https/tree/master/compliance/data

ycmbntrthrwaway9y ago· 2 in thread

I like it how https://pulse.cio.gov/ does not work because its certificate is issued for cloudfront.net

ycmbntrthrwaway9y ago

Looks like it was fixed just now or there is some round-robin balancing behind it.

konkloneOP9y ago

As it happened, we were migrating production infrastructure to a new service tonight, and had a few minutes of time where the cert was invalid. Sorry about that.

hmft9y ago· 2 in thread

Heyo, ^ blogger here. Happy to chat.

konkloneOP9y ago

And 18F/GSA employee and open source collaborator here. =) Can definitely help answer any questions folks have.

newman3149y ago

Are the HTTP report generation/assembly tools available/open-sourced too?

I'd love to be able to use this as a starting point. Thanks.

1 more reply

discreditable9y ago· 1 in thread

I was happy to notice not long ago that apod.nasa.gov is now served over HTTPS with a Let's Encrypt certificate. Even OP link is!

hmft9y ago

Yeah! Lots of NASA sites use of Let's Encrypt certs. Some examples here [https://crt.sh/?Identity=%25nasa.gov&iCAID=16418].

alpb9y ago· 1 in thread

konkloneOP9y ago

You're right, this is (unfortunately) very common. I wish there were clearer guidelines about this.

The White House Office of Management and Budget publishes IT policies, and they ask for specific URLs with www in front: https://www.whitehouse.gov/sites/default/files/omb/memoranda...

But I don't think they or anyone would care if the www redirected to the root, or vice versa, as long as it eventually got you there.

eeZah7Ux9y ago· 1 in thread

How mature is pshtt?

konkloneOP9y ago

We (18F/GSA) have been using DHS's tool in production for a few months now, and have fixed various bugs as they've come up.

Before that, pshtt's methodology was replicated in a Ruby tool (site-inspector) that we grafted HTTPS/HSTS detection logic onto, and had that running in production for a year or so.

bertil9y ago

garrettr_9y ago

It's a promising project, and could use more contributors if anyone here is interested: https://github.com/dhs-ncats/pshtt/issues for ideas!

DyslexicAtheist9y ago

this combines some really important checks. I might be able to remove my .bashrc hack ...

  function certchain() {
      # Usage: certchain
      # Display PKI chain-of-trust for a given domain
      # GistID: https://gist.github.com/joshenders/cda916797665de69ebcd
      if [[ "$#" -ne 1 ]]; then
          echo "Usage: ${FUNCNAME} <ip|domain[:port]>"
          return 1
      fi

      local host_port="$1"

      if [[ "$1" != *:* ]]; then
          local host_port="${1}:443"
      fi

      openssl s_client -connect "${host_port}" </dev/null 2>/dev/null | grep -E '\ (s|i):'
  }

j / k navigate · click thread line to collapse