Skip to content
April 9, 2014 / Jim Fenton

Adventures with DNSSEC Part 1: Checking signatures

Some signatures from the Declaration of Independence

A confession: “Deploy DNSSEC” has been on my to-do list for at least a couple of years. During that time, I have been pinging my domain registrar to allow registration of DS records so that my domain’s DNS can be authenticated properly. I have used their delay in making this possible as an excuse to push that to-do list item down to the bottom.

I recently attended TrustyCon, an alternative security conference held during the RSA Conference a month or so ago. At TrustyCon, Jeff Moss, organizer of DefCon and other security conferences, gave a talk entitled “Community Immunity” that addressed security from a public health point of view: we need to be secure both to protect ourselves and the community. A video of Jeff’s talk is on YouTube, starting at 6:06:00.

Using a caching name server that checks DNSSEC response signatures was one of his examples of protecting oneself. Signing your domain with DNSSEC protects others who use it. Understanding this distinction started me thinking about DNSSEC as not one big thing to do, but two or more. And since I run my own caching name server and checking signatures is supposed to be easy, so why not?

According to Jeff’s talk, turning on DNSSEC verification is as simple as putting the following in your named.conf file (assuming you’re running BIND, of course):

options {
        dnssec-enable yes;
        dnssec-validation yes;

So I thought I’d give it a try, but I did a little homework first.  I found lots of confusing information, including the fact that since BIND 9.5 (I’m running 9.7), dnssec-enable and dnssec-validation both default to yes. This must not be the reason that I’m not checking signatures.

After looking around a while, I found a tutorial on HowToForge that describes the DNSSEC deployment process on Debian Linux in some detail. It doesn’t clearly separate the validation from signing steps clearly, but step 2 told me what was missing: my configuration didn’t specify the root keys that should be trusted by my name server. To fix this problem, I added the line:

include "/etc/bind/bind.keys";

at the bottom of my configuration, and restarted the name server.  It immediately crashed with a segmentation fault!

After a bit of hunting around, I found Debian bug #630733, that describes a segmentation fault under some circumstances when starting BIND. The circumstances didn’t quite match mine, but it gave me the clue I needed: my system had an empty file, /var/cache/bind/managed-keys.bind that was confusing BIND. After removing that empty file, the name server worked fine.

To test, I tried resolving the intentionally mis-signed domain on a machine that uses this name server:

fenton@kernel:~$ dig soa

If the name server is checking signatures, the status returned will be SERVFAIL. If it is not checking signatures, the status will be NOERROR.

Like me, you might be wondering if this is a good idea to do. Might DNS resolution for some domains be broken by inadvertent bad signatures? Using the above dnssec-failed test, I found that both Google ( and Comcast ( are checking DNSSEC signatures. If it doesn’t cause a problem for them, it won’t for me.

I’m still struggling with deploying DNSSEC on the signing side, so I’ll leave that for a subsequent article.


March 24, 2014 / Jim Fenton

Identity and Attribute Providers


One of the more unconventional but important aspects of the National Strategy for Trusted Identities in Cyberspace (NSTIC) is its model of attribute providers (APs) as distinct from identity providers (IdPs). However, this concept does not seem to be fully embraced by many who are active in the Identity Ecosystem Steering Group (IDESG), the organization that is working to turn the NSTIC vision into a reality.

Identity providers in current identity management systems, primarily those that are enterprise-focused or based on “social identity” like Facebook, act as attribute providers as well. In an enterprise, you would typically pass your login credentials (typically username and password) to an application that would in turn use a protocol like LDAP or Active Directory to verify the credentials with an identity provider. If the credentials are valid, the identity provider returns attributes about you, e.g., name, employee ID, department ID, and job title, to the enterprise application which uses the attributes to decide what you’re authorized to do.

Social login operates somewhat differently because the application isn’t necessarily trusted to receive your credentials. So Facebook Login collects your username and password directly, and uses the OAuth protocol to return your attributes, including name, time zone, friends list, and any links you have shared, to the application requesting the login. As in the enterprise login case, the attribute provider and the identity provider are one and the same.

In the broader context of NSTIC, there are several reasons why identity providers and attribute providers can’t be one and the same:

  • Different attribute providers are authoritative for different attribute classes – In an enterprise, the enterprise itself is authoritative for nearly all attributes of interest. But in the broader NSTIC use case, there isn’t a common point that all parties trust. Users typically will have different providers for different types of attributes: proof that you’re a full-time student might come from your school district or university, an assertion that you’re an adult might come from your motor vehicle department, and your credit-worthiness might come from one of the major credit bureaus. Requiring these all to be asserted by the identity provider requires it to be trusted by basically everyone, and that’s hard to achieve.
  • Users need to be able to choose their identity provider – In the course of processing transactions for you, your identity provider will be exposed to a great deal of information about where you use your identity. For that reason, the principle of IdP choice described in the NSTIC strategy document is very important. In order to make that choice meaningful, we have to minimize the trust in the IdP required by others such as relying parties. Except for self-asserted attributes where there is no trust required, attribute assertions by IdP require relying parties to consider the IdP to be authoritative for those attributes, which severely constrains the possible range of IdPs that users can choose from, making it more difficult for users to find an IdP that they can trust with this intimate information.
  • Support for anonymous and pseudonymous interactions is required – NSTIC recognizes the need to support anonymous and pseudonymous interactions in order to facilitate important uses that might not occur otherwise. If user attributes accompany every use of an online identity, these types of interactions are not possible. An IdP can simply assert an identifier, which should be opaque (not divulging any other information about the user). In many cases, identifiers may also be directed (different for each place you use your identity, so that your activities aren’t as easily correlated) and sometimes ephemeral (different for each session). Depending on the specific use, some attributes might be provided with the consent of the user, such as an assertion that the user is of legal age, without identifying the specific user.
  • Attribute providers must be insulated from sensitive information – When you use your driver’s license to prove that you’re of legal age, the issuer of that license doesn’t generally get information about where that ID has been checked. Given the sensitivity of some online transactions, the same characteristic is desirable: in most cases, the authoritative source for an attribute isn’t entitled to know how and where it is used. For this reason, it may be preferable to route attribute queries through the IdP to insulate the relying party from attribute providers. This characteristic isn’t called out explicitly in the NSTIC, but is a privacy enhancing technology that might be employed to prevent attribute providers from tracking users’ use of their online identities. This, in turn, motivates an arms-length relationship between users’ IdPs and attribute providers.

While some IdPs may also operate attribute providers (particularly for self-asserted attributes, which like the IdP are on behalf of the user), it’s cleaner to think of the IdP and AP as separate functions that may incidentally be operated by the same entity, subject to the arms-length concern mentioned above. More generally, an attribute provider is somewhat like a relying party, in that it receives a trustable assertion of an identifier from the user’s IdP representing that user. IdPs, since they represent the user, may also serve as directories of APs where attributes for a given user can be found. This may also limit the leakage of information about the user that comes from their choice of attribute providers. The use of a particular state DMV as an attribute provider correlates strongly with residence in that state, while the assertion provided might actually be signed on behalf of a broader authority such as AAMVA.

An area where the combined IdP/AP model seems to dominate thinking is identity proofing, which is the binding of an online identity with trusted real-world attributes, such as the user’s legal name. In the combined model, one needs to go through a process, either in person or through association with an existing relationship such as a bank account, prior to the issuance of a credential. This is important because the credentials in these cases often incorporate those identifying attributes, as a driver’s license or government PIV card has your name printed on it and incorporated into a magnetic stripe and/or chip. But when attribute providers are separate, they need an assertion from the user’s IdP to bind the attributes they are verifying to that digital identity, so the credential needs to be issued first. Identity proofing is a function of the attribute provider, not the identity provider, in this model.

The combined IdP/AP thinking also affects how one views a credential. We use the word “credential” extensively in the offline world, to describe a variety of documents and situations ranging from the presentation of a birth certificate to get a passport to the use of that passport to travel internationally. In the NSTIC authentication model, the user presents their credential to his or her IdP. It need not contain any attribute information, because the IdP does not need it. This differs from the combined model, where the relying party obtains information, such as the user’s name or employer, directly from a credential like a government PIV card. But in the NSTIC model, the choice of credential is up to the user and IdP, subject to the requirement that it be sufficiently secure to satisfy the relying party.

Illustration is taken from “Identity Systems”, a presentation I gave in late 2009. The entire presentation is available on Slideshare.


March 14, 2014 / Jim Fenton

Commercial vs. Government Surveillance: Which is more dangerous?

Last Sunday evening’s story about data brokers on 60 Minutes is a long-needed heads up to many people about the widespread but largely invisible practices of data brokers who collect, aggregate, and sell information about us.

Monday morning, in an interview at SXSW with Edward Snowden, the question was raised about whether government or commercial surveillance is more of a concern. Snowden’s response was that the government has the ability to prosecute and incarcerate people, that commercial providers don’t, and that we should therefore be more concerned about it:

Right now, my thinking, I believe the majority’s thinking is that the government has the ability to deprive you of rights. Governments around the world whether it is the United States government, whether it is the Yemeni government, whether it is Zaire, any country: they have police powers, they have military powers, they have intelligence powers. They can literally kill you, they can jail you, they can surveil you. Companies can surveil you to sell you products, to sell your information to other companies, and that can be bad, but you have legal recourse.

Chris Soghoian of ACLU, who was interviewing him, correctly observed:

I am not crazy about the amount of data that Google and Facebook collect. Of course, everything they get the government can come and ask for too. There is the collection that the government is doing by itself and then there is the data that they can go to Google and Facebook and force them to hand over.

But that still may not be the whole story. Is there anything to prevent the government from going to the sort of data brokers described in the 60 Minutes report and simply buying the data they’re looking for, in bulk? I haven’t seen any concrete evidence that this is happening, but I would would expect it to be done, either directly or through intermediaries.

Data brokers are typically secretive about the identities of their customers, so it’s highly doubtful that they will admit to this if it is happening.  I wonder if there is any sort of public records request that would reveal the existence of those sorts of contracts. If my hypothesis is correct, commercial surveillance is at least as dangerous as government surveillance, since there isn’t much of a distinction about how the data might ultimately be used. And since much of the information is behavioral, it has more potential for error.

Hopefully it doesn’t take another “Edward Snowden” from the data broker world for us to learn all the ways the data brokers use our data.

March 7, 2014 / Jim Fenton

RightsCon – an international experience


I spent the first part of this week at RightsCon, a conference dealing with the hunam rights issues associated with the internet, including freedom of speech, privacy, security/encryption, surveillance, and ensuring unimpeded access to the internet itself. It was organized by Access and attended by 600 or so people from 65 countries. In many ways, this was an atypical conference for me to attend: much more oriented toward policy than the technology issues that I usually focus on. But I enjoy conferences that stretch my experience, and RightsCon was an opportunity to better understand the motivations why people need to protect their privacy on the internet.

I decided to serve as a volunteer for RightsCon, the first time I have attended a conference as a volunteer. This was a fun experience — an opportunity to meet (if briefly) lots of amazing people, help make sessions run smoothly, and help attendees find their sessions. I spent about half the time as a “floater”, and about half staffing the registration desk or information table, or supporting one of the sessions.  When not assigned to other things, we were free to attend sessions. It was a fairly intense three days, but they treated us really well, fed us well, and it was a lot of fun.

I didn’t study the schedule extensively before signing up, so I didn’t do a very good job of specifying the sessions I really wanted to attend. As a result, I missed out on a few sessions that, judging from the Twitter comments, I would have enjoyed. When I volunteer again, I’ll do my homework better.

One of my staffing assignments was for a series of lightning talks on Monday. This included a session from Irina Raicu of the Markkula Center for Applied Ethics at Santa Clara University, presenting on “Are Software Engineers Morally Obligated to Care About Digital Human Rights?” I have met Irina and her colleagues at previous conferences, and this is a critically important topic for the software industry.  The big message is that just because something is legal doesnt make it morally acceptable.

The word ‘diversity’ is overused these days, but it definitely applies here. As I mentioned, 65 countries were represented; this is no small feat considering the difficulty many countries’ citizens have getting a visa to enter the United States. There was a balance of genders (including the GLBT community), and a wide range of attendees’ ages. The conference was greatly enriched as a result. I particularly remember a discussion where we were discussing issues accessing the internet in some countries. A middle-aged man from Sudan was discussing the situation there, and a young woman sitting next to him, who it turns out is from Azerbaidjan, was able to compare that with the situation in her own country. In another session, another attendee sitting next to me spoke up with a comment. He was from Egypt, and commented from the perspective of someone who experienced the turmoil there first-hand.

The feedback I would give the organizers would be to provide more categorization of the sessions in the program. I was looking for more technical content, and a couple of sessions I attended that I thought were more technical turned out not to be. Perhaps some keywords in the schedule would make it easier to choose sessions. I also observed, and heard from others, that several talks on similar topics of interest were scheduled against each other. That might be more obvious with keywords as well.

Overall, it was three very intense days, but time well spent. Next year’s RightsCon is in Manila, Philippines, so I don’t expect to attend, but I learned a lot this week and that was the objective.

February 11, 2014 / Jim Fenton

Some thoughts about Snowden: A middle ground


I thought it would be useful to write down my thoughts on the past months’ disclosures from Edward Snowden, the contractor who made off with and has leaked a vast trove of classified NSA documents. The main reason for doing this is to help gel my own opinions, but it may be interesting to others as well. The Snowden situation is very complex, and as you will see I can’t label him as either a hero or a traitor: he is a little of both, or a little of neither.

Read more…

December 9, 2013 / Jim Fenton

More Fun with Big(ger) Data: MongoDB


Last week, I blogged about my experiences using MySQL with a relatively large set of data: 10 GB and 153 million records. A lot of things worked well, but a few were surprisingly slow. This week, I’ll try to do roughly the same things with MongoDB, a popular “NoSQL” database, to get experience with it and to test the assertion that NoSQL databases are better suited for Big Data than traditional databases. This is my very first experience with MongoDB, so this is coming from the perspective of a complete newbie. There are probably better ways to do many of the things I’m describing.

One of the nice things about MongoDB is that it is schemaless. One just needs to give names to the columns (now known as fields) in the input data, and it will figure out what type of data is there and store it accordingly. If some fields are missing, that’s OK. So the data import is considerably easier to set up than with MySQL, and it doesn’t generate warnings for rows (now called documents) that have ridiculously long user or domain names.

That is not to say that mongoimport was trouble-free.  There is a 2 GB limitation on databases when using MongoDB in 32-bit mode, but I hadn’t paid much attention. When I imported the data, it began fast but slowed down gradually, as if it was adding to an indexed database. After running overnight, by morning it had slowed to about 500 records/second. A friend looked at it with me, and suggested I look at the log file, which contained an tens of gigabytes of error messages telling me about the 32-bit limitation (had I known to look). It would have been preferable for mongoimport to have thrown an error than to have silently logged errors until the disk filled up. So if you’re dealing with any significant amount of data, be sure you’re running the 64-bit version (I had to upgrade my Linux system to do this), and remember to check the log files frequently when using MongoDB.

Once I upgraded to 64-bit Linux (a significant task, but something I needed to do anyway), the import went smoothly, and about three times as fast as MySQL.

Here are timings from the same, or similar, tasks tried with MySQL:

Task Command Time Result
Import mongoimport –d adobe –c cred
–file cred –type tsv
–fields id,x,username,domain,pw.hint
1 hr 1 min 152,989,513 documents
Add index db.cred.ensureIndex({domain:1}) 34 min 29 sec
Count Cisco addresses db.cred.find({domain:””}).count() 0.042 sec 8552 documents
Count domains db.cred.aggregate([{ $group: { _id: "$domain"} },{ $group: { _id: 1, count: { $sum: 1 } } } ]) 3min 45 sec 9,326,393 domains
Domain popularity Various See below
Count entries without hints db.cred.find({“hint”:{“$exists”:false}}) .count() 3 min 39 sec 109,190,313 documents

One of the striking differences from MySQL is the command structure. While MySQL operations have somewhat of a narrative structure, MongoDB has a much more API-like flavor. The command shell for MongoDB is, in fact, a JavaScript shell. I’m not particularly strong in JavaScript, so it was a bit foreign, but workable, for me.

Several of the commands were, as expected, faster than with MySQL. But commands that needed to “touch” a lot of data and/or indexes thrashed badly, because MongoDB is an in-memory database and ran up to about 90 GB of virtual memory, causing many page faults when the data being accessed were widely dispersed.

It was when I tried to determine the most frequently used domains that things really got bogged down. I tried initially to do this with an aggregate operation similar to the domain count command, but this failed because of a limitation in the size of the aggregation it could perform. I next tried MongoDB’s powerful MapReduce capability, and it seemed to be thrashing the server. I finally wrote a short Python program that I thought would run quickly because the database was indexed by domain and could present the documents (records) in domain order, but even that got bogged down by the thrashing of the database process when it went to get more data. Using a subset of 1 million documents, these methods worked well, but not at the scale I was attempting, at least with my hardware.

So there were things I liked and didn’t like about MongoDB:

I Liked:

  • API-style interface that translated easily into running code
  • Speed to find records with a limited number of results
  • Loose schema: Free format of documents (records)

I Disliked:

  • Cumbersome syntax for ad-hoc “what if” queries from the shell
  • Speed of processing entire database (due to thrashing)
  • Loose coupling between shell and database daemon: terminating the shell wouldn’t necessarily terminate the database operation

MongoDB is well suited for certain database tasks. Just not the ones I expected.

December 4, 2013 / Jim Fenton

Fun with Big(ger) Data


I have been using databases, probably MySQL and PostgreSQL, for a variety of tasks, such as recording data from our home’s solar panels, analyzing data regarding DKIM deployment, and of course as a back-end database for wikis, blogs, and the family calendars. I have been impressed by what I have been able to do easily and quickly, but (with the possible exception of the DKIM data I was analyzing while at Cisco) I haven’t been dealing with very large databases. So when people tell me that NoSQL databases are much faster than what I have been using, I have had to take their word for it.

The breached authentication database from Adobe has been widely available on the Internet, and I was curious how a database of that size performs and if there were any interesting analytics I could extract. So I recently downloaded the database and imported it into MySQL and MongoDB, a popular NoSQL database. I’ll describe my experiences with MySQL in the rest of this blog post, and MongoDB in my next installment.

The database contains each account’s username (usually an email address), encrypted password, and in many cases a password hint. The passwords were stored in an “unsalted” form, meaning that two users with the same password will have the same encrypted password, permitting analysis of how many users have the same passwords (though we don’t know what that password is). I’m not interested in anyone’s password, although guessing common passwords from the hints used by different users has become a popular puzzle game in some groups. I’m not interested in email addresses, either. However, it’s interesting (to me) to see what the predominant email providers are, and to see the distribution of  the same passwords are used, and to experiment with using the database to extract some analytics like this from such a large set of data.

The database, often referred to as users.tar.gz, uncompresses to a file just under 10 GB with about 153 million records. I wrote a simple Perl script to convert the file to a more easily importable form (fixing delimiters, removing blank lines, and splitting email addresses into username and domain). Importing this file (on my 3.33 GHz quad-core i5 with 8 GB memory) took just under 3 hours. I got quite a few warnings, primarily due to extremely long usernames and domain names that exceeded the 64-character limits I had set for those fields.

Here are a few sample timings:

Task Command Time Result
Import LOAD DATA LOCAL INFILE “cred”  INTO TABLE cred; 2 hr 58 min 152,988,966 records
Add index ALTER TABLE cred ADD INDEX(domain); 2 hr 15 min
Count Cisco addresses SELECT COUNT(*) FROM cred WHERE domain = “”; 0.12 sec 8552 records
Count domains SELECT COUNT(DISTINCT domain) FROM CRED; 9,326,005 domains 47 sec
Domain popularity SELECT domain, count(*) AS count FROM cred GROUP BY domain ORDER BY count DESC LIMIT 500; 3 min 0 sec Top domain: (32,571,004 records)
Create domain table CREATE TABLE domains SELECT domain, COUNT(*) AS count FROM cred GROUP BY domain; 9 min 14 sec
Index domain table ALTER TABLE domains ADD INDEX(domain); 5 min 40 sec
Null out blank hints UPDATE cred SET hint = NULL WHERE hint = “”; 3 hr 45 min 109,305,580 blank hints
Password popularity SELECT pw, COUNT(pw) as count, COUNT(hint) as hints FROM cred GROUP BY pw HAVING count>1; See below

One thing that is immediately obvious from looking at the domains is that the email addresses aren’t verified. Many of the domain names contain illegal characters.  But it’s striking to see how many of the domains are misspelled: had about 32.5 million users, but there were also 62088, 23000, 22171, 19200, 15200, and so forth. Quite a study in mistypings!

I’m puzzled about the last query (password popularity). I expected that, since the database is indexed by pw, it would fairly quickly give me a count for each different pw, like the domain popularity query. I tried this with and without limits, ordering of the results, and with and without the HAVING clause eliminating unique values of pw.  At first the query had terminated with a warning that it had exceeded the size of the lock table; increasing the size of a buffer pool took care of that problem.  But now it has been running, very I/O bound, for 12 hours and I’m not sure why it’s taking that long when compared with the domain query.  If anyone has any ideas, please respond in the comments.

Even with 153 million records, this is nowhere near the scale of many of the so-called Big Data databases, and gives me an appreciation for what they’re doing. I’m not a MySQL expert, and expect that there are more efficient ways to do the above. But I’m still impressed by how much you can do with MySQL and reasonably modest hardware.


Get every new post delivered to your Inbox.

Join 994 other followers