RESIST REdacting Sensitive Information in Software artifacTs

What is Resist?

RESIST is a tool to redact sensitive information from source code without compromising program comprehension - i.e. detect and replace sensitive information with meaningful alternatives in source code

Why do we need such a tool?

Source code leak predates recent adventerous of Lulzsec and Anonymous. In the past decade, there has been a number of cases of where source code containing sensitive information were leaked from companies like Cisco, Facebook, Microsoft, Symantec etc. Such leaks not only makes the innocent users vulnerable to security threats, but almost inevitably result in public humiliation of the company. Programmers work under impossible deadlines and often would include profanities in the code, that makes it embarassing for the company when the code is available in public domain for scrutiny.

Why is it difficult?

While detecting sensitive information is a challenge, removal of it is even more difficult. Mere word matching doesn't always work as programmers morph different words to form variable names. Further, blindly removing/obfuscating source code severely reduces program comprehension, making it a maintainence nightmare.

While tools like sed, grep, find etc. give you a pointer to begin with, a huge amount of manual effort is required to sanitize the code. Starting with only source code and documentation of a software project RESIST tries to automatically find sensitive information in the code and replace them with meaningful words which balances privacy as well as program comprehension.

How does it work?

Well, its a complicated architecture that builds upon our previous work on code search and program comprehension. Below is the workflow [pdf].

To make it more a bit more clear, here is a 5 minute video. You can try it out online with your Java code and documentation.

Can you show some examples of refactoring done by RESIST?

The following is a from the code Symentec's pcAnywhere software that was released by Anonymous. The code handles Netscape navigator Security.

Before applying RESIST its clear the code handles security for a particular browser, Netscape Navigator.

After applying RESIST, it indicates that it relates to web security, which prevents designing an attack targetting vulnerability of a specific browser.

Does RESIST find meaningful alternatives? How does it find all relevant synonyms?

Yes, because we use WordNet to find synonyms. We first split an identifier into its separate components, then generate the synonyms for each element and recombine them to find which combination satisfies privacy and security value ranges. The optimal choice is used to refactor the source code.

Does RESIST's refactoring lead to source code that can be safely distributed? How well are the secrets hidden? Can programmers still work with the modified code?

In order to answer these questions we conducted a case study that included 67 programmers with a varied background - from undergraduates, graduates, Ph.D students, professional programmers from different parts of the country. You can find the results here

The test included a questionaire where subjects were randomly shown original source code from the Symentec pcAnywhere and code refactored by RESIST. You can take the test here.

We got interesting insights from our participants:

"The obfuscation was totally amazing on a side, but to tell the truth, after page 3 and 4 I got used to some common words which are generally appear like "let*" prefix and for writer codes, the "drop" word, "identification" for password and so on. But hm yeah at least this way the code is staying coherent in the natural language view in every source code for a project. Afterwards this recognition, the codes were a bit easier to understand (not like If I get the original code though). Nonetheless, the code results were very annoying so I think it is doing a great job :) And the totally amazing was the brilliant name refactorings everywhere in the comments and other places which shows a great text-processing. I did not see anywhere mistakes with this." - Bela Ujhaz, Siemens

Which software projects did you used for experiments?

Initially we based our experiments on four open source projects which belong to very different verticals. Luckily as we were working on them, Lulzsec and Anonymous kept releasing different commercial softwares. We strictly used their releases for research and educational purposes, as we did with open source software artefacts. While the open source software setups are available for download for your evaluation, we refrain from mirroring the commercial softwares for ethical reasons. If you are interested or have any specific question regarding the results contact me.

Commercial software

  • Sony Play Station Network aka swonage
  • Symentec PCAnywhere
In pipleine:
  • Symentec Norton Antivirus
  • Half-Life

Open source

ImageJ

ImageJ is a public domain Java image processing program inspired by NIH Image for the Macintosh

Onebook

Onebook is a Web-based application which allows students and teachers to share share information via a consistent interface.

Cobalt Personal Pages

  • Cobalt setup Source and documentation used for experiment
  • Results - Result set for 100 iterations on Cobalt Personal Pages

Opentaps

Opentaps is an open Source ERP + CRM is a fully integrated application suite that brings together top-tier open source projects to help manage businesses more effectively.

  • Opentaps - Download original project
  • Opentaps Source and documentation used for experiment
  • Results - Result set for 100 iterations on Opentaps