RESIST is a tool to redact sensitive information from source code without compromising program comprehension - i.e. detect and replace sensitive information with meaningful alternatives in source code
Why do we need such a tool?
Source code leak predates recent adventerous of Lulzsec and Anonymous. In the past decade, there has been a number of cases of where source code containing sensitive information were leaked from companies like Cisco, Facebook, Microsoft, Symantec etc. Such leaks not only makes the innocent users vulnerable to security threats, but almost inevitably result in public humiliation of the company. Programmers work under impossible deadlines and often would include profanities in the code, that makes it embarassing for the company when the code is available in public domain for scrutiny.
Why is it difficult?
While detecting sensitive information is a challenge, removal of it is even more difficult. Mere word matching doesn't always work as programmers morph different words to form variable names. Further, blindly removing/obfuscating source code severely reduces program comprehension, making it a maintainence nightmare.
While tools like sed, grep, find etc. give you a pointer to begin with, a huge amount of manual effort is required to sanitize the code. Starting with only source code and documentation of a software project RESIST tries to automatically find sensitive information in the code and replace them with meaningful words which balances privacy as well as program comprehension.
How does it work?
Well, its a complicated architecture that builds upon our previous work on code search and program comprehension. Below is the workflow [pdf].
To make it more a bit more clear, here is a 5 minute video. You can try it out online with your Java code and documentation.
Can you show some examples of refactoring done by RESIST?
The following is a from the code Symentec's pcAnywhere software that was released by Anonymous. The code handles Netscape navigator Security.
Before applying RESIST its clear the code handles security for a particular browser, Netscape Navigator.
After applying RESIST, it indicates that it relates to web security, which prevents designing an attack targetting vulnerability of a specific browser.
Does RESIST find meaningful alternatives? How does it find all relevant synonyms?
Yes, because we use WordNet to find synonyms. We first split an identifier into its separate components, then generate the synonyms for each element and recombine them to find which combination satisfies privacy and security value ranges. The optimal choice is used to refactor the source code.
Does RESIST's refactoring lead to source code that can be safely distributed? How well are the secrets hidden? Can programmers still work with the modified code?
In order to answer these questions we conducted a case study that included 67 programmers with a varied background - from undergraduates, graduates, Ph.D students, professional programmers from different parts of the country. You can find the results here
The test included a questionaire where subjects were randomly shown original source code from the Symentec pcAnywhere and code refactored by RESIST. You can take the test here.
We got interesting insights from our participants:
"The obfuscation was totally amazing on a side, but to tell the truth,
after page 3 and 4 I got used to some common words which are generally
appear like "let*" prefix and for writer codes, the "drop" word,
"identification" for password and so on. But hm yeah at least this way
the code is staying coherent in the natural language view in every
source code for a project. Afterwards this recognition, the codes were
a bit easier to understand (not like If I get the original code
though). Nonetheless, the code results were very annoying so I think
it is doing a great job :) And the totally amazing was the brilliant
name refactorings everywhere in the comments and other places which
shows a great text-processing. I did not see anywhere mistakes with
this." - Bela Ujhaz, Siemens
Which software projects did you used for experiments?
Initially we based our experiments on four open source projects which belong to very different verticals. Luckily as we were working on them, Lulzsec and Anonymous kept releasing different commercial softwares. We strictly used their releases for research and educational purposes, as we did with open source software artefacts.
While the open source software setups are available for download for your evaluation, we refrain from mirroring the commercial softwares for ethical reasons. If you are interested or have any specific question regarding the results contact me.
Commercial software
Sony Play Station Network aka swonage
Symentec PCAnywhere
In pipleine:
Symentec Norton Antivirus
Half-Life
Open source
ImageJ
ImageJ is a public domain Java image processing program inspired by NIH Image for the Macintosh
Onebook Sertup Source and Documentation used for experiment
Results - Result set for 100 iterations on onebook
Cobalt Personal Pages
Cobalt setup Source and documentation used for experiment
Results - Result set for 100 iterations on Cobalt Personal Pages
Opentaps
Opentaps is an open Source ERP + CRM is a fully integrated application suite that brings together top-tier open source projects to help manage businesses more effectively.