Adding a scoring system in peepdf

From mediawiki.org

Adding a scoring system in peepdf[edit]

Google Summer Of Code-2015: PEEPDF

Name and contact information[edit]

Name: Rohit Dua
Email: 8ohit.dua AT gmail DOT com
IRC or IM networks/handle(s): rohit-dua
Location: New Delhi, India
Time-zone: UTC+5:30
Typical working hours: 12:00 pm to 5:00 pm , 8:00 pm to 2:00am(IST) until August, 6:00 pm to 2:00 am after August.
Nationality: Indian

Synopsis[edit]

Currently, it is possible to identify the suspicious elements in a PDF file because they are shown in a different color (yellow). While it helps for experimented analysts or users with some experience with the PDF format and/or threat analysis, it could be difficult to understand for less skilled users. This project focuses to list out the elememts which permit distinguish if a PDF file is malicious or not and create a score out for each of those elements (maybe out of 10) and sum up the individual scores to the overall file maliciousness score.

Deliverables[edit]

Required Goals:[edit]

Factors to be added(if not already present) that may decide maliciousness:[edit]
  • Number of pages
   Generally pdf's with single page are more possible malware
  • Broken trailer/xref
   May indicate manual modification.
  • hex/oct in tags
   eg: /#4a#61#76#61#73#63#72#69#70#74 instead of /Javascript
  • No. of filter on a specific stream
  • Type of filter applied
   eg: JBIG2Decode is not expected in text streams
  • Presence of Javascript/XFA
  • Presence of invalid tokens between objects
   These will be ignored by the pdf reader
  • Absence of terminator or length tag in streams
   The pdf viewer will read this without error. But wihout length tag, the terminator could extend into other objects.
  • Presence of random garbage before the header
   This is also used for file type cloaking.
  • Presence of unknown elements in the header except %PDF-xx
  • Presence of additional triggers
  • Presence of Launch/OpenAction with javascript
  • Absence of xref/startxref/trailer
  • Similar file structure/ names as used in the popular exploit kits
  • Colour expressed with more than 3 bytes
  • Backtracking and analysing for PDF Syntax Obfuscation
   eg: when /Javascript is not directly called but via pointing.
  • Objects not referenced from Catalog
  • Encrypted File with default password
  • File not made with traditional known editors
   eg: Creating a pdf by exporting from Office(MS/Open/Libre) adds specific metadata.
  • Chars After Last EOF
Score assignment to each factor[edit]
  • The test suite for pdf files will be obtained from online sources
   http://contagiodump.blogspot.in/2013/03/16800-clean-and-11960-malicious-files.html
  • To improve scores, the above obtained test suite will be used
   Along with the pre-defined factors, the test suite will run against the system to identify new factors. (such as regex expressions etc.)
   

Optional Goals:[edit]

  • Link with peepdf web interface


Project schedule[edit]

Timeline Task
Apr 21 - May 19 Get familiar with code base, bond with community
May 19 - May 26 University Examinations
May 26 - May 30 Add factors that define maliciousness(Broken trailer/xref,presence of triggers etc.) that are not in already present in peepdf
May 30 - Jun 05 Develop code to detect presence of random garbage before the header/after last EOF/between object terminator and initiator.
May 30 - Jun 05 Create list of valid filters corresponding to their applied data type.
Jun 05 - Jun 10 Develop code to Backtrack and analyse PDF Syntax Obfuscation and search for javascript/JS tags.
Jun 10 - Jun 22 Add/update file structures/pattern/signatures for popular exploit kits that support pdf creation(database/json)
Jun 23 Mid Term Evaluation
Jun 24 - Jul 05 Develop system(possible simple machine learning?) to obtain/improve scores against clean and malicious pdf files.
Jun 05 - Jul 15 Display suspicious elements sorted according to score and give an overall maliciousness score to the file.
Jul 15 - Jul 25 UI Polishing, Bug fixing
Jul 25 - Aug 18 Code clean up, documentation + Buffer time for unprecedented delays

* The above plan could go as expected or invariably re-distribute among the tasks.


Participation[edit]

During my work hours, I would always be logged in IRC (channels: #gsoc-honeynet, #gsoc) and also can always be reached at my email. I'm an computer addict and have hard time staying off of it. All source code I write will be published to my Github repo along with the official git of peepdf(with default or developement branch)
At each stage of development I would like to discuss implementation details with the mentors so that there are no delays/issues later on. If face some other doubts or need feedback I would head over to the talk on the mailing list.


About you[edit]

My name is Rohit Dua, and I'm currently pursuing my B.Tech in Electronics and Communication at Jaypee Institute of Information Technology, Noida at India. My home-town is New-Delhi, India.
I code in Python/JavaScript/C/C++.
I'm passionate about computer-security/automation and Coding gets me high! This is my second consecutive year in Google Summer Of Code. Previous year(2014) I contributed to Mediawiki organization developing an online tool + bot(python + shell) http://tools.wmflabs.org/bub/ which uploads books to internet archive from Google books.
When I first heard about Open Source at a Linux User Group Meetup at my university, I went crazy about it as I always thought there's no such thing as a free bread, but then there always was free knowledge. This is why I love open-source :-) I feel I can grow and learn much faster with community-bondings in the Open Source universe.
Google Summer of Code will be my top priority and I will be happily accepting this as a full time job.


Past open source experience[edit]

GitHub profile: rohit-dua
GSOC-2014->Mediawiki:
OWCS-2014(Owasp Winter Code Sprint)->OTWF

Related Projects[edit]

I have been building a headless browser that randamizes its fingerprint http://github.com/rohit-dua/selkie. I recently got to know about thug project(honeynet). This is somewhat similar to the project I have been working on. Although I love the thug project but I'm more interested in malware spreading and security. Thats why I choose peepdf project.

Short Q/A (as given in https://www.honeynet.org/gsoc/form)[edit]

Q) Top project choice (can be one of our project suggestions or your own)
Project 13 - PEEPDF2: Adding a scoring system in peepdf
Q) Are you willing and able to work on other projects instead?
Yes I am willing to work on projects relating thug/peepdf.
Q) Please describe you preferered coding languages and experience
Python - 60%
C/C++ - 40%
Javascript - 30%
Assembly - 20%
Q) Please describe any Windows, Unix or Mac OS X development experience relevant to your chosen project
Past Open source experience
I develop in linux environment.
Q) Please describe any previous usage with Honeynet Project tools or honeypots in general
I do not have much experience with honeynet project tools, but I certainly have used wireshark extensions(wiresocks). Recently I have been studying the source code of thug, so as to improve on my tool selkie. Realated Projects
Q) Please describe any previous Honeynet Project or honeypot related development experience, including details of any patches, code or ideas you may have previously submitted
Before submitting a proposal for peepdf, I was interested in building something similar to thug(didn't know thug existed!)
Q) Please describe any previous Open Source development experience, including projects you have worked on
Past Open source experience
Q) What school do you attend and what is your specialty/major at the school?
Jaypee Institute Of Information Technology, Noida
Major: Electronics and Communication(graduation year: 2016)
Q) What city/country will you be spending this summer in?
New Delhi, India
Q) How much time do you expect to have for this project?
Typical working hours: 12:00 pm to 5:00 pm , 8:00 pm to 2:00am(IST) until August, 6:00 pm to 2:00 am after August.
Time may wary on certain country holidays.
Q) Please list jobs, summer classes, and/or vacations that you'll need to work around
There aren't any classes that I'll be working. While for vacations, I might be unreachable for a week in between, but I'll let the mentors know before hand.
Q) Have you participated in any previous Summer of Code projects? If so please describe your projects and experience, including what you liked or didn't like about the experience
I participated in GSOC-2014(last year) with Mediawiki. I build a python tool + web interface for uploading/transferring books after scraping from online liraries like Google Books to the internet archive. The web interface can be seen http://tools.wmflabs.org/bub/. I enjoyed the whole experience and am still maintaining the tool.
Q) Have you applied for (or intend to apply for) any other Google Summer of Code 2015 projects? If so, which ones?
I am focusing on only one project at this time. So this my only submission in Google Summer of code 2015.
Q) Have any of our members met you face to face, such as at one of our recent public events (Paris, San Francisco Bay Area, Dubai and Warsaw)? If so, please list who/where?
No I haven't but I would love to. :-)