Google Books, Internet Archive, Commons upload cycle
This page is obsolete. It is being retained for archival purposes. It may document extensions or features that are obsolete and/or no longer supported. Do not rely on the information here being up-to-date. This was a Google Summer of Code/2014 project/proposal. |
(Automation Tool) Google Books > Internet Archive > Commons upload cycle
[edit]BUB : Book Uploader Bot
[edit]Public URL: //www.mediawiki.org/wiki/Google_Books,_Internet_Archive,_Commons_upload_cycle
- Bugzilla report: Bug - 57813
- Hosted on tools-lab: http://tools.wmflabs.org/bub/
- Testing doesn't require login!
- If you're very curious you can check progress of all uploads on archive.org (requires login), or https://archive.org/search.php?query=subject%3A%22bub_upload%22&sort=-publicdate
- Maintained on github: https://github.com/rohit-dua/bub
- Progress: [1]
Name and contact information
[edit]Name: Rohit Dua
Email: 8ohit.dua@gmail.com
IRC or IM networks/handle(s): rohit-dua
Location: New Delhi, India
Time-zone: UTC+5:30
Typical working hours: 12:00 pm to 5:00 pm , 8:00 pm to 2:00am(IST) until August, 6:00 pm to 2:00 am after August.
Synopsis
[edit]Wikisources all around the world use heavily Google-Books digitizations for transcription and proofreading. The books often are disappeared from the GB database. Currently the users have to manually download a book from GB, then upload them to IA(if they want to preserve) or directly upload to Wikimedia-Commons(again manual task) with appropriate meta-data.
This project focuses on automating all the three altogether! The user will just have to give appropriate url(or identifier) for the book(s) they wish to upload, and all other task is just automated, notifying user only when their intervention is needed.
Core Libraries/tools used:
- internetarchive
urllib2python-requests- IA-Upload
- Google-Books API
- JSON API (IA)
- SQLAlchemy
- htmlmin (minifier)
- Python-Flask
- Jinja2
Deliverables
[edit]Goals of this project :
Required Goals:
[edit]- Tool hosted on Tool-Labs with a JavaScript front-end and python core.
This will take as input: LIBRARY_TO_CHOOSE //This is the Library like Google-Books. More libraries can be added in future
GOOGLE_BOOK_URL OR ID //This is the ID/URL for book that will be uploaded to IA and Commons
FILE_NAME_FOR_COMMONS //This is the user defined name for djvu file (will be passed to IA-Upload) EMAIL_ID
- Extract meta-data from GB and check if it is Public Domain
Google provides Google-Books API: This will be used to extract all the details about the book (meta-data) and check if it is public domain or not.
- Check if a book is available on IA
Internet Archive provides JSON API for advanced searching. This will be used to check whether the book is already available in IA or not.
- Download all its pages from GB and convert to PDF/ZIP
The required book will be downloaded from Google Books in a manner that each page will first be downloaded as PNG/JPG image, and then they will be converted to PDF format for easy upload to IA. Link to proof of concept code for book-download given at bottom
- Upload to IA with appropriate meta-data
The python library internetarchive will be used for this step. For each book that'll be uploaded to IA, its meta-data(taken from GB) will be added. This will be a better means to avoid duplicated uploads in the long run.
Files uploaded to IA are OCR'ed so that their text is searchable. This takes time. Therefore as soon as the OCR is complete, users will be notified via email. Users email, corresponding url identifiers, and the entered FILE_NAME_FOR_COMMONS will be stored(sqlite). A web crawler will periodically visit the url with stored identifiers to check on OCR completion.
- Wait for its OCR, when completed notify user via email
If the OCR process is completed, the user will be notified via email. Python Library smtplib will be used to send emails.
- Upload to Commons using IA-Upload tool.
The emails will contain the link of type: http://tools.wmflabs.org/ia-upload/commons/fill?iaId=ID&commonsName=FILENAME, where ID --> identifier stored previously and FILENAME --> the FILE_NAME_FOR_COMMONS taken as input at the beginning. This will help in avoiding the unnecessary front-page of IA-Upload. <since users will not have to manually enter the identifier of the uploaded file>
Optional Goals:
[edit]- Direct upload to Commons.
If a user wants an immediate use of the Commons file, he/she might want to skip the step of uploading to IA.(as it takes time). wikitools library and MediaWiki API will be used to connect and upload to commons.
- Add support for other popular Public Library Networks
Support for public libraries like Digital Library of India (Archived 2013-08-06 at the Wayback Machine) and West Bengal Public Library Network will be added, which will work in a similar fashion to Google-Books.
* The Design of the code will be in a form that support for more libraries (like Digital Library of India (Archived 2013-08-06 at the Wayback Machine)) can be easily added.
Project schedule
[edit]Timeline | Task |
---|---|
Apr 21 - May 19 | Get familiar with code base, move local environment to Labs, bond with community |
May 19 - May 26 | University Examinations |
May 26 - May 30 | Add feature to extract meta-data from GB and check if its public-domain (proof of code) |
May 30 - Jun 05 | Download from GB and convert to PDF |
May 30 - Jun 05 | code to properly upload to IA using internetarchive library |
Jun 05 - Jun 10 | code to check if book is available in IA |
Jun 10 - Jun 22 | Database and its python connector for email/identifier storage |
Jun 23 | Mid Term Evaluation |
Jun 24 - Jul 05 | Spider bot to check for updates |
Jun 05 - Jul 15 | Automatic notification email using smtplib and link with IA-Upload tool |
Jul 15 - Jul 25 | UI Polishing, Bug fixing |
Jul 25 - Aug 18 | Code clean up, documentation + Buffer time for unprecedented delays |
* The above plan could go as expected or invariably re-distribute among the tasks.
Participation
[edit]During my work hours, I would always be logged in IRC (channels: #mediawiki, #wikimedia-dev, #mediawiki-labs) and also can always be reached at my email. I'm an computer addict and have hard time staying off of it. All source code I write will be published to my Github repo, although my tool will be hosted on Tool-Labs.
At each stage of development I would like to discuss implementation details with the mentors so that there are no delays/issues later on. If face some other doubts or need feedback I would head over to the talk at Talk:Google Books, Internet Archive, Commons upload cycle or the mailing list(Wikitech-I).
About you
[edit]My name is Rohit Dua, and I'm currently pursuing my B.Tech in Electronics and Communication at Jaypee Institute of Information Technology, Noida at India. My home-town is New-Delhi, India.
I code in Python/JavaScript/C/C++.
I'm passionate about computer-security/automation and Coding gets me high! I am new to world of open-source and its community bonding.
When I first heard about Open Source at a Linux User Group Meetup at my university, I went crazy about it as I always thought there's no such thing as a free bread, but then there always was free knowledge. Prior to this I never used to go to someone with my programming issues/bugs(online or offline). But now I feel I can grow and learn much faster with community-bondings in the Open Source universe.
This project is my first opportunity to bond with an open source organization. GSoC will be my bridge to the open-source community. Also Google Summer of Code will be my top priority and I will be happily accepting this as a full time job.
Past open source experience
[edit]GitHub profile: rohit-dua
Proof of concept code
[edit]For the sake of demonstration, I have the script to - download any public domain book from GB - https://github.com/rohit-dua/gb-download (Python)
* UI and some verification code(project named BUB: book uploader bot): https://github.com/rohit-dua/BUB