Grants:IEG/Tools for Armenian Wikisource and beyond: Difference between revisions

Content deleted Content added
Xelgen (talk | contribs)
Project Plan: new section
Xelgen (talk | contribs)
Budget total changed from 7600 to 7850, due to miscalculation (last line not included in total)
 
(36 intermediate revisions by 18 users not shown)
Line 12:
|round=1<!--You don't need to change this entry-->
|year=2014<!--You don't need to change this entry-->
|status=DRAFT SELECTED<!--When you have completed all 3 parts of your proposal and it is ready for review, change DRAFT to PROPOSED (all caps).-->
|amount=7850 USD<!--Please enter the total budget amount requested, when finalized-->
|target=Armenian Wikisource, with potential to extend to other languages<!--What existing Wikimedia projects and language versions do you seek to impact with this project-->
|goal=Improving Quality<!--Examples of strategic goals: Increasing Reach (more people are able to access Wikimedia projects), Improving Quality (better quality and quantity of content on Wikimedia projects), Increasing Participation (larger and more diverse groups of people are contributing to Wikimedia projects)-->
|theme=TOOLS
|IdeaLab=<!--If you want to remove your proposal from the IdeaLab, delete "YES"-->
}}
 
==Project idea==
===What is the problem you're trying to solve?===
<!--In this section: Explain the problem that you are trying to solve with this project. What is the issue you want to address? Be brief, you can update and add to this later.-->
 
While digitizing [[:en:Armenian_Encyclopedia#Soviet_Armenian_Encyclopedia|Soviet Armenian Encyclopedia]] on Armenian Wikisource, and using articles of it later as a base for Armenian Wikipedia articles, we've felt strong need for a set of tools, which could make the whole process much faster, more efficient and fun by eliminating chore and as a result yield to higher quality and quantity of articles, and a happier community.
 
Digitizing dictionary-structured books challenged regular Wikisource workflow, and required actions and new workflow, never needed while digitizing fiction books for example. Examples of such actions are - dividing content by articles, creating index of articles with some metadata, maintaining list of articles on 2 projects, while they are being created/moved/deleted/merged. Proofreading text printed in 3 column layer, with rich formatting also makes UX different, with longer "seek time" and non-linear, back and forth movement for eyes of contributors trying to find the word in page’s scan.
 
Digitizing dictionary-structured books challenged regular Wikisource workflow, and required actions and new workflow,tools never needed while digitizing fiction books, for example, fiction books. Examples of such actions are - dividing content byinto articles, creating index of the articles with some metadata, maintaining list of articles on 2 projects, while they are being created/moved/deleted/merged. Proofreading text printed in a 3 column layer,layout with rich formatting is also makesmore UX differentdifficult, withbecause of the longer "seek time" and non-linear, back and forth movement for eyes of contributors trying to find the word in page’sthe scan of the page.
 
===What is your solution?===
Line 31 ⟶ 35:
Those tools are somewhat language dependent. Project's main target is the Armenian Wikisource and Armenian language, but we're going to keep global vision while developing the tools, to ensure that they can be modified, localized, and reconfigured for other Wikisources as easy as possible. We'll do our best to provide sufficient documentation in English for that. Good news is that if we succeed with Armenian language, we'll also succeed with majority of other Indo-European languages, as Armenian hyphenation, capitalization rules are relatively complicated.
 
Non-WMF projects using MediaWiki, usingand the ProofReadPage extension, can also benefit from these tools.
 
 
 
 
==Project goals==
<!--In this section: briefly explain what are you trying to accomplish with this project, or what do you expect will change as a result of this grant. You can update and add to this later.-->
[[File:ZoomProof Wikisource tool concept.ogv|thumb|300px|Proof of concept, of ZoomProof in action]]
 
[[File:SAECropper-dev.png|thumb|Current state of IllustrationCropper, frontend is about 40% done. Some dev. output can be seen]]
[[File:SAE Tools - Marking sections and section duplicate check.ogv|thumb|Section marker and duplicate checker in work. Tightly focused on SAE, and needs to be made more universal and flexible]]
Following tools have to be developed, tested and fine-tuned in scope of the project:
# ZoomProof - Wikisource gadget to automatically zoom and highlight the word which is being proofread (feature often found in OCR software, see screencast)
# WikiSource IllustrationCropper - Tool to crop out and upload images out of book page scan, right on the edit page
# Section Harvest - Parse all pages of a book, get a detailed list of sections used, and verify by some simple rules (e.g. no duplicate section names, no all-caps section names, etc..). Especially useful for digitizing dictionaries, encyclopedias, reference books, etc.
# LST Guard (see [ https://www.mediawiki.org/wiki/Extension:Labeled_Section_Transclusion Labeled Sections Transclustion]) (service) to monitor changes to the names of the sections, which are already used to transclude content. We’ve noticed some editors would change section names without realising that as a result other pages become blank. Monitoring and notifying patrollers on such cases is the minimal goal, we’ll try to allow to automatically update the section name in trasclusion, or revert back the change of the section name (not the whole revision) according to some rules (e.g. section name is not unique for book). Best if it will automatically link Wikipedia and Wikisource articles.
 
The following 2 tools are implemented as part of [[:s:hy:Մասնակից:Xelgen/ՀՍՀ Գործիքներ|SAE Tools]] user script and are being used in the Armenian Wikisource. But the current versions are too focused on the scanned version of Soviet Armenian Encyclopedia, and are interconnected with other tools of that pack.
# AutoHinter, automatically finds and highlights possible mistakes by using data specific to the language and the OCR software used.
# SectionMarker tool, which allows to add sections, and checks for duplicates in the neighboring scanned pages.
Project’s other goal is to turn themthose 2 tools into more universal and suitable for generic cases, “standalone” widgetsGadgets (or User Scripts), so they can be used by other Wikisources as well. SectionMarker also needs deep refactoring (duplicate prevention parts are very messy and hard to read).
<!--The templates below help maintain your proposal, please leave them, thank you!-->
{{IEG/Proposals/Button/2}}
 
== Project Plan ==
 
== Part 2: The Project Plan ==
 
__TOC__<!--The sections below provide the required structure for part 2 of your project proposal, feel free to add to them but please keep the provided subsections.-->
 
==Project plan==
<!--Leave this blank.-->
Line 87 ⟶ 84:
====Total amount requested====
<!--Please list the total grant amount you are requesting in USD or local currency (USD will be assumed if no other currency is specified).-->
7850 USD
 
 
====Budget breakdown====
<!--How you will use the funds you are requesting? You can create a table with the table button from the toolbar above, or just list bullet points.-->
* Research and& Development 220 hours per 20$/hour
* ProjectVolunteer coordination & meetings 40 hours 20$/hr
* Project coordination 120 hours per 20$/hour
* Accountant consultancy
*
* Promotional materials for volunteer contributors and participants of proof-read-a-thons (certiciates, stickers, etc..) 250 USD (some materials like mugs and T-Shirts may be provided by Wikimedia Armenia)
 
Hour rates, include 26% Taxes. Yerevan State University will provide space and lab, for meetups with volunteers IRL.
 
<!--Examples: Travel roundtrip from Bangladesh to India: (Amount) USD, Visual design contractor: (Amount) USD, Project management: (Amount) USD, Wikimedia merchandise for volunteer giveaways: (Amount) USD, etc.-->
 
 
===Intended impact:===
Line 105 ⟶ 102:
The main target audience of the project is the proofreaders in Wikisource.
While talking specifically about SAE, Wikipedians using articles from this printed encyclopedia would also benefit, as they’ll be able to coordinate their work in a more efficient way.
 
IT students who would volunteer to help with coding, will get mentorship, code review, and learn about real world tools, techniques, workflow and platforms. We also hope to introduce them to world of free software, and see them contributing code to free projects in future.
 
====Fit with strategy====
Line 123 ⟶ 122:
====Measures of success====
<!--How will you know if the project is successful? What are the targets, metrics, or measurable criteria will you use to learn if you’ve met your goals?-->
 
Measure of success:
* All tools listed are created, with all essential functionality
* Tools work under current versions of Firefox, Chromium (Chrome, Opera). All critical functionality works under Internet Explorer 7.
* User documentation is provided in Armenian and English
* Technical documentation in English is provided for all tools (for later modifications by developers)
* At least 5 volunteer coders involved in project
* Tested by developers and community. If community members have major concerns related to functionality, UX or workflow during early Alpha version, necessary modifications are made.
* It is ensured, at best that tools can be easily modified (if not just configured) for other languages.
* Success rate of highlighting with ZoomProof tool is at least 75%, in at least 95% it is able to show correct area of page, without highlighting word. (Numbers are for Armenian text, other languages may show higher or lower success rates, after re-configuration)
 
Stretch-goals:
* IllustrationCropper is able to use high-res files from external servers, local Wikisource, or Commons, as source instead of low-res DjVu images.
* Every tool is implemented and used in at least one Wikisource project, other then Armenian, by the end of project.
 
Those are metrics, which can be used in retro-perspective manner, few months after project finishes and tools are being used:
 
One of the metrics to measure the success will be the number of editors using the tools. It may not be efficient to track all the changes done by any of the tools (e.g. every corrected mistake in the OCRed text). The usage of some of the tools will be easier to measure, for example, the number of pictures obtained by IllustrationCropper. The LST Guard will have a detailed log of its actions.
 
Rise in the number of edits, especially by users using these tools, .
 
Having full index of articles in Soviet Armenian Encyclopedia, and separate subpage for every article by the end of the project, may be another target and an indicator of success. Though this depends on
 
In a longer term, the number of language versions of Wikisource that have used the tools will be a good indicator of the reusability.
 
 
==Participant(s)==
Line 136 ⟶ 151:
 
[[User:Xelgen]], FLOSS enthusiast, Wikipedian for over 7 years, sysop of Armenian Wikipedia. Localisation enthusiast. Coding since 9 years old, 11 years of professional experience in IT, including managerial positions. Project management skills, experience running NGOs, and local tax reporting. Was a member of the initiative to get Soviet Armenian Encyclopedia released under Creative Commons license. Scanned and processed 13 volumes of the encyclopedia.
 
[[User:HrantKhachatrian]], Masters student at the Department of Informatics and Applied Mathematics, Yerevan State University, president of the Student Scientific Society of the department, freelance web developer for 5 years.
 
[[User:Mahnerak]], Bachelor student at the Department of Informatics and Applied Mathematics, Yerevan State University, winner of various local and international olympiads in informatics, author of the software used in school olympiads in Armenia.
 
==Discussion==
Line 143 ⟶ 160:
===Community Notification:===
<!--You are responsible for notifying relevant communities of your proposal. Depending on your project, notification may be most appropriate on a Village Pump, talk page, mailing list, etc.-->
Please paste a link to where the relevant communities have been notified of this proposal, and to any other relevant community discussions, here.
 
* [//hy.wikisource.org/wiki/Վիքիդարան:Խորհրդարան#.D4.BF.D5.A1.D6.80.D5.AE.D5.AB.D6.84.D5.B6.D5.A5.D6.80.D5.9D_.D5.84.D5.AB_.D5.B7.D5.A1.D6.80.D6.84_.D5.A3.D5.B8.D6.80.D5.AE.D5.AB.D6.84.D5.B6.D5.A5.D6.80_.D5.B0.D5.A1.D5.B5.D5.A5.D6.80.D5.A5.D5.B6_.28.D6.87_.D5.B8.D5.B9_.D5.B4.D5.AB.D5.A1.D5.B5.D5.B6_.D5.B0.D5.A1.D5.B5.D5.A5.D6.80.D5.A5.D5.B6.29_.D5.8E.D5.AB.D6.84.D5.AB.D5.A4.D5.A1.D6.80.D5.A1.D5.B6.D5.AB_.D5.B0.D5.A1.D5.B4.D5.A1.D6.80 Notification of Armenian Wikisource community of this proposal]
* [http://lists.wikimedia.org/pipermail/wikisource-l/2014-April/001868.html Notification via mailing list of Wikisource Community User Group]
 
Old discussions related to possible SAE proof reading automation, tools, and policies:
* [//hy.wikisource.org/wiki/%D5%8E%D5%AB%D6%84%D5%AB%D5%BA%D5%A5%D5%A4%D5%AB%D5%A1:%D4%BD%D5%B8%D6%80%D5%B0%D6%80%D5%A4%D5%A1%D6%80%D5%A1%D5%B6/%D4%B1%D6%80%D5%AD%D5%AB%D5%BE/2007-2012#.D5.80.D5.8D.D5.80.E2.80.93.D5.AB_.D5.BA.D5.A1.D5.BF.D5.AF.D5.A5.D6.80.D5.B6.D5.A5.D6.80 Discussion on what and how to do with SAE Illustrations]
* Discussions on widespread OCR mistakes to consider when developing tools [//hy.wikisource.org/wiki/%D5%8E%D5%AB%D6%84%D5%AB%D5%BA%D5%A5%D5%A4%D5%AB%D5%A1:%D4%BD%D5%B8%D6%80%D5%B0%D6%80%D5%A4%D5%A1%D6%80%D5%A1%D5%B6/%D4%B1%D6%80%D5%AD%D5%AB%D5%BE/2007-2012#.D5.8F.D5.A1.D5.BC.D5.A1.D5.B3.D5.A1.D5.B6.D5.A1.D5.B9.D5.B4.D5.A1.D5.B6_.D5.BD.D5.AD.D5.A1.D5.AC.D5.B6.D5.A5.D6.80.2C_.D5.B8.D6.80.D5.B8.D5.B6.D6.84_.D5.B0.D5.B6.D5.A1.D6.80.D5.A1.D5.BE.D5.B8.D6.80_.D5.A7_.D5.B7.D5.BF.D5.AF.D5.A5.D5.AC_.D5.A1.D5.BE.D5.BF.D5.B8.D5.B4.D5.A1.D5.BF from 2012] and [//hy.wikisource.org/wiki/%D5%8E%D5%AB%D6%84%D5%AB%D5%A4%D5%A1%D6%80%D5%A1%D5%B6:%D4%BD%D5%B8%D6%80%D5%B0%D6%80%D5%A4%D5%A1%D6%80%D5%A1%D5%B6#.D5.80.D5.8D.D5.80_.D5.BF.D5.A1.D5.BC.D5.A1.D5.B3.D5.A1.D5.B6.D5.A1.D5.B9.D5.B4.D5.A1.D5.B6_.D5.B8.D6.80.D5.A1.D5.AF.D5.A8 from 2013]
* [//hy.wikisource.org/wiki/Վիքիդարան:Խորհրդարան#.D5.80.D5.8D.D5.80.E2.80.93.D5.AB_.D5.A8.D5.BD.D5.BF_.D5.A2.D5.A1.D5.AA.D5.AB.D5.B6.D5.B6.D5.A5.D6.80.D5.B8.D5.BE_.D5.A2.D5.A1.D5.AA.D5.A1.D5.B6.D5.A5.D5.AC.D5.B8.D6.82_.D5.A3.D5.B8.D6.80.D5.AE.D5.AB.D6.84 Discussion on first version of SectionMarker]
* [//hy.wikisource.org/wiki/Վիքիդարան:Խորհրդարան#.D5.80.D5.8D.D5.80.E2.80.A4_.D5.8F.D5.B8.D5.B2.D5.A1.D5.A4.D5.A1.D6.80.D5.B1.D5.A5.D6.80.D5.A8_.D5.B0.D5.A1.D5.B6.D5.A5.D5.AC_.D5.A1.D5.BE.D5.BF.D5.B8.D5.B4.D5.A1.D5.BF_.D5.A9.D5.A5.D5.9E_.D5.B9.D5.B0.D5.A1.D5.B6.D5.A5.D5.AC Discussion on blind, automatic removal of hyphenation] Consensus was that due to complicated rules of hyphenation and OCR quality, simply removing hyphens with bot isn't a good idea. This sparked development of a smarter and more careful tool, removing only obvios hyphenation cases. In the end it lead to development of SAE Tools pack.
 
===Endorsements:===
Line 150 ⟶ 176:
 
*''Community member: add your name and rationale here.''
*'''Endorse''' -- by all means. [[User:Xelgen]] has a long history on Armenian Wikipedia and Wikisource, and has had a major role in release of Soviet Armenian Encyclopedia under Creative Commons license, and then taken on the digitization of all 13 volumes of it, and afterwards has continued to develop anbd fine-tune extemely useful proofreading tools on Armenian Wikisource (see [[s:Մասնակից:Xelgen/ՀՍՀ Գործիքներ|description]]). I think it is extremely essential to enable these developers to build and enhance such proofreading tools without which proofreading gigantic texts such as encyclopedias could turn into an even more arduous task and take exponentially longer to accomplish. [[User:Chaojoker|Chaojoker]] ([[User talk:Chaojoker|talk]]) 06:22, 9 April 2014 (UTC)
 
*'''Endorse''' -- He has my full endorsement. We are truly in need for greater tools for easier proofreading of the numerous material that has been uploaded into wikisource, especially for such a large editing tasks as editing the voluminous Armenian Soviet Encyclopedia. I have been already using some scripts created by [[User:Xelgen]] that has been quite helpful. [[User:Վազգեն|Վազգեն]] ([[User talk:Վազգեն|talk]]) 03:08, 10 April 2014 (UTC)
 
*'''Endorse''' -- Useful tools and a team with a track record of getting things done in wiki-projects. [[User:Teak|― Teak]] ([[User talk:Teak|talk]]) 18:42, 10 April 2014 (UTC)
*'''Endorse''' -- I know both [[User:Xelgen]] and [[User:HrantKhachatrian]]. Both have improved/developed a lot of useful tools for Wikimedia projects and I think it would be incredible if they would not limit themselves to do it in their spare time. --[[User:Vacio|<font color="#1E90FF">'''va'''</font>]][[Special:Contributions/Vacio|<font color="#FF8C00">'''c'''</font>]][[User_talk:Vacio|<font color="#1E90FF">'''io'''</font>]] 06:45, 11 April 2014 (UTC)
*'''Endorse''' This sounds like a great idea and who knows, it may benefit the wider community of Wikisorcerers eventually. [[User:Jane023|Jane023]] ([[User talk:Jane023|talk]]) 17:21, 13 April 2014 (UTC)
*'''Endorse''' -- I am familiar with the work that [[User:Xelgen]] has done in Wikipedia and Wikisource projects and I fully endorse this proposal. -[[User:Սահակ|Սահակ]] ([[User talk:Սահակ|talk]]) 01:17, 14 April 2014 (UTC)
*'''Endorse''' -- Just the zoom and crop functionality by themselves would be great tools and worth the funding. --[[User:Ainali|Ainali]] ([[User talk:Ainali|talk]]) 20:34, 16 April 2014 (UTC)
*'''Endorse''' -- Wanting almost this same functionality for digitized dictionary books having probably some more typographical complexities. Looking forward to test runs with such books. --[[User:Purodha|Purodha Blissenbach]] ([[User talk:Purodha|talk]]) 09:19, 17 April 2014 (UTC)
<!--You have finished creating all sections of your proposal. Have a great time continuing to develop your project idea, we look forward to your completed submission!-->
*'''Endorse''' -- See my comments on the talk page.--[[User:Micru|Micru]] ([[User talk:Micru|talk]]) 14:13, 18 April 2014 (UTC)
*'''Endorse''' -- I am fully endorsement this project. It will my Bangal Wikisource too.[[User:Jayantanth|Jayantanth]] ([[User talk:Jayantanth|talk]]) 14:30, 18 April 2014 (UTC)
*'''Endorse''' -- Although see my [[Grants_talk:IEG/Tools_for_Armenian_Wikisource_and_beyond#Comments_from_the_wub|comments on the talk page]]. [[User:the wub|the wub]] [[User_talk:The wub|<font color="green">"?!"</font>]] 22:20, 19 April 2014 (UTC)
[[Category:Wikisource]]