Wikipedia:Historical archive/GNE project files/GNE Project Design: Difference between revisions
Content deleted Content added
m soft redirect |
m 4 revisions imported: import old edits from "GNEProjectDesign" and "GNE Project Files/GNE Project Design" in the August 2001 database dump |
||
(4 intermediate revisions by the same user not shown) | |||
Line 1:
More thoughts by [[User:MikeWarren|MikeWarren]] about GNE. See also [[Wikipedia:Historical archive/GNE project_ files/GNE Architecture|GNE Architecture]].
'''Introduction'''
As discussion rages over moderation and back-end design, I thought I'd write a more
cohesive version of my thoughts. There should be one article repository (with
potential for many mirrors of it) and many different classifications, hopefully by
different groups of people.
'''Editing'''
Authors may like to take advantage of editors. Volunteers who wish to edit work can
organize themselves onto one or many editing mailing lists, much like Nupedia has
done. Authors can submit their articles to the list and make changes which editors
might suggest if they see fit.
Once an author is satisfied with her work, she can then submit it to the article
repository.
'''Article Repository'''
There should exist a large group of moderators (hopefully everyone in the project) for
the repository. When an article arrives, they will vote on whether it is spam or not.
Unless there is unanimous consent that the article is completely useless, it will go into
the repository. This can prevent abuse of the system (i.e. someone sending in random
binary files) while still keeping the process completely open and allowing all authors to
put their work into the pool.
The only stipulation, obviously, is that the work be licensed under the GNU FDL.
There seems to be consensus that XML will be used to mark up the articles;
submissions arriving which are not in the correct DTD ( TEI seems to be the
front-runner currently) will be subject to the brutal treatment of various conversion
scripts. The resultant XML document will be assigned a unique ID and placed in the
repository. There has also been talk of using a Web-based form submission service;
this is also a good idea.
The repository will be kept simple and will utilize the capabilities of modern Web
servers by being a simple hierarchy of directories with the .xml files inside them. The
directories will not be a classification of the content, but merely a way to get around
limits on the number of files in a file-system. Each document will simply be called
unique-id.[version].[language].xml. If it has been digitally signed, there will be a
corresponding file unique-id.[version].[language].xml.asc
Version numbers can be anything, really, but simply sequentially increasing them
seems like the best course. Language can correspond to the LOCALE meanings. So,
one might have a directory like:
http://www.gne.org/articles/a/123456.1.en.xml
http://www.gne.org/articles/a/123456.2.en.xml
http://www.gne.org/articles/a/123456.3.en.xml
http://www.gne.org/articles/a/123456.3.en.xml.asc
Which means that there are three versions of the article with ID 123456 and the last
one (version 3) has been digitally signed.
The repository server will keep a list of all the unique IDs of all the documents it
contains. This will allow the classification systems to easily update themselves with
new documents by requesting this list (i.e. http://www.gne.org/articles/index).
For the above example, the index would just contain:
123456.1.en
123456.2.en
123456.3.en
'''Classifiers'''
Everybody seems to have their own favourite way of classifying articles, from voting to
Dewey Decimal to Library of Congress. All have merit, and there are probably lots of
users who would find each approach useful.
It seems like a good idea, then, to allow for multiple classification systems. Users would
interact with the repository through one particular classification. Hence, the
classifications systems will be doing all the searching and indexing that users might
want; storing the information they use in a database makes a lots of sense.
How might this work? Taking the above example again, a fresh classifier which just
lists articles by author and title downloads the index and notices that there are three
versions of a single article. Since it doesn't know about this article yet (it looks in its
database and finds nothing about the unique ID 123456) and only cares about the
latest articles, it asks for the file http://www.gne.org/articles/123456.3.en.xml.
The repository server re-writes the URL into
http://www.gne.org/articles/a/123456.3.en.xml and sends the XML file back.
The database updating program parses the file and extracts the author and title
information, putting these into the database with the unique ID 123456. This
particular classifier doesn't care about digital signatures, so it never requests the .asc
file to see if there is one.
Next, a user visits the Web site for this classifier, and visits the author list which shows
a single entry: the author of the article with unique ID 123456.
In a similar manner, other classifiers might do much more complicated things, like
send the article to a mailing list of peer-reviewers (again, like Nupedia) or any
number of schemes to classify the article in question.
'''Software'''
So, what software does GNE need to write? If we use TEI as the representation format,
the project can start receiving submissions immediately; a cursory reading of TEITools
indicates that it can convert to HTML, TeX and RTF. All that needs creating is a
method of getting the submission into TEI in the first place, which can be a Python (or
whatever) script which accepts plain text and makes guesses at what things should be
(i.e. bold face, references, etcetera). Then, editors can make sure this makes sense and
the author can give the final okay.
We can serve submissions live from the XML repository using Apache and TEITools.
Classification projects can thus begin work immediately, and this will be the major
programming work of the project (besides building better anything-to-TEI
conversion scripts). Splitting interested parties into groups for this can begin
immediately as well.
'''Conclusion'''
I propose that four groups are formed immediately:
'Backend' :: This group will set up and manage the back-end server. It should do
nothing more than accept submissions in DTD-compliant XML, accept revised
versions of the same document and accept signatures for existing documents. From
this, it should make the above-mentioned index file and serve XML pages to the
classifiers. This group should also set up the moderation system which rejects things
which are unanimously decided to be spam. This group should also determine whether
or not multimedia will be inline in the XML or served as separate files.
'Editing' :: This group will provide editing services for authors. Any author can submit
their article to the group for comments, although this will obviously not be a
requirement.
'Classification' :: This group will write the first generic classifier project, which will be
targeted as being a template for other more specific classifier projects to use.
'Conversion' :: This group will work on methods and programs for efficiently
converting submitted articles into TEI. Emphasis should be on making it easy for
(especially) academic groups to submit articles, so LaTeX might be a good first choice
after plain text.
|