|
|
|
|
Machinery or no machinery, for retrieval classification is a
necessity, not just superficial classification but depth
classification. S.R. Ranganathan. |
NOTE ABOUT THIS WRITE-UP
This is
a first draft and a causal style write up and so contains several repetitions
and statements which you may just ignore if you do not follow. However after
reading the whole lot and the paper linked to this write up, you will definitely
follow what I am trying to convey. I hope to edit and level these out at a later
time. If I get feedback that I should be a bit more serious and re-write this
paper properly according to scientific paper writing rigid rules and citations
and bibliography and review of the recent works and so on, then I will do so
later. Those of you, who know Thamilzh (Thamizh, Tamil), may play as background
music the song "ettaNaa irunthaa ettu ooru en paattu kaetkum." You may have to
register to download it from:
http://music.cooltoad.com/music/song.php?id=100300 . This is the theme song for this write-up ! I am sure you will
enjoy. The statements in transliterated Thamilzh are for enjoyment and do not
convey any significant meaning and so others may simply ignore them. But I am
sure they will force some soft movement of the lips leading to a softened smile
(Punmuruval pookka vaikkum). They did the same to Dr. Ranganathan too.
By
the way do you know that the famous "Wall-Picture" principle is an old saying
(proverb) in Thamilzh?. (Chuvarintich chiththiram yelutha mudiyaathu). And its
derivation the "Cow-Calf principle" another common proverb in
Thamilzh?
PROLOGUE
Purpose 1:
Is Our Paper "Not Relevant"
One of the purposes of this write up is
just to find out whether one of our papers entitled " Faceted Indexing Based System
for Organizing and Accessing Internet Resources" published in "Knowledge
Organization : Journal of the International Society for Knowledge Organization
Vol. 29 (2002); No. 2; p 65-77" [reproduced in its entirety down below] is
really NOT RELEVANT as mentioned in one of the bibliography entitled "Putting
Facets on the Web: An Annotated Bibliography" Oct. 2003. < http://www.miskatonic.org/library/facet-biblio.html> .
Normally
when one compiles a bibliography she or he compiles according to a
certain criteria, be it a subject or topic or what ever, and includes
only those that are found relevant. If something is found "not
relevant" then it is simply ignored and not included in the
bibliography. But this bibliography for some reason has put in a
heading "NOT RELEVANT" and included our paper and another one. To me it
appears to be cheeky!. If some one wants to
criticize our paper they have every right to write to the Editor of the Journal.
The criticism would be sent to the authors and the authors' view will also be
published in the "Letters to the Editor" section. Unilaterally branding a paper
as "NOT RELEVANT" without giving an opportunity to the author to reply, is
actually academic dishonesty JJ. If our paper is partially relevant it may be
stated so in the annotation below the citation, with the objective of the
compilation and how the objective is not met or partially met by our paper. If
it is totally "not relevant" then the search criteria used by the compiler for
compiling the bibliography should not have retrieved our paper, if it has, then
the conduct of the search and the selection of the paper is just flawed ! You
select a "not relevant" paper and list it as " not relevant"? This is not
fair ! Also when a paper is
published in a journal, the paper generally does not reproduce all the
information necessary to understand it from scratch. A certain amount of
scholarship is expected of the readers, at least an understanding of the papers
cited.
Our paper is not a tutorial on Facet Analysis ! If any one
wants a tutorial, there are quite a few available, most of them formed by
copiously copying from Ranganathan's works or from works that copied his works.
Some are formed by a generous selection from Ranganathan and a limited selection
from CRG's (CRG itself is an offshoot of Ranganathan's teaching) and those of
his students in UK. I have not come across any significant development to what
Ranganathan has said except that a few more additions to the Categories (facets)
as Prof. Vickery did long time ago. If there has been any addition to the
principles of facet sequence or array isolate sequence or any enhancement
thereto, please let me know. I would be grateful.. Our paper does not copy and
reproduce Ranganathan's, it is a further development of POPSI and the DSIS which
are based on Ranganathan's facet analysis. The origin being Ranganathan's paper
"Subject Headings and Facet Analysis" Published in Journal of Documentation
(1964), Vol. 20. p 109-119. The fact is, though Ranganathan's facets are
generally understood to be just the five Facets (PMEST), he had in actuality
seven Facets !.J
They are [Basic Facet], [Personality Facet], [Matter Facet], [Energy Facet],
[Space Facet], [Time Facet] and [Common Isolate Facet] (the anteriorising and
the posteriorising types). Never mind if you do not understand because you have
to read his works. [Strange that it is NOT FIVE but SEVEN. How this escaped the
analytical mind of stalwarts who say our paper is "Not relevant" is strange
indeed ! Now onwards every one will say SEVEN and not FIVE . But there are some
points for argument too]. In our paper, we have managed with just four
Elementary Categories [Discipline], [Entity], [Property] and [Action] and a
concept called [Modifier] as enunciated by Bhattacharyya in his works on POPSI..
To generate different types of organizing sequences we have the concept of
[Base] and [Core]. Well, if you read both Bhattacharyya's papers on POPSI and my
FID/CR report No 21, you will understand it all. When one writes a paper for a
journal, the space is limited.
I would like to put all our documents in
this site whenever I find time. Though the origin of POPSI was Ranganathan's
paper "Subject heading and Facet Analysis"; in 1969 DRTC Annual Seminar 7, there
was a paper entitled "Postulate based subject headings for a dictionary
catalogue system" by Prof.Bhattacharyya and Prof.Neelameghan. But this Seminar
volume is the old cyclostyled one fit to become an antique and perhaps available
in some library here. If you really want to understand a concept you have to
trace the origin and get at the original documents! One caution though.
Deductive type of reasoning alone does not work well with Facet Analysis and its
understanding! (Library Science is a social science and so is Library
Classification). (Cheththaal thaan chudukaadu therium).
Our paper is not
concerned about how to put a faceted classification scheme on the web. We are
not "putting any facet on the web". We are not designing any faceted
classification scheme and storing it any where. We are showing how a faceted
indexing can produce an organizing classification effect and that it can be used
for organizing and accessing any resource including web resources. Our paper is
not concerned about how an information resource is identified and its structure
explicated. The concern is not about whether the structure of the information
resource is described using Dublin Core or the fantastic templates developed by
the ROADS project in UK < http://www.ukoln.ac.uk/metadata/roads/what/
> (forgotten by many and the project now
ceased to exist), or the "Wordstar" reminder HTML, or the verbose xml, or RDF
etc.
The concern of the paper is how a facet analyzed subject heading
(structured subject heading assigned on the basis of the theory of POPSI
(POstulate based Permuted Subject Index) and elaborated as the Deep Structure
Indexing System (indicating how the different types of index displays could be
generated using computer "Computerized Deep Structure Indexing System. FID / CR
Report NO. 21. FID/CR Secretariat, Frankfurt. 1986"), could be used to provide
an expression that is meaningful and has the capacity to produce an organizing
sequence when sorted alphabetically resulting in the organization of web
resources and also the fact that the same could be used in a retrieval
environment. The paper is only on subject retrieval and not retrieval based on
other data elements such as the name of the creator or language of the resource
etc. Detailed discussion of the formation of the index files for retrieval are
not presented because the system uses the age old technique of Inverted Index
files (index sequential files) and such discussion would be trivial.. Almost all
information systems (Googele, Yahoo ..) use inverted index files.
I am
reproducing our paper of 2002, as such below for you all to see and read. I may
set up a blog to help you give your comments. Of course you have to register
your name in order to be able to add your comments. For the time being send me
e-mails. I will reproduce the "good" ones J.
Purpose 2: Pondering Over the Semantic Web Intrigued by the
Course of its Development
The second purpose is to ponder in a
lighter vein some factors in the development of the information scenario leading
to the web and wonder what it would take to get to the semantic web if the
stalwarts stick to the ?stay the course? policy in its development. Well, I am
intrigued by the fact that the visionaries did not envision the need for the
?semantics? of the web early enough and do not realize that real evolution and
development could be achieved only if a historical and developmental study of
information systems is carried out diligently to get at and recognize the
necessary factors for the evolution to be carried forward.
I am reminded of the phrases I
heard at IIT Madras Computer Centre such as "you IBM 370 assembly programmers
will always revolve around the 16 registers"(some IIT Madras Computer Centre
friends, may remember simple assembly language macros developed to do this
without using up another base register) , "a good systems programmer is not
necessarily going to be a good systems administrator or system designer", "you
higher level programmers will always think in terms of 'if then else' and miss
the niceties of the quantum jump that is necessary to recognize the connection
between the seemingly unconnected", and so on J .
Well, there was
this ARPA net and ASCII won over all others such as the EBCDIC and all such
codes, and well, communication between computers became a possibility in spite
of different Operating Systems. There was Apple and "hypertext" which was
developed as an easy page flipping and display technique with the added ability
to jump to related pages and sections of text put on the computer as an
information resource using the embedded links (Librarians called directive links
as "See" references and related page / section links as "See also" references).
It was basically a text displaying technique and no database methodology at all.
There was Gopher and the Browser followed, and experts (non Librarians) who
perhaps had no exposure to information system design but had used enough of word
processing, took a fascination to the WordStar like text displaying hypertext
and anointed it as the standard for putting Information on the net. Users having
used word processors took a fascination to this and the web grew in a
cooperative manner - but grew wildly! I am at a loss to think there was
any one who actually designed the web consciously. If there is one please
let me know. I would be grateful. Well, there were and are, several types of
information systems and visibly there were / are INIS, AGRIS, the MEDLARS, the
Chemical Abstracts, Engineering Index and a few on Computer Science too. How the
designs of these have never been taken into cognizance is a mystery, and it is
sad.
Had a Librarian been involved in the development of
the web, he would have realized that it is going to be a cooperative,
global database and would have tried to set up the correct guidelines and standards.
It is because Librarians have been involved in at least
designing Bibliographic Databases. From the sixties onwards the librarians knew that there
are data elements identified by Data Names (name for the data - what a
beautiful meaningful (semantically rich) simple label compared to the semantically confusing
label - "meta data") (some people are always interested in expressing simple things
with high sounding semantically poor terms to highjack the ideas and make them
their own. More examples may be found below), and that most of them have variable
length and that some of them can repeat in the description of an information
resource. Using this knowledge, Librarians developed Machine Readable Catalog
standards. "United States Standard for Information Exchange. Journal of Library
Automation. Vol.1; 1968".
Librarians' involvement in information
systems design is well known - the CAN/SDI, CAN/OLE systems of NRC, Canada for
instance. Librarians did contribute to the development of international
cooperative bibliographic information systems. For instance, if we take the
FAO's AGRIS, Librarians knew that there should be a well defined "data element
directory" (Librarians called it the AGRIS Cataloging Manual) giving the Data
Names, their definitions, what should form the data element for a particular
data name, including sub elements, where a particular data element be found in
the document described, and how it is to be recorded along with examples. As the
system is being developed cooperatively by member countries and that there could
be records in different languages, to cut across the language barrier the data
names were identified with unique codes (tags) and the data name itself is
replaced by the codes in the records. This saved a lot of space in the records
too and made it non-verbose unlike the XML of to day. Also to help easy
ascertaining of the documents for inclusion and categorize them, there was a
Categorization/ Classification scheme with codes for the topics, though it was
not a faceted classification scheme displaying full hierarchy. Apart from these,
there was a vocabulary control tool called "AGROVOC thesaurus" displaying up to
seven levels of hierarchy of the terms to be used as index terms. The index
terms themselves are to be selected from the AGROVOC and recorded according to a
"standard syntax" to further categorize the documents coextensively and
completely semantically, and in a consistent manner. Several standard templates
for the different types of documents were also developed and the Librarians in
all the participating countries were trained to ensure
consistency.
Librarians knew well that to be successful, at least
three important tools are needed for an information system of the type they were
handling. They are:
When we were discussing about classaurus, I did oppose Dr.
G. Bhattacharyya when he wanted a class code to be fixed to each of the
terms (isolates -- this is a peculiar term but well understood by Librarians knowing
Colon Classification and we take it for granted people understand this word.
Well it just stands for an idea represented by a term in a faceted
classification scheme) because I thought that there is no need for any class
code for the terms in the classaurus. And it won't make it that different from
Faceted Classification schedule except that it will have synonyms also. But he
was not convinced and wrote his paper to the Augsburg Conference
(4th International Study
Conference on Classification Research, Augusburg, Edited by I. Dahlberg. Indeks
Verlaag, Frankfurt 1982. p. 139-.) that way. Well, I thought, it is a
vocabulary control tool and not a classification code assigning tool. In the
Library when books are returned they are to be put back in the organizing
sequence. Similarly the organizing sequence is to be restored every time it is
disturbed. To mechanize the restoration of the sequence and make it simple and
mechanical for the library clerks, the code is necessary. They need not have
been educated in English Medium schools as they need not read the titles to
determine their place on the shelves. It is enough if they can read numbers. But
in a computer environment there is no restoration of the disturbed holy
sequence. The documents are there always unless there is a hard disk crash. So I
went on to develop the alphabetical classaurus and said, to update it, there is
no need for any class code to denote the position of the terms. Now I realize
that it may be necessary to identify each isolate term by unique hierarchy
reflecting code so that it becomes universal, piercing the language barrier
armor with ease (terminology influenced by current events). To avoid code you
have to beat around the bush with many logical derivations and deduction of
inclusiveness, implication and so on. Wish you good luck with your winding ways.
(WWW stands for Winding Ways of the Web!).
Yet another
interesting factor is the development of DBMS of various types, independent of
the Bibliographic Databases of the Library field. There were the Hierarchical
DBMS, the Network DBMS and the famous Relational DBMS. Most of them used fixed
fields (fixed length data) and the idea of repeating data elements got
incorporated quite late. Librarians recognized variable length and repeatability
of data long time ago. The bibliographic type databases and the DBMS types
existed separately and grew separately each not recognizing the existence and
the niceties of the other. May be it is better to keep them that way instead of
mixing and creating an omnibus solution.
The inverted
index and the post coordination technique of search (Librarians called
it "coordinate Indexing", post-coordinate retrieval and so on --, others preferred to
call it "Boolean search" based on inverted indexes, whether Boole himself would have
allowed his name to be used this way is doubtful), emerged as the panacea for
information systems.
Well, Librarians knew very well from the day
"Coordinate indexing" as a concept and as a method was founded by Mortimer Taube
in 1951, that this post-coordinate indexing would result in many pitfalls and
waste of time to further scan and discover the relevant ones from the retrieved
results. There have been several papers published in the Library literature,
even some entitled "Pitfalls of the post coordinate index", "Why
post-coordination Fails" and so on. Pre-coordinate Vs post-coordinate indexing
studies have been a regular part of the Library science curriculum for the past
30 years. In fact, those who have now found out that the search engines are not
retrieving as they should and advocate other forms and winding solutions, must
go through the texts and research articles taught in Library Science schools so
that they evolve better solutions. I understand that faceted classification type
of schemes are re-invented with new names and slightly altered representation,
and called by high sounding awe inspiring names like Ontologies !.
I did see two of the papers addressing this. One is
that of Prof. Dagobert Soergel's - The rise of ontologies or the reinvention of
classification http://www.dsoergel.com/cv/B70.pdff> published in the Journal of the American Society for
Information Science. 1999; Oct; Vol. 50(12); p 1119-1120. How I wish it is
published again in the new "Journal of Ontology" also and in several other
computer science journals and a copy sent to each and every one working on
semantic web. (I would celebrate such an event with Sivakasi
crackers).
The other one is that of Prof. Marcia J. Bates :After the Dot-Bomb : Getting Web Information Retrieval Right This Time, published in First Monday < http://www.hastingsresearch.com/net/08-net-information-retrieval.shtml > I also found a small note that asks a pertinent question ? do we have to invest in pre-coordinate indexing which was thrown away for the cheap coordinate indexing <http://www.websearchguide.ca/netblog/archives/003621.html> . A full article <Metadata--Think outside the docs! By Bob Doyle - is available at <http://www.econtentmag.com/Articles/ArticleReader.aspx?ArticleID=7947"> If I meet these authors Prof. Soergel, Prof. Marcia J. Bates and Prof. Bob Doyle I will give them each a mouthful of sugar. This is the way Thamilzhs tell you how pleased they are with what one said. (I am giving the Thirunelvaeli Thamilzh expression for this in transliterated form "Avunga vayilae ainthu aru cheeni allippodanum").
Here is our paper published in 2002!. You may read the
Epilogue given below afterwards.
Click to get Faceted Indexing
Based System for Organizing and Accessing Internet Resources
EPILOGUE
Web is an information system not just computer system alone.
Without information web is empty network. Consult those who have handled
information for decades. I heard the phrase "Content is King" but those who have
managed the Mighty King have been left out. Brain
storming having a good representative set of persons involved with information
(communications specialists, librarians, bibliographic information system design
experts, deep web system designers, DBMS and IM stalwarts and the new wave of
web processing language experts) would be needed to have a new direction for the
web, based on a thorough analysis of the types and variety of resources their
characteristics and how the different professions have dealt with them to
develop a solid foundation. The DBMS type web
resources can have a header giving the record structure of their data base,
allow any agent to understand and formulate the search query according to the
search system as explicated in the record structure / database structure and
just submit the query to the database system and wait for the answers. The
answers could then be taken by the agent and further processed. The record
structure could be the age old easy, COBOL Data Division structure. The
hierarchy could be indicated by the level numbers, the data names could be
defined using the same structure and defined as variable or fixed along with its
Universal Code ! (Ah what an abhorrent idea in the days of the OOS and OOA !).
Well, COBOL Like structure would indeed be liked by all. Web is to be built up
by common people who may not even care to know how to indent "if then else"
properly. Even if user friendly design tools are made available it would be more
creative to make the users participate in "developing" the web, not just growing
the web in terms of size. (Maattai thanni kaattaththaan mudium).
FINAL WORD:
As a final word
I wish to state that Library Profession is a noble profession. The profession as
such did not involve in any business to make money. I am sad to see the
contributions of such a profession which has helped every one to overcome their
intellectual weaknesses / knowledge gaps is being kidnapped, hijacked, for the
glory of a few, without acknowledgement and recognition. Librarians and Library
schools should start research projects to develop better techniques of
organizing the web. Monolithic enumerative classification schemes will not help.
Independent of what the so called web standard builders are doing, Librarians
and Library schools should carry on their research without worrying about the
winding ways of the web standard developers. .
Please make copies of
Prof. Dagobert Soergel 's paper "The rise of
ontologies or the reinvention of classification"
<www.dsoergel.com/cv/B70.pdf>published in the Journal of the American
Society for Information Science. 1999; Oct; Vol. 50(12); p 1119-1120. as well as
that of Prof. Marcia J. Bates :After the Dot-Bomb : Getting Web Information
Retrieval Right This Time, published in First Monday < http://www.hastingsresearch.com/net/08-net-information-retrieval.shtml
> and circulate them to all the Web experts
and researchers.
If any one can prepare a paper following the model of
Prof. T.D. Wilson's, please prepare and try to publish it in computer science
and web science journals.
"Like Library Science, Web science is evolving
as a Social Science dealing basically with information. To force it with nuts
and bolts to keep it as a hard science may retard this evolution".
To ignore the Basic Laws
(Law of Parsimony, Law of Symmetry, Law of Impartiality etc.), applicable to
intellectual work in all areas of knowledge; Fundamental Laws (the Five Laws of
Library Science), applicable to a discipline dealing with information;Canons
(Canons of Cataloguing, Canons of Classification etc.), applicable to a branch
of Library Science; Principles (Principles for Helpful Sequence of Array
Isolates in Classification etc. ? array here does not refer to the allocation of
an array in memory, but refers to a set of coordinate ideas in a hierarchy);
constituting the theoretical foundation along with the Tools and Techniques
developed to handle Information;would lead to reinventing the wheel after
spending a lot of manpower, money, resources, time and energy.
I welcome your comments and suggestions.
24 Feb, 2007
E-mail: devadason_f_j@yahoo.com