file mod2k4

Modular scenarios in the electronic age

1) Meta-Information

1.1) Title: Modular scenarios in the electronic age

1.2) Authors: Joost G. Kircz and Frédérique Harmsze

1.3) Address: Van der Waals-Zeeman Instituut, Universiteit van Amsterdam

Valckenierstraat 65-67, 1018 XE Amsterdam, The Netherlands

1.4) Internet Address: {Kircz, Harmsze}@ wins.uva.nl

www.wins.uva.nl/projects/commphys

1.5) Publication: Conferentie Informatiewetenschap 2000

2) Positioning

The scientific community is confronted with a manifest information overload. Due to the great advances in electronic communication, the fear of a massive information infarct only becomes more real. However, as all new technologies strongly influence the way information is created, organised and presented, we can try to rescue the situation by taking the intrinsic properties of the new electronic technology as a starting point for a novel communication framework.

In an electronic environment, information is stored according to schemes independent of the consecutive space or time order of the original information and is therefore randomly accessible. In this contribution, we develop the consequences of these essential characteristics of distributed storage and random retrieval for information handling in a bottom-up approach. We first briefly survey the coming into being of the present paper-based situation. We then proceed with the conjecture that the future of electronic scientific communication will be characterised by modularity. Our aim is to design such a new, modular, framework, in which documents are presented as a collection of coherent entities. We discuss at length the idea that, in an electronic environment, links between modules become information objects on an equal footing with modules themselves, with the concomitant need to design representation classes. We then present the outline of such a framework. In this presentation, we emphasise the more general aspects than those presented in an in-depth study in physics that was recently published. The results presented here will hopefully serve as ingredients for applications in different fields in natural science or medicine. Based on these general notions, concrete designs, with a strong accent on authoring tools, can be made.

2.1) Situation: What is it all about?

Technology has always both helped and hindered human communication. In person-to-person contact, speech and gesture establish a unified framework. In oral society, stanza, rhythm and rhyme were important mnemonical techniques for structuring information [Ong82]. Information was communicated according to rules governed by the need that information must be memorised by humans. With the transition to literacy, the communication process changed fundamentally. Written text and illustrations liberated the human brain from the task of memorizing lengthy texts. It established new ways of structuring information closer to the intrinsic character of the information itself. Thus, e.g., long lists of data no longer demanded all kinds of memory aids, but were now printed in their full, dull and monotonous nature. The one-dimensional time axis of speech is now mapped onto the two-dimensional plane of the written text. The unique qualities of paper enabled the invention of elaborate 2D-graphical representation of data, which introduced its own pictorial rhetoric [Tuf83, Tuf90]. Writing developed its own style, away from pure speech [Hav86, Mcl62]. However, the ability of the reader to jump haphazardly through a text does not imply that in writing, communication stops following the old oral concept of the linear ordering of a trail of thought. A well-structured argument still follows sequential steps. Style manuals of all kinds insist on the logical sequential order of the unrolling story. The author writes for the unknown reader and therefore creates a path that can be followed by everyone. For a discussion on the different roles of authors and readers, see [Kir91]. The reader, be (s)he an informed reader or an ignoramus in the field, hardly starts reading at line one to follow diligently the entire text. Before the decision is taken to read a text, it is scanned and browsed through. People then decide to skip the text, to read parts of it or to read it in full. In contrast to listening, reading allows for non-linear and haphazard consumption. Associative and lateral thinking is much easier using the two-dimensional plane of paper, than the forced-march route of the one-dimensional time-axis of spoken language. Due to paper, the selective consumption of information took off, as paper allows for non-sequential reading and skipping large parts.

Interestingly the need for certified information, necessary for maintaining the integrity of published texts, induced very strict rules for the publishing of non-fiction and in particular scientific texts. The role of technology, in particular the printing press, in shaping texts and distributing knowledge and information has been extensively studied by [Eis79, Eis83]. The role of the necessary transport material, i.e., paper, in the same way that the development of the car demands the development of roads, is the starting point of [Feb96]. The role of multiplication as the prime mover for the dissemination of all kinds of semi- and even non-scientific information as motor for the transcending of pre-scientific thinking to modern scientific practice is emphasized by [Eam94].

As a form of research communication, the scientific paper is particularly interesting, as it demands a high level of clarity, integrity and standardisation. In the following, we will narrow down our discussion to this type of work, although many observations and remarks may also hold for other types of work, especially in the educational realm. Much has been written on the development of the scientific article as carrier for scientific information transfer [Mea98 and references therein]. In an earlier article [Kir98], we argued that the linear form of the scientific article is a typical consequence of print on paper. With the emergence of electronic media, the two-dimensionality of paper is enhanced by new parameters that allow the free rearrangement of every piece of information. This ability for nonlinear reading is rooted in the way electronic information is stored. Independent of the intended order of the intellectual content, the bit-representation is essentially nonlinear and can even be distributed among geographically disconnected memories. This new fact allows the reader not only to have an individual reading path through the work, but also allows for the addition of comments, extra -secondary- information to, and the expansion of every single part of a particular work.

Another important advantage of modularity is that for the searching reader, keywords are put into context. In traditional information retrieval a query composed of a combination of keywords results in the retrieval of a series of more-or-less relevant documents. In a modular environment the name of the module adds context to the keyword. Now explicit queries can be envisioned like "Keyword" AND "Module X (e.g., Results)" NOT "Module Y (e.g. Introduction)".

2.2) Central problem: A new granularity and its structure

An important precondition for the exploitation of these dynamic capabilities is that we introduce a well-defined granularity of information. The design of such a new structure has to be based on general accepted ideas about the consistency and integrity of scientific communications, as well as on the analysis of standard scientific papers. On every level of granularity, there is a need for surveyability. Especially in an electronic modular environment, where authors can add their own modules with new work or can add comments to old works, clear traffic signs are needed. The purpose of this paper is to discuss the new opportunities and to present a general framework for modules and their relations that enhance the research communication in an electronic environment.

3.1) Methods: General ideas on modules

As indicated above, the typical ‘grain size’ in a print-on-paper environment is the scientific paper as we know it. A consequence of the electronic environment is the introduction of a new type of granularity of information. A new granularity that implies new ways of ordering: ordering that mimics the old trusted methods, and also a structuring that exploits the new capabilities. As with print-on-paper, we face, on the one hand, the aspects of reproduction, which is the representation of an object made using a certain technique into another one, e.g., the photograph of an oil painting. Reproduction is specifically the case if some form aspects of the object are essential. For instance, the text of a speech or song will always be intended to be read aloud and in a fixed order, even if it is printed on paper or stored in electronic form. On the other hand, in the electronic realm, we also encounter novel ways to structure information and knowledge that are unique to the medium, like elaborate data tables that are became possible on paper and CAD/CAM simulations that are unique to electronic media.

For scientific research papers, we can distinguish modules of information that represent self-contained, though related, types of information. Writing a modular article does not mean a simple "cutting-up" of an existing linear text, but imply a different grouping of the same information. As the reasoning of the author needs a sufficient length of text to present the message in an intelligible way, modules of information have no intrinsic length (or in the case of pictorial information, no surface) limits. The essence of the idea of modularity is that modules comprise cognitive units. We reported a first approach to such a modular structure in [Har96], a in-depth analysis of the concept of modularity is given in [Kir98].

Thus, because of the fact that information in an electronic environment can be accessed from many different directions, re-used piecewise and augmented on demand, a novel structure is needed. Hence, a new class of descriptors is needed. Such a novel system must tally with the standard requirements for scientific communication with regard to clarity, quantity, quality and relevance. In the Communication in Physics Project [Har99, Har00], an elaborate attempt has been made for such a new structure in the field of molecular physics. We analysed a coherent corpus of physics articles with the aim of creation of a new model for electronic articles in experimental sciences based on selective reading and multiple use of information. In this project, we propose a scheme for writing scientific papers in a modular form, including instructions for authors. A full description of this model is given in the PhD thesis of Frédérique Harmsze [Har00]. Below, we rely on this work, but try to describe our ideas on a more generic level.

Following [Har00], we define a module as a uniquely characterised, self-contained representation of a conceptual information unit aimed at communicating that information. Not its length, but the coherence and completeness of the information it contains make it a module. This definition leaves open that modules are textual or, e.g., pictorial. Modules can be located, retrieved and consulted separately as well as in combination with related modules.

Elementary modules can be assembled into higher-level, complex modules. We define a complex module as a module that consists of a coherent collection of (elementary or complex) modules and the links between them. Using a metaphor, elementary modules are 'atomic' entities that can be bound into a 'molecular' entity: a complex module.

We distinguish two types of complex modules: compound modules and cluster modules. In a compound module, related (albeit possibly dissimilar) modules are aggregated to form a new module on a higher level. An example of an aggregated module is the module 'Experimental methods' that can be composed of lower-level modules representing the various components of a measuring device. In our physics corpus, we encounter, e.g., a molecular beam apparatus that has relatively independent components like: one or more sources of a particle beam, an interaction chamber, and a detector.

The central idea of a cluster module is the generalisation of specific concepts that its constituent modules focus on. An example of a cluster module is the module 'Raw data', composed of various elementary modules reporting the results of the same general type of measurements involving different molecules.

A direct consequence of splitting up the different kinds of information is that in complex modules, we need some extra text, summarising the essence of the composing modules in so-called module summaries. These module summaries play also an important role in the creation of an abstract for the entire modular article. In a modular environment, the abstract is particularly important, being a concise expression of the coherence of the discourse. This aspect of modularity is the subject of a related, ongoing, project [Tol99]. Summaries will be tagged separately and hence can be skipped by the reader on demand.

3.2) Methods: General ideas on relations and links

As mentioned in the Section 2.2 (Central Problem), on every level of granularity of information there is a need for surveyability. Especially in an electronic modular environment, where authors can add their own modules with new work or can add comments to old works, clear traffic signs and road maps are needed. Traditionally, we have meta-data as the representation or identifier of a piece of information. Meta-data describe the type of information contained in a unit (a paper, a painting, a module). Various classes of meta-data exist and preferably form a complete and non-overlapping system of coordinates, e.g.: bibliographic information, domain-specific keywords, etc. Information Retrieval (IR) is the science and art of manipulation of these meta-data in order to meet a more or less well-defined query (information need) of a reader. In a distributed environment, we need new kinds of meta-data: meta-data that express the relations between different modules. As far as it indicates the organisational structure of the related modules, this is, in essence, an extension of the traditional bibliographic meta-data. Just as a paper is part of a journal section, a journal section part of a journal, and a journal part of a publication programme, will a module be part of a modular article and a modular article part of a larger data base of modular publications.

The novelty of a modular environment is that we define modules as conceptually separate entities within an organised structure. Hence, the conceptual relations between modules are primary to the organisational ones. This means that, next to organisational relations, a series of discourse relations based on the communicative function of the message, as well as the content, have to be introduced. The expression of such relations in an electronic environment is given through explicitly-labelled links. We define a link as a uniquely-characterised, explicit, directed connection, between entire modules or segments of modules, that represents one or more different kinds of relations. A link then represents not only the organisational relations between two information units, but also harbours information on the why and wherefore. As a result, the various kinds of relations expressed in hyper-links lead to special classes of meta-data describing the attributes of that link. An immediate advantage of this approach is that from an IR point of view, we can keep all the machinery and just add these new types of meta-data to our system.

Thus, in a modular electronic information system, hyper-links are objects that represent particular relations and do not just represent the fact that something is somehow related to something else. A hyper-link becomes is endowed with information that expresses the relations it represents. One link can represent many relations, such as: the target belongs to the same work, contains a more elaborate account and a different presentation of the same data. A link might be created as part of a scientific report, but can also be created later by somebody who links already-existing material with new material. These new links to old material can represent important steps in the scientific discourse. Therefore, every link also has to carry the bibliographic meta-data of its originator. This way, a hyper-link becomes a new type of information object carrying particular meta-data, representing the coherence of the distributed information. With an acknowledgement to particle physics, one could call hyper-links messenger-objects. This brings us to an important consequence of modularity: in an electronic modular environment links are objects on the same footing as modules.

From the beginning of Hypertext, authors are aware of the need to structure the ever- increasing number of linked objects. A serious problem with all attempts to attack this intrinsic complication was the impossibility of proper implementation and therefore experimentation with linking schemes. The latest developments within the World Wide Web Consortium projects suggest serious breakthroughs. First we have the development of the Resource Descriptor Framework as a meta-data framework in which it becomes possible to add attributes and their values to any web resource [RDF99]. Secondly we have the development of XLINK, of the W3C XML linking Working Group [XML99]. These developments give confidence to the discussion on link systems becoming an essential ingredient in the design of future communication systems. For most schemes discussed in the literature, it is clear that at least two classes of link description are needed, one describing the organisational relations and another dealing with meaning and understanding. All authors call upon linguistic tools, however, without a systematic analytic approach. For that reason, most of the suggestions remain highly ad hoc. In many cases, an attempt is made to provide a complete set of all possible link types. An early, and oft-quoted, taxonomy can be found in Trigg [Tri83]. Here 70-odd types of links are classified in two categories. The first category contains normal links that "serve to connect nodes making up a scientific work, as well as to connect nodes living in separate works", whilst "commentary links connect statements about a node to the node in question". The problem with this taxonomy is that it is of a pure phenomenological kind without any attempt to structure links according to some deeper understanding of speech communication research. In an attempt to structure hypertext for Classics and Religious studies, DeRose presents a more elaborate link taxonomy, split into the two main categories: extensional links and intensional links [Der89]. The extensional links are further split into relational links and inclusive links. Relational links consist of, first, associative links that connect arbitrary pieces of text and can be considered to follow a discourse, for which many types named by Trigg could be useful. Secondly, relational links contain annotational links as referential links that represent connections from portions of a text to information about the text. In the inclusive link realm, DeRose lists sequentional links and taxonomic links that associate lists of properties with particular document elements. In the second main branch of intentional links, DeRose lists link types that follow strictly from the structure and content of the document. Though very ingenious, the scheme lacks sufficient transparency for easy adoption and is insufficiently fine-grained, particular in the case of relational links.

In the work of Baron et al. [Bar96] tests were executed on a limited corpus of the OCLC Cataloguing User’s guide, augmented with labelled links according two three classes: semantic (similar, contrast, part of), rhetorical (definition, explanation, continuation, illustration and summary) and pragmatic links (warning, prerequisite, usage, example). Their tests show that more effective searching becomes possible. A final example of an ad hoc approach is the usage by Rutledge et al. [Rut00] of the extensive list of rhetorical indicators by Mann et al. [Man88, Man89], to assist the structuring of multimedia presentations. Furthermore, here, a whole list of possible relations between information objects is used.

In an earlier paper [Kir91], we suggested a structured taxonomy of the lines of reasoning in scientific texts as a complement to the semantic networks of keywords. In this early work, a clear distinction between coherent content and relations between this content is not yet made. The main argument was that a structural taxonomy was badly needed and preferred above a mere taxonomy of rhetorical indicators. In the Communication in Physics Project, we are collaborating with the University of Amsterdam, Department of Speech Communication, Argumentation theory and Rhetoric, to try to find a systematic way of designing a linking taxonomy based on a linguistic approach. An important aspect is that the analysis will be based, not only on formal structural relations, but also on the pragmatic approach of speech communication. Here, the actual usage of language is the starting point, in the framework of a critical assessment of an argumentation that legitimates the step from premises to standpoints. Scientific communications are texts in which an author wants to convey a problem, solution or opinion within the context of a broader scientific quest. For that reason, large parts of scientific papers can be considered as argumentative texts. In a working electronic environment, we need a systematic structure in which the relations between different parts of the unravelling story have a clear meaning. Furthermore, because we see links as important information objects in themselves, we need to limit their number to create an efficient and effective authoring environment. The results presented in this contribution give the general outline of a structure of classes of modules and classes of their relations, which can serve as a starting point for more detailed domain-specific applications.

4) Results: Descriptor classes in a modular framework

As information in an electronic environment has essential nonlinear characteristics, we have to develop a framework in which the unique characteristics of particular information modules are well-defined in order to let different kinds of modules be handled according to their own, unique, aspects. In order to build new information systems, we have to analyse the various ways in which that information can be represented and identified. Below, we present an approach, in which we distinguish between different classes of representation and identification. As every kind of module can have features belonging to different classes, classes have to be of a different and mutually exclusive nature. These class-dependent features can then be structured in controlled keyword lists or thesauri per class, and can be considered as meta-data. It is important that the classes do not overlap in meaning; in mathematical terms, one would say that the classes represent orthogonal representations, and in every representation, with specific parameters. In the analysis, an important aim is also to ensure that the collection of classes is exhaustive and economical. Next to classes identifying the content and form of an information item, we also have to identify the classes describing the mutual relations between these items, as argued above. A complication that will not be addressed further in this paper, is that metadata can be attached to a module or to a particular part of a module, for instance a word, a picture element, or a specific number.

4.1) Module descriptors

In order to be able to determine what is 'similar information' to be grouped together and represented in a self-contained module and, subsequently, to determine how to tag the resulting module, we need an unambiguous typology of scientific information. The analysis of Harmsze [Har00] and the subsequent testing of the model by rewriting physics articles shows that the following types are sufficient for the domain analysed and by proxy for most experimental sciences. Obviously in different fields the exact structure and definition of the modules can vary.

1-Bibliographic information

It goes without saying that our first class of identification is the traditional bibliographic data. As, in a multi-media environment, the presentation form of the same object can change, I prefer to confine the class of bibliographic data to data strictly concerning the author, the publisher, title, dates, etc. All references to the appearance of the work should be in a separate class.

2-Presentation forms

It is obvious then, that the next division in types of information will be according to appearance: text, figure, graph, photo, holograph, sound, simulation, etc., etc. The accompanied meta-data are items like presentation type, file length, colour scheme, frequency range, language type, resolution, etc. This division in "physical" appearance of information objects does not tell us anything about what is communicated. In the Communication in Physics project, we concentrate on the information, regardless of its presentation form; hence this class was not touched upon.

3-Domain-dependent keywords

On the same level, we can introduce the classical domain-specific keyword and classification code systems. Both traditional classes of identifiers naturally keep their relevance in an electronic environment.

We introduced in the project two new classes of typology: a characterisation by the range of information and a characterisation by its conceptual function, i.e., by the role the information plays in the scientific problem-solving process.

4-Range-based information

By characterising information by its range, so-called microscopic, mesoscopic and macroscopic modules can be introduced.

4.1-A microscopic module represents information that belongs only to one particular article, e.g., information concerning the specific problem addressed in that article.

4.2-A mesoscopic module functions at the level of an entire research project; it is created for multiple use in several articles issued from the same project. For example, information about the experimental set-up that has been used in a series of experiments can be presented in a mesoscopic module and connected to several articles reporting experimental results. The same holds for general theoretical approaches or computational algorithms used in a series of investigations.

4.3-A macroscopic module represents information that transcends the level of the research project; this type of firmly established information is given, e.g., in books, lecture notes, etc.

5-Conceptual function

The main characterisation in a modular structure is the identification of information by its conceptual function indicating the different steps in the research process. Modules are self-contained, therefore, every module represents only one well-defined aspect of the discourse in the article. Of course, this self-containedness does not mean that one module is usually sufficient to understand the whole work. Modularity enables the reader to immediately zoom in to those aspects he/she is interested in. If so wanted, the whole work, i.e., all the related modules and, if needed, the necessary auxiliary information presented in meso- and macromodules, can be retrieved and read as if it were a traditional article.

As modules are meant to be coherent bodies of content, it makes sense to have only a limited number of members in this class. Our starting point is the prototypical section structure of scientific papers in experimental sciences: Introduction, Methods, Results, Discussion and Conclusions. This sequence represents the normal flow of a scientific narrative, but the way it is used in practice presupposes that the article will, indeed, be read sequentially from beginning to end. Based on this prototypical structure we have defined the following, more systematic, modules.

5.1- Positioning is a complex module consisting of two modules.

5.1.1- The module Situation, describing the embedding of the work, and

5.1.2- the module Central Problem, stating the why of the work in question.

In this complex module, all the information the reader needs to know about the background of the problem in question and the particular aspects dealt within the article, is grouped together. It is immediately clear that the module Situation, which reviews the embedding of the work, can be to a large extent replaced by a pointer, linking the work in question to a description elsewhere. Such an introductory text is a typical kind of meso-information. This way, the enormous redundancy of information presented in introductions of articles can be avoided. Obviously the module Central Problem is an essential module, as this module provides the intentions of the author, given the context. For an informed reader, this module can play a decisive role in the decision to drop the article or to continue reading.

5.2- Methods is a complex module that can be built up from separate modules representing:

5.2.1- the theoretical, 5.2.2- experimental, and/or 5.2.3- numerical methods employed. If an article is one of a series, a substantial part of the information about the methods can be represented in mesoscopic modules for multiple use; e.g., in a pure experimental article using a standard instrument and employing a standard theory, both the Experimental Method and the Theoretical Method can be described elsewhere. In other fields like, e.g., organic chemistry or pharmacology, different types of methods can be defined, e.g., Preparatory Methods or Treatment Protocols, respectively.

5.3- The complex module Results allows readers to inspect the results without further need to read the whole article.

5.3.1- One of its two constituents is the module Raw Data. In printed articles, these data are hardly ever published, as that would require too much space. In an electronic environment these data can become directly available to the reader. By accessing the data in this way, the reader is able to use the data without the preferred interpretation of the originator. This enables the reader to merge his/her own data directly with the presented data for comparison and analysis. It also allows different people to apply different methods for data reduction to the same data.

5.3.2- The second constituent of the module Results is the module Treated Results. Here the raw data are handled according to the author's choice for data reduction and further treatment. The module Treated Results presents the smoothed data in the usual form in figures and tables, as we are familiar with from traditional journals.

5.4- The module Interpretation contains the core of the scientific reasoning in the article. Here, the author interprets the experimental results in the light of a theoretical model, e.g., by comparing them with theoretical results and experimental results obtained by others. An important observation in our analysis is that it is this module that maintains most of the characteristics of a classical paper. One can argue that our procedure in fact strips the traditional article from those components that can be presented as self-contained information or data entities. These modules do contain regular text and, if needed, reasoning and arguments.

The remaining core, the real scientific reasoning, argumentation and conjectures, remains an essay-like text. It is this part, in fact representing complex knowledge rather than quantitative information, that is the most difficult one to deal with.

5.5- Within the complex module Outcome, we distinguish two modules:

5.5.1-Findings, a compulsory module in which the author tries to answer the central questions stated in the module Central Problem.

5.5.2- We also have an optional module Leads to Further Research, in which ideas and suggestions for new work are expressed. It is clear that the module Findings represents the final results of the work, which together with the module Treated Results allows a reader to learn about what happened without the how and why.

6- Within the model developed in the Communication in Physics Project, we collect the earlier-mentioned traditional classes (bibliography, presentation form and keywords) as well as an abstract, a map of contents, list of references per module, and the acknowledgement in a separate module Meta-Information.

4.2) Link descriptors

If we want a crisp definition for classes of relations, it makes sense to differentiate between two main classes: organisational relations, dealing with the pure structural aspects, and scientific discourse relations, dealing with the content and pragmatic relations such as arguments.

1- Organisational relations

With respect to the class of organisational relations, we suggest defining, as a minimum, the following types of relations.

1.1- Hierarchical relations are asymmetric relations between an objects and there constituents, such as between complex modules and the elementary modules they contain.

1.2- Proximity-based relations reflex how close linked modules are. For instance between elementary modules that are part of the same complex module, article or larger collection of modules. An example can be links between a certain kind of measurement on different species reported in modules for each specie and collected in one complex module. Another example are photographs of the same animals in a different environment collected in one complex module.

1.3- Range-based relations exist between micro-, meso- and macromodules of the same or different communications.

1.4- Path relations that allow for the construction of different reading paths, e.g.: a first cursory reading and a second in-depth reading path. Also, with the help different discourse relations (see below), such different paths can be set out. In [Har00], a distinction is made between a so-called sequential path, connecting all modules available and an essay path mimicking the reading of a linear document.

1.5- Representational relations are between different representations of the same information. A typical example is the relation between a data-table and a graph. One can also argue that the relation between a particular specific word and its entry in a dictionary belongs to this type. In the model worked out by Harmsze, the latter kind of relation is part of discourse relations (see below). This approach is warranted if such is a link is added to clarify a term with the help of a dictionary or an encyclopaedia. In a pure organisational framework, one can think of a series of equivalent representations of the same information. For example, acetylsalicylic acid (regular chemical name) = 2 actoxybenzoic acid (formal chemical name) = C9H804 (chemical formula) = Aspirin (trade name) = 2D picture of structure = 3D representation of structure.

1.6- Administrative relations that relate the various modules with the various kinds of meta-information collected in the module Meta-Information.

2- Scientific discourse relations

In [Har99] and [Har00], a distinction between two classes of discourse relations are given. One class is based on the communicative function. The author may want to argue or elucidate something in order to increase the reader's understanding or acceptance of the presented information. Hence, the author links the module to another module where the supportive information is given. In that link, the communicative function of the target with respect to its source can be made explicit. The different types of relations based on the communicative function are elucidation (which can be further expanded into clarification and explanation) and argumentation (which can be split into refutation or support of a standpoint). Important to note is that a clear distinction is made between links that aim to increase the reader's understanding and links that aim to increase to reader's acceptance of the presented material.

A second class exists of content relations that can be defined next to elucidatory and argumentative relations. In particular we defined, dependency, elaboration, synthesis, comparison and causality. The elaborate scheme thus obtained is the result of an inductive process and presents the relations that are the most relevant in the corpus. An important issue for further research is how the various relations based on the communicative function and the content relations can combine to form specific combinations that are useful in a particular domain.

A different approach might be based on Garssen [Gar97, Gar98], who showed that almost all so-called argumentation schemes discussed in the literature, that are schemes in which the acceptability of the premise that is explicit in the argumentation is transferred to the standpoint, can be divided into three categories that are clearly demarcated, homogeneous, mutually exclusive, and non-superfluous. These three relational categories, can be used in speech communication in various ways: as a description, as an argument and as a clarification or explanation. In the case of argumentation, a dispute has to be resolved; in the case of a clarification, we need to enhance the understanding of a statement.

2.1- Causal relations are relations where there is a causal connection between premise and conclusion (or between explanans and explanandum). This kind of relation exists between a statement or a formula and an elaborate mathematical derivation. Obviously, the usage of the causal relation as an argument and as an explanation, lie close together.

2.2- Comparison relations are relations where the relation is one of resemblance, contradiction or similarity. The analogue is a typical subtype. Comparisons used as argument are well-known phenomena, such as with the comparison of measured data from, e.g., the module Treated Results with theoretical predictions that fit within certain acceptable boundaries. We can also think of similarity relations, where results of others on similar systems are compared to emphasise agreement or disagreement. In the use of an elucidation, we can think of the relation between the description of a phenomenon and a known mechanical toy model. Furthermore, the link between a text and an image that illustrates the reasoning or results belong to this category. Another example is the suggestion that a drug that is effective in curing a particular ailment, might also help against look-alike symptoms, or the suggestion that look-alike physiological phenomena might have the same illness as the source.

2.3- Symptomatic relations, which are of a more complicated nature. Here we deal with relations where a concomitance exists between the two poles. This category is more heterogeneous than the other two. This kind of relation can be based on a definition or a value judgement such as the role of a specific feature that serves as a sufficiently discriminatory value to warrant a conclusion. We can think of a relation between the textually described results and a picture in which a specific feature, like a discontinuity in a graph, is used to declare a particular physical effect present or not.

Obviously next to these three classes, we still need the notion of relations that represent dependency of information in a line of reasoning and elaboration to cater for various levels of readership (e.g., extra information for the freshman reader).

5) Discussion

In the above, it has been argued that in an electronic environment, scientific papers can cease to be most basic independent self-contained entities. Instead of that, we suggest that the scientific communication of the future could be characterised by a network of modules defined by their conceptual function and other classes of identification. This way, information can be shared by various authors, and in reporting new science, only those modules have to be written that present new information or insights. Next to that, commentaries to existing trails of modules can be added as independent entities. Such commentaries can be of different kinds: as new data in the form of a new module Raw Data, a new Theoretical Method or an independent Interpretation of other peoples work. Essential for such a system is the existence of a limited but sufficient number of non-overlapping categories of modules, and equally important, categories of relations that are expressed in labelled hyper-links. As with the design of all knowledge-related systems, the conceptualisation of the parameter space is very difficult. The attempt made in this work is to try to keep the main descriptor classes (or coordinates) as domain-independent as possible, with the clear understanding that domain-specific identification in the form of keywords is one of the system’s coordinates. In the field of knowledge-based systems, the term Ontology is now a fashionable term to describe the kind of systems based on both the semantics of every notion and the rules of using the primary terms in modelling a domain. Though the term is used in different ways [Vic97], the important notion is that contrary to IR, we deal with rules and relations as well. Many systems are built with the aim of assisting in the extraction of information. In our approach, we try to design a new input system, that, if successful, will substantially enhance the subsequent storage, retrieval and annotation capability of electronic publishing. In that process, we try to restrict the granularity of self-contained units of a well-defined conceptual meaning for which the development of a new writing style is important. Not the length, but the coherence of the message is the essence. This aim of conceptual granularity induces the need for a taxonomy of relations that is based on the real use of linguistic communication next to pure logical and structural approaches. Systematic classification of speech communication, and argumentation in particular, is therefore a valuable tool. A plethora of rhetorical indicators cannot define the necessary discriminatory coordinates, just as a thesaurus of keywords is insufficient to learn about the real content of a work. Our approach is bottom-up, given a conceptual level of granularity. In that sense, it has elements in common to the Plinius project in material science where also a bottom-up ontology is proposed [Vet98].

6.1) Findings

Our aim to develop a new framework for scientific communication based on the intrinsic characteristics of distributed electronic storage, has led us to the design of a modular framework. In this framework, a new granularity of scientific information is suggested, based on the conceptual content of the message. As every level of aggregation of information demands proper structuring, we introduced classes of relations that express themselves in hyperlinks connecting the modules or parts thereof. We provide a general systematic and coherent model that can be tailored to specific domain requirements. In the Communication in Physics Project, we showed that an application in experimental molecular physics is easily feasible. In this paper, we outlined the background and concepts in order to bring the discussion onto a more interdisciplinary level.

6.2) Arrows to further research

It goes without saying that much more in-depth analyses as well as user-surveys are necessary to feel safe with a minimum set of module and link descriptors. Further work is especially needed in the stratification of the discourse relations, and in particular, the possible connections between the communicative and content relations. The important issue is, that we need structural relation taxonomies and not arbitrary lists of rhetorical semantic indicators. Best practice would be to build a domain-specific experimental writing environment with an emphasis on authoring tools for reasonably large scale tests, where authors cast their publications in modular form. In such an authoring system, authors must be able to link the different modules or parts thereof with the help of a preset collection of relations. Analyses of such a practice, the problems and the results for the proper understanding by the ultimate readers must then provide the model with concrete feedback and suggestions for expanding the available types of relations. This type of a more systematic and comparative study are needed in different domains. A good suggestion could be large conference proceedings in an experimental field, or a comprehensive work with lots of data.

1.6) Acknowledgements

Helpful discussions with Maarten van der Tol, Francisca Snoeck Henkemans, Anita de Waard and Keith Jones are gratefully acknowledged. The Communication in Physics Project is financially supported by the Foundation Physica, the Shell Research and Technology Centre Amsterdam, the Royal Dutch Academy of Sciences, the Royal Library and Elsevier Science BV.

1.7) References

[Bar96] Lisa Baron, Jean Tague-Sutcliffe, and Mark T. Kinnucan. Labelled, typed links as cues when reading hypertext documents. JASIS 47 (12), 1996, 896-908.

[Der89] Steven J. DeRose. Expanding the notion of links. Hypertext ‘89 Proceedings, November 1989, 249-257.

[Eam94] William Eamon. Science and the secrets of nature: Books of secrets in medieval and early modern culture. Princeton. Princeton Univ. Press, 1994.

[Eis79] Elizabeth. L. Eisenstein. The printing press as an agent of change: communications and cultural transformations in early modern Europe. 2 Volumes. Cambridge: Cambridge Univ. Press, 1979.

[Eis83] Elizabeth. L. Eisenstein. The printing revolution in Early modern Europe. Cambridge: Canto Cambridge Univ. Press, 1996.

[Feb76] Lucien Febvre and Henri-Jean Martin. The coming of the book. The impact of printing 1450-1800. London: Verso Press, 1976.

[Gar97] Bart Garssen. Argumentatieschema’s in pragma-dialectisch perspectief. Een theoretisch en empirisch onderzoek. PhD thesis University of Amsterdam. Amsterdam: IFOTT Vol.32, 1997.

[Gar98] Bart Garssen. The nature of symptomatic argumentation. In: Frans H. van Eemeren, Rob Grootendorst, J Anthony Blair, Charles A. Wilards (eds.). Proceedings of the 4th International Conference of the International Society for the Study of Argumentation, Amsterdam, June 16-19 1998. Amsterdam: SICSAT, 1999.

[Har96] Frédérique Harmsze, Maarten van der Tol, Joost Kircz. Naar een modulair model voor natuurwetenschappelijke informatie in elektronische artikelen. In: K. van der Meer (red.). Informatiewetenschap 1996. Wetenschappelijke bijdragen aan de Vierde Interdisciplinaire Conferentie Informatiewetenschap. Delft, 13 december 1999. An electronic version at: http://www.wins.uva.nl/projects/commphys/papers/delft/delft.html

[Har99] F.A.P. Harmsze, M.C. van der Tol and J.G. Kircz. A modular structure for electronic scientific articles. Contribution to the Conferentie Informatie Wetenschap 1999. CWI Amsterdam, 12 November 1999. On-line proceedings: http://www.cwi.nl/~lynda/WGI/info-wet1999/proceedings/. To be published.

[Har00] Frédérique Harmsze. A modular structure for scientific articles in an electronic environment. PhD Thesis, University of Amsterdam, Amsterdam, 2000. An electronic version of this thesis can be found at: http://www.wins.uva.nl/projects/commphys/papers/thesisfh/front.html

[Hav86] Eric. A. Havelock. The muse learns to write. Reflections on orality and literacy from antiquity to the present. New Haven: Yale Univ. Press, 1986.

[Kir91] Joost G. Kircz. Rhetorical structure of scientific articles: the case for argumentational analysis in information retrieval. Journal of Documentation, 47(4), 1991, 354-372.

[Kir98] Joost G. Kircz. Modularity: the next form of scientific information presentation? Journal of Documentation, 54(2), 1998, 210-235. An electronic version at: http://www.wins.uva.nl/projects/commphys/papers/jkmodul.html

[Man88] W.C. Mann and S.A. Thompson, Rhetorical structure theory: towards a functional theory of text organization. Text, 8, 243-281.

[Man89] W.C. Man, C.M.I.M. Mattheissen and S.A. Thompson. Rhetorical structure theory and text analysis. Information Science Institute Research Report, Philadelphia: ISI/ RR-89-242, November 1989.

[Mcl62] M. McLuhan. The Gutenberg galaxy, the making of typographic man. Toronto. Univ. of Toronto Press, 1962.

[Mea98] A.J. Meadows. Communicating Research. San Diego: Academic Press, 1998.

[Ong82] W.J.Ong. Orality and literacy: the technologizing of the world. London: Methuen, 1982.

[RDF99] http://www.w3.org/TR/NOTE-rdf-simple-intro (last access 4-1-00).

[Tol99] M.C. van der Tol. The abstract as an orientation tool in modular electronic articles. In: Alfons Maes, Hans Hoeken, Leo Noordman, Wilbert Spooren (eds.) Document Design, Linking writer's goals to reader's needs. Proceedings of the First International Conference on Document Design. Tilburg University, 17-18 December, 1998. Tilburg, 1999, 175-186. An electronic version at: http://www.wins.uva.nl/projects/commphys/papers/docdes/docdes.html

[Tri83] Randall Trigg. A network-based approach to text handling for the online scientific community. PhD dissertation Univ. of Maryland Tech. report, TR-1346, 1983. The taxonomy is given in chapter four which is available on: http://www.parc.xerox.com/spl/members/trigg/thesis/thesis-chap4.html

[Rut00] Lloyd Rutledge, Brian Bailey, Jacco van Ossenbruggen, Lynda Hardman and Joost Geurts. Generating presentation constraints from rhetorical structure. Contribution to the Conferentie Informatie Wetenschap 1999. CWI Amsterdam, 12 November 1999. To be published. Also: http://www.cwi.nl/~lynda/WGI/info-wet1999/proceedings/ .

[Tuf83] Edward R. Tufte. The Visual Display of Quantitative Information. Cheshire: Graphics Press, 1983.

[Tuf90] Edward R. Tufte. Envisioning Information. Cheshire: Graphics Press, 1990.

[Vet98] Paul E. van der Vet and Nicolaas J.I. Mars. Bottom-up construction of ontologies. IEEE Transactions on Knowledge and Data Engineering 10, 1998, 513-526.

[Vic97] B.C. Vickery. Ontologies. Jnl. of Information Science, 23(4), 1997, 277-286.

[XML99] http://www.w3.org/TR/NOTE-xlink-req (last access 4-1-00).

Last modifications on: 22-3-2000