1
|
Abstract
A nearest neighbour search procedure is described for use with serial files of textual data. The procedure involves the grouping of records into blocks, each of which is characterised by a fixed-length bit string. A comparable query bit string may then be matched against each of these bit strings, and an upper bound calculation used to identify those blocks which need to be inspected in detail if the document that is most similar to the query is to be identified. Experiments with three small collections of documents and queries are used to test the efficiency of the approach. The experiments show that reduc tions in computation are possible, although the precise savings are crucially dependent upon a range of factors including the frequency characteristics of the documents and queries, the similarity coefficients, and the sizes of the bit strings and of the blocks.
Collapse
Affiliation(s)
- Kondrahalli C. Mohan
- Department of Information Studies, University of Sheffield, Western Bank, Sheffield SI0 2TN, United Kingdom
| | - Peter Willett
- Department of Information Studies, University of Sheffield, Western Bank, Sheffield SI0 2TN, United Kingdom
| |
Collapse
|
2
|
Lynch MF, Willett P. Information retrieval research in the Department of Information Studies, University of Sheffield: 1965-1985. J Inf Sci 2016. [DOI: 10.1177/016555158701300405] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
This paper discusses research which was carried out at the Department of Information Studies, University of Sheffield in the period 1965 to 1985 into storage and retrieval techniques for databases of textual and chemical structure data. The research includes the development of methods for the auto matic production of printed subject indexes and for the inde xing and retrieval of chemical structures and chemical reac tions, the variety generation method for the analysis, characterization and storage of data in a range of types of textual database, the prediction of biological activity in chemical compounds, and the design of document retrieval systems.
Collapse
Affiliation(s)
- Michael F. Lynch
- Department of Information Studies, The University of Sheffield, Western Bank, Sheffield S10 2TN, United Kingdom
| | - P. Willett
- Department of Information Studies, The University of Sheffield, Western Bank, Sheffield S10 2TN, United Kingdom
| |
Collapse
|
3
|
Cringean JK, Manson GA, Willett P, Wilson GA. Efficiency of text scanning in bibliographic databases using microprocessor-based, multiprocessor networks. J Inf Sci 2016. [DOI: 10.1177/016555158801400604] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
Abstract
This paper reports an evaluation of the efficiency of text scanning achievable from a microprocessor-based, multi processing system which uses Inmos transputers. The Boyer-Moore pattern matching algorithm was used to search 35 natural language quenes against a file of 1000 titles and abstracts taken from the Library and Information Science Ab stracts database. A model of searching using a singly-linked chain containing up to 11 transputers was carried out: the maximum speed-up obtained with this size of network was 10.4 with a processor utilization of 0.95, both figures being close to the ideal of 11.0 and 1.0. Expenments with a nearest neighbour searching algorithm for serial document files demonstrate the need to keep the processors fully occupied with computational work if a high degree of speed-up is to be obtained.
Collapse
Affiliation(s)
- Janey K. Cringean
- Departments of Computer Science and Information Studies, University of Sheffield. Western Bank, Sheffield S10 2TN, United Kingdom
| | - Gordon A. Manson
- Departments of Computer Science and Information Studies, University of Sheffield. Western Bank, Sheffield S10 2TN, United Kingdom
| | - Peter Willett
- Departments of Computer Science and Information Studies, University of Sheffield. Western Bank, Sheffield S10 2TN, United Kingdom
| | - George A. Wilson
- Departments of Computer Science and Information Studies, University of Sheffield. Western Bank, Sheffield S10 2TN, United Kingdom
| |
Collapse
|
4
|
Robertson AM, Willett P. Applications ofn‐grams in textual information systems. JOURNAL OF DOCUMENTATION 1998. [DOI: 10.1108/eum0000000007161] [Citation(s) in RCA: 37] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
|
5
|
Software and hardware techniques for string searching in serial document databases. WORLD PATENT INFORMATION 1988. [DOI: 10.1016/0172-2190(88)90153-6] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022]
|
6
|
Abstract
Signature files provide an efficient access method for text in documents, but retrieval is usually limited to finding documents that contain a specified Boolean pattern of words. Effective retrieval requires that documents with similar meanings be found through a process of plausible inference. The simplest way of implementing this retrieval process is to rank documents in order of their probability of relevance. In this paper techniques are described for implementing probabilistic ranking strategies with sequential and bit-sliced signature tiles and the limitations of these implementations with regard to their effectiveness are pointed out. A detailed comparison is made between signature-based ranking techniques and ranking using term-based document representatives and inverted files. The comparison shows that term-based representations are at least competitive (in terms of efficiency) with signature files and, in some situations, superior.
Collapse
|
7
|
An evaluation of document retrieval from serial files using the ICL Distributed Array Processor. ACTA ACUST UNITED AC 1984. [DOI: 10.1108/eb024172] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
|