By Joe Howie*
Interesting – Cormack and Grossman define the “TAR Problem” as not knowing about a data set at the outset (p. 2); point out that simple key term searching to select the seed set improves the performance of all TAR protocols (p. 7); and indicate that some protocols have difficulty with low prevalence or low richness collections (p.9). Visual classification impacts those issues plus addresses the implicit issue that TAR is text-restricted (i.e., should actually be called TR-TAR).
Visual classification classifies visually similar documents whether or not text is contained in them, and reviewing one document per classification provides an awareness of a collection’s content that overcomes a possible initial lack of knowledge. The classifications also provide a far more granular and meaningful tool for selecting documents than simple text searching. By eliminating plainly unresponsive document types, reviewers can increase the richness of the remaining documents, and may in fact be able to identify which specific attributes which make documents responsive, thereby short cutting the need to “predict” which documents are responsive.
Since the last century, text analysis has been the primary tool used to classify documents, but its durability as the tool of choice doesn’t mean that it remains the best choice.
* The comment was used by express permission of its contributor, Joe Howie of BeyondRecognition.