Efficient Search for Plagiarism on the Web

J. Malcolm, P.C.R. Lane

    Research output: Contribution to journalArticlepeer-review

    113 Downloads (Pure)

    Abstract

    Understanding the characteristics of written English allows Internet search for the source of a document to be carried out efficiently. There is a Zipfian distribution of word frequencies in natural language, with some words common and many words rare. If we take a group of three words, the rarity of most of these triples is extreme. This can be exploited to detect web pages similar to a given target document: while a Google search for some triples from the target may return many hits, other triples will only be found in a few documents on the Internet. These documents may well be similar to the target, and are certainly worth examining more closely. Initial experiments show that this approach is very promising, and it is being implemented in a software tool called WebFerret.
    Original languageEnglish
    Pages (from-to)206-211
    JournalProceedings (International Conference on Technology, Communication and Education)
    Volume2008
    Publication statusPublished - 2008

    Keywords

    • plagiarism
    • search engines
    • ferret
    • natural language processing

    Fingerprint

    Dive into the research topics of 'Efficient Search for Plagiarism on the Web'. Together they form a unique fingerprint.

    Cite this