Proposed System for Plagiarism Detection Chapter 3 The Proposed System Introduction This chapter introduces ZPLAG as proposed system, and its most important design issues are explained in details. It is very easy for the student to find the documents and magazines using advanced search engines, so the problem of electronic thefts is no longer local or regional, but has become a global problem occurring in many areas. Due to the Hugging of information, and correlation networks, the discovery of electronic thefts is a difficult task, and the discovery of the thefts started in the Arabic language and the most difficult task no doubt. And in light of the growing e-learning systems in the Arab countries, this requires special techniques to detect thefts electronic written in Arabic. And although it could use some search engines like Google, it is very difficult to copy and paste the sentences in the search engines to find these thefts. For this reason, it must be develop a good tool for the discovery of electronic thefts written Arabic language to protect e-learning systems, and to facilitate and accelerate the learning process, where it can automatically detect electronic thefts automatically by this tool. This thesis shows, ZPLAG, a system that works on the Internet to enable specialists to detect thefts of electronic texts in Arabic so it can be integrated with e-learning systems to ensure the safety of students and research papers and scientific theses of electronic thefts. The thesis also describes the major components of this system, including stage outfitted, and in the end we will establish an experimental system on a set of documents and Arabic texts and compared the results obtained with some of the existing systems, particularly TurnItIn. The chapter is organized as follow; Section 3.2 presents an overview of the Arabic E-Learning, Section 3.3 presents and explains the General Overview of the Proposed System, Section 3.4 explains in details the system architecture of the proposed system ZPLAG. Section 3.5 gives a summery for this chapter. General Overview of the Proposed System The proposed system consists of three different phases namely; (1) Preparation phase, (2) Processing phase, and (3) Similarity detection phase. Figure 3.1 depicts the phases of the proposed system. Figure 3.1 Proposed system phases Preparation Phases: this phase is responsible for collecting and prepares the documents for the next phase. It consists of five modules: text editor module, check language module, check spelling module, check grammar module, and Sentences analysis module. Text editor module allows the user to input a text or upload a text file in document format, these files can be processed in the next phase. The check language module is responsible for checking the input file written language, If it is an Arabic language then use Arabic process, or English language then use English process. The check spelling module use to check the words are written correct or there is some misspelling. This phase consists of three modules explained as follows: Tokenization: break up the input text as some token . SWR: remove the common words that appear in the text but carry little meaning. Rooting: is the process of removing: (prefixes, infixes, or/and suffixes) from words to get the roots or stems of this word Replacement of Synonym: words are converted to their synonyms. Similarity detection Phases: It is consists of three modules Fingerprinting, documents representation and similarity detection, this phase discussed as follows: To calculate fingerprints of any document, first cut up the text into small pieces called chunks, the chunking method that responsible for cutting up the text will be determined [12]. A unit of chunk could be a sentence or a word. In case of chunking using sentences called sentence-based, the document can be cutted into small chunks based on ââ¬ËCââ¬â¢ parameter. For example, a document containing sentences ds1 ds2 ds3 ds4 ds5, if C=3 then the calculated chunks will be ds1 ds2 ds3, ds2 ds3 ds4, ds3 ds4 ds5. For example, a document containing words dw1 dw2 dw3 dw4 dw5, if C=3 then the calculated chunks will be dw1 dw2 dw3, dw2 dw3 dw4, dw3 dw4 dw5. The chunking using Word gives higher precision in similarity detection than the chunking sentence. The Architecture pf Proposed System The following properties should be satisfied by any system detecting plagiarism in natural language: Insensitivity to small matches. Insensitivity to punctuation, capitalization, etc. Insensitivity to permutations of the document content. The system main architecture of ZPLAG is illustrated in Figur1. Preparation: text editor, check language, check spelling, and check grammar. Preprocess: synonym replacement, tokenization, rooting, and stop-word removal. Fingerprinting: the use of n-gram, where the user choses the parameter n. Document representation: for each document, create a document tree structure that describes its internal representation. Selection of a similarity: use of a similarity metric to find the longest match of two hash strings. As mentioned in the previous section, the system architecture breakdown contains three main phases. Each phase will be composed to a set of modules in terms of system functionality. The following section contains the description of each phase and its modules in details. 3.4.1 The Preparation Phase The main task of this phase is to prepare the data for the next phase. It consists of text editor module, check language module, check spelling module and check grammars module. 3.4.1.1. Text editor Module Figure 3.2, illustrates text editor module. The users of the text editor module are faculty members and students, where the users need a text area to upload their files, so the brows helps for file path to make it easy for the users, After that check file format is very important , because the service upload files with doc or docx format, then after the user upload the file , the text editor module save the file in the database. Figure 3.2 text editor module 3.4.1.2 Check Language Module The raw text of the document is treated separately as well. In order to extract terms from text, classic Natural Language Processing (NLP) techniques are applied as. Figure 3.3 illustrates Check Language module and its functions: from the system database, whereas all the files are stored, the check language module bring the file and read it, then check for language either Arabic , English or combo (both Arabic and English), After that mark the document with its written language and save the file again in the system database. Figure 3.3 check language module 3.4.1.3 Check Spelling Module Figure 3.4 illustrates Check spelling module and its functions: after bringing the document from the system database, whereas all the files are stored, the check spelling module read the file, and use the web spelling checker, then the check spelling module make all the possible replacements for the words in false spelling check , After that save the file again in the system database. Figure 3.4 check spelling module 3.4.1.4 Check Grammars Module For English documents, Figure 3.5 illustrates Check grammar module and its functions: after bringing the document from the system database, whereas all the files are stored, the check grammar module read the file, and use the web grammar checker, After that the check grammar module mark the sentences with the suitable grammar mark and save the file again in the system database. Figure 3.5 check grammar module 3.4.2 The processing Phase 3.4.2.1 The Tokenization Module In the Tokenization module : after bringing the document from the system database, whereas all the files are stored, the Tokenization module read the file, and brake down the file into paragraphs, after that brake down the paragraphs into sentences, then brake down the sentence into words. After that save the file again in the system database. 3.4.2.2 The Stop Words Removal and Rooting Module The raw text of the document is treated separately as well. In order to extract terms from text, classic Natural Language Processing (NLP) techniques are applied as. Figure 3.6 illustrates Stop Words Removal and rooting module and its functions: Figure 3.6: SWR and Rooting module SWR: Common stop words in English include: a, an, the, in, of, on, are, be, if, into, which etc. Whereas stop words in Arabic include: Ãâ¢Ã¢â¬ ¦Ãâ¢Ã¢â¬ , ÃËà ¥Ãâ¢Ã¢â¬Å¾Ãâ¢Ã¢â¬ ° , ÃËà ¹Ãâ¢Ã¢â¬ , ÃËà ¹Ãâ¢Ã¢â¬Å¾Ãâ¢Ã¢â¬ ° , Ãâ¢Ã Ãâ¢Ã
etc. These words do not provide a significant meaning to the documents . Therefore, they should be removed in order to reduce ââ¬Ënoiseââ¬â¢ and to reduce the computation time. Word Stemming: it will be changed into the wordââ¬â¢s basic form. 3.4.2.3 Replacement of Synonym Replacement of Synonym: It may help to detect advanced forms of hidden plagiarism. The first synonym in the list of synonyms of a given word is considered as the most frequent one. 3.4.3 The Similarity Detection Phase 3.4.3.1 The Fingerprinting Module It is consists of three modules Fingerprinting, documents representation and similarity detection, this phase discussed as follows: To calculate fingerprints of any document, first cut up the text into small pieces called chunks, the chunking method that responsible for cutting up the text will be determined [12]. A unit of chunk could be a sentence or a word. In case of chunking using sentences called sentence-based, the document can be cutted into small chunks based on ââ¬ËCââ¬â¢ parameter. For example, a document containing sentences ds1 ds2 ds3 ds4 ds5, if C=3 then the calculated chunks will be ds1 ds2 ds3, ds2 ds3 ds4, ds3 ds4 ds5. In case of chunking using word called a word-based chunking, the document is cutted into small chunks based on ââ¬ËCââ¬â¢ parameter. For example, a document containing words dw1 dw2 dw3 dw4 dw5, if C=3 then the calculated chunks will be dw1 dw2 dw3, dw2 dw3 dw4, dw3 dw4 dw5. The chunking using Word gives higher precision in similarity dete ction than the chunking sentence. ZPLAG is based on a word-based chunking method: in every sentence of a document, words are first chunked and then use a hash function for hashing. 3.4.3.2 The Document Representation Module Document representation: for each document, create a document tree structure that describes its internal representation. 3.4.3.3 The Similarity Detection Module A tree representation is created for each document to describe its logical structure. The root represents the document itself, the second level represents the paragraphs, and the leaf nodes contain the sentences. Summary Being a growing problem, The electronic thefts is generally known as plagiarism and dishonesty academic and they constitute a growing phenomenon, It should be known that way to prevent its spread and preserve the ethical principles that control the academic environments, with easy access to information on the World Wide Web and the large number of digital libraries, electronic thefts have become one of the most important issues that plague universities and scientific centers and research. This chapter presented in detailed description of the proposed system for plagiarism detection in electronic resources and its phases and its functions.