Título: | On the importance of lexicon, structure and style for identifying source code plagiarism |
Autor(es): | RAMIREZ DE LA CRUZ, AARON RAMIREZ DE LA ROSA, ADRIANA GABRIELA SANCHEZ SANCHEZ, CHRISTIAN JIMENEZ SALAZAR, HECTOR |
Temas: | Código fuente (Computación) Estructura de datos (Computadoras) Plagio - Innovaciones tecnológicas |
Fecha: | 2007 |
Editorial: | New York : Association for Computing Machinery |
Citation: | FIRE 2014 : post-proceedings of the 6th workshop of the Forum for Information Retrieval Evaluation |
Resumen: | Source code plagiarism can be identified by analyzing similarities of several and diverse aspects of a pair of source code. In this paper we present three types of similarity features that account for three aspects of source code documents, particularly: i) lexical, ii) structural, and iii) stylistics. From the lexical view, we used a character 3-gram model without considering reserved words for the programming language in revision. For the structural view, we proposed two similarity metrics that take into account the function’s signatures within a source code, namely the data types and the identifier’s names of the function’s signature. The third view consists on accounting for several stylistics’ features, such as the number of white spaces, lines of code, upper letters, etc. Accordingly, we proposed 8 similarity features to represent pairs of source code in order to, under a supervised approach, identify plagiarized pairs of source codes. We use a set of more than 32000 source code documents from Java and C to perform our experiments. The results show the pertinence of our set of features to identify plagiarism for source code documents that satisfy particular conditions, such as, source code that solve difficult problems. |
URI: | http://ilitia.cua.uam.mx:8080/jspui/handle/123456789/484 |
Aparece en las colecciones: | Libros |
Fichero | Descripción | Tamaño | Formato | |
---|---|---|---|---|
On the importance.pdf | 331.43 kB | Adobe PDF | Visualizar/Abrir |
Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.