N-gram feature selection for authorship identification : μεταπτυχιακή εργασία

Automatic authorship identification offers a valuable tool for supporting crime investigation and security. It can be seen as a multi-class, single-label text categorization task. Automatic authorship identification depends on selecting stylisticfeatures that would capture an authors writing style i...

Πλήρης περιγραφή

Αποθηκεύτηκε σε:
Λεπτομέρειες βιβλιογραφικής εγγραφής
Κύριος συγγραφέας: Χουβαρδάς, Ιωάννης
Συγγραφή απο Οργανισμό/Αρχή: Πανεπιστήμιο Αιγαίου. Σχολή Θετικών Επιστημών. Τμήμα Μηχανικών Πληροφοριακών και Επικοινωνιακών Συστημάτων
Μορφή: Thesis Βιβλίο
Γλώσσα:English
Δημοσίευση: Καρλόβασι, Σάμος : Πανεπιστήμιο Αιγαίου, Τμήμα Μηχανικών Πληροφοριακών και Επικοινωνιακών Συστημάτων, 2006.
Θέματα:
Διαθέσιμο Online:http://hdl.handle.net/11610/12497
Ετικέτες: Προσθήκη ετικέτας
Δεν υπάρχουν, Καταχωρήστε ετικέτα πρώτοι!
Περιγραφή
Περίληψη:Automatic authorship identification offers a valuable tool for supporting crime investigation and security. It can be seen as a multi-class, single-label text categorization task. Automatic authorship identification depends on selecting stylisticfeatures that would capture an authors writing style independent of the content or genre of text. Character n-grams have been used successfully to represent text for stylistic purposes in literature. They seem to be able to capture nuances in lexical, syntactical, and structural level. To date character n-grams of fixed length have been used for authorship identification. In this thesis: we introduce a new approach for selecting variable length n-grams inspired by previous work for selecting variable-length word sequences. We propose the use of variable-length n-grams to represent the stylistic information of the documents to be classified. We explore the significance of digits as stylistic features for distinguishing between authors and show that an increase in performance can be achieved using simple text pre-processing. Using a subset of the new Reuters corpus, consisting of texts on the same topic by 50 different authors, we show that the proposed feature selection method is at least as effective as information gain for selecting the most significant n-grams, although the feature sets produced by the two methods have few common members.
Περιγραφή τεκμηρίου:Μέλη της εξεταστικής επιτροπής: Σταματάτος Ευστάθιος, Βούρος Γεώργιος, Καβαλλιεράτου Εργίνα
Φυσική περιγραφή:71 σ. : σχέδια, πιν. ; 30 εκ.
Βιβλιογραφία:Βιβλιογραφία: σ.65-69. Ευρετήριο