Trainable Classifiers – Tips
Keyword or metadata values (keyword query language)Previously identified patterns of sensitive information like social security, credit card, or bank account numbers (Sensitive information type entity definitions)Document fingerprinting: recognizing an item because it’s a variation on a templateThe presence of exact strings exact data match
Hello All,Just sharing some tips to assist with the process of data collection and the creation of trainable classifiers for the purpose of labelling/Data Loss prevention. -Regarding training Machine Learning to recognize a certain document type, It must have one or more recognizable aspects. Possible usable recognizable aspects of the data/document type:Keyword or metadata values (keyword query language)Previously identified patterns of sensitive information like social security, credit card, or bank account numbers (Sensitive information type entity definitions)Document fingerprinting: recognizing an item because it’s a variation on a templateThe presence of exact strings exact data match -In the below examples, we focus on Document Fingerprinting and Previously identifiable Sensitive information Type. For e.g.Regarding positive samples, The below file samples display a pattern, CC info (dummy data), Include Keywords referring to CC info such CVV2/AMEX etc…. as well as SSN information. -This can be regarded as a pattern for positive detection. The above data samples (about 150 samples of a similar pattern) are stored in a folder in a dedicated SharePoint Site(In the below screenshot, Same items are used as false samples for another classifier). -Regarding Negative samples, It is the same concept, It can be also stored in a folder in a dedicated Sharepoint Site and have a unique pattern or fingerprint. for e.g. -The below samples represent Credential information (dummy), Need to be about 150 samples or so. The samples should strongly represent a uniform document/data type different from positive samples. Similarly the data is stored in a dedicated folder in a SharePoint Site: Once the trainable classifier is created and fed this information, It will successfully identify data type to facilitate detection and minimize potential false positive. Read More