Text Classification - Label Pre Process

449
February 05, 2017, at 1:50 PM

I have a data set of 1M+ observations of customer interactions with a call center. The text is free text written by the representative taking the call. The text is not well formatted nor is it close to being grammatically correct (a lot of short hand). None of the free text has a label on the data as I do not know what labels to provide.

Given the size of the data, would a random sample of the data (to give a high level of confidence) be reasonable first step in determining what labels to create? Is it possible not to have to manually label 400+ random observations from the data, or is there no other method to pre-process the data in order to determine the a good set of labels to use for classification?

Appreciate any help on the issue.

Answer 1

Manual annotation is a good option since you have a very good idea of an ideal document corresponding to your label.

However, with the large dataset size, I would recommend that you fit an LDA to the documents and look at the topics generated, this will give you a good idea of labels that you can use for text classification.

You can also use LDA for text classification eventually by finding out representative documents for your labels and then finding the closest documents to that document by a similarity metric(say cosine).

Alternatively, once you have an idea of labels, you can also assign them without any manual intervention using LDA, but then you will get restricted to unsupervised learning.

Hope this helps!

P.S. - Be sure to remove all the stopwords and use a stemmer to club together words of similar king example(managing,manage,management) at the pre-processing stage.

Rent Charter Buses Company
READ ALSO
how to get matched substring list in a json-type string with python?

how to get matched substring list in a json-type string with python?

I am trying to find all matched substrings from a input string below:

494
Lambda Expressions: Returning Multiple Values

Lambda Expressions: Returning Multiple Values

I want to implement the function via a lambda expression as follows:

479
Using CUDA8 in theano

Using CUDA8 in theano

I have working installtion of CUDA8 and have installed theano, while importing the theano it searches for CUDA75 instead of CUDA8, How can tell theano to use CUDA8 instead of CUDA7

442
TypeError: 'Text' object is not callable - Python

TypeError: 'Text' object is not callable - Python

I'm working on a game and In this game the player wins if its character (a ball) has a w of 34I made a win screen, but when I tested it out, it was giving me an error message:

572