Subscribe to DSC Newsletter

Do I need to adjust frequencies or weights of rows so the right weight is given to each sample (data mining)

The general problem type is as follows. I have about 2,500 rows of data. Each row contains data about an individual sample with sizes from around 10,000 to 200,000 (a known attribute / column), and there are attributes in the row that are percentages of the individual samples with different characteristics. The samples do not overlap. There is a binary categorical target (not highly imbalanced).

The specific problem follows. The rows are characteristics of cities (number of streets, number of grocery stores, etc.), and census type data on people (number of a certain type of healthcare visit, poverty rate, etc). These are examples, and the actual attributes are different, but similar. There is a binary categorical target (not disclosed here, a human characteristic such as a health issue or opinion on a particular item, that will be independent of the city size). There are some known, moderately weak relationships among these independent variables and the target. I expect the effect of multiple attributes to make the target more predictable that from any single attribute alone.

I plan to use decision trees, logistic regression, and neural networks to generate models.

The size of the cities will have an impact, usually minor, on some of the attributes (percentages), but not on the target.

Do I need to adjust the row weights to do this correctly? Such as duplicating some of the rows - this would be a very complex operation.

Can I just use the data as is? A row has a sample size, several percentages, and the binary target.

Do I need to change the percentages to numbers? (Change 0.10 to 1,000 if the row's sample population is 10,000.)

Thanks for any help. (I will be using R.)

Tags: analytics, data, decision, logistic, mining, networks, neural, predictive, regression, trees

Views: 284

Reply to This

Replies to This Discussion

It depends on the research question.  I understand you're trying to predict a health outcome at the city level.  Does that mean:

  1. your success criteria is to predict the classification of the CITY (** implied), or
  2. the level of impact across the population? (if so, then weight by pop or something like that).

The former indicates that Sheboygan Wisconsin (small city) will have the same influence over your model as New York City.  Do you want that?  Is it just as important to predict Sheboygan as NYC classification?  Or, because NYC is so much larger, it's important to give it more influence over the model and classification because it's more "important" to draw your conclusion from NYC than Sheboygan?  (The general term for these question has to do with: What are you generalizing about from your analysis?)

Note: You could still use pop density, etc. or something like that as a predictor.  Using "raw counts" like population counts as a predictor would be hazardous if that variable is just giving your more "chances" at your categorical variable (e.g., presence of disease in community; Y/N).  

RSS

Videos

  • Add Videos
  • View All

© 2020   TechTarget, Inc.   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service