The general problem type is as follows. I have about 2,500 rows of data. Each row contains data about an individual sample with sizes from around 10,000 to 200,000 (a known attribute / column), and there are attributes in the row that are percentages of the individual samples with different characteristics. The samples do not overlap. There is a binary categorical target (not highly imbalanced).
The specific problem follows. The rows are characteristics of cities (number of streets, number of grocery stores, etc.), and census type data on people (number of a certain type of healthcare visit, poverty rate, etc). These are examples, and the actual attributes are different, but similar. There is a binary categorical target (not disclosed here, a human characteristic such as a health issue or opinion on a particular item, that will be independent of the city size). There are some known, moderately weak relationships among these independent variables and the target. I expect the effect of multiple attributes to make the target more predictable that from any single attribute alone.
I plan to use decision trees, logistic regression, and neural networks to generate models.
The size of the cities will have an impact, usually minor, on some of the attributes (percentages), but not on the target.
Do I need to adjust the row weights to do this correctly? Such as duplicating some of the rows - this would be a very complex operation.
Can I just use the data as is? A row has a sample size, several percentages, and the binary target.
Do I need to change the percentages to numbers? (Change 0.10 to 1,000 if the row's sample population is 10,000.)
Thanks for any help. (I will be using R.)