Subscribe to DSC Newsletter

# What method/model should I use for this parameter fitting problem?

I am running analysis on data for this type of sensor my company makes. I want to quantify the health of the sensor based on three features using the following formula:

sensor health index = feature1 * A + feature2 * B + feature3 *C

We also need to pick a threshold so that if this index exceeds the threshold, the sensor is considered as bad sensor.

We only have a legacy list which shows about 100 sensors are bad. But now we have data for more than 10,000 sensors. Anything not in that 100 sensor list is NOT necessarily "bad". So I guess the linear regression methods don't work in this scenario.

The only way I can think of is the brute force fitting. Pseudo code is as follows:

`# class definition for params(coefficients)class params{  a  b  c  th}# dictionary of parameter and accuracy ratemap = {}for thold in range (1..20):   for a in range (1..10):      for b in range (1..10):        for b in range (1..10):           # bad sensor list           bad_list = []           params = new params[a, b, c, thold]           for each sensor:             health_index = sensor.feature1*a+sensor.feature2*b+sensor.feature3*c             if health_index > thold:               bad_list.append(sensor.id)           accuracy = percentage of common sensors between bad_list and known_bad_sensors           map[params] = accuracy# rank params based on accuracyrank(map)# the params with most accuracy is the best modelprint map.index(0)`

`I really don't like this method since it is using 5 for loops which is very inefficient. But the thing is that 100 bad sensor list is all I got. There is no way to get more labeled data point including the "good" ones. I wonder if there is a better way to do it. Using something from existing library such as sk-learn perhaps?`

Views: 95

### Replies to This Discussion

I'll note up front that my data science knowledge is extremely limited... I'm only starting to learn the techniques & terms.  So I don't have an answer for your main question, but I do have a comment about your problem.  You note at the start,

---------------------

We also need to pick a threshold so that if this index exceeds the threshold, the sensor is considered as bad sensor.

---------------------

But near the end, you said,

---------------------

...that 100 bad sensor list is all I got. There is no way to get more labeled data point including the "good" ones.

---------------------

If this is really case, it seems to me you have an unsolvable problem.  You could apply any technique to determine values for those coefficients (a,b,c) based on data from the bad sensors, but without good sensors to put in as a comparison there'd be no way to identify the threshold between bad & good.

By way of an analogy, you may have data to confirm that every city in [list of 100 cities] is definitely south of the equator.  But that alone is not enough to tell you where the equator is, only (to some extent) where it isn't.  City #101 may lie further north than the northernmost city in your list, but it could still be in either hemisphere.

1

2

3

4

5

6

## Videos

• ### DSC Webinar Series: Self-Service Analytics: Fostering a Data-Enabled Culture

Added by Tim Matteson

• ### DSC Webinar Series: The State of Data Preparation in 2018

Added by Tim Matteson

• ### DSC Webinar Series: Data Contributions to a Conversational AI Platform

Added by Tim Matteson