Link to home
Start Free TrialLog in
Avatar of Cluskitt
CluskittFlag for Portugal

asked on

Adding different values from a table to find a specific sum value

There are times when I need to find a certain value out of a number of records. For example, assume this table:
tblValues (ID_Value int PK,ID_Code PK,CalcValue real)

It can have values like:
1,1,123.52
1,2,15421.23
2,1,2367
3,3,4352.12

Usually, it will have a few thousand rows (the actual table has more fields, but they're not relevant).
Sometimes, I need to find rows that, when added, will generate a certain number. For example, using the above values, let's say I need to find a value of 2490.52. If I add the 1st and 3rd rows, I will get that value. If I need to find a value of 6842.64, I will get it with the 1st, 3rd and 4th rows.

Usually, when such a need arises, we need to see if any one row has that value. That is basic SQL and I won't even bother with that. Next, we move to two values. I usually do it this way:
SELECT t1.ID_Value,t1.ID_Code,t1.CalcValue,
            t2.ID_Value,t2.ID_Code,t2.CalcValue
FROM tblValues t1
INNER JOIN tblValues t2
ON t1.ID_Value<>t2.ID_Value OR t1.ID_Code<>t2.ID_Code
WHERE t1.CalcValue+t2.CalcValue=2490.52

Open in new window

If we then need to move on to 3 rows, I add another table, and so on.

This approach has, basically, two problems:
1- Records get repeated. That is, it will return row 1 + row 3 AND row 3 + row 1. This quickly escalates as you add more tables
2- The query quickly gets too slow to be practical. We can only run a 4 rows sum on a small subset of the table and even then, it takes too long.

So, what is the best approach to this problem? Is there a more efficient way?
Even better, is there some way to return something like all possible one row combinations, then all possible two row combinations, then all of 3, etc, up to a reasonable max of, let's say, for example, 5?
Also important, is there any way not to get what are basically repeated records?
Avatar of Sean Stuber
Sean Stuber

the repeated records can be avoided by "sorting" the rows

for example..

instead of

t1.ID_Value<>t2.ID_Value

use

t1.ID_Value  > t2.ID_Value

you'll get the same effect of not joining a row to itself, but you'll also prevent joining a to b and b to a
as for the combinatoric sums,  there is no good way to do that.  The factorial expansion is going to make that an expensive operation.

you can help by adding some conditions to your joins so as to avoid obviously bad combinations.

for example   if your target is 600

row 1 = 100
row 2 = 300
row 3 = 400

once you've added rows 1, 2 and 3  you know you don't need to join any other rows because you've already exceeded the target so joins for rows 4 and 5 can be avoided.
Avatar of Cluskitt

ASKER

I have actually thought of the > condition. However, that wouldn't solve the scalability problem. Trying to add 3 rows would still be a nightmare on the whole table. There should be a better solution, even if it's a stored proc with some sort of loop, or a temp table, or anything that will make adding 5 rows take less than a few hours.
it will help the scalability problem

there is really no solution to what you are looking for.

combinatoric expansion is big, there's no fighting math  the best you can do is block some subset of the combinations from being generated by restricting the join conditions.
I think you can improve it considerably by saving the two-value results using those to generate the 3, 4, etc, value results.

[You could probably add even more efficiency by sorting the values in order first, although I haven't worked out the code in my head to actually implement the savings yet :-) .]

OK, so you generate the two-values added result and no two-value total matches.  

You can selectively join the two-value result to the original list to generate the three-value result.

You can selectively join the two-value result to the two-value result to get the four-value result.

You can selectively join the four-value result to the original list to generate the five-value result.

And so on.

I'll be happy to help with exact code if you provide some sample data, say a thousand rows or so.

Btw, I would assign a single id to each value, to insure a distinct ascending key for the original values, for efficiency.  That id can then be used to lookup the original pair of ids.

For even a few thousand rows, an insert into a new table, even sorted, is very minor overhead.
isn't

"selectively joining"

just another way of saying

 "block some subset of the combinations from being generated by restricting the join"


maybe I'm not understanding what you are suggesting, but I don't read anything in your description that suggests combinatoric explosion will be reduced.

for example,  using just 1-20,  there are 309 ways to generate sums of 37

if there are thousands of records and higher sums the variations can grow even more

here's a simple example showing the expansion


 WITH numbers
     AS (SELECT 1 id, 1 n
         UNION ALL
         SELECT id + 1 id, n + 1 n
           FROM numbers
          WHERE n < 20)
SELECT n1.n, n2.n, n3.n, n4.n, n5.n
  FROM numbers n1
       LEFT JOIN numbers n2 ON n1.id > n2.id
             AND n1.n + ISNULL(n2.n, 0) <= 37
       LEFT JOIN numbers n3 ON n2.id > n3.id
             AND n1.n + ISNULL(n2.n, 0) + ISNULL(n3.n, 0) <= 37
       LEFT JOIN numbers n4 ON n3.id > n4.id
             AND n1.n + ISNULL(n2.n, 0) + ISNULL(n3.n, 0) + ISNULL(n4.n, 0) <= 37
       LEFT JOIN numbers n5 ON n4.id > n5.id
 WHERE n1.n + ISNULL(n2.n, 0) + ISNULL(n3.n, 0) + ISNULL(n4.n, 0) + ISNULL(n5.n, 0) = 37;


increasing the range of n and replacing the four 37s with other targets lets you test for other combinations.

1-100 with sums of 177  yields almost a quarter million different permutations  (243339)
I will throw out one significant caveat.

If your sums are small compared to the elements you are adding then the individual Join conditions will filter out many permutations very quickly.  

for example

1-10   with target sum of 8

once I have n =8,  the first join will fail and hence kick out all of the rest

n1=7 and n2=1, all third and following joins will fail

etc
You're right: you didn't understand what I said.  

In fact, you seem to have completely ignored it, since you continue to do multiple joins solely to the original list of numbers.  I explicitly stated a different approach.
nope, didn't ignore it at all.  in fact, that's exactly why I wrote it and wrote it the way I did

 I was simply illustrating the math.

no matter the method of generating them the number of permutations is the same.

If you have thousands of rows you could easily end up with billions of permutations.
Generate a million per second and it'll still take you half an hour and I doubt you can go that fast.

As the number of rows grows the permutations grow much, much faster.
If you have 10,000 rows  you won't finish in a day.

Of course, as I already mentioned, there are data values and target sums that will allow for massive reduction in the joins needed.  Even if you do this procedurally rather than with just sql joins, the general idea is the same.  The exact same methods that allow you to reduce (or not reduce) the join combinations will apply to (the benefit or to the detrement of ) the procedural methods.

So, while the problem isn't unsolvable,  the practicality of doing so is dependent more on the nature of the data itself than the method for tackling it.
NO, you did not do anything close to what I stated.

You're "illustrating math" for a different approach.

You need to read my comments again.
no need to be argumentative

 if you really think you have something significantly different, please illustrate.

I collect algorithms. It would tickle me pink to be proven wrong.  
More importantly, it would benefit the asker
just to give us a consistent test bed.  Try this data


Insert into NUMBERS (ID, N) Values (1, 527);
Insert into NUMBERS (ID, N) Values (2, 345);
Insert into NUMBERS (ID, N) Values (3, 197);
Insert into NUMBERS (ID, N) Values (4, 420);
Insert into NUMBERS (ID, N) Values (5, 964);
Insert into NUMBERS (ID, N) Values (6, 74);
Insert into NUMBERS (ID, N) Values (7, 104);
Insert into NUMBERS (ID, N) Values (8, 677);
Insert into NUMBERS (ID, N) Values (9, 24);
Insert into NUMBERS (ID, N) Values (10, 357);
Insert into NUMBERS (ID, N) Values (11, 40);
Insert into NUMBERS (ID, N) Values (12, 212);
Insert into NUMBERS (ID, N) Values (13, 478);
Insert into NUMBERS (ID, N) Values (14, 76);
Insert into NUMBERS (ID, N) Values (15, 244);
Insert into NUMBERS (ID, N) Values (16, 640);
Insert into NUMBERS (ID, N) Values (17, 481);
Insert into NUMBERS (ID, N) Values (18, 304);
Insert into NUMBERS (ID, N) Values (19, 91);
Insert into NUMBERS (ID, N) Values (20, 589);
Insert into NUMBERS (ID, N) Values (21, 676);
Insert into NUMBERS (ID, N) Values (22, 276);
Insert into NUMBERS (ID, N) Values (23, 369);
Insert into NUMBERS (ID, N) Values (24, 405);
Insert into NUMBERS (ID, N) Values (25, 161);
Insert into NUMBERS (ID, N) Values (26, 232);
Insert into NUMBERS (ID, N) Values (27, 356);
Insert into NUMBERS (ID, N) Values (28, 752);
Insert into NUMBERS (ID, N) Values (29, 170);
Insert into NUMBERS (ID, N) Values (30, 570);
Insert into NUMBERS (ID, N) Values (31, 775);
Insert into NUMBERS (ID, N) Values (32, 511);
Insert into NUMBERS (ID, N) Values (33, 379);
Insert into NUMBERS (ID, N) Values (34, 192);
Insert into NUMBERS (ID, N) Values (35, 418);
Insert into NUMBERS (ID, N) Values (36, 647);
Insert into NUMBERS (ID, N) Values (37, 592);
Insert into NUMBERS (ID, N) Values (38, 98);
Insert into NUMBERS (ID, N) Values (39, 35);
Insert into NUMBERS (ID, N) Values (40, 711);
Insert into NUMBERS (ID, N) Values (41, 45);
Insert into NUMBERS (ID, N) Values (42, 101);
Insert into NUMBERS (ID, N) Values (43, 980);
Insert into NUMBERS (ID, N) Values (44, 340);
Insert into NUMBERS (ID, N) Values (45, 69);
Insert into NUMBERS (ID, N) Values (46, 262);
Insert into NUMBERS (ID, N) Values (47, 552);
Insert into NUMBERS (ID, N) Values (48, 631);
Insert into NUMBERS (ID, N) Values (49, 459);
Insert into NUMBERS (ID, N) Values (50, 280);
Insert into NUMBERS (ID, N) Values (51, 501);
Insert into NUMBERS (ID, N) Values (52, 73);
Insert into NUMBERS (ID, N) Values (53, 961);
Insert into NUMBERS (ID, N) Values (54, 849);
Insert into NUMBERS (ID, N) Values (55, 737);
Insert into NUMBERS (ID, N) Values (56, 687);
Insert into NUMBERS (ID, N) Values (57, 576);
Insert into NUMBERS (ID, N) Values (58, 552);
Insert into NUMBERS (ID, N) Values (59, 172);
Insert into NUMBERS (ID, N) Values (60, 133);
Insert into NUMBERS (ID, N) Values (61, 643);
Insert into NUMBERS (ID, N) Values (62, 878);
Insert into NUMBERS (ID, N) Values (63, 188);
Insert into NUMBERS (ID, N) Values (64, 402);
Insert into NUMBERS (ID, N) Values (65, 548);
Insert into NUMBERS (ID, N) Values (66, 457);
Insert into NUMBERS (ID, N) Values (67, 858);
Insert into NUMBERS (ID, N) Values (68, 70);
Insert into NUMBERS (ID, N) Values (69, 172);
Insert into NUMBERS (ID, N) Values (70, 278);
Insert into NUMBERS (ID, N) Values (71, 351);
Insert into NUMBERS (ID, N) Values (72, 525);
Insert into NUMBERS (ID, N) Values (73, 404);
Insert into NUMBERS (ID, N) Values (74, 71);
Insert into NUMBERS (ID, N) Values (75, 928);
Insert into NUMBERS (ID, N) Values (76, 744);
Insert into NUMBERS (ID, N) Values (77, 537);
Insert into NUMBERS (ID, N) Values (78, 920);
Insert into NUMBERS (ID, N) Values (79, 36);
Insert into NUMBERS (ID, N) Values (80, 619);
Insert into NUMBERS (ID, N) Values (81, 510);
Insert into NUMBERS (ID, N) Values (82, 856);
Insert into NUMBERS (ID, N) Values (83, 825);
Insert into NUMBERS (ID, N) Values (84, 131);
Insert into NUMBERS (ID, N) Values (85, 419);
Insert into NUMBERS (ID, N) Values (86, 512);
Insert into NUMBERS (ID, N) Values (87, 197);
Insert into NUMBERS (ID, N) Values (88, 954);
Insert into NUMBERS (ID, N) Values (89, 743);
Insert into NUMBERS (ID, N) Values (90, 589);
Insert into NUMBERS (ID, N) Values (91, 779);
Insert into NUMBERS (ID, N) Values (92, 235);
Insert into NUMBERS (ID, N) Values (93, 975);
Insert into NUMBERS (ID, N) Values (94, 221);
Insert into NUMBERS (ID, N) Values (95, 113);
Insert into NUMBERS (ID, N) Values (96, 592);
Insert into NUMBERS (ID, N) Values (97, 558);
Insert into NUMBERS (ID, N) Values (98, 838);
Insert into NUMBERS (ID, N) Values (99, 409);
Insert into NUMBERS (ID, N) Values (100, 240);
Insert into NUMBERS (ID, N) Values (101, 539);
Insert into NUMBERS (ID, N) Values (102, 902);
Insert into NUMBERS (ID, N) Values (103, 157);
Insert into NUMBERS (ID, N) Values (104, 862);
Insert into NUMBERS (ID, N) Values (105, 350);
Insert into NUMBERS (ID, N) Values (106, 429);
Insert into NUMBERS (ID, N) Values (107, 816);
Insert into NUMBERS (ID, N) Values (108, 498);
Insert into NUMBERS (ID, N) Values (109, 769);
Insert into NUMBERS (ID, N) Values (110, 772);
Insert into NUMBERS (ID, N) Values (111, 306);
Insert into NUMBERS (ID, N) Values (112, 86);
Insert into NUMBERS (ID, N) Values (113, 407);
Insert into NUMBERS (ID, N) Values (114, 996);
Insert into NUMBERS (ID, N) Values (115, 263);
Insert into NUMBERS (ID, N) Values (116, 62);
Insert into NUMBERS (ID, N) Values (117, 390);
Insert into NUMBERS (ID, N) Values (118, 384);
Insert into NUMBERS (ID, N) Values (119, 357);
Insert into NUMBERS (ID, N) Values (120, 291);
Insert into NUMBERS (ID, N) Values (121, 47);
Insert into NUMBERS (ID, N) Values (122, 637);
Insert into NUMBERS (ID, N) Values (123, 304);
Insert into NUMBERS (ID, N) Values (124, 147);
Insert into NUMBERS (ID, N) Values (125, 498);
Insert into NUMBERS (ID, N) Values (126, 463);
Insert into NUMBERS (ID, N) Values (127, 118);
Insert into NUMBERS (ID, N) Values (128, 961);
Insert into NUMBERS (ID, N) Values (129, 908);
Insert into NUMBERS (ID, N) Values (130, 338);
Insert into NUMBERS (ID, N) Values (131, 983);
Insert into NUMBERS (ID, N) Values (132, 75);
Insert into NUMBERS (ID, N) Values (133, 822);
Insert into NUMBERS (ID, N) Values (134, 192);
Insert into NUMBERS (ID, N) Values (135, 481);
Insert into NUMBERS (ID, N) Values (136, 859);
Insert into NUMBERS (ID, N) Values (137, 285);
Insert into NUMBERS (ID, N) Values (138, 641);
Insert into NUMBERS (ID, N) Values (139, 628);
Insert into NUMBERS (ID, N) Values (140, 188);
Insert into NUMBERS (ID, N) Values (141, 284);
Insert into NUMBERS (ID, N) Values (142, 502);
Insert into NUMBERS (ID, N) Values (143, 40);
Insert into NUMBERS (ID, N) Values (144, 150);
Insert into NUMBERS (ID, N) Values (145, 585);
Insert into NUMBERS (ID, N) Values (146, 43);
Insert into NUMBERS (ID, N) Values (147, 297);
Insert into NUMBERS (ID, N) Values (148, 364);
Insert into NUMBERS (ID, N) Values (149, 604);
Insert into NUMBERS (ID, N) Values (150, 468);
Insert into NUMBERS (ID, N) Values (151, 882);
Insert into NUMBERS (ID, N) Values (152, 604);
Insert into NUMBERS (ID, N) Values (153, 475);
Insert into NUMBERS (ID, N) Values (154, 712);
Insert into NUMBERS (ID, N) Values (155, 386);
Insert into NUMBERS (ID, N) Values (156, 37);
Insert into NUMBERS (ID, N) Values (157, 364);
Insert into NUMBERS (ID, N) Values (158, 274);
Insert into NUMBERS (ID, N) Values (159, 823);
Insert into NUMBERS (ID, N) Values (160, 258);
Insert into NUMBERS (ID, N) Values (161, 766);
Insert into NUMBERS (ID, N) Values (162, 800);
Insert into NUMBERS (ID, N) Values (163, 572);
Insert into NUMBERS (ID, N) Values (164, 591);
Insert into NUMBERS (ID, N) Values (165, 963);
Insert into NUMBERS (ID, N) Values (166, 785);
Insert into NUMBERS (ID, N) Values (167, 944);
Insert into NUMBERS (ID, N) Values (168, 691);
Insert into NUMBERS (ID, N) Values (169, 637);
Insert into NUMBERS (ID, N) Values (170, 890);
Insert into NUMBERS (ID, N) Values (171, 249);
Insert into NUMBERS (ID, N) Values (172, 672);
Insert into NUMBERS (ID, N) Values (173, 885);
Insert into NUMBERS (ID, N) Values (174, 396);
Insert into NUMBERS (ID, N) Values (175, 439);
Insert into NUMBERS (ID, N) Values (176, 631);
Insert into NUMBERS (ID, N) Values (177, 679);
Insert into NUMBERS (ID, N) Values (178, 600);
Insert into NUMBERS (ID, N) Values (179, 510);
Insert into NUMBERS (ID, N) Values (180, 103);
Insert into NUMBERS (ID, N) Values (181, 930);
Insert into NUMBERS (ID, N) Values (182, 999);
Insert into NUMBERS (ID, N) Values (183, 565);
Insert into NUMBERS (ID, N) Values (184, 384);
Insert into NUMBERS (ID, N) Values (185, 50);
Insert into NUMBERS (ID, N) Values (186, 370);
Insert into NUMBERS (ID, N) Values (187, 111);
Insert into NUMBERS (ID, N) Values (188, 186);
Insert into NUMBERS (ID, N) Values (189, 464);
Insert into NUMBERS (ID, N) Values (190, 304);
Insert into NUMBERS (ID, N) Values (191, 118);
Insert into NUMBERS (ID, N) Values (192, 51);
Insert into NUMBERS (ID, N) Values (193, 441);
Insert into NUMBERS (ID, N) Values (194, 200);
Insert into NUMBERS (ID, N) Values (195, 778);
Insert into NUMBERS (ID, N) Values (196, 246);
Insert into NUMBERS (ID, N) Values (197, 287);
Insert into NUMBERS (ID, N) Values (198, 984);
Insert into NUMBERS (ID, N) Values (199, 840);
Insert into NUMBERS (ID, N) Values (200, 222);
Insert into NUMBERS (ID, N) Values (201, 932);
Insert into NUMBERS (ID, N) Values (202, 545);
Insert into NUMBERS (ID, N) Values (203, 36);
Insert into NUMBERS (ID, N) Values (204, 489);
Insert into NUMBERS (ID, N) Values (205, 863);
Insert into NUMBERS (ID, N) Values (206, 321);
Insert into NUMBERS (ID, N) Values (207, 235);
Insert into NUMBERS (ID, N) Values (208, 154);
Insert into NUMBERS (ID, N) Values (209, 312);
Insert into NUMBERS (ID, N) Values (210, 896);
Insert into NUMBERS (ID, N) Values (211, 138);
Insert into NUMBERS (ID, N) Values (212, 294);
Insert into NUMBERS (ID, N) Values (213, 273);
Insert into NUMBERS (ID, N) Values (214, 388);
Insert into NUMBERS (ID, N) Values (215, 641);
Insert into NUMBERS (ID, N) Values (216, 815);
Insert into NUMBERS (ID, N) Values (217, 170);
Insert into NUMBERS (ID, N) Values (218, 682);
Insert into NUMBERS (ID, N) Values (219, 776);
Insert into NUMBERS (ID, N) Values (220, 427);
Insert into NUMBERS (ID, N) Values (221, 90);
Insert into NUMBERS (ID, N) Values (222, 62);
Insert into NUMBERS (ID, N) Values (223, 741);
Insert into NUMBERS (ID, N) Values (224, 78);
Insert into NUMBERS (ID, N) Values (225, 90);
Insert into NUMBERS (ID, N) Values (226, 27);
Insert into NUMBERS (ID, N) Values (227, 917);
Insert into NUMBERS (ID, N) Values (228, 172);
Insert into NUMBERS (ID, N) Values (229, 380);
Insert into NUMBERS (ID, N) Values (230, 279);
Insert into NUMBERS (ID, N) Values (231, 852);
Insert into NUMBERS (ID, N) Values (232, 611);
Insert into NUMBERS (ID, N) Values (233, 145);
Insert into NUMBERS (ID, N) Values (234, 545);
Insert into NUMBERS (ID, N) Values (235, 590);
Insert into NUMBERS (ID, N) Values (236, 793);
Insert into NUMBERS (ID, N) Values (237, 320);
Insert into NUMBERS (ID, N) Values (238, 798);
Insert into NUMBERS (ID, N) Values (239, 537);
Insert into NUMBERS (ID, N) Values (240, 361);
Insert into NUMBERS (ID, N) Values (241, 265);
Insert into NUMBERS (ID, N) Values (242, 248);
Insert into NUMBERS (ID, N) Values (243, 480);
Insert into NUMBERS (ID, N) Values (244, 736);
Insert into NUMBERS (ID, N) Values (245, 691);
Insert into NUMBERS (ID, N) Values (246, 758);
Insert into NUMBERS (ID, N) Values (247, 864);
Insert into NUMBERS (ID, N) Values (248, 609);
Insert into NUMBERS (ID, N) Values (249, 881);
Insert into NUMBERS (ID, N) Values (250, 554);
Insert into NUMBERS (ID, N) Values (251, 672);
Insert into NUMBERS (ID, N) Values (252, 376);
Insert into NUMBERS (ID, N) Values (253, 46);
Insert into NUMBERS (ID, N) Values (254, 580);
Insert into NUMBERS (ID, N) Values (255, 299);
Insert into NUMBERS (ID, N) Values (256, 21);
Insert into NUMBERS (ID, N) Values (257, 571);
Insert into NUMBERS (ID, N) Values (258, 953);
Insert into NUMBERS (ID, N) Values (259, 660);
Insert into NUMBERS (ID, N) Values (260, 243);
Insert into NUMBERS (ID, N) Values (261, 599);
Insert into NUMBERS (ID, N) Values (262, 87);
Insert into NUMBERS (ID, N) Values (263, 764);
Insert into NUMBERS (ID, N) Values (264, 455);
Insert into NUMBERS (ID, N) Values (265, 441);
Insert into NUMBERS (ID, N) Values (266, 728);
Insert into NUMBERS (ID, N) Values (267, 88);
Insert into NUMBERS (ID, N) Values (268, 592);
Insert into NUMBERS (ID, N) Values (269, 187);
Insert into NUMBERS (ID, N) Values (270, 178);
Insert into NUMBERS (ID, N) Values (271, 175);
Insert into NUMBERS (ID, N) Values (272, 434);
Insert into NUMBERS (ID, N) Values (273, 929);
Insert into NUMBERS (ID, N) Values (274, 256);
Insert into NUMBERS (ID, N) Values (275, 163);
Insert into NUMBERS (ID, N) Values (276, 780);
Insert into NUMBERS (ID, N) Values (277, 818);
Insert into NUMBERS (ID, N) Values (278, 605);
Insert into NUMBERS (ID, N) Values (279, 686);
Insert into NUMBERS (ID, N) Values (280, 969);
Insert into NUMBERS (ID, N) Values (281, 581);
Insert into NUMBERS (ID, N) Values (282, 590);
Insert into NUMBERS (ID, N) Values (283, 547);
Insert into NUMBERS (ID, N) Values (284, 424);
Insert into NUMBERS (ID, N) Values (285, 858);
Insert into NUMBERS (ID, N) Values (286, 152);
Insert into NUMBERS (ID, N) Values (287, 631);
Insert into NUMBERS (ID, N) Values (288, 715);
Insert into NUMBERS (ID, N) Values (289, 498);
Insert into NUMBERS (ID, N) Values (290, 250);
Insert into NUMBERS (ID, N) Values (291, 36);
Insert into NUMBERS (ID, N) Values (292, 918);
Insert into NUMBERS (ID, N) Values (293, 884);
Insert into NUMBERS (ID, N) Values (294, 301);
Insert into NUMBERS (ID, N) Values (295, 815);
Insert into NUMBERS (ID, N) Values (296, 705);
Insert into NUMBERS (ID, N) Values (297, 975);
Insert into NUMBERS (ID, N) Values (298, 566);
Insert into NUMBERS (ID, N) Values (299, 328);
Insert into NUMBERS (ID, N) Values (300, 877);
Insert into NUMBERS (ID, N) Values (301, 935);
Insert into NUMBERS (ID, N) Values (302, 40);
Insert into NUMBERS (ID, N) Values (303, 44);
Insert into NUMBERS (ID, N) Values (304, 810);
Insert into NUMBERS (ID, N) Values (305, 809);
Insert into NUMBERS (ID, N) Values (306, 834);
Insert into NUMBERS (ID, N) Values (307, 156);
Insert into NUMBERS (ID, N) Values (308, 863);
Insert into NUMBERS (ID, N) Values (309, 185);
Insert into NUMBERS (ID, N) Values (310, 985);
Insert into NUMBERS (ID, N) Values (311, 990);
Insert into NUMBERS (ID, N) Values (312, 152);
Insert into NUMBERS (ID, N) Values (313, 542);
Insert into NUMBERS (ID, N) Values (314, 206);
Insert into NUMBERS (ID, N) Values (315, 666);
Insert into NUMBERS (ID, N) Values (316, 457);
Insert into NUMBERS (ID, N) Values (317, 238);
Insert into NUMBERS (ID, N) Values (318, 395);
Insert into NUMBERS (ID, N) Values (319, 170);
Insert into NUMBERS (ID, N) Values (320, 938);
Insert into NUMBERS (ID, N) Values (321, 977);
Insert into NUMBERS (ID, N) Values (322, 122);
Insert into NUMBERS (ID, N) Values (323, 510);
Insert into NUMBERS (ID, N) Values (324, 71);
Insert into NUMBERS (ID, N) Values (325, 478);
Insert into NUMBERS (ID, N) Values (326, 989);
Insert into NUMBERS (ID, N) Values (327, 139);
Insert into NUMBERS (ID, N) Values (328, 904);
Insert into NUMBERS (ID, N) Values (329, 822);
Insert into NUMBERS (ID, N) Values (330, 489);
Insert into NUMBERS (ID, N) Values (331, 657);
Insert into NUMBERS (ID, N) Values (332, 753);
Insert into NUMBERS (ID, N) Values (333, 644);
Insert into NUMBERS (ID, N) Values (334, 729);
Insert into NUMBERS (ID, N) Values (335, 779);
Insert into NUMBERS (ID, N) Values (336, 390);
Insert into NUMBERS (ID, N) Values (337, 424);
Insert into NUMBERS (ID, N) Values (338, 701);
Insert into NUMBERS (ID, N) Values (339, 287);
Insert into NUMBERS (ID, N) Values (340, 43);
Insert into NUMBERS (ID, N) Values (341, 136);
Insert into NUMBERS (ID, N) Values (342, 621);
Insert into NUMBERS (ID, N) Values (343, 866);
Insert into NUMBERS (ID, N) Values (344, 41);
Insert into NUMBERS (ID, N) Values (345, 456);
Insert into NUMBERS (ID, N) Values (346, 701);
Insert into NUMBERS (ID, N) Values (347, 375);
Insert into NUMBERS (ID, N) Values (348, 122);
Insert into NUMBERS (ID, N) Values (349, 694);
Insert into NUMBERS (ID, N) Values (350, 984);
Insert into NUMBERS (ID, N) Values (351, 644);
Insert into NUMBERS (ID, N) Values (352, 952);
Insert into NUMBERS (ID, N) Values (353, 688);
Insert into NUMBERS (ID, N) Values (354, 836);
Insert into NUMBERS (ID, N) Values (355, 947);
Insert into NUMBERS (ID, N) Values (356, 412);
Insert into NUMBERS (ID, N) Values (357, 29);
Insert into NUMBERS (ID, N) Values (358, 182);
Insert into NUMBERS (ID, N) Values (359, 714);
Insert into NUMBERS (ID, N) Values (360, 631);
Insert into NUMBERS (ID, N) Values (361, 323);
Insert into NUMBERS (ID, N) Values (362, 811);
Insert into NUMBERS (ID, N) Values (363, 616);
Insert into NUMBERS (ID, N) Values (364, 828);
Insert into NUMBERS (ID, N) Values (365, 713);
Insert into NUMBERS (ID, N) Values (366, 769);
Insert into NUMBERS (ID, N) Values (367, 541);
Insert into NUMBERS (ID, N) Values (368, 965);
Insert into NUMBERS (ID, N) Values (369, 907);
Insert into NUMBERS (ID, N) Values (370, 952);
Insert into NUMBERS (ID, N) Values (371, 499);
Insert into NUMBERS (ID, N) Values (372, 373);
Insert into NUMBERS (ID, N) Values (373, 15);
Insert into NUMBERS (ID, N) Values (374, 37);
Insert into NUMBERS (ID, N) Values (375, 978);
Insert into NUMBERS (ID, N) Values (376, 432);
Insert into NUMBERS (ID, N) Values (377, 822);
Insert into NUMBERS (ID, N) Values (378, 884);
Insert into NUMBERS (ID, N) Values (379, 192);
Insert into NUMBERS (ID, N) Values (380, 172);
Insert into NUMBERS (ID, N) Values (381, 973);
Insert into NUMBERS (ID, N) Values (382, 782);
Insert into NUMBERS (ID, N) Values (383, 857);
Insert into NUMBERS (ID, N) Values (384, 509);
Insert into NUMBERS (ID, N) Values (385, 326);
Insert into NUMBERS (ID, N) Values (386, 603);
Insert into NUMBERS (ID, N) Values (387, 165);
Insert into NUMBERS (ID, N) Values (388, 671);
Insert into NUMBERS (ID, N) Values (389, 910);
Insert into NUMBERS (ID, N) Values (390, 494);
Insert into NUMBERS (ID, N) Values (391, 21);
Insert into NUMBERS (ID, N) Values (392, 746);
Insert into NUMBERS (ID, N) Values (393, 513);
Insert into NUMBERS (ID, N) Values (394, 902);
Insert into NUMBERS (ID, N) Values (395, 870);
Insert into NUMBERS (ID, N) Values (396, 849);
Insert into NUMBERS (ID, N) Values (397, 390);
Insert into NUMBERS (ID, N) Values (398, 408);
Insert into NUMBERS (ID, N) Values (399, 6);
Insert into NUMBERS (ID, N) Values (400, 362);
Insert into NUMBERS (ID, N) Values (401, 652);
Insert into NUMBERS (ID, N) Values (402, 873);
Insert into NUMBERS (ID, N) Values (403, 494);
Insert into NUMBERS (ID, N) Values (404, 709);
Insert into NUMBERS (ID, N) Values (405, 21);
Insert into NUMBERS (ID, N) Values (406, 622);
Insert into NUMBERS (ID, N) Values (407, 384);
Insert into NUMBERS (ID, N) Values (408, 510);
Insert into NUMBERS (ID, N) Values (409, 720);
Insert into NUMBERS (ID, N) Values (410, 139);
Insert into NUMBERS (ID, N) Values (411, 583);
Insert into NUMBERS (ID, N) Values (412, 2);
Insert into NUMBERS (ID, N) Values (413, 963);
Insert into NUMBERS (ID, N) Values (414, 571);
Insert into NUMBERS (ID, N) Values (415, 140);
Insert into NUMBERS (ID, N) Values (416, 648);
Insert into NUMBERS (ID, N) Values (417, 415);
Insert into NUMBERS (ID, N) Values (418, 780);
Insert into NUMBERS (ID, N) Values (419, 499);
Insert into NUMBERS (ID, N) Values (420, 623);
Insert into NUMBERS (ID, N) Values (421, 263);
Insert into NUMBERS (ID, N) Values (422, 562);
Insert into NUMBERS (ID, N) Values (423, 711);
Insert into NUMBERS (ID, N) Values (424, 420);
Insert into NUMBERS (ID, N) Values (425, 854);
Insert into NUMBERS (ID, N) Values (426, 369);
Insert into NUMBERS (ID, N) Values (427, 221);
Insert into NUMBERS (ID, N) Values (428, 404);
Insert into NUMBERS (ID, N) Values (429, 443);
Insert into NUMBERS (ID, N) Values (430, 983);
Insert into NUMBERS (ID, N) Values (431, 794);
Insert into NUMBERS (ID, N) Values (432, 474);
Insert into NUMBERS (ID, N) Values (433, 757);
Insert into NUMBERS (ID, N) Values (434, 685);
Insert into NUMBERS (ID, N) Values (435, 880);
Insert into NUMBERS (ID, N) Values (436, 993);
Insert into NUMBERS (ID, N) Values (437, 404);
Insert into NUMBERS (ID, N) Values (438, 241);
Insert into NUMBERS (ID, N) Values (439, 19);
Insert into NUMBERS (ID, N) Values (440, 46);
Insert into NUMBERS (ID, N) Values (441, 741);
Insert into NUMBERS (ID, N) Values (442, 747);
Insert into NUMBERS (ID, N) Values (443, 672);
Insert into NUMBERS (ID, N) Values (444, 873);
Insert into NUMBERS (ID, N) Values (445, 64);
Insert into NUMBERS (ID, N) Values (446, 160);
Insert into NUMBERS (ID, N) Values (447, 394);
Insert into NUMBERS (ID, N) Values (448, 927);
Insert into NUMBERS (ID, N) Values (449, 682);
Insert into NUMBERS (ID, N) Values (450, 369);
Insert into NUMBERS (ID, N) Values (451, 472);
Insert into NUMBERS (ID, N) Values (452, 651);
Insert into NUMBERS (ID, N) Values (453, 968);
Insert into NUMBERS (ID, N) Values (454, 716);
Insert into NUMBERS (ID, N) Values (455, 781);
Insert into NUMBERS (ID, N) Values (456, 506);
Insert into NUMBERS (ID, N) Values (457, 242);
Insert into NUMBERS (ID, N) Values (458, 714);
Insert into NUMBERS (ID, N) Values (459, 113);
Insert into NUMBERS (ID, N) Values (460, 463);
Insert into NUMBERS (ID, N) Values (461, 605);
Insert into NUMBERS (ID, N) Values (462, 178);
Insert into NUMBERS (ID, N) Values (463, 983);
Insert into NUMBERS (ID, N) Values (464, 477);
Insert into NUMBERS (ID, N) Values (465, 822);
Insert into NUMBERS (ID, N) Values (466, 463);
Insert into NUMBERS (ID, N) Values (467, 993);
Insert into NUMBERS (ID, N) Values (468, 367);
Insert into NUMBERS (ID, N) Values (469, 811);
Insert into NUMBERS (ID, N) Values (470, 159);
Insert into NUMBERS (ID, N) Values (471, 693);
Insert into NUMBERS (ID, N) Values (472, 156);
Insert into NUMBERS (ID, N) Values (473, 527);
Insert into NUMBERS (ID, N) Values (474, 171);
Insert into NUMBERS (ID, N) Values (475, 496);
Insert into NUMBERS (ID, N) Values (476, 326);
Insert into NUMBERS (ID, N) Values (477, 721);
Insert into NUMBERS (ID, N) Values (478, 106);
Insert into NUMBERS (ID, N) Values (479, 346);
Insert into NUMBERS (ID, N) Values (480, 535);
Insert into NUMBERS (ID, N) Values (481, 737);
Insert into NUMBERS (ID, N) Values (482, 692);
Insert into NUMBERS (ID, N) Values (483, 55);
Insert into NUMBERS (ID, N) Values (484, 412);
Insert into NUMBERS (ID, N) Values (485, 699);
Insert into NUMBERS (ID, N) Values (486, 574);
Insert into NUMBERS (ID, N) Values (487, 979);
Insert into NUMBERS (ID, N) Values (488, 999);
Insert into NUMBERS (ID, N) Values (489, 399);
Insert into NUMBERS (ID, N) Values (490, 992);
Insert into NUMBERS (ID, N) Values (491, 456);
Insert into NUMBERS (ID, N) Values (492, 8);
Insert into NUMBERS (ID, N) Values (493, 418);
Insert into NUMBERS (ID, N) Values (494, 3);
Insert into NUMBERS (ID, N) Values (495, 523);
Insert into NUMBERS (ID, N) Values (496, 563);
Insert into NUMBERS (ID, N) Values (497, 211);
Insert into NUMBERS (ID, N) Values (498, 666);
Insert into NUMBERS (ID, N) Values (499, 239);
Insert into NUMBERS (ID, N) Values (500, 874);
Insert into NUMBERS (ID, N) Values (501, 318);
Insert into NUMBERS (ID, N) Values (502, 88);
Insert into NUMBERS (ID, N) Values (503, 83);
Insert into NUMBERS (ID, N) Values (504, 209);
Insert into NUMBERS (ID, N) Values (505, 539);
Insert into NUMBERS (ID, N) Values (506, 967);
Insert into NUMBERS (ID, N) Values (507, 976);
Insert into NUMBERS (ID, N) Values (508, 689);
Insert into NUMBERS (ID, N) Values (509, 821);
Insert into NUMBERS (ID, N) Values (510, 127);
Insert into NUMBERS (ID, N) Values (511, 41);
Insert into NUMBERS (ID, N) Values (512, 979);
Insert into NUMBERS (ID, N) Values (513, 406);
Insert into NUMBERS (ID, N) Values (514, 167);
Insert into NUMBERS (ID, N) Values (515, 873);
Insert into NUMBERS (ID, N) Values (516, 304);
Insert into NUMBERS (ID, N) Values (517, 752);
Insert into NUMBERS (ID, N) Values (518, 963);
Insert into NUMBERS (ID, N) Values (519, 476);
Insert into NUMBERS (ID, N) Values (520, 221);
Insert into NUMBERS (ID, N) Values (521, 455);
Insert into NUMBERS (ID, N) Values (522, 449);
Insert into NUMBERS (ID, N) Values (523, 374);
Insert into NUMBERS (ID, N) Values (524, 229);
Insert into NUMBERS (ID, N) Values (525, 161);
Insert into NUMBERS (ID, N) Values (526, 216);
Insert into NUMBERS (ID, N) Values (527, 718);
Insert into NUMBERS (ID, N) Values (528, 737);
Insert into NUMBERS (ID, N) Values (529, 836);
Insert into NUMBERS (ID, N) Values (530, 734);
Insert into NUMBERS (ID, N) Values (531, 200);
Insert into NUMBERS (ID, N) Values (532, 39);
Insert into NUMBERS (ID, N) Values (533, 193);
Insert into NUMBERS (ID, N) Values (534, 428);
Insert into NUMBERS (ID, N) Values (535, 744);
Insert into NUMBERS (ID, N) Values (536, 277);
Insert into NUMBERS (ID, N) Values (537, 659);
Insert into NUMBERS (ID, N) Values (538, 31);
Insert into NUMBERS (ID, N) Values (539, 101);
Insert into NUMBERS (ID, N) Values (540, 520);
Insert into NUMBERS (ID, N) Values (541, 701);
Insert into NUMBERS (ID, N) Values (542, 21);
Insert into NUMBERS (ID, N) Values (543, 977);
Insert into NUMBERS (ID, N) Values (544, 804);
Insert into NUMBERS (ID, N) Values (545, 159);
Insert into NUMBERS (ID, N) Values (546, 329);
Insert into NUMBERS (ID, N) Values (547, 311);
Insert into NUMBERS (ID, N) Values (548, 170);
Insert into NUMBERS (ID, N) Values (549, 965);
Insert into NUMBERS (ID, N) Values (550, 998);
Insert into NUMBERS (ID, N) Values (551, 783);
Insert into NUMBERS (ID, N) Values (552, 664);
Insert into NUMBERS (ID, N) Values (553, 115);
Insert into NUMBERS (ID, N) Values (554, 612);
Insert into NUMBERS (ID, N) Values (555, 103);
Insert into NUMBERS (ID, N) Values (556, 478);
Insert into NUMBERS (ID, N) Values (557, 303);
Insert into NUMBERS (ID, N) Values (558, 800);
Insert into NUMBERS (ID, N) Values (559, 945);
Insert into NUMBERS (ID, N) Values (560, 375);
Insert into NUMBERS (ID, N) Values (561, 701);
Insert into NUMBERS (ID, N) Values (562, 176);
Insert into NUMBERS (ID, N) Values (563, 728);
Insert into NUMBERS (ID, N) Values (564, 13);
Insert into NUMBERS (ID, N) Values (565, 554);
Insert into NUMBERS (ID, N) Values (566, 784);
Insert into NUMBERS (ID, N) Values (567, 256);
Insert into NUMBERS (ID, N) Values (568, 66);
Insert into NUMBERS (ID, N) Values (569, 197);
Insert into NUMBERS (ID, N) Values (570, 973);
Insert into NUMBERS (ID, N) Values (571, 822);
Insert into NUMBERS (ID, N) Values (572, 452);
Insert into NUMBERS (ID, N) Values (573, 982);
Insert into NUMBERS (ID, N) Values (574, 453);
Insert into NUMBERS (ID, N) Values (575, 25);
Insert into NUMBERS (ID, N) Values (576, 612);
Insert into NUMBERS (ID, N) Values (577, 777);
Insert into NUMBERS (ID, N) Values (578, 684);
Insert into NUMBERS (ID, N) Values (579, 398);
Insert into NUMBERS (ID, N) Values (580, 125);
Insert into NUMBERS (ID, N) Values (581, 214);
Insert into NUMBERS (ID, N) Values (582, 502);
Insert into NUMBERS (ID, N) Values (583, 401);
Insert into NUMBERS (ID, N) Values (584, 949);
Insert into NUMBERS (ID, N) Values (585, 347);
Insert into NUMBERS (ID, N) Values (586, 302);
Insert into NUMBERS (ID, N) Values (587, 516);
Insert into NUMBERS (ID, N) Values (588, 494);
Insert into NUMBERS (ID, N) Values (589, 228);
Insert into NUMBERS (ID, N) Values (590, 689);
Insert into NUMBERS (ID, N) Values (591, 651);
Insert into NUMBERS (ID, N) Values (592, 361);
Insert into NUMBERS (ID, N) Values (593, 207);
Insert into NUMBERS (ID, N) Values (594, 828);
Insert into NUMBERS (ID, N) Values (595, 532);
Insert into NUMBERS (ID, N) Values (596, 255);
Insert into NUMBERS (ID, N) Values (597, 804);
Insert into NUMBERS (ID, N) Values (598, 233);
Insert into NUMBERS (ID, N) Values (599, 869);
Insert into NUMBERS (ID, N) Values (600, 355);
Insert into NUMBERS (ID, N) Values (601, 302);
Insert into NUMBERS (ID, N) Values (602, 133);
Insert into NUMBERS (ID, N) Values (603, 621);
Insert into NUMBERS (ID, N) Values (604, 947);
Insert into NUMBERS (ID, N) Values (605, 451);
Insert into NUMBERS (ID, N) Values (606, 807);
Insert into NUMBERS (ID, N) Values (607, 277);
Insert into NUMBERS (ID, N) Values (608, 891);
Insert into NUMBERS (ID, N) Values (609, 296);
Insert into NUMBERS (ID, N) Values (610, 500);
Insert into NUMBERS (ID, N) Values (611, 602);
Insert into NUMBERS (ID, N) Values (612, 515);
Insert into NUMBERS (ID, N) Values (613, 302);
Insert into NUMBERS (ID, N) Values (614, 346);
Insert into NUMBERS (ID, N) Values (615, 325);
Insert into NUMBERS (ID, N) Values (616, 48);
Insert into NUMBERS (ID, N) Values (617, 477);
Insert into NUMBERS (ID, N) Values (618, 244);
Insert into NUMBERS (ID, N) Values (619, 507);
Insert into NUMBERS (ID, N) Values (620, 781);
Insert into NUMBERS (ID, N) Values (621, 473);
Insert into NUMBERS (ID, N) Values (622, 906);
Insert into NUMBERS (ID, N) Values (623, 425);
Insert into NUMBERS (ID, N) Values (624, 403);
Insert into NUMBERS (ID, N) Values (625, 801);
Insert into NUMBERS (ID, N) Values (626, 355);
Insert into NUMBERS (ID, N) Values (627, 706);
Insert into NUMBERS (ID, N) Values (628, 786);
Insert into NUMBERS (ID, N) Values (629, 685);
Insert into NUMBERS (ID, N) Values (630, 892);
Insert into NUMBERS (ID, N) Values (631, 967);
Insert into NUMBERS (ID, N) Values (632, 79);
Insert into NUMBERS (ID, N) Values (633, 816);
Insert into NUMBERS (ID, N) Values (634, 19);
Insert into NUMBERS (ID, N) Values (635, 72);
Insert into NUMBERS (ID, N) Values (636, 663);
Insert into NUMBERS (ID, N) Values (637, 309);
Insert into NUMBERS (ID, N) Values (638, 677);
Insert into NUMBERS (ID, N) Values (639, 840);
Insert into NUMBERS (ID, N) Values (640, 642);
Insert into NUMBERS (ID, N) Values (641, 801);
Insert into NUMBERS (ID, N) Values (642, 119);
Insert into NUMBERS (ID, N) Values (643, 10);
Insert into NUMBERS (ID, N) Values (644, 529);
Insert into NUMBERS (ID, N) Values (645, 36);
Insert into NUMBERS (ID, N) Values (646, 975);
Insert into NUMBERS (ID, N) Values (647, 408);
Insert into NUMBERS (ID, N) Values (648, 683);
Insert into NUMBERS (ID, N) Values (649, 71);
Insert into NUMBERS (ID, N) Values (650, 39);
Insert into NUMBERS (ID, N) Values (651, 36);
Insert into NUMBERS (ID, N) Values (652, 277);
Insert into NUMBERS (ID, N) Values (653, 139);
Insert into NUMBERS (ID, N) Values (654, 294);
Insert into NUMBERS (ID, N) Values (655, 757);
Insert into NUMBERS (ID, N) Values (656, 103);
Insert into NUMBERS (ID, N) Values (657, 487);
Insert into NUMBERS (ID, N) Values (658, 327);
Insert into NUMBERS (ID, N) Values (659, 733);
Insert into NUMBERS (ID, N) Values (660, 136);
Insert into NUMBERS (ID, N) Values (661, 700);
Insert into NUMBERS (ID, N) Values (662, 243);
Insert into NUMBERS (ID, N) Values (663, 969);
Insert into NUMBERS (ID, N) Values (664, 112);
Insert into NUMBERS (ID, N) Values (665, 518);
Insert into NUMBERS (ID, N) Values (666, 674);
Insert into NUMBERS (ID, N) Values (667, 179);
Insert into NUMBERS (ID, N) Values (668, 610);
Insert into NUMBERS (ID, N) Values (669, 23);
Insert into NUMBERS (ID, N) Values (670, 165);
Insert into NUMBERS (ID, N) Values (671, 689);
Insert into NUMBERS (ID, N) Values (672, 279);
Insert into NUMBERS (ID, N) Values (673, 361);
Insert into NUMBERS (ID, N) Values (674, 516);
Insert into NUMBERS (ID, N) Values (675, 311);
Insert into NUMBERS (ID, N) Values (676, 508);
Insert into NUMBERS (ID, N) Values (677, 881);
Insert into NUMBERS (ID, N) Values (678, 832);
Insert into NUMBERS (ID, N) Values (679, 86);
Insert into NUMBERS (ID, N) Values (680, 871);
Insert into NUMBERS (ID, N) Values (681, 393);
Insert into NUMBERS (ID, N) Values (682, 741);
Insert into NUMBERS (ID, N) Values (683, 63);
Insert into NUMBERS (ID, N) Values (684, 823);
Insert into NUMBERS (ID, N) Values (685, 186);
Insert into NUMBERS (ID, N) Values (686, 724);
Insert into NUMBERS (ID, N) Values (687, 182);
Insert into NUMBERS (ID, N) Values (688, 303);
Insert into NUMBERS (ID, N) Values (689, 345);
Insert into NUMBERS (ID, N) Values (690, 805);
Insert into NUMBERS (ID, N) Values (691, 798);
Insert into NUMBERS (ID, N) Values (692, 8);
Insert into NUMBERS (ID, N) Values (693, 919);
Insert into NUMBERS (ID, N) Values (694, 809);
Insert into NUMBERS (ID, N) Values (695, 753);
Insert into NUMBERS (ID, N) Values (696, 319);
Insert into NUMBERS (ID, N) Values (697, 791);
Insert into NUMBERS (ID, N) Values (698, 188);
Insert into NUMBERS (ID, N) Values (699, 139);
Insert into NUMBERS (ID, N) Values (700, 58);
Insert into NUMBERS (ID, N) Values (701, 139);
Insert into NUMBERS (ID, N) Values (702, 97);
Insert into NUMBERS (ID, N) Values (703, 961);
Insert into NUMBERS (ID, N) Values (704, 432);
Insert into NUMBERS (ID, N) Values (705, 554);
Insert into NUMBERS (ID, N) Values (706, 346);
Insert into NUMBERS (ID, N) Values (707, 784);
Insert into NUMBERS (ID, N) Values (708, 19);
Insert into NUMBERS (ID, N) Values (709, 126);
Insert into NUMBERS (ID, N) Values (710, 842);
Insert into NUMBERS (ID, N) Values (711, 973);
Insert into NUMBERS (ID, N) Values (712, 879);
Insert into NUMBERS (ID, N) Values (713, 68);
Insert into NUMBERS (ID, N) Values (714, 796);
Insert into NUMBERS (ID, N) Values (715, 958);
Insert into NUMBERS (ID, N) Values (716, 885);
Insert into NUMBERS (ID, N) Values (717, 966);
Insert into NUMBERS (ID, N) Values (718, 151);
Insert into NUMBERS (ID, N) Values (719, 415);
Insert into NUMBERS (ID, N) Values (720, 862);
Insert into NUMBERS (ID, N) Values (721, 478);
Insert into NUMBERS (ID, N) Values (722, 976);
Insert into NUMBERS (ID, N) Values (723, 617);
Insert into NUMBERS (ID, N) Values (724, 941);
Insert into NUMBERS (ID, N) Values (725, 972);
Insert into NUMBERS (ID, N) Values (726, 443);
Insert into NUMBERS (ID, N) Values (727, 597);
Insert into NUMBERS (ID, N) Values (728, 153);
Insert into NUMBERS (ID, N) Values (729, 702);
Insert into NUMBERS (ID, N) Values (730, 448);
Insert into NUMBERS (ID, N) Values (731, 564);
Insert into NUMBERS (ID, N) Values (732, 20);
Insert into NUMBERS (ID, N) Values (733, 928);
Insert into NUMBERS (ID, N) Values (734, 47);
Insert into NUMBERS (ID, N) Values (735, 303);
Insert into NUMBERS (ID, N) Values (736, 946);
Insert into NUMBERS (ID, N) Values (737, 86);
Insert into NUMBERS (ID, N) Values (738, 846);
Insert into NUMBERS (ID, N) Values (739, 841);
Insert into NUMBERS (ID, N) Values (740, 312);
Insert into NUMBERS (ID, N) Values (741, 566);
Insert into NUMBERS (ID, N) Values (742, 155);
Insert into NUMBERS (ID, N) Values (743, 182);
Insert into NUMBERS (ID, N) Values (744, 412);
Insert into NUMBERS (ID, N) Values (745, 600);
Insert into NUMBERS (ID, N) Values (746, 756);
Insert into NUMBERS (ID, N) Values (747, 892);
Insert into NUMBERS (ID, N) Values (748, 885);
Insert into NUMBERS (ID, N) Values (749, 958);
Insert into NUMBERS (ID, N) Values (750, 168);
Insert into NUMBERS (ID, N) Values (751, 181);
Insert into NUMBERS (ID, N) Values (752, 270);
Insert into NUMBERS (ID, N) Values (753, 164);
Insert into NUMBERS (ID, N) Values (754, 755);
Insert into NUMBERS (ID, N) Values (755, 998);
Insert into NUMBERS (ID, N) Values (756, 112);
Insert into NUMBERS (ID, N) Values (757, 539);
Insert into NUMBERS (ID, N) Values (758, 558);
Insert into NUMBERS (ID, N) Values (759, 583);
Insert into NUMBERS (ID, N) Values (760, 256);
Insert into NUMBERS (ID, N) Values (761, 793);
Insert into NUMBERS (ID, N) Values (762, 348);
Insert into NUMBERS (ID, N) Values (763, 38);
Insert into NUMBERS (ID, N) Values (764, 55);
Insert into NUMBERS (ID, N) Values (765, 888);
Insert into NUMBERS (ID, N) Values (766, 277);
Insert into NUMBERS (ID, N) Values (767, 825);
Insert into NUMBERS (ID, N) Values (768, 153);
Insert into NUMBERS (ID, N) Values (769, 641);
Insert into NUMBERS (ID, N) Values (770, 799);
Insert into NUMBERS (ID, N) Values (771, 197);
Insert into NUMBERS (ID, N) Values (772, 531);
Insert into NUMBERS (ID, N) Values (773, 305);
Insert into NUMBERS (ID, N) Values (774, 596);
Insert into NUMBERS (ID, N) Values (775, 274);
Insert into NUMBERS (ID, N) Values (776, 79);
Insert into NUMBERS (ID, N) Values (777, 732);
Insert into NUMBERS (ID, N) Values (778, 509);
Insert into NUMBERS (ID, N) Values (779, 826);
Insert into NUMBERS (ID, N) Values (780, 931);
Insert into NUMBERS (ID, N) Values (781, 610);
Insert into NUMBERS (ID, N) Values (782, 777);
Insert into NUMBERS (ID, N) Values (783, 421);
Insert into NUMBERS (ID, N) Values (784, 865);
Insert into NUMBERS (ID, N) Values (785, 203);
Insert into NUMBERS (ID, N) Values (786, 562);
Insert into NUMBERS (ID, N) Values (787, 131);
Insert into NUMBERS (ID, N) Values (788, 467);
Insert into NUMBERS (ID, N) Values (789, 604);
Insert into NUMBERS (ID, N) Values (790, 886);
Insert into NUMBERS (ID, N) Values (791, 202);
Insert into NUMBERS (ID, N) Values (792, 878);
Insert into NUMBERS (ID, N) Values (793, 194);
Insert into NUMBERS (ID, N) Values (794, 878);
Insert into NUMBERS (ID, N) Values (795, 365);
Insert into NUMBERS (ID, N) Values (796, 454);
Insert into NUMBERS (ID, N) Values (797, 431);
Insert into NUMBERS (ID, N) Values (798, 7);
Insert into NUMBERS (ID, N) Values (799, 564);
Insert into NUMBERS (ID, N) Values (800, 242);
Insert into NUMBERS (ID, N) Values (801, 555);
Insert into NUMBERS (ID, N) Values (802, 89);
Insert into NUMBERS (ID, N) Values (803, 416);
Insert into NUMBERS (ID, N) Values (804, 263);
Insert into NUMBERS (ID, N) Values (805, 763);
Insert into NUMBERS (ID, N) Values (806, 454);
Insert into NUMBERS (ID, N) Values (807, 348);
Insert into NUMBERS (ID, N) Values (808, 895);
Insert into NUMBERS (ID, N) Values (809, 264);
Insert into NUMBERS (ID, N) Values (810, 823);
Insert into NUMBERS (ID, N) Values (811, 43);
Insert into NUMBERS (ID, N) Values (812, 149);
Insert into NUMBERS (ID, N) Values (813, 335);
Insert into NUMBERS (ID, N) Values (814, 5);
Insert into NUMBERS (ID, N) Values (815, 121);
Insert into NUMBERS (ID, N) Values (816, 995);
Insert into NUMBERS (ID, N) Values (817, 909);
Insert into NUMBERS (ID, N) Values (818, 168);
Insert into NUMBERS (ID, N) Values (819, 521);
Insert into NUMBERS (ID, N) Values (820, 492);
Insert into NUMBERS (ID, N) Values (821, 162);
Insert into NUMBERS (ID, N) Values (822, 27);
Insert into NUMBERS (ID, N) Values (823, 32);
Insert into NUMBERS (ID, N) Values (824, 834);
Insert into NUMBERS (ID, N) Values (825, 677);
Insert into NUMBERS (ID, N) Values (826, 561);
Insert into NUMBERS (ID, N) Values (827, 984);
Insert into NUMBERS (ID, N) Values (828, 734);
Insert into NUMBERS (ID, N) Values (829, 602);
Insert into NUMBERS (ID, N) Values (830, 836);
Insert into NUMBERS (ID, N) Values (831, 319);
Insert into NUMBERS (ID, N) Values (832, 287);
Insert into NUMBERS (ID, N) Values (833, 597);
Insert into NUMBERS (ID, N) Values (834, 241);
Insert into NUMBERS (ID, N) Values (835, 194);
Insert into NUMBERS (ID, N) Values (836, 373);
Insert into NUMBERS (ID, N) Values (837, 231);
Insert into NUMBERS (ID, N) Values (838, 768);
Insert into NUMBERS (ID, N) Values (839, 761);
Insert into NUMBERS (ID, N) Values (840, 467);
Insert into NUMBERS (ID, N) Values (841, 385);
Insert into NUMBERS (ID, N) Values (842, 173);
Insert into NUMBERS (ID, N) Values (843, 615);
Insert into NUMBERS (ID, N) Values (844, 938);
Insert into NUMBERS (ID, N) Values (845, 889);
Insert into NUMBERS (ID, N) Values (846, 322);
Insert into NUMBERS (ID, N) Values (847, 874);
Insert into NUMBERS (ID, N) Values (848, 103);
Insert into NUMBERS (ID, N) Values (849, 46);
Insert into NUMBERS (ID, N) Values (850, 885);
Insert into NUMBERS (ID, N) Values (851, 944);
Insert into NUMBERS (ID, N) Values (852, 592);
Insert into NUMBERS (ID, N) Values (853, 33);
Insert into NUMBERS (ID, N) Values (854, 595);
Insert into NUMBERS (ID, N) Values (855, 76);
Insert into NUMBERS (ID, N) Values (856, 232);
Insert into NUMBERS (ID, N) Values (857, 649);
Insert into NUMBERS (ID, N) Values (858, 400);
Insert into NUMBERS (ID, N) Values (859, 997);
Insert into NUMBERS (ID, N) Values (860, 365);
Insert into NUMBERS (ID, N) Values (861, 290);
Insert into NUMBERS (ID, N) Values (862, 666);
Insert into NUMBERS (ID, N) Values (863, 183);
Insert into NUMBERS (ID, N) Values (864, 861);
Insert into NUMBERS (ID, N) Values (865, 65);
Insert into NUMBERS (ID, N) Values (866, 236);
Insert into NUMBERS (ID, N) Values (867, 521);
Insert into NUMBERS (ID, N) Values (868, 566);
Insert into NUMBERS (ID, N) Values (869, 772);
Insert into NUMBERS (ID, N) Values (870, 881);
Insert into NUMBERS (ID, N) Values (871, 462);
Insert into NUMBERS (ID, N) Values (872, 294);
Insert into NUMBERS (ID, N) Values (873, 339);
Insert into NUMBERS (ID, N) Values (874, 136);
Insert into NUMBERS (ID, N) Values (875, 430);
Insert into NUMBERS (ID, N) Values (876, 52);
Insert into NUMBERS (ID, N) Values (877, 348);
Insert into NUMBERS (ID, N) Values (878, 905);
Insert into NUMBERS (ID, N) Values (879, 936);
Insert into NUMBERS (ID, N) Values (880, 722);
Insert into NUMBERS (ID, N) Values (881, 446);
Insert into NUMBERS (ID, N) Values (882, 928);
Insert into NUMBERS (ID, N) Values (883, 326);
Insert into NUMBERS (ID, N) Values (884, 634);
Insert into NUMBERS (ID, N) Values (885, 431);
Insert into NUMBERS (ID, N) Values (886, 394);
Insert into NUMBERS (ID, N) Values (887, 518);
Insert into NUMBERS (ID, N) Values (888, 246);
Insert into NUMBERS (ID, N) Values (889, 640);
Insert into NUMBERS (ID, N) Values (890, 191);
Insert into NUMBERS (ID, N) Values (891, 738);
Insert into NUMBERS (ID, N) Values (892, 520);
Insert into NUMBERS (ID, N) Values (893, 434);
Insert into NUMBERS (ID, N) Values (894, 942);
Insert into NUMBERS (ID, N) Values (895, 327);
Insert into NUMBERS (ID, N) Values (896, 449);
Insert into NUMBERS (ID, N) Values (897, 408);
Insert into NUMBERS (ID, N) Values (898, 136);
Insert into NUMBERS (ID, N) Values (899, 504);
Insert into NUMBERS (ID, N) Values (900, 661);
Insert into NUMBERS (ID, N) Values (901, 203);
Insert into NUMBERS (ID, N) Values (902, 336);
Insert into NUMBERS (ID, N) Values (903, 396);
Insert into NUMBERS (ID, N) Values (904, 384);
Insert into NUMBERS (ID, N) Values (905, 21);
Insert into NUMBERS (ID, N) Values (906, 375);
Insert into NUMBERS (ID, N) Values (907, 642);
Insert into NUMBERS (ID, N) Values (908, 379);
Insert into NUMBERS (ID, N) Values (909, 499);
Insert into NUMBERS (ID, N) Values (910, 12);
Insert into NUMBERS (ID, N) Values (911, 953);
Insert into NUMBERS (ID, N) Values (912, 94);
Insert into NUMBERS (ID, N) Values (913, 329);
Insert into NUMBERS (ID, N) Values (914, 323);
Insert into NUMBERS (ID, N) Values (915, 998);
Insert into NUMBERS (ID, N) Values (916, 720);
Insert into NUMBERS (ID, N) Values (917, 60);
Insert into NUMBERS (ID, N) Values (918, 700);
Insert into NUMBERS (ID, N) Values (919, 107);
Insert into NUMBERS (ID, N) Values (920, 704);
Insert into NUMBERS (ID, N) Values (921, 426);
Insert into NUMBERS (ID, N) Values (922, 258);
Insert into NUMBERS (ID, N) Values (923, 86);
Insert into NUMBERS (ID, N) Values (924, 206);
Insert into NUMBERS (ID, N) Values (925, 823);
Insert into NUMBERS (ID, N) Values (926, 788);
Insert into NUMBERS (ID, N) Values (927, 742);
Insert into NUMBERS (ID, N) Values (928, 746);
Insert into NUMBERS (ID, N) Values (929, 271);
Insert into NUMBERS (ID, N) Values (930, 933);
Insert into NUMBERS (ID, N) Values (931, 712);
Insert into NUMBERS (ID, N) Values (932, 550);
Insert into NUMBERS (ID, N) Values (933, 240);
Insert into NUMBERS (ID, N) Values (934, 332);
Insert into NUMBERS (ID, N) Values (935, 106);
Insert into NUMBERS (ID, N) Values (936, 465);
Insert into NUMBERS (ID, N) Values (937, 303);
Insert into NUMBERS (ID, N) Values (938, 968);
Insert into NUMBERS (ID, N) Values (939, 13);
Insert into NUMBERS (ID, N) Values (940, 930);
Insert into NUMBERS (ID, N) Values (941, 405);
Insert into NUMBERS (ID, N) Values (942, 471);
Insert into NUMBERS (ID, N) Values (943, 339);
Insert into NUMBERS (ID, N) Values (944, 968);
Insert into NUMBERS (ID, N) Values (945, 513);
Insert into NUMBERS (ID, N) Values (946, 736);
Insert into NUMBERS (ID, N) Values (947, 241);
Insert into NUMBERS (ID, N) Values (948, 493);
Insert into NUMBERS (ID, N) Values (949, 642);
Insert into NUMBERS (ID, N) Values (950, 433);
Insert into NUMBERS (ID, N) Values (951, 153);
Insert into NUMBERS (ID, N) Values (952, 832);
Insert into NUMBERS (ID, N) Values (953, 393);
Insert into NUMBERS (ID, N) Values (954, 589);
Insert into NUMBERS (ID, N) Values (955, 866);
Insert into NUMBERS (ID, N) Values (956, 27);
Insert into NUMBERS (ID, N) Values (957, 124);
Insert into NUMBERS (ID, N) Values (958, 138);
Insert into NUMBERS (ID, N) Values (959, 130);
Insert into NUMBERS (ID, N) Values (960, 290);
Insert into NUMBERS (ID, N) Values (961, 308);
Insert into NUMBERS (ID, N) Values (962, 354);
Insert into NUMBERS (ID, N) Values (963, 928);
Insert into NUMBERS (ID, N) Values (964, 739);
Insert into NUMBERS (ID, N) Values (965, 343);
Insert into NUMBERS (ID, N) Values (966, 59);
Insert into NUMBERS (ID, N) Values (967, 559);
Insert into NUMBERS (ID, N) Values (968, 631);
Insert into NUMBERS (ID, N) Values (969, 290);
Insert into NUMBERS (ID, N) Values (970, 11);
Insert into NUMBERS (ID, N) Values (971, 650);
Insert into NUMBERS (ID, N) Values (972, 465);
Insert into NUMBERS (ID, N) Values (973, 171);
Insert into NUMBERS (ID, N) Values (974, 445);
Insert into NUMBERS (ID, N) Values (975, 672);
Insert into NUMBERS (ID, N) Values (976, 937);
Insert into NUMBERS (ID, N) Values (977, 993);
Insert into NUMBERS (ID, N) Values (978, 326);
Insert into NUMBERS (ID, N) Values (979, 698);
Insert into NUMBERS (ID, N) Values (980, 466);
Insert into NUMBERS (ID, N) Values (981, 221);
Insert into NUMBERS (ID, N) Values (982, 894);
Insert into NUMBERS (ID, N) Values (983, 578);
Insert into NUMBERS (ID, N) Values (984, 663);
Insert into NUMBERS (ID, N) Values (985, 522);
Insert into NUMBERS (ID, N) Values (986, 577);
Insert into NUMBERS (ID, N) Values (987, 576);
Insert into NUMBERS (ID, N) Values (988, 363);
Insert into NUMBERS (ID, N) Values (989, 470);
Insert into NUMBERS (ID, N) Values (990, 235);
Insert into NUMBERS (ID, N) Values (991, 755);
Insert into NUMBERS (ID, N) Values (992, 610);
Insert into NUMBERS (ID, N) Values (993, 321);
Insert into NUMBERS (ID, N) Values (994, 940);
Insert into NUMBERS (ID, N) Values (995, 668);
Insert into NUMBERS (ID, N) Values (996, 748);
Insert into NUMBERS (ID, N) Values (997, 528);
Insert into NUMBERS (ID, N) Values (998, 897);
Insert into NUMBERS (ID, N) Values (999, 599);
Insert into NUMBERS (ID, N) Values (1000, 802);

Open in new window

Ok, maybe we need to work with an example that is closer to reality. But first, let me just say that I don't need combinations of 10 rows. At most, I will need 5. The best approach would be, I think, a stored procedure that would take a variable that accepts values 1-5 and generate each SQL accordingly. Most of the times, we need the least "expensive" solution.
The rows will usually be something between 150-6000. At most, we don't expect the subset to have more than 10000.

The actual table, with the relevant fields is:
rh_vcrt0 (vcr_emp_empresa PK, vcr_num_empregado PK, vcr_vnc_cod_venc PK, vcr_valor_bruto)
You can treat all values as numbers. vcr_emp_empresa is the company ID. It will be a fixed value for the query, using @Emp variable, that is, we are going to sum rows from the same company all the time.
vcr_num_empregado is the employee number. Each employee will have various codes which will add/subtract to his total salary (like extra hours, night shift, holidays, sick leave, etc). This is vcr_vnc_cod_venc.

So, the fastest query I have for the 5 rows combination is, after applying some of the suggestions:
DECLARE @Emp int,@Vl real
SET @Emp=32
SET @Vl=112.41;

WITH v AS (
	SELECT vcr_emp_empresa,vcr_num_empregado,vcr_vnc_cod_venc,SUM(vcr_valor_bruto) vcr_valor_bruto
	FROM rh_vcrt0
	WHERE vcr_emp_empresa=@Emp AND vcr_valor_bruto<@Vl
	GROUP BY vcr_emp_empresa,vcr_num_empregado,vcr_vnc_cod_venc)

SELECT v1.vcr_num_empregado 'Nº1',v1.vcr_vnc_cod_venc 'Venc1',v1.vcr_valor_bruto 'Vl1',
		v2.vcr_num_empregado 'Nº2',v2.vcr_vnc_cod_venc 'Venc2',v2.vcr_valor_bruto 'Vl2',
		v3.vcr_num_empregado 'Nº3',v3.vcr_vnc_cod_venc 'Venc3',v3.vcr_valor_bruto 'Vl3',
		v4.vcr_num_empregado 'Nº4',v4.vcr_vnc_cod_venc 'Venc4',v4.vcr_valor_bruto 'Vl4',
		v5.vcr_num_empregado 'Nº5',v5.vcr_vnc_cod_venc 'Venc5',v5.vcr_valor_bruto 'Vl5'
FROM v v1
INNER JOIN v v2
ON (v1.vcr_num_empregado>v2.vcr_num_empregado
		OR (v1.vcr_num_empregado=v2.vcr_num_empregado AND v1.vcr_vnc_cod_venc>v2.vcr_vnc_cod_venc))
	AND v1.vcr_valor_bruto+v2.vcr_valor_bruto<@Vl
INNER JOIN v v3
ON (v2.vcr_num_empregado>v3.vcr_num_empregado
		OR (v2.vcr_num_empregado=v3.vcr_num_empregado AND v2.vcr_vnc_cod_venc>v3.vcr_vnc_cod_venc))
	AND v1.vcr_valor_bruto+v2.vcr_valor_bruto+v3.vcr_valor_bruto<@Vl
INNER JOIN v v4
ON (v3.vcr_num_empregado>v4.vcr_num_empregado
		OR (v3.vcr_num_empregado=v4.vcr_num_empregado AND v3.vcr_vnc_cod_venc>v4.vcr_vnc_cod_venc))
	AND v1.vcr_valor_bruto+v2.vcr_valor_bruto+v3.vcr_valor_bruto+v4.vcr_valor_bruto<@Vl
INNER JOIN v v5
ON (v4.vcr_num_empregado>v5.vcr_num_empregado
		OR (v4.vcr_num_empregado=v5.vcr_num_empregado AND v4.vcr_vnc_cod_venc>v5.vcr_vnc_cod_venc))
	AND v1.vcr_valor_bruto+v2.vcr_valor_bruto+v3.vcr_valor_bruto+v4.vcr_valor_bruto+v5.vcr_valor_bruto=@Vl
ORDER BY 1,2,4,5,7,8,10,11,13,14

Open in new window

(the WITH part of the table using the SUM is because there can be rows which have the same keys as presented here, but have fields that are also part of the key but aren't used in this case. I take the opportunity to also filter some undesired rows right away)

I have attached some sample data with the largest subset at the moment (about 5.5k rows).
I also attached the execution plan (had to rename to .sql because of EE upload rules. Just rename it back to .sqlplan).

With the smaller subset of data (about 150 rows) and a 5 rows approach, it took about 7 seconds and returned one row.
When I try a 3 rows approach (you can easily infer from the larger one) on the larger subset, it will quickly escalate (I cancelled after 15 minutes, with 4655 rows returned).

Trying a 4 rows approach on the larger subset would be impossible (at the very least, it will take over 30 minutes, though it should actually be a geometrical growth rather than arithmetic).

Also, it should be noted that I'm using a relatively small number for the sum. If I use, for example, 1500, all values would escalate as the consecutive joins would filter less rows.

ScottPletcher, I don't really understand what you mean with selective join. The two value table, if any, would have only the values that do add up to the desired value.

I sort of understand the LEFT JOIN approach, which would give me all the results at once, but I'm not sure how to code it nor if it will be efficient. It seems to me, intuitively, that it will be slower than the INNER JOIN solution.

If I could get a decent solution for the 5 rows sum, I could then devise a stored procedure which will run the 1 rows solution, check for values, if none run the 2 rows solution, check for values, if none run the 3 rows solution, etc...
But this would require that running all 5 queries on the larger subset wouldn't take longer than, at most, 5-10 minutes. Anything less wouldn't be practical. I could make do with a solution that runs for the smaller subsets (most common is something between 300 and 800) in the desired time, but that would just be a smaller consolation.
Sample.txt
query.sql
>> The two value table, if any, would have only the values that do add up to the desired value. <<


No.  That's what I saying -- keep ALL results that are less than the total desired, because those intermediate results can be used to help generate later results.
But isn't that what I did when I added the condition in the joins for v1.vcr_valor_bruto+v2.vcr_valor_bruto<@Vl (and subsequent)?

Or do you simply want to change the INNER JOINs to LEFT JOINs and build from there?

I don't think I'm understanding your point. Can you please provide some sort of example code or pseudocode so I can grasp it?
the join conditions above do keep the results that are less than the total desired.
that's not a matter of effeciency it's necessary for it to work at all


That same condition not only allows you to find solutions that are possible.  
but more importantly,  it eliminates joins where the sum is already too big.

That's where the speed comes from, because then you can let the math work for you rather than against you.

Of course, for small values and large sums the possible combinations could still grow unmanageably big but the asker seems to be suggesting that the relative size of value vs sum will make that unlikely.
>>> do you simply want to change the INNER JOINs to LEFT JOINs and build from there?

the difference here is INNER JOINS will only find 5 row sums (in your example of 5 joins)

the outer joins I showed above find 1, 2, 3, 4 and 5 rows sums.

if you parameterize your procedure and want to search for exactly one particular level, then INNER is the way to go.   Same idea as what I already posted and you've used above.

You'll simply have 5 different queries in your procedure depending on the number of joins/rows you need to examine.
I didn't anticipate NEGATIVE numbers in the input, since there were none in the original q.

That VASTLY complicates things.  You won't be able to gain nearly as much efficiency; you'll have to re-scan all negative values.

For only positive numbers, we have a decent chance of using some binary search logic to reduce future search requirements.

As for that specific sample data and search value, for 112.41 in the ~5500 rows, I found 94 separate matches (!) on the second level.  That took ~20-24 seconds on my server; naturally, your time may vary.



USE tempdb
GO

--code assumes that tempdb.dbo.rh_vcrt0 already exists, loaded with the test data

DECLARE @value_to_find decimal(9, 2)
SET @value_to_find = 112.41;

IF OBJECT_ID('tempdb..orig_values') IS NOT NULL
    DROP TABLE orig_values
IF OBJECT_ID('tempdb..result2') IS NOT NULL
    DROP TABLE result2

-- orig values < @value_to_find
CREATE TABLE orig_values (
    orig_ident int IDENTITY(1, 1) NOT NULL,
    rh_ident int NOT NULL,
    vcr_valor_bruto real NOT NULL
    )
CREATE UNIQUE CLUSTERED INDEX orig_values__CL ON orig_values ( orig_ident )

CREATE TABLE result2 (
    result2_ident int IDENTITY(1, 1) NOT NULL,
    orig_ident1 int NOT NULL,
    orig_ident2 int NOT NULL,
    CHECK(orig_ident1 < orig_ident2),
    result2 real NOT NULL
    )
CREATE UNIQUE CLUSTERED INDEX result2__CL ON result2 ( result2_ident )
--CREATE UNIQUE CLUSTERED INDEX result2__CL ON result2 ( result2, result2_ident )

DECLARE @orig_count int
DECLARE @result2_count int
DECLARE @orig_ident int
DECLARE @orig_ident2 int
DECLARE @orig_ident3 int
DECLARE @orig_ident4 int
DECLARE @orig_ident5 int
DECLARE @rh_ident int
DECLARE @rh_ident2 int
DECLARE @rh_ident3 int
DECLARE @rh_ident4 int
DECLARE @rh_ident5 int
DECLARE @final_msg varchar(1000)

/*
IF EXISTS(SELECT 1 FROM rh_vcrt0 WHERE vcr_valor_bruto = @value_to_find)
BEGIN
    SET @final_msg = 'Value to search for is in the original input, no need to join rows!'
    GOTO ExitCode
END --IF
*/

INSERT INTO orig_values
SELECT
    rh_ident, vcr_valor_bruto
FROM dbo.rh_vcrt0
-- can't use this because of NEGATIVE numbers in the input data!!
--WHERE
    --vcr_valor_bruto <= @value_to_find
ORDER BY
    vcr_valor_bruto
SET @orig_count = @@ROWCOUNT

PRINT '@orig_count = ' + CAST(@orig_count AS varchar(10))

/*
-- can't use this because of NEGATIVE numbers in the input data!!
SELECT @orig_ident = orig_ident
FROM orig_values
WHERE
    orig_ident = @orig_count AND
    vcr_valor_bruto = @value_to_find

IF @orig_ident IS NOT NULL
    GOTO ExitCode
*/

INSERT INTO result2
SELECT ov1.orig_ident, ov2.orig_ident, ov1.vcr_valor_bruto + ov2.vcr_valor_bruto AS result2
FROM orig_values ov1
INNER JOIN orig_values ov2 ON
    ov2.orig_ident > ov1.orig_ident AND
    ov1.vcr_valor_bruto + ov2.vcr_valor_bruto <= @value_to_find
ORDER BY
    ov1.vcr_valor_bruto + ov2.vcr_valor_bruto --result2
SET @result2_count = @@ROWCOUNT

PRINT '@result2_count = ' + CAST(@result2_count AS varchar(10))

SELECT TOP 1
    @orig_ident = orig_ident1, @orig_ident2 = orig_ident2
FROM result2
WHERE
    result2_ident = @result2_count AND
    result2 = @value_to_find

IF @orig_ident IS NOT NULL
    GOTO ExitCode

ExitCode:
IF @orig_ident IS NOT NULL
BEGIN
    SELECT 'Match(es) found', rh.* 
    FROM orig_values ov
    INNER JOIN rh_vcrt0 rh ON
        rh.rh_ident = ov.rh_ident
    WHERE ov.orig_ident IN ( @orig_ident, @orig_ident2, @orig_ident3, @orig_ident4, @orig_ident5 )
    /*
    SELECT COUNT(*)
    FROM result2
    WHERE
        result2 = @value_to_find
    */
END --IF
PRINT @final_msg

Open in new window

using the sample data provided I found 111 combinations of 2-row sums  (5 seconds)

I get 111 rows using the code in http:#a38513641

or using a small variant on my original query.  Compensating for negatives was trivial. Simply generate an id sorted by the values. The negatives will sort to the beginning and allow the same < @vl  logic I originally posted.

So, for 2 rows it looks like this...  (4 seconds)

DECLARE @Emp int,@Vl real
SET @Emp=32
SET @Vl=112.41;

WITH v
     AS (SELECT vcr_emp_empresa,
                vcr_num_empregado,
                vcr_vnc_cod_venc,
                SUM(vcr_valor_bruto) vcr_valor_bruto
           FROM rh_vcrt0
          WHERE vcr_emp_empresa = @emp AND vcr_valor_bruto < @vl
         GROUP BY vcr_emp_empresa, vcr_num_empregado, vcr_vnc_cod_venc),
     numbers
     AS (SELECT ROW_NUMBER() OVER (ORDER BY vcr_valor_bruto) id,
                vcr_valor_bruto,
                vcr_valor_bruto n,
                vcr_emp_empresa,
                vcr_num_empregado,
                vcr_vnc_cod_venc
           FROM v)
SELECT v1.vcr_num_empregado "Nº1",
       v1.vcr_vnc_cod_venc "Venc1",
       v1.vcr_valor_bruto "Vl1",
       v2.vcr_num_empregado "Nº2",
       v2.vcr_vnc_cod_venc "Venc2",
       v2.vcr_valor_bruto "Vl2"
  FROM numbers v1
       INNER JOIN numbers v2 ON v1.id > v2.id AND v1.n + v2.n  = @vl

Open in new window



I don't see an rh_ident column in the sample data, so the code in http:#a38516277  didn't run for me until I created a new table and generated ids for it. Not a problem, basically did the same thing I used for my own query but maybe my test was skewed.

The code in http:#a38516277 ran in 1:48 (108 seconds) and only printed 1 combination.
I'm willing to believe I messed up ScottPletchers test but I don't see how


This is the table I used for my tests and the asker's test code

CREATE TABLE [dbo].[rh_vcrt0](
      [vcr_emp_empresa] [int] ,
      [vcr_num_empregado] [int] ,
      [vcr_vnc_cod_venc] [int] ,
      [vcr_valor_bruto] money
);

I simply ran the inserts and then code above.


To test ScottPletchers code I renamed the table above to rh_vcrt1
then created this table...

CREATE TABLE [dbo].[rh_vcrt0](
      rh_ident [int] ,
   [vcr_emp_empresa] [int] ,
      [vcr_num_empregado] [int] ,
      [vcr_vnc_cod_venc] [int] ,
      [vcr_valor_bruto] money
);

and then populated it with this query

insert into [rh_vcrt0]
     SELECT ROW_NUMBER() OVER (ORDER BY vcr_valor_bruto),                    
                vcr_emp_empresa,
                vcr_num_empregado,
                vcr_vnc_cod_venc, vcr_valor_bruto
           FROM [rh_vcrt1];

If there is a flaw in any of the test setup, I apologize and welcome correction.
I didn't realize negative numbers would interfere with this. I just took them for granted. My apologies for this omission.
However, your comments and the commented part of the code made me realize that I actually can't filter the values beforehand.

Running sdstuber's code with this removed (AND vcr_valor_bruto < @vl) took 17 seconds and returned 132 rows. With it, it was 7 seconds and 116 rows, so it's a relevant issue.

I tried ScottPletcher's code, but couldn't quite get it to work. It seems it ignored the fact that the table has two keys, not just one. It started using rh_ident, then later uses orig_ident. I might be missing something, but when I tried changing this into both keys, I got that "An explicit value for the identity column in table 'orig_values' can only be specified when a column list is used and IDENTITY_INSERT is ON."

Either way, it seems that I can't get a workable solution that will return 5 rows combinations in less than 5 minutes. Adapting sdstuber's code to 3 rows will run for more than 10 minutes. This seems to imply that there isn't any chance that a workable 5 rows solution will be forthcoming. Not even for the smaller subset of data. I tried:
DECLARE @Emp int,@Vl real
SET @Emp=101
SET @Vl=112.41;

WITH v
	AS (SELECT vcr_emp_empresa,
				vcr_num_empregado,
				vcr_vnc_cod_venc,
				SUM(vcr_valor_bruto) vcr_valor_bruto
		FROM rh_vcrt0
		WHERE vcr_emp_empresa=@Emp
		GROUP BY vcr_emp_empresa,
				vcr_num_empregado,
				vcr_vnc_cod_venc),
	numbers
	AS (SELECT ROW_NUMBER() OVER (ORDER BY vcr_valor_bruto) id,
				vcr_valor_bruto,
				vcr_valor_bruto n,
				vcr_emp_empresa,
				vcr_num_empregado,
				vcr_vnc_cod_venc
		FROM v)

SELECT v1.vcr_num_empregado "Nº1",
		v1.vcr_vnc_cod_venc "Venc1",
		v1.vcr_valor_bruto "Vl1",
		v2.vcr_num_empregado "Nº2",
		v2.vcr_vnc_cod_venc "Venc2",
		v2.vcr_valor_bruto "Vl2",
		v3.vcr_num_empregado "Nº3",
		v3.vcr_vnc_cod_venc "Venc3",
		v3.vcr_valor_bruto "Vl3",
		v4.vcr_num_empregado "Nº4",
		v4.vcr_vnc_cod_venc "Venc4",
		v4.vcr_valor_bruto "Vl4",
		v5.vcr_num_empregado "Nº5",
		v5.vcr_vnc_cod_venc "Venc5",
		v5.vcr_valor_bruto "Vl5"
FROM numbers v1
INNER JOIN numbers v2
ON v1.id>v2.id
INNER JOIN numbers v3
ON v2.id>v3.id
INNER JOIN numbers v4
ON v3.id>v4.id
INNER JOIN numbers v5
ON v4.id>v5.id
	AND v1.vcr_valor_bruto+v2.vcr_valor_bruto+v3.vcr_valor_bruto+v4.vcr_valor_bruto+v5.vcr_valor_bruto=@Vl

Open in new window

I cancelled it after 4 minutes, with 108 rows returned.

If that is the case, that no feasible solution exists, I will award points to both anyway, because you have both helped me improve my original query, not only making it feasible to actually use a 2 row solution on the larger subset, but also allowing me to get 4 rows solutions on the smaller subset (45 seconds, 15 rows).
>>> This seems to imply that there isn't any chance that a workable 5 rows solution will be forthcoming.


yes, like I said originally, you're fighting math.  The only way any solution (regardless of method) could possibly be fast is with data that implicitly prevents most of the combinatoric explosion.

ScottPletcher's method is essentially (despite protests that it isn't) the same thing presented in SQL.  I'm not really sure how any method could be significantly different.

If you need to check all combinations, then you need to check all combinations. There really aren't any shortcuts to "do all"
>> It seems it ignored the fact that the table has two keys, not just one <<

I inserted the values into another table so that (1) the values could be sorted and (2) there would only have to one key.  The new ident I use can be used to lookup the original table row and all the columns, as I did upon exit to my code.

My method should of worked exactly as coded for any rows loaded into the rh_vcrt0 table.  I didn't go past the second level because, with negative values present, you don't gain anything.

Yes, my method will be slower when 2 values generates a match, but the stored intermediate results will save time when further iterations are needed (vastly more so when the numbers are all POSITIVE).

With only positive values, it's clear you could potentially save time by -- as I stated earlier -- using a binary search to restrict the number of rows you have to iterate over.


>> Compensating for negatives was trivial. <<

Only if you can't see any potential but plow-straight-ahead, no thought about maximizing efficiency.  With that limited vision, no improvements would be possible for this (or much of anything else).
For the three-level total, we can selectively join the 2-level table to the original input values.

For the four-level total, we can selectively join the 2-level table I created to itself -- I would expect that to be noticeably faster than a four-table join ... which was one of the points of creating an intermediate table.

With all POSITIVE values, we could do binary search of the ALREADY SORTED (in BOTH tables) values first to reduce the values that had to be joined ... but with negative numbers present, this can't really be done.
>>Only if you can't see any potential but plow-straight-ahead, no thought about maximizing efficiency.

Not sure where you're going with that.  The sorting to allow for early filtering (i.e. efficiency) is precisely why I'm doing it.  And, based on your code, looks like you're following the same idea.

Given that our approaches are effectively identical, I'm not sure where the "no thought" or "limited vision" comes in unless you are suggesting you have some other method than what you have proposed above.

If so,  please illuminate us.  The asker in particular appears to be willing to "settle" for what has currently been presented.  If you have something better, please post.
ScottPletcher, when I try your code as is, I get the following error:

Msg 207, Level 16, State 1, Line 54
Invalid column name 'rh_ident'.
Msg 207, Level 16, State 1, Line 105
Invalid column name 'rh_ident'.


Also, if I interpret it correctly, your code is matching all rows regardless of vcr_emp_empresa. Although I only provided data for vcr_emp_empresa=32, which is the largest subset, there are plenty of other values. All combinations have to be restricted to the same vcr_emp_empresa, which I did on my code using @Emp. Combinations with different vcr_emp_empresa are not desired.
>>  The sorting to allow for early filtering (i.e. efficiency) is precisely why I'm doing it.  <<

Where did you sort the values? And so that SQL knows they are sorted?
>>> Where did you sort the values? And so that SQL knows they are sorted?

when I create the "numbers" cte I create an ID which is sorted by the values

ROW_NUMBER() OVER (ORDER BY vcr_valor_bruto) id


this ensures my "candidate" combinations look at negatives first  then apply positives afterward.
the early filtering will still apply because adding negatives will still produce sums less than the target.   It will work even in cases of zero value too.

If, however, the values all happen to be positive, the ordering still works and the early filtering will become more efficient ~for free~ simply because the number of candidate combinations that might work will shrink.  Thus making the amount of work in later checks less.

The idea being:  check for combination sums that "might" work.  If they do, keep them,  if they don't drop them.  In the next level, check if adding another value is still valid, if so, keep that combination, if not, drop it. and so on.  Each additional level (join) will either become closer to a solution or eliminate a particular combination from further consideration of additional sums.
>> Also, if I interpret it correctly, your code is matching all rows regardless of vcr_emp_empresa. <<

That would be implemented when doing the initial INSERT into the "orig_values" tables.  I left it off just for convenience during testing, as that is trivial.



>> ScottPletcher, when I try your code as is, I get the following error: <<

You must be running a subset of the code, not the entire thing.



Btw, yes, I did actually output only one matching result.  I thought you only needed one match and that any match was as good as another.
INSERT INTO orig_values
SELECT
    rh_ident, vcr_valor_bruto
FROM dbo.rh_vcrt0
-- can't use this because of NEGATIVE numbers in the input data!!
--WHERE
    --vcr_valor_bruto <= @value_to_find
WHERE  --<<--
    vcr_emp_empresa = 32  --<<--
ORDER BY
    vcr_valor_bruto
sorry, just remembered I never addressed this comment...


Running sdstuber's code with this removed (AND vcr_valor_bruto < @vl) took 17 seconds and returned 132 rows. With it, it was 7 seconds and 116 rows, so it's a relevant issue.


With negatives in the list,  you can't include the initial filter
AND vcr_valor_bruto < @vl        (ScottPletcher mentioned this above too)

sorry, I should have taken that out myself.

it works for postives only, but with a mix it doesn't.

simple example

target=6

input 1 = -1
input 2 = 7

if I exclude where "AND vcr_valor_bruto < 6"  then it will eliminate a valid combination
>> You must be running a subset of the code, not the entire thing.

I clicked "Select All" from your code box, pasted on SSMS and only changed the db name from tempdb to rhxxi on the top and on both drop checks. I don't know if I need to create any other table before I run this code, though.

>>WHERE  --<<--
    vcr_emp_empresa = 32  --<<--

Yeah, I had figured out it would be there. :)
on a related note to the above, I am also assuming your target sum will always be a positive number.  ScottPletcher's code above has the same limitation.
Actually, though it would be rare, there could be occasions where we would search for negative numbers.
And I do mean, very very rare. So rare that, if it doesn't work for negative sums, it won't matter.
I don't understand the current challenge I guess.

I thought if you found a 2nd level-match, you quit searching; from the original q:
"
Usually, when such a need arises, we need to see if any one row has that value. That is basic SQL and I won't even bother with that. Next, we move to two values.
"

The 2-level logic runs so quickly on the original data that you can directly test that.

The example you gave has a lot of 2-level matches.
At any rate, I think the negative numbers prevent nearly all optimizations for higher levels.  With negatives in there, you almost really do have to go thru every permutation at every level.

[A genuine mathematician might be able to speed things up for you, but I'm not one so I can't say for sure.]


Btw, you definitely should NOT be using data type "real" for this, since real is an approximate value only.  

You should use decimal, which provides an exact data value, and thus an exact, and accurate, match.
What we do is search for the least "expensive" solution. We search for 1 row results, then move on to 2 row results, etc, until, after 5 row results, we try different approaches (or so we wish). However, two things should be noted:
1- This case has a lot of 2 level matches, but there are many cases where there aren't any.
2- Even when there are 2 level matches, we sometimes need to delve further.

Basically, we are trying to find out why something is amiss, out of some grouped grand totals that are calculated by the application. We sometimes have low level matches but which we find out aren't relevant to the issue, so we need to keep digging.

We had set up a 5 row sum as the limit, since we thought that could be done in a reasonable amount of time. This will force us to re-think our limit, and maybe how we can debug these situations, but the fact stands that, even when there are lower level matches, we sometimes need higher level matches as well.

Which is why I was sort of emphasizing a working 5 rows solution. If we have the final one, we can infer from that and resize to generate lower level ones. However, as it doesn't seem to be possible to create one with 5 rows, I guess we'll have to settle with a working 2nd level solution for larger subsets, and 4 level solution for smaller ones. We'll have to try and think of other solutions for this. Maybe a desktop based app in VB.Net or even plain C would be faster.
>> Btw, you definitely should NOT be using data type "real" for this, since real is an approximate value only.  

>> You should use decimal, which provides an exact data value, and thus an exact, and accurate, match.

Actually, that was my bad. The DB was created by another company and I thought it was real. Turns out it was decimal(14,2). That will actually force me to change lots of code unrelated to this question, but which should, nevertheless, be optimized.
>>> Maybe a desktop based app in VB.Net or even plain C would be faster.

it's not the database that is slow.  it's the combinatorics.

I'm not saying you can't possibly write a c program that would be faster, but it will be a marginal difference against the fundamental combinatoric issue..


you can try tackling the problem by running multiple threads, each examining a different set of values.  That will consume more total cpu resources but may find solutions in faster "wall clock" time.
When I say combinatorics is the issue, maybe showing the math will help explain why.

Assuming 5000 rows you have the following number of combinations to examine at each level.

5000
25000000
125000000000
625000000000000
3125000000000000000

Hopefully with over 3 quintillion possible combinations that makes it clear why the algorithm or language used for analysis isn't particularly important.   There are  simply too many possible combinations.

The ONLY way this can be solved to 5 levels in any kind of reasonable time is IF AND ONLY IF, the number of valid combinations at each level is small.  Due to exponential  growth,  it must be a very, very small number of possibilities otherwise you'll still get unmanageable growth.

 Including negatives means you are unlikely to ever get that level of filtering any level.
While that is an issue, you're overstating the problem by at least double.  He doesn't need duplicate combinations; and you don't join a value to itself.

So the 2nd level is really more like:
12497500.
yes,  I didn't specify any of the filters we've already identified.
plus, we already know it's not exactly 5000 rows.  
and I don't know how much the filtering will really reduce it.

I wasn't trying to be precise, there are too many unknowns to make that possible.
I'm trying to indicate scale.

But, your comment helps to illustrate the problem.

"at least double" makes it sound like I'm way off so maybe it's not as bad as I'm suggesting.

but I say "double" is completely inconsequential here.  The magnitudes are so large as to make a factor of two laughable.


Let's say my simplification is off by a factor of  100000000 (one hundred million) rather than 2.

That means there are many billions of possible combinations, which still makes the problem intractable for most systems and certainly for real-time, user interactive computation.

Again, the math makes the algorithm inconsequential.  If the data allows for signficant combinations at each level then you're fighting exponentiation.
You WERE way off.  Off by a MINIMUM of a factor of 2 is not a trivial amount.  That is only at the 1st level.  That means by the 5th level it is off by at least 32x.


Again, it's the negative values that really prevent any kind of useful optimization for higher levels.  Otherwise you'd have a good chance of drastically reducing the number of computations required at upper levels.
>>> You WERE way off.  Off by a MINIMUM of a factor of 2 is not a trivial amount.

You say 2 is a lot, I say 2 is laughably, trivially inconsequential.    Sure, 1.65 quintillion is a lot smaller than 3.1 quintillion.  But who cares?    Being off by double makes no practical difference at all.  Could you actually examine a quintillion different combinations?  If not, then it's effectively no difference at all to an implementable solution.

 I'm not sure why you highlighted the word minimum. I already addressed the consequences of making the error much, much larger.    Since I already said that I was intentionally being imprecise; because it simply doesn't matter except under special circumstances, why bother?

If your goal is simply to get me to concede that there are times when it can be better than double,I already did; but I'll say so again.  In fact,  I'll concede the factor could be even higher than 100 million.  It could be off by quadrillions,  in fact, it could be 100% wrong.  There might be 0 possible combinations.  But who cares?  The point was not to say it will always be "X" many combinations.  The point was the problem-space is potentially big, very big.  So big that even reductions  that might "feel" like a lot:  half, a tenth, a millionth  are still not enough to make a practical dent in it.

As far as I can tell, we're in agreement as to the intractability of this problem.     I'll try again to explain in a way that is hopefully more agreeable.


>>>  If the data allows for signficant combinations at each level then you're fighting exponentiation.

I've tried saying it in different ways but that line is the key to everything I've said above, both as to how/when it can be solved and when/why it can't be solved.

The reason a positive-only set of data helps is because it can drastically reduce the number of legal combinations at each level.  This is still not a guarantee, but certainly better than when negatives are included.  Positives only can still produce an unwieldy number of combinations per level.

The reason tiny data sets work is even the worst case scenario doesn't have enough data for exponentiation to explode.

The reason small targets with large values work is, (even with a few negatives or relatively small absolute values)  you can still filter a significant number of combinations early enough.


Large totals and small values, or lots of negatives or large negatives don't work because the number of combinations at each level is too many so you'll "effectively" get exponential growth.  As ScottPletcher pointed out, it's not "true" exponentiation, but it can easily be large enough to not matter in a practical sense.

My use of the imprecise terms "large", "small" and "tiny" will likely elicit more complaints but we simply don't have enough information to say how many is a "large amount" and how many is a "small amount".   It's all relative.  You could have a million rows, but if they are all positive values and your target sum is 0.12  then you'll be able to filter out "many" of the candidates at each level, thus exponentiation should not be a problem.

Reduce the number of rows, increase the target sum - some will be solvable, some won't.  It's a balance in the data between growth and shrinking.

Add negatives,  few/many,  small/large absolute values.  The combinations will increase, but how much is, again, a relative matter based on the other values and the target sum.

Given a specific set of data constraints it's possible to calculate more precisely how many rows you might expect in the 5-level sums.; but we don't have those so I'm reduced to saying it "could" be large, thus making it a problem with an unreliable solution.

The possible combinations could be anywhere in the quintillions down to zero.  So, a factor of 2, a factor of a billion, a factor of a trillion - whatever it is.  The algorithm used to find them does not, in any way, change the number of results.

I hope that helps
Just a recap of what you've said before.  Still false, but a nice recap.

You can't seem to grasp the concept of a binary search on values, so we'll never agree on this.

As stated, though, negative values render much of the possible efficiencies moot anyway.

There might be other mathematical tricks that could speed this up (similar to the algorithms used to speed up finding the places for pi, for example), but, even if they exists, I don't have the pure mathematics knowledge to apply those to something like this.
If you want other options to make better filters to "fight" expansion at each level, try looking for minimum and maximum values and only pursue further levels if the partial sum could reach the target within the known of the data range.


For example:  target 100,  after 3 levels I have a partial sum of 200

If the minimum value is -25,  then I know I don't have to search any further because even adding the minimum twice (whether it exists twice or not isn't important)  I can't reach the sum, so I don't need to look any further.
>>> Still false, but a nice recap.

Thanks, but that's not helpful.
Please correct any place I made a mistake.  Don't let my error become part of the PAQ.


>>>  You can't seem to grasp the concept of a binary search on values, so we'll never agree on this.

I understand a binary search - I do concede that I don't know how you're trying to apply it here.
Please illustrate.

If you have code that is better than what is above, please post it,  or at least pseudocode and I'll write it myself.  Preferably using the asker's existing table structure so as to reduce ambiguity and errors in translation.
@ScottPletcher: I have pinpointed where your code goes wrong. You use this part:
INSERT INTO orig_values
SELECT
    rh_ident, vcr_valor_bruto
FROM dbo.rh_vcrt0
-- can't use this because of NEGATIVE numbers in the input data!!
--WHERE
    --vcr_valor_bruto <= @value_to_find
ORDER BY
    vcr_valor_bruto
SET @orig_count = @@ROWCOUNT

Open in new window

to insert values in the orig_values table, but rh_vcrt0 doesn't have any rh_ident field. I'm not sure how you're planning to make this work though, so I don't think I can correct it myself.
you can test his code with the changes I made above.

or, if you don't want to change your own objects.
Do a search and replace on his code to replace rh_crt0 with ScottPletcher (or other name)

then create a new ScottPletcher table and populate as shown in http:#a38516676
@sdstuber: I adapted your code to do filtering, which improved the speed quite a lot. Basically, I reversed the sign for the joins and added the check again:
DECLARE @Emp int,@Vl real
SET @Emp=101
SET @Vl=112.41;

WITH v
	AS (SELECT vcr_emp_empresa,
				vcr_num_empregado,
				vcr_vnc_cod_venc,
				SUM(vcr_valor_bruto) vcr_valor_bruto
		FROM rh_vcrt0
		WHERE vcr_emp_empresa=@Emp
		GROUP BY vcr_emp_empresa,
				vcr_num_empregado,
				vcr_vnc_cod_venc),
	numbers
	AS (SELECT ROW_NUMBER() OVER (ORDER BY vcr_valor_bruto) id,
				vcr_valor_bruto,
				vcr_valor_bruto n,
				vcr_emp_empresa,
				vcr_num_empregado,
				vcr_vnc_cod_venc
		FROM v)

SELECT v1.vcr_num_empregado "Nº1",
		v1.vcr_vnc_cod_venc "Venc1",
		v1.vcr_valor_bruto "Vl1",
		v2.vcr_num_empregado "Nº2",
		v2.vcr_vnc_cod_venc "Venc2",
		v2.vcr_valor_bruto "Vl2",
		v3.vcr_num_empregado "Nº3",
		v3.vcr_vnc_cod_venc "Venc3",
		v3.vcr_valor_bruto "Vl3",
		v4.vcr_num_empregado "Nº4",
		v4.vcr_vnc_cod_venc "Venc4",
		v4.vcr_valor_bruto "Vl4",
		v5.vcr_num_empregado "Nº5",
		v5.vcr_vnc_cod_venc "Venc5",
		v5.vcr_valor_bruto "Vl5"
FROM numbers v1
INNER JOIN numbers v2
ON v1.id<v2.id
	AND v1.vcr_valor_bruto+v2.vcr_valor_bruto<@Vl
INNER JOIN numbers v3
ON v2.id<v3.id
	AND v1.vcr_valor_bruto+v2.vcr_valor_bruto+v3.vcr_valor_bruto<@Vl
INNER JOIN numbers v4
ON v3.id<v4.id
	AND v1.vcr_valor_bruto+v2.vcr_valor_bruto+v3.vcr_valor_bruto+v4.vcr_valor_bruto<@Vl
INNER JOIN numbers v5
ON v4.id<v5.id
	AND v1.vcr_valor_bruto+v2.vcr_valor_bruto+v3.vcr_valor_bruto+v4.vcr_valor_bruto+v5.vcr_valor_bruto=@Vl

Open in new window


Basically, we can filter it this way because, if the next value is a negative number, then all previous ones were as well. Basically, v3 will be a larger number than v2, so if v1+v2 are larger than our sum, this is already an undesired value. If v3 is a negative number, then v1+v2 is also negative, so this will only decrease. Thus it will always be smaller than the desired target, since we did discard negative sums as not important.

This took 1m30s on the smaller subset, returning 225 rows. So, for smaller subsets, this is feasible, at least for certain rows/sums.
On the larger subset, a 3 row solution was cancelled after 6 minutes, returning 1226 rows so far, so this level remains impractical for larger subsets.

So, you can filter the results, as long as you order the subset. If only the query could be speeded up, or some other filtering could be achieved. At this point, I'd be happy with a 3 row solution for the larger subset.
>> So, you can filter the results, as long as you order the subset. <<

Which is EXACTLY what I've been saying all along.  It's sdstuber that stated differently, that you couldn't "fight math" and the inevitable growth of extrapolations.

I ordered the intial input IMMEDIATELY in a table keyed that way.  This would allow max efficiency later.

That's also why I changed the three controlling columns to one ident, for efficiency.  You can use the one ident to lookup the original three values when you're done computing.
>>> So, you can filter the results, as long as you order the subset. <<

that's exactly what I did originally (and all along) !  reversing the signs doesn't really do anything

>>>  It's sdstuber that stated differently,

nope -  If that's how you read what I've posted, then it's no wonder you think you're doing something different than I am.  That explains everything


well, not everything, but a lot
>> Basically, v3 will be a larger number than v2, so if v1+v2 are larger than our sum, this is already an undesired value. <<

Not with negative numbers present.  v1+v2+v3 (where v3 is negative) could still yield your desired total.

That is the big problem.  Once you introduce negative numbers, you lost almost all efficiency gains in searching.

You do do some bounds checks on the neg values, but they seemed to cover quite a spread, so I don't think even that helped on the extremely large input sample.
The only sort you're doing is part of the query -- you can't possibly exclude large numbers of rows from consideration if they're already part of the query.

You never stored any intermediate results sorted in a table so that value ranges could be checked BEFORE doing the next level of calcs.

This tells you IN ADVANCE whether you have too many possibilities to proceed, BEFORE running the next query and letting it time out.
ok, I'll try one more time with a concrete (silly, but mathematically sound) example


my data consists of 100 rows, all of which have vcr_valor_bruto = 1

my target value is 6.

 the only filter that will actually do anything to reduce the number of combinations evaluated is the simple v1.id < v2.id  (or v2.id < v1.id - there is no difference)  until you get the final set which will be checked against 6 and kick them out.

At level 3 that will produce 161,700 permutations to search through.  None of which will be returned but every one of them must be evaluated.

That's worst case scenario.  I fully expect the real data will be much, much better for us in a practical sense.  However,  unless there are some special constraints that haven't been mentioned, we have no reason to believe that the combinations won't grow in the direction of worst case and become unsolvable in a practical sense
>>> You never stored any intermediate results sorted in a table so that value ranges could be checked BEFORE doing the next level of calcs.

the nature of joins is my "storage" operation.

there is no such thing as a 5-way join.  There are only 2 way joins done in sequence.

When I join the table to itself the first time,  that produces some set of legal combinations (maybe large, maybe small, we don't know. it's immaterial)

Then I join THOSE results (without needing to store them independently) to the table for the second time (third instance)  I don't need to have the results of the first join stored in a table simply to read them back out again.  I'm using the results of that first join already.

that's likely why all of the sql approaches has been so much faster than your procedural approach.  They are doing the same thing, except without the overhead of extra insert/select operations.
>> None of which will be returned but every one of them must be evaluated. <<

NO, they DON'T.

Add the highest level 2 value (keyed lookup, one read ... if stored) and the highest original value (not in the level 2 combination) -- if that total is less than the total you are seeking, you KNOW that NO level 3 combination will work.

Add the two highest level 2 values -- if that is less than the total you need, then you know that NO level 4 will equal your value.

All of that can be determined with almost no additional I/O.

For positive numbers only, you could do a binary search on the level 2 values to eliminate some combinations -- hopefully a lot, but that depends on the specific values.
>> Not with negative numbers present.  v1+v2+v3 (where v3 is negative) could still yield your desired total.

If you ensure that they're ordered, it couldn't, because, if v3 is negative, then so is v1 and v2. And if v1+v2 is bigger than the sum, then any v3 will be bigger than v2 (negatives would have been behind by this stage). The only way this condition wouldn't work would be with a negative sum. But, not only did we discard this option already, I actually think negative sums are easy. All I need to do is reverse the signs on all values and search for the positive.

>> reversing the signs doesn't really do anything

Actually, it's reversing them that allows you to filter results. If you start from negative to positive, you can know that, once you've passed the desired result, further numbers aren't desirable. If you start from positive to negative, any number (since the largest negative will be the last record) can still yield a desired result.

I've actually thought that you could further filter results by storing the MAX(vcr_valor_bruto) in a variable. Then you could check if v1+v2+@MaxVl>@Vl, then you can proceed (for a 3 row solution. For a 5 row solution it would have to check @MaxVl*3). If v1+v2+@MaxVl<@Vl, then you can't possibly have a solution for that particular join. I believe this was suggested somewhere above.

ScottPletcher, I'm still not completely sure how you're hoping to achieve this. I think you want to create an extended 2 rows solution, that is, a table that has 2 row combinations that either yield the desired result, or that are still valid (not larger than the sum), hoping to later combine it with itself. If that is the case, then I can understand how negatives would hurt your solution. Nevertheless, it was still a valid one, given your knowledge of the OQ, though I'm not sure how you would generate a 3 or 5 row solution from it.
As I stated earlier, you can gen the 3-level solution by *selectively* (with only pos values) joining the necessary parts of the 2-level solution to the original values.

The four-level solution can be obtained by selectively joining the 2-level solution to itself.

With neg values, you could just compute the top two values in the 2-level solution, and use those to determine if ANY four-level solution is even possible.  If not, you could exit the code if there are any significant number of original values, because as we all know a *full* explosion to five levels would produce an essentially unsolvable number of rows.
>>>  you start from negative to positive, you can know that, once you've passed the desired result, further numbers aren't desirable.

yes, that's what I did in my query

ROW_NUMBER() OVER (ORDER BY vcr_valor_bruto)

that puts the values in increasing order negatives first, postives last  


>>>Add the highest level 2 value (keyed lookup, one read ... if stored) and the highest original value (not in the level 2 combination) -- if that total is less than the total you are seeking, you KNOW that NO level 3 combination will work.

what you just explained IS evaluating all of them.  When you KNOW that NO level 3 combination will work, it's because you looked at each of them and they all dropped out of the join condition or loop condition if done procedurally or where clause on the extra select to previously saved results.

once the level 3 join eliminates rows the level 4 and 5 have nothing to do.  That's the whole point of why I sorted and joined the way I did.

I tested your way for level 2, it took 1:48, I tested my way it took 0:04 using the same data on the same server i n the same database.  Please, if you have a method that is faster, post it and let us test it.  

Clearly our words aren't convincing each other.  Whether that's a failing of the writers or the readers or both is losing importance.  Let the code do the talking, we can all run it and see for ourselves what works and what doesn't.    Again, I apologize if I messed up testing your code; but I've posted exactly what I did to test it.  If there is a failure in my test that skewed the results against you, please correct me.
I already stated, try the 2-level FIRST BEFORE STORING ANY RESULTS.

You only need to store results if you need to go to the 3 level AND the orig values is beyond a certain count.
>> what you just explained IS evaluating all of them.  When you KNOW that NO level 3 combination will work, it's because you looked at each of them <<

NO, that's false again.

If I've stored the two level results, I can just read the two highest results and the highest original result (also stored keyed in my example), the 3-level impossibility can be done with NO other 3-level combinations tried.
Interesting discussion.  :)  I wish that I'd found it sooner....

For my money, I'd skip the SQL solutions and write this in C.  It is, after all, a problem with a recursive solution and C (or a derivative) offers some of the best recursive solutions.  Certainly the fastest.  I would expect a C based solution to solve this with sub-second speed.  Some DBMS actually let you write functions and stored procedures in C, but I don't believe that SQL Server is one of them.

One of the problems suggested above was to take the integer values from 1 to 20 and finding all combinations (/permutations) that sum to 37.  In C it's trivial if you're comfortable with recursion.

Step 1 is to read the database table into a C array.  That's 1 SQL query, no joins or other overhead passed off to the database.  Even with a million rows, there's only a few MB of storage required.  Far less than the DBMS needs to store and manage the rows.

Step 2 is to create an array to contain the values that make up the solution.  In theory, this could be as large as the original array, but there's nothing preventing a rule that limits the number of elements in the solution to a fixed size.  It was stated that 5 was the expected upper limit, so let's be generous and use 10.  :)

Step 3 is to sort the array ascending.  Note that there are no restrictions on signage, nor is there anything to prevent duplication of values.  10+10+10+7=37 and if there are 3 tens and a 7 in the data list, this algorithm will find that solution.

Overkill, I know, but the program below seems to solve this in a couple of milliseconds.  Modify this to run the SQL to load the table, and perhaps to store the solutions and it's done.

//---------------------------------------------------------------------------

#include <stdio.h>
#pragma hdrstop

//---------------------------------------------------------------------------
#pragma argsused

	int TValues[40];
	int Limit;
	int SValues[10];
	int MaxValues;
	int Target;
	int SCount = 0;

void PrintSolution (int Depth)
{
	int idx;

	printf (" Found: %4d :", ++SCount);
	for (idx = 0; idx <= Depth; ++idx)
		printf (" %d", SValues[idx]);
	printf ("\n");
}

int FindOperand (int Sum, int Depth, int Limit)
{
	int TPosition;
// Find the largest value in the array that when added to the running sum, is GE the target
		for (TPosition=0; TPosition < Limit; ++TPosition)
			if (Sum + TValues[TPosition] >= Target)
				break;

// If a solution is not found and sum(items) is less than the target
// Then we need to keep the largest value and search for another token
		if (TPosition >= Limit)
		{
			if (Limit == 0)
				return 0;
			SValues[Depth] = TValues[Limit-1];  // Relative 0 indexing, doncha know....
			Sum += SValues[Depth];           // Carry the new sum to the next pass
			FindOperand (Sum, Depth+1, Limit-1);
			return 0;
		}

// If a solution is not found and sum (items) is GE than the target
// We need to grab a value less than the current value (previous in the list)
// and look add another item to the solution check
		if (Sum + TValues[TPosition] == Target)
		{
      SValues[Depth] = TValues[TPosition];
			PrintSolution (Depth);
		}

		for (--TPosition; TPosition > 0; --TPosition)
		{
			SValues[Depth] = TValues[TPosition];
			FindOperand (Sum + TValues[TPosition], Depth+1, TPosition-1);
		}
		return 0;
}

int main(int argc, char* argv[])
{
	int TPosition;
	int idx;

// Set the value that we want to find.
	Target = 37;

// Load 1 through 20 into the TValues array, sorted ascending
	for (idx = 0; idx < 20; ++idx)
		TValues[idx] = idx+1;

// Set the number of items in the array
  Limit = 20;
  
// Set the maximum number of values in the solution
	MaxValues = 10;

	FindOperand (0, 0, Limit);

	return 0;
}
//---------------------------------------------------------------------------

Open in new window

At home now, so I can't really test that properly. However, I think you missed the point of the max we're talking about. What we need is ALL 5 value combinations. On a 150 rows sample data, it's pretty standard. However, when you try to find ALL 5 rows combinations, it does get slow. We're not limiting the number of results. We want them all. What we're limiting is the number of values to combine.
That is, you can run the array and combine it with itself to find 2 row solutions (solutions of 2 values that achieve the desired value). That is pretty standard. But what we need is to run the array and combine it with itself, then combine it with itself, then combine it with itself, then combine it with itself, filtering undesired results on the way.

It's likely that a C solution is faster than an SQL one. I just don't know if it's THAT fast. Can a 5 row solution return results out of a 5k rows sample data? If you wish to test it, there is a sample data above that you can use (a bit over 5k). You can use the value 122.41, which is the one we've been using so far. Try to find out all 5 row combinations in about 5 minutes. I don't think you can do it with just pure recursiveness. There are too many combinations to run through them all, even ordering and filtering as you go. However, I'm curious to know if a C solution can be found to achieve a 5 row solution in any decent time (anything less than 5 minutes isn't practical, other than to my curiosity ;)
Hi Cluskitt,

Perhaps two key variables in that code need some description.

  *Limit* is the number of data points to examine.  If there are 20 data points in the table, its value is 20.  If there are 50,000 data points, its value is 50,000, etc.

  *MaxValues* is the most number of items that will be returned for a solution.  If 1, 2, 3, 4, 5, 6 sums to the correct target (21) it will not be shown because *MaxValues* is set to 5 and that sequence contains 6 points.  However, 2, 3, 4, 5, 7 will be returned as it contains only 5 points.

I believe that's what you're looking for.

Modern processors are capable of multi-billion calculations per second.  At such speeds, every item in a 1,000 item table can be compared to the other items more than 1,000,000 times in well under a second.  That kind of performance isn't possible in SQL, where a Cartesian product is first created for each item in the table, then rows are filtered out.  Multiple joins results in multiple Cartesian products and a huge number of rows.

It would probably take a whiteboard (or equivalent) to properly show the algorithm that the C code uses, but I'll be glad to give it a stab here.


Kent
I added

if (Depth >= MaxValues)  return 0;  

to the beginning of FindOperand  so the recursion would stop after searching for 5 numbers.
If you set the limits higher, make sure you increase the array size too.

However, even though the original code doesn't appear to observe the MaxValues  search stops after a depth of 5 for the 1-20 set and target 37.

The the code runs fast, but only returns a tiny subset of the possible answers.  For example, running 1-20 with a target of 37  returns only 18 combinations out of 309.

Found:    1 : 20 17
 Found:    2 : 20 16 1
 Found:    3 : 20 15 2
 Found:    4 : 20 14 3
 Found:    5 : 20 13 4
 Found:    6 : 20 13 3 1
 Found:    7 : 20 12 5
 Found:    8 : 20 12 4 1
 Found:    9 : 20 11 6
 Found:   10 : 20 11 5 1
 Found:   11 : 20 11 4 2
 Found:   12 : 20 10 7
 Found:   13 : 20 10 6 1
 Found:   14 : 20 10 5 2
 Found:   15 : 20 10 4 2 1
 Found:   16 : 20 9 7 1
 Found:   17 : 20 8 6 3
 Found:   18 : 20 7 5 4 1

It's always starting with the largest value in the set, but even with that limit it's still missing some of the expected results.   For example:  20,11,3,2,1   isn't found.
Hi Cluskitt,

I've done some more work on this (I find it interesting) and several things come to mind.

In your data sample (5,000+) rows, there are duplicate values for CalcValue.  If the duplicates are not filtered out, there will be duplicates in the answer.  

I'll post the complete program, but it appears that we have a LOT more combinations that generate the target value than we thought.

When we limit the results to only 2-column sums AND we delete the rows that duplicate CalcValue there are 99 results.  When we limit the results to 3 column sums there are 65,024 results.  The 65,024 rows are written to a file in about a second.

The results are growing exponentially.  Perhaps this whole thing needs to be rethough.


Kent
Yes, I know there are duplicates. They have to be taken into account. That is, each is to be treated as an individual value. On the whole, there are no two rows that are exactly the same. One of the keys will change.

The fact that a 5 row combinatory calculation explodes into the billions and trillions of possible solutions is the problem we're fighting with here. We tried early filtering, to reduce this number, but it still seems a bit unmanageable, even for C.
Let's try looking at the problem a different way.

Let's say we give you a magic computer that has infinite speed and for some particular target sum you get a million different possible combinations.

What are you going to do with that information?  Is it useful or is the answer only of use when it's something smaller and more manageable?

If a million is considered small (and proportional to what it could be it's miniscule)  what if it's a billion?  Are the results still useful?  What if there are 100 billion? A trillion?  

At some point, the growth will be so high that there is no possible way you could ever use the results in  a meaningful way, even if you could generate them instantly you couldn't display them or write them to disk fast enough.

or -  Can you say with reasonable confidence that the data will not have that kind of growth?  Are there other conditions that haven't been specified that will lets us generate many tens of thousands of combinations but it'll never explode in the mega-results?

 I can create gigantic input sets (millions of rows) that can still be solved in a few seconds (the 0.12 sum of positives example above).   I can create small sets (thousands of rows) that can not be solved within hours (the all 1 example above, if you don't like target of 6, how about a target of 5 - every row combination will be legal.)

So, since the data will either make or break you here.  Can you tell us anything more about it?
Well, obviously, if we get millions of records, it won't be useful. We will have to restrict the data further.
The point is to discover codes that aren't correctly parametrized in the application and thus generate a different total. We then have to check which values/codes might cause this. And if it were just this, it would be simpler, because we could check a few, eliminate some possibilities, and generate a smaller result. But in some cases, it's something that affects only a few employees (vcr_num_empregado), in which case codes that are correctly parametrized can still be part of the difference.

So, while some codes can later be eliminated, that will have to be due to human examination and the full results have to be calculated or, I guess, we can establish some ceiling of, for example, 500 results to be evaluated. Also, we first run the 1 row solution. If there are no results, then we move on to the 2 row solution, etc...
From experience, I'd say that most of the time, a 3 row solution would solve the case. However, we sometimes need to perform a 5 row one, for more complicated cases.

The real problem here is that, other than pure math, there really isn't any condition that can be used to exclude data automatically.

Maybe an easier path would be to create 5 row solutions for 1 code at a time. That is, the first table/array would be composed only of one code, which would then be fully combined. I suppose that would cut down quite a few possibilities, especially as some of the codes exist in a smaller number.
>> The real problem here is that, other than pure math, there really isn't any condition that can be used to exclude data automatically.

I think that that's been the point of most everyone that has posted, though the explanations come to that conclusion from different angles.

Recursively spinning over a modest volume of data that (in this case) probably fits into the CPU cache is the fastest solution that you'll get.  But the number of possible solutions is huge.

The algorithm in that C program searches for combinations, not permutations.  It does it by searching the data in such a way that the value list consists of items in decreasing order.  (Technically, non-ascending if duplicates are included.)  It may be possible to approximate the number of solutions for a given set of data, but it sure seems impractical (or impossible) to generate all of the solutions in a meaningful form.

Does it do you any good to hunt for solutions where the largest value is X?  Or 100 solutions where the largest value is X?  Search for "contains x" doesn't get us anywhere as "contains 1" means the entire list still has to search all possible combinations.


Kent
Oh.  Here's the last C program that I used.  I took some liberties with the data, squeezing out the duplicate values and converting the values to non-fractions by multiplying by 100.  Those are performance boosters, but it's still a huge number of answers in the result set.


//---------------------------------------------------------------------------

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#pragma hdrstop

//---------------------------------------------------------------------------
#pragma argsused

	double *TValues;
	double SValues[10];
	double Target;
	int Limit;
	int MaxValues;
	int SCount = 0;

	FILE *R;

void PrintSolution (int Depth)
{
	int idx;

	fprintf (R, " Found: %d :", ++SCount);
	for (idx = 0; idx <= Depth; ++idx)
		fprintf (R, " %.0f", SValues[idx]);
	fprintf (R, "\n");
}

int FindOperand (double Sum, int Depth, int Limit)
{
	int TPosition;
// Scan the list backwards from the starting point (Limit)
		for (TPosition=Limit-1; TPosition >= 0; --TPosition)
			if (Sum + TValues[TPosition] == Target)
			{
				SValues[Depth] = TValues[TPosition];
				PrintSolution (Depth);
			}
			else if (Sum + TValues[TPosition] < Target)
			{
				if (Depth+1 < MaxValues)
				{
					SValues[Depth] = TValues[TPosition];
					FindOperand (Sum + TValues[TPosition], Depth+1, TPosition);
				}
				else
					return 0;
      }
		return 0;
}

void LoadTable (void)
{
	FILE *F;
	int A, B, C;
	float V;
	int Status;
	int TableSize = 1000;

	F = fopen ("C:\\sqldata.txt", "rt");
	Limit = 0;

	if (F)
	{
		TValues = malloc (TableSize * sizeof (double));
		while (1)
		{
			Status = fscanf (F, "(%d,%d,%d,%f)\n", &A, &B, &C, &V);
			if (Status <= 0)
				break;
			if (Limit +1 >= TableSize)
			{
				TableSize *= 2;
				TValues = (double*) realloc (TValues, TableSize * sizeof (double));
			}
			TValues[Limit++] = (Status = V * 100);
		}
		fclose (F);
	}
}

void SortTable (void)
{
	int x, y;
	double V;

	for (x = 1; x < Limit; ++x)
	{
		if (TValues[x-1] > TValues[x])
		{
			for (y = x-2; y>=0; --y)
				if (TValues[x] > TValues[y]) 
					break;
			++y;
			V = TValues[x];
			memmove (&TValues[y+1], &TValues[y], (x-y) * sizeof (double));
			TValues[y] = V;
		}
	}
}

void PackTable (void)
{
	int Source;
	int Destination;

	Source = 1;
	Destination = 0;

	while (Source < Limit)
	{
		if (TValues[Source] != TValues[Destination])
			TValues[++Destination] = TValues[Source];
		++Source;
  }
	Limit = Source;	
}
int main(int argc, char* argv[])
{
	int TPosition;
	int idx;
	int i;
	FILE *F;

	Target = 11241;

	MaxValues = 3;

	LoadTable ();
	SortTable ();
	PackTable ();
	F = fopen ("C:\\Values.txt", "wb");
	for (idx = 0; idx < Limit; ++idx)
		fprintf (F, "%.0f\n\r", TValues[idx]);
	fclose (F);
	R = fopen ("C:\\Results.txt", "wb");
	FindOperand (0, 0, Limit);
	fclose (R);

	return 0;
}
//---------------------------------------------------------------------------

Open in new window

>>>I think that that's been the point of most everyone that has posted

That's also why I've been saying the algorithm and language don't matter.  Since c can only do a few billions of operations per second (in theory, and "operation" doesn't equate to "evaluate one combination" ) then it's just way too slow for the bad cases.   Of course a really bad algorithm can make the problem worse, but nothing "good" will ever be "fast" when the math is against you.

I hate to keep playing the role of doomsayer; but when the combinations are bad there simply is no way to tackle it in a practical sense.



>>> The algorithm in that C program searches for combinations, not permutations.

I'm pretty sure that applies to all of the code above. Other than the query in the original question text, everythign does some sort of sorting to prevent duplicates through permutation and only do combination searches.


>>>  That is, the first table/array would be composed only of one code

If it's valid to only evaluate a subset of the rows you've presented then by all means do that.  Any rule, no matter how seemingly trivial that can cut the amount of data will help.  In fact, the reduction will help exponentially in the same way that more data grows exponentially.  Less data is less combinations which inhibits the mathematical expansion.

If however, you still need to check for cross-code sums then it doesn't really help the exhaustive search.  

I'm still a little fuzzy about what you want to do with large result sets.  If we find 500 or even 50000 combinations quickly, but there are still another 146 million more after that,  do the 50000 we found even matter?
The only preferred data is one that contains as little different codes as possible. That is, one row that has 3 values with code 1 and 2 with code 6 is more relevant than one that has 5 values from 5 different codes.

From past experience, at least from level 3 solutions, I can say that, for the values used, most samples will return about 100-200 solutions. This is a manageable number which can be analysed by sight, given human intuition and all that.

Usually the differences (desired sum) for the larger subset are larger, which actually generate less possible values. But that only helps with the number of results, not the processing power required, because, while larger numbers do exclude many possibilities, they only exclude them after calculating them all.

For example, if, given the sample provided, you were to search for a sum of 8952.32, there wouldn't be as many possible combinations as for 112.42, because the vast majority of values isn't large enough to produce this sum.

I will try to study your code and adapt it into my VB.Net project which I'm experimenting with, since I'm more at home with VB than C, though I can use both. One thing I did notice is that you use a sorting procedure, which, in the real case, won't be necessary, as it would come from a DataTable (or similar) and I can order SQL to sort it in advance.
C program will definitely be faster than SQL; math calcs are not SQL's strongpoint.  Even if staying in SQL, fastest would likely be C in CLR, but even that will be slower than native C.
>> If we find 500 or even 50000 combinations quickly, but there are still another 146 million more after that,  do the 50000 we found even matter?

If we find 500, but there are still 50000 more after that (our tolerance is lower ;) ), we just study the first 500 to check if there is anything to be made of it, otherwise we just drop it. At this point, we would have exhausted the 1, 2 and 3 row combinations (and possibly 4), so expectations would be low anyway and some other method would have to be found for discovering the problem.
>>> C program will definitely be faster than SQL; math calcs are not SQL's strongpoint.  

sure, but same argument as my approximation errors before.  if the c is a million times faster than sql (which it isn't)  then it's still to slow to tackle result sets in the quadrillions.

but, based on the asker's last comment that 200 results is all that is expected then the early-filtering that we've all been hoping would work,  might actually do it.

if that's the case, then the data is working for you, not against you and the math works in your favor because you'll rarely get the mega-results and, if you start down that path,  50000 is too many to worry about  so we don't have to get anywhere near the full combination.

I have some alterations to kdo's code that should make it more efficient but with these new constraints I'm not sure it's even necessary.  I expect we can either solve the small sets or give up on the big ones fairly quickly
I've been working on my own code as well, to see if I can come up with something reasonable (I'm actually also curious if I can get the full results without filtering, but I'll try at home where my computer is better ;)).

It seems to me that there are a few ways to speed up the filtering, especially working with an array:
-Working from an ascending order (we tackle negatives first), we can check for possibilities. For example, let's suppose that the smalles value is -5000. We then add it with the 4 larger values in the array to see if we can possibly achieve our goal. If not, then we just move on to the next position. If so, we go down one level and combine that with the next value. For example, -4500. Then we check to see if -9500 + the 3 largest values permit us to achieve our goal. etc...
-As we've been using, we check if the current sum is already larger than the desired total.

These two conditions alone will filter many combinations, which can serve to end the current depth level prematurely.

One thing that has to be taken into account, though, is that we first generate a 0 depth level (or 1 row solution). Then we generate a 1 depth level, etc. This is a parameter which will be input by the user. In SQL I was thinking of a stored procedure using a variable, pure console C would have to store the variable on input, VB.Net would use a DropDownList, most likely.

I don't know if there is much of a loss in using VB.Net. Other than the initial stage, which is fixed, I just print into a label, instead of console. Processing power should be similar.
LOL.  Suddenly now the math is working in our favor, when it was an insurmountable obstacle all along before.
>>> LOL.  Suddenly now the math is working in our favor, when it was an insurmountable obstacle all along before.


I've always said that the data determines the success or failure.  Not the method.

If the data will produce few combinations then it might be feasible to solve in real time
   -i.e. the math works

If the data will produce many combinations then it won't be feasible.
  - i.e. the math does not work

Nothing posted until half an hour ago gave any indication that the data would work in our favor.  There was only the hope that it might sometimes.  The "trick" now is that with a hard cutoff of 50000 we effectively no longer care if it's bad or not.

Up until the last half hour, the request has been to find them all; which meant bad cases were not solvable.  Now "I give up" is a legal solution.  

So "suddenly" it is in our favor and no longer an obstacle.  The rules changed, but not my reasonings.
>>>  For example, let's suppose that the smalles value is -5000. We then add it with the 4 larger values in the array to see if we can possibly achieve our goal.


yes, this is where I was going when I suggested keeping track of the minimum and maximum values to allow early detection of impossible conditions without needing to try intermediate values
Wow, that's exactly the opposite of what you've been saying all along, that the only way to know which combinations would/would not work was to test them ALL.


A genuine mathematician might be able to use logs or some other math "trick" to speed this up considerably.  You might want to post this problem on a mathematics site and see if they can give you some algorithm that would help.
Hi Cluskitt,

>> I've been working on my own code as well, to see if I can come up with something reasonable (I'm actually also curious if I can get the full results without filtering, but I'll try at home where my computer is better ;)).

I've got some killer processors at my disposal.  It doesn't seem to help.


>>It seems to me that there are a few ways to speed up the filtering, especially working with an array:
-Working from an ascending order (we tackle negatives first), we can check for possibilities. For example, let's suppose that the smalles value is -5000. We then add it with the 4 larger values in the array to see if we can possibly achieve our goal. If not, then we just move on to the next position. If so, we go down one level and combine that with the next value. For example, -4500. Then we check to see if -9500 + the 3 largest values permit us to achieve our goal. etc...

That's essentially what the C program does, except that it starts at the highest value and works down.  This approach has the advantage of pre-filtering all items that are  larger than the target value.  They're identified in the initial loop and once found are no longer part of the equation.  


>> I don't know if there is much of a loss in using VB.Net. Other than the initial stage, which is fixed, I just print into a label, instead of console. Processing power should be similar.

I don't think so.  C is a sort or macro-language that compiles almost directly to assembly code.  VB.Net is intended to manage visual (gui) objects.  I expect quite a large difference in performance.


Kent
>>> Wow, that's exactly the opposite of what you've been saying all along, that the only way to know which combinations would/would not work was to test them ALL.

nope


But I think maybe I understand your confusion.  You're confusing potential and actual results.
I'll take the blame for that for failing to differentiate them well enough in my previous attempts.

My concern is when there ARE lots of combinations to be returned.
If there are only 10 legal combinations then I fully expect we'll be able to find those quickly because the joins, loops, filters, sorts, whatever will quickly throw out lots and lots of invalid combinations early and we'll only need to actually check a handful of 5 value sums to find those 10.

If there are 10 trillion legal combinations, then I fully expect we will NOT be able to find those quickly because there is little to be thrown out on each pass.  

You can only validate the 5 value sum equals the target if you add all 5 values which, in the bad case means doing that 10 trillion times.  Even kdo's billion operation per second system would still take hours to do that.

But you might be able to say a 5 value sum does not equal a target after looking at only a couple values there by skipping a lot of that work.

Does that help?
You don't need to help you, you're the one that keeps flopping all over the place depending on the latest iteration.

I'VE SAID ALL ALONG that there ARE mechanisms that can be used to quickly determine:
IF there is NO chance a of 3, 4 or 5 level solution, WITHOUT going thru all the permutations

And that certain math techniques should be able to selectively filter the number of rows required to compute at each level.

Unfortunately, negative values massively limits the reduction.

Will only positive numbers, much more filtering is possible.
>> That's essentially what the C program does, except that it starts at the highest value and works down.  This approach has the advantage of pre-filtering all items that are  larger than the target value.  They're identified in the initial loop and once found are no longer part of the equation.  

Actually, we've concluded somewhere up there that you can't do that because of negative values. That is, a value can be larger than the desired sum, but negatives can bring that value down to a point where it becomes a valid solution. The only way you can filter them out is using ascending data. That way, you can be sure that, once you're over the sum, no further values are valid.
>>I'VE SAID ALL ALONG that there ARE mechanisms that can be used to quickly determine:
>> IF there is NO chance a of 3, 4 or 5 level solution, WITHOUT going thru all the permutations

sure, we've all said that from the very beginning

go back to the very first post in this thread

t1.ID_Value  > t2.ID_Value    

what's the point of that condition?  To eliminate duplicate evaluation  of effectively identical solutions.


go back to the first example of complete code in this thread http:#a38512067 

Look at each join I posted, they each have a condition like this...

     AND n1.n + ISNULL(n2.n, 0) <= 37

why?  because I don't want to continue evaluating further joins if I already know the partial sum has exceeded the target.  I'm skipping all the later work for 3, 4 and 5 value sums.

I've been consistent all along.  Please reread my posts and feel free to ask for any clarifications. I'm sure there are probably some typos or other mistakes that might make my point confusing; but the intent has been the exact same thing every time.


>>> IF there is NO chance a of 3, 4 or 5 level solution,

go back to my previous post.  I'm not really concerned as much about the case when there only a few solutions.  I'm concerned when there are many solutions.  i.e. when there is not only a chance for 3,4 and 5 level solutions but a certainty of millions, billions or trillions of them.
>> I don't think so.  C is a sort or macro-language that compiles almost directly to assembly code.  VB.Net is intended to manage visual (gui) objects.  I expect quite a large difference in performance.

I would expect C to be faster. Obviously, VB.Net would have to draw the objects, which would make it slower. But the part of pure computation (just calculating valid results and dumping the results into an array) shouldn't be much slower than C. Slower, but not much slower. I could be wrong, though, but I seem to have read somewhere that both C and VB are compiled to binary. VB.Net would be slower due to all the graphics involved. And dumping the array into a label would be slower than just printing to a console. But the math itself, which has no influence from anything other than a memory address communicating with the CPU and storing the value in another (or the same) memory address, would take the same time whether on C, VB, Delphi, or similar.

There would be slower languages, namely all the interpreted ones like flash, java, php, etc. But the only one I could see being faster is pure assembly.
>> Actually, we've concluded somewhere up there that you can't do that because of negative values. That is, a value can be larger than the desired sum, but negatives can bring that value down to a point where it becomes a valid solution. The only way you can filter them out is using ascending data. That way, you can be sure that, once you're over the sum, no further values are valid.

You're correct.  I missed that.  Offhand, it would seem that attempts to simply bias/unbias the data to use only positive values may be more trouble than it's worth.  But if all of the values are biased so that the lowest value is 0 (eliminating the need for handling negative numbers) the target value changes with the number of terms in the solution.  (It's effectively OriginalTarget+(Bias*Depth)).  There's no reason that the target can't increment as the depth increases just as the subtotal (Sum) does.  Something else to try!  :)




>> I'VE SAID ALL ALONG that there ARE mechanisms that can be used to quickly determine:  IF there is NO chance a of 3, 4 or 5 level solution, WITHOUT going thru all the permutations

That's an interesting exercise.  I think that I'll test the theory with the program (above) by creating a 5,000 item dataset that consists of the even numbers from 2 to 10,000, then searching for a sum that is an odd number.  It can't possibly happen.  The algorithm does have a short-stop to terminate the current loop when the depth (number of terms) exceeds the maximum, but there is no short-exit that prevents it from skipping the sums that must occur once the loop hits a certain threshhold.  I'm wondering if that's more overhead than gain?  Hmmm....
>>I'VE SAID ALL ALONG that there ARE mechanisms that can be used to quickly determine:
>> IF there is NO chance a of 3, 4 or 5 level solution, WITHOUT going thru all the permutations

>> sure, we've all said that from the very beginning <<

Geez, baloney.  I'm saying NO further joins are required, NONE AT ALL, not filtered ones.  You always present some non sequitir and then act as if it's a revelation to me.
>>>  I'm saying NO further joins are required, NONE AT ALL, not filtered ones.  

yep - exactly.  I agree, I'm sorry we're talking past each other  - thanks for participating though.
I think you could quickly tell whether ANY 3-level, 4-level or 5-level (or n-level) solution can ever reach the desired total as follows:

SELECT the n highest original values (this is quick, even with 5000 rows).

Add them, keeping each total separate.

That alone can tell you if:
1) ANY 3-level total can reach the value needed
2) ANY 4-level total can reach the value needed
3) ANY 5-level total can reach the value needed

From that, you can get some idea of the level of effort to find a match.  

If NO 5-level could possibly match, then you know you'd have to go to at least the 6th level, and you could quickly calc the extreme number of permutations you'd have to have to find a solution.
Going back to the math problem....

I programmed and ran a test where the Values table contains 5400 values, the even number from 2 to 10,800 and the target value is odd.  This is a good process for testing search timings as no solution can be found so the process runs through all possible combinations.

I first tested for 11241, a value slightly higher than the largest value in the table.  On a 3+Ghz processor, the program took about 13 seconds to complete.  When the target value was reduced to 1121, the search ended almost instantaneously.  When value was increased to 112241 the search required about 1 second.

Searching for a small number is lightning fast due to the numbers larger than the target being disqualified (and ignored) in the initial loop.  The search actually considered only 556 of the 5,400 values.

Searching for a large number took longer than I had expected, suggesting that there may be some improvement possible in the "we don't need to pursue this particular loop any longer" logic.


I'm going back to testing, but wanted to pass this along.

Kent
Don't forget about the negative values issue -- I still think negs are a serious problem to gaining performance, unlike sdstuber.  We just have a different opinion on that.
Hi Scott,

I haven't forgotten.  Before the night's over I'll test the changes to bias the entire table so that all values are positive.  Pretty easy change, actually.

Just an FYI, but the conversation keeps drifting between Combinations and Permutations.  Early on the OP indicated that he's looking for combinations.


Kent
>>> unlike sdstuber.  We just have a different opinion on that.

nope!  

Please email me offline if you want to pursue your line of thought more.  I'll be happy to keep trying to explain it to you but I've taken enough space in this thread so far.
Ok.  We have a bit more to report.

Setting the maximum number of terms to 4, I find 20,997,754 solutions in the original data.  The C program, using biased values to eliminate the issue of negative numbers, took 2:34 to find a write the solutions to disk.

That's also using a packed data table, where the duplicates are collapsed into a single row per unique value.  In an offline discussion, sdstuber correctly pointed out that collapsing the duplicates can result in not finding correct solutions.  If the target is 100 and the raw data contains 2 or more rows with a value of 50, the correct solution would report 50+50, but that solution is lost when only unique values are considered.

So we have 21M solutions of 1, 2, 3, or 4 columns added together.  I propose that finding and counting the solutions for up to 5 columns is beyond this discussion.  Actually using the results is waaaaaayyy out of bounds.
First: Yes, what I want is combinations, not permutations. Duplicate values that are essentially the same but in a different order are not desired.

Second: No, you can't pack the data table. Each row has to be taken into account, even if they have the same value (even though, 99.9(9)% of the time, either the number or the code will vary, so the row is still unique, despite the value being the same).

Third: The sum I presented in the above example (112.42) was a real life example of a sum we needed to find... but on the smaller subset. There is a very low chance we ever need to find such a small sum on the larger subset. So, for the data presented above, you can search for values between 5k-10k. Sometimes we may drop to 2k, but that is rarer. As an example for your program, you can use the value 5234.2.

Fourth: Once you find 50k solutions, you can break and exit. No more will be evaluated. If necessary, you can add some line stating the percentage evaluated so far.
Hi Cluskitt,

That does leave the question of "how to break" out of the search.  The C algorithm finds the solutions "in order".  That is, the first solution found will have the largest value to the left and move in that fashion.  Targeting 100 could result in:

100
99 1
98 2
98 1 1
97 3
etc.

If you "just break" out of the search, the solutions are all clustered where the leftmost column contains the largest number possible.  Most searches would behave similarly as the search is typically an ordered process.  Is that adequate or do you need/want a representative sampling across the data?

And I reran the search with a maximum of 4 columns, with the duplicated data. 149,800,308 combinations found in 15 minutes, though some should filter out as the list contains both solutions with duplicated values (98+1+1) as well as duplicate solutions (99+1) when a value (in this case, 1) occurs twice.

Kent
You need to run each solution independently, by user input. That is, user inputs level 1, you only run 1 row solutions. User inputs level 3, you only run 3 row solutions (disregarding 1 and 2 row solutions), etc. I said this was necessary somewhere up above. In which case, stopping after a certain number has been achieved is easy.

The way I'm doing the code is basically add a check for Current Level and Max Level (this last is a user input variable).

By the way, I'm curious how long would a non-packed 3 row solution take for you. At this exact moment, I'm running a 3 row solution for the sum 7554.07. I'm waiting to see how long it will take on VB.Net. I'll post my code soon, to see if any change should be made.

Also, I realized that the condition to search for current sum + highest values (to see if sum can be achieved) is pretty much irrelevant. Almost always, there will be a few values that are way higher than the desired sum, as you can see in the sample (there is a value over 20k, one over 10k, etc).
4 columns in 15 minutes? wow, that definitely beats anything I've tried (sql or c) and by wide enough margin that I don't think I can chalk it up to hardware.

what's your current code look like?
Ok, I don't think my code is running correctly. I must be doing something wrong, though I'm not exactly sure. A 3 column search for 7554.07 was taking over 45 (went to lunch and it was still running when I got back). It shouldn't take so long, even if this code is running slower. This is my code:
Imports System.Data.SqlClient

Public Class frmMain

  Private iVal(,) As Double
  Private iCol(15) As String
  Private CurrentLine, TotalLines, CurrentLevel, MaxLevel As Integer
  Private MaxVal, DesiredSum, CurrentSum As Double

  Protected Sub GetValues()
    Dim sConn As New Sqlconnection("server=SERVER\SERVER;uid=rhxxi;pwd=rhxxi;database=rhxxi;")
    Dim sqlDA As New SqlDataAdapter("SELECT vcr_num_empregado,vcr_vnc_cod_venc,vcr_valor_bruto " & _
                                    "FROM rh_vcrt0 " & _
                                    "WHERE vcr_emp_empresa=@Emp " & _
                                    "ORDER BY vcr_valor_bruto,vcr_num_empregado*1,vcr_vnc_cod_venc", _
                                    sConn)
    sqlDA.SelectCommand.Parameters.Add("@Emp", SqlDbType.Int).Value = txt_Emp.Text
    Dim Table1 As New DataTable
    sqlDA.Fill(Table1)
    TotalLines = Table1.Rows.Count - 1
    CurrentLine = 0
    ReDim iVal(3, TotalLines + 1)
    For a = 0 To TotalLines
      For b = 0 To 2
        iVal(b, a) = Table1.Rows(a).Item(b)
      Next
    Next
  End Sub


  Private Sub btnCalc_Click(sender As System.Object, e As System.EventArgs) Handles btnCalc.Click
    If Not Integer.TryParse(txt_Emp.Text, New Integer) Or Not Double.TryParse(txt_Sum.Text, New Double) Then
      lbl_Msg.Text = "Insira valores válidos para empresa e soma"
    Else
      lbl_Msg.Text = ""
      Dim timNow As DateTime = Now
      DesiredSum = txt_Sum.Text
      MaxLevel = cbo_Nivel.SelectedIndex + 1
      GetValues()
      lbl_Msg.Text = "SQL: Demorou " & DateDiff(DateInterval.Second, timNow, Now) & " segundos (" & TotalLines & " resultados)"
      timNow = Now
      SumArr(0, 1, 0)
      lbl_Msg.Text &= vbNewLine & "Cálculo+Print: Demorou " & DateDiff(DateInterval.Second, timNow, Now) & " segundos (" & CurrentLine & " resultados)"
    End If
  End Sub

  Protected Sub SumArr(CurSum As Double, CurLevel As Integer, CurIndex As Integer)
    Dim CurSum2, CurSumMax As Double
    For a = CurIndex To TotalLines
      CurSum2 = CurSum + iVal(2, a)
      If CurSum2 > DesiredSum Then
        Exit Sub
      End If
      If CurLevel < MaxLevel Then
        CurSumMax = CurSum2
        For r = TotalLines To TotalLines - (MaxLevel - CurLevel) Step -1
          CurSumMax += iVal(2, r)
        Next
        If CurSumMax < DesiredSum Then
          Exit Sub
        End If
        For x = 0 To 2
          iCol((CurLevel - 1) * 3 + x) = iVal(x, a)
        Next
        SumArr(CurSum2, CurLevel + 1, a + 1)
      Else
        If CurSum2 = DesiredSum Then
          For x = 0 To 2
            iCol((CurLevel - 1) * 3 + x) = iVal(x, a)
          Next
          GenRow(iCol)
          If CurrentLine >= 500 Then
            'TODO: %
            Exit Sub
          End If
        End If
      End If
    Next
  End Sub

  Protected Sub GenRow(iCol() As String)
    For Each ctr As Label In pnlMain.Controls.OfType(Of Label)()
      If Microsoft.VisualBasic.Left(ctr.Name, 4) = "lbl_" Then
        ctr.Text &= iCol(Microsoft.VisualBasic.Right(ctr.Name, Len(ctr.Name) - 4)) & vbNewLine
      End If
    Next
    CurrentLine += 1
  End Sub

  Private Sub frmMain_Load(sender As Object, e As System.EventArgs) Handles Me.Load
    cbo_Nivel.Items.Clear()
    cbo_Nivel.Items.Add("1")
    cbo_Nivel.Items.Add("2")
    cbo_Nivel.Items.Add("3")
    cbo_Nivel.Items.Add("4")
    cbo_Nivel.Items.Add("5")
    cbo_Nivel.SelectedIndex = 0
  End Sub
End Class

Open in new window

I don't know if I have a flaw in my logic, or if I can do things more efficiently. I don't have much experience with recursive combinations, but it seems to me, logically, that this should be a correct way of doing so. But maybe there's a simpler way which I'm not considering.
I'll rerun the test to look for 5235.2.  My guess is that the time jumps way up.  WAY up.

Grouping the items by the number of rows is trivial.  1 or 2 columns returns the results instantaneously, and 3 columns is still sub-second.  Running the search once for 1, 2, 3, and 4 column solutions takes about 1 second longer than running it once for all solutions.  Far superior to sorting the results later.

It's trivial to make the number of columns a command line argument.  If you use this program we can make that happen.

  findstuff -1

Currently, the program reads the input from your SQL source, though I have removed the text so that just the data points remain.  The data file looks like this:

  (32,1,1,3641.00)
  (32,1,6,133.44)
  (32,10014,1,585.46)

Apples to Apples.  Looking for all 1, 2, and, 3 column solutions where the target is 7554.07 takes about a second.  33 results are found.

Looking at your code, I believe that the SumArr function is doing entirely too much work, resulting in the high search time.  All it needs to do is scan the relevant portion of the data list from the position of the previous element +1 to the end (line 49 does this) and take 1 of three actions depending on the sum of the previous columns and the current value.  If the sum is greater than the target, ignore it.  (Since you're examining the values in increasing magnitude you can exit the function.)  If the sum is equal to the target print it (and exit the function since the rest of the list can't possibly produce a unique result).  If the sum is less than the target, add another column or continue the loop depending on the current column number.
I already do this. If the sum is greater, exit sub, if the sum is the same, print it (I don't exit because I want other similar values), if it's smaller, keep going.

The point of limiting the number of columns to user input isn't to save time. It's to keep results separate. We always search 1 row solutions first. When we examine them, if we still need to find results, then we move on to 2 row solutions. At this point, seeing as we've already examined 1 row solutions, all we want is 2 row solutions. No need to repeat data that was already examined and that will only add to the total number (which will cut off at X lines, 500 in my code).
Without printing anything anywhere, just running the calculations (commented the iCol and GenRow lines), a 3 row solution for 7445.2 calculated 0 results after 1145 seconds, or just a bit over 19 minutes.

This is the only important part of the code:
  Protected Sub SumArr(CurSum As Double, CurLevel As Integer, CurIndex As Integer)
    Dim CurSum2 As Double
    For a = CurIndex To TotalLines
      CurSum2 = CurSum + iVal(2, a)
      If CurSum2 > DesiredSum Then
        Exit Sub
      End If
      If CurLevel < MaxLevel Then
        'For x = 0 To 2
        '  iCol((CurLevel - 1) * 3 + x) = iVal(x, a)
        'Next
        SumArr(CurSum2, CurLevel + 1, a + 1)
      Else
        If CurSum2 = DesiredSum Then
          'For x = 0 To 2
          '  iCol((CurLevel - 1) * 3 + x) = iVal(x, a)
          'Next
          'GenRow(iCol)
          CurrentLevel += 1
          If CurrentLine >= 500 Then
            'TODO: %
            Exit Sub
          End If
        End If
      End If
    Next
  End Sub

Open in new window

One thing that could be causing it is that maybe an array of doubles isn't as effective, especially because it has to be a two dimensional one (to store number, code and value).
I don't know if your code is taking that into account.

Could you try to translate the above to C and see if there's any significant difference in performance to your own? If there is, I imagine I implemented this the wrong way. If not, then VB.Net is WAY slower. Possibly due to iterating a multi-dimensional array.
Hi Cluskitt,

Lines 55 - 64 have a couple of inefficiencies.  

It looks like they pre-determine if calling the function to process the next column will be successful.  If the answer is "no", the call isn't made, but if the answer is "yes" the effort is redundant since the function will need to recalculate things to process the next column.  I'd just call the function and let it make it's own determination.

Lines 62 - 64 need to be moved outside of the function, probably to GenRow().  It looks like you're doing a "binary to display" conversion (float to string) every time the function advances to the next column.  Depending on the number of rows in the search, the function could be spending more time in needless data conversion than in searching.

These are the critical lines:

      If CurLevel < MaxLevel Then
        CurSumMax = CurSum2
        For r = TotalLines To TotalLines - (MaxLevel - CurLevel) Step -1
          CurSumMax += iVal(2, r)
        Next
        If CurSumMax < DesiredSum Then
          Exit Sub
        End If
        For x = 0 To 2
          iCol((CurLevel - 1) * 3 + x) = iVal(x, a)
        Next
        SumArr(CurSum2, CurLevel + 1, a + 1)
      Else

Open in new window


Move the float to string conversion outside of the loop and those lines can be shortened to:

      If CurLevel < MaxLevel Then
        SumArr(CurSum2, CurLevel + 1, a + 1)
      Else

Open in new window



Kent
GenRow increments the number of solutions found so commenting out the call to the function means that the program never acknowledges that a solution was found.

And taking 20 minutes is a good indication that VB is doing a lot object management (probably memory) that's taking processor time away from the task at hand.


Kent
I thought of another option to take advantage of the 50000 cutoff.

Ignore the negatives.

kdo's original c code worked from largest to smallest so he'd get that "for free"
but if you're working from smallest to largest, start with whatever element is 0.

If, by chance, the data and target are such that you can't find 50000 results using only positive numbers then you can go ahead and use negatives to finish the set.

However, due to that complexity, it might be easier to just use kdo's top-down search.
The fact that it might drop out some results isn't particularly important if we know we can stop early anyway.

Of course, I'm still looking at the "bad cases" where there are many, many combinations - basically because that's the more interesting part of the problem.  We can brute-force 1 and 2 value searches and a couple of trivial filters will let 3-value searches finish quickly too.
When I commented GenRow, I moved the CurrentLine incrementation to just above it.
I'm converting floats to string to print them to a label. But even commenting those doesn't help either. Running just this code:
  Protected Sub SumArr(CurSum As Double, CurLevel As Integer, CurIndex As Integer)
    Dim CurSum2 As Double
    For a = CurIndex To TotalLines
      CurSum2 = CurSum + iVal(2, a)
      If CurSum2 > DesiredSum Then
        Exit Sub
      End If
      If CurLevel < MaxLevel Then
        'For x = 0 To 2
        '  iCol((CurLevel - 1) * 3 + x) = iVal(x, a)
        'Next
        SumArr(CurSum2, CurLevel + 1, a + 1)
      Else
        If CurSum2 = DesiredSum Then
          'For x = 0 To 2
          '  iCol((CurLevel - 1) * 3 + x) = iVal(x, a)
          'Next
          'GenRow(iCol)
          CurrentLevel += 1
          If CurrentLine >= 500 Then
            'TODO: %
            Exit Sub
          End If
        End If
      End If
    Next
  End Sub

Open in new window

(basically, just counting results), still takes a long time. I tried removing the panel and running just the count. It took 811 seconds, about 13.5 minutes.

It should be noted that:
1- This laptop isn't very good. In fact, I have a hard time with VS2010, because it constantly hogs most of the memory (1GB).
2- I'm running the exe that is created when I debug (i.e., not debugging, but using the exe directly).

However, while it might be a bit slower, these are the conditions I have to deal with, so a solution will have to work reasonably well in this computer.
Can you, using the VB.Net code, adapt your C to retrieve the data into a multi-dimensional array (always writing something like: Num1;Code1;Value1;Num2;Code2;Value2;etc...)?

I ask this because I have a feeling a multi-dimensional array will slow things down (even though we're only actively using one of the dimensions) and I would like to compare and see how it runs in this computer. If I can achieve a 3 row solution in a few seconds, then that is a LOT better than it was before. At the very least, a 5 row solution to smaller subsets would be possible (about 200-400 rows is the most common value. 5k+ is only for one company, though it's also important).
Ignoring the negatives, while it would sound good for programming, wouldn't be advisable for finding a real solution. There are many codes that only have negative values, and those would be ignored. The only gain would be on gigantic results, but on manageable ones it would actually slow things down, I think. After all, the biggest problem is running the combinations. The number of results is secondary to that.
>>> There are many codes that only have negative values, and those would be ignored.

ok, I don't know your real distributions so thought I'd throw it out there.

data trumps algorithm  :)
The sample I provided is a real life sample. You can check this situation (the negatives) on codes like 160, 161, 182, etc.
Go to lunch, miss a lot.  :)

VB does array bounds checking while C does not.  Every single array reference requires additional computation to ensure that the reference is in bounds.  So while C merely dereferences a pointer (Array + offset) which can be done in a single instruction (depending on the processor) the bounds check at least doubles the necessary CPU instructions, perhaps even more.

Something to note is that I'm mostly running this on my desktop computer.  It's an AMD Phenom processor with 64KB of L1 cache.  The entire program should fit into cache.

I suspect that the use of a binary search to determine the starting point of the loop within the recursive function may help.  If only a few items are skipped, the trade-off is next to zero, but if the number of skipped values is more than a handful, there's a boost to be had.  It's microseconds per pass, but we're performing a LOT of passes.

An earlier test described looking for 7554.07 in the original data.  The 33 solutions that I find (with MaxColumns == 3) are returned in about a second.  The recursive function is called 847,873,440 times to examine the data.  So saving a few instructions per pass could be significant when adding that 4th column.

I resolved the issue with negative number by biasing all of the data (adding a value to all of the items so that there are no negative numbers).  The search takes the bias into account when searching through the data.

Here's the code as it exists now.  Be glad to explain anything.

//---------------------------------------------------------------------------

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>

#pragma hdrstop

//---------------------------------------------------------------------------
#pragma argsused

	int TableType = 0;

	double *TValues;
	double SValues[10];
	double Target;
	double Bias = 0;

	int Limit;
	int MaxValues;
	int SCount = 0;

	long long Passes = 0;

	FILE *R;             // Results Stream

//---------------------------------------------------------------------------
void PrintSolution (int Depth)
{
	int idx;

	fprintf (R, " Found: %d :", ++SCount);
	for (idx = 0; idx <= Depth; ++idx)
		fprintf (R, " %.2f", (SValues[idx] - Bias) / 100.0);
	fprintf (R, "\n");
}
//---------------------------------------------------------------------------
// FindOperand -- find the next suitable column from the list
//---------------------------------------------------------------------------
int FindOperand (double Sum, int Depth, int Limit, int Target)
{
	int TPosition;

	++Passes;

// Scan the list backwards from the starting point (Limit)
		for (TPosition=Limit-1; TPosition >= 0; --TPosition)
			if (Sum + TValues[TPosition] == Target)
			{
				SValues[Depth] = TValues[TPosition];
				PrintSolution (Depth);
			}
			else if (Sum + TValues[TPosition] < Target)
			{
				if (Depth+1 < MaxValues)
				{
					SValues[Depth] = TValues[TPosition];
					FindOperand (Sum + TValues[TPosition], Depth+1, TPosition, Target + Bias);
				}
				else
					return 0;
			}
		return 0;
}
//---------------------------------------------------------------------------
//---------------------------------------------------------------------------
void LoadTable (void)
{
	FILE *F;
	int A, B, C;
	float V;
	int Status;
	int TableSize;
	int i;

	switch (TableType)
	{
		case 1:
			TableSize = 6000;
			TValues =  malloc (TableSize * sizeof (double));
			Limit = 5400;
			for (i = 0; i < Limit; i++)
				TValues[i]  = i * 2 + 2;
			break;

		default:
			F = fopen ("E:\\Documents and Settings\\Kent.YOUR-0B3AB6E8FA\\My Documents\\RAD Studio\\sqldata.txt", "rt");
			Limit = 0;

			if (F)
			{
				TableSize = 1000;
				TValues = malloc (TableSize * sizeof (double));
				while (1)
				{
					Status = fscanf (F, "(%d,%d,%d,%f)\n", &A, &B, &C, &V);
					if (Status <= 0)
						break;
					if (Limit +1 >= TableSize)
					{
						TableSize *= 2;
						TValues = (double*) realloc (TValues, TableSize * sizeof (double));
					}
					TValues[Limit++] = (Status = V * 100);
				}
				fclose (F);
			}
	}
}
//---------------------------------------------------------------------------
//---------------------------------------------------------------------------
void SortTable (void)
{
	int x, y;
	double V;

	for (x = 1; x < Limit; ++x)
	{
		if (TValues[x-1] > TValues[x])
		{
			for (y = x-2; y>=0; --y)
				if (TValues[x] > TValues[y]) 
					break;
			++y;
			V = TValues[x];
			memmove (&TValues[y+1], &TValues[y], (x-y) * sizeof (double));
			TValues[y] = V;
		}
	}
}
//---------------------------------------------------------------------------
//  PackTable -- Squeeze out duplicated values
//---------------------------------------------------------------------------
void PackTable (void)
{
	int Source;
	int Destination;

	Source = 1;
	Destination = 0;

	while (Source < Limit)
	{
		if (TValues[Source] != TValues[Destination])
			TValues[++Destination] = TValues[Source];
		++Source;
  }
	Limit = Source;
}
//---------------------------------------------------------------------------
//  BiasTable -- Add ABS(lowestvalue) to every item in the table
//               when it is negative so that all values are positive
//               and increasing.
//---------------------------------------------------------------------------
void BiasTable (void)
{
	int idx;

	if (Limit)
		if (TValues[0] < 0)
		{
			Bias = -TValues[0];
			for (idx = 0; idx < Limit; ++idx)
				TValues[idx] += Bias;
		}
}
//---------------------------------------------------------------------------
int main(int argc, char* argv[])
{
	int TPosition;
	int idx;
	int i;
	FILE *F;

	time_t rawtime;
	struct tm * timeinfo;

	time ( &rawtime );
	timeinfo = localtime ( &rawtime );
	printf ( "Start: %s\n", asctime (timeinfo) );

	Target = 11241;
	Target = 755407;

	MaxValues = 3;

	LoadTable ();
	SortTable ();
//	PackTable ();

	F = fopen ("C:\\Values.txt", "wb");
	for (idx = 0; idx < Limit; ++idx)
		fprintf (F, "%.2f\n\r", TValues[idx]/100.0);
	fclose (F);
	BiasTable ();

	R = fopen ("C:\\Results.txt", "wb");
	FindOperand (0, 0, Limit, Target+Bias);
	fclose (R);

	time ( &rawtime );
	timeinfo = localtime ( &rawtime );
	printf ("End  : %s\n", asctime (timeinfo) );
	printf (" %d Solutions found.\n", SCount);
	printf (" %ld passes.\n");
	return 0;
}
//---------------------------------------------------------------------------

Open in new window

Below is the binary search logic.

First it loads the data from the rh_vcrt0 table into a temp table in order to add an identity and sort by vcr_valor_bruto.  That only needs done ONCE per set of data to be analyzed; once you've loaded it, comment out the load section to run further binary searches.

For the initial run, I chose:
@value_to_find = 7554.07
@count_of_values_to_join = 3 --3-level join

The code returns rh_ident 5416 (of a total of 5470 values).

[I have a fairly fast server, so this runs in 00:00:00 on my machine, but it should be fast enough on virtually any hardware.]

That means that when doing the 3-level join, one level can be restricted to ONLY values from #5416 on.  Other values can be IGNORED, because it's known that it's impossible for those values to reach the desired total.

Something like:

FROM dbo.search_values sv1
INNER JOIN dbo.search_values sv2 ON
    sv2.rh_ident > sv1.rh_ident AND
    sv1.vcr_valor_bruto + sv2.vcr_valor_bruto < @value_to_find
INNER JOIN dbo.search_values sv3 ON
    sv3.rh_ident >= 5416 AND
    sv3.rh_ident > sv2.rh_ident AND
    sv1.vcr_valor_bruto + sv2.vcr_valor_bruto = @value_to_find

That will eliminate a lot of unnecessary permutations.

Just by changing the two initial params, you can pre-test for any total and level.




--USE <_db_containing__rh_vcrt0__table>

DECLARE @value_to_find decimal(14, 2)
DECLARE @count_of_values_to_join int

SET @value_to_find = 7554.07
SET @count_of_values_to_join = 3


DECLARE @search_count int
DECLARE @search_low_entry int
DECLARE @search_high_entry int
DECLARE @search_midpoint int
DECLARE @sum decimal(14, 2)

/* run this code ONCE for each new set of search values, then comment it out for further analysis
IF OBJECT_ID('tempdb.dbo.rh_vcrt0_sorted') IS NOT NULL
    DROP TABLE tempdb.dbo.rh_vcrt0_sorted
IF OBJECT_ID('tempdb.dbo.rh_search_values') IS NOT NULL
    DROP TABLE tempdb.dbo.rh_search_values
CREATE TABLE tempdb.dbo.rh_search_values (
    rh_ident smallint NOT NULL,
    vcr_valor_bruto decimal(14, 2) NOT NULL
    )
CREATE UNIQUE CLUSTERED INDEX rh_search_values__CL ON tempdb.dbo.rh_search_values ( rh_ident )

-- insert the original values into a new table to: add an identity and sort by vcr_valor_bruto.
SELECT
    IDENTITY(smallint, 1, 1) AS rh_ident, *
INTO tempdb.dbo.rh_vcrt0_sorted
FROM dbo.rh_vcrt0
-- can't use this because of NEGATIVE numbers in the input data!!
--WHERE
    --vcr_valor_bruto <= @value_to_find
WHERE
    vcr_emp_empresa = 32
ORDER BY
    vcr_valor_bruto

-- insert the sorted values into a search table to: make search table as small as possible.
INSERT INTO tempdb.dbo.rh_search_values
SELECT
    rh_ident, vcr_valor_bruto
FROM tempdb.dbo.rh_vcrt0_sorted
ORDER BY
    rh_ident
*/


----------------------------------------------------------------------------------------------------
-- actual binary search logic

SET @search_count = (SELECT COUNT(*) FROM tempdb.dbo.rh_search_values)

SET @search_low_entry = 0
SET @search_high_entry = @search_count

IF @search_high_entry = 0
    GOTO ExitCode

WHILE 1 = 1
BEGIN
    SET @search_midpoint = (@search_low_entry + @search_high_entry) / 2
	SELECT @sum = SUM(vcr_valor_bruto)
	FROM tempdb.dbo.rh_search_values
	WHERE rh_ident BETWEEN @search_midpoint AND @search_midpoint + (@count_of_values_to_join - 1)
	IF @sum = @value_to_find
	    BREAK
	IF @sum < @value_to_find
	    SET @search_low_entry = @search_midpoint
	ELSE
	    SET @search_high_entry = @search_midpoint
	IF @search_high_entry - @search_low_entry <= 2 --@count_of_values_to_join
	    BREAK
END

ExitCode:

SELECT 'End', @search_count AS Total_#Values, @search_low_entry AS starting_rh_ident, @value_to_find AS value_to_find,
    sum_of_bruto, row_count
FROM (
    SELECT SUM(vcr_valor_bruto) AS sum_of_bruto, COUNT(*) AS row_count 
    FROM tempdb.dbo.rh_search_values  
    WHERE rh_ident BETWEEN @search_low_entry AND @search_low_entry + (@count_of_values_to_join - 1)
) AS rh_values

/*
-- show next values, to find very last starting_rh_ident that is valid
SELECT 'End', @search_count AS Total_#Values, @search_low_entry + 1 AS starting_rh_ident, @value_to_find AS value_to_find,
    sum_of_bruto, row_count
FROM (
    SELECT SUM(vcr_valor_bruto) AS sum_of_bruto, COUNT(*) AS row_count 
    FROM dbo.rh_search_values  
    WHERE rh_ident BETWEEN @search_low_entry + 1 AND @search_low_entry + 1 + (@count_of_values_to_join - 1)
) AS rh_values

SELECT 'End', @search_count AS Total_#Values, @search_low_entry + 2 AS starting_rh_ident, @value_to_find AS value_to_find,
    sum_of_bruto, row_count
FROM (
    SELECT SUM(vcr_valor_bruto) AS sum_of_bruto, COUNT(*) AS row_count 
    FROM dbo.rh_search_values  
    WHERE rh_ident BETWEEN @search_low_entry + 2 AND @search_low_entry + 2 + (@count_of_values_to_join - 1)
) AS rh_values

SELECT 'End', @search_count AS Total_#Values, @search_low_entry + 3 AS starting_rh_ident, @value_to_find AS value_to_find,
    sum_of_bruto, row_count
FROM (
    SELECT SUM(vcr_valor_bruto) AS sum_of_bruto, COUNT(*) AS row_count 
    FROM dbo.rh_search_values  
    WHERE rh_ident BETWEEN @search_low_entry + 3 AND @search_low_entry + 3 + (@count_of_values_to_join - 1)
) AS rh_values

SELECT 'End', @search_count AS Total_#Values, @search_low_entry + 4 AS starting_rh_ident, @value_to_find AS value_to_find,
    sum_of_bruto, row_count
FROM (
    SELECT SUM(vcr_valor_bruto) AS sum_of_bruto, COUNT(*) AS row_count 
    FROM dbo.rh_search_values  
    WHERE rh_ident BETWEEN @search_low_entry + 4 AND @search_low_entry + 4 + (@count_of_values_to_join - 1)
) AS rh_values

SELECT 'End', @search_count AS Total_#Values, @search_low_entry + 5 AS starting_rh_ident, @value_to_find AS value_to_find,
    sum_of_bruto, row_count
FROM (
    SELECT SUM(vcr_valor_bruto) AS sum_of_bruto, COUNT(*) AS row_count 
    FROM dbo.rh_search_values  
    WHERE rh_ident BETWEEN @search_low_entry + 5 AND @search_low_entry + 5 + (@count_of_values_to_join - 1)
) AS rh_values
*/

Open in new window

Hi Cluskitt,

I just saw something in your code that might help.

  https://www.experts-exchange.com/questions/27905416/Adding-different-values-from-a-table-to-find-a-specific-sum-value.html?anchorAnswerId=38534888#a38534888

At line 19 I think that you want to increment CurrentLine.

Line 22 is an "Exit Sub".  That only ends the iteration on the current column.  It doesn't unwind the recursive calls already on the stack so the calling iteration will just advance to the next value for the previous column and recall the function.  Perhaps the function should normally return 0 and return 1 when it's found the 500 solutions.  Of course, the function would have to test the return value from the recursive call.  :)


Kent
Hi Scott,

I'm not following your description very well.  :(  Can you help me out?  And are you using Cluskitt's sample data?


>> That means that when doing the 3-level join, one level can be restricted to ONLY values from #5416 on.  Other values can be IGNORED, because it's known that it's impossible for those values to reach the desired total.

I think that's data dependent.  If the maximum value in the data is 2,600, three rows with values between 2,500 and 2,600 could exist that sum to 7554.07.  In the actual data, it looks like only the 5 highest values cannot possibly be part of a 3-column solution.


A couple of observations about the SQL:

--  In the first join, the first term should probably be 'sv2.rh_ident >= sv1.rh_ident'.  If we're searching for 100, then 20+20+60 is a valid solution.  That will also require another term to filter rows by PK where the row would be joined to itself.

--  But I think that the SQL will generate entirely too many intermediate rows to be effective.  The nature of the problem requires that the join between sv1 and sv2 occurs before the join of sv3.  That should result in about 12,000,000 intermediate rows that must all be joined to sv3 and filtered.  That's gonna take some time.



Kent
Ident is just an identity value to uniquely identify each row, which is assigned AFTER the input values are sorted, viz:

-- insert the original values into a new table to: add an identity and sort by vcr_valor_bruto.
SELECT
    IDENTITY(smallint, 1, 1) AS rh_ident, *
INTO tempdb.dbo.rh_vcrt0_sorted
FROM dbo.rh_vcrt0  --<<-- assumed to contain sample data posted earlier by Author
WHERE vcr_emp_empresa = 32
ORDER BY vcr_valor_bruto



>> In the first join, the first term should probably be  'sv2.rh_ident >= sv1.rh_ident' <<
No, as that would join a row to itself.  Duplicate values may appear, of course, by any single row cannot be joined to itself in the solution.
Ident is NOT the amount being added, it's JUST an identity value.


>> I think that's data dependent. <<

Yes, of course.  That's the point of the binary search -- to determine, from this specific set of data, what is the minimum value that needs included to reach the desired total.

What the binary search tells you is that AT LEAST ONE of the values in the total must be this large.  So, what 5416 means is that I add 5415+5414+5413 the value is LESS than I need; therefore, no combination of values where ALL rows have idents less than 5416 needs to be tested, because it's IMPOSSIBLE for those rows to ever reach the desired total.


>> In the actual data, it looks like only the 5 highest values cannot possibly be part of a 3-column solution. <<

Note that I'm applying the restriction to only ONE level.  The other two levels can be ANY ident.
IOW, say I have the values:

1; 7; 11; 21; 45; 49.

And I say give me 3 rows that total 57.

When I add 7+11+21 I get only 39, less than the total sought, therefore value#5 or larger MUST be included to produce the desired result.  That is, I DO NOT EVEN NEED TO CONSIDER:
1+2+3
1+2+4
2+3+4

That's what the binary search code above determines for the specified input.

If THE 3/N HIGHEST values in a subset cannot reach the total, then any values guaranteed to be lower could not possibly reach it either.

Likewise, if I said give me three rows that total 81, since 11+21+45=77, then I know that value#6 (or later) MUST be included, and I can ignore all 3-values combinations/permutations of only idents 1-5.

The KEY is that you have SORTED the input values BEFORE making the determination.

Note that negative numbers do NOT affect this computation.

Does this make sense now?
Hi Scott.

Thanks.  The SQL makes more sense now.  I didn't realize that you were generating an identity column for the table as the original data already had a composite PK.  

That looks like it could be an effective way to establish the search's starting point.

Now the question is the performance.

If the search target is sufficiently large that sum of the values in the first join are less than the target, all possible combinations will be generated.  The OP's sample data contains 5,470 lines.  This could explode into 14,954,980 rows that must then be joined with the third column data.

It takes a little while to generate 15 million intermediate rows from the cross product.  And considerably more time to join them into another cross product and filter the results.  That's potentially 81 billion rows.  

I'm not aware of any DBMS capable of generating and filtering that many rows in anywhere close to a reasonable amount of time.  Or is there another filter/short-exit that I'm not seeing?


Kent
>> It takes a little while to generate 15 million intermediate rows from the cross product <<

That's actually proven to be remarkly quick in SQL, since some results are instantly ignored as they exceed the max value being sought.



>>  That's potentially 81 billion rows. <<

Ah, but that's the importance of the binary search result.  With it, you only have to join the 2-level results to the rows from 5416 on -- so the possible results are 2nd-level * 55 instead of 2nd-level * 5470 -- BIG difference.

Again, because of the binary search result, we KNOW that combinations where all idents are lower than 5416 CANNOT yield a valid result.  So we can safely FORCE AT LEAST ONE of the values to be #5416 or higher.

(Again, remember that 5416 is the relative SORTED value #, not the value itself, and NOT an unsorted id.)
>>  I didn't realize that you were generating an identity column for the table as the original data already had a composite PK. <<

Yes, but being a composite key, it is more overhead to work with.  Moreover, the original values are unsorted, and hence offer no chance to streamline the computations.

Therefore, I insert all the original data into a SORTED table with a single smallint identity that is a pointer back to all of the original data. [smallint handles up to 32767; requested has specified max number of values is 10000; realistically, permutations on anything above that would be too prohibitive anyway]

Finally, for maximum search/computational efficiency, I take just the identity pointer and the actual value to be summed and store them, SORTED, into a new table (what I call the "search" table).  That makes each row only 11 data bytes, with no variable length columns.
@ScottPletcher: I really don't follow your code at all. Nevertheless, I tried running it as is, but all I succeeded in doing is getting this line:
(No column name)      Total_#Values      starting_rh_ident      value_to_find      sum_of_bruto      row_count
End      5897      5842      7554.07      7446.00      3

If I uncomment the last ones, what I get is further lines all similar, but with the sum_of_bruto column incremented. It takes virtually no time to run, but how can I get my results?

@Kdo: I'll try to adapt your code to get the data from the DB, though I'm a bit rusty on C. I'll have to install GCC as well. I won't probably do that today, so most likely, I'll try it on Monday, when I have a bit more free time.

At this point, though I'd love to have a working solution, I'm actually more interested in the theoretical part of this, and getting something to work as fast as possible. Right now, I'd just love to see how much faster I can make this. Even if a 5 row solution can only be achieved in 2 hours, or even 30 minutes, it won't be pratical for use, but I would feel very gratified :)
One idea I did have, which I probably won't be able to work out on my own:
Isn't it possible to join both solutions? I mean, an SQL intermediate solution that is then further treated in C? I don't necessarily need a "pure" solution.
Hi Cluskitt,

Actually, since SQL Server doesn't really support Stored Procedures in C (at least I don't think that it does) any solution that uses a C program is a hybrid solution.

The performance of these algorithms/implementations is hugely dependent on the data.  Scott's technique starts by selecting the minimum value (row) that must be part of any solution, mine works in reverse and eliminates those rows that can't possibly be part of any solution.  Test both of them side-by-side (in the same language) and one will perform better as the target value increases, the other will perform better as the target decreases.

For a 5-value solution, I'm concerned over disk space.  The 4-value solution required 147MB to store the solutions in ASCII.  I can envision the 5-value solution not fitting onto a hard drive.

I'm going to modify the C program to count the 5-values solutions and see where it is after about an hour.  :)


Kent
Disk space isn't an issue. We did specify that we want a max of 50k rows. So, once that number has been reached, the program will exit and no further space is wasted. This would be true whether in C or SQL.
I was thinking more about the magnitude of the problem, not the first 50,000 results.  :)

The theory goes that with N data points, there could be up to N^y solutions.  5,000 data points means that there could be up to 5,000 1-column solutions.  25,000,000 2-column solutions.  125,000,000,000 3-column solutions, 625,000,000,000,000 4-column solutions. 3,125,000,000,000,000,000 5-column solutions.

That's why the 4 and 5 column solutions are so daunting.  The algorithms need to identify things to not bother checking without investing many resources in the filter.


Kent
I know the processing part of it is the hardest part of this. But you were concerned over disk space and that isn't an issue (although memory usage might be).
>> End      5897      5842      7554.07      7446.00      3 <<

Yep, that IS the result.  What that tells you is that the last-level join can START at 5842 (out of 5897 values).


So instead of having to code:

FROM dbo.search_values sv1
INNER JOIN dbo.search_values sv2 ON
    sv2.rh_ident > sv1.rh_ident AND
    sv1.vcr_valor_bruto + sv2.vcr_valor_bruto < @value_to_find
INNER JOIN dbo.search_values sv3 ON
    sv3.rh_ident > sv2.rh_ident AND
    sv1.vcr_valor_bruto + sv2.vcr_valor_bruto = @value_to_find

You can change the third join to this:
INNER JOIN dbo.search_values sv3 ON
    sv3.rh_ident >= 5842 AND
    sv3.rh_ident > sv2.rh_ident AND
    sv1.vcr_valor_bruto + sv2.vcr_valor_bruto = @value_to_find

Eliminating a huge number of permutations from consideration.



>> For a 5-value solution, I'm concerned over disk space. <<

Are you using my stripped-down "search" table, or the original table with all columns?
Again, that's why I created the absolutely-minimum-columns-necessary "search" table, to reduce the search overhead as much as possible.
Additional short-cuts could be obtained by using the SORTED input values, again with a binary search.   I don't have time to layout the full code, but it's the same general search.

For example, if the first 2 levels I add value #1 and value #2 (both likely neg), only an extremely small subset of the original rows could every possibly yield a result.

The SORTED original values only help with one level of the search, though.  For higher levels, you need the 2-level values stored (SORTED, of course), so you can use a similar technique/logic to reduce more than one level of searches.
I liked Kdo's solution of biasing the results so they're all positive. I imagine that would be very helpful for your code, as that would eliminate the negatives problem. As long as you account for the bias when searching for the target, I can't see any inconvenient in this.
>> I resolved the issue with negative number by biasing all of the data (adding a value to all of the items so that there are no negative numbers).  The search takes the bias into account when searching through the data. <<

That's tricky.

What specifically are you doing to "take the bias into account when searching"?

You have to do bias_amount * #_of_values_combined at that level, right?
>>> Eliminating a huge number of permutations

are you really trying to eliminate permutations?  I don't see anywhere in your code or anyone else's for that matter where permutations will be a problem except in the code in the question itself.  That's, in part, why we sort.

Permutations aren't what we want anyway.  We want combinations, which are fewer in number.

You've used the word "permutation" previously - is it a mistake or are you using that word intentionally? If it is intentional,  why are you trying to generate/filter permutations and how is that supposed to be happening?
>>> What specifically are you doing to "take the bias into account when searching"?

yes, you're right, the bias has to be included once for each level.
You can see where that happens in the recursion.
Each call deeper increments the target by the bias amount.


FindOperand (Sum + TValues[TPosition], Depth+1, TPosition, Target + Bias);   (line 59)

he also does it at line 199 to get it started.
I think we need to index the data by value, rather an ident (or by value also, if the ident is still useful for other purposes).  Then for the final level we just select a matching value(s), with NO further combinations required.



CREATE TABLE tempdb.dbo.rh_search_values (
    rh_ident smallint NOT NULL,
    vcr_valor_bruto decimal(14, 2) NOT NULL
    )
CREATE UNIQUE CLUSTERED INDEX rh_search_values__CL ON tempdb.dbo.rh_search_values ( vcr_valor_bruto, rh_ident )



3-level search:

FROM dbo.search_values sv1
INNER JOIN dbo.search_values sv2 ON
    sv2.rh_ident > sv1.rh_ident --AND
    --sv1.vcr_valor_bruto + sv2.vcr_valor_bruto < @value_to_find --not valid w/neg #s
INNER JOIN dbo.search_values sv3 ON
    sv3.rh_ident > sv2.rh_ident AND
    sv3.vcr_valor_bruto = @value_to_find - sv2.vcr_valor_bruto - sv1.vcr_valor_bruto


Then the last level does NO explosion at all, it just does a keyed lookup for the needed value.

That should make the 3-level lookup almost as fast as the 2-level lookup.

It also makes the 4-level lookup only require gen'ing the 3-level number of combinations.
Hi Scott,

That's closer to what I had thought your previous solution was trying to do.

But the index is doomed to failure.  The data has duplicates that can't be ignored.


Kent
Ok.  One more round to report.

I modified the C code and came up with some more results to report.

--  All of the data items are stored as integers, at 100 times the original value.  Integer operations (math and comparison) are generally faster than FP data types.

--  When iterating through the data, ignore duplicate values for the current column.  (If 121 occurs 4 times, the 2nd - 4th checks of the value are redundant.  If value was rejected the first pass it will be the next 3, too.  If the value was a solution the first pass, it will be the next 3, too.  This eliminates duplicate results but retains rows with duplicate values.  It also significantly reduces the number of data items that are tested.  It has a high cost in that the first test of each data item is to see if it is a repeat, but the overall result is a faster program.

--  Duplicate results are ignored.

--  Rows with duplicate values are found.

--  Solutions are found and group by number of columns.

Oddly, the program runs about 1% slower when it's compiled for speed than for debug.  82 seconds vs 81 seconds searching for all 1, 2, 3, and 4 column solutions.  And these timings are repeatable.
>>> Duplicate results are ignored.

is that appropriate?
The combinations are based on numbers, but the numbers are tied to other data.

I can buy ice cream $1
I can buy candy for $1
I can buy a drink for $2

how many ways can I spend $3?

If I'm reading your explanation correctly I'll get one  solution of $1 + $2
but the asker needs both solutions.  

ice cream + drink  
and
candy + drink
I suppose you could keep track of all the duplicates and then apply another round of combination searching on the other fields to expand a collapsed duplicate into all variations of the same amounts.

But that seems a like more complexity when you could just include them in the original solution.

Maybe it'd be worth it though.
>> But the index is doomed to failure.  The data has duplicates that can't be ignored. <<

Huh?  The index won't ignore duplicates.  On the contrary, it will join every value that matches, regardless of how many times it appears in the input.



Both indexes are needed on the search table, as follows:


CREATE TABLE tempdb.dbo.rh_search_values (
    rh_ident smallint NOT NULL,
    vcr_valor_bruto decimal(14, 2) NOT NULL
    )

CREATE UNIQUE CLUSTERED INDEX rh_search_values__CL ON tempdb.dbo.rh_search_values ( rh_ident )
CREATE UNIQUE NONCLUSTERED INDEX rh_search_values__IX_vcr_valor_bruto ON tempdb.dbo.rh_search_values ( vcr_valor_bruto, rh_ident )
Yeah, we could DISTINCT out the duplicates in the search values, then re-expand them at the end to include all the original rows with that value.
Hi stuber,

I'm trying to squeeze as much performance out as I can so that a 5-column solution has a chance to actually complete.  If that kind of process results in a significant enough improvement, it's worthwhile to try.  (I know that I'm losing focus on the 50,000 limitation, but I'm still curious about the general solution.

By filtering the search to avoid finding duplicate rows, the search time drops from about 350 seconds to 81.  If that extrapolates to the 5-column solution, it's the difference between a 10 hour run (run overnight) and 2+ days.  I'm giving up after the second night.  :)



Hi Scott,

Rats.  Don't know why I have such trouble reading the fine print in your SQL.  Apologies.

Try running that and generating an explain plan.  With only 5,400 numeric rows, the data will fit into a handful of pages and the index may actually be larger than the data.  The optimizer may decide that the two-tier access to each item is more expensive than scanning the data blocks.

It'd be interesting to know.


Kent
>> Don't know why I have such trouble reading the fine print in your SQL.  Apologies. <<

LOL.  No problem, you followed it well really.  Based on his comments, I don't think sdstuber even now understands anything I wrote.



>> The optimizer may decide that the two-tier access to each item is more expensive than scanning the data blocks. <<

There's no 2-tier access in this case, because I made the index a covering index.  That is, the index has all data needed to satisfy the query, so SQL can use the index w/o linking back to the main table.  SQL actual favors covering indexes when available.



>> All of the data items are stored as integers <<

Technically they need to be big integers (8 bytes), as 14 digits -- decimal(14, 2) -- obviously won't fit into a standard (4 byte) int.
Other improvements are possible, with similar binary searches / pre-stored values at key points.  I think the best improvement for higher levels -- 4 and 5 level + -- will come from having the 2-level results physically stored (fyi, sdstuber disputes this too).

The overhead of storing the sorted level 2 results is not trivial, but on a reasonably fast machine -- or in a C++ array -- it shouldn't be at all prohibitive.

That allows an exact valuing match search to replace the last two levels of the join, meaning that a 4-way join is not much overhead than a 2-way join, and a 5-way join not much more overhead than a full 3-way join.
>>>  (fyi, sdstuber disputes this too).

as suggested before, please contact me offline if you want to discuss what I do or do not believe

as for this particular "dispute" - I have no idea why you'd suggest such a thing since I've  never commented on your storage - reread my comments.

to this point you've posted 2 sets of code - the first was 30 times slower than anything posted to that point and didn't provide the requested results so I didn't really bother investigating it, let alone commenting on it (other than to mention the relative performance).

I have yet to test what you posted yesterday so I have no comments on that either.

All of my participation thus far has been trying to explain and reexplain my methods and a little commentary on some of kdo's first work that I did get a chance to test. And of course my "doomsayer" stuff about big sets.

Again, feel free to contact me offline if you'd like to pursue anything further, I'm happy to try to explain what I've done above, and anything I've actually tested.

You may have noted I haven't posted much code in the last couple days, partially because I've busy, but mostly because Kdo has set the bar.  If I can't beat his times then I won't post.  Why bother?
ScottPletcher - thank you for finally posting the code example.  Apparently I read your code better than I read your words.

I never disputed a binary search was fast for finding a single number from a set, I simply had no idea what you wanted to do with it.  Now I do.  Thanks!

I'm still not sure why you've picked other random statements as either being flip flops or ascribing views I've never stated but I'll, again, take that as a failing of mine that I'm not explaining myself well enough for others to understand what I'm talking about.

Now onto Kdo's stuff, because like him, I'm a math guy and solving for the whole set (even when it's awful) is still the more interesting part of this question for me.
I was curious just how much the bias approach helped in kdo's code.

 I changed the timing to use struct timeval in order to record sub-second measures

Here are my first results for 3-value on target 11241  (112.41)

Elapsed time: 664569 microseconds
 177527 Solutions found.
Bias: 0.000000

Elapsed time: 665221 microseconds
 177527 Solutions found.
Bias: 124602.000000

bias was slower but that was only one sample.

after repeated tests the biased version did end up slightly faster but only by 2660.6 microseconds.

I did one run each for 4-value sums

Elapsed time: 459879342 microseconds
 151737464 Solutions found.
 139931314708480 passes.
Bias: 0.000000


Elapsed time: 458895217 microseconds
 151737464 Solutions found.
 139971610013696 passes.
Bias: 124602.000000

Bias does seem to help; but only marginally and the benefit diminishes as the results grow.  I haven't waited for 5-values to finish but the benefit on 2-value was greater, not that it needed it.  26052.4 vs 26798.2 microseconds.

of course the exact values will vary for others but I imagine the pattern should remain.
The c code above doesn't find all solutions.

I changed these lines (and the file paths, but nothing else)
      Target = 11241;
      MaxValues = 2;

these results were NOT found but should have been.

         1: -89.00 201.41
         2: -40.59 153.00
         3: -35.87 148.28
         4: -26.31 138.72
         5: -17.79 130.20
         6: -17.79 130.20
         7: -15.75 128.16
         8: -15.75 128.16
         9: -12.22 124.63
        10: -12.16 124.57
        11: -12.16 124.57
        12: -12.16 124.57

It's not a problem with the bias, from the performance tests above I already knew it didn't change the results found but I did check again explicitly for these combinations and confirmed that with our without bias they are still skipped.
Wow.  Great catch!

That's because the biasing, as implemented, doesn't work.

Searching for :  201.41 -89.00    (Larger number to the left....)
Target is : 112.41

Recapping, the original problem is that when adding terms A and B, the result is less than A when B is negative.  The original algorithm was fooled by the negative values and didn't bother moving right a column when the sum of the columns already checked (in this case, 1 column with a value of 201.41) is larger than target.

So Biasing was invented.   The actual data has a smallest value of -1246.02, so 1246.02 is added to all of the data values, making 0 the smallest value.

After biasing everything, testing the first column for (201.41, -89.00) uses these terms:

Searching for :  1447.43 1157.02    (Larger number still to the left....)
Target is : 1358.43

The first term is still larger than the target because both the term and the target were incremented by the same value.


Gotta rethink this one, now.....
I'm not sure the biasing is the cause of skipped combinations.  If I comment out the bias call (so bias=0) I get the same results as with the bias.

I wasn't going to bother posting my code because kdo's code was so much faster than mine but since the results are different, it might still be of value.

I'm sure I'm overlooking something; but assuming bias= 0, I don't see a significant difference between the code.  I sum bottom-up, kdo sums top-down.  I see no reason why one should be faster than the other or why one should find more results than the other.  Our loops, our if/else are basically the same.

Of course, since kdo's code isn't finding all of the results so it doesn't have to do as much work. So it "should" be faster.  But it's 5-6 times the speed as mine while doing far more than 1/5 of the work.


#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/time.h>

#define SOURCE_ROWS 	5470
#define MAX_DEPTH 		2
#define BASE_TARGET		112.41
#define TARGET 			(int) (BASE_TARGET * 100.00)
#define MAX_SOLUTIONS 	50000  //not used yet

int rows = 0;
int *numbers;
int solution_count = 0;
int solution_value[MAX_DEPTH];
FILE *outfile;
long long function_call_count = 0;
int sum;

int intcomp(const int *a, const int *b) {
   return *a - *b;
}

void read_input_file (void) {
	FILE *infile;
	int dummy1, dummy2, dummy3, ret, i;
	float temp_value;

	infile = fopen ("sqldata.txt", "rt");

	numbers = malloc (SOURCE_ROWS * sizeof (int));
	while (rows < SOURCE_ROWS) {
		ret = fscanf (infile, "(%d,%d,%d,%f)\n", &dummy1, &dummy2, &dummy3, &temp_value);
		if (ret <= 0) break;
	
		// For effenciency adjust and cast all values as integers
		numbers[rows] = (int) (100 * temp_value);			
		rows++;
    }
    if (rows < SOURCE_ROWS) numbers = realloc (numbers, rows * sizeof (int));
	
	fclose(infile);
		
	qsort(numbers, rows, sizeof(int), (int (*)(const void *,const void *))intcomp);
		
//	printf("Read %ld rows from input file\n", rows);
//	for (i=0;i< rows;i++) printf("%4i: %i  %i\n",i,numbers[i],TARGET);

}

void print_solution(int depth, int number) {
	int i;
	
	if (outfile) {
		fprintf(outfile,"%10i:",solution_count);
		for (i=1;i<depth;i++) {
			fprintf(outfile," %.2f",solution_value[i-1],solution_value[i-1]/100.0);
		}
				
		fprintf(outfile," %.2f\n",number/100.0);
	}
	else {
		printf("%10i:",solution_count);		
		for (i=1;i<depth;i++) {
			printf(" %.2f",solution_value[i-1],solution_value[i-1]/100.0);
		}
				
		printf(" %.2f\n",number/100.0);
	}
}

void find_sums(int depth, int partialsum, int index) {
	int i;
	
	function_call_count++;
	
	for (i=index;i< rows;i++) {
			sum = partialsum + numbers[i];

			if (sum > TARGET) return;			
			
			if (sum == TARGET) {
				++solution_count;
				//print_solution(depth,numbers[i]);				
			}
			else if (MAX_DEPTH > depth) {
				solution_value[depth-1] = numbers[i];
				find_sums(depth+1,sum,i+1);
			}
	}	
}

int main(int argc, char* argv[])
{
	struct timeval  start_time, end_time;
	long int elapsed;

	read_input_file();

//	outfile = fopen ("solutions.txt", "w");
	
	gettimeofday(&start_time,NULL);
	find_sums(1,0,0);
	gettimeofday(&end_time,NULL);
		
	elapsed = (end_time.tv_sec * 1000000 + end_time.tv_usec) - (start_time.tv_sec * 1000000 + start_time.tv_usec);
		
	printf("Elapsed time: %ld microseconds\nSolutions found: %ld\nCall count: %ld\n",
			elapsed,solution_count,function_call_count);


	if (outfile) {
		fprintf(outfile,"Elapsed time: %ld microseconds\nSolutions found: %ld\n",elapsed,solution_count);
		fclose(outfile);
	}
	
	return 0;
}

Open in new window

Wow, lots of posts this weekend. I'll have to try and find some time to check them all. At first glance, though, I can see Kdo's problem right away, not from his code, but from his example:
>> After biasing everything, testing the first column for (201.41, -89.00) uses these terms:
>> Searching for :  1447.43 1157.02    (Larger number still to the left....)
>> Target is : 1358.43

You're using a bias of 1246.02. In which case, the target isn't correct. It should be 1246.02 * 2 + 122.41 (because we're on a 2 row solution). So, checking for 2614.45 wouldn't skip these values. If you don't account for double the bias (one for each column), then you'll be skipping not only the negatives, but also some positives.

I still feel that some sort of mixed solution would be the fastest. Maybe adapting Scott's solution of storing the two column results and using those for expanding. Implementing that solution on C may prove promising.
The target does increment by the bias every time that you move right a column.  The problem is that if the target is less than current value, biasing both items by the same value doesn't change that.

I've got a solution in mind.  Just need a couple of minutes to implement.  :)


Kent
SOLUTION
Avatar of Scott Pletcher
Scott Pletcher
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
That query took 20 seconds to complete. From what I understand, though, I have to use your other query first to populate and to define the binary search value indicated on the comments. I suppose it would be easier to run both codes at once and store the value in a variable.

Is there any way to return the total number of combinations (to compare with other solutions and check for accuracy)?
I was going to get a count of the total solutions (see the "--COUNT(*)" in my query above :-) ), but once it ran over 2.5 mins I gave up and cancelled it.

Of course if you let it run it would eventually give you its count.


>> From what I understand, though, I have to use your other query first to populate and to define the binary search value indicated on the comments. <<

Yes, that adds a lot of efficiency (esp. on your box) with the 4_5 table, by eliminating most of the rows from one level of the joins.


>> I suppose it would be easier to run both codes at once and store the value in a variable. <<

Yep.  Sorry, just running this catch-as-catch-can right now, trying to piece together a working, viable solution.  I was careful, however, to use variables for the value to find and for the level, to make it much easier to put it into combined/straight-line code later.

If you want a different subset, you could limit the rh_ident on one or more of the other levels -- another reason I like the idea of an identity for these rows instead of the multiple original rows.

For example, if you specified:
WHERE s1.rh_ident > 100

Then none of the first 100 rows would be included in s1, s2 or s3.
I tried commenting the TOP 50000 line and letting it run. After 7 minutes I cancelled, having near 7M records.

Analysing different subsets would be useful, because I think the data presented will be quite biased, starting from negative numbers as it does. A random sample would be more representative, but not only would that be harder and slower, having this many results would decrease the likelihood of finding a solution using the combinatorial method anyway.

Once I assemble your code together, I'll try to study it to understand it. If I have questions I'll ask them here.

I'm curious to see if Kdo can achieve either a faster solution (not really necessary. Anything under 5 minutes is good enough) or one that can return all possible solutions in a decent time. I love problem solving, and this is one that's piqued my curiosity.
As to "check for accuracy":

I show you the original relative value# (ident).
I re-add the numbers in the display and the total matches.  
I also verified the first 50,000 results I got, going back to make sure the listed values -- s1, s2, etc. -- really were in the original input.  

Not sure what else I can do as far as verifying "accuracy".


C'mon, 5 levels in 20 secs, you're not happy??  I thought you were talking about *hoping* to  be able to get 5-level results in HOURS just to see them.
I am extremely happy with this solution. I just wanted to compare with Kdo's as well. Also, I wanted to ascertain whether both solutions return the same number of results (which would be a good indication that none are missing any solutions, like was happening with Kdo's).

One thing I'm not sure of, though, is how to adapt your code to use 2, 3 or 4 level solutions, which would have to be run before going for the 5 second solution (most likely using a stored procedure to populate/drop the table and select which level to run).
We could adapt it easily enough for the different levels -- again, that's why I used variables for the value to find and the levels to use.

I would include the binary search even for smaller values and levels.  Even for the small example value you used earlier of 112.41, the 2-level total can start at #2952 (out of 5470), knocking out over half the rows from one level almost instantly.
You can vary the s1.rh_ident starting point to get different result sets.

For example:

WHERE s1.rh_ident > 100
WHERE s1.rh_ident > 500

etc..

They still aren't random, of course.  Don't know any way to easily do that.  You could randomly delete many values from the table, which would reduce the possible result sets randomly -- maybe.
Don't worry too much about the random values. They aren't too important. The chances of hitting a relevant value from that many results doesn't improve by much, if at all, by randomizing them. It's just the programmer in me that likes these types of solutions ;)
LOL, I got ya.

However, it occurs to me now, the only values that genuinely must be sorted are for doing the binary search.

The values in the final search for s1, s2 and s3 do NOT have to be sorted, since the selection for them is done by ident and not by value.  [The s4_5 is doing an indexed lookup, so the physical order is irrelevant, as long as the bruto total has an index on it.]

So you could randomly insert the values into the search_values table to produce different results.  And that IS easy AND FAST:

SELECT ... INTO dbo.rh_search_values ORDER BY NEWID()

That could be done even, say, 10 times, and you could produce 10 reasonably quick random top 50,000 (or whatever) results ... even, if you wanted, store those results and randomly pick some number from the 500,000 random results.
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
In C(++), I suspect you're right, or at least very close ... my only potential issue with that would be the indexed lookup to match the value ... I'm not sure how well C deals with doing that vs SQL (that is THE strength of SQL, after all).


I've confirmed for myself that 50,000 random values could be obtained in nearly the same amount of time as the last run (maybe a second or two on the slower boxes).

The original ~5,500 values must be sorted and stored, but that is a relatively trivial amount of time for SQL.  Then the binary sort finds the "cut off" point, which is very fast in SQL.  

Finally, the qualifying two-level combinations must be stored and sorted -- much less trivial, but SQL is still fast enough to do this in less than 30 seconds for almost all reasonable input lists and search values.


None of the advanced techniques are necessary for the 2-level (although the binary search could still yield savings, sometimes a lot, and is so fast it probably should be done anyway).

I would use them for the 3-level checks -- sometimes it won't have been needed, but other times it could help a lot.

I would definitely use some type of "advanced" techniques for 4- and 5-level searches, because I think it will yield noticeable savings there.
More to report.  :)

No 5-column solution uses any of the top 68 values in the list.  That's certainly not intuitive, and an algorithm to detect the starting point without missing any values is quite a task.  (Note that different data might use the top value and none of the next 50, or none of the top 200, etc.)

Still, starting the search just a modest way into the list makes a huge difference in run times.  The first 5-column solution is found using the 69th largest value in the list.  So starting the search after just ignoring the first 68 items yields these times (for 50,000 results) for the 2, 3, 4, and 5 column searches.

0, 5, 15, 46.  Seconds.

That's an impressive reduction.  Skipping the first 100 values yields:

0, 5, 11, 7.  Seven seconds to find 50,000 5-columns solutions!

Skipping the first 200 values yields:

0, 5, 7, 9.

Skipping the first 400 values yields:

0, 5, 10, 12.


There's a strong indication that performance will vary as the data changes, but simply ignoring the top 3 or 4% of the values gets all of the generated answers in about 10 seconds.

If the TOP 50K results are needed, it's not possible to pick an arbitrary number of values to ignore.  But if 50K consecutive (or related) solutions are sufficient, starting just a short way into the list is a very workable technique.

Kent
>> No 5-column solution uses any of the top 68 values in the list.  That's certainly not intuitive, and an algorithm to detect the starting point without missing any values is quite a task. <<

Indeed.  I suspect the algorithm for that might take longer than generating the first solutions :-).


That's why I went at it from a different angle: for an n-level solution, what is the SMALLEST value that MUST be included to reach the desired total?  That's what my binary search determines (within a few entries), and a binary search will ALWAYS BE VERY FAST.

Now, this only helps for one level, but it often helps BIG for that level, allowing even SQL to gen a solution in a very reasonable time.
This reminds me of the old "super queen" problem.  The lessons are relevant here.

Back in the 70s, Scientific American posed the question of placing N super queens (can move like a queen and a knight) on an NxN chess board.  At the time of publishing, they had solutions up to N=16 (or there abouts) and within the next couple of months readers submitted solutions for perhaps 3 more values of N.

At the time, I worked for Control Data and had access to some of the most powerful computers in the world.  I wrote a small (recursive) program in assembly to solve as N increased.  Within a short period of time, I had solutions for all boards up to N=22.  N=23 was daunting.  Days later the program was still running, without finding a solution.  I didn't have a solution for N=23, but I was the only mathematician in the world with answers for all N < 23.  And they were, of course, useless.  :)

That program was looking for the "least solution".  Put a super-queen in the lower left corner and advance to the next column.  Put a super-queen legally in the next column and move on.  When no legal move is found retreat a column and move the super-queen to the next legal column.  The base-N numeric value of the solution (with each column being a digit in the solution) was the "least solution".  I was hoping to establish a pattern by which solutions could be derived for larger values of N.

But that didn't work.

What did work, at least as far as finding SOME solution, was to start in the middle and work out.  Start by placing a queen in the center square, then move to the column to the left of center and search up for a legal move.  Then move to the column to the right of center and search down for a legal move.  Move away from center one more column and repeat.  (Searches are circular so that all positions are tested, and the left/down, right/up technique lent symmetry to the solution.  There is a slight variation on the start depending on whether N is odd or even.)

That technique found solutions very, very fast.  And remembering how I've paid those dues made me think that if we start toward the middle of the list and work out, we can find 50,000 solutions lightning fast.  I'm going to try it.
I'm actually very fascinated by all this and I'm loving the progress being made in the C code (and also the new approach, though I'm not sure if it will be fruitful. I'll wait and see).

However, it seems that the SQL solution is pretty much stable and finalized. I would ask if ScottPletcher can assemble the solution into some sort of stored procedure which I can then adapt. Is there any way the procedure can detect if you need to refill the table? Also, is it possible to use the number of results returned as a variable? I seem to remember that TOP @NumberOfRows wouldn't work on SQL and I had to use dynamic SQL before, but I can't be sure now, and, since you're using an incrementing index, you might not need it.
My biggest difficulty, so far, is determining how you get the results back from the ident. For each column, I need vcr_num_empregado, vcr_vnc_cod_venc and vcr_valor_bruto, but I can't seem to understand how you built the ident so I can reverse it.

Kdo, when you do have a more final solution, please post it here, so I can compare both. Please take into account that the program should retrieve the values from a DB. It should take only a few microseconds for this, but it should be taken into account.

When this is over, I'll be opening a new question to award 500 points to each working solution. I feel like you guys deserve it.
I keep trying to make the code simpler

I'm have more testing to do on the larger sets to confirm they are correct.
I did try adding the binary search at the last level an, as expected it is faster
but I'm less confident in the results.  Not because of a flaw in the idea but a flaw in my code.

My latest timings - without binary search.

--------------------------------
Target sum: 112.410000
--------------------------------
Depth: 1
Elapsed time: 7 microseconds
Solutions found: 0
Call count: 1
--------------------------------
Depth: 2
Elapsed time: 12585 microseconds
Solutions found: 122
Call count: 3332
--------------------------------
Depth: 3
Elapsed time: 13513323 microseconds
Solutions found: 256735
Call count: 5500396
--------------------------------
Depth: 4
Elapsed time: 11481604585 microseconds
Solutions found: 277343509
Call count: 6094087776
--------------------------------

Open in new window


I'll try to get the kinks worked out of both versions and post final results and code this weekend.
I made a few more improvements and fixed the binary search bug, which does add some complexity because if a value exists multiple times in the set, the binary search will only find one of them so I have to search the neighbors exhaustively.  That is still significantly faster than iterating through all remaining values but does take a little more code.

I also returned a previous tuning effort that will exit early if I can determine the target is unreachable.  Adding those checks slows things down for the two value solutions but those only take a few hundred microseconds anyway so the loss isn't appreciable.  However on solutions with 4 or more values the early exit can reduce the total execution time by quite a bit.  The relative amount will vary by data and target though.

But, despite all the successes I'm having with this code, I'd say the times still prove my original point:  The data beats the algorithm.  I can find simple solutions with convenient data in almost no time at all; but big solutions with lots of combinations take a long time, and "lots of combinations" with the existing data is still relatively small compared to what they could be.   For example: the 4-value solutions below, are only 277 million but, with different data and a different target the solutions could be in the billions or trillions and that would take a lot longer to sift through, let alone trying to tackle the 5 value solutions.

However, the 50000 limit, provides a convenient "out" that lets you dodge some of the math and that, as already shown above by others, can be incredibly powerful.

Like Kdo though, finding all solutions is more interesting and putting the 50000 cap in the middle is a trivial addition so I'm still pursuing the exhaustive solution.

I'm running for 5-value solutions now, these are timings I have up through 4 while I wait

--------------------------------
target sum: 112.410000
--------------------------------
Depth: 1
Elapsed time: 0 microseconds
Solutions found: 0
Call count: 1
--------------------------------
Depth: 2
Elapsed time: 713 microseconds
Solutions found: 122
Call count: 2953
--------------------------------
Depth: 3
Elapsed time: 450369 microseconds  (half a second)
Solutions found: 256735
Call count: 4390127
--------------------------------
Depth: 4
Elapsed time: 448023735 microseconds (a little under 8 minutes)
Solutions found: 277343509
Call count: 4446882814
--------------------------------
I lost my connection to the server so my 5-value search from above didn't finish.
But I made a couple of small tweaks and added the early exit.

It can do 50000 5-value sums in about 8 seconds.  
Below are timings for 5,000,000 at each level (if there are that many)
I only check at the top of the function so duplicates found within one call will be counted, hence the 5000001 in the last set.

--------------------------------
target sum: 112.410000
--------------------------------
Depth: 1
Elapsed time: 0 microseconds
Solutions found: 0
Call count: 1
--------------------------------
Depth: 2
Elapsed time: 277 microseconds
Solutions found: 122
Call count: 2953
--------------------------------
Depth: 3
Elapsed time: 411936 microseconds
Solutions found: 256735
Call count: 4390127
--------------------------------
Depth: 4
Elapsed time: 25090999 microseconds
Solutions found: 5000000
Call count: 326857516
--------------------------------
Depth: 5
Elapsed time: 183320310 microseconds
Solutions found: 5000001
Call count: 2720381708
--------------------------------
--------------------------------


This is the code used to generate the above.
You might note I did not use the bsearch function.  I tried it but what I have below actually ended up slightly faster.  I think it's because I don't force searching down to a single value; which saves a tiny bit on extra division and offsets.  Instead, if I get down to 2 values I simply evaluate both.

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/time.h>
#include <getopt.h>

#define DEPTH_LIMIT		5
#define SOURCE_ROWS 	5470
#define BASE_TARGET		112.41
#define MAX_SOLUTIONS 	5000000 
#define separator 		printf("--------------------------------\n")

int target = (BASE_TARGET * 100.00);
char *in_filename;
unsigned rows = 0;
int *numbers;

int solution_value[DEPTH_LIMIT];

long max_solution[DEPTH_LIMIT];

FILE *outfile;
unsigned long solution_count = 0;
unsigned long function_call_count = 0;
long sum;

unsigned max_depth;

int intcomp(const int *a, const int *b) {
   return *a - *b;
}

void read_input_file (void) {
	FILE *infile;
	int  dummy2, dummy3, ret, i,j;
	float dummy1, temp_value;
	

	infile = fopen ("sqldata.txt", "rt");

	numbers = malloc (SOURCE_ROWS * sizeof (int));
	while (rows < SOURCE_ROWS) {
		ret = fscanf (infile, "(%d,%d,%d,%f)\n", &dummy1, &dummy2, &dummy3, &temp_value);
		if (ret <= 0) break;
		// For effenciency adjust and cast all values as integers,
		// adjust for proper truncation
		numbers[rows] = (int) (100.0 * temp_value + (temp_value >0?0.1:-0.1) ) ;			
		rows++;
    }
    if (rows < SOURCE_ROWS) numbers = realloc (numbers, rows * sizeof (int));
	
	fclose(infile);
		
	qsort(numbers, rows, sizeof(int), (int (*)(const void *,const void *))intcomp);
		
	for (i=0;i<DEPTH_LIMIT;i++)	{
		for (j=rows-1;j>=rows-1-i;j--)
			max_solution[i] += numbers[j];
	}	


	//printf("Read %ld rows from input file\n", rows);
	//for (i=0;i< rows;i++) printf("%4i: %i  %i\n",i,numbers[i],target);
}

void print_solution(unsigned depth, int number) {
	int i;

	for (i=1;i<depth;i++) {
		fprintf(outfile,"%.2f ",solution_value[i-1],solution_value[i-1]/100.0);
	}
				
	fprintf(outfile,"%.2f\n",number/100.0);
}

// Given a depth and current position in the list
// what is the smallest sum that could be generated?
// Since the list is sorted, the smallest sum is 
// found by adding the next elements in sequence
long min_solution(unsigned depth, unsigned index) {
	int j;
	long temp=0;
	for (j=index+1;j<=index+max_depth-depth;j++)
		temp += numbers[j];
	return temp;
}

void find_sums(unsigned depth, long partialsum, unsigned index) {
	unsigned i, j, lo, hi;
	
	function_call_count++;

	if (solution_count >= MAX_SOLUTIONS) return;

	if (depth < max_depth) {
		for (i=index;i< rows;i++) {
			sum = partialsum + numbers[i];

			if (sum > target) return;				
	
			// Are we so far from the target that no values could
			// possibly get us there? If so, give up.								
			if (sum + max_solution[depth-1] < target) break;
			if (sum + min_solution(depth,i) > target) break;
				
			// We're close enough that it's worth trying to continue		
			solution_value[depth-1] = numbers[i];
			find_sums(depth+1,sum,i+1);
		}
	}
	else {
		lo = index;
		hi = rows-1;

		while (hi-lo > 1) {
			i = (hi+lo)>>1;
			if (partialsum + numbers[i] < target) lo = i;
			else if (partialsum + numbers[i] > target)	hi = i;
			else break;
		}
		if (partialsum + numbers[lo] == target) i = lo;
		else if (partialsum + numbers[hi] == target) i = hi;
		
		if (partialsum + numbers[i] == target) {
			++solution_count;
			if (outfile) print_solution(depth,numbers[i]);
			
			// We found a hit but there may be more than one element in the set with the same value
			// so count the neighbors until we exhaust them all										
			for (j=i-1;(j > index) && (numbers[j] == numbers[i]);j--) {
				++solution_count;
				if (outfile) print_solution(depth,numbers[j]);
			}
			for (j=i+1;(j < rows) && (numbers[j] == numbers[i]);j++) {
				++solution_count;
				if (outfile) print_solution(depth,numbers[j]);
			}
							
		}
	}
}

int main(int argc, char* argv[])
{
	struct timeval  start_time, end_time;
	unsigned long elapsed;
	int c;

	while ((c = getopt(argc, argv, ":t:i:o:?")) != -1) {
		switch (c) {
			case 't':
				target = (int) (atof(optarg) * 100.00);
				break;
			case 'i':
				in_filename = optarg;
				break;
			case 'o':
				outfile = fopen (optarg, "w");
				break;
			case '?':
				printf("\n%s [-t target] [-s ] [-i infile] [-o outfile]\n",argv[0]);
				printf("-t target is a floating point value,  default=%.2f\n",BASE_TARGET);
				printf("-i input file name, default sqldata.txt\n");
				printf("-o output file name\n");
				return 0;		
			default:
				if (isprint (optopt))
					fprintf(stderr, "Unknown option `-%c'.\n", optopt);
				else
					fprintf(stderr,"Unknown option character `\\x%x'.\n",optopt);
				return 0;
		}
	}
	
	read_input_file();
	
	separator;
	printf("target sum: %f\n",target/100.0);
	separator;
	for (max_depth=1;max_depth<=DEPTH_LIMIT;max_depth++) {
		memset(&start_time,0,sizeof(start_time));
		memset(&end_time,0,sizeof(end_time));
		
		function_call_count = 0;
		solution_count = 0;
	 	gettimeofday(&start_time,NULL);
	 	find_sums(1,0,0);
		gettimeofday(&end_time,NULL);
		
		elapsed = (end_time.tv_sec * 1000000 + end_time.tv_usec) - (start_time.tv_sec * 1000000 + start_time.tv_usec);

		printf("Depth: %i\nElapsed time: %ld microseconds\nSolutions found: %ld\nCall count: %ld\n",
				max_depth,elapsed,solution_count,function_call_count);
		separator;
	}
	separator;
	return 0;
}
                                

Open in new window

This is essentially the same code as above except my array of "numbers" isn't just the numbers, I maintain the entire record which doesn't really impact the performance much because the positional references is simply math.  There is a slight overhead in all of the element dereferencing though.  I'm not sure how important this is but since other code and discussion above has mentioned special complexity for negative values I'll make a note of how they are handled here.  The initial sort is all that is required.  Negative values don't add any additional complexity or require any special consideration with the methods used in this code.  They do, of course, add more possible options to explore but you don't have to code for them. They just fall out naturally. That may be a feature or a failing depending on your point of view I guess.


As for implementing this with your database: you could either write the data from sql to a file and then use this pretty much as-is, or replace the read file function with a read database routine to build the array.

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/time.h>
#include <getopt.h>

#define DEPTH_LIMIT		5
#define SOURCE_ROWS 	5470
#define BASE_TARGET		112.41
#define MAX_SOLUTIONS 	50000
#define separator 		printf("--------------------------------\n")


struct rh_vcrt0 {
	int vcr_emp_empresa,
     vcr_num_empregado,
     vcr_vnc_cod_venc,
     vcr_valor_bruto;
};


int target = (BASE_TARGET * 100.00);
char *in_filename;
unsigned rows = 0;
struct rh_vcrt0 *numbers;

struct rh_vcrt0 solution_value[DEPTH_LIMIT];

long max_solution[DEPTH_LIMIT];

FILE *outfile;
unsigned long solution_count = 0;
unsigned long function_call_count = 0;
long sum;

unsigned max_depth;

int intcomp(const struct rh_vcrt0 *a, const struct rh_vcrt0 *b) {
   return a->vcr_valor_bruto - b->vcr_valor_bruto;
}

void read_input_file (void) {
	FILE *infile;
	int  dummy2, dummy3, ret, i,j;
	float dummy1, temp_value;
	struct rh_vcrt0 temp;
	

	infile = fopen ("sqldata.txt", "rt");

	numbers = malloc (SOURCE_ROWS * sizeof (struct rh_vcrt0));
	while (rows < SOURCE_ROWS) {
		ret = fscanf (infile, "(%d,%d,%d,%f)\n", 
							&temp.vcr_emp_empresa,
						    &temp.vcr_num_empregado,
						    &temp.vcr_vnc_cod_venc,
						    &temp_value);
		if (ret <= 0) break;
		// For effenciency adjust and cast all values as integers,
		// adjust for proper truncation
		temp.vcr_valor_bruto = (int) (100.0 * temp_value + (temp_value >0?0.1:-0.1) );
		memcpy(&numbers[rows],&temp,sizeof(struct rh_vcrt0));
		rows++;
    }
    if (rows < SOURCE_ROWS) numbers = realloc (numbers, rows * sizeof (int));
	
	fclose(infile);
		
	qsort(numbers, rows, sizeof(struct rh_vcrt0), (int (*)(const void *,const void *))intcomp);
		
	for (i=0;i<DEPTH_LIMIT;i++)	{
		for (j=rows-1;j>=rows-1-i;j--)
			max_solution[i] += numbers[j].vcr_valor_bruto;
	}	


	//printf("Read %ld rows from input file\n", rows);
	//for (i=0;i< rows;i++) printf("%4i: %i  %i\n",i,target,numbers[i].vcr_valor_bruto);
}

void print_solution(unsigned depth, struct rh_vcrt0 *last) {
	int i;

	for (i=1;i<depth;i++) {
		fprintf(outfile,"(%d,%d,%d,%.2f) ", 
							solution_value[i-1].vcr_emp_empresa,
						    solution_value[i-1].vcr_num_empregado,
						    solution_value[i-1].vcr_vnc_cod_venc,
						    solution_value[i-1].vcr_valor_bruto/100.0);
	}
				
	fprintf(outfile,"(%d,%d,%d,%.2f)\n",
				last->vcr_emp_empresa,
			    last->vcr_num_empregado,
			    last->vcr_vnc_cod_venc,
			    last->vcr_valor_bruto/100.0);
}

// Given a depth and current position in the list
// what is the smallest sum that could be generated?
// Since the list is sorted, the smallest sum is 
// found by adding the next elements in sequence
long min_solution(unsigned depth, unsigned index) {
	int j;
	long temp=0;
	for (j=index+1;j<=index+max_depth-depth;j++)
		temp += numbers[j].vcr_valor_bruto;
	return temp;
}

void find_sums(unsigned depth, long partialsum, unsigned index) {
	unsigned i, j, lo, hi;
	
	function_call_count++;

	if (solution_count >= MAX_SOLUTIONS) return;

	if (depth < max_depth) {
		for (i=index;i< rows;i++) {
			sum = partialsum + numbers[i].vcr_valor_bruto;

			if (sum > target) return;				
	
			// Are we so far from the target that no values could
			// possibly get us there? If so, give up.								
			if (sum + max_solution[depth-1] < target) break;
			if (sum + min_solution(depth,i) > target) break;
				
			// We're close enough that it's worth trying to continue		
			memcpy(&solution_value[depth-1], &numbers[i], sizeof(struct rh_vcrt0));
			find_sums(depth+1,sum,i+1);
		}
	}
	else {
		lo = index;
		hi = rows-1;

		while (hi-lo > 1) {
			i = (hi+lo)>>1;
			if (partialsum + numbers[i].vcr_valor_bruto < target) lo = i;
			else if (partialsum + numbers[i].vcr_valor_bruto > target)	hi = i;
			else break;
		}
		if (partialsum + numbers[lo].vcr_valor_bruto == target) i = lo;
		else if (partialsum + numbers[hi].vcr_valor_bruto == target) i = hi;
		
		if (partialsum + numbers[i].vcr_valor_bruto == target) {
			++solution_count;
			if (outfile) print_solution(depth,&numbers[i]);
			
			// We found a hit but there may be more than one element in the set with the same value
			// so count the neighbors until we exhaust them all										
			for (j=i-1;(j > index) && (numbers[j].vcr_valor_bruto == numbers[i].vcr_valor_bruto);j--) {
				++solution_count;
				if (outfile) print_solution(depth,&numbers[j]);
			}
			for (j=i+1;(j < rows) && (numbers[j].vcr_valor_bruto == numbers[i].vcr_valor_bruto);j++) {
				++solution_count;
				if (outfile) print_solution(depth,&numbers[j]);
			}
							
		}
	}
}

int main(int argc, char* argv[])
{
	struct timeval  start_time, end_time;
	unsigned long elapsed;
	int c;

	while ((c = getopt(argc, argv, ":t:i:o:?")) != -1) {
		switch (c) {
			case 't':
				target = (int) (atof(optarg) * 100.00);
				break;
			case 'i':
				in_filename = optarg;
				break;
			case 'o':
				outfile = fopen (optarg, "w");
				break;
			case '?':
				printf("\n%s [-t target] [-s ] [-i infile] [-o outfile]\n",argv[0]);
				printf("-t target is a floating point value,  default=%.2f\n",BASE_TARGET);
				printf("-i input file name, default sqldata.txt\n");
				printf("-o output file name\n");
				return 0;		
			default:
				if (isprint (optopt))
					fprintf(stderr, "Unknown option `-%c'.\n", optopt);
				else
					fprintf(stderr,"Unknown option character `\\x%x'.\n",optopt);
				return 0;
		}
	}
	
	read_input_file();

	separator;
	printf("target sum: %f\n",target/100.0);
	separator;
	for (max_depth=1;max_depth<=DEPTH_LIMIT;max_depth++) {
		memset(&start_time,0,sizeof(start_time));
		memset(&end_time,0,sizeof(end_time));
		
		function_call_count = 0;
		solution_count = 0;
	 	gettimeofday(&start_time,NULL);
	 	find_sums(1,0,0);
		gettimeofday(&end_time,NULL);
		
		elapsed = (end_time.tv_sec * 1000000 + end_time.tv_usec) - (start_time.tv_sec * 1000000 + start_time.tv_usec);

		printf("Depth: %i\nElapsed time: %ld microseconds\nSolutions found: %ld\nCall count: %ld\n",
				max_depth,elapsed,solution_count,function_call_count);
		separator;
	}
	separator;
	return 0;
}
                                

Open in new window


Here are timings for 50000 limit


--------------------------------
target sum: 112.410000
--------------------------------
Depth: 1
Elapsed time: 1 microseconds
Solutions found: 0
Call count: 1
--------------------------------
Depth: 2
Elapsed time: 667 microseconds
Solutions found: 122
Call count: 2953
--------------------------------
Depth: 3
Elapsed time: 83836 microseconds
Solutions found: 50001
Call count: 889437
--------------------------------
Depth: 4
Elapsed time: 1366452 microseconds
Solutions found: 50000
Call count: 18983831
--------------------------------
Depth: 5
Elapsed time: 8349257 microseconds
Solutions found: 50000
Call count: 126549329
--------------------------------
--------------------------------

Open in new window

What about a random 50000 vs (nearly) always finding the lowest matches?


AS to duplicates on the binary search, the code normally exits just ABOVE the last value, which should cover duplicates.  Or you can simply search up until you don't get a dup -- in SQL, with the table indexed, that will be extremely fast.
Complexity/speed of random selection would depend on degree of randomness required.

It wouldn't be hard to randomize selection of the first value, but you'll still get clusters if there are multiple combinations that use that value.

If eliminating those clusters would also be required then I suppose that could be ameliorated by exiting as soon as the a solution was found.

Since finding one solution doesn't seem to take very long, with a limit of 50000, the random hopping around, while slower, might still be acceptable.  My concern there would be if a particular target and data set would not have 50000 solutions, then forcing early exit might cause a loss of records.

I guess in that case there could always be a second pass to do an exhaustive search if the randomized early exits couldn't fill the 50000 buckets.

However, it might simply be easier to over-generate and then randomly select from what was found.  I can do half a million in 3 minutes.  Randomly select 10% of those.  You wouldn't even need to store the values as you go, the random selection could be done on the fly.  Simply apply a 10% randomizer to each solution found, if you're within 10%, then that solution counts, if not, skip it and move on.  Doing that would also easily handle the duplicates without requiring early exit.  The downside is you'll still get randomized results that are skewed toward the beginning of your set.

If the randomization must be evenly distributed across the entire field of all combinations then it'll be a more complex and slower.  Still possible but I'm skeptical of the practical value of that.  
Especially for the "bad cases" which haven't been mentioned much recently.  So let me put my doomsayer cap back on.  All of the "good and fast" stuff above is for data and a target that plays nicely.

We only have 122 two-value solutions but could have had millions.  We only have 256735 three-value solutions but could have had billions; and so on.

Anything that forces exploration of more combinations will always be susceptible to intractability when the combinations explode.  If I have to select 50000 combinations randomly distributed out of quadrillions of legal solutions then it's simply not going to happen in any kind of practical time.

Skewed results? maybe
Skewed results with convenient data? more likely
I think the original list of values could be stored "sorted" randomly (in SQL, by NEWID(), for example).

The only list that needs sorted by the values is the 2-level mini-join'ed lookup table.  The other levels could all be random, thus inherently providing a random result with only a single process of 50,000 (or however many) rows.

Purely in SQL, that would appear to take, based on these tests, typically 30 secs or less.

It can be done faster in C, no q, but only the OP can decide if the speed diff is worth the switch to C, and the resulting loss of control/understanding.
Randomization isn't important. It's just a little something my programmer mindset would prefer, but it's not essential.

As for the large datasets: Most of the time, hopefully, there won't be that many results returned. I said 50k records as a threshold, but it might as well be 5k. I'd say that about 85% of the time, there will be about under 50-100 valid solutions. 10% will be 100-1k. 5% will be 1k+ and usually we search for a different solution in these cases.
So, if it helps, you can set the threshold to 5k, or even to 1k. It's not likely we'll be searching further, for various reasons, not the least of which is the time required to effectively search all those solutions.

I'd guess that this would speed ScottPletcher's code as well.

To effectively use this, you need to drop the 122.41 value. That was from a real life example of a value we needed to find, but on the smallest set. You can use a much higher value for the larger set. Something around 7k-25k will be good enough, even though we sometimes need values as high as 100k.
using a fixed target lets us compare results to be sure we're not only making progress with the speed but also check for correctness.

However, once the code is verified it's easy to switch.
If you compile the c code above (simply gcc the file with defaults) you can run the resulting output with the -t flag to specify whatever target you want.


for example:  save the code above to somefile.c

gcc   somefile.c  -o testit
testit -t 12345.67
Basically I'm saying that randomization is free, in that it does NOT cost ANY noticeable time diff.  You could even give them the option of ordered or random results.
@ScottPletcher
>>Purely in SQL, that would appear to take, based on these tests, typically 30 secs or less.

can you post a complete example of what you suggest?

I agree it's doable but I don't share the same degree of optimism on the results; I'd love to test something and be proven wrong though.

@Cluskitt
>>  I'd say that about 85% of the time, there will be about under 50-100 valid solutions.
Is the example data above abnormal?  I've tried lots of random target values and unless I pick a gargantuan number there are almost always many combinations that add up to the target sum.

If you're right and your normal real data and real targets produce have so few results then randomization isn't needed at all because you can exhaustively search.
>>> randomization is free

I don't get how you made that leap.  Please post some code that shows that.
Either you're not correct, or I'm misunderstanding what you mean - I'm very willing to believe the fault is mine.

Possibly the confusion is just imprecision in the use of the word "randomization".

Exactly what are you randomizing and to what degree?  Are you saying you can pick 50000 (or 1000 or whatever) results with an even distribution from all possible valid solutions?

And, you can do that even for inconvenient data? (e.g. select N results out of 154 quadrillion?)
if so, I think we'd all love to see that

Or, are you saying you can pick a skewed randomized selection of a subset of results for convenient data?  (pick N results out of 32 million)  an example here would be nice, the asker even asked for this as a nice-to-have.  I'll post skewed randomization for my code later.

Or are you saying something something else entirely?

again, testable code is always more helpful than words and leaves zero ambiguity
Well, I must admit that we cheat a bit. We sometimes clear a few codes in advance on the larger sets to make them more manageable. For example, if you pick the given sample of data and remove all vcr_vnc_cod_venc IN (1,4,5), you'll be effectively removing LOTS of possible combinations. Not only that, we usually remove a few of the most common negative codes. We do this gradually, though. We first remove codes 1 and 5. Then, when we still don't have a reasonable result we remove a few more, etc. It isn't always clear cut, so it can't be hard-coded, which is why I haven't mentioned it before, other than in passing.

So, normally, we'll work with smaller sets (about 2/3) of the larger set. Come to think of it, I actually think it would be simpler to remove the codes first and then start adding them back if we don't get proper results. This would have the advantage of eliminating a lot of similar values, I suppose (one of the codes we also eliminate fairly often is 6, which almost always has the same value).

If you want to test a reasonably common filter, remove these codes and test with the remaining ones: 1,4,5,6,12 and all between 139-169 (which will remove most of the negatives)
As I stated for "randomization", I'm saying that the original rows can be inserted in a random order prior to doing the searches.  SQL seems to strongly bias its results toward rows that appear earlier in the result set.  Thus, by storing the initial values randomly, we're likely to get a random sample of results.  Certainly much more so than storing the values sorted.

As to 30 secs, the 5-level results took ~20 secs in SQL, as stated by OP.  Based on the q's comments, I took that to be a typical case.  The extra 10 secs is a typical IT "cushion" :-) .
Probably not a bad idea to eliminate dups from the input values, since we're only generating a subset of matches anyway, dropping dups gives us more unique input values that match to reach the search limit.
>>> I'm saying that the original rows can be inserted in a random order prior to doing the searches.

That's effectively picking the first values randomly, which, as noted above, should work just fine but the results are still skewed with clusters.

But because of the initial randomization the skewing will hopefully not be a recognizable pattern.

but, if there are a lot of valid combinations that happen to use value X, then no matter how randomized your method is of picking X, once you have it, unless you also include early exit you'll still get a big cluster.

Easy example:  X=0.
Similarly, every 3-value solution clusters around 0 in 4-value solutions.

Other values can also produce clusters but 0 is an obvious candidate for maximal clustering.
Perhaps, but given that the requestor has repeatedly stated that he doesn't need ultra-scientific randomization, just a better sampling of results, then I don't see why an initial random insert isn't sufficient.

I give up on your constant assaults on my perception, implying I have no clue what I'm talking about.

I believe my perception here is more accurate than yours: the requestor doesn't need absolute randomization.  In fact, he explicitly stated that: it's just a "nicety".  So the almost-no-cost randomization of the initial values seems a practical solution to me, w or w/o clustering.
>>> requestor has repeatedly stated that he doesn't need ultra-scientific randomization, just a better sampling of results, then I don't see why an initial random insert isn't sufficient.

I didn't say or imply it wasn't sufficient.  I was pointing out that your suggested sql method was functionally the same as my suggested c method.    

I didn't say skewed clusters were wrong, but I did think it was appropriate to note the distinction between those results and random selection from a uniform distribution.  


>>> I give up on your constant assaults on my perception, implying I have no clue what I'm talking
about.

I'm truly sorry you have that impression but I've neither said nor implied any such thing.  In fact, I've gone out of my way to assume the fault is mine for misunderstanding.  And have pointed out on multiple times when we were following the same approach, as in this case.  So, if did somehow slight you, I was poking at myself as well; but again, it was never my intent.

Please, please reread my comments and contact me offline I'll be happy to discuss whatever you want and  then come back here to publicly apologize and correct/remove anything that is inappropriate.


>> I believe my perception here is more accurate than yours

Yes, I know - you've been very clear about that

But more or less accurate is irrelevant since we agree.   I also recognized that randomization was asked for it as a "nice-to-have" and not a strict requirement.

I hadn't been pursing randomization at all because it was only a "nicety".  
After I posted my last example - YOU asked about how I would produce randomized results.   So I was answering you. Then when your response wasn't clear to me (again I was willing to take the fault there) I asked for more clarification.    I would have taken the entire sidebar offline but since Cluskitt "did" ask about it (even as a "nicety") it seemed appropriate to keep it in thread.

Then when you suggested you already had a solution to do it, I asked you to  post it.  Even if it is just a "nicety" - if you really have a viable solution,  please share it.  Give us something we can test.

I'm not attacking you - again I apologize that you've somehow come to that conclusion.

I'm simply asking you to post what you suggest.  What you say sounds great, so now that you've described it - please show it.

No hostility is intended, implied nor should be construed in anyway from anything in this post.  My requests are not challenges to your skill, your knowledge, your perception  or any other slight you may have felt to his point.  

My requests are for my own curiosity (I don't know TSQL nearly as well as you do, I'm hoping to learn something) and for the asker who is looking for a complete solution and I don't think there is one available in the posts above.
Hi Scott,

I must admit to my own confusion here.  How does randomization tie in with your initial step(s) of omitting the impossible low-values sums?


Kent
Hi Kent,  

can you compile and run my code above on your hardware?

I'm curious to get a benchmark when the hardware differences are removed.

Since I have multiple versions above , let me know which one you tried (although they should be roughly equal).

If you post your latest, I'll reciprocate
>> How does randomization tie in with your initial step(s) of omitting the impossible low-values sums? <<


In order to do the binary search and quickly find the limiting value, the values must be sorted.

For the sake of convenience, I used that same sorted-in-value-order table as input to the matching selection.  

When we gen the matches, we specify "TOP (50000)" (for obvious performance reasons).  Since the data is sorted, SQL consistency gives us matches containing the lowest values possible, since the data is sorted in order.  SQL's results are biased because of the physical order of the input values combined with the TOP clause.

To avoid that, the initial data can be "sorted" randomly, i.e. randomly reordered using NEWID(), for all levels except the final two, because they are matched by total value, and so must be indexed.

So, the random data is used for <desired level> - 2 levels.

Then, either a nonclustered index on the random data table, or a separate value-ordered table, is needed to produce the final 2-level join on value table.
Sorry, I got busy and forgot about this thread, but looks like not much has happened since last time so no catch up reading needed.

Here is the randomized c version.

I have 2 levels of randomization -

First simply randomizing the selection of the 1st number.  This, as expected, is fast, adding only a few microseconds total time; but also as expected does still lead to some degree of clustering in the results.

Second level of randomization is also quite simple but can increase the time. For levels 3 and above (because levels 1 and 2 have small result sets, I do them exhaustively)  I give each solution found only a 20% chance of being used.   This virtually eliminates clustering.

As an interesting side effect, the randomization can actually make finding the first 50000 results FASTER!  This is because of the offsets Kent found earlier where extreme values were known to be impossible.  So, with some randomization of the first value you'll sometimes get lucky and hop past bad combinations.

enjoy!

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/time.h>
#include <getopt.h>

#define DEPTH_LIMIT     5
#define SOURCE_ROWS     5470
#define BASE_TARGET     112.41
#define MAX_SOLUTIONS   50000
#define separator       printf("--------------------------------\n")


struct rh_vcrt0 {
    int vcr_emp_empresa,
     vcr_num_empregado,
     vcr_vnc_cod_venc,
     vcr_valor_bruto;
};


int target = (BASE_TARGET * 100.00);
char *in_filename;
unsigned rows = 0;
unsigned *random_indexes;
struct rh_vcrt0 *numbers;

struct rh_vcrt0 solution_value[DEPTH_LIMIT];

long max_solution[DEPTH_LIMIT];

FILE *outfile;
unsigned long solution_count = 0;
unsigned long function_call_count = 0;
long sum;

unsigned max_depth;

unsigned myrand(int n) {
    int maxbias;
    int result;

    //To prevent modulo bias limit the random range
    // to the largest number evenly divisible by n
    maxbias = RAND_MAX - (RAND_MAX % n);

    //If random number is greater than the bias limit then pick a new number.
    //Theoretically this could run forever but in real life it won't.
    while (1) {
        result = rand();
        if (result <= maxbias) break;
    }

    return result % n;
}

void generate_random_indexes(void) {
    int i, r;
    unsigned temp;
        
    random_indexes = malloc(rows * sizeof (unsigned));
    
    for (i=0;i<rows;i++)
        random_indexes[i] = i;
    
    srand(time(NULL));
    for (i=rows-1;i>0;i--) {
        r = myrand(i);
        temp = random_indexes[r];
        random_indexes[r] = random_indexes[i];
        random_indexes[i] = temp;        
    }
}

int intcomp(const struct rh_vcrt0 *a, const struct rh_vcrt0 *b) {
   return a->vcr_valor_bruto - b->vcr_valor_bruto;
}

void read_input_file (void) {
    FILE *infile;
    int  dummy2, dummy3, ret, i,j;
    float dummy1, temp_value;
    struct rh_vcrt0 temp;
    

    infile = fopen ("sqldata.txt", "rt");

    numbers = malloc (SOURCE_ROWS * sizeof (struct rh_vcrt0));
    while (rows < SOURCE_ROWS) {
        ret = fscanf (infile, "(%d,%d,%d,%f)\n",
                            &temp.vcr_emp_empresa,
                            &temp.vcr_num_empregado,
                            &temp.vcr_vnc_cod_venc,
                            &temp_value);
        if (ret <= 0) break;
        // For effenciency adjust and cast all values as integers,
        // adjust for proper truncation
        temp.vcr_valor_bruto = (int) (100.0 * temp_value + (temp_value >0?0.1:-0.1) );
        memcpy(&numbers[rows],&temp,sizeof(struct rh_vcrt0));
        rows++;
    }
    if (rows < SOURCE_ROWS) numbers = realloc (numbers, rows * sizeof (struct rh_vcrt0));
    
    fclose(infile);
        
    qsort(numbers, rows, sizeof(struct rh_vcrt0), (int (*)(const void *,const void *))intcomp);
        
    for (i=0;i<DEPTH_LIMIT;i++)    {
        for (j=rows-1;j>=rows-1-i;j--)
            max_solution[i] += numbers[j].vcr_valor_bruto;
    }    

    generate_random_indexes();

    //printf("Read %ld rows from input file\n", rows);
    //for (i=0;i< rows;i++) printf("%4i: %i  %i\n",i,target,numbers[i].vcr_valor_bruto);
}

void print_solution(unsigned depth, struct rh_vcrt0 *last) {
    int i;

    for (i=1;i<depth;i++) {
        fprintf(outfile,"(%d,%d,%d,%.2f) ",
                            solution_value[i-1].vcr_emp_empresa,
                            solution_value[i-1].vcr_num_empregado,
                            solution_value[i-1].vcr_vnc_cod_venc,
                            solution_value[i-1].vcr_valor_bruto/100.0);
    }
                
    fprintf(outfile,"(%d,%d,%d,%.2f)\n",
                last->vcr_emp_empresa,
                last->vcr_num_empregado,
                last->vcr_vnc_cod_venc,
                last->vcr_valor_bruto/100.0);
}

// Given a depth and current position in the list
// what is the smallest sum that could be generated?
// Since the list is sorted, the smallest sum is
// found by adding the next elements in sequence
long min_solution(unsigned depth, unsigned index) {
    int j;
    long temp=0;
    for (j=index+1;j<=index+max_depth-depth;j++)
        temp += numbers[j].vcr_valor_bruto;
    return temp;
}

void find_sums(unsigned depth, long partialsum, unsigned index) {
    unsigned i, j, lo, hi;
    
    function_call_count++;

    if (solution_count >= MAX_SOLUTIONS) return;

    if (depth < max_depth) {
        // Pick the first value of a multi-value sum randomly
        if (depth == 1) {
            for (i=0;i< rows;i++) {
                j = random_indexes[i];
                sum = numbers[j].vcr_valor_bruto;

                // Are we so far from the target that no values could
                // possibly get us there? If so, give up.                                
                if (sum + max_solution[depth-1] < target) continue;
                if (sum + min_solution(depth,j) > target) continue;
                
                // We're close enough that it's worth trying to continue        
                memcpy(&solution_value[depth-1], &numbers[j], sizeof(struct rh_vcrt0));
                find_sums(depth+1,sum,j+1);
            }
            
        }
        else {
            for (i=index;i< rows;i++) {
                sum = partialsum + numbers[i].vcr_valor_bruto;

                if (sum > target) return;                
    
                // Are we so far from the target that no values could
                // possibly get us there? If so, give up.                                
                if (sum + max_solution[depth-1] < target) break;
                if (sum + min_solution(depth,i) > target) break;
                
                // We're close enough that it's worth trying to continue        
                memcpy(&solution_value[depth-1], &numbers[i], sizeof(struct rh_vcrt0));
                find_sums(depth+1,sum,i+1);
            }
        }
    }
    else {
        lo = index;
        hi = rows-1;

        while (hi-lo > 1) {
            i = (hi+lo)>>1;
            if (partialsum + numbers[i].vcr_valor_bruto < target) lo = i;
            else if (partialsum + numbers[i].vcr_valor_bruto > target)    hi = i;
            else break;
        }
        if (partialsum + numbers[lo].vcr_valor_bruto == target) i = lo;
        else if (partialsum + numbers[hi].vcr_valor_bruto == target) i = hi;
        
        if (partialsum + numbers[i].vcr_valor_bruto == target) {
            if (depth <= 2 || rand() < RAND_MAX * .20) {
                ++solution_count;
                if (outfile) print_solution(depth,&numbers[i]);
            
                // We found a hit but there may be more than one element in the set with the same value
                // so count the neighbors until we exhaust them all                                        
                for (j=i-1;(j > index) && (numbers[j].vcr_valor_bruto == numbers[i].vcr_valor_bruto);j--) {
                    ++solution_count;
                    if (outfile) print_solution(depth,&numbers[j]);
                }
                for (j=i+1;(j < rows) && (numbers[j].vcr_valor_bruto == numbers[i].vcr_valor_bruto);j++) {
                    ++solution_count;
                    if (outfile) print_solution(depth,&numbers[j]);
                }
            }            
        }
    }
}

int main(int argc, char* argv[])
{
    struct timeval  start_time, end_time;
    unsigned long elapsed;
    int c;

    while ((c = getopt(argc, argv, ":t:i:o:?")) != -1) {
        switch (c) {
            case 't':
                target = (int) (atof(optarg) * 100.00);
                break;
            case 'i':
                in_filename = optarg;
                break;
            case 'o':
                outfile = fopen (optarg, "w");
                break;
            case '?':
                printf("\n%s [-t target] [-s ] [-i infile] [-o outfile]\n",argv[0]);
                printf("-t target is a floating point value,  default=%.2f\n",BASE_TARGET);
                printf("-i input file name, default sqldata.txt\n");
                printf("-o output file name\n");
                return 0;        
            default:
                if (isprint (optopt))
                    fprintf(stderr, "Unknown option `-%c'.\n", optopt);
                else
                    fprintf(stderr,"Unknown option character `\\x%x'.\n",optopt);
                return 0;
        }
    }
    
    read_input_file();

    separator;
    printf("target sum: %f\n",target/100.0);
    separator;
    for (max_depth=1;max_depth<=DEPTH_LIMIT;max_depth++) {
        memset(&start_time,0,sizeof(start_time));
        memset(&end_time,0,sizeof(end_time));
        
        function_call_count = 0;
        solution_count = 0;
        gettimeofday(&start_time,NULL);
        find_sums(1,0,0);
        gettimeofday(&end_time,NULL);
        
        elapsed = (end_time.tv_sec * 1000000 + end_time.tv_usec) - (start_time.tv_sec * 1000000 + start_time.tv_usec);

        printf("Depth: %i\nElapsed time: %ld microseconds\nSolutions found: %ld\nCall count: %ld\n",
                max_depth,elapsed,solution_count,function_call_count);
        separator;
    }
    separator;
    return 0;
}

Open in new window

The main question I have, before trying to study this code, is: what file format did you use and did you place it in a specific location (or as a command argument)?
the input file was the file you provided above but with the insert syntax removed.
So only the values were left, the first few lines looked like this...

(32,1,1,3641.00)
(32,1,6,133.44)
(32,10014,1,585.46)
(32,10014,22,6.72)

I put it in the same directory as the executable itself.
My program defaults to using a file called "sqldata.txt" but using the -i option you can specify a different file, including path.
This is basically the same thing as I posted previously, except I've parameterized just about everything.  There is no error checking, the program assumes you'll pass in valid parameters

You can specify:
the input file
an optional output file
the target number to build sums for
the depth limit (I've tested out to depth 1000)
max number of solutions to return (0=all)
randomization level
       0=not random, sequential bottom-up search
       1=random first number
       2=random selection of results found
       3=both 1 & 2
percentage selection
     - for random levels 2 and 3, sets the percentage selected from results found
     - lower percents will decrease clustering and create a greater spread of results
     - setting percent too low will make the process take longer and could result in under selection of results by eliminating too many


#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/time.h>
#include <getopt.h>

#define DEFAULT_DEPTH       5    
#define DEFAULT_TARGET      112.41
#define DEFAULT_SOLUTIONS   50000
#define DEFAULT_RANDOM      3
#define DEFAULT_PERCENT     20
#define separator           printf("--------------------------------\n")


struct rh_vcrt0 {
    int vcr_emp_empresa,
     vcr_num_empregado,
     vcr_vnc_cod_venc,
     vcr_valor_bruto;
};


int target = (DEFAULT_TARGET * 100.00);
unsigned random_level = DEFAULT_RANDOM;
char *in_filename;
unsigned rows = 0;
unsigned *random_indexes;

unsigned long solution_limit = DEFAULT_SOLUTIONS;
unsigned depth_limit = DEFAULT_DEPTH;
struct rh_vcrt0 *numbers;
struct rh_vcrt0 *solution_value;
long *max_solution;
FILE *outfile;
unsigned long solution_count = 0;
unsigned long function_call_count = 0;
long sum;
unsigned max_depth;
unsigned selection_percent=DEFAULT_PERCENT;

unsigned myrand(int n) {
    int maxbias;
    int result;
       
    // To prevent modulo bias limit the random range
    // to the largest number evenly divisible by n
    maxbias = RAND_MAX - (RAND_MAX % n);

    // If random number is greater than the bias limit then pick a new number.
    // Theoretically this could run forever but in real life it won't.
    while (1) {
        result = rand();
        if (result <= maxbias) break;
    }

    return result % n;
}

void generate_random_indexes(void) {
    unsigned i, r, temp;
        
    random_indexes = malloc(rows * sizeof (unsigned));

    for (i=0;i<rows;i++) 
        random_indexes[i] = i;

    for (i=rows-1;i>0;i--) {
        r = myrand(i);
        temp = random_indexes[r];
        random_indexes[r] = random_indexes[i];
        random_indexes[i] = temp;        
    }
    //for (i=0;i<rows;i++) printf("%ld\n",random_indexes[i]);
}

int intcomp(const struct rh_vcrt0 *a, const struct rh_vcrt0 *b) {
   return a->vcr_valor_bruto - b->vcr_valor_bruto;
}

void read_input_file (void) {
    FILE *infile;
    int  dummy2, dummy3, ret, i,j;
    float dummy1, temp_value;
    struct rh_vcrt0 temp;

    if (in_filename)
        infile = fopen (in_filename, "rt");
    else
        infile = fopen ("sqldata.txt", "rt");

    numbers = malloc (100 * sizeof (struct rh_vcrt0));
    while (1) {
        ret = fscanf (infile, "(%d,%d,%d,%f)\n", 
                            &temp.vcr_emp_empresa,
                            &temp.vcr_num_empregado,
                            &temp.vcr_vnc_cod_venc,
                            &temp_value);
        if (ret <= 0) break;
        // For effenciency adjust and cast all values as integers,
        // adjust for proper truncation
        temp.vcr_valor_bruto = (int) (100.0 * temp_value + (temp_value >0?0.1:-0.1) );
        memcpy(&numbers[rows],&temp,sizeof(struct rh_vcrt0));
        rows++;
        if (rows % 100 == 0) {
            numbers = realloc (numbers, (100 + rows) * sizeof (struct rh_vcrt0));
        }
    }
    numbers = realloc (numbers, rows * sizeof (struct rh_vcrt0));
    
    fclose(infile);

    qsort(numbers, rows, sizeof(struct rh_vcrt0), (int (*)(const void *,const void *))intcomp);
    
    if (rows > 0) {
        max_solution = malloc (depth_limit * sizeof (long));
        max_solution[0] = numbers[rows-1].vcr_valor_bruto;    
        for (i=1;i<depth_limit;i++)    {            
            max_solution[i] = max_solution[i-1] + numbers[rows-i-1].vcr_valor_bruto;
        }    
        //for (i=0;i<depth_limit;i++) printf("%ld\n",max_solution[i]);

        if (random_level)  srand(time(NULL));
        if (random_level & 1) generate_random_indexes();
    
        //printf("Read %ld rows from input file\n", rows);
        //for (i=0;i< rows;i++) printf("%4i: %i  %i\n",i,target,numbers[i].vcr_valor_bruto);
    }
}

void print_solution(unsigned depth, struct rh_vcrt0 *last) {
    int i;

    for (i=1;i<depth;i++) {
        fprintf(outfile,"(%d,%d,%d,%.2f) ", 
                            solution_value[i-1].vcr_emp_empresa,
                            solution_value[i-1].vcr_num_empregado,
                            solution_value[i-1].vcr_vnc_cod_venc,
                            solution_value[i-1].vcr_valor_bruto/100.0);
    }
                
    fprintf(outfile,"(%d,%d,%d,%.2f)\n",
                last->vcr_emp_empresa,
                last->vcr_num_empregado,
                last->vcr_vnc_cod_venc,
                last->vcr_valor_bruto/100.0);
}

// Given a depth and current position in the list
// what is the smallest sum that could be generated?
// Since the list is sorted, the smallest sum is 
// found by adding the next elements in sequence
long min_solution(unsigned depth, unsigned index) {
    int j;
    long temp=0;
    for (j=index+1;j<=index+max_depth-depth;j++)
        temp += numbers[j].vcr_valor_bruto;
    return temp;
}

void find_sums(unsigned depth, long partialsum, unsigned index) {
    unsigned i, j, lo, hi;
    
    function_call_count++;

    if (solution_limit > 0 && solution_count >= solution_limit) return;

    if (depth < max_depth) {
        // Pick the first value of a multi-value sum randomly
        if (depth == 1 && (random_level & 1)) {
        
            for (i=0;i< rows;i++) {                
                j = random_indexes[i];
                sum = numbers[j].vcr_valor_bruto;

                // Are we so far from the target that no values could
                // possibly get us there? If so, give up.                                
                if (sum + max_solution[0] < target) continue;
                if (sum + min_solution(1,j) > target) continue;
        
                // We're close enough that it's worth trying to continue        
                memcpy(&solution_value[0], &numbers[j], sizeof(struct rh_vcrt0));
                find_sums(depth+1,sum,j+1);
            }
        }
        else {
            for (i=index;i< rows;i++) {
                sum = partialsum + numbers[i].vcr_valor_bruto;

                if (sum > target) return;                
    
                // Are we so far from the target that no values could
                // possibly get us there? If so, give up.                                
                if (sum + max_solution[depth-1] < target) break;
                if (sum + min_solution(depth,i) > target) break;
                
                // We're close enough that it's worth trying to continue        
                memcpy(&solution_value[depth-1], &numbers[i], sizeof(struct rh_vcrt0));
                find_sums(depth+1,sum,i+1);
            }
        }
    }
    else {
        lo = index;
        hi = rows-1;

        while (hi-lo > 1) {
            i = (hi+lo)>>1;
            if (partialsum + numbers[i].vcr_valor_bruto < target) lo = i;
            else if (partialsum + numbers[i].vcr_valor_bruto > target)    hi = i;
            else break;
        }
        if (partialsum + numbers[lo].vcr_valor_bruto == target) i = lo;
        else if (partialsum + numbers[hi].vcr_valor_bruto == target) i = hi;
        
        if (partialsum + numbers[i].vcr_valor_bruto == target) {
            if (depth <= 2 || !(random_level & 2)  || ((random_level & 2) && (rand() < RAND_MAX * (selection_percent/100.0)))) {
                ++solution_count;
                if (outfile) print_solution(depth,&numbers[i]);
                if (solution_limit > 0 && solution_count >= solution_limit) return;
            
                // We found a hit but there may be more than one element in the set with the same value
                // so count the neighbors until we exhaust them all                                        
                for (j=i-1;(j > index) && (numbers[j].vcr_valor_bruto == numbers[i].vcr_valor_bruto);j--) {
                    ++solution_count;
                    if (outfile) print_solution(depth,&numbers[j]);
                    if (solution_limit > 0 && solution_count >= solution_limit) return;
                }
                for (j=i+1;(j < rows) && (numbers[j].vcr_valor_bruto == numbers[i].vcr_valor_bruto);j++) {
                    ++solution_count;
                    if (outfile) print_solution(depth,&numbers[j]);
                    if (solution_limit > 0 && solution_count >= solution_limit) return;
                }
            }            
        }
    }
}

int main(int argc, char* argv[])
{
    struct timeval  start_time, end_time;
    unsigned long elapsed;
    int c;

    while ((c = getopt(argc, argv, ":t:s:d:r:p:i:o:?")) != -1) {
        switch (c) {
            case 't':
                target = (int) (atof(optarg) * 100.00);
                break;
            case 's':
                solution_limit = (unsigned) atol(optarg);
                break;
            case 'd':
                depth_limit = (unsigned) atoi(optarg);
                break;
            case 'r':
                random_level = (unsigned) atoi(optarg);
                break;                    
            case 'p':
                selection_percent = (unsigned) atoi(optarg);
                break;                
            case 'i':
                in_filename = optarg;
                break;
            case 'o':
                outfile = fopen (optarg, "w");
                break;
            case '?':
                printf("\n%s [-t target] [-s solution_limit] [-d depth_limit] [-r random_level [-p selection_percent]] [-i infile] [-o outfile]\n",argv[0]);
                printf("-t target is a floating point value,  default=%.2f\n",DEFAULT_TARGET);
                printf("-s solution limit, default=%ld\n", DEFAULT_SOLUTIONS);
                printf("-d depth limit, default=%ld\n", DEFAULT_DEPTH);
                printf("-r randomization level (0-3), default=%ld\n",DEFAULT_RANDOM);
                printf("-p selection percent (only applies for random level 2 and 3), default=%d\n",DEFAULT_PERCENT);
                printf("-i input file name, default sqldata.txt\n");
                printf("-o output file name\n");
                return 0;        
            default:
                if (isprint (optopt))
                    fprintf(stderr, "Unknown option `-%c'.\n", optopt);
                else
                    fprintf(stderr,"Unknown option character `\\x%x'.\n",optopt);
                return 0;
        }
    }
    
    read_input_file();
    solution_value = malloc (depth_limit * sizeof (struct rh_vcrt0));
        
    separator;
    separator;
    printf("target sum:          %.2f\n",target/100.0);
    printf("solution limit:      %ld\n",solution_limit);
    printf("depth limit:         %ld\n", depth_limit);
    printf("randomization level: %ld\n",random_level);
    if (random_level & 2)
        printf("selection percent:   %ld\n",selection_percent);
    separator;
    separator;
        
    for (max_depth=1;max_depth<=depth_limit;max_depth++) {
        memset(&start_time,0,sizeof(start_time));
        memset(&end_time,0,sizeof(end_time));
        
        function_call_count = 0;
        solution_count = 0;
         gettimeofday(&start_time,NULL);
         find_sums(1,0,0);
        gettimeofday(&end_time,NULL);
        
        elapsed = (end_time.tv_sec * 1000000 + end_time.tv_usec) - (start_time.tv_sec * 1000000 + start_time.tv_usec);

        printf("Depth: %i\nElapsed time: %ld microseconds\nSolutions found: %ld\nCall count: %ld\n",
                max_depth,elapsed,solution_count,function_call_count);
        separator;
    }
    separator;
    return 0;
}

Open in new window

just for fun, I tested depths of 3000 and 4000 on an old slow computer to see how long it would take to find just 1 solution.

 
Depth: 3000
Elapsed time: 2637125 microseconds
Solutions found: 1
Call count: 989599


It's only one, but I was still surprised - 2.6 seconds

I tested again with depth 4000 and it's still running three hours later.

after 19 hours I gave up
I've been quite busy this week, which is why I haven't said anything yet. I haven't had time to test it yet, though I'll try to find some next week. I just wanted to let you all know that I haven't forgotten this and will get back to you as soon as possible.
I know EE has a "500 pts max per q" rule, but if ever there was a q that should be able to go over, it would be this one ...
I did say I'll award an extra 500 for each valid solution.
Hi Cluskitt,

Except that we'd have to decline and play by the rules...



Kent
I've seen this done before. I open a related question and award the points once you deliver a correct answer. I'll try to clear it up with the admins before doing so, but they should allow this, hopefully. After all, this IS a very hard question.
don't bother - we're not doing this for points, we're doing this one for fun.

The question morphed into a solvable problem when you added the 50000/1000/100 cap
then it became rather easy as seen by the multiple variations above.

The discussion then revolved around going faster.

In all cases the suggestions are essentially just trying combinations sequentially until a match is found.  Of course sql works in sets so you you can't see the iteration as clearly as a c array; but behind the scenes the sql engines are joining a row at a time to build the sums. The clever bits are finding ways to try likely combinations faster until the cap is reached; but it's still basically just trial and error looping.  Without the cap, finding all combinations is possible but not feasible because no matter how fast it finds them, there are just too many but that's what I said from the very beginning.

So, the question now is:  Are any of the solutions fast enough or do we need to keep going?
I still haven't had time to try them out (sorry for that), but I'll get back as soon as I can try them out. Until then, don't bother improving things, unless you're having fun doing it.
I've created a new file and ran sdstuber's code. This is the result:
C:\Cygwin\home\Administrador>CombCalc.exe -i samples.txt -o results.tx
--------------------------------
--------------------------------
target sum:          112.41
solution limit:      50000
depth limit:         5
randomization level: 3
selection percent:   20
--------------------------------
--------------------------------
Depth: 1
Elapsed time: 0 microseconds
Solutions found: 0
Call count: 1
--------------------------------
Depth: 2
Elapsed time: 15625 microseconds
Solutions found: 180
Call count: 3418
--------------------------------
Depth: 3
Elapsed time: 3312500 microseconds
Solutions found: 50000
Call count: 4186983
--------------------------------
Depth: 4
Elapsed time: 4609375 microseconds
Solutions found: 50000
Call count: 4909052
--------------------------------
Depth: 5
Elapsed time: 5796875 microseconds
Solutions found: 50000
Call count: 6086580
--------------------------------
--------------------------------

Open in new window

However, this is very confusing because notepad doesn't recognize the line breaks. I have to paste it on an editor that does, like SMSS. I seem to recall some incompatibility between C's \n and windows. Something about \n\p, but it's been too long since I've used C (at least 15 years).

I've also attached the new values file. I've used sdstuber's layout, but if any of you require a different syntax (like insert or similar), just ask and I'll provide.

@Kdo: You still haven't posted a complete solution, or will the first code you posted somewhere up above be enough?

@ScottPletcher: Can you post something that will join all the bits of SQL you provided (the one to create the table, one to generate results, etc) to provide some sort of linked solution? I've failed to get the actual results back from your solution (it just reports how many it found).

I'm going to ask a mod if I'm allowed to give you each a separate question for points. If not, we're going to have to come up with something, because splitting 500 points between the 3 of you isn't fair (even just 500 isn't really fair).
samples.txt
you can do a search for

\n  

and replace with

\n\r



using your new input file I get similar results

 sds -i samples.txt -o results.txt
--------------------------------
--------------------------------
target sum:          112.41
solution limit:      50000
depth limit:         5
randomization level: 3
selection percent:   20
--------------------------------
--------------------------------
Depth: 1
Elapsed time: 1 microseconds
Solutions found: 0
Call count: 1
--------------------------------
Depth: 2
Elapsed time: 1767 microseconds
Solutions found: 180
Call count: 3418
--------------------------------
Depth: 3
Elapsed time: 533655 microseconds
Solutions found: 50000
Call count: 4395921
--------------------------------
Depth: 4
Elapsed time: 454458 microseconds
Solutions found: 50000
Call count: 3500731
--------------------------------
Depth: 5
Elapsed time: 489786 microseconds
Solutions found: 50000
Call count: 3612287
--------------------------------
--------------------------------

Open in new window

\n\r didn't solve it. Where I saw a white square before (and no line break), notepad now shows two white squares. Still, this isn't really important. I can always open it with wordpad, smss, vs, etc.
do you have a hex editor to see what is actually at the end of each line?

or, a text editor like notedpad++ that will show you the characters?
OpenOffice shows the first one (presumably \n) as a return (arrow symbol as in the keyboard's return key) and the second one (\r) as paragraph sign (the one that looks like a pi symbol).
openoffice is a word processor, not a text editor,  I'm not sure what that output you've described means.

there are tons of free (as in no-cost) to try that will support end-of-line character display, notepad++ is just one, or use a hex editor like frhed
Notepad++ shows them as LF and CR. Notepad still doesn't show line breaks, but as I said, this isn't important. As far as I've seen, notepad is the only one that doesn't. Every other editor/processor displays correctly, so that's good enough.
ok, I know you said it's not important, but try  \r\n   instead of \n\r
\r\n did solve the problem in notepad.

As for the rest, are there no complete solutions? There seems to be no further feedback regarding this question. Kdo was still fiddling around with it, trying to return all results, and ScottPletcher still hasn't provided a way to get the results (just the number of results), let alone a single, integrated solution.

Regarding the points, I've been told that I really cannot award more than the points in this question. Which, in my opinion, is unfair. EE should have an option where the mods can, if they think there is cause for it, award more points than the max. Maybe even have some internal rule where you need a vote of x mods to allow that. Even though simply having a mod be able to do this (and only a mod) would allow for a more fair retribution of points and wouldn't be abused.
>>> are there no complete solutions?

I thought I did provide a complete solution.
What else did you need?
I meant, for the rest of the experts :)
Also, for the points, again, don't worry about them.  Everybody here knew the point system when we started on this thread.

I do appreciate that you value our efforts, and that is, above all, the greatest reward for me and I'm sure the others too.

Points vs difficulty/time - that's always a fuzzy calculation at best.

This was an enjoyable question for me; but in terms of how hard it was to work on or how much time it took, it doesn't rank in my top 10.  Probably not for the others either.  Like any question, we spent as much time as we wanted to get results we were comfortable with and now we're done - unless you need more.
Hi Cluskitt,

I backed away from the active discussion several weeks ago.  Stuber and I were chasing similar solutions, and it didn't look like a significantly better one was available than what he'd shown.  (My other time demands were incredible through some of this and my attention was needed elsewhere.)  With a high degree of data dependency in the problem, finding a solution that beat others on a given dataset is fairly easy, but without several representative samples of data, or a clear understanding of the subtleties in the data patterns, it's impossible to say that Solution A is really better than Solution B.

Your question gave some of EE's more seasoned experts a chance to really think.  Do you know how rare that is in today's plug-and-play world?  EE, like most of the technology assist sites, sees mostly the same questions over and over again, in somewhat different form.  "What's the difference between 'df' and 'du'?"  "How do I put all the results on the same line?"  "What's a segmentation fault?"  "How do I make this SQL query faster?"  Answering those questions is what the site is all about.

But questions like yours, where several of us get to really think and collaborate are rare.  Instead of being concerned that we're not getting enough points for helping you, think of it as a Christmas present, where the kids get more enjoyment out of the cardboard box than the $500 item inside.  :)  It's still all good.

Besides, 500 points for answering "What's the command line switch to generate a loadmap" is charity.  :)


Kent
If you wish for more data, I can post more samples. I can post this month's sample of the large dataset to join the other two posted above and I can post a few of the smaller datasets. Like you, I am having fun with this whole question and was curious to see if a full 5 level solution can be achieved in a reasonable amount of time.

However, I have a working solution in C, so if you don't want to pursue this further, I understand. I'm usually busy as well, so I can relate not having much time to focus on an issue for fun.
what do you mean by "a full 5 level solution"  ?

If you mean finding 50000 results of 5-value combinations, then http:#a38639936  shows results being returned in less than half a second.  You had said 1000 results is more than enough so 50000 seems overkill.

If you mean an exhaustive list of all possible combinations, then what would you consider a "reasonable" time?  Also, since the number of combinations is potentially very large, how big of an output file is reasonable?  Is a multi-gigabyte result set acceptable?  

Does the total time include file io time?  Writing 50000 results isn't that big of a deal; but writing billions of bytes will take time.

One last thing, in comparing your results above to my results, it looks like the machine you're testing on took 5.7 seconds to return 50000 results, mine took 0.48 seconds to do the same thing.  A factor of 11+ speed difference is pretty significant, especially if we try to ramp up the total volume.

For example, if I could somehow get all of the results in 30 minutes, that might be reasonable; but it would likely take your machine close to 6 hours to get the same thing.  Is that still reasonable?
Well, for all purposes, the solution presented works fine as it is. My curiosity just wondered if it was possible to create a full solution. So yes, I meant the full combination.

First, regarding time elapsed: seeing as this is a personal thing, I would be using my home pc, which is a LOT better. Also, seeing as it's not a professional thing anymore, I'd say about a cap of about 2h on any computer would be good enough.

Second, file size: Obviously we need some sort of cap on this. The biggest problem for the file size isn't disk space, but how much time would a text editor take to open a huge file. The best solution would seem to be to break it into smaller files, maybe 10MB, with some sort of upper cap on how many files are created. At home, I have no problem with disk space, so you can set your own cap, whatever you feel comfortable with or depending on how the data computes. I'd probably do it on the fly, depending on the results, rather than have some sort of limit off the bat.

Most importantly, this would be for fun and curiosity sake alone. It is not required for the answer (which I'll award soon regardless of this) and if you don't have time to waste on this, then it's totally understandable. I'm always willing to expand my programming skills and if this is what will take to refresh my C skills, I'm willing to take the challenge. Most of the hard work has been done already anyway, of which I am most grateful.

Most of all, I am extremely curious to find out if there is some other sort of algorithm that can be applied that will speed things up even more. I am willing to provide more samples if you wish to work on them. I can even load old backups just to create new (old) samples.

One other thing that I'm curious about: is it possible, somehow, to find out how many possible solutions there are BEFORE combining them all? I don't see how it could be done, but if you could find out beforehand the number of possibilities (preferably in a much shorter time than combining them all), you could probably use that to work the data in a different way. Not really sure how, but it does make me wonder.
There is no way (other than extreme and trivial examples) to determine how many combinations there are without actually finding them.

By "extreme and trivial"  - all values positive and greater than your target = 0.
Or, all values negative and less than target = 0
or 5 largest values too small = 0

One value is target, all the rest are 0 = n * (n-1) * (n-2) * (n-3) * (n-4)  (i.e. too many to list individually)

I'll see how long it'll take me to find one hundred million and then use that as a benchmark if it's feasible to try pursuing further.
Actually, you can very quickly find the minimum value required for n entries to equal the total you want, and base the maximum number of possible solutions on that.

The specific number of actual solutions could indeed likely be done only by extensive processing.  Or, as I suggested earlier, try a mathematics forum -- a math expert might be able to help you find a quicker method.
100 million wasn't too bad (not counting write time); but even for just 4 combinations that's not sufficient to be exhaustive for this data set.


--------------------------------
--------------------------------
target sum:          112.41
solution limit:      100000000
depth limit:         5
randomization level: 1
--------------------------------
--------------------------------
Depth: 1
Elapsed time: 1 microseconds
Solutions found: 0
Call count: 1
--------------------------------
Depth: 2
Elapsed time: 457 microseconds
Solutions found: 122
Call count: 2953
--------------------------------
Depth: 3
Elapsed time: 426157 microseconds
Solutions found: 256735
Call count: 4390127
--------------------------------
Depth: 4
Elapsed time: 155386389 microseconds
Solutions found: 100000000
Call count: 1639905075
--------------------------------
Depth: 5
Elapsed time: 206861600 microseconds
Solutions found: 100000000
Call count: 2167492848

Open in new window



>>>The specific number of actual solutions could indeed likely be done only by extensive processing.

yes, like I said above, it's only solvable for trivial scenarios.


>>>Actually, you can very quickly find the minimum value required for n entries to equal the total you want,

The c-solution posted above does this.  In fact, it continually adjusts as necessary for various paths.  Not directly mind you, but rather, "backing into it" by checking min/max bounds as it proceeds.  However, I did not pre-filter the completely impossible, instead favoring early exit; but there is no reason I couldn't do both.  Note though,  pre-filtering is not guaranteed to remove enough values to have a significant impact.  In fact, none is possible.  On the other hand, for other data sets and targets pre-filtering could be a huge benefit.  

But... that goes along with my initial and continual argument that true "solvability" for this problem is a feature of the data -  NOT the algorithm.
Ramping up to 1 billion was sufficient for 4-value exhaustion but not 5

Depth: 4
Elapsed time: 423734852 microseconds
Solutions found: 277343509
Call count: 4446882027
--------------------------------
Depth: 5
Elapsed time: 912573543 microseconds
Solutions found: 1000000000
Call count: 9379870185
--------------------------------


Again, this is just finding them, not writing them too.
I added rejection of extreme values to the previous code.  Using you original sample and target, only 20 values can be eliminated outright for being too large.  That is, no amount of negative values in your sample can be added and make a sum small enough to reach the target.   On the other end, all of the negatives are at least candidates.

Eliminating those 20 values, didn't make a measurable difference in execution time, and with early exit already in place, removing the extra rows didn't even impact the number of calls needed.  That's not to say it's not a good idea to try; but for THIS particular set of data and THIS particular target, it didn't help.  For other samples/targets it could be a lot better, in others it might small or again, no effect.

Your later set (samples.txt from http:#a38639890) there were no upfront rejections.  All of the values are near enough to the target that they are possible candidates.

Here is the early-reject version

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/time.h>
#include <getopt.h>

#define DEFAULT_DEPTH		5	
#define DEFAULT_TARGET		112.41
#define DEFAULT_SOLUTIONS 	50000
#define DEFAULT_RANDOM		3
#define DEFAULT_PERCENT		20
#define DEFAULT_FILE        "sqldata.txt"
#define separator 			printf("--------------------------------\n")


struct rh_vcrt0 {
	int vcr_emp_empresa,
     vcr_num_empregado,
     vcr_vnc_cod_venc,
     vcr_valor_bruto;
};

int target = (DEFAULT_TARGET * 100.00);
unsigned random_level = DEFAULT_RANDOM;
char *in_filename;
unsigned rows = 0;
unsigned *random_indexes;
unsigned long solution_limit = DEFAULT_SOLUTIONS;
unsigned depth_limit = DEFAULT_DEPTH;
struct rh_vcrt0 *numbers;
struct rh_vcrt0 *solution_value;
long *max_solution;
FILE *outfile;
unsigned long solution_count = 0;
unsigned long function_call_count = 0;
long sum;
unsigned max_depth;
unsigned selection_percent=DEFAULT_PERCENT;

unsigned myrand(int n) {
	int maxbias;
	int result;
       
    //To prevent modulo bias limit the random range
    // to the largest number evenly divisible by n
    maxbias = RAND_MAX - (RAND_MAX % n);

    //If random number is greater than the bias limit then pick a new number.
    //Theoretically this could run forever but in real life it won't.
    while (1) {
    	result = rand();
    	if (result <= maxbias) break;
    }

	return result % n;
}

void generate_random_indexes(void) {
	unsigned i, r, temp;
		
	random_indexes = malloc(rows * sizeof (unsigned));

	for (i=0;i<rows;i++) 
		random_indexes[i] = i;

	for (i=rows-1;i>0;i--) {
		r = myrand(i);
		temp = random_indexes[r];
		random_indexes[r] = random_indexes[i];
		random_indexes[i] = temp;		
	}
	//for (i=0;i<rows;i++) printf("%ld\n",random_indexes[i]);
}

int intcomp(const struct rh_vcrt0 *a, const struct rh_vcrt0 *b) {
   return a->vcr_valor_bruto - b->vcr_valor_bruto;
}


unsigned reject_extremes(void)
{
    struct rh_vcrt0 *temp;
    unsigned i;
    long minsum = 0;
    long maxsum = 0;
    unsigned rejected = 0;

    if (rows <= depth_limit) return 0;
    
    for (i=0;i<depth_limit-1;i++) {
        minsum = minsum + numbers[i].vcr_valor_bruto;
        maxsum = maxsum + + numbers[rows-i-1].vcr_valor_bruto;
	}	
/*
    If searching for an N-value solution, if the largest value plus the sum N-1 smallest values is still too big then the largest value
    can't be used.  Reject all values greater than the target plus sum of N-1 smallest values.
    Similarly, reject all values less than the target plus sum of N-1 largest values.
    
    Repeat this process until there are no rejections.
*/

    temp = malloc(rows * sizeof (struct rh_vcrt0));
    for (i=0;i<rows;i++) {
        if ((numbers[i].vcr_valor_bruto + minsum > target) || (numbers[i].vcr_valor_bruto + maxsum < target)) {
            printf("Rejected Extreme Value: (%d,%d,%d,%.2f)\n", 
					numbers[i].vcr_emp_empresa,
				    numbers[i].vcr_num_empregado,
				    numbers[i].vcr_vnc_cod_venc,
				    numbers[i].vcr_valor_bruto/100.0);
            rejected = rejected + 1;
        }
        else {
            memcpy(temp,numbers,rows * sizeof (struct rh_vcrt0));
        }
    }
    if (rejected > 0) {
        rows -= rejected;
        temp = realloc(temp, rows * sizeof (struct rh_vcrt0));
        
    }

    free(numbers);
    numbers = temp;

    return rejected;
}


void read_input_file (void) {
	FILE *infile;
	int  dummy2, dummy3, ret, i,j;
	float dummy1, temp_value;
	struct rh_vcrt0 temp;
	
    if (in_filename) infile = fopen (in_filename, "rt");
	else infile = fopen (DEFAULT_FILE, "rt");

	numbers = malloc(100 * sizeof (struct rh_vcrt0));
	while (1) {
		ret = fscanf (infile, "(%d,%d,%d,%f)\n", 
							&temp.vcr_emp_empresa,
						    &temp.vcr_num_empregado,
						    &temp.vcr_vnc_cod_venc,
						    &temp_value);
		if (ret <= 0) break;
		// For effenciency adjust and cast all values as integers,
		// adjust for proper truncation
		temp.vcr_valor_bruto = (int) (100.0 * temp_value + (temp_value >0?0.1:-0.1) );
		memcpy(&numbers[rows],&temp,sizeof(struct rh_vcrt0));
		rows++;
		if (rows % 100 == 0) {
			numbers = realloc(numbers, (100 + rows) * sizeof (struct rh_vcrt0));
		}
    }
    numbers = realloc(numbers, rows * sizeof (struct rh_vcrt0));
	
	fclose(infile);

	qsort(numbers, rows, sizeof(struct rh_vcrt0), (int (*)(const void *,const void *))intcomp);
	
    while (reject_extremes());
    
	if (rows > 0) {
    	max_solution = malloc(depth_limit * sizeof (long));
    	max_solution[0] = numbers[rows-1].vcr_valor_bruto;	
		for (i=1;i<depth_limit;i++)	{			
			max_solution[i] = max_solution[i-1] + numbers[rows-i-1].vcr_valor_bruto;
		}	
		//for (i=0;i<depth_limit;i++) printf("%ld\n",max_solution[i]);

		if (random_level) {
            srand(time(NULL));
            if (random_level & 1) generate_random_indexes();
        }
		//printf("Read %ld rows from input file\n", rows);
		//for (i=0;i< rows;i++) printf("%4i: %i  %i\n",i,target,numbers[i].vcr_valor_bruto);
	}
}

void print_solution(unsigned depth, struct rh_vcrt0 *last) {
	int i;

	for (i=1;i<depth;i++) {
		fprintf(outfile,"(%d,%d,%d,%.2f) ", 
							solution_value[i-1].vcr_emp_empresa,
						    solution_value[i-1].vcr_num_empregado,
						    solution_value[i-1].vcr_vnc_cod_venc,
						    solution_value[i-1].vcr_valor_bruto/100.0);
	}
				
	fprintf(outfile,"(%d,%d,%d,%.2f)\n",
				last->vcr_emp_empresa,
			    last->vcr_num_empregado,
			    last->vcr_vnc_cod_venc,
			    last->vcr_valor_bruto/100.0);
}

// Given a depth and current position in the list
// what is the smallest sum that could be generated?
// Since the list is sorted, the smallest sum is 
// found by adding the next elements in sequence
long min_solution(unsigned depth, unsigned index) {
	int j;
	long temp=0;
	for (j=index+1;j<=index+max_depth-depth;j++)
		temp += numbers[j].vcr_valor_bruto;
	return temp;
}

void find_sums(unsigned depth, long partialsum, unsigned index) {
	unsigned i, j, lo, hi;
	
	function_call_count++;

	if (solution_limit > 0 && solution_count >= solution_limit) return;

	if (depth < max_depth) {
		// Pick the first value of a multi-value sum randomly
		if (depth == 1 && (random_level & 1)) {
		
			for (i=0;i< rows;i++) {				
				j = random_indexes[i];
				sum = numbers[j].vcr_valor_bruto;

				// Are we so far from the target that no values could
				// possibly get us there? If so, give up.								
				if (sum + max_solution[0] < target) continue;
				if (sum + min_solution(1,j) > target) continue;
		
				// We're close enough that it's worth trying to continue		
				memcpy(&solution_value[0], &numbers[j], sizeof(struct rh_vcrt0));
				find_sums(depth+1,sum,j+1);
			}
		}
		else {
			for (i=index;i< rows;i++) {
				sum = partialsum + numbers[i].vcr_valor_bruto;

				if (sum > target) return;				
	
				// Are we so far from the target that no values could
				// possibly get us there? If so, give up.								
				if (sum + max_solution[depth-1] < target) break;
				if (sum + min_solution(depth,i) > target) break;
				
				// We're close enough that it's worth trying to continue		
				memcpy(&solution_value[depth-1], &numbers[i], sizeof(struct rh_vcrt0));
				find_sums(depth+1,sum,i+1);
			}
		}
	}
	else {
		lo = index;
		hi = rows-1;

		while (hi-lo > 1) {
			i = (hi+lo)>>1;
			if (partialsum + numbers[i].vcr_valor_bruto < target) lo = i;
			else if (partialsum + numbers[i].vcr_valor_bruto > target)	hi = i;
			else break;
		}
		if (partialsum + numbers[lo].vcr_valor_bruto == target) i = lo;
		else if (partialsum + numbers[hi].vcr_valor_bruto == target) i = hi;
		
		if (partialsum + numbers[i].vcr_valor_bruto == target) {
			if (depth <= 2 || !(random_level & 2) || ((random_level & 2) && (rand() < RAND_MAX * (selection_percent/100.0)))) {
				++solution_count;
				if (outfile) print_solution(depth,&numbers[i]);
				if (solution_limit > 0 && solution_count >= solution_limit) return;
			
				// We found a hit but there may be more than one element in the set with the same value
				// so count the neighbors until we exhaust them all										
				for (j=i-1;(j > index) && (numbers[j].vcr_valor_bruto == numbers[i].vcr_valor_bruto);j--) {
					++solution_count;
					if (outfile) print_solution(depth,&numbers[j]);
					if (solution_limit > 0 && solution_count >= solution_limit) return;
				}
				for (j=i+1;(j < rows) && (numbers[j].vcr_valor_bruto == numbers[i].vcr_valor_bruto);j++) {
					++solution_count;
					if (outfile) print_solution(depth,&numbers[j]);
					if (solution_limit > 0 && solution_count >= solution_limit) return;
				}
			}			
		}
	}
}

int main(int argc, char* argv[])
{
	struct timeval  start_time, end_time;
	unsigned long elapsed;
	int c;

	while ((c = getopt(argc, argv, ":t:s:d:r:p:i:o:?")) != -1) {
		switch (c) {
			case 't':
				target = (int) (atof(optarg) * 100.00);
				break;
			case 's':
				solution_limit = (unsigned long) atol(optarg);
				break;
			case 'd':
				depth_limit = (unsigned) atoi(optarg);
				break;
			case 'r':
				random_level = (unsigned) atoi(optarg);
				break;					
			case 'p':
				selection_percent = (unsigned) atoi(optarg);
				break;				
			case 'i':
				in_filename = optarg;
				break;
			case 'o':
				outfile = fopen (optarg, "w");
				break;
			case '?':
				printf("\n%s [-t target] [-s solution_limit] [-d depth_limit] [-r random_level [-p selection_percent]] [-i infile] [-o outfile]\n",argv[0]);
				printf("-t target is a floating point value,  default=%.2f\n",DEFAULT_TARGET);
				printf("-s solution limit, default=%ld\n", DEFAULT_SOLUTIONS);
				printf("-d depth limit, default=%ld\n", DEFAULT_DEPTH);
				printf("-r randomization level (0-3), default=%ld\n",DEFAULT_RANDOM);
				printf("-p selection percent (only applies for random level 2 and 3), default=%d\n",DEFAULT_PERCENT);
				printf("-i input file name, default sqldata.txt\n");
				printf("-o output file name\n");
				return 0;		
			default:
				if (isprint (optopt))
					fprintf(stderr, "Unknown option `-%c'.\n", optopt);
				else
					fprintf(stderr,"Unknown option character `\\x%x'.\n",optopt);
				return 0;
		}
	}
	
	read_input_file();
	solution_value = malloc (depth_limit * sizeof (struct rh_vcrt0));
		
	separator;
	separator;
	printf("target sum:          %.2f\n",target/100.0);
	printf("solution limit:      %ld\n",solution_limit);
	printf("depth limit:         %ld\n", depth_limit);
	printf("randomization level: %ld\n",random_level);
    if (random_level & 2) printf("selection percent:   %ld\n",selection_percent);
	separator;
	separator;
		
	for (max_depth=1;max_depth<=depth_limit;max_depth++) {
		memset(&start_time,0,sizeof(start_time));
		memset(&end_time,0,sizeof(end_time));
		
		function_call_count = 0;
		solution_count = 0;
	 	gettimeofday(&start_time,NULL);
	 	find_sums(1,0,0);
		gettimeofday(&end_time,NULL);
		
		elapsed = (end_time.tv_sec * 1000000 + end_time.tv_usec) - (start_time.tv_sec * 1000000 + start_time.tv_usec);

		printf("Depth: %i\nElapsed time: %ld microseconds\nSolutions found: %ld\nCall count: %ld\n",
				max_depth,elapsed,solution_count,function_call_count);
		separator;
	}
	separator;
	return 0;
}
                               

Open in new window



I'm currently running a 5-value search for up to a trillion combinations.  Both with and without the extreme value rejections.  One trillion should still be insufficient to be exhaustive; but should be large enough to show it's infeasible to do quickly.  I kicked off the first one about 4 hours ago.  I don't expect either of them to finish anytime soon.
I've posted a few more sample files from this month. The tiny subset isn't likely to be of help, but I thought I'd include it anyway.

@ScottPletcher: My father was a university math teacher. I've discussed this with him, but other than eliminating outliers and using a bias to eliminate negatives, which we're already doing, he didn't find any way to improve this. Granted, he's not a pure mathematician. Maybe some math wiz could find some way to do this, but that would be too much effort just for this. As I said, the question itself is answered and I have a working solution. Anything we do now is for fun and for the intellectual challenge. I don't see a need to go "pro" on this.
Big-Sample.txt
Smaller-Sample.txt
Smaller-Subset.txt
Tiny-Subset.txt
There were several bugs in my last post.
Don't use that code - sorry about that.

I'll fix those, and repost but then I'm not sure what you want to do with the new samples.  Did you have a new direction you wanted to try?
I just posted them because Kdo was saying he needed more examples to work with. In case you need to work with different data and different results. The way he put it was:

>> With a high degree of data dependency in the problem, finding a solution that beat others on a given dataset is fairly easy, but without several representative samples of data, or a clear understanding of the subtleties in the data patterns, it's impossible to say that Solution A is really better than Solution B.

If you feel you need more samples, I'd be happy to provide them.
ASKER CERTIFIED SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Well, on the normal sample, we had to search a value of 58.16 this month. There were no helpful results, so we used it on the filtered sample. This still hadn't any helpful results but it was enough for us to look elsewhere.

Sometimes we have low values of near 100, sometimes it's closer to 2-5k. It depends on what the problem is, really. I'd say it's mostly higher values, but lower also occur.
I'm running an unbounded search for 58.16 on your normal sample.
I'm not sure what you consider a helpful result.
There are nearly a billion 4-value solutions.  None of those are "helpful"  ?
It's still running for 5-values, unbounded it might take days.


K$ sds -i bigsample.txt -r1 -s0 -t 58.16
Rejected Extreme Value: (32,2922,1,12643.00)
Rejected Extreme Value: (32,4490,107,21979.00)
--------------------------------
--------------------------------
target sum:          58.16
solution limit:      0
depth limit:         5
randomization level: 1
--------------------------------
--------------------------------
Depth: 1
Elapsed time: 0 microseconds
Solutions found: 0
Call count: 1
--------------------------------
Depth: 2
Elapsed time: 550 microseconds
Solutions found: 502
Call count: 3403
--------------------------------
Depth: 3
Elapsed time: 577730 microseconds
Solutions found: 861643
Call count: 5688355
--------------------------------
Depth: 4
Elapsed time: 738364530 microseconds
Solutions found: 961143576
Call count: 7146891491
--------------------------------

Open in new window

Don't worry about helpful. In this case, we're just trying to generate all of them.

I did think of something which might help reduce possible combinations. Once it finds a certain combination, it need not worry about the same codes again. That is, imagine that you're searching for 5 and you find 3 values with code 1 that reach your target. For example, 1+1+3. You don't need to search any other combination with 3 code 1s anymore, even if there are more, like 1+2+2.
I'd imagine this would help reduce combinations by a gigantic deal.
I don't understand   "you find 3 values with code 1"    but your example is 1+1+3.
To me that looks like 2 values of  1 and one of 3



Do you mean a 3-value solution that has "1" somewhere in it, one or more times?
And, by using "1", I can eliminate it from all further 3-value solutions?
Can I do the same with "3"?
I meant, if you find a value 3 values that have the same code but different values. Remember that the 4 values are company (always the same for each file), employee number, code, value. So, if you find a 3 value solution with 3 code 1s which have the value 1+1+3, you can ignore any further results with 3 code 1s, even if the values are valid (like 1+2+2)
oh - you mean looking at the "other" numbers.  - The ones we've basically been ignoring so far.

That should reduce the number of combinations, however we'd have to store all of them we find in order to check each new set to see if it's one we've already found. Unless most of the combinations will be eliminated this will quickly become unfeasible.
On further reflection it might not  really save that much work, only reduce the output.

For example - I find a 3 code-1 solution.

Then I find a 2 code-1 partial sum.  At this point I can eliminate all of the code-1 options for the third part.  However, I can also solve directly for that last option.  So, it's likely faster to search for exactly the one value that might reach the target, then check to see if it's a third code-1.  Rather than scanning all of the code-1 values first to reject them then searching the left overs for non-code-1 solutions.

Since any number of code-1 values can be included up to the last value, I don't get any savings.
Maybe you can. Instead of trying to run combinations based on values, you could try to run combinations based on codes. That is, you would try to find all code combinations that can produce a result. Seeing as there are usually only about 50 different codes, this would hugely decrease the amount of possible solutions.
what are the rules of a code combination?  If it's simply adding them to see if they reach a target, then it's the same problem but with an extra step
The rules would be the same. I was just thinking that, if you take that extra step of checking the code, once you reach a solution, any solution, you can then exit earlier, as long as you're looping by code. I'm not sure if it would be a better way or not, just something we could possibly use?
do all values need to have the same code in a combination, or was that just an example?
That was just an example. What I meant was, if you find any solution, you can drop any other combination that have the same codes. If you find a solution that has codes 1,6,6,161,254, you can drop all other solutions that have those codes, no matter in which order they come.
given that set of codes, can i ignore all other solutions with a code of 1 with any other codes?  or only when in combination with the other codes too?  if the former, it's VERY helpful.  so much so thats its probably worth being a new question.  if the latter, then it might not be?  

if it is then it still might be worthy of a new question because the current methods are based on value sorting and elimination.  this new requirement changes the game a lot.  so whether its helpful or not, its a different, albeit related, direction.
No. Only a distinct set of code combinations can be eliminated. You can eliminate all permutations, but you can't eliminate one code if you find any combination with it. I know that would be helpful, reducing the possible solutions to less than 100, but it isn't possible.
The exhaustive 5-value search finished on the bigsample provided above.
Close to a trillion results found in 08 19:12:20.409318


K$ sds -i bigsample.txt -r1 -s0 -t 58.16
Rejected Extreme Value: (32,2922,1,12643.00)
Rejected Extreme Value: (32,4490,107,21979.00)
--------------------------------
--------------------------------
target sum:          58.16
solution limit:      0
depth limit:         5
randomization level: 1
--------------------------------
--------------------------------
Depth: 1
Elapsed time: 0 microseconds
Solutions found: 0
Call count: 1
--------------------------------
Depth: 2
Elapsed time: 550 microseconds
Solutions found: 502
Call count: 3403
--------------------------------
Depth: 3
Elapsed time: 577730 microseconds
Solutions found: 861643
Call count: 5688355
--------------------------------
Depth: 4
Elapsed time: 738364530 microseconds
Solutions found: 961143576
Call count: 7146891491
--------------------------------
Depth: 5
Elapsed time: 760340409318 microseconds
Solutions found: 801566724743
Call count: 7627852415787
--------------------------------
--------------------------------

Open in new window

You let it run for almost 9 days? I would have stopped it long before that.

Have you thought about the unique combination efforts? Simply storing the combinations in an array and looping to find the value would take too much time, but the way I thought would be easier would be a 5 dimension array (for the 5 level depth) from 0 to 999 (most codes are below 300, but just in case). When you find a combination, you just write 1s to that position. For example, if you find a solution of codes 1,1,6,89,161, you would do something like:
MyArray[1][1][6][89][161] = 1;
Then you could check any combination without wasting time. I imagine that this would reduce lots of time. Maybe there's some even better way, but I think this one takes virtually no time to fill/check
Since you have to have solve for 4 values out of 5 by iteration and the 5th is practically free,  adding an extra step at the 5th level as an alternative method for elimination doesn't really help much because you've already done all the "hard" work.

If you do want to pursue this, since it's a new requirement and new direction I was sort of waiting for that to be a new question.  Is there anything more on the original requirement or direction?
As I said, the original question is already answered and I was actually thinking about closing it tomorrow. Anything else would be for fun and sheer curiosity.

I don't know what you mean by the 5th being practically free. The 4th takes a bit over 12mins, the 5th takes almost 9 days.

Anyway, this is an easy adaptation, which you can apply to the last level, I'm sure I can achieve it once I have some time. I'll post it here once I do.
>>> The 4th takes a bit over 12mins, the 5th takes almost 9 days.

no,  it takes about 12 minutes to find 4-value solutions.  That's not the same thing as finding the first 4 values of 5-value solutions.
Ah, yes. I see. Anyway, I think the question itself has gone on long enough, as it has been answered long ago. I will take the 9 days solution as meaning that, unless there is a favourable set of data with a favourable target value, getting all possible solutions isn't feasible.
Thank you all for the time and effort put into this. I've selected one answer from each, even though there were many useful posts from all of you. But I don't think there is any need of breaking this up even further. You all deserve the points equally.
I do hope you start a new question about the code-filtering.  I'm not convinced it will help in general; but as with the rest of this question, it will be data dependent.  With the right data, it might be significant. It could be interesting to pursue it in any case.

It will take a large rewrite of the accepted solution here to implement such a requirement.

Please post back here if you do start such a thread to make sure we all know where to look for it.
I still think the 5-level might solve (noticeably?) faster when joining the properly-indexed, pre-computed 3-level values to the 2-level values, rather than 3 to 4 to 5.

But, sorry, I don't have the time to test that out, esp. since I can't guarantee the results, although it would be interesting to see.
I will probably open a question for each, but right now there isn't much time. Lots of work to do and too little time to do it in :P

Just wanted to let you know that I haven't forgotten this, though.
I still haven't had time to do this. There were some economic rules changed at the start of the year which created the need to change lots of things in the apps we use. I am still quite busy but I expect I'll have some more time later this month. I'm sorry to keep you all waiting, but it just isn't possible for me at this point to open and monitor a complex question like this.

As soon as I can, I will open the questions and mention it here.