Progress 6: One RegEx to find them all.
The holy grail here is an all-in-one RegEx solution, in the last update I had something workable but it’d require iterating through a RegEx search every property and feeding the result to each subsequent search. So getting to a1, b2, c1
for example would require running RegEx three times and feeding the result back in twice to get the value.
I put the time in and figured out how to nest the entire search so RegEx only runs once. Below is a proof of concept for matching every value (all 12) in my dataset using a single line of nested regex per search. It’ll climb a user defined property hierarchy and return the value without needing any additional tools and it can be produced programmatically.
RegExr: Learn, Build, & Test RegEx
1. (^|[^\t])\t{1}a1((?=\t{2}[^\t])|:).*?\t{2}b1((?=\t{3}[^\t])|:)\K.*?[^\t](?=$|\t{1,2}[^\t])
2. (^|[^\t])\t{1}a2((?=\t{2}[^\t])|:).*?\t{2}b1((?=\t{3}[^\t])|:)\K.*?[^\t](?=$|\t{1,2}[^\t])
3. (^|[^\t])\t{1}a3((?=\t{2}[^\t])|:).*?\t{2}b1((?=\t{3}[^\t])|:)\K.*?[^\t](?=$|\t{1,2}[^\t])
4. (^|[^\t])\t{1}a1((?=\t{2}[^\t])|:).*?\t{2}b2((?=\t{3}[^\t])|:).*?\t{3}c1((?=\t{4}[^\t])|:)\K.*?[^\t](?=$|\t{1,3}[^\t])
5. (^|[^\t])\t{1}a2((?=\t{2}[^\t])|:).*?\t{2}b2((?=\t{3}[^\t])|:).*?\t{3}c1((?=\t{4}[^\t])|:)\K.*?[^\t](?=$|\t{1,3}[^\t])
6. (^|[^\t])\t{1}a3((?=\t{2}[^\t])|:).*?\t{2}b2((?=\t{3}[^\t])|:).*?\t{3}c1((?=\t{4}[^\t])|:)\K.*?[^\t](?=$|\t{1,3}[^\t])
7. (^|[^\t])\t{1}a1((?=\t{2}[^\t])|:).*?\t{2}b2((?=\t{3}[^\t])|:).*?\t{3}c2((?=\t{4}[^\t])|:)\K.*?[^\t](?=$|\t{1,3}[^\t])
8. (^|[^\t])\t{1}a2((?=\t{2}[^\t])|:).*?\t{2}b2((?=\t{3}[^\t])|:).*?\t{3}c2((?=\t{4}[^\t])|:)\K.*?[^\t](?=$|\t{1,3}[^\t])
9. (^|[^\t])\t{1}a3((?=\t{2}[^\t])|:).*?\t{2}b2((?=\t{3}[^\t])|:).*?\t{3}c2((?=\t{4}[^\t])|:)\K.*?[^\t](?=$|\t{1,3}[^\t])
10. (^|[^\t])\t{1}a1((?=\t{2}[^\t])|:).*?\t{2}b3((?=\t{3}[^\t])|:)\K.*?[^\t](?=$|\t{1,2}[^\t])
11. (^|[^\t])\t{1}a2((?=\t{2}[^\t])|:).*?\t{2}b3((?=\t{3}[^\t])|:)\K.*?[^\t](?=$|\t{1,2}[^\t])
12. (^|[^\t])\t{1}a3((?=\t{2}[^\t])|:).*?\t{2}b3((?=\t{3}[^\t])|:)\K.*?[^\t](?=$|\t{1,2}[^\t]
Nesting Examples:
Below are 2 examples of the nesting above broken into property searches.
Get value of: a1 b1
- Match everything inside the 1st property, difference: Be mindful of document start.
- Match everything inside the 2nd property, difference: forget matches prior to this value.
- Find where the 2nd property ends.
1. (^|[^\t])\t{1}a1((?=\t{2}[^\t])|:).*?
2. \t{2}b1((?=\t{3}[^\t])|:)\K.*?
3. [^\t](?=$|\t{1,2}[^\t])
Merged: (^|[^\t])\t{1}a1((?=\t{2}[^\t])|:).*?\t{2}b1((?=\t{3}[^\t])|:)\K.*?[^\t](?=$|\t{1,2}[^\t])
Get value of: a1 b2 c1
- Match everything inside the 1st property, difference: Be mindful of document start.
- Match everything inside the 2nd property.
- Match everything inside the 3rd property, difference: forget matches prior to this value.
- Find where the 3rd property ends.
1. (^|[^\t])\t{1}a1((?=\t{2}[^\t])|:).*?
2. \t{2}b2((?=\t{3}[^\t])|:).*?
3. \t{3}c1((?=\t{4}[^\t])|:)\K.*?
4. [^\t](?=$|\t{1,3}[^\t])
Merged: (^|[^\t])\t{1}a1((?=\t{2}[^\t])|:).*?\t{2}b2((?=\t{3}[^\t])|:).*?\t{3}c1((?=\t{4}[^\t])|:)\K.*?[^\t](?=$|\t{1,3}[^\t])
Thoughts on what’s next…
I need to build the programmatic creation of the nested expression and put it inside a user friendly function.
I’ve done some additional testing against the iterative proof of concept in progress 5 but I need to be able to produce huge and very complicated datasets on the fly in order to really put this solution through it’s paces.
This’d be best served with a JSON converter which I need to deliver. The knock-on benefit being people can use this solution with converted JSON.
Racing other libraries and languages using the same values in their preferred formats may be fun. The markup rules are extremely lightweight so this solution could be a contender.