Best way to simulate multidimensional arrays "objects" in BASH?

That’s just some of them. There are many. I need more practice with awk and sed. Those two are very easy to use for simple use cases, but can get very complex.

Oh, and I left out the most important Bash cmd/tool…grep.

Love regexr !!! My goto for building regex.

Progress 5: There is no /no/ in RegEx.

I built a more complicated test heirarchy and my RegEx broke in spectacular ways. I was also able to clean things up with the \K command (assuming KDE influence) which “cuts” out everything previously matched from the result so I could clean up the “lookbehind” section.

Turns (?<=(?<=^|[^\t])\t{1}) into ((^|[^\t])\t{1})\K

Improvements:

  • Property name encapsulation is now explicitly defined. They must start at the beginning of the doc or ahead of a tab and end in either the exact amount of tabs required for containing properties or with a : identifying they have a simple value. This allows them to be empty (result is non-existing), prevents partial matches being treated as full matches and generally makes the syntax more robust.
  • Now accepts a search term for the property name.
  • Now “cuts” the relevant information for the next level search. No need to pipe into a second command.
  • Now recognizes end-of-file for encapsulation and not just tab. Would break search under certain conditions.
  • Now explicitly requires the number of ending tabs be equal to or below the level of the property. This used to break the cuts if the hierarchy wasn’t always climbing.
  • “lookbehind” cleaned up with -K. Need to do speed tests but it makes life easier for now, easy to revert to (?<=.

Old RegEx: (1st layer search)

(?<=(?<=^|[^\t])\t{1})[^\t]*[^\t]

New RegEx: (1st layer search w/ search term)

(^|[^\t])\t{1}SEARCH_TERM((?=\t{2}[^\t])|:)\K.*?[^\t](?=$|\t{1,1}[^\t])

GIF below shows an example of iterating through the layers making the correct cut for the next layer search. You’ll need to imagine the next search only using what was highlighted in the previous.

RegExr: Learn, Build, & Test RegEx

(^|[^\t])\t{1}a1((?=\t{2}[^\t])|:)\K.*?[^\t](?=$|\t{1,1}[^\t])
(^|[^\t])\t{2}b2((?=\t{3}[^\t])|:)\K.*?[^\t](?=$|\t{1,2}[^\t])
(^|[^\t])\t{3}c1((?=\t{4}[^\t])|:)\K.*?[^\t](?=$|\t{1,3}[^\t])

regex3

The syntax can be inline (seen above) or use newlines as they’re ignored.

Syntax using newlines: (same data in the GIF)

	a1
		b1:DATA
		b2
			c1:DATA
			c2:DATA
		b3:DATA
	a2
		b1:DATA
		b2
			c1:DATA
			c2:DATA
		b3:DATA
	a3
		b1:DATA
		b2
			c1:DATA
			c2:DATA
		b3:DATA

Presently the RegEx is fully capable of picking any one value out of this syntax if the result of each level’s search is fed into the next. The command to return the value would look something like: $(MY_SEARCH_FUNC a2 b2 c1) # Returns “DATA”

I really need to build a JSON syntax converter to help test this thing…

I agree that bash has a limit to how far you can get fancy, and once things get too difficult in bash, then it’s time to move up to something like Python.

1 Like

By Terminal Tuesday BASH will have multidimensional arrays in < 10 lines of code.

3 Likes

In bash, I mostly just do for-loops, while-loops, and shell-expansions (using wildcards), when it’s time to get fancier. I know a handful of keyboard shortcuts as well, and that’s how far I’ve taken it. Having said this, bash is indeed my favorite shell.

1 Like

This is my love letter to BASH :heart:

1 Like

Progress 6: One RegEx to find them all.

The holy grail here is an all-in-one RegEx solution, in the last update I had something workable but it’d require iterating through a RegEx search every property and feeding the result to each subsequent search. So getting to a1, b2, c1 for example would require running RegEx three times and feeding the result back in twice to get the value.

I put the time in and figured out how to nest the entire search so RegEx only runs once. Below is a proof of concept for matching every value (all 12) in my dataset using a single line of nested regex per search. It’ll climb a user defined property hierarchy and return the value without needing any additional tools and it can be produced programmatically.

RegExr: Learn, Build, & Test RegEx

1. (^|[^\t])\t{1}a1((?=\t{2}[^\t])|:).*?\t{2}b1((?=\t{3}[^\t])|:)\K.*?[^\t](?=$|\t{1,2}[^\t])
2. (^|[^\t])\t{1}a2((?=\t{2}[^\t])|:).*?\t{2}b1((?=\t{3}[^\t])|:)\K.*?[^\t](?=$|\t{1,2}[^\t])
3. (^|[^\t])\t{1}a3((?=\t{2}[^\t])|:).*?\t{2}b1((?=\t{3}[^\t])|:)\K.*?[^\t](?=$|\t{1,2}[^\t])
4. (^|[^\t])\t{1}a1((?=\t{2}[^\t])|:).*?\t{2}b2((?=\t{3}[^\t])|:).*?\t{3}c1((?=\t{4}[^\t])|:)\K.*?[^\t](?=$|\t{1,3}[^\t])
5. (^|[^\t])\t{1}a2((?=\t{2}[^\t])|:).*?\t{2}b2((?=\t{3}[^\t])|:).*?\t{3}c1((?=\t{4}[^\t])|:)\K.*?[^\t](?=$|\t{1,3}[^\t])
6. (^|[^\t])\t{1}a3((?=\t{2}[^\t])|:).*?\t{2}b2((?=\t{3}[^\t])|:).*?\t{3}c1((?=\t{4}[^\t])|:)\K.*?[^\t](?=$|\t{1,3}[^\t])
7. (^|[^\t])\t{1}a1((?=\t{2}[^\t])|:).*?\t{2}b2((?=\t{3}[^\t])|:).*?\t{3}c2((?=\t{4}[^\t])|:)\K.*?[^\t](?=$|\t{1,3}[^\t])
8. (^|[^\t])\t{1}a2((?=\t{2}[^\t])|:).*?\t{2}b2((?=\t{3}[^\t])|:).*?\t{3}c2((?=\t{4}[^\t])|:)\K.*?[^\t](?=$|\t{1,3}[^\t])
9. (^|[^\t])\t{1}a3((?=\t{2}[^\t])|:).*?\t{2}b2((?=\t{3}[^\t])|:).*?\t{3}c2((?=\t{4}[^\t])|:)\K.*?[^\t](?=$|\t{1,3}[^\t])
10. (^|[^\t])\t{1}a1((?=\t{2}[^\t])|:).*?\t{2}b3((?=\t{3}[^\t])|:)\K.*?[^\t](?=$|\t{1,2}[^\t])
11. (^|[^\t])\t{1}a2((?=\t{2}[^\t])|:).*?\t{2}b3((?=\t{3}[^\t])|:)\K.*?[^\t](?=$|\t{1,2}[^\t])
12. (^|[^\t])\t{1}a3((?=\t{2}[^\t])|:).*?\t{2}b3((?=\t{3}[^\t])|:)\K.*?[^\t](?=$|\t{1,2}[^\t]

regex4

Nesting Examples:

Below are 2 examples of the nesting above broken into property searches.

Get value of: a1 b1

  1. Match everything inside the 1st property, difference: Be mindful of document start.
  2. Match everything inside the 2nd property, difference: forget matches prior to this value.
  3. Find where the 2nd property ends.
1. (^|[^\t])\t{1}a1((?=\t{2}[^\t])|:).*?
2. \t{2}b1((?=\t{3}[^\t])|:)\K.*?
3. [^\t](?=$|\t{1,2}[^\t])
Merged: (^|[^\t])\t{1}a1((?=\t{2}[^\t])|:).*?\t{2}b1((?=\t{3}[^\t])|:)\K.*?[^\t](?=$|\t{1,2}[^\t])

Get value of: a1 b2 c1

  1. Match everything inside the 1st property, difference: Be mindful of document start.
  2. Match everything inside the 2nd property.
  3. Match everything inside the 3rd property, difference: forget matches prior to this value.
  4. Find where the 3rd property ends.
1. (^|[^\t])\t{1}a1((?=\t{2}[^\t])|:).*?
2. \t{2}b2((?=\t{3}[^\t])|:).*?
3. \t{3}c1((?=\t{4}[^\t])|:)\K.*?
4. [^\t](?=$|\t{1,3}[^\t])
Merged: (^|[^\t])\t{1}a1((?=\t{2}[^\t])|:).*?\t{2}b2((?=\t{3}[^\t])|:).*?\t{3}c1((?=\t{4}[^\t])|:)\K.*?[^\t](?=$|\t{1,3}[^\t])

Thoughts on what’s next…

I need to build the programmatic creation of the nested expression and put it inside a user friendly function.

I’ve done some additional testing against the iterative proof of concept in progress 5 but I need to be able to produce huge and very complicated datasets on the fly in order to really put this solution through it’s paces.

This’d be best served with a JSON converter which I need to deliver. The knock-on benefit being people can use this solution with converted JSON.

Racing other libraries and languages using the same values in their preferred formats may be fun. The markup rules are extremely lightweight so this solution could be a contender.

1 Like

Also trying to come up with a name for:

String based, tab deliminated, multidimensional, RegEx indexing, associative arrays for BASH.

SBTDMRIAAFB doesn’t quite roll off the tongue.

1 Like

You can use !, the logical NOT in regex to negate specific matches, or to anchor a match ( for example, match a specific string except when it i preceded by another specific string ).

Just considering the time and brain power it would take to work through the building of that regex…gives me a headache.

But, what an educational experience that must have been. Regex is the bomb !!!

I am seriously growing an appreciation for the mysterious dark arts of RegEx. It’s been a ride but one i’ve enjoyed.

This is ground floor for what this project can do. I can see multiple datatypes, edit-in-place, complex structural manipulation and even prototypical inheritance on the horizon all designed to be BASH-first. I have a lot of projects I want to get to this year but the possibilities are there.

Deadline is tomorrow, RAWRRRR!!!

1 Like

Progress 7: Preparing for JSON Megalodon. We’re going to need a bigger bloat.

MAA="multi-dimensional associative array"

Naming:

Thank you @MichaelTunnell and @Ethanol for helping me put this name together in very short notice with only a word salad to work with. #suggest: TWIL marketer speed run of FOSS websites.

The project will be called: BAAM
[ B ] ASH
[ A ] ssociative
[ A ] rrays
in
[ M ] ultidimensions

The markup language will be called: BAAML
[ B ] ASH
[ A ] ssociative
[ A ] rrays
in
[M] ultidimensions
[L] anguage

BAAML

Designed to be incredibly easy to read and write MAA’s by hand, it’s a highly parse-able, lightweight, BASH-friendly markup language.

Using tab delimitation, there’s only 2 datatypes: arrays (only accessible to the parser) and strings (the output). As it’s multidimensional and highly permissive, any markup language can be easily converted into BAAML. It also ignores newlines so it can be human readable or “minified” all on one line.

If there’s demand I can build converters for other markups but for now it’ll have a JSON converter. Given JSON’s popularity that makes every popular markup a maximum of two hops from BAAML format.

It needs official documentation but here’s the initial showcase:

	Main Menu
		name:Home
		href:https://example.com
		submenu
			0
				name:Page 1
				href:https://example.com/page1
			one
				evalme:echo "Page 2"
				href:https://example.com/contact
			3
	anotherEmptyProperty
	JSON:{a:123,b:{c:"abc",d:null}}
	lots_of_symbols:!@#$%^&*()_+-=[]\{}|;':",./<>?
	body:You can write values on multiple
lines and use any character even :, just
don't use a tab!

Converting JSON to BAAML:

I researched and experimented with a few cli JavaScript engines like SpiderMonkey and Rhino. For now i’m just going to do a NodeJS solution for JSON.

The following is a script that’ll convert JSON to BAAML. This should allow me to produce a BAAML Megalodon to test BAAM with huge, complex, real World datasets and run some speed tests.

NodeJS script:

#!/usr/bin/env nodejs
'use strict'

let jsonData = process.argv[2];
jsonData = jsonData.replace(/\t/g, ' ')

try {
	jsonData = JSON.parse(data);
}
catch(e) {
	return console.log('Failed to parse JSON');
}

function output(obj, level = 0){
	level++;
	let tabPrefix='\t'.repeat(level);
	for ( let prop in obj ){
		let val = obj[prop];
		if(val instanceof Object){
			console.log( tabPrefix + prop );
			output( val, level );
		}
		else {
			console.log( tabPrefix + prop + ':' + val )
		}
	};
};

output(jsonData);

Usage:

sudo dnf install nodejs # Fedora/CentOS/RHEL
sudo apt install nodejs # Debian/Ubuntu

nano ./json_to_baaml
# Copy script above, paste and save
chmod u+x ./json_to_baaml

# Node isn't great with pipes so here's two options for running:

# Option 1. Pipe JSON into first argument:
cat my_json.txt | ./json_to_baaml "$(</dev/stdin)" > ./my_baaml.txt

# Option 2. Literal JSON into 1st argument:
./json_to_baaml "`cat my_json.txt`" > ./my_baaml.txt

# Confirm output
cat ./my_baaml.txt

Anyone know of a massive, complicated and unchanging JSON dataset I can do tests on? Something around 1,000,000 lines?

I can create one but it’d be nice to get one from the wild that I can share wget speed test instructions for.

Solution:

https://forum.tuxdigital.com/t/its-terminal-tuesday/3057/32

I haven’t understood most of this thread but I must say it’s been exciting. My only wish would be if this could be named BAAM somehow. Then all at once you’ve got Bash, BAAM, shebang! (#!) And you’ll know just how hard that RegEx is going to hit you in the face before you go near it.

1 Like

Thank you @Ethanol

I’m switching it to BAAM and BAAML

[ B ] ASH
[ A ] ssociative
[ A ] rrays
in
[ M ] ultidimensions

1 Like

Progress 7: One small step for BASH, one multi-dimensional leap for Bourne Shell.

Took quite a lot of adjustment but I managed to get it working with /bin/sh.

sh has no concept of what an array is let alone an associative array so it definitely was fun going multidimensional. Granted BAAM uses grep so this isn’t a native sh solution but it’s still a native UNIX solution.

#!/usr/bin/env sh

# Option 1. Inline: Pass arbitrary markup into a variable
MY_DATA=`cat <<\EOF
	SpaceX
		headquarters
			address:Rocket Road
			city:Hawthorne
			state:California
		links
			website:https://www.spacex.com/
			flickr:https://www.flickr.com/photos/spacex/
			twitter:https://twitter.com/SpaceX
			elon_twitter:https://twitter.com/elonmusk
		name:SpaceX
		founder:Elon Musk
		founded:2002
		employees:8000
EOF
`
# Option 2. cat markup into a variable
# MY_DATA="`cat ./my_baaml.txt`"

# Remove newlines so grep can parse it as a whole
MY_DATA=`echo "$MY_DATA" | tr -d '\n'`

BAAM (){
	REGEX_STR='(^|[^\t])\t{1}'$2'((?=\t{2}[^\t])|:)\K.*?'
	LEVEL=1;DB=$1;shift;shift
	for PROP_NAME in "$@"; do
		REGEX_STR=${REGEX_STR}'\t{'$LEVEL'}'$PROP_NAME'((?=\t{'$(($LEVEL+2))'}[^\t])|:)\K.*?'
		LEVEL=$(($LEVEL+1))
	done
	REGEX_STR=${REGEX_STR}'[^\t](?=$|\t{1,'$LEVEL'}[^\t])'
	echo $(echo "$DB" | grep -oPe "$REGEX_STR")
}

echo $(BAAM "$MY_DATA" SpaceX links website)

# Output:
https://www.spacex.com/

What’s next?

Currently working on a way to list property names belonging to a specific position. This’ll allow a user to iterate through the entire data structure without knowing any of the property names.

When it’s done, if you wanted to output every SpaceX link it’d look something like this:

LIST=$(BAAM_LIST "$MY_DATA" SpaceX links)

for ITEM in ${LIST}; do
	echo "The SpaceX $ITEM is $(BAAM "$MY_DATA" SpaceX links ${ITEM})"
done

# Theoretical output:
The SpaceX website is https://www.spacex.com/
The SpaceX flickr is https://www.flickr.com/photos/spacex/
The SpaceX twitter is https://twitter.com/SpaceX
The SpaceX elon_twitter is https://twitter.com/elonmusk

Question:

grep is designed to repeat the same search each line. Is there a UNIX native way to preform a single search of an entire file or string using a PCRE regular expression?

Why I need this:

BAAM ignores newlines so the markup can be written in human readable or “minified” format. The problem is grep searches per line so if the markup is human readable I have to remove all the newlines to get it to perform a single search. I’m having to use:

MY_DATA=`echo "$MY_DATA" | tr -d '\n'`

This is VERY slow for big datasets. If I can fix this I can literally destroy the lookup times of every popular markup language from a cold start even in human readable format.