linux - How To Extract Text Between HTML Tags With Or Condition Multiple Times -
i have been researching how extract title tags html. i've pretty figured out regex , html don't mix , grep can used. however, code found here, looks this:
awk -vrs="</title>" '/<title>/{gsub(/.*<title>|\n+/,"");print;exit}'
now, works find text between title tags once. know how can make run on every line. cat file; while read line; ...; done
. however, know not efficient there's better way.
secondly, in file need keep lines start string '--'. believe requires adding 'or' statement in awk
match title tags , line starting '--'
the input file this:
text text text <title>random text of title 1</title> random html stuff --time-- xyz more random text <title>random text of title 2</title> hmtl text --time-- text <title>random text of title 3</title> more text tags --time-- text here <title>random text of title 4</title> random text html --time--
the desired output:
<title>random text of title 1</title> --time-- <title>random text of title 2</title> --time-- <title>random text of title 3</title> --time-- <title>random text of title 4</title> --time--
i'm not great awk, i'm learning. know there should option print all, it's or statement i'm stuck on. open sed or grep if think that's more efficient. or direction appreciated.
for given input, grep
enough
$ grep -o '<.*>\|^--.*' ip.html <title>random text of title 1</title> --time-- <title>random text of title 2</title> --time-- <title>random text of title 3</title> --time-- <title>random text of title 4</title> --time--
-o
extract matching parts<.*>
extract<
upto last>
in line\|^--.*
alternate pattern, if line starts--
line
to restrict title
tags,
grep -o '<title.*title>\|^--.*' ip.html
Comments
Post a Comment