Today I Learned
Long long time ago, I started using browser bookmarks and the burden of carrying around them system to system has been a pain. I hope, there are people like me, want to carry forward the legacy bookmarks to beyond.
I’m not going to be giving the steps for exporting the bookmarks from firefox, as they are already straight forward. Still, if somebody needs help, please look - here .
SED to the rescue
SED - stream editor, popular choice of most of the sysadmins and hardcore developers. Once, you’ve exported, you’d have a html file in any destination based on your choice.
In your terminal, use the following script to filter out only the href
part from all those nasty large links:
sed -r -e '/.* HREF=.*$/!d' -e 's/.* HREF="(.*)\" ADD_DATE.*$/\1/g' {INPUT FILE NAME}.html | uniq > {OUTPUT FILE NAME}.txt
Let me breakdown the command
sed
is the bash tool, that we’ll be using-r
for extended regex,-e
for expression'/.* HREF="(.*)\" ADD_DATE.*$/!d'
Any line not matching the pattern will be deleted using the first expression's/.* HREF="(.*)\" ADD_DATE.*$/\1/g'
Any line matching the pattern will substitute the captured links in place- Input File (HTML from the export)
uniq
for deleting the duplicate links> {OUTPUT FILE}
- store the extracted text from html file
Now, the output file can be managed with any VCS out there.
Improvements (December, 2020):
parallel --pipepart -a {INPUT FILE NAME}.html -j4 --roundrobin \
sed\ -r\ -e\ \'/.\*\ HREF\=\"\(.\*\)\\\"\ ADD_DATE.\*\$/\!d\'\
-e\ \'s/.\*\ HREF\=\"\(.\*\)\\\"\ ADD_DATE.\*.\*\$/\\1/g\' | \
uniq > {OUTPUT FILE NAME}.txt
After sometime, browser took quite sometime to export the bookmarks.html file which was around 50mb and sed took about 13 to 15 secs to process the whole html file. The improvement requires a new package named parallel
from GNU. (sudo apt install parallel or directly download from the site
)
Command breakdown:
parallel
is the parallel job execution tool from GNU.--pipepart
tells the parallel job to chunk and pipe the file to the command-a
input file to the parallel command, here is the bookmarks.html file-j4 --roundrobin
4 parallel jobs in roundrobin fashion- SED command here is shell-quoted using the parallel utility as follows:
$ parallel --shellquote
parallel: Warning: Input is read from the terminal. You either know what you
parallel: Warning: are doing (in which case: YOU ARE AWESOME!) or you forgot
parallel: Warning: ::: or :::: or to pipe data into parallel. If so
parallel: Warning: consider going through the tutorial: man parallel_tutorial
parallel: Warning: Press CTRL-D to exit.
{PASTE the SED command from the above (before improvements)}
{SHELL QUOTED OUTPUT STRING}
- Copy the shellquoted output string and use it along with parallel utility
After using parallel, the time was reduced to 6 to 7 secs which is half the actual time taken.