Bash команда для преобразования html-страницы в текстовый файл

Я новичок в Linux. Не могли бы вы помочь мне преобразовать html-страницу в текстовый файл. текстовый файл удалит любые изображения и ссылки с веб-страницы. Я хочу использовать только команды bash, а не html для инструментов преобразования текста. Например, я хочу преобразовать первые результаты поиска Google для "компьютеров".

Спасибо

Ответы

Ответ 1

Я использовал python-boilerpipe, и он работает очень хорошо, пока...

Ответ 2

Самый простой способ - использовать что-то вроде этого, что дамп (короче говоря, текстовая версия просмотра html)

удаленный файл

lynx --dump www.google.com > file.txt
links -dump www.google.com

локальный файл

lynx --dump ./1.html > file.txt
links -dump ./1.htm

Ответ 3

У вас есть html2text.py в командной строке.

Использование: html2text.py [(filename|url) [encoding]]

Options:
  --version             show program version number and exit
  -h, --help            show this help message and exit
  --ignore-links        don't include any formatting for links
  --ignore-images       don't include any formatting for images
  -g, --google-doc      convert an html-exported Google Document
  -d, --dash-unordered-list
                        use a dash rather than a star for unordered list items
  -b BODY_WIDTH, --body-width=BODY_WIDTH
                        number of characters per output line, 0 for no wrap
  -i LIST_INDENT, --google-list-indent=LIST_INDENT
                        number of pixels Google indents nested lists
  -s, --hide-strikethrough
                        hide strike-through text. only relevent when -g is
                        specified as well

Ответ 4

В OSX вы можете использовать инструмент командной строки textutil для пакетного преобразования html файлов в формат txt:

textutil -convert txt *.html

Ответ 5

Вы можете получить nodejs и глобально установить модуль html-to-text:

npm install -g html-to-text

Тогда используйте это так:

html-to-text < stuff.html > stuff.txt

Ответ 6

в Ubuntu/Debian html2text хороший выбор. http://linux.die.net/man/1/html2text

Ответ 7

Использование sed

sed -e 's/<[^>]*>//g' foo.html

Ответ 8

Я думаю, что ссылки являются наиболее распространенным инструментом для этого. Проверьте ссылки человека и найдите простой текст или аналогичный. -dump - это моя догадка, ищите это тоже. Программное обеспечение поставляется с большинством дистрибутивов.

Ответ 9

пакетный режим для локального файла htm и html, lynx required

#!/bin/sh
# h2t, convert all htm and html files of a directory to text 

for file in `ls *.htm`
do
new=`basename $file htm`
lynx -dump $file > ${new}txt 
done
#####
for file in `ls *.html`
do
new=`basename $file html`
lynx -dump $file > ${new}txt 
done

Ответ 10

Скрипт Bash для рекурсивного преобразования html-страницы в текстовый файл. Применимо к httpd-руководству. Заставляет grep -Rhi 'LoadModule ssl'/usr/share/httpd/manual_dump -A 10 работать удобно.

#!/bin/sh
# Adapted from ewwink, recursive html to txt dump
# Made to kind of recursively (4 levels) dump the /usr/share/httpd manual to a dump httpd manual directory into a txt dump including dir
# put this script in /usr/share/httpd for it to work (after installing httpd-manual rpm)

for file in ./manual/*{,/*,/*/*,/*/*/*}.html
do
new='basename $file .html'
mkdir -p ./manual_dump/${new}
lynx --dump $file > ./manual_dump/${new}.txt
done