Introduction to Python

Introduction

Python is an excellent, cross-platform, object-oriented interpreted language. Besides ease of use, its main characteristic is that it enforces indentation (don't indent, and the program won't run.)

As of Septembre 2004, there are weaknesses to be aware of if you intend to use Python to write GUI apps for Windows, though:

Setup

At least three distributions of Python are currently available for the Windows platform (PythonWare used to be yet another package, but it's been deprecated):

If you only need a basic distribution, try out Tiny Python.

The "import" statement looks for module files in the directories specified in the $PYTHONPATH environment variable. If the named module isn't found in these directories, it returns an error. The first time Python imports a module, it automatically compiles the module as saves it as bytecode; this bytecode file has the same name as the module file, but ends in a .pyc extension. These .pyc files are automatically recompiled if the module changes in any way.

"On Windows, you can also use extension .pyw and interpreter program pythonw.exe instead of .py and python.exe. The w variants run Python without a text-mode console, and thus without standard input and output. These variants are appropriate for scripts that rely on GUIs. You normally use them only when the script is fully debugged, to keep standard output and error available for information, warnings, and error messages during development."

If you are using UltraEdit as your favorite editor, here's the section to add in UE's wordfile.txt to handle Python documents.

Installing Python 2 and Python 3 on the same Windows host

C:\Python27\python.exe

C:\Users\fred\AppData\Local\Programs\Python\Python37-32\python.exe

Checking script syntax

Concepts

module

library

package

namespace

Wheel, .whl, "allows for binary redistribution of libraries"

pip

Why import a library twice?

eg.

import mylib

import mylib.lib

How to find the list of methods/properties a library offers?

How to uninstall a module?

c:\>pip list

c:\>pip uninstall somemodule

Data Structures

Array

All entries must be of the same data type.

import array as arr

a = arr.array("I",[3,6,9])

List

Collection of heterogeneous items. Mutable.

x = []

x1 = [1,'apple',3]

print(x1[1])

List vs. array?

"With arrays, you can perform an operations on all its item individually easily, which may not be the case with lists", eg.

array_char.tostring()

"NumPy arrays are very heavily used in the data science world to work with multidimensional arrays. They are more efficient than the array module and Python lists in general."

Tuples

"Tuples are another standard sequence data type. The difference between tuples and list is that tuples are immutable, which means once defined you cannot delete, add or edit any values inside it."

Tuples are enclosed in parentheses.

x_tuple = (1,2,3,4,5)

y_tuple = ('c','a','k','e')

x_tuple[0]

Dictionary

"Dictionaries are made up of key-value pairs. key is used to identify the item and the value holds as the name suggests, the value of the item."

Dictionaries are built with curly brackets.

x_dict = {'Edward':1, 'Jorge':2, 'Prem':3, 'Joe':4}

del x_dict['Joe']

x_dict

{'Edward': 1, 'Jorge': 2, 'Prem': 3}

x_dict['Edward'] # Prints the value stored with the key 'Edward'.

Sets

'Sets are a collection of distinct (unique) objects. These are useful to create lists that only hold unique values in the dataset. It is an unordered collection but a mutable one, this is very helpful when going through a huge dataset.'

y_set = set('COOKIE')

print(y_set) # Single unique 'o'

{'I', 'O', 'E', 'C', 'K'}

Collections, heapq

Those are additional data structures.

Code Snippets

Running an external program

    

Leaving for loop early

"break" or "continue"

Operators

Watch out when using shortcuts like += on large strings, as they seem to be much slowed than the more lengthy "mystring = mstring + something".

File I/O

Checking if a directory exists

Either...

import os
try:
    os.mkdir("./mydir")
except:
    pass

... or

import os
if not os.path.isdir("./mydir"):
    os.mkdir("./mydir")

Writing to a text file

log = open('test.txt','w')
log.write("Some string")
log.close()

Caution: Under Windows, \r\n turns into 0D0D0A. To get the expected 0D0A, just use \n .

Important: Although Python3 uses Unicode, it happily writes data in Latin1 under Windows unless told otherwise:

stuff = "Crème"
with open("cp1252.txt", 'w') as outFile:
    outFile.write(stuff)
with open("utf8.txt", mode='w',encoding='utf-8') as outFile:
    outFile.write(stuff)

Reading from a text file in one go

f = open("c:/test.txt", "r")
data = f.read()
print data
f.close()

Reading from a text file, line by line

f = open("c:/test.txt", "r")
textlines = f.readlines()
for line in textlines:
    print line
f.close()

Reading for a text file, edit each line, save into new file

import re,sys
 
MAGIC = 10
 
f = open("C:\\input.txt", "r")
textlines = f.readlines()
f.close()
 
#rewrite lines to new file
log = open('output.txt','w')
 
#search for pattern using regex
p = re.compile('^\{(.+?)\}')
for line in textlines:
        m = p.search(line)
        nugget = int(m.group(1))
        nugget += MAGIC
        
        #update line
        start = str(start)
        end = str(end)
        new = "{%s){%s}" % (start,end)
        line = p.sub(new,line)
        #print line
        #adds extra newline :-/
        #print>>log, line
        log.write("%s" % line)
log.close()

Finding if a file is missing from a directory

We'll read a list of files from a text file, and then check if the file exists:

import os.path
 
PATH="C:\\MYDIR\\"
 
f = open(PATH + "files.txt", "r")
textlines = f.readlines()
for line in textlines:
        line = line.strip()
        if not os.path.isfile(PATH + line):
                print "%s NOT FOUND" % line
f.close()

Append stuff to a text file

A first way is to open a file in "a" mode:

f = open("c:/test.txt", "a")
f.write("This is an appended line.\r\n")
f.close()

Another way:

import glob
    
f = open("stuff.to.add.txt", "r")
template = "\n\n" + f.read()
f.close()
 
for frm in glob.glob('*.txt'):
        f = open(frm, "r+")
        content = f.read()
        if 'my pattern' not in content:
                f.seek(0,2)
                f.write(template)
        f.close()

Checking that a file exists

Either...

import os
 
if os.path.exists(file):
    return 1
else:
    return 0

... or

import os
 
def exists(file):
    return os.access(file, os.F_OK)

Checking the size of a file

import os
 
print os.stat(file)[ST_SIZE]

Displaying the last modified date of a file

os.stat() returns the date a file was last modified in epoch, ie. the origin of times being the number of seconds since January 1st 1970. To turn an epoch into eg. YYYY-MM-DD:

filetime = os.stat('myfile.txt')[ST_MTIME]
 
#turns epoch into tuple such as (2004, 8, 13, 2, 35, 2, 4, 226, 0)
filetime = time.gmtime(filetime)
 
#turns tuple into formatted string
print time.strftime("%Y-%m-%d",filetime)

Reading a value from a key in a section of an INI file

import ConfigParser
 
p = ConfigParser.ConfigParser()
p.readfp (open('index.ini'))
try:
    print p.get('files',file)
except:
    print "section 'files' not found"
else:
    print "ok"

Reading all the key/value items in a section in an INI file

import ConfigParser
 
p = ConfigParser.ConfigParser()
p.readfp (open('index.ini'))
for item in p.items('files'):
    print("key = " + item[0] + " value = " + item[1])

Writing data to an INI file

Oddly enough, the ConfigParser doesn't have a write() method, so you need to read the INI file, make the changes in memory, open the file in write mode, and write to it:

def writeini(file,size):
    p = ConfigParser.ConfigParser()
    p.read('index.ini')
    p.set('files', file, size)
 
    fp = open('index.ini','w')
    p.write(fp)
    fp.close()
 
writeini("mykey","myvalue")

Setting the current directory

import os
 
os.chdir('./mydir')

Looping through each file in a directory

import glob
 
for file in glob.glob('*.htm*'):

Note: On the Windows platform, glob() mixes forward- and backslashes, while open() doesn't allow backslashes altogether ("IOError: [Errno 2] No such file or directory: '.\\mydir\myfile.txt' ".)

Reading information from MS Word files

import win32com.client
app = win32com.client.Dispatch('Word.Application')
doc = app.Documents.Add('c:\\stuff.doc')
for rev in doc.Revisions:
    print rev.Author

Using SQLite as file-based database

Python3

When reading data from SQLite3 (which is saved in UTF-8/16), and saving them into a plain text file, Python3 uses the locale as default, eg. cp1252. To save data as UTF-8, make sure you use the following switch:

import sqlite3
 
con = sqlite3.connect('input.sqlite')
 
con.row_factory = sqlite3.Row
cur = con.cursor()
cur.execute("SELECT name FROM table1");
results = cur.fetchall()
output = open("output.txt", "w", encoding='UTF-8')
for row in results:
    NUMBER=int(row["NUMBER"])
    output.write(NUMBER)
 
output.close()
conn.close()

Python2

Several wrappers are available to access SQLite from Python, but two stand out: "pysqlite implements Python's DBAPI and was integrated into Python [2.5]. There is another wrapper, APSW ("Another Python SQLite Wrapper"), which is thinner and closer to SQLite's C API."

Note:

Installing APSW: Just run the EXE that matches your version of Python, eg. apsw-3.3.13-r1.win32-py2.5.exe

Here's how to display information:

import os, sys, time
import apsw
 
print "Using APSW file",apsw.__file__
print "APSW version",apsw.apswversion()
print "SQLite version",apsw.sqlitelibversion()

Here's how to play with SQLite:

if os.path.exists("dbfile"):
        os.remove("dbfile")
 
connection=apsw.Connection("dbfile")
cursor=connection.cursor()

cursor.execute("begin")
cursor.execute("create table foo(x,y,z)")
cursor.execute("insert into foo values(1,2,3)")
cursor.execute("insert into foo values(4, 'five', 6.0)")
cursor.execute("commit")
 
for row in cursor.execute("select * from foo"):
    print row
 
for m,n,o in cursor.execute("select x,y,z from foo"):
    print m,n,o
 
connection.close(True) 

Another example of using APSW (reading a tab-delimited text file to insert books into SQLite)

import re, apsw
 
connection=apsw.Connection("books.sqlite")
cursor=connection.cursor()
 
sql = "CREATE TABLE IF NOT EXISTS books (id INTEGER PRIMARY KEY, isbn VARCHAR, box VARCHAR, title VARCHAR)"
cursor.execute(sql)
 
f = open("books.tsv", "r")
textlines = f.readlines()
f.close()
 
#Extract ISBN + box
p = re.compile('^(.+)\t(\d+)$')
for line in textlines:
        m = p.search(line)
        if m:
                isbn = m.group(1)
                box = m.group(2)
                
                sql = "SELECT COUNT(isbn) FROM books WHERE isbn='%s'" % isbn
                cursor.execute(sql)
                for row in cursor.execute(sql):
                        #Record not found -> Insert
                        if not row[0]:
                                print "No record found for ISBN " + isbn
                                cursor.execute("INSERT INTO books (id,isbn,box) VALUES (NULL,?,?)", (isbn,box))
                                
connection.close(True)

Here's how to perform an INSERT and display the values for each column:

cursor.execute("INSERT INTO person (name, address, tel, web, email)
VALUES (:name, :address, :tel, :web, :email)", locals())

Here's how to safely update/insert data and display the resulting query:

sql = 'UPDATE companies SET name=?,address=?,zip=? WHERE id=?;'
try:
        cursor.execute(sql, (name,address,zip,id) )
except:
        print "Failed UPDATING"
        raise

Using regular expressions

Here's how to loop through a list of web pages, and check whether a given pattern is found therein:

import sys
import urllib
import re
 
for i in range(1,10):
        f = urllib.urlopen("http://www.acme.com/index.asp?page=%s" % i)
        #re.I = ignore case
        if re.search('stringtofind',f.read(), re.I):
                print "Found in %s" % i

Another way to do this:

p = re.compile('stringtofind')
if p.search(f.read()):
    print "Found"
else:
    print "Not found"

Here's how to compile a regex, find a pattern, and save it to a file:

p = re.compile('(<some>.+</some>)',re.DOTALL)
 
m = p.search(inputdata)
if m:
        inputdata = m.group(0)
else:
        print("Pattern not found")
        sys.exit()
 
inputdata = inputdata.replace('<other>','<yet>')
 
with open(output, 'w') as outputfile:
        outputfile.write(inputdat)

Here's how to load a web page, isolate a section, and display it (Note: you cannot call f.read() twice, hence the copying of the page into the 'page' variable):

log = open('found.txt','w')
for i in range(1,10):
        f = urllib.urlopen("http://www.acme.com/index.asp?page=%s" % i)
        print "Checking page %i" % i
        page = f.read()
        if re.search('some text',page, re.I):
                m=re.search('<span class=subject>"(.+?)"</span>',page,re.I)
                if m:
                        log.write("Found in %s\n" % i)
                        log.flush()
log.close()

Here's how to read an HTML file, and display the string between the TITLE tags, in any:

import re
 
f = open('myfile.html', "r")
inputfile = f.read()
f.close()
 
m = re.search('<title>(.*?)</title>',inputfile,re.I)
if m:
    print m.group(1)

... or if you need to extract more than one set of items:

p = re.compile('blabla (.+?) blabla (.+?)')
packed = p.findall(inputfile)
if packed:
    for x in packed:
        print "Item 1 = " + x[0] + " Item 2 = " + x[1]"

If you need to call a regex a great number of times, you can increase performance by compiling the search pattern:

p = re.compile('[0-9]+')
m = p.search('tempo999')
print m.group(0)

To replace an item with another item, use re.sub():

print re.sub('john','jane','john doe')

Note that re.sub() is very much slower than using a string's replace() method:

stuff = stuff.replace('_',' ')

Also, the string with which to replace the pattern to search for must have its backslashes escaped prior to calling re.sub(), using the r prefix to indiquate a raw string (ie. with its backslashes treated as regular characters):

toreplace = r"\\"
body = "#"
print re.sub("#",toreplace,body)

If you wish to tell the re module to treat the replace pattern as is even when it contains backslashes, add a call to its escape() function:

toreplace = re.escape(r"\\")
body = "#"
print re.sub("#",toreplace,body)

Here's how to rewrite a phone number:

#!/usr/bin/python
 
import sys,re
 
#Turn 0123456789 into 01.23.45.67.89
p = re.compile(r'(\d\d)(\d\d)(\d\d)(\d\d)(\d\d)')
phone = p.sub(r'\1.\2.\3.\4.\5',sys.argv[1])
 
print phone

Important: By default, the regex library can't handle even European characters, so you must set a locale, and add the re.LOCALE switch:    

import locale
 
#BAD : Let Python handle it locale.setlocale(locale.LC_ALL, 'FR')
locale.setlocale(locale.LC_ALL, '') 
mypattern = re.compile("(\d+)\s+(\w+)\s+(\d+)",re.LOCALE)

More infos on using regexes in Python:

Driving a web browser

As of April 2021, there are at least two modules to manage a web browser through a Python script: the webbrowser module, and the Selenium module. mechanize might be too basic.

Selenium

https://towardsdatascience.com/controlling-the-web-with-python-6fceb22c5f08

#pip3 install -U selenium
#pip3 install webdriver-manager
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
 
options = Options()
options.add_argument("start-maximized")
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=options)
driver.get("https://www.google.com")

webbrowser module

https://devtut.github.io/python/webbrowser-module.html

Connecting to a web server

Here's how to use urllib to POST to a script:

import urllib
        
url = "http://www.acme.com"
data = {'myfield': somevalue}
urldata = urllib.urlencode(data)
results = urllib.urlopen(url, urldata).read()
print results

Here is an example session that uses the 'GET' method to retrieve a URL containing parameters:

import urllib
params = urllib.urlencode({'spam': 1, 'eggs': 2, 'bacon': 0})
f = urllib.urlopen("http://www.musi-cal.com/cgi-bin/query?%s" % params)
print f.read()

The following example uses the 'POST' method instead:

import urllib
params = urllib.urlencode({'spam': 1, 'eggs': 2, 'bacon': 0})
f = urllib.urlopen("http://www.musi-cal.com/cgi-bin/query ", params)
print f.read()

The following example uses an explicitly specified HTTP proxy, overriding environment settings:

import urllib
proxies = {'http': 'http://proxy.example.com:8080/'}
opener = urllib.FancyURLopener(proxies)
f = opener.open("http://www.python.org")
f.read()

The following example uses no proxies at all, overriding environment settings:

import urllib
opener = urllib.FancyURLopener({})
f = opener.open("http://www.python.org/")
f.read ()

Here's how to use Libcurl to POST to a script:

  1. Install Python and Libcurl (eg. libcurl-7.16.2-win32-ssl-sspi.zip)
  2. Install PyCurl
  3. Use this script:

Here's how to log on to a web server through POST with support for cookies:

urllib vs urllib2 vs httplib

cookielib vs. ClientCookie http://www.voidspace.org.uk/python/articles/cookielib.shtml

Playing with date/time

Here's how to display the current date and time:

import time
import locale
 
#displays '08/20/04 22:05:15'
print time.strftime('%c')
 
#displays 'French_France.1252'
print locale.setlocale(locale.LC_ALL,'')
 
#displays '20/08/2004 22:05:15'
print time.strftime('%c')

The time value as returned by gmtime(), localtime(), and strptime(), and accepted by asctime(), mktime() and strftime(), is a sequence of 9 integers. The return values of gmtime(), localtime(), and strptime() also offer attribute names for individual fields.

Lists

Tuples

Dictionaries

Printing the content of each key:

for i in stuff.keys():
    print i + "=" + stuff[i]

Commenting a block of text

"""
This is one
block of text
"""

ie. three double-quotes in a row.

Exiting a script

import sys
 
sys.exit()

Handling a long line of code

To break a long line of code:

if (somevar) or \
    (someothervar):

Sending an e-mail

Here's how to send an e-mail through code, passing one parameter to the script:

#!/usr/bin/python
 
from email.MIMEText import MIMEText
import smtplib,sys
 
body='''this text will become the body of the message
Using triple-quotes you can span it easily over multiple lines.
the result of an action'''
 
msg = MIMEText(body)
From = "me@acme.com"
To = "you@acme.com"
msg['From'] = From
msg['To'] = To
msg['Subject'] = "Call from " + sys.argv[1]
 
server = smtplib.SMTP("smtp.isp.net")
server.sendmail(From,[To],msg.as_string())
server.quit

Tips from the Python Tutorial

Calling a non-COM DLL from Python

Calling a COM DLL from Python

Shortcut to the last result

In interactive mode, the last printed expression is assigned to the variable _. This means that when you are using Python as a desk calculator, it is somewhat easier to continue calculations, for example:

>>> price * tax
12.5625
>>> price + _
113.0625

Long lines

If a statement or string is too long to fit on a line,use the backslash:

Note that whitespace at the beginning of the line is\
significant."

You can also use """ or ''' :

print """
Usage: thingy [OPTIONS]
-h Display this usage message
-H hostname Hostname to connect to
"""

Strings

Unlike a C string, Python strings cannot be changed. Assigning to an indexed position in the string results in an error.

Lists

Unlike strings, which are immutable, it is possible to change individual elements of a list:

a = [’spam’, ’eggs’, 100, 1234]
a[2] = a[2] + 23
a
[’spam’, ’eggs’, 123, 1234]

Variable number of function parameters

When a final formal parameter of the form **name is present, it receives a dictionary containing all keyword argu-ments whose keyword doesn’t correspond to a formal parameter. This may be combined with a formal parameter of the form *name (described in the next subsection) which receives a tuple containing the positional arguments beyond the formal parameter list. (*name must occur before **name.) For example, if we define a function like this:

def cheeseshop(kind, *arguments, **keywords):

Importing modules

There is even a variant to import all names that a module defines:

from fibo import * 

This imports all names except those beginning with an underscore (_).

Modules

The built-in function dir() is used to find out which names a module defines. It returns a sorted list of strings. Without arguments, dir() lists the names [ie. variables and functions] you have defined currently.

page 42

Compiling

An easy and satisfactory way to distribute your Python script on a Windows host is to compile it with Py2exe (which analyses your script, and tries to extract all the required modules into a ZIP file), and combine the different files using either a standard installer like InnoSetup or NSIS, or combine all the files into the main EXE generated by py2exe using PE Bundle which will extract those extra files at runtime transparently:

  1. Install py2exe, and write a setup script (call it setup.py):

    from distutils.core import setup
    import py2exe

    setup(console=["myapp.py"])
     
  2. Open a DOS box, and run the following: python setup.py py2exe
  3. A directory named ./dist is create by py2exe, and contains all the files that are required to run your script on a bare Windows host. You can remove the ./build directory (temp stuff)
  4. Combine those few files into a single EXE using either your favorite installer, or PE Bundle
  5. More information available on py2exe

An alternative to py2exe is PyInstaller: "PyInstaller is a program that converts (packages) Python programs into stand-alone executables, under Windows, Linux and Irix. [...] PyInstaller is an effort to rescue, maintain and further develop Gordon McMillan's Python Installer (now PyInstaller). Their official website is not longer available and the original package is not longer maintained. Believing that it is still far superior to py2exe, we have setup this site to continue its further development."

First, read the following to understand the issue of compiling and/or distributing Python scripts:

Pyco

Psyco

setuptools

"setuptools () is a collection of enhancements to distutils which let you build .egg files. Once you start using egg files you can include dependencies between package versions and if your product requires a bunch of other packages the installation step will download and install the appropriate versions.

See http://peak.telecommunity.com/DevCenter/EasyInstall for instructions on installing packages built in this way, but in short, the user has to run ez_setup.py from the EasyInstall page, and then a command like:

        easy_install http://example.com/path/to/MyPackage-1.2.3.tgz

would download and install your package and all the other products it depends on. If at a later stage they want to upgrade to a more recent version then all they need to do is to run:

        easy_install --upgrade MyPackage

Installed eggs usually exist in a single file (importable zip) which makes uninstalling especially easy: just one file to delete."

py2exe

py2exe is a Python distutils extension which converts python scripts into executable windows programs, able to run without requiring a python installation.

  1. Install py2exe
  2. Create a script
  3. Run the script including the -w (Windows) option to hide the DOS box that Python opens even when running a GUI application
  4. Distribute the resulting .EXE and its dependent DLLs, or generate an installer

Note that even a no-thrill window developed with the wxPython toolkit with just a tiny menu bar that displays a dialog box, turns into a 300KB EXE, and requires 4 binaries for a total of 2.5Meg (and that's after compressing the four dependencies with UPX).

For information, internally, Python source code is always translated into a "virtual machine code" or "byte code" representation before it is interpreted (by the "Python virtual machine" or "bytecode interpreter").  In order to avoid the overhead of parsing and translating modules that rarely change over and over again, this byte code is written on a file whose name ends in ".pyc" whenever a module is parsed (from a file whose name ends in ".py").

When the corresponding .py file is changed, it is parsed and translated again and the .pyc file is rewritten. There is no performance difference once the .pyc file has been loaded (the bytecode read from the .pyc file is exactly the same as the bytecode created by direct translation).  The only difference is that loading code from a .pyc file is faster than parsing and translating a .py file, so the presence of precompiled .pyc files will generally improve start-up time of Python scripts.

If desired, the Lib/compileall.py module/script can be used to force creation of valid .pyc files for a given set of modules. Note that the main script executed by Python, even if its filename ends in .py, is not compiled to a .pyc file.  It is compiled to bytecode, but the bytecode is not saved to a file.

McMillan Installer

Freeze

If you are looking for a way to translate Python programs in order to distribute them in binary form, without the need to distribute the interpreter and library as well, have a look at the freeze.py script in the Tools/freeze directory [find it in the ActivePython distribution; Guess this refers to the standard Python distro.)

This creates a single binary file incorporating your program, the Python interpreter, and those parts of the Python library that are needed by your program.  Of course, the resulting binary will only run on the same type of platform as that used to create it.

"There is a tool called freeze that is included with Python that does this.  I havn't done it on Windows yet and I have heard that there are some tricks or potential problems with it.  Check the back-postings at dejanews for details.  Basically it scans you code for all imported modules and builds a C module that has all the compiled python modules encoded within it.  Then you compile and linke this file against the Python library and you end up with an executable that along with any binary extension modules you may need will be a distributable version of your program."

cx_Freeze

"The Freeze utility that comes with Python itself requires a source distribution, a C compiler and linker which makes for a complex environment for creating executables. In addition, this method is very slow for creating executables as compared to the other methods. py2exe is intended for development on Windows only and cx_Freeze is intended for cross platform development. Installer uses an import hook which means that the development environment and runtime environment are considerably different."

SQFreeze

Pyrex

PyPy

"The PyPy project aims at producing a simple runtime-system for the Python language. We want to express the basic abstractions within the Python Language itself. We later want to have a minimal core which is not written in Python and doesn't need CPython anymore. We want to take care that technologies such as PSYCO and Stackless will easily integrate."

PyInline

Py2Cmod

Weave

pyPack

SWIG

"SWIG is a software development tool that connects programs written in C and C++ with a variety of high-level programming languages. SWIG is primarily used with common scripting languages such as Perl, Python, Tcl/Tk, and Ruby, however the list of supported languages also includes non-scripting languages such as Java, OCAML and C#."

distutils

PyChecker

"PyChecker is a tool for finding bugs in python source code. It finds problems that are typically caught by a compiler for less dynamic languages, like C and C++. It is similar to lint."

IDEs

As of 2018, I tried a few of them (IDLE, MS Visual Studio, PyCharm, Wing, Eric), and found PyScripter to be the simplest to install and use.

PyScripter

Main page; Support

To make the IDE actually… readable, choose View > Styles > Windows 10

To set a keyboard shortcut to toggle comments (Source Code > Toggle Comment), use Tools > Options > IDE Shortcuts; I used CTRL+SHIFT+B (B as "block"), since it was available.

Tools > Configure Tools: To use the current script work directory as the Working directory, use "$[ActiveDoc-Dir]"

Instead of the internal Python interpreter, use the external interpreter, so you can easily kill a rogue application if need be.

Q&A

What causes "Remote Interpreter Reinitialized"?'
How to change encoding, so that UTF-8 strings are displayed correctly (eg. "é"   "é")?
How to remove "greyed out" lines when script stopped running due to error? Can't read code. Nothing in "Run".
How to hide black vertical line in middle of editing window?

Tools > Options > Editor Options : Edge column = 0

How to add items from Tools > Tools into user toolbar?

1. Right-click on User Toolbar

2. Customize

3. Commands tab

4. External Tools

5. Select item, and drag 'n drop it to the User Toolbar.

Others

OLD If you prefer to use an IDE instead of a basic text editor, here are the choices I would recommend:

Bigger list here

PyDev

http://pydev.sourceforge.net

PyPe

DrPython

Programmer Studio

ActiveState

Komodo

Visual Python

BlackAdder

The eric3 Python IDE

SPE - Stani's Python Editor

Boa Constructor

FOX

Arachno

PythonWin, a.k.a. PyWin32

WPY

Pmw

Idle

Wing

Writing GUI apps

Some infos

Below is a list of tools to let you build GUI applications. Most are just wrappers around a set of widgets such as Windows' native widgets, wxWidgets, or QT, bringing you back to the days of Windows programming Petzold-style (Mmm...), but some also offer a GUI designer lilke VB, ie. you can draw the windows interface with the mouse. You can read more in the page Gui Programming on the Python site.

Note that the WYSIWYG GUI designer that feels most like VB's is QT Designer, which you can get either directly from QT or by buying the BlackAdder IDE.

Alternatively, you could also use a GUI desiging tool such as the antiquated MS Dialog Editor or its more modern equivalents, just to draw the interface with the mouse and get the coordinates for each widget, and copy/paste this into code. Here are some suggestions I got:

Here are the widgets and/or Python wrappers those GUI designers may require:

Designing the UI as a resource

"Quick side note: depending on your GUI needs, ctypes can be a pretty easy way to go. Create your GUI as resources (e.g. in MS Visual Studio) and wrap them into a tiny DLL. Then use ctypes to load them at runtime and run CreateDialogIndirect. Most of the work involved is simply looking in header files for the values of various Win32 messages and constants, but once you do it the first time you can re-use much of the code over and over."

MFC

PythonWin, a.k.a. PyWin32 is not only an IDE, but also an MFC wrapper so you can build Win32 apps without any extra widgets set. Take a look at the samples under Drive:\Python23\Lib\site-packages\

PythonWin offers the following modules to wrap the Win32 APIs:

Note that Python Win32, a.k.a. Win32all, is part of the ActivePython package, so if you use ActivePython instead of the standard Windows version of Python, Python Win32 is already installed.

More information:

Here's the familiar "Hello, World!" as a dialog box in PyWin32:

from pywin.mfc import dialog, window
import win32con
 
dlgStatic = 130
dlgButton = 128
 
class Mydialog(dialog.Dialog):
    def OnInitDialog(self):
        rc = dialog.Dialog.OnInitDialog(self)
        return rc
 
style = (win32con.DS_MODALFRAME |
    win32con.WS_POPUP |
    win32con.WS_VISIBLE |
    win32con.WS_CAPTION |
    win32con.WS_SYSMENU |
    win32con.DS_SETFONT)
cs = win32con.WS_CHILD | win32con.WS_VISIBLE
s = win32con.WS_TABSTOP | cs
w = 64
h = 64
 
#1. Let's create a dialog box with a label and a pushbutton
dlg = [["PyWin32",(0, 0, w, h), style,  None,   (8, "MS Sans Serif")],]
dlg.append([dlgStatic,"OK",   win32con.IDOK,      (7, h - 18, 50, 14), s | win32con.BS_PUSHBUTTON])
dlg.append([dlgStatic, "Hello, world!", -1, (7, 9, 50, 14), cs | win32con.SS_LEFT])
 
#2. Let's start the dialog
d = Mydialog(dlg)
 
#3. Display it
d.DoModal()

Here's how to add a progress bar, set its range, and increment it:

def OnInitDialog(self):
    rc = dialog.Dialog.OnInitDialog(self)
    self.pbar = win32ui.CreateProgressCtrl()
    self.pbar.CreateWindow (win32con.WS_CHILD | win32con.WS_VISIBLE, (7, 30, 270, 50), self, 1001)
 
#Find out how many *.HTM* in /input, and set range of progress bar
filecount = 0
for file in glob.glob('*.htm*'):
    filecount+=1
self.pbar.SetRange(0,filecount)
                
for file in glob.glob('*.htm*'):
    self.pbar.SetStep(1)
    self.pbar.StepIt()
    [...]

Python GUI API Project

wxPython

More infos here.

PyQT

.Net (Mono, DotGNU)

This is very early development, but if you like bleeding edge stuff, you could start looking at how to develop applications using either MS' official .Net framework and its tools (VS.Net and the Python add-on, etc.), or the compatible open-source versions that are Mono and DotGNU. Take a look at IronPython, and boo.

pyFLTK

PyGTK

"If you like GTK+, you might want to try the glade designer and parse the XML file with libglade and pygtk.  (Generated code is bad). Remember, glade generates XML.  XML is not code, XML is data.  And data is not code.  As long as you stay away from generated code, you will be safe.  Yup, the best of two worlds -- a graphical form designer that stores information in XML data to be parsed by your own python program."

"BTW, there's a python port of glade underway: http://gruppy.sicem.biz/componentes#gazpacho"

FXPy

Binding to the TnFox Toolkit?

http://www.osnews.com/story.php?news_id=9701

PyGUI

WAX

PyUI

Windows

RipSting’s Blender-Python GUI Designer

Blender GUI Wizard

http://www.angelfire.com/nt/teklord/GUIWizard.htm

ActiveState GUI Builder

Venster

PythonWorks Pro

EasyDialogs for Windows

Dabo

DynWin

PythonWin

sdk32 - Partial Python wrap of the Win32 Platform SDK

GTK

MojoView

QT

QT Designer

wxWidgets

Dabo

"Dabo is a 3-tier, cross-platform application development framework, written in Python atop the wxPython GUI toolkit. And while Dabo is designed to create database-centric apps, that is not a requirement. Lots of people are using Dabo for the GUI tools to create apps that have no need to connect to a database at all."

wxDesigner

wxGlade

XRCed

VisualWx

Boa Constructor

See above

PythonCard

Dialogblocks

Tcl/Tk

Visual TCL

"Visual Tcl is a freely-available, high-quality application development environment for UNIX, Windows, Macintosh and AS400 platforms. Visual Tcl is written entirely in Tcl/Tk and generates pure Tcl/Tk code. This makes porting your Visual Tcl applications either unnecessary or trivial. Visual Tcl is covered by the GNU General Public License."

PAGE - Python Automatic GUI Generator

Resources

Writing GUIs with Tcl/Tk and TKinter

Notes:

Tkinter is Python's object-oriented layer on top of Tcl/Tk. Tk only offers basic widgets; If you need more, check out wxPython and PyQt.

Three main concepts: Widgets, event handling, and geometry management (pack, grid, place; pack is the simplest for simple layouts, grid is the most commonly used, and place is the least popular but provides the best control).

Books

Layout/Geometry Managers

Historically, Tkinter supports three layout managers:

Grid

The container frame is organized into a two-dimensional table where each cell can hold one widget. However, widgets can be made to span multiple cells.

Pack

.pack(side=LEFT|RIGHT|TOP|BOTTOM, fill=X|Y|BOTH,expand=YES|NO,anchor=N|NE|E|SE|S|SW|W|NW)

The pack manager is ideally suited for the following two kinds épauleof situation:

Widgets

Tkinder provides the following widgets:

Test

To check that Python is correctly installed and that Tkinter works, open a terminal window, and run the following command: python -m tkinter

Loading Tkinter

#Bad
from tkinter import *

#Better
import tkinter

#Best
import tkinter as tk

Hello, world!

import tkinter as tk
 
root = tk.Tk()
root.title("My title")
 
w = tk.Label(root, text="Hello Tkinter!")
#Fit the size of the window to the given text
w.pack()
 
root.mainloop()

Dialog

A simple OK dialog:

from tkinter import Tk
from tkinter import messagebox
 
# Hide parent window; in Windows, use ".pyw" as the extension to hide the terminal window as well
Tk().withdraw()
 
messagebox.showinfo("My title", "Hello")

An Yes/No dialog:

from Tkinter import *
from tkMessageBox import *
 
def answer():
    showerror("Answer", "Sorry, no answer available")
 
def callback():
    if askyesno('Verify', 'Really quit?'):
        showwarning('Yes', 'Not yet implemented')
    else:
        showinfo('No', 'Quit has been cancelled')
 
Button(text='Quit', command=callback).pack(fill=X)
Button(text='Answer', command=answer).pack(fill=X)
 
mainloop()

To hide the main window:

from tkinter import Tk
from tkinter.filedialog import askopenfilename
 
# we don't want a full GUI, so keep the root window from appearing
Tk().withdraw()
 
# show an "Open" dialog box and return the path to the selected file
filename = askopenfilename()
if not filename:
        exit()
 
print(filename)

Displaying text: Message, and Text

The Message widget has more features than Label, and the Text widget has even more features.

Message

import tkinter as tk
 
master = tk.Tk()
 
whatever_you_do = "Whatever you do will be insignificant, but it is very important that you do it.\n(Mahatma Gandhi)"
msg = tk.Message(master, text = whatever_you_do)
msg.config(bg='lightgreen', font=('times', 24, 'italic'))
msg.pack()
 
tk.mainloop()

Buttons

import tkinter as tk
import random
 
def change_label():
    button.config(text=str(random.randint(1,101)))
 
root = tk.Tk()
root.title("Changing label")
 
button = tk.Button(root, text='Change', width=25, command=change_label)
button.pack()
 
root.mainloop()

Closing an application

Button(master, text='Quit', command=master.quit)

Checkboxes

var1 = IntVar()
Checkbutton(master, text="male", variable=var1).grid(row=1, sticky=W)

Radio button

    tk.Radiobutton(root,

                  text=language,

                  padx = 20,

                  variable=v,

                  command=ShowChoice,

                  value=val).pack(anchor=tk.W)

 

Variable : tell Tkinter what radiobutton should be the default choice

Text = radiobutton label ; Value = ?

Entry

For just a single line of text.

import tkinter as tk
 
def show_entry_fields():
        print("First Name: %s" % (e1.get()))
 
master = tk.Tk()
 
tk.Label(master, text="First Name").grid(row=0)
e1 = tk.Entry(master)
e1.grid(row=0, column=1)
 
tk.Button(master,
        text='Show', command=show_entry_fields).grid(row=3, column=1, sticky=tk.W, pady=4)
 
tk.mainloop()

Text

Multiple lines of text.

import tkinter as tk
 
root = tk.Tk()
 
S = tk.Scrollbar(root)
T = tk.Text(root, height=4, width=50)
S.pack(side=tk.RIGHT, fill=tk.Y)
T.pack(side=tk.LEFT, fill=tk.Y)
S.config(command=T.yview)
T.config(yscrollcommand=S.set)
 
quote = """HAMLET: To be, or not to be--that is the question:
Whether 'tis nobler in the mind to suffer
The slings and arrows of outrageous fortune
Or to take arms against a sea of troubles
And by opposing end them. To die, to sleep--
No more--and by a sleep to say we end
The heartache, and the thousand natural shocks
That flesh is heir to. 'Tis a consummation
Devoutly to be wished."""
 
T.insert(tk.END, quote)
 
tk.mainloop()

Showing picture

To eg. display a picture. Note that the Canvas object can only display GIF and PGM/PPM files.

import tkinter as tk
from PIL import ImageTk, Image
 
master = tk.Tk()
master.title("Join")
master.geometry("300x300")
master.configure(background='grey')
 
img = ImageTk.PhotoImage(Image.open("IMG_20190522_164109.jpg"))
panel = tk.Label(master, image = img)
panel.pack(side = "bottom", fill = "both", expand = "yes")
 
tk.mainloop()

Variable Classes

Some widgets (like text entry widgets, radio buttons and so on) can be connected directly to application variables by using special options: variable, textvariable, onvalue, offvalue, and value. This connection works both ways: if the variable changes for any reason, the widget it's connected to will be updated to reflect the new value. These Tkinter control variables are used like regular Python variables to keep certain values. It's not possible to hand over a regular Python variable to a widget through a variable or textvariable option. The only kinds of variables for which this works are variables that are subclassed from a class called Variable, defined in the Tkinter module. They are declared like this:

x = StringVar() # Holds a string; default value ""

x = IntVar() # Holds an integer; default value 0

x = DoubleVar() # Holds a float; default value 0.0

x = BooleanVar() # Holds a boolean, returns 0 for False and 1 for True

To read the current value of such a variable, call the method get(). The value of such a variable can be changed with the set() method.

Web development

More infos here.

Database access

http://www.python.org/sigs/db-sig/

json

https://realpython.com/python-json/

https://realpython.com/python-json/

https://jsonplaceholder.typicode.com/

PyGeoj, "a simple Python Geojson file reader and writer."

Encoding JSON = serialization or marshaling; decoding = deserialization.

dumps() is used to handle data in RAM while dump() is to write them to disk.

turn json into Python objets

Use load() and loads().

JSON

Python

object

dict

array

list

string

str

number (int)

int

number (real)

float

true

True

false

False

null

None

read

with open("data_file.json", "r") as read_file:

    todos = json.loads(response.text)

 

 

 

with open("data_file.json", "r") as read_file:

    json_string = json.dumps(read_file)

turn dictionary into json

data = {

    "president": {

        "name": "Zaphod Beeblebrox",

        "species": "Betelgeusian"

    }

}

json_string = json.dumps(data)

print(json_string)

write

data = {

    "president": {

        "name": "Zaphod Beeblebrox",

        "species": "Betelgeusian"

    }

}

with open("data_file.json", "w") as write_file:

    json.dump(data, write_file)

geoJSON

https://pypi.org/search/?q=geojson

What package/library/module is recommended to work with geoJSON files?

c:\>pip install json

Collecting json

  Could not find a version that satisfies the requirement json (from versions: )

No matching distribution found for json

c:\>pip search geojson

geojson (2.4.0)                    - Python bindings and utilities for GeoJSON

geojsontools (0.0.3)               - Functions for manipulating geojsons

geojson_elevation (0.1)            - GeoJSON compatible elevation proxy

geojson_utils (0.0.2)              - Python helper functions for manipulating GeoJSON

PyGeoj (0.22)                      - A simple Python GeoJSON file reader and writer.

Setup

pip install geojson

Features

Read

with open('myfile.geojson') as f:

    gj = geojson.load(f)

gj['features'][0]

for feature in gj['features'][0]:

    print(feature)

print(gj)

Write

props = {"name": "My name","country": "Spain"}

point = Point((-115.81, 37.24))

 

features = []

features.append(Feature(properties=props,geometry=point))

feature_collection = FeatureCollection(features)

 

with open('myfile.geojson', 'w') as f:

   dump(feature_collection, f)

geopy

"geopy makes it easy for Python developers to locate the coordinates of addresses, cities, countries, and landmarks across the globe using third-party geocoders and other data sources."

https://geopy.readthedocs.io/en/stable/

https://programminghistorian.org/en/lessons/mapping-with-python-leaflet

Working with XML/HTML

https://stackabuse.com/reading-and-writing-xml-files-in-python/

Note: PyXML is deadware

minidom: simplified implementation of DOM

ElementTree (ET): More Pythonic interface than DOM; lxml is an enhanced version of ET

BeautifulSoup uses lxml, if available, and is an easy way to work with HTML/XML

"untangle is a simple library which takes an XML document and returns a Python object which mirrors the nodes and attributes in its structure."

More infos on XML here.

BeautifulSoup

"Beautiful Soup provides a few simple methods and Pythonic idioms for navigating, searching, and modifying a parse tree. […] Beautiful Soup sits on top of popular Python parsers like lxml and html5lib, allowing you to try out different parsing strategies or trade speed for flexibility." Python 3.x should use BeautifulSoup4. Once parsed, BS builds a tree of Python objects (Tag, NavigableString, BeautifulSoup, and Comment.)

"A string corresponds to a bit of text within a tag. Beautiful Soup uses the NavigableString class to contain these bits of text. A NavigableString is just like a Python Unicode string, except that it also supports some of the features described in Navigating the tree and Searching the tree. You can convert a NavigableString to a Unicode string with str: unicode_string = str(tag.string)"

"If you want to use a NavigableString outside of Beautiful Soup, you should call unicode() on it to turn it into a normal Python Unicode string. If you don’t, your string will carry around a reference to the entire Beautiful Soup parse tree, even when you’re done using Beautiful Soup. This is a big waste of memory."

BS provides two ways to find elements: find(_all)(), and select() with more sophisticated features as it's a CSS selector from Soup Sieve).

Different classes, to make it easier to find elements:

Note: If the input data isn't in utf-8, BS will silently convert them, and edit the relevant meta line in the header if it's there — but won't add one if it isn't.

Questions

.string vs .text? "The string argument is new in Beautiful Soup 4.4.0. In earlier versions it was called text"

soup.select("kml Document") vs. soup.select("kml > Document")? The former finds any "Document" tags below "kml", no matter where in the tree, while the latter look for it directly under "kml".

CSS: Difference between "#sister" and ".sister"?

find_all only searche tags (elements), or also strings within?

How to parse and output

from bs4 import BeautifulSoup
 
#open in binary and let BS convert data to utf-8 if needed
soup = BeautifulSoup(open('input.html', 'rb'), 'xml')
#OR
soup = BeautifulSoup("<html>a web page</html>", 'html.parser')
 
print("Orig encod:",soup.original_encoding)
print(soup.prettify())
 
#To work with each tag before having it prettified
for c in soup.contents:
    print(c.prettify())

Since BS doesn't add it if none is found in the header, here's how to add encoding information:

meta = soup.head.find("meta",  {"http-equiv".lower():"Content-Type".lower()})
if meta is None:
    metatag = soup.new_tag('meta')
    metatag.attrs['http-equiv'] = 'Content-Type'
    metatag.attrs['content'] = 'text/html; charset=utf-8'
    soup.head.append(metatag)
else:
  print("Found")

If you know how a file is (not) encoded, you can help BS by providing this information before it runs its Unicode, Damnit sub-library:

soup = BeautifulSoup(markup, 'html.parser', from_encoding="iso-8859-8")
soup = BeautifulSoup(markup, 'html.parser', exclude_encodings=["iso-8859-7"])

Parsing XML

By default, BS will use an HTML parser unless you specifically tell it to use an XML parser (which will need to be installed)

soup = BeautifulSoup(data, 'xml') #https://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser

Performance

Navigating

link = soup.a
for parent in link.parents:
    print(parent.name)
 
find_parents() and find_parent() work their way up the tree:
a_string = soup.find(string="Lacie")
a_string.find_parents("a")
sibling_soup = BeautifulSoup("<a><b>text1</b><c>text2</c></a>", 'html.parser')
print(sibling_soup.prettify())
sibling_soup.b.next_sibling
sibling_soup.c.previous_sibling
 
find_next_siblings() and find_next_sibling(), find_previous_siblings() and find_previous_sibling():
for sibling in soup.a.next_siblings:
    print(repr(sibling))
 
.next_elements and .previous_elements #iterators to move forward or backward in the document as it was parsed
 
The find_all_next() method returns all matches, and find_next() only returns the first match. The find_all_previous() method returns all matches, and find_previous() only returns the first match.
 
len(list(soup.children))
len(list(soup.descendants))

Finding elements

soup.head
soup.title
soup.title.name #Important: "name" is a reserved keyword. To access a tag named <name>, use eg. wpt.find("name").string
soup.title.string #element's text
soup.title.get_text() #alternative
soup.get_text("|", strip=True)
soup.body.b #get the first <b> tag below <body>
 
#get all text within a tree
text for text in soup.stripped_strings
 
soup.title.parent.name
 
soup.head.contents
soup.head.contents[0].name
 
title_tag = head_tag.contents[0]
for child in title_tag.children:
    print(child)
soup.p #first paragraph
soup.p['class'] #display value of attribute
 
soup.find_all('a') #all hyperlinks
soup("a") #shortcut for soup.find_all("a")
soup.title(string=True) #shortcut for soup.title.find_all(string=True)
soup.find(id="link3") #all elements with that attribute
soup.find_all(string="Elsie") #first occurence <blah>Elsie</blah>
soup.find_all(string=["Tillie", "Elsie", "Lacie"])
soup.find_all(string=re.compile("Dormouse"))
soup.find_all("a", limit=2) #bad?
 
#Grab attributes
tag = soup.find('meta', {'name': 'keywords'})
print(tag)
print(tag.attrs)
print(tag.attrs.get('content'))
 
Note: If find_all() can’t find anything, it returns an empty list. If find() can’t find anything, it returns None:
#get content of a.href
for link in soup.find_all('a'):
    print(link.get('href')) #get content of href attribute, ie. link
 
for string in soup.stripped_strings:
    print(repr(string)) #returns a printable representation of the given object
find() and find_all() are the most popular search methods; Use filters (string, regex, list, function)
soup.find_all("p", "title")
soup.find_all(id=True) #all tags with an "id" attribute
soup.find_all(href=re.compile("elsie"), id='link1')
name_soup.find_all(attrs={"name": "email"}) #name is a reserved keyword
soup.find_all("a", class_="sister") #class is a reserved keyword
soup.find_all('b')
 
#regex
for tag in soup.find_all(re.compile("^b")):
    print(tag.name)
 
soup.find_all(string="Elsie") #all tags that contain Elsie
BeautifulSoup has a .select() method which uses the SoupSieve package to run a CSS selector against a parsed document and return all the matching elements.
 
css_soup.select("p.strikeout.body") #CSS selector to search for tags that match two or more CSS classes
soup.select("html head title") #same as soup.title
soup.select("p > a") #directly under
 
soup.select_one(".sister") #only first one
soup.select("#link1 ~ .sister")
soup.select("#link1 + .sister")
soup.select(".sister")
soup.select("[class~=sister]")
soup.select("#link1")
soup.select("#link1,#link2")
soup.select('a[href]')
soup.select('a[href="http://example.com/elsie"]')
soup.select('a[href^="http://example.com/"]')
soup.select('a[href$="tillie"]')
soup.select('a[href*=".com/el"]')

Modifying the tree

Note: When calling eg. soup.mytag, BS will look for mytag anywhere in the three, not just right after soup
 
tag = soup.b
tag.name = "blockquote"
new_tag.string = "Link text."
 
tag.string.replace_with("No longer bold")
tag = BeautifulSoup('<b id="boldest">bold</b>', 'html.parser').b
tag['id']
 
tag['id'] = 'verybold'
del tag['id']
 
tag = soup.a
tag.string = "New link text."
 
append()/insert() to add to an element's string (at the end, at a given location) which can be empty; new_tag() to add a whole tag. There's also insert_before() and insert_after(). Use clear() to empty a tag's string. Use extract()/decompose() to remove a tag from the tree.
 
replace_with() can be used with more than one argument: a_tag.b.replace_with(bold_tag, ".", i_tag)
 
soup = BeautifulSoup("<a>Foo</a>", 'html.parser')
soup.a.append("Bar") #<a>FooBar</a>
#alternative
new_string = NavigableString("ed")
soup.a.append(new_string) #<a>FooBared</a>
Important: append/insert is used to edit the string of a tag, which can include a whole block (ie. to add a new tag), not just the string of a basic tag
 
original_tag = soup.b
new_tag = soup.new_tag("a", href="http://www.example.com")
new_tag.string = "Link text."
original_tag.append(new_tag) # <b><a href="http://www.example.com">Link text.</a></b>
 
To clean a tag that holds multiple NavigableString objects after using .append():
soup.smooth()
print(soup.p.prettify())
soup = BeautifulSoup("<a>Soup</a>", 'html.parser')
soup.a.extend(["'s", " ", "on"])
soup # <a>Soup's on</a>
soup.a.contents # ['Soup', ''s', ' ', 'on']
 
from bs4 import Comment
new_comment = Comment("Nice to see you.")
tag.append(new_comment) # <b>Hello there<!--Nice to see you.--></b>
 
tag = soup.a #<a href="http://example.com/">I linked to <i>example.com</i></a>
tag.insert(1, "but did not endorse ") #<a href="http://example.com/">I linked to but did not endorse <i>example.com</i></a>
 
soup #<b>leave</b>
tag = soup.new_tag("i")
tag.string = "Don't"
soup.b.string.insert_before(tag) # <b><i>Don't</i>leave</b>
div = soup.new_tag('div')
div.string = 'ever'
soup.b.i.insert_after(" you ", div) # <b><i>Don't</i> you <div>ever</div> leave</b>
soup.b.contents # [<i>Don't</i>, ' you', <div>ever</div>, 'leave']
 
soup #<a href="http://example.com/">I linked to <i>example.com</i></a>
tag = soup.a
tag.clear() # <a href="http://example.com/"></a>
 
a_tag = soup.a #<a href="http://example.com/">I linked to <i>example.com</i></a>
i_tag = soup.i.extract()
a_tag # <a href="http://example.com/">I linked to</a>
i_tag # <i>example.com</i>
 
a_tag = soup.a #<a href="http://example.com/">I linked to <i>example.com</i></a>
i_tag = soup.i
i_tag.decompose()
a_tag # <a href="http://example.com/">I linked to</a>
 
a_tag = soup.a #<a href="http://example.com/">I linked to <i>example.com</i></a>
new_tag = soup.new_tag("b")
new_tag.string = "example.com"
a_tag.i.replace_with(new_tag) # <a href="http://example.com/">I linked to <b>example.com</b></a>
bold_tag = soup.new_tag("b")
bold_tag.string = "example"
i_tag = soup.new_tag("i")
i_tag.string = "net"
a_tag.b.replace_with(bold_tag, ".", i_tag) # <a href="http://example.com/">I linked to <b>example</b>.<i>net</i></a>
 
#<p>I wish I was bold.</p>
soup.p.string.wrap(soup.new_tag("b"))
# <b>I wish I was bold.</b>
 
a_tag = soup.a #<a href="http://example.com/">I linked to <i>example.com</i></a>'
a_tag.i.unwrap() # <a href="http://example.com/">I linked to example.com</a>
 
#to inject a tree into another
doc = BeautifulSoup("<document><content/>INSERT FOOTER HERE</document", "xml")
footer = BeautifulSoup("<footer>Here's the footer</footer>", "xml")
doc.find(text="INSERT FOOTER HERE").replace_with(footer)
 
#<header/>
header = soup.header
header.string = "blah"

Output

https://www.crummy.com/software/BeautifulSoup/bs4/doc/#output

soup.prettify()
soup.a.prettify() #just a subpart of the tree
 
#raw output
str(soup)
str(soup.a)
unicode_string = str(tag.string)
 
Note: The str() function returns a string encoded in UTF-8. See Encodings for other options. You can also call encode() to get a bytestring, and decode() to get Unicode.
 
If you need more sophisticated control over your output, you can use Beautiful Soup’s Formatter class:
from bs4.formatter import HTMLFormatter
 
formatter = HTMLFormatter(uppercase)
print(soup.prettify(formatter=formatter))

To find where elements are located in the source file:

soup = BeautifulSoup(markup, 'html.parser')
for tag in soup.find_all('p'):
    print(repr((tag.sourceline, tag.sourcepos, tag.string)))

To copy an element (which won't be part of the tree):

import copy
p_copy = copy.copy(soup.p)

To only parse and find certain elements:

from bs4 import SoupStrainer
only_a_tags = SoupStrainer("a")

To investigate what BS does:

from bs4.diagnose import diagnose
with open("bad.html") as fp:
    data = fp.read()
diagnose(data)

Encoding

Regardless of how it's encoded originally, when loaded into Beautiful Soup, it's converted to Unicode. Beautiful Soup uses a sub-library called Unicode, Dammit to detect a document’s encoding and convert it to Unicode.

from bs4 import UnicodeDammit
 
dammit = UnicodeDammit("Sacr\xc3\xa9 bleu!")
print(dammit.unicode_markup)
# Sacré bleu!
dammit.original_encoding
# 'utf-8'

Unicode, Dammit’s guesses will get a lot more accurate if you install one of these Python libraries: charset-normalizer, chardet, or cchardet.

If you have your own suspicions as to what the encoding might be, you can pass them in as a list:

dammit = UnicodeDammit("Sacr\xe9 bleu!", ["latin-1", "iso-8859-1"])
print(dammit.unicode_markup)
# Sacré bleu!
dammit.original_encoding
# 'latin-1'

You can check the encoding found by BS using "soup.original_encoding".

If you happen to know a document’s encoding ahead of time, you can avoid mistakes and delays by passing it to the BeautifulSoup constructor as from_encoding:

soup = BeautifulSoup(markup, 'html.parser', from_encoding="iso-8859-8")

When you write out a document from Beautiful Soup, you get a UTF-8 document, even if the document wasn’t in UTF-8 to begin with. If you don’t want UTF-8, you can pass an encoding into prettify():

print(soup.prettify("latin-1"))
 
#alternatively
soup.p.encode("utf-8")

To read

Internal xml module: minidom and ElementTree

"The ElementTree library was contributed to the standard library by Fredrick Lundh. It includes tools for parsing XML using event-based and document-based APIs, searching parsed documents with XPath expressions, and creating new or modifying existing documents."

"Python has two interfaces — minidom and Element Tree — probably because Element Tree was integrated into the standard library a good deal later after minidom came to be. The reason for this was likely its far more "Pythonic" API compared to the W3C-controlled DOM." (Source)

Python's ElementTree has only limited support for XPath. If you need more, try lxml.

Besides the Python implementation xml.etree.ElementTree, there is also a C implementation in xml.etree.cElementTree, which in Python3 is used automatically. The xml.etree.cElementTree module is now deprecated.

Python's XML module includes…

ET has two classes for this purpose - ElementTree represents the whole XML document as a tree, and Element represents a single node in this tree. Interactions with the whole document (reading and writing to/from files) are usually done on the ElementTree level. Interactions with a single XML element and its sub-elements are done on the Element level.

Issue I had while learning how to use minidom and ET:

lxml

Since they're largely compatible, any tutorial about Element(Tree) will do, not just the limited doco from lxml which assumes people already know ET.

"lxml is significantly faster [than ElementTree], can be used to parse HTML, and supports XPath. […] lxml is also easier to use with namespaces." (Source) lxml.etree versus ElementTree

"The lxml toolkit is a Pythonic binding for the C libraries libxml2 and libxslt. It is unique in that it combines the speed and XML feature completeness of these libraries with the simplicity of a native Python API, mostly compatible but superior to the well-known ElementTree API. lxml.etree follows the ElementTree API as much as possible, building it on top of the native libxml2 tree."

What's the difference between tree and root?

Note: In lxml 4.6.3.0 at least, there's a bug when parsing an HTML through a filename rather than a file handle, with lxml adding "&#13;" before each carriage-return:

#BAD &#13;
tree = et.parse(INPUT,parser)
#OK
with open(INPUT) as tempfile:
        tree = et.parse(tempfile, parser=parser)

If need be, encoding/decoding can be specified: print(ET.tostring(root, encoding='utf8').decode('utf8')).

Here's how to find elements, and get their parents, which is need to delete the element:

for movie in root.findall("./foo/bar/[@multiple='Yes']..."):
    print(movie.attrib)

The difference between iterfind() and findall() is that the former returns an iterator, and only searches through the tree as needed, while findall() first returns all the data.

iter()? https://docs.python.org/3/library/xml.etree.elementtree.html#elementinclude-functions

It's possible to only get descendants under a given tag:

tag_name = "ellipse"
for descendant in root.iter(tag_name):
    print(descendant)

Dealing with namespaces is more convenient when using .iterfind(), which accepts an optional mapping of prefixes to domain names:

namespaces = {"": "http://www.w3.org/2000/svg","custom": "http://www.w3.org/2000/svg"}
for descendant in root.iterfind("g", namespaces):
    print(descendant)

findtext() and itertext() work on elements' text.

There is no move() method: You'll have to find + append + remove:

action = root.find("./genre[@category='Action']")
new_dec = ET.SubElement(action, 'decade')
new_dec.attrib["years"] = '2000s'
 
xmen = root.find("./genre/decade/movie[@title='X-Men']")
dec2000s = root.find("./genre[@category='Action']/decade[@years='2000s']")
dec2000s.append(xmen)
dec1990s = root.find("./genre[@category='Action']/decade[@years='1990s']")
dec1990s.remove(xmen)

Install

pip install lxml

Quick test

import lxml.etree as et
 
tree = et.parse("input.gpx")
root = tree.getroot()
 
#Retrieves direct children nodes of the root
for child in root:
    print(child.tag, child.attrib)
    #? print(root[0][1].text)

Logic

An XML file is made of elements (or "nodes"). Each element has a tag, and possibly attributes and text.

The ElementTree package consists of two classes: ElementTree (the whole structure) and Element (nodes).

You first need to read the input, either from a file or a string, have ET parse it and return a pointer to either the tree (ET.parse("myfile.xml") followed by tree.getroot()) or the root element directly (ET.fromstring()).

Once you have a pointer to the root element, you can navigate and modify the tree before writing the edited output back to a file.

Ways to get/set infos from an element:

To create a tree from code:

a = ET.Element('a')
b = ET.SubElement(a, 'b')
c = ET.SubElement(a, 'c')
d = ET.SubElement(c, 'd')
ET.dump(a) -> <a><b /><c><d /></c></a>

Alternatively:

root = etree.HTML("<p>data</p>")
print(etree.tostring(root))

XPath ("XML Path Language and uses") has more features than ElementTree to find elements (ElementTree's Supported XPath syntax). "The .find*() methods are usually faster than the full-blown XPath support. They also support incremental tree processing through the .iterfind() method, whereas XPath always collects all results before returning them. They are therefore recommended over XPath for both speed and memory reasons, whenever there is no need for highly selective XPath queries."

"ElementTree objects have a method getpath(element), which returns a structural, absolute XPath expression to find that element:

c  = etree.SubElement(a, "c")
d2 = etree.SubElement(c, "d")
tree = etree.ElementTree(c)
print(tree.getpath(d2))
/c/d[2]

"

"For ElementTree, the xpath method performs a global XPath query against the document (if absolute) or against the root node (if relative):

r = tree.xpath('/foo/bar')
prin(r[0].tag)

"

"The XPath class compiles an XPath expression into a callable function. The compilation takes as much time as in the xpath() method, but it is done only once per class instantiation. This makes it especially efficient for repeated evaluation of the same XPath expression. Just like the xpath() method, the XPath class supports XPath variables:

root = etree.XML("<root><a><b/></a><b/></root>")
find = etree.XPath("//b")
print(find(root)[0].tag)

"

"ElementTree supports a language named ElementPath in its find*() methods. One of the main differences between XPath and ElementPath is that the XPath language requires an indirection through prefixes for namespace support, whereas ElementTree uses the Clark notation ({ns}name) to avoid prefixes completely. The other major difference regards the capabilities of both path languages. Where XPath supports various sophisticated ways of restricting the result set through functions and boolean expressions, ElementPath only supports pure path traversal without nesting or further conditions."

lxml.etree vs lxml.objectify: The two modules provide different ways of handling XML. However, objectify builds on top of lxml.etree and therefore inherits most of its capabilities and a large portion of its API. lxml.etree is a generic API for XML and HTML handling. It aims for ElementTree compatibility and supports the entire XML infoset. It is well suited for both mixed content and data centric XML. Its generality makes it the best choice for most applications. lxml.objectify is a specialized API for XML data handling in a Python object syntax. It provides a very natural way to deal with data fields stored in a structurally well defined XML format. Data is automatically converted to Python data types and can be manipulated with normal Python operators. Look at the examples in the objectify documentation to see what it feels like to use it. Objectify is not well suited for mixed contents or HTML documents. As it is built on top of lxml.etree, however, it inherits the normal support for XPath, XSLT or validation.

Parsing HTML

https://lxml.de/lxmlhtml.html

If BeautifulSoup's UnicodeDammit doesn't solve an incorrect encoding declaration, ElementSoup makes use of the BeautifulSoup parser to build an lxml HTML tree from broken HTML.

E-factory makes it possible to quickly generate HTML pages and fragments:

from lxml.html import builder as E
from lxml.html import usedoctest
 
html = E.HTML(
        E.HEAD(
                E.LINK(rel="stylesheet", href="great.css", type="text/css"),
                E.TITLE("Best Page Ever")
        ),
        E.BODY(
                E.H1(E.CLASS("heading"), "Top News"),
                E.P("World News only on this page", style="font-size: 200%"),
                        "Ah, and here's some more text, by the way.",
                        lxml.html.fromstring("<p>... and this is a parsed fragment ...</p>")
        )
)
 
print lxml.html.tostring(html)

lxml.html.open_in_browser(lxml_doc) writes the document to disk and open it in the default browser.

lxml.html also supports working with links and forms, and cleaning HTML (removing embedded or script content, special tags, CSS style annotations, etc.)

Namespaces

Namespaces are required in XML files that contain data from different sources that might use elements with the same  name, eg. "name" as a way to know which one is meant each time. In simple XML files, namespaces are not required, and can be removed from the source file to make things easier.

Namespaces can be any string; It's a convention to use a URL, which can point to a document providing information about it.

Namespaces can be either set in the input file, or through ElementTree:

<Author xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:type="Application_t">
 
#Default namespaces follow this format.
xmlns="namespaceURI"
 
#Be sure to replace "URI" with the actual URI in your XML document.
ET.register_namespace('', "URI")
ET.register_namespace('xsi', "http://www.w3.org/2001/XMLSchema-instance")

Classes

Some interesting methods:

class lxml.etree._Element

class lxml.etree._ElementTree

Output

Note: The tree is binary data, while the root is string:

#with open(OUTPUFILE, 'wb') as writer:
with open(OUTPUTFILE, 'wt',encoding='utf-8') as writer:
       #TypeError: write() argument must be str, not bytes
       writer.write(et.tostring(root,pretty_print=True))

To print the whole tree:

print(ET.tostring(root, encoding='utf8').decode('utf8'))

#makes no difference: Still binary dump
print(ET.tostring(root,pretty_print=True))

Displaying infos:

tree = etree.ElementTree(root)
print(tree.docinfo.xml_version)
print(tree.docinfo.doctype)
 
tree.docinfo.public_id = '-//W3C//DTD XHTML 1.0 Transitional//EN'
tree.docinfo.system_url = 'file://local.dtd'
print(tree.docinfo.doctype)

lxml also supports indenting:

etree.indent(root)
print(etree.tostring(root))
 
etree.indent(root, space="    ")
print(etree.tostring(root))
 
etree.indent(root, space="\t")
etree.tostring(root)

Outputing XML, HTML, text:

print(etree.tostring(root)) #Default is XML
print(etree.tostring(root, method='html', pretty_print=True))
print(etree.tostring(root, method='text', encoding="UTF-8"))

To start from a clean plate:

parser = etree.XMLParser(remove_blank_text=True)
tree = etree.parse(filename, parser)
print(ET.tostring(root,pretty_print=True))

In case there are redundant namespaces:

parser = et.XMLParser(ns_clean=True,remove_blank_text=True)
tree   = et.parse(INPUTFILE, parser)
print(et.tostring(tree.getroot()))

To write the tree to a file:

f = open('doc.xml', 'w')
f.write(etree.tostring(root, pretty_print=True))
f.close()

Another way to get pretty printing is to use Tidy:

import tidy
 
f.write(tidy.parseString(your_xml_str, **{'output_xml':1, 'indent':1, 'input_xml':1}))

If ET complains about encoding, try this:

print(ET.tostring(root, encoding='utf8').decode('utf8'))

Finding elements

There are multiple ways to search for elements

An element acts like a list where the children are items in the list, eg. len(root) returns the number of elements below the root. Attributes are dictionaries.

find() returns a single element, if any, while findall() returns a dictionary.

findall() is part of the original ElementTree API. It supports a simple subset of the XPath language, without predicates, conditions and other advanced features. For instance, it doesn't allow the use of "/" ("SyntaxError: cannot use absolute path on element"). findall() returns only elements with a tag which are direct children of the current element.

Likewise, findall() doesn't support the "|" symbol to search for different tags, while xpath() does:

for el in root.xpath('.//tag1/*|.//tag2/*'):
    print(el.tag, el.text)

A simpler alternative:

for el in root.iter('tag1', 'tag2'):
    print(el.tag, el.text)

Note: find/findall/iterfind() methods are recommended over using xpath() because they are faster and support incremental searches, and also simplify namespace usage, ie. only use .xpath() for advanced queries.

Important: If findall() returns nothing although the query looks good, it might be an issue with the namespaces. In that case, either remove all namespaces in the input file, or change the search string.

Important: To check if find() found an element, use: if element.find('...') is not None.

lxml also offers two functions to get sibblings: getprevious()/getnext(). It also provides getparent().

CHECK Important: Checking if find/findall is not empty requires two different ways:

r = root.find('./Document/name')
#if et.iselement(r):
if r is not None:
 
tracks = root.findall('.//LineString')
if len(tracks):

This shows the top-level element in the tree:

print(root.tag,root.attrib)

This will return all the elements right below the root, not any deeper:

for child in root:
    print(child.tag, child.attrib)

To get all the elements in the tree…

for elem in root.iter():
    print(elem.tag,elem.attribute,elem.text)

… or only some elements, anywhere in the tree:

for movie in root.iter('movie'):
    print(movie.attrib)

Find the first element that matches:

for wpt in root.find("wpt"):
    print(wpt.tag,wpt.attrib,wpt.text)

Getting the text of the second attribute in a meta element in the head section:

description = root.xpath('string(//meta[@name="description"]/@content)')
if len(description):
    print("Description=",description)

find() and find() supports a sub-set of XPath, which provides a more powerful way to navigate a tree. Here's how to find all waypoints below the root in a GPX file:

for wpt in root.findall("./wpt"):
    print(wpt.tag,wpt.attrib,wpt.text)
Searching with XPath

Note that xpath() returns a list, even if it found only one element:

element = template_tree.xpath('//myelement')
if len(element):
        html_tree = lxml.html.fragment_fromstring("<div>blah</div>", parser=lxml.html.HTMLParser())
        parent = element[0].getparent()
        parent.insert(parent.index(element[0]),html_tree)
        parent.remove(element[0])
        print(et.dump(template_root))

Examples:

More infos:

lxml.objectify

"lxml supports an alternative API similar to the Amara bindery or gnosis.xml.objectify through a custom Element implementation. The main idea is to hide the usage of XML behind normal Python objects, sometimes referred to as data-binding. It allows you to use XML as if you were dealing with a normal Python object hierarchy."

https://lxml.de/objectify.html

TO READ

Help

https://mailman-mail5.webfaction.com/listinfo/lxml Archives https://mailman-mail5.webfaction.com/pipermail/lxml/

DEAD http://blog.gmane.org/gmane.comp.python.lxml.devel

DEAD https://www.google.com/webhp?q=site:comments.gmane.org%2Fgmane.comp.python.lxml.devel+

Q&A

Can I get rid of namespace infos while working with data?

The klugy way to remove namespaces from the source file is to run a regex through the source file, and read the result into the root

A cleaner way is to parse the XML, and then remove all references to the namespace(s):

# Remove namespace prefixes
#Source: https://stackoverflow.com/questions/60486563/
tree = et.parse(INPUTFILE)
root = tree.getroot()
for elem in root.getiterator():
        #ValueError: Invalid input tag of type <class 'cython_function_or_method'>
        #et.tag = et.QName(elem).localname
 
        # For elements, replace qualified name with localname
        if not(type(elem) == et._Comment):
                elem.tag = et.QName(elem).localname
 
        # Remove attributes that are in a namespace
        for attr in elem.attrib:
                if "{" in attr:
                        elem.attrib.pop(attr)   
 
# Remove unused namespace declarations
et.cleanup_namespaces(root)
How to add text when using append()?

for waypoint in root.findall('gpx:wpt', namespaces=NSMAP):

        #How to set text?

        waypoint.append( ET.Element("dummy"))

Difference between .iter() and .findall()?

"Element.findall() finds only elements with a tag which are direct children of the current element."

Diff between root.write(ET.tostring()) and tree.write()?

with open("removed.time.gpx", 'wb') as doc:

        #Diff with tree.write('output.xml') ?

        root.write(ET.tostring(tree, pretty_print = True))

What's the difference between tree and root (parse vs. fromstring)?

parse() returns an ElementTree while fromstring() returns an Element.

https://stackoverflow.com/questions/32620254/python-elementtree-elementtree-vs-root-element

What's the point of getroot()?

Needed with functions that return a whole ElementTree instead of a specific Element (node):

tree = et.parse("input.gpx")
root = tree.getroot()

"fromstring() parses XML from a string directly into an Element, which is the root element of the parsed tree. Other parsing functions may create an ElementTree." (Source)

"The getroot() method is available on xml.etree.ElementTree.ElementTree objects, not xml.etree.ElementTree.Element objects. ET.fromstring() returns the latter type. You already have the root element."

https://stackoverflow.com/questions/32620254/python-elementtree-elementtree-vs-root-element

fastkml

$ pip install fastkml (or "pip install -r requirements.txt" from the base of the source tree; To build KML files, FastKML requires Shapely which requires libgeos).

Note: If the input KML has something it doesn't like, fastkml (or lxml2?) might just complain with "ValueError" without saying which line it didn't like.

from fastkml import kml
 
k = kml.KML()
with open(kml_file, 'rt', encoding="utf-8") as myfile:
    doc=myfile.read()
k.from_string(doc)
 
#features() returns a generator object that you can iterate over
for f in k.features():
    print(f.name)
features = list(k.features())
f2 = list(features[0].features())
print(f2[0].name)
print k.to_string(prettyprint=True)
 
--
with open(input) as f:
  doc = parser.parse(f)
root = doc.getroot()
 
folder = root.Document.Folder # parent of Placemark
for pm in folder.getchildren():
  print(pm.tag, pm.getparent().tag)
        """
  if pm.tag == '{http://www.opengis.net/kml/2.2}Placemark':
    keep = False
    #zipcode = ''
    for sd in pm.ExtendedData.SchemaData.getchildren():
      if 'ZCTA5CE10' in sd.values():
              if sd.text in zipcodes:
                  #zipcode = sd.text
                  keep = True
              break
    if not keep:
      removed += 1
      folder.remove(pm)
    else:
      kept += 1
        """
#doc.write('output.kml', xml_declaration=True, encoding='UTF-8')

More infos

pykml

"pyKML is based on the lxml.objectify API which provides a Pythonic API for working with XML documents. pyKML adds additional functionality specific to the KML language. pyKML depends on the lxml Python library, which in turn depends on two C libraries: libxml2 and libxslt. Given this, the first step to installing pyKML is to get lxml running on your system."

The XML parser is used to read an existing KML file (pykml.parser.parse), or write a KML object to a file (lxml.etree.tostring).

"For complete stand alone programs that demonstrate how to use pyKML, check out the pyKML Examples."

"This type of attribute-based access is provided by the lxml packages’s objectify API. pyKML users are encouraged to familiarize themselves with the objectify API documentation on the lxml website, because pyKML inherits this functionality."

"KML documents that you create can be validated against XML Schema documents, which define the rules of which elements are acceptible and what ordering can be used. Both the OGC KML schema and the Google Extension schemas are included with pyKML."

Resources

Install lxml

To check if lxml is installed, run Python, and type "import lxml"

If not: http://lxml.de/installation.html

Install PyKML

pip install pykml

Run Python, and type "import pykml"

Have PyKML create a ready-to-use script

from pykml.factory import write_python_script_for_kml_document

import urllib.request as urllib2 #urllib2 was used in Python 2

from pykml import parser

url = 'http://code.google.com/apis/kml/documentation/kmlfiles/altitudemode_reference.kml'

fileobject = urllib2.urlopen(url)

doc = parser.parse(fileobject).getroot()

script = write_python_script_for_kml_document(doc)

print(script)

Later

from lxml import etree

from pykml import parser

from pykml.factory import KML_ElementMaker as KML

To validate:

from pykml.parser import Schema

schema_ogc = Schema("ogckml22.xsd")

schema_gx = Schema("kml22gx.xsd")

doc = KML.kml(GX.Tour())

#The .validate() method only returns True or False

schema_ogc.validate(doc)

schema_gx.validate(doc)

#More details

schema_ogc.assertValid(doc)

Here's how to read a KML file, and remove an element:

from pykml import parser
from lxml import etree
 
with open("input.kml") as f:
  doc = parser.parse(f)
root = doc.getroot()
 
folder = root.Document.Folder.Placemark
for pm in folder.getchildren():
    #print(pm.tag)
   #To prevent PyKML from prepending {http://earth.google.com/kml/2.0},
   #use regex to remove namespace: <kml xmlns="http://earth.google.com/kml/2.0">
   #if pm.tag=="ExtendedData":
    if pm.tag=="{http://earth.google.com/kml/2.0}ExtendedData":
        folder.remove(pm)
 
outfile = open("output.kml","wb")
outfile.write(etree.tostring(doc, pretty_print=True))

Here's how to read coordinates:

import lxml
#import pykml
from pykml import parser
 
doc=None
with open('dummy.kml') as f:
    doc = parser.parse(f).getroot()
 
for e in doc.Document.Folder.Placemark:
  coor = e.LineString.coordinates.text.split(',')
  print(coor)

Here's how to read from a file, and copy data into a new file:

from pykml import parser
from lxml import etree
from pykml.factory import KML_ElementMaker as KML
 
with open('input.kml') as f:
    tree = parser.parse(f)
root = tree.getroot()
coords = root.Document.Folder.Placemark.LineString.coordinates
 
doc = KML.kml(
    KML.Placemark(
        KML.name("test"),
        KML.Style(KML.LineStyle(KML.color("FF0000FF"))),
        KML.LineString(
            KML.coordinates(coords)
        )
    )
)
 
outfile = open('output.kml','wb')
outfile.write(etree.tostring(doc, pretty_print=True))

simpleKML

https://simplekml.readthedocs.io/en/latest/

"Unfortunately, simplekml is just a kml generator, it cannot read and manipulate existing kml, only create it. You will have to use an alternative, such as pyKML." (Source)

pip install simplekml

Example:

import simplekml
kml = simplekml.Kml()
kml.document.name = "Test"
kml.save("botanicalgarden.kml")

How to remove id? <Document id="1">

How to read existing KML file, extract needed items (eg. Placemark), edit them, and save everything to a new KML file?

Working with GPX files with gpxpy

https://github.com/tkrajina/gpxpy

http://witkowskibartosz.com/blog/gpx-file-reader.html

https://ocefpaf.github.io/python4oceanographers/blog/2014/08/18/gpx/

pip install gpxpy

Quick code:

import gpxpy
 
f = open(path_to_gpx_file, 'r')
p = gpxpy.parse(f)
print("{} track(s)".format(len(p.tracks)))

To read from a GPX file:

a = gpx.tracks[0]
b = a.segments[0]
c = b.points[1]
d = [c.longitude, c.latitude, c.elevation, c.time]

To create a new GPX file from scratch:

# Create first track in our GPX:
gpx_track = gpxpy.gpx.GPXTrack()
gpx.tracks.append(gpx_track)
 
# Create first segment in our GPX track:
gpx_segment = gpxpy.gpx.GPXTrackSegment()
gpx_track.segments.append(gpx_segment)
 
# Create points:
gpx_segment.points.append(gpxpy.gpx.GPXTrackPoint(2.1234, 5.1234, elevation=1234))
gpx_segment.points.append(gpxpy.gpx.GPXTrackPoint(2.1235, 5.1235, elevation=1235))
gpx_segment.points.append(gpxpy.gpx.GPXTrackPoint(2.1236, 5.1236, elevation=1236))
 
# You can add routes and waypoints, too...
 
print 'Created GPX:', gpx.to_xml()

xmltodict

"xmltodict is a Python module that makes working with XML feel like you are working with JSON".

Won't do if you need to add a key, but fine if you just need to read, and possibly change any value.

Notes from John E. Simpson's "XPath and XPointer" (2002)

XPath is used for locating XML content within an XML document; XPointer is the standard for addressing such content, once located.

As support for XPath is integrated into the Document Object Model (DOM), DOM developers may also find XPath a convenient alternative to walking through document trees.

"An XPath" consists of one or more chunks of text, delimited by any of a number of special characters, assembled in any of various formal ways. Each chunk, as well as the assemblage as a whole, is called an XPath expression.

Most XPath expressions, by far, locate a document's contents or portions thereof. These pieces of content are located by way of one or more location steps — discrete units of XPath "meaning" — chained together, usually, into location paths.

An XPath expression can be said to consist of various components: tokens and delimiters. The expression taxcut/* locates all elements that are children of a taxcut element.

XPath is capable of processing four data types: string, numeric, Boolean, and nodes (or node-sets).

Most nodes have names. Three important terms:

In a location path, the root node is represented by a leading / (forward slash) character.

There's an XPath function, normalize-space() that trims all leading and trailing whitespace from a given element's content.

Editing with XMLStarlet

Read xmlstarlet-ug.pdf

XMLStarlet is an open-source, command-line application that supports testing XPath queries.

Checking the structure: xml el input.xml

Networking

Q&A

How to find the type of a variable/output?

print(type(blah))

UnicodeDecodeError: 'ascii' codec can't decode byte

Python uses Unicode internally, and may need some help when it can't successfully figure out which page code is used to encode a string:

try:
    cursor.execute(sql.decode('utf-8'))
except UnicodeDecodeError:
    try:
        cursor.execute(sql.decode('iso8859-15'))
    except UnicodeDecodeError:
        cursor.execute(sql.decode('cp1252'))

More information:

How to check for errors?

"Pylint analyzes Python source code looking for bugs and signs of poor quality."

How to enhance performance?

Why are strings immutable?

Read that question in a ng. Does it mean a string in Python is read-only?

What's the difference between "import mymodule" and "from mymodule import *"?

The former forces you to prepend the module's name to every membre, eg. mymodule.mymethod(), while the latter imports all the methods into the current namespace, letting you call the methods without the module name. Although easier to use, make sure those new methods don't clash with your current namespace...

Is there a native-code compiler for Windows?

Check out py2exe. Other sources of information are Distributing Python Apps and How can I create a stand-alone binary from a Python script? Also take a look at Psyco.

Py? Pyc? Pyd? Pyo? Pyw?

(From Boudewijn Rempt's book on PyQT): "The translation from Python code to byte-code only happens once: Python saves a compiled version of your code in another file with the extension .pyc, or an optimized compiled version of your code that removes assert statements and line-number tracking in a file with the extension .pyo.

However, that is only done with Python files that are imported from other files: the bootstrap script will be compiled to bytecode every time you run it, but python will create a myapp.pyc from a file myapp.py (which is not shown here)."

IndentationError

"unindent does not match any outer indentation level" : If copy/pasting code from a web page, make sure there are not hidden characters that confuse Python.

Hiding the DOS box when running under Windows?

"Python.exe is used for console mode programs and Pythonw.exe is used for GUI applications that don't need a console window. Python.exe can also be used for GUI programs, but then you get a console window in addition to your GUI window(s)."

How to call a PowerBasic DLL from Python?

http://www.talkaboutprogramming.com/group/alt.lang.powerbasic/messages/7219.html

How to hide the console window when running an EXE generated by py2exe?

Books

From VB to Python

Resources